Which LLM writes the best R code?

This repository evaluates how well different AI models generate R code. The evaluation uses the are (An R Eval) dataset from the vitals package, which contains challenging R coding problems and solutions.

Shiny App

View the live app to compare model performance and costs for a variety of models.

Blog posts

There are also a series of accompanying blog posts which go into more depth about the analysis. Read the latest one: Which AI model writes the best R code?.

Methodology

We used ellmer to create connections to the various models and vitals to evaluate model performance on R code generation tasks.
We tested each model on a shared benchmark: the are dataset ("An R Eval"). are contains a collection of difficult R coding problems and a column, target, with information about the target solution.
Using vitals, we had each model solve each problem in are. Then, we scored their solutions using a scoring model (Claude 3.7 Sonnet). Each solution received either an Incorrect, Partially Correct, or Correct score.

Running Evaluations

To run the evaluations yourself (or experiment with different models):

If adding a new model: edit data/models.yaml with the specification for the model you want to run.
Run eval/run_eval.R. This will run the evaluation for all models listed in data/models.yaml. Note that you will need API keys for all model providers. See the ellmer documentation on authentication for more details.

Python/Pandas Evaluations

We also evaluated Pandas code generation using the inspect_ai framework. See the Python version of the blog post for results.

Related Work

This work builds on Simon Couch's blog series analyzing LLM code generation capabilities, including Claude 4 and R Coding.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
R		R
blog-posts		blog-posts
data		data
eval		eval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
about.md		about.md
app.R		app.R
manifest.json		manifest.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Which LLM writes the best R code?

Shiny App

Blog posts

Methodology

Running Evaluations

Python/Pandas Evaluations

Related Work

About

Uh oh!

Releases

Packages

Languages

License

skaltman/model-eval

Folders and files

Latest commit

History

Repository files navigation

Which LLM writes the best R code?

Shiny App

Blog posts

Methodology

Running Evaluations

Python/Pandas Evaluations

Related Work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages