This repository evaluates how well different AI models generate R code. The evaluation uses the are (An R Eval) dataset from the vitals package, which contains challenging R coding problems and solutions.
View the live app to compare model performance and costs for a variety of models.
There are also a series of accompanying blog posts which go into more depth about the analysis. Read the latest one: Which AI model writes the best R code?.
- We used ellmer to create connections to the various models and vitals to evaluate model performance on R code generation tasks.
- We tested each model on a shared benchmark: the
aredataset ("An R Eval").arecontains a collection of difficult R coding problems and a column,target, with information about the target solution. - Using vitals, we had each model solve each problem in
are. Then, we scored their solutions using a scoring model (Claude 3.7 Sonnet). Each solution received either an Incorrect, Partially Correct, or Correct score.
To run the evaluations yourself (or experiment with different models):
- If adding a new model: edit
data/models.yamlwith the specification for the model you want to run. - Run
eval/run_eval.R. This will run the evaluation for all models listed indata/models.yaml. Note that you will need API keys for all model providers. See the ellmer documentation on authentication for more details.
We also evaluated Pandas code generation using the inspect_ai framework. See the Python version of the blog post for results.
This work builds on Simon Couch's blog series analyzing LLM code generation capabilities, including Claude 4 and R Coding.
