Skip to content

skaltman/model-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Which LLM writes the best R code?

This repository evaluates how well different AI models generate R code. The evaluation uses the are (An R Eval) dataset from the vitals package, which contains challenging R coding problems and solutions.

Shiny App

View the live app to compare model performance and costs for a variety of models.

Blog posts

There are also a series of accompanying blog posts which go into more depth about the analysis. Read the latest one: Which AI model writes the best R code?.

Methodology

  • We used ellmer to create connections to the various models and vitals to evaluate model performance on R code generation tasks.
  • We tested each model on a shared benchmark: the are dataset ("An R Eval"). are contains a collection of difficult R coding problems and a column, target, with information about the target solution.
  • Using vitals, we had each model solve each problem in are. Then, we scored their solutions using a scoring model (Claude 3.7 Sonnet). Each solution received either an Incorrect, Partially Correct, or Correct score.

Running Evaluations

To run the evaluations yourself (or experiment with different models):

  1. If adding a new model: edit data/models.yaml with the specification for the model you want to run.
  2. Run eval/run_eval.R. This will run the evaluation for all models listed in data/models.yaml. Note that you will need API keys for all model providers. See the ellmer documentation on authentication for more details.

Python/Pandas Evaluations

We also evaluated Pandas code generation using the inspect_ai framework. See the Python version of the blog post for results.

Related Work

This work builds on Simon Couch's blog series analyzing LLM code generation capabilities, including Claude 4 and R Coding.