LMEval aims to help AI researchers and developers compare the performance of different large language models. Designed to be accurate, multimodal, and easy to use, LMEval has already been used to evaluate major models in terms of safety and security.
One of the reasons behind LMEval is the fast pace at which new models are being introduced. This makes it essential to evaluate them quickly and reliably to assess their suitability for specific applications, says Google researchers. Among its key features are compatibility with a wide range of LLM providers, incremental benchmark execution for improved efficiency, support for multimodal evaluation—including text, images, and code—and encrypted result storage for enhanced security.
For cross-provider support, it is critical that evaluation benchmarks can be defined once and reused across multiple models, despite differences in their APIs. To this end, LMEval uses LiteLLM, a framework that allows developers to use the OpenAI API format to call a variety of LLM providers, including Bedrock, Hugging Face, Vertex AI, Together AI, Azure, OpenAI, Groq, and others. LiteLLM translates inputs to match each provider’s specific requirements for completion, embedding, and image generation endpoints, and produces a uniform output format.
To improve execution efficiency when new models are released, LMEval runs only the evaluations that are strictly necessary, whether for new models, prompts, or questions. This is made possible by an intelligent evaluation engine that follows an incremental evaluation model.
Written in Python and available on GitHub, LMEval requires you to follow a series of steps to run an evaluation. First, you define your benchmark by specifying the tasks to execute, e.g., detect eye colors in a picture, along with the prompt, the image, and the expected results. Then, you list the models to evaluate and run the benchmark:
benchmark = Benchmark(name='Cat Visual Questions',
description='Ask questions about cats picture')
...
scorer = get_scorer(ScorerType.contain_text_insensitive)
task = Task(name='Eyes color', type=TaskType.text_generation, scorer=scorer)
category.add_task(task)
# add questions
source = QuestionSource(name='cookbook')
# cat 1 question - create question then add media image
question = Question(id=0, question='what is the colors of eye?',
answer='blue', source=source)
question.add_media('./data/media/cat_blue.jpg')
task.add_question(question)
...
# evaluate benchmark on two models
models = [GeminiModel(), GeminiModel(model_version='gemini-1.5-pro')]
prompt = SingleWordAnswerPrompt()
evaluator = Evaluator(benchmark)
eval_plan = evaluator.plan(models, prompt) # plan evaluation
completed_benchmark = evaluator.execute() # run evaluation
Optionally, you can save the evaluation results to a SQLite database and export the data to pandas for further analysis and visualization. LMEval uses encryption to store benchmark data and evaluation results to protect against crawling or indexing.
LMEval also includes LMEvalboard, a visual dashboard that lets you view overall performance, analyze individual models, or compare multiple models.
As mentioned, LMEval has been used to create the Phare LLM Benchmark, designed to evaluate LLM safety and security, including resistance to hallucination, factual accuracy, bias, and potential harm.
LMEval is not the only cross-provider LLM evaluation framework currently available. Others include Harbor Bench and EleutherAI's LM Evaluation Harness. Harbor Bench, limited to text prompts, has the interesting feature of using an LLM to judge result quality. In contrast, EleutherAI’s LM Evaluation Harness includes over 60 benchmarks and allows users to define new ones using YAML.