evals
Here are 88 public repositories matching this topic...
AI Observability & Evaluation
-
Updated
Nov 1, 2025 - Jupyter Notebook
Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI
-
Updated
Oct 30, 2025 - Python
The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.
-
Updated
Oct 31, 2025 - Python
Evaluation and Tracking for LLM Experiments and AI Agents
-
Updated
Oct 30, 2025 - Python
Laminar - open-source all-in-one platform for engineering AI products. Create data flywheel for your AI app. Traces, Evals, Datasets, Labels. YC S24.
-
Updated
Oct 31, 2025 - TypeScript
🥤 RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL
-
Updated
Oct 31, 2025 - Python
[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding
-
Updated
Jul 12, 2025 - Jupyter Notebook
Test Generation for Prompts
-
Updated
Oct 21, 2025 - TeX
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
-
Updated
Oct 23, 2025 - TypeScript
Evalica, your favourite evaluation toolkit
-
Updated
Oct 30, 2025 - Python
Benchmarking Large Language Models for FHIR
-
Updated
Sep 26, 2025 - TypeScript
A collection of particularly difficult test scenarios for evaluating browser-use.
-
Updated
Oct 30, 2025 - HTML
Improve this page
Add a description, image, and links to the evals topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with the evals topic, visit your repo's landing page and select "manage topics."