Note: This is not an officially supported Google product.
This repository contains code for evaluating LLMs on various natural language tasks in three Semitic Languages: Modern Hebrew, Modern Standard Arabic and Levantine Arabic. The tasks are detailed below. This page includes instructions for downloading, processing, fine-tuning, running inference, and evaluating models in a fine-tuning setting.
We include scripts for fine-tuning encoder-decoder and decoder LLMs, and for generating test set predictions using both Huggingface transformers and T5X. The evaluation script calculates performance metrics from these predictions. In this page we provide Baseline results on all tasks using mt5-XL.
The following tasks are supported:
| Language | Name | Task | Metric | Paper / Page |
|---|---|---|---|---|
| Hebrew | HebNLI | Natural Language Inference | Macro F1 | Page |
| Hebrew | HeSentiment | Sentiment Analysis | Macro F1 | Page |
| Hebrew | HeQ | Question Answering | TLNLS | Paper |
| Hebrew | HeQ-QG | Question Generation | Rouge | Paper |
| Hebrew | HeSum | Summarization | Rouge | Paper |
| Hebrew | HebSummaries | Summarization | Rouge | Page |
| Hebrew | Nemo-Token | NER (token level) | F1 | Paper |
| Hebrew | Nemo-Morph | NER (morph level) | F1 | Paper |
| Hebrew | HebCo | Coreference Resolution | Macro F1 | Page |
| Modern Standard Arabic | ArabicNLI | Natural Language Inference | Macro F1 | Page |
| Modern Standard Arabic | MSA Sentiment | Sentiment Analysis | Macro F1 | Page |
| Modern Standard Arabic | ArQ-MSA-QA | Question Answering | TLNLS | Page |
| Modern Standard Arabic | ArQ-MSA-QG | Question Generation | Rouge | Page |
| Modern Standard Arabic | ArTyDiQA-QA | Question Answering | TyDiQA-F1 | Page |
| Modern Standard Arabic | ArTyDiQA-QG | Question Generation | Rouge | Page |
| Modern Standard Arabic | ArXLSum | Summarization | Rouge | Page |
| Modern Standard Arabic | ASAS | Summarization | Rouge | Page |
| Modern Standard Arabic | IAHLT-NER | Named Entity Recognition | F1 | Page |
| Modern Standard Arabic & Levantine Arabic | Wojood-Full | NER (token level) | F1 | Paper |
| Modern Standard Arabic | Wojood-MSA | NER (token level) | F1 | Paper |
| Modern Standard Arabic | OntoNotes | Coreference Resolution | Macro F1 | Page |
| Levantine Arabic | ArSentiment | Sentiment Analysis | Macro F1 | Page |
| Levantine Arabic | ArQ-Spoken-QA | Question Answering | TLNLS | Page |
| Levantine Arabic | ArQ-Spoken-QG | Question Generation | Rouge | Page |
| Levantine Arabic | Wojood-Spoken | NER (token level) | F1 | Paper |
| Levantine Arabic | ShamNER | Named Entity Recognition | F1 | Page |
| Levantine Arabic | ArCoref | Coreference Resolution | Macro F1 | Page |
Note that this package requires Python 3.10 or higher.
First, clone the repository:
git clone https://github.com/google-research/mrl_eval.gitThen, install the requirements, preferably in a new virtual environment.
pip install torch --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txtDownload and preprocess raw data for all tasks (done once per dataset):
bash mrl_eval/datasets/download_raw_data.sh
bash mrl_eval/datasets/ingest_all_datasets.shTo prepare the dataset for the OntoNotes task (End-to-end and Gold Mentions variants) follow these steps:
- Download the OntoNotes 5.0 dataset: Obtain the dataset from the LDC website (LDC2013T19).
- Generate CoNLL 2012 Data Split in JSONL format: Run this script to generate the
jsonlfiles for the Arabic subset of the dataset: setup_training.sh. - Run the data ingestion script: Copy the generated files to the data directory
mrl_eval_data/ontonotes/andmrl_eval_data/ontonotes_gold_mentions/, and run the ingestion script:
python -m mrl_eval.datasets.ingest_dataset ontonotes
python -m mrl_eval.datasets.ingest_dataset ontonotes_gold_mentions
To finetune on a specific dataset using Huggingface, run:
python -m mrl_eval.hf.finetune --dataset {dataset}The options for dataset are:
- hebnli
- hesentiment
- heq
- heq_question_gen
- hesum
- hebsummaries
- nemo_token
- nemo_morph
- hebco
- hebco_gold_mentions
- arabic_nli
- msa_sentiment
- arq_MSA
- arq_MSA_question_gen
- artydiqa
- artydiqa_question_gen
- ar_xlsum
- asas
- iahlt_ner
- wojood_msa
- wojood_full
- ontonotes
- ontonotes_gold_mentions
- arsentiment
- arq_spoken
- arq_spoken_question_gen
- wojood_spoken
- shamner
- arcoref
- arcoref_gold_mentions
By default, the mt5-xl model will be trained. To train a different model (e.g.
a decoder LLM) specify its HF model name as follows:
python -m mrl_eval.hf.finetune --dataset {dataset} --model "google/gemma-2-9b"Decoder models will be trained by default with LORA using half precision.
Once the training is done, the script will print the path to the best checkpoint.
To generate responses for the inputs of the test set, run:
python -m mrl_eval.hf.generate --dataset {dataset} --checkpoint_path path/to/checkpointAlternatively, the T5X framework can also be used for running training and inference with T5-family models. For more details, see models/T5X.md.
To evaluate the score of model predictions, run:
python -m mrl_eval.evaluation.evaluate --dataset {dataset} --predictions_path path/to/prediction/fileThe options for dataset are the same as above.
Your predictions file is expected to be a jsonl file in the following format:
{"input": {"id": "example_id_1"}, "prediction": "prediction1"}
{"input": {"id": "example_id_2"}, "prediction": "prediction2"}
...We finetune mT5-xl model per task as the first baseline. Results are shown in the table below.
| Language | Model | Task | Metric | Value |
|---|---|---|---|---|
| Hebrew | mT5-XL | HeQ | TLNLS | 87.1 |
| Hebrew | mT5-XL | HeQ-QG | R1/R2/RL | 40.2 / 22.0 / 39.7 |
| Hebrew | mT5-XL | HeSum | R1/R2/RL | 17.9 / 7.2 / 15.0 |
| Hebrew | mT5-XL | HebSummaries | R1/R2/RL | 23.9 / 10.1 / 16.6 |
| Hebrew | mT5-XL | NEMO | Token / Morph F1 | 86.3 / 84.8 |
| Hebrew | mT5-XL | Sentiment | Macro F1 | 85.0 |
| Hebrew | mT5-XL | HebNLI | Macro F1 | 84.6 |
| Hebrew | mT5-XL | Hebco | End-to-end / Gold Mentions Macro F1 | 49.3 / 75.8 |
| Modern Standard Arabic | mT5-XL | ArQ-MSA-QA | TLNLS | 79.5 |
| Modern Standard Arabic | mT5-XL | ArQ-MSA-QG | R1/R2/RL | 35.8 / 17.2 / 35.5 |
| Modern Standard Arabic | mT5-XL | ArTyDi-QA | TyDiQA-F1 | 87.4 |
| Modern Standard Arabic | mT5-XL | ArTyDi-QG | R1/R2/RL | 60.6 / 44.1 / 60.5 |
| Modern Standard Arabic | mT5-XL | IAHLT-NER | Token F1 | 84.2 |
| Modern Standard Arabic | mT5-XL | ArabicNLI | Macro F1 | 82.2 |
| Modern Standard Arabic | mT5-XL | ArXLSum | R1/R2/RL | 26.5 / 11.4 / 23.4 |
| Modern Standard Arabic | mT5-XL | ASAS | R1/R2/RL | 39.5 / 21.3 / 27.8 |
| Modern Standard Arabic | mT5-XL | MSA Sentiment | Macro F1 | 61.0 |
| Modern Standard Arabic & Levantine Arabic | mT5-XL | Wojood-Full | Token F1 | 90.67 |
| Modern Standard Arabic | mT5-XL | Wojood-MSA | Token F1 | 91.62 |
| Modern Standard Arabic | mT5-XL | OntoNotes | End-to-end / Gold Mentions Macro F1 | 50.9 / 88.7 |
| Levantine Arabic | mT5-XL | ArSentiment | Macro F1 | 71.2 |
| Levantine Arabic | mT5-XL | ArQ-spoken-QA | TLNLS | 81.8 |
| Levantine Arabic | mT5-XL | ArQ-spoken-QG | R1/R2/RL | 35.6 / 16.6 / 35.3 |
| Levantine Arabic | mT5-XL | Wojood-Spoken | Token F1 | 79.12 |
| Levantine Arabic | mT5-XL | ShamNER | Token F1 | 42.0 |
| Levantine Arabic | mT5-XL | ArCoref | End-to-end / Gold Mentions Macro F1 | 50.1 / 74.4 |
This framework includes an evaluation module that leverages a Large Language Model (LLM) for assessing the quality of question generation and summarization tasks - offering evaluation beyond word matching metrics. The specific model used for this evaluation is Gemini 2.5 Flash, accessed via the Vertex AI platform on Google Cloud.
More details about autoraters can be found in evaluation/autoraters/README.md.
For question generation the Autorater evaluates answerability which is the percentage of generated questions that can be accurately and directly answered by the provided reference answer/passage.
The following command can be used to run autorater evaluation on question generation datasets:
python -m mrl_eval.evaluation.autoraters.qg_autorater_main \
--project_id={cloud-project-id} \
--location={project-location} \
--model_name="gemini-2.5-flash" \
--dataset={question-generation-dataset} \
--dataset_split=test \
--predictions_path=path/to/prediction/file.jsonl \
--output_path=path/to/output/file.jsonl
where question-generation-dataset can be one of:
- arq_MSA_question_gen
- arq_spoken_question_gen
- artydiqa_question_gen
- heq_question_gen
| Dataset | Answerability |
|---|---|
| arq_MSA_question_gen | 0.930 |
| arq_spoken_question | 0.884 |
| artydiqa_question_gen | 0.883 |
| heq_question_gen | 0.915 |
For summarisation, the autorater evaluates three dimensions:
- Factual grounding: all the information in the summary should be grounded in the original article.
- Relevance or Recall: every single fact of the summary should be “salient” enough to appear in the summary.
- Quality: the summary should be fluent and coherent. Each will be rated on a scale of 1-5.
The following command can be used to run autorater evaluation on summarisation datasets:
python -m mrl_eval.evaluation.autoraters.summarization_autoraters_main \
--project_id={cloud-project-id} \
--location={project-location} \
--model_name="gemini-2.5-flash" \
--dataset={summarisation-dataset} \
--dataset_split="test" \
--predictions_path=path/to/prediction/file.jsonl \
--output_path=path/to/output/file.jsonl
where summarisation-dataset can be one of:
- hesum
- hebsummaries
- ar_xlsum
- asas
| Dataset | Factual Grounding | Recall | Fluency | Coherence |
|---|---|---|---|---|
| hesum | 0.812 | 0.153 | 4.530 | 3.931 |
| hebsummaries | 0.736 | 0.522 | 4.271 | 3.819 |
| ar_xlsum | 0.561 | 0.205 | 4.894 | 4.865 |
| asas | 0.913 | 0.546 | 3.973 | 3.560 |