MRLEval - A Benchmark for Morphologically Rich Languages

Note: This is not an officially supported Google product.

Introduction

This repository contains code for evaluating LLMs on various natural language tasks in three Semitic Languages: Modern Hebrew, Modern Standard Arabic and Levantine Arabic. The tasks are detailed below. This page includes instructions for downloading, processing, fine-tuning, running inference, and evaluating models in a fine-tuning setting.

We include scripts for fine-tuning encoder-decoder and decoder LLMs, and for generating test set predictions using both Huggingface transformers and T5X. The evaluation script calculates performance metrics from these predictions. In this page we provide Baseline results on all tasks using mt5-XL.

Tasks

The following tasks are supported:

Language	Name	Task	Metric	Paper / Page
Hebrew	HebNLI	Natural Language Inference	Macro F1	Page
Hebrew	HeSentiment	Sentiment Analysis	Macro F1	Page
Hebrew	HeQ	Question Answering	TLNLS	Paper
Hebrew	HeQ-QG	Question Generation	Rouge	Paper
Hebrew	HeSum	Summarization	Rouge	Paper
Hebrew	HebSummaries	Summarization	Rouge	Page
Hebrew	Nemo-Token	NER (token level)	F1	Paper
Hebrew	Nemo-Morph	NER (morph level)	F1	Paper
Hebrew	HebCo	Coreference Resolution	Macro F1	Page

Modern Standard Arabic	ArabicNLI	Natural Language Inference	Macro F1	Page
Modern Standard Arabic	MSA Sentiment	Sentiment Analysis	Macro F1	Page
Modern Standard Arabic	ArQ-MSA-QA	Question Answering	TLNLS	Page
Modern Standard Arabic	ArQ-MSA-QG	Question Generation	Rouge	Page
Modern Standard Arabic	ArTyDiQA-QA	Question Answering	TyDiQA-F1	Page
Modern Standard Arabic	ArTyDiQA-QG	Question Generation	Rouge	Page
Modern Standard Arabic	ArXLSum	Summarization	Rouge	Page
Modern Standard Arabic	ASAS	Summarization	Rouge	Page
Modern Standard Arabic	IAHLT-NER	Named Entity Recognition	F1	Page
Modern Standard Arabic & Levantine Arabic	Wojood-Full	NER (token level)	F1	Paper
Modern Standard Arabic	Wojood-MSA	NER (token level)	F1	Paper
Modern Standard Arabic	OntoNotes	Coreference Resolution	Macro F1	Page

Levantine Arabic	ArSentiment	Sentiment Analysis	Macro F1	Page
Levantine Arabic	ArQ-Spoken-QA	Question Answering	TLNLS	Page
Levantine Arabic	ArQ-Spoken-QG	Question Generation	Rouge	Page
Levantine Arabic	Wojood-Spoken	NER (token level)	F1	Paper
Levantine Arabic	ShamNER	Named Entity Recognition	F1	Page
Levantine Arabic	ArCoref	Coreference Resolution	Macro F1	Page

Setup

Note that this package requires Python 3.10 or higher.

First, clone the repository:

git clone https://github.com/google-research/mrl_eval.git

Then, install the requirements, preferably in a new virtual environment.

pip install torch --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Data

Download and preprocess raw data for all tasks (done once per dataset):

bash mrl_eval/datasets/download_raw_data.sh
bash mrl_eval/datasets/ingest_all_datasets.sh

OntoNotes Dataset Preparation

To prepare the dataset for the OntoNotes task (End-to-end and Gold Mentions variants) follow these steps:

Download the OntoNotes 5.0 dataset: Obtain the dataset from the LDC website (LDC2013T19).
Generate CoNLL 2012 Data Split in JSONL format: Run this script to generate the jsonl files for the Arabic subset of the dataset: setup_training.sh.
Run the data ingestion script: Copy the generated files to the data directory mrl_eval_data/ontonotes/ and mrl_eval_data/ontonotes_gold_mentions/, and run the ingestion script:

python -m mrl_eval.datasets.ingest_dataset ontonotes
python -m mrl_eval.datasets.ingest_dataset ontonotes_gold_mentions

Fine-tuning and inference

To finetune on a specific dataset using Huggingface, run:

python -m mrl_eval.hf.finetune --dataset {dataset}

The options for dataset are:

hebnli
hesentiment
heq
heq_question_gen
hesum
hebsummaries
nemo_token
nemo_morph
hebco
hebco_gold_mentions
arabic_nli
msa_sentiment
arq_MSA
arq_MSA_question_gen
artydiqa
artydiqa_question_gen
ar_xlsum
asas
iahlt_ner
wojood_msa
wojood_full
ontonotes
ontonotes_gold_mentions
arsentiment
arq_spoken
arq_spoken_question_gen
wojood_spoken
shamner
arcoref
arcoref_gold_mentions

By default, the mt5-xl model will be trained. To train a different model (e.g. a decoder LLM) specify its HF model name as follows:

python -m mrl_eval.hf.finetune --dataset {dataset} --model "google/gemma-2-9b"

Decoder models will be trained by default with LORA using half precision.

Once the training is done, the script will print the path to the best checkpoint.

To generate responses for the inputs of the test set, run:

python -m mrl_eval.hf.generate --dataset {dataset} --checkpoint_path path/to/checkpoint

Alternatively, the T5X framework can also be used for running training and inference with T5-family models. For more details, see models/T5X.md.

Evaluation

To evaluate the score of model predictions, run:

python -m mrl_eval.evaluation.evaluate --dataset {dataset} --predictions_path path/to/prediction/file

The options for dataset are the same as above.

Your predictions file is expected to be a jsonl file in the following format:

{"input": {"id": "example_id_1"}, "prediction": "prediction1"}
{"input": {"id": "example_id_2"}, "prediction": "prediction2"}
...

Baseline results

We finetune mT5-xl model per task as the first baseline. Results are shown in the table below.

Language	Model	Task	Metric	Value
Hebrew	mT5-XL	HeQ	TLNLS	87.1
Hebrew	mT5-XL	HeQ-QG	R1/R2/RL	40.2 / 22.0 / 39.7
Hebrew	mT5-XL	HeSum	R1/R2/RL	17.9 / 7.2 / 15.0
Hebrew	mT5-XL	HebSummaries	R1/R2/RL	23.9 / 10.1 / 16.6
Hebrew	mT5-XL	NEMO	Token / Morph F1	86.3 / 84.8
Hebrew	mT5-XL	Sentiment	Macro F1	85.0
Hebrew	mT5-XL	HebNLI	Macro F1	84.6
Hebrew	mT5-XL	Hebco	End-to-end / Gold Mentions Macro F1	49.3 / 75.8

Modern Standard Arabic	mT5-XL	ArQ-MSA-QA	TLNLS	79.5
Modern Standard Arabic	mT5-XL	ArQ-MSA-QG	R1/R2/RL	35.8 / 17.2 / 35.5
Modern Standard Arabic	mT5-XL	ArTyDi-QA	TyDiQA-F1	87.4
Modern Standard Arabic	mT5-XL	ArTyDi-QG	R1/R2/RL	60.6 / 44.1 / 60.5
Modern Standard Arabic	mT5-XL	IAHLT-NER	Token F1	84.2
Modern Standard Arabic	mT5-XL	ArabicNLI	Macro F1	82.2
Modern Standard Arabic	mT5-XL	ArXLSum	R1/R2/RL	26.5 / 11.4 / 23.4
Modern Standard Arabic	mT5-XL	ASAS	R1/R2/RL	39.5 / 21.3 / 27.8
Modern Standard Arabic	mT5-XL	MSA Sentiment	Macro F1	61.0
Modern Standard Arabic & Levantine Arabic	mT5-XL	Wojood-Full	Token F1	90.67
Modern Standard Arabic	mT5-XL	Wojood-MSA	Token F1	91.62
Modern Standard Arabic	mT5-XL	OntoNotes	End-to-end / Gold Mentions Macro F1	50.9 / 88.7

Levantine Arabic	mT5-XL	ArSentiment	Macro F1	71.2
Levantine Arabic	mT5-XL	ArQ-spoken-QA	TLNLS	81.8
Levantine Arabic	mT5-XL	ArQ-spoken-QG	R1/R2/RL	35.6 / 16.6 / 35.3
Levantine Arabic	mT5-XL	Wojood-Spoken	Token F1	79.12
Levantine Arabic	mT5-XL	ShamNER	Token F1	42.0
Levantine Arabic	mT5-XL	ArCoref	End-to-end / Gold Mentions Macro F1	50.1 / 74.4

Autoraters (LLM as a Judge)

This framework includes an evaluation module that leverages a Large Language Model (LLM) for assessing the quality of question generation and summarization tasks - offering evaluation beyond word matching metrics. The specific model used for this evaluation is Gemini 2.5 Flash, accessed via the Vertex AI platform on Google Cloud.

More details about autoraters can be found in evaluation/autoraters/README.md.

Autorating for Question Generation

For question generation the Autorater evaluates answerability which is the percentage of generated questions that can be accurately and directly answered by the provided reference answer/passage.

The following command can be used to run autorater evaluation on question generation datasets:

python -m mrl_eval.evaluation.autoraters.qg_autorater_main \
    --project_id={cloud-project-id} \
    --location={project-location} \
    --model_name="gemini-2.5-flash" \
    --dataset={question-generation-dataset} \
    --dataset_split=test \
    --predictions_path=path/to/prediction/file.jsonl \
    --output_path=path/to/output/file.jsonl

where question-generation-dataset can be one of:

arq_MSA_question_gen
arq_spoken_question_gen
artydiqa_question_gen
heq_question_gen

mT5-XL Autorater results

Dataset	Answerability
arq_MSA_question_gen	0.930
arq_spoken_question	0.884
artydiqa_question_gen	0.883
heq_question_gen	0.915

Autorating for Summarisation

For summarisation, the autorater evaluates three dimensions:

Factual grounding: all the information in the summary should be grounded in the original article.
Relevance or Recall: every single fact of the summary should be “salient” enough to appear in the summary.
Quality: the summary should be fluent and coherent. Each will be rated on a scale of 1-5.

The following command can be used to run autorater evaluation on summarisation datasets:

python -m mrl_eval.evaluation.autoraters.summarization_autoraters_main \
  --project_id={cloud-project-id} \
  --location={project-location} \
  --model_name="gemini-2.5-flash" \
  --dataset={summarisation-dataset} \
  --dataset_split="test" \
  --predictions_path=path/to/prediction/file.jsonl \
  --output_path=path/to/output/file.jsonl

where summarisation-dataset can be one of:

hesum
hebsummaries
ar_xlsum
asas

mT5-XL Autorater results

Dataset	Factual Grounding	Recall	Fluency	Coherence
hesum	0.812	0.153	4.530	3.931
hebsummaries	0.736	0.522	4.271	3.819
ar_xlsum	0.561	0.205	4.894	4.865
asas	0.913	0.546	3.973	3.560

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MRLEval - A Benchmark for Morphologically Rich Languages

Introduction

Tasks

Setup

Data

OntoNotes Dataset Preparation

Fine-tuning and inference

Evaluation

Baseline results

Autoraters (LLM as a Judge)

Autorating for Question Generation

mT5-XL Autorater results

Autorating for Summarisation

mT5-XL Autorater results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
datasets		datasets
evaluation		evaluation
hf		hf
models		models
scripts		scripts
utils		utils
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

google-research/mrl_eval

Folders and files

Latest commit

History

Repository files navigation

MRLEval - A Benchmark for Morphologically Rich Languages

Introduction

Tasks

Setup

Data

OntoNotes Dataset Preparation

Fine-tuning and inference

Evaluation

Baseline results

Autoraters (LLM as a Judge)

Autorating for Question Generation

mT5-XL Autorater results

Autorating for Summarisation

mT5-XL Autorater results

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages