Enterprise Benchmarks for Large Language Model Evaluation

Introduction

This is a set of benchmarks and metrics for a study of enterprise benchmarking of LLMs using domain-specific datasets from finance, legal, climate, and cyber security domains.

In many enterprise applications of LLMs, a model needs to process domain-specific text data. In such cases, the performance of a model can be different from what is expected from non-domain-specific benchmarks, such as benchmarks for language capabilities and common-sense knowledges. Therefore, it is important to use a domain-specific dataset whose distribution is close to that of the actual application domain.

The following scenarios are added.

  • Finance
    • conv_fin_qa_calc
    • financial_phrasebank
    • gold_commodity_news (news_headline)
    • kpi_edgar
  • Legal
    • casehold
    • echr_judgment_classification
    • legal_contract_summarization
    • legal_opinion_sentiment_classification
  • Climate
    • sumosum
  • Cyber security
    • cti_to_mitre

Getting started

Please follow the standard installation procedure. No additional steps are required for these benchmarks.

Here is an example of run_specs.conf

entries: [
  {description: "gold_commodity_news:model=openai/gpt2,category=price_or_not", priority: 1},
  {description: "sumosum:model=openai/gpt2", priority: 1},
  {description: "legal_contract_summarization:model=openai/gpt2", priority: 1},
  {description: "casehold:model=openai/gpt2", priority: 1},
 ]

Then, the benchmark scenarios can be executed by the following command.

helm-run --conf-paths {path_to}/run_specs.conf -m 100 --suite a_benchmark_suite_v1
helm-summarize --suite a_benchmark_suite_v1

Usage of the scenarios

Please refer to the docstring of the source code of each scenario, or the page shown by helm-server for the details (i.e., src/helm/benchmark/static/schema_enterprise.yaml).

Finance

conv_fin_qa_calc (ConvFinQACalc)

  • Description: A mathematical calculation benchmark based on ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering (Chen ey al., 2022).
  • Scenario name: conv_fin_qa_calc
  • Parameters:
  • Metrics:
    • float_equiv (from helm.benchmark.metrics.conv_fin_qa_calc_metrics.ConvFinQACalcMetric)
  • Source code: src/helm/benchmark/scenarios/conv_fin_qa_calc_scenario.py

financial_phrasebank (Financial Phrasebank)

  • Description: A sentiment classification benchmark based on the dataset from Good Debt or Bad Debt - Detecting Semantic Orientations in Economic Texts (Malo et al., 2013).
  • Scenario name: financial_phrasebank
  • Parameters:
    • agreement: int = 50. Optional. Valid values are defined in FinancialPhrasebankScenario.AGREEMENT_VALUES = [50, 66, 75, 100]. This argument is used to specify the ratio of annotators who agreed on the ground truth label.
  • Metrics:
    • classification_weighted_f1
  • Source code: src/helm/benchmark/scenarios/financial_phrasebank_scenario.py

gold_commodity_news (Gold Commodity News)

  • Description: A classification benchmark based on a dataset of human-annotated gold commodity news headlines (Sinha & Khandait, 2019).
  • Scenario name: gold_commodity_news
  • Parameters:
    • category: str . This is a mandatory option. The task is to classify whether an input (a news headline) discusses about the category specified. One of the following values can be specified.
      • price_or_not: "the gold price",
      • direction_up: "the gold price heading up",
      • direction_constant: "the price remaining constant or stable",
      • direction_down: "the gold price heading down",
      • past_price: "any past information about gold prices",
      • future_price: "any future information about gold prices",
      • past_news: "any past information other than the gold prices",
      • future_news: "any future information other than the gold prices",
      • assert_comparison: "a comparison purely in the context of the gold commodity with another asset",
  • Metrics:
    • classification_weighted_f1
  • Source code: src/helm/benchmark/scenarios/gold_commodity_news_scenario.py

kpi_edgar (KPI EDGAR)

  • Description: A named entity recognition beenchmark based on the paper KPI-EDGAR - A Novel Dataset and Accompanying Metric for Relation Extraction from Financial Documents (Deußer et al., 2022).
  • Scenario name: kpi_edgar
  • Parameters:
  • Metrics:
    • word_macro_f1_score (from helm.benchmark.metrics.kpi_edgar_metrics.KPIEdgarMetric)
    • adjusted_macro_f1_score (from helm.benchmark.metrics.kpi_edgar_metrics.KPIEdgarMetric)
  • Source code: src/helm/benchmark/scenarios/kpi_edgar_scenario.py

casehold (CASEHold)

  • Description: CaseHOLD (Case Holdings On Legal Decisions) is a multiple choice question answering scenario where the task is to identify the relevant holding of a cited case (Zheng et al, 2021).
  • Scenario name: casehold
  • Parameters:
  • Metrics:
    • f1 (instance-wide F1 in basic_metrics.py)
  • Source code: src/helm/benchmark/scenarios/casehold_scenario.py

echr_judgment_classification (ECHR Judgement Classification)

  • Description: The "Binary Violation" Classification task from the paper Neural Legal Judgment Prediction in English (Chalkidis et al., 2019). The task is to analyze the description of a legal case from the European Court of Human Rights (ECHR), and classify it as positive if any human rights article or protocol has been violated and negative otherwise.
  • Scenario name: echr_judgment_classification
  • Parameters:
  • Metrics:
    • classification_weighted_f1
  • Source code: src/helm/benchmark/scenarios/echr_judgment_classification_scenario.py
  • Description: Plain English Summarization of Contracts (Manor et al., 2019).
  • Scenario name: legal_contract_summarization
  • Parameters:
  • Metrics:
    • rouge_1
    • rouge_2
    • rouge_l
  • Source code: src/helm/benchmark/scenarios/legal_contract_summarization_scenario.py
  • Description: A legal opinion sentiment classification task based on the paper Effective Approach to Develop a Sentiment Annotator For Legal Domain in a Low Resource Setting (Ratnayaka et al., 2020).
  • Scenario name: legal_opinion_sentiment_classification
  • Parameters:
  • Metrics:
    • classification_weighted_f1
  • Source code: src/helm/benchmark/scenarios/legal_opinion_sentiment_classification_scenario.py

Climate

sumosum (SUMO Web Claims Summarization)

  • Description: A summarization benchmark based on the climate subset of the SUMO dataset (Mishra et al., 2020).
  • Scenario name: sumosum
  • Parameters:
  • Metrics:
    • rouge_1
    • rouge_2
    • rouge_l
  • Source code: src/helm/benchmark/scenarios/sumosum_scenario.py

Cyber security

cti_to_mitre (CTI to MITRE)

  • Description: A classification benchmark based on Automatic Mapping of Unstructured Cyber Threat Intelligence - An Experimental Study (Orbinato et al., 2022).
  • Scenario name: cti_to_mitre
  • Parameters:
    • num_options: int = 10 Optional. Number of choices in multiple-choice task.
    • seed: int = 42 Optional. Seed for random module. The seed is set to random if specified.
  • Metrics:
    • exact_match
  • Source code: src/helm/benchmark/scenarios/cti_to_mitre_scenario.py

Paper

This study is published in the following paper. Please cite this paper if you use this code in your research and you publish it. - B. Zhang, M. Takeuchi, R. Kawahara, S. Asthana, Md. M. Hossain, G. Ren, K. Soule and Y. Zhu, “Enterprise Benchmarks for Large Language Model Evaluation.” arXiv preprint arXiv:2410.12857 2024. https://arxiv.org/abs/2410.12857

@article{zhang2024enterprise,
  title={Enterprise Benchmarks for Large Language Model Evaluation},
  author={Zhang, Bing and Takeuchi, Mikio and Kawahara, Ryo and Asthana, Shubhi and Hossain, Md Maruf and Ren, Guang-Jie and Soule, Kate and Zhu, Yada},
  journal={arXiv preprint arXiv:2410.12857},
  url={https://arxiv.org/abs/2410.12857}, 
  year={2024}
}

Finance benchmark results

Legal benchmark results

Climate and sustainability benchmark results

Cyber security benchmark results

Contributors

Original contributors are as follows:

  • Yada Zhu, Kate Soule (MIT-IBM Watson AI Lab)
  • Mikio Takeuchi, Ryo Kawahara, Futoshi Iwama, Alisa Arno (IBM Research - Tokyo)
  • Bing Zhang, Shubhi Asthana (IBM Almaden Research Lab)
  • Md Maruf Hossain, Naoto Satoh, Guang-Jie Ren (former IBM members)