HEIM (Text-to-image Model Evaluation)

Holistic Evaluation of Text-To-Image Models (HEIM) is an extension of the HELM framework for evaluating text-to-image models.

Holistic Evaluation of Text-To-Image Models

Significant effort has recently been made in developing text-to-image generation models, which take textual prompts as input and generate images. As these models are widely used in real-world applications, there is an urgent need to comprehensively understand their capabilities and risks. However, existing evaluations primarily focus on image-text alignment and image quality. To address this limitation, we introduce a new benchmark, Holistic Evaluation of Text-To-Image Models (HEIM).

We identify 12 different aspects that are important in real-world model deployment, including:

image-text alignment
image quality
aesthetics
originality
reasoning
knowledge
bias
toxicity
fairness
robustness
multilinguality
efficiency

By curating scenarios encompassing these aspects, we evaluate state-of-the-art text-to-image models using this benchmark. Unlike previous evaluations that focused on alignment and quality, HEIM significantly improves coverage by evaluating all models across all aspects. Our results reveal that no single model excels in all aspects, with different models demonstrating strengths in different aspects.

References

Installation

First, follow the installation instructions to install the base HELM Python page.

To install the additional dependencies to run HEIM, run:

pip install "crfm-helm[heim]"

Some models (e.g., DALLE-mini/mega) and metrics (DetectionMetric) require extra dependencies that are not available on PyPI. To install these dependencies, download and run the extra install script:

bash install-heim-extras.sh

Getting Started

The following is an example of evaluating Stable Diffusion v1.4 on the MS-COCO scenario using 10 instances.

helm-run --run-entries mscoco:model=huggingface/stable-diffusion-v1-4 --suite my-heim-suite --max-eval-instances 10

Reproducing the Leaderboard

To reproduce the entire HEIM leaderboard, refer to the instructions for HEIM on the Reproducing Leaderboards documentation.

Note:

The full HEIM leaderboard is not currently reproducible with these instructions. We are working to resolve this. In the meantime we have disabled the NSFWMetric to allow for the rest of the evaluation suite to run.