A benchmark that aims to evaluate the capability of implementing new features in the code repositories.
This repository is the official implementation of the paper "FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation." It can be used for baseline evaluation using the prompts mentioned in the paper.
The repository includes several functionalities, primarily for obtaining the full dataset, running model inference aligned with the paper, and evaluating the results. The complete pipeline is as follows:
You can create a new Python environment and install all dependencies using:
pip install -e .If you plan to use VLLM inference, ensure that the installed libraries match your hardware.
Due to licensing and company policies, we cannot release the full dataset. Our published version (https://huggingface.co/datasets/microsoft/FEA-Bench) only includes essential attributes, and the remaining content needs to be scraped from GitHub.
To construct the full FEA-Bench dataset and save it in the feabench-data folder, run the following command. Note that you need to replace GITHUB_TOKEN with your own GitHub token, which should have read-only access to public repositories:
export GITHUB_TOKEN="xxx"
python -m feabench.get_dataset \
--dataset microsoft/FEA-Bench \
--testbed feabench-data/testbed \
--lite_ids instances_lite.json \
--medium_file feabench-data/FEA-Bench-v1.0-medium.jsonl \
--standard_dataset_path feabench-data/FEA-Bench-v1.0-Standard \
--oracle_dataset_path feabench-data/FEA-Bench-v1.0-Oracle \
--lite_standard_dataset_path feabench-data/FEA-Bench-v1.0-Lite-Standard \
--lite_oracle_dataset_path feabench-data/FEA-Bench-v1.0-Lite-OracleOur repository only provides inference methods consistent with those in the paper. Agentless and other agent-based inferences can use the FEA-Bench-v1.0-Lite-Standard dataset constructed in the previous step, which is aligned with the format of SWE-Bench.
export MAX_SEQ_LEN=128000
export MAX_GEN_LEN=4096
DATASET_PATH=feabench-data/FEA-Bench-v1.0-Oracle
MODEL_NAME=Qwen/Qwen2.5-Coder-3B-Instruct
RESULTS_ROOT_DIR=scripts/experiments/results_full
PROMPT_MODE=natural-detailed
python -m feabench.run_prediction \
--dataset_name_or_path $DATASET_PATH \
--model_type vllm \
--model_name_or_path $MODEL_NAME \
--input_text $PROMPT_MODE \
--output_dir $RESULTS_ROOT_DIR/$PROMPT_MODE(DEEPSEEK_TOKENIZER is only required when using DeepSeek model inference)
export DEEPSEEK_TOKENIZER_PATH="xxx"
export OPENAI_API_KEY="xxx"
export OPENAI_BASE_URL="https://api.deepseek.com"
DATASET_PATH=feabench-data/FEA-Bench-v1.0-Oracle
MODEL_NAME=deepseek-chat
RESULTS_ROOT_DIR=scripts/experiments/results_full
PROMPT_MODE=natural-detailed
python -m feabench.run_prediction \
--dataset_name_or_path $DATASET_PATH \
--model_type openai \
--model_name_or_path $MODEL_NAME \
--input_text $PROMPT_MODE \
--output_dir $RESULTS_ROOT_DIR/$PROMPT_MODE \
--num_proc 1After running the inference, you should see the output .jsonl result files in the specified output_dir.
Our evaluation process is based on the code provided by SWE-Bench. We have provided a patch file swe-bench.diff to include the environment configurations for the task instances we are involved in.
Clone the SWE-Bench repository and apply the patch:
mkdir -p evaluator
cd evaluator
git clone https://github.com/SWE-bench/SWE-bench.git
cd SWE-bench
git checkout a0536ee6f9fd5ff88acf17a36a384bf3da3d93d6
git apply ../../swe-bench.diff
conda create --name fea-eval python=3.11
conda activate fea-eval
pip install -e .To verify that the FEA-Bench task instances can run correctly on your machine, you can build a gold result based on the dataset:
python -m feabench.get_gold_results \
--dataset_name_or_path feabench-data/FEA-Bench-v1.0-Standard \
--save_dir feabench-data/experiments/gold \
--file_name Gold__FEABench_v1.0__test.jsonlThe command to run the evaluation script is as follows (using the gold result constructed above as an example):
python -m swebench.harness.run_evaluation \
--dataset_name ../../feabench-data/FEA-Bench-v1.0-Standard \
--predictions_path ../../feabench-data/experiments/gold/Gold__FEABench_v1.0__test.jsonl \
--max_workers 10 \
--cache_level instance \
--timeout 900 \
--run_id FEABench_v1_GoldThe usage is identical to SWE-Bench. You can set the cache level cache_level based on your disk size. You should then obtain a result file similar to the following .json format:
{
"total_instances": 1401,
"submitted_instances": 1401,
"completed_instances": 1401,
"resolved_instances": 1401,
"unresolved_instances": 0,
"empty_patch_instances": 0,
"error_instances": 0,
...
}Congratulations! You have completed the usage of FEA-Bench. If you have any questions, please raise them in the issues.
For more details, please refer to the FEA-Bench Paper. If you find our work helpful, we would be grateful if you could cite our work.
@misc{li2025feabenchbenchmarkevaluatingrepositorylevel,
title={FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation},
author={Wei Li and Xin Zhang and Zhongxin Guo and Shaoguang Mao and Wen Luo and Guangyue Peng and Yangyu Huang and Houfeng Wang and Scarlett Li},
year={2025},
eprint={2503.06680},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2503.06680},
}
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.