This repo contains the code and datasets for paper "ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models".
We study 4 datasets: CoNLL-2003, WikiGold, MIT-Movie and MIT-Restaurant.
See sections (1) data (reproduce folder) for LLM prompts and responses and processed datasets and (2) commands (scripts folder) for reproducing the results in the main experiments.
The reproduce folder contains prompts, LLM responses and processed datasets as reported in our main experiments. It is organized as follows:
diversify-x(Diversify X)gen-attr-dimandgen-attr-valcontain prompts and responses for attribute dimensions and attribute values generation, respectively.configcontains processed attribute dimensions and values.
diversify-y(Diversify X)gen-entity-vanillaandgen-entity-latentcontain prompts and responses for named entity pool generation, for the vanilla and latent variant, respectively.configcontains processed named entities.
samplefor NER sample generationgen-samplecontains prompts and responses.datasetcontains processed NER datasets.
correctionfor LLM self-correctiongen-correctioncontains prompts and responses.configcontains entity-class-specific annotation instructions and demos for each dataset and each diversity approach.instruction&demo-poolcontains annotation instruction pool and demo pool for each entity class, shared for all diversity approaches, for illustration purposes.annotation-errorcontains representative entity annotation errors from NER sample generation for each dataset.datasetcontains processed datasets with entity annotations overridden by processed corrections.
Note
- LLM prompts and responses are available in 2 formats:
- A readable format, via
prompts.logandcompletion-*.txtfiles, and - OpenAI API format, via
requests.jsonlandrequests_results.jsonlfiles.
- A readable format, via
- All folders are have date prefixes indicating date of experiments.
- In each processed dataset (
sample/dataset) folder, each entity annotation triple (sentence, span, entity type) is available inlogprobs-triple.jsonfiles. - Top-uncertain triples selected for LLM Self-Correction are available from correction generation log files (
correction/gen-correction/**/completion.log)
We detail scripts for running experiments and reproducing our results with example commands.
Note
- Each script contains all relevant arguments (see
helpin each script andutils.py). - It’s expected to run each script/command at the directory root level.
- Terminal logging messages (and log file writes) w.r.t each script will show where the relevant (dataset) files are saved.
- All OpenAI API responses and processed datasets will be written to the
generated_datafolder.
Before you run a script, make sure python sees the src package folder:
export PYTHONPATH=$PYTHONPATH:$(pwd) For all LLM generation steps, set your OpenAI API via
export OPENAI_API_KEY='<your-api-key>'Python version 3.8
1> Install conda environment
conda create -n prog-gen python=3.8 pip2> Activate environment and install packages
conda activate prog-gen
pip install -r requirements.txtIncludes writing (1) few-shot demo samples and (2) entire test set for each of the datasets studied. Intended for downstream model training.
See write_original_dataset.py for details.
Example 1: Write few-shot demo samples for CoNLL-2003:
python scripts/write_original_dataset.py demo \
--dataset_name 'conll2003-no-misc' \
--n_demo 1 \
--include_negative_sample 1Example 2: Write entire test set for MIT-Movie:
python scripts/write_original_dataset.py test --dataset_name 'mit-movie'Note this step is not necessary as each subsequent step will automatically write the respective files if not found.
Note that additional manual inspection and filtering for low-quality values may be needed.
1: Diversify X
Note we omit the step for attribute dimension generation as we queried the GPT-4 web App. See the paper for the prompt templates and reproduce for the actual prompts used.
See generate_diversify_x_config.py for details on generating attribute values.
Example on WikiGold:
python scripts/generate_diversity_config.py \
--dataset_name 'wiki-gold-no-misc' \
--diversity_variant 'diversify-x' \
--prompt_seed 42 \
--chat_model_name 'gpt-3.5-turbo-1106' \
--chat_timeout 30 \
--n_call 32: Diversify Y
Includes the vanilla and latent variants. See generate_diversify_y_config.py
Example 1: The vanilla variant on MIT-Restaurant:
python scripts/generate_diversity_config.py \
--dataset_name 'mit-restaurant' \
--diversity_variant 'diversify-y-vanilla' \
--prompt_seed 42 \
--chat_model_name 'gpt-3.5-turbo-1106' \
--chat_timeout 30 \
--n_call 10Example 2: The latent variant on CoNLL-2003:
python scripts/generate_diversity_config.py \
--dataset_name 'conll2003-no-misc' \
--diversity_variant 'diversify-y-latent' \
--diversify_y_latent_attribute 'reproduce/diversify-x/config/conll2003_no_misc.json' \
--prompt_seed 42 \
--chat_model_name 'gpt-3.5-turbo-1106' \
--chat_max_tokens 256 \
--chat_timeout 30 \
--n_call 5Note the internal name of the dataset-independent attribute dimension for each dataset is given by
DATASET_NAME2TOPIC_DIM = {
'conll2003-no-misc': 'news-category',
'wiki-gold-no-misc': 'topic',
'mit-movie': 'query-category',
'mit-restaurant': 'meal-category'
}Includes Simple Prompt and all 4 diversity variants studied.
See generate_ner_sample.py for details.
Example 1: Simple Prompt on MIT-Movie:
python scripts/generate_ner_sample.py \
--dataset_name 'mit-movie' \
--diversity_variant 'simple-prompt' \
--prompt_seed 42 \
--n_list 50 \
--n_call 36 \
--chat_model_name 'gpt-3.5-turbo-1106' \
--chat_max_tokens 2560 \
--chat_logprobs 'True' \
--chat_timeout 60Note (1) a large n_list (e.g. 50) may not yield 50 generated samples sometimes, as discussed in the paper, and (2) WikiGold generated samples are much longer so a relatively higher chat_max_tokens is advised.
Example 2: Diversify X on WikiGold:
python scripts/generate_ner_sample.py \
--dataset_name 'wiki-gold-no-misc' \
--diversity_variant 'diversify-x' \
--diversify_x_config 'reproduce/diversify-x/config/wiki_gold_no_misc.json' \
--prompt_seed 42 \
--n_list 3 \
--n_call 600 \
--chat_model_name 'gpt-3.5-turbo-1106' \
--chat_max_tokens 256 \
--chat_logprobs 'True' \
--chat_timeout 20Example 3: Diversify Y (vanilla) on MIT-Restaurant:
python scripts/generate_ner_sample.py \
--dataset_name 'mit-restaurant' \
--diversity_variant 'diversify-y-vanilla' \
--diversify_y_config 'reproduce/diversify-y/config/vanilla/mit_restaurant.json' \
--prompt_seed 42 \
--n_list 3 \
--n_call 600 \
--chat_model_name 'gpt-3.5-turbo-1106' \
--chat_max_tokens 256 \
--chat_logprobs 'True' \
--chat_timeout 20Example 4: Diversify Y (latent) on MIT-Movie:
python scripts/generate_ner_sample.py \
--dataset_name 'mit-movie' \
--diversity_variant 'diversify-y-latent' \
--diversify_y_config 'reproduce/diversify-y/config/latent/mit_movie.json' \
--diversify_y_n_exp_entity 4.5 \
--prompt_seed 42 \
--n_list 3 \
--n_call 600 \
--chat_model_name 'gpt-3.5-turbo-1106' \
--chat_max_tokens 256 \
--chat_logprobs 'True' \
--chat_timeout 20Example 5: Diversify X+Y on CoNLL-2003:
python scripts/generate_ner_sample.py \
--dataset_name 'conll2003-no-misc' \
--diversity_variant 'diversify-x+y' \
--diversify_x_config 'reproduce/diversify-x/config/conll2003_no_misc.json' \
--diversify_y_config 'reproduce/diversify-y/config/latent/conll2003_no_misc.json' \
--prompt_seed 42 \
--n_list 3 \
--n_call 600 \
--chat_model_name 'gpt-3.5-turbo-1106' \
--chat_max_tokens 256 \
--chat_logprobs 'True' \
--chat_timeout 20Diversity arguments including diversify_x_config, diversify_x_sample_prob diversify_y_config and diversify_y_n_exp_entity are optional and will default to setups as reported in the paper (via loading from processed datasets in the generated_data folder).
For generating LLM Self-Corrections for entity annotations given a generated (and processed) NER dataset.
See generate_correction.py
Example: Self-Correction for a processed dataset (diversify-y-vanilla) on MIT-Movie
python scripts/generate_correction.py \
--dataset_name 'mit-movie' \
--generated_dataset_dir_name 'reproduce/sample/dataset/mit_movie/24-02-06_Diversify-Y-vanilla' \
--correction_config 'reproduce/correction/config/mit_movie/Diverse-Y-vanilla.json' \
--output_postfix 'diversify-y-vanilla' \
--prompt_seed 42 \
--n_correct 3 \
--logprob_thresh=-2e-2 \
--top_n 0.2 \
--chat_model_name 'gpt-3.5-turbo-1106' \
--chat_max_tokens 256 \
--chat_temperature 0 \
--chat_timeout 30Includes training a BERT-class model with epoch-wise evaluation. See train.py.
Example: Train a generated dataset (Diversify X) with self-correction for WikiGold:
python scripts/train.py \
--dataset_name 'wiki-gold-no-misc' \
--generated_dataset_dir_name 'reproduce/correction/dataset/wiki_gold_no_misc/24-02-11_Diversify-X' \
--few_shot_demo_file 'bio-train-1-shot-shuffled+neg.jsonl' \
--test_file 'bio-test-all.jsonl' \
--hf_model_name 'microsoft/deberta-v3-base' \
--learning_rate 4e-5 \
--n_epochs 16.0 \
--train_batch_size 24 \
--seed 42To train with GPU, use:
CUDA_VISIBLE_DEVICES=<your-gpu-id> python scripts/train.py ...Potential functionalities to support include
- Custom shuffle seed for different set of n-shot demo
- Customizable templates, including
- (1) diversity config generation, (2) data generation instruction and (3) diversity requirement
