Skip to content

Commit 2adcf43

Browse files
Span labeling datasets (#154)
* data sets init * ignore analyses * clean * clean up wikineural * clean up finer * add mit restaurant reviews * add restuarnt to all * update english datasets * link to the restaurant dataset paper * replace unix commands with python functions * replace linux commands with python functions * Update descriptions for each dataset * Fix spancat_default config for validation errors The spancat_default config was outdated, and so it created incompatible parsing errors from pydantic. I regenerated the config instead. * Add tests for the project * Run black and isort to script files * Attempt fix encoding error UnicodeDecodeError shows up when the CI is run on Windows. I tried fixing it by specifying the encoding in the open() function. * Include encoding even during file output * Update project title * Update category docs Co-authored-by: Lj Miranda <ljvmiranda@gmail.com>
1 parent c19c6a0 commit 2adcf43

19 files changed

+1810
-1
lines changed

‎benchmarks/README.md‎

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,14 @@
11
<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
22

3-
# 🪐 Project Templates: Benchmarks (7)
3+
# 🪐 Project Templates: Benchmarks (8)
44

55
| Template | Description |
66
| --- | --- |
77
| [`healthsea_spancat`](healthsea_spancat) | Healthsea-Spancat |
88
| [`nel`](nel) | NEL Benchmark |
99
| [`ner_conll03`](ner_conll03) | Named Entity Recognition (CoNLL-2003) |
1010
| [`parsing_penn_treebank`](parsing_penn_treebank) | Dependency Parsing (Penn Treebank) |
11+
| [`span-labeling-datasets`](span-labeling-datasets) | Span labeling datasets |
1112
| [`speed`](speed) | Project for speed benchmarking of various pretrained models of different NLP libraries. |
1213
| [`textcat_architectures`](textcat_architectures) | Textcat performance benchmarks |
1314
| [`ud_benchmark`](ud_benchmark) | Universal Dependencies v2.5 Benchmarks |
Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
project.lock
2+
assets/
3+
corpus/
4+
training/
5+
wandb/
6+
unseen/
7+
analyses/
8+
9+
# Byte-compiled / optimized / DLL files
10+
__pycache__/
11+
*.py[cod]
12+
*$py.class
13+
14+
# C extensions
15+
*.so
16+
17+
# Distribution / packaging
18+
.Python
19+
build/
20+
develop-eggs/
21+
dist/
22+
downloads/
23+
eggs/
24+
.eggs/
25+
lib/
26+
lib64/
27+
parts/
28+
sdist/
29+
var/
30+
wheels/
31+
share/python-wheels/
32+
*.egg-info/
33+
.installed.cfg
34+
*.egg
35+
MANIFEST
36+
37+
# PyInstaller
38+
# Usually these files are written by a python script from a template
39+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
40+
*.manifest
41+
*.spec
42+
43+
# Installer logs
44+
pip-log.txt
45+
pip-delete-this-directory.txt
46+
47+
# Unit test / coverage reports
48+
htmlcov/
49+
.tox/
50+
.nox/
51+
.coverage
52+
.coverage.*
53+
.cache
54+
nosetests.xml
55+
coverage.xml
56+
*.cover
57+
*.py,cover
58+
.hypothesis/
59+
.pytest_cache/
60+
cover/
61+
62+
# Translations
63+
*.mo
64+
*.pot
65+
66+
# Django stuff:
67+
*.log
68+
local_settings.py
69+
db.sqlite3
70+
db.sqlite3-journal
71+
72+
# Flask stuff:
73+
instance/
74+
.webassets-cache
75+
76+
# Scrapy stuff:
77+
.scrapy
78+
79+
# Sphinx documentation
80+
docs/_build/
81+
82+
# PyBuilder
83+
.pybuilder/
84+
target/
85+
86+
# Jupyter Notebook
87+
.ipynb_checkpoints
88+
89+
# IPython
90+
profile_default/
91+
ipython_config.py
92+
93+
# pyenv
94+
# For a library or package, you might want to ignore these files since the code is
95+
# intended to run in multiple environments; otherwise, check them in:
96+
# .python-version
97+
98+
# pipenv
99+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
100+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
101+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
102+
# install all needed dependencies.
103+
#Pipfile.lock
104+
105+
# poetry
106+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
107+
# This is especially recommended for binary packages to ensure reproducibility, and is more
108+
# commonly ignored for libraries.
109+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
110+
#poetry.lock
111+
112+
# pdm
113+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
114+
#pdm.lock
115+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
116+
# in version control.
117+
# https://pdm.fming.dev/#use-with-ide
118+
.pdm.toml
119+
120+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
121+
__pypackages__/
122+
123+
# Celery stuff
124+
celerybeat-schedule
125+
celerybeat.pid
126+
127+
# SageMath parsed files
128+
*.sage.py
129+
130+
# Environments
131+
.env
132+
.venv
133+
env/
134+
venv/
135+
ENV/
136+
env.bak/
137+
venv.bak/
138+
139+
# Spyder project settings
140+
.spyderproject
141+
.spyproject
142+
143+
# Rope project settings
144+
.ropeproject
145+
146+
# mkdocs documentation
147+
/site
148+
149+
# mypy
150+
.mypy_cache/
151+
.dmypy.json
152+
dmypy.json
153+
154+
# Pyre type checker
155+
.pyre/
156+
157+
# pytype static type analyzer
158+
.pytype/
159+
160+
# Cython debug symbols
161+
cython_debug/
162+
163+
# PyCharm
164+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
165+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
166+
# and can be added to the global gitignore or merged into this file. For a more nuclear
167+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
168+
#.idea/
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
<!-- SPACY PROJECT: AUTO-GENERATED DOCS START (do not remove) -->
2+
3+
# 🪐 spaCy Project: Span labeling datasets
4+
5+
This project compiles various NER and more general spancat datasets
6+
and their converters into the [spaCy format](https://spacy.io/api/data-formats).
7+
You can use this to try out experiment with `ner` and `spancat`
8+
or to potentially pre-train them for your application.
9+
10+
11+
## 📋 project.yml
12+
13+
The [`project.yml`](project.yml) defines the data assets required by the
14+
project, as well as the available commands and workflows. For details, see the
15+
[spaCy projects documentation](https://spacy.io/usage/projects).
16+
17+
### ⏯ Commands
18+
19+
The following commands are defined by the project. They
20+
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run).
21+
Commands are only re-run if their inputs have changed.
22+
23+
| Command | Description |
24+
| --- | --- |
25+
| `preprocess-wnut17` | Canonicalize the WNUT2017 data set for conversion to .spacy. |
26+
| `convert-wnut17-ents` | Convert WNUT17 dataset into the spaCy format |
27+
| `convert-wnut17-spans` | Convert WNUT17 dataset into the spaCy format |
28+
| `inspect-wnut17` | Analyze span-characteristics |
29+
| `unpack-conll` | Decompress ConLL 2002, remove temporary files and change encoding. |
30+
| `preprocess-conll` | Canonicalize the Dutch ConLL data set for conversion to .spacy. |
31+
| `convert-conll-spans` | Convert CoNLL dataset (de, en, es, nl) into the spaCy format |
32+
| `convert-conll-ents` | Convert CoNLL dataset (de, en, es, nl) into the spaCy format |
33+
| `inspect-conll` | Analyze span-characteristics |
34+
| `convert-archaeo-spans` | Convert Dutch Archaeology dataset into the spaCy format |
35+
| `convert-archaeo-ents` | Convert Dutch Archaeology dataset into the spaCy format |
36+
| `inspect-archaeo` | Analyze span-characteristics |
37+
| `clean-archaeo` | |
38+
| `convert-anem-spans` | Convert AnEM dataset into the spaCy format |
39+
| `convert-anem-ents` | Convert AnEM dataset into the spaCy format |
40+
| `inspect-anem` | Analyze span-characteristics |
41+
| `preprocess-restaurant` | Make MIT Restaurant Review data set format comply with convert. |
42+
| `convert-restaurant-ents` | Convert MIT Restaurant Review data to .spacy format |
43+
| `convert-restaurant-spans` | Convert MIT Restaurant Review dataset into the spaCy format |
44+
| `inspect-restaurant` | Analyze span-characteristics |
45+
| `generate-unseen` | Create unseen entities splits for all preprocessed datasets. |
46+
47+
### ⏭ Workflows
48+
49+
The following workflows are defined by the project. They
50+
can be executed using [`spacy project run [name]`](https://spacy.io/api/cli#project-run)
51+
and will run the specified commands in order. Commands are only re-run if their
52+
inputs have changed.
53+
54+
| Workflow | Steps |
55+
| --- | --- |
56+
| `wnut17` | `preprocess-wnut17` &rarr; `convert-wnut17-ents` &rarr; `convert-wnut17-spans` &rarr; `inspect-wnut17` |
57+
| `conll` | `unpack-conll` &rarr; `preprocess-conll` &rarr; `convert-conll-ents` &rarr; `convert-conll-spans` &rarr; `inspect-conll` |
58+
| `archaeo` | `convert-archaeo-ents` &rarr; `convert-archaeo-spans` &rarr; `clean-archaeo` &rarr; `inspect-archaeo` |
59+
| `anem` | `convert-anem-ents` &rarr; `convert-anem-spans` &rarr; `inspect-anem` |
60+
| `restaurant` | `preprocess-restaurant` &rarr; `convert-restaurant-ents` &rarr; `convert-restaurant-spans` &rarr; `inspect-restaurant` |
61+
| `all` | `preprocess-wnut17` &rarr; `convert-wnut17-ents` &rarr; `convert-wnut17-spans` &rarr; `unpack-conll` &rarr; `convert-conll-spans` &rarr; `convert-conll-ents` &rarr; `convert-archaeo-ents` &rarr; `convert-archaeo-spans` &rarr; `convert-anem-ents` &rarr; `convert-anem-spans` &rarr; `preprocess-restaurant` &rarr; `convert-restaurant-ents` &rarr; `convert-restaurant-spans` &rarr; `inspect-restaurant` |
62+
63+
### 🗂 Assets
64+
65+
The following assets are defined by the project. They can
66+
be fetched by running [`spacy project assets`](https://spacy.io/api/cli#project-assets)
67+
in the project directory.
68+
69+
| File | Source | Description |
70+
| --- | --- | --- |
71+
| `assets/wnut17-train.iob` | URL | WNUT17 training dataset for Emerging and Rare Entities Task from Derczynski et al. (ACL 2017) |
72+
| `assets/wnut17-test.iob` | URL | WNUT17 test dataset for Emerging and Rare Entities Task from Derczynski et al. (ACL 2017) |
73+
| `assets/wnut17-dev.iob` | URL | WNUT17 dev dataset for Emerging and Rare Entities Task from Derczynski et al. (ACL 2017) |
74+
| `assets/conll.tgz` | URL | ConLL 2002 shared task data from Tjong Kim Sang (ACL 2002) |
75+
| `assets/archaeo.bio` | URL | Dutch Archaeological NER dataset by Alex Brandsen (LREC 2020) |
76+
| `assets/anem-train.iob` | URL | Anatomical Entity Mention Detection training dataset from Ohta et al. (ACL 2012) |
77+
| `assets/anem-test.iob` | URL | Anatomical Entity Mention Detection test dataset from Ohta et al. (ACL 2012) |
78+
| `assets/restaurant-train_raw.iob` | URL | Training data from the MIT Restaurants Review dataset |
79+
| `assets/restaurant-test_raw.iob` | URL | Test data from the MIT Restaurants Review dataset |
80+
81+
<!-- SPACY PROJECT: AUTO-GENERATED DOCS END (do not remove) -->

0 commit comments

Comments
 (0)