NLP | Wordlist Corpus
In Natural Language Processing (NLP) a corpus is a collection of text data that is used for training, testing or evaluating NLP models. It is used in many NLP tasks like sentiment analysis, text classification and machine translation. A Wordlist Corpus is a specific type of corpus that contains a list of words used for tasks where word-level information is required. In this article, we will explore more about Wordlist Corpus and how it can be used in various NLP tasks.
Understanding Wordlist Corpus
A Wordlist Corpus is a collection of words organized in a specific format with each word on a separate line. This type of corpus is widely used in NLP tasks that require a predefined set of words such as creating custom dictionaries, spell-checking applications, text normalization or filtering out certain words based on the task's requirements.
- Text Preprocessing: Removes stopwords, unwanted words or filter out non-relevant terms.
- Spell Checking: Ensures presence of correctly spelled words by referencing them from dictionary-like corpus.
- Text Normalization: Converts variations of the same word into a standard format for further processing.
- Word Filtering: When working with text corpus it may be necessary to filter out certain words or phrases.
- Building Custom Dictionaries: You can create custom dictionaries to enhance Named Entity Recognition (NER), classification or other NLP tasks that require domain-specific knowledge.
Implementation of Wordlist Corpus
We will see how to create a Wordlist Corpus for NLP tasks:
Step 1: Prepare Your Wordlist File
First step is to create a file that contains list of words. Each word should be listed on a new line and the file can be in a .txt
format or .csv
file. We are using a simple wordlist in CSV format which you can download from here.
Step 2: Load the Wordlist Using NLTK
Now that we have our csv file we can use WordListCorpusReader to load and work with which can handle and manage wordlist corpora.
We are using nltk and pandas libraries required to work on it.
df['word']
: extracts theword
column from the DataFrame.- corpus_root = '/content': directory path where the text file will be stored is defined.
- Loop
for word in wordlist
iterates through each word in thewordlist
and writes it to the text filenlp_task_wordlist.txt
in the/content
directory each word followed by a newline character (\n).
import nltk
import pandas as pd
from nltk.corpus.reader import WordListCorpusReader
df = pd.read_csv('/content/nlp_task_wordlist.csv')
wordlist = df['word'].tolist()
corpus_root = '/content'
with open(f'{corpus_root}/nlp_task_wordlist.txt', 'w') as file:
for word in wordlist:
file.write(f"{word}\n")
wordlist_reader = WordListCorpusReader(corpus_root, 'nlp_task_wordlist.txt')
print(wordlist_reader.words())
print(wordlist_reader.fileids())
Output :
['text', 'analysis', 'preprocessing', 'tokenization', 'lemmatization', ........ 'scaling', 'accuracy', 'correction', 'text']
['nlp_task_wordlist.txt']
Step 3: Accessing Raw Text
We can access raw text of our wordlist file by using raw()
method which will return entire content of the file as a single string.
from nltk.tokenize import line_tokenize
: Importsline_tokenize
function from thenltk.tokenize
module.raw_text = wordlist_reader.raw()
: Retrieves entire raw text from the wordlist file.
from nltk.tokenize import line_tokenize
raw_text = wordlist_reader.raw()
print("Wordlist: ", line_tokenize(raw_text))
Output :
Wordlist: ['text', 'analysis', 'preprocessing',.... 'accuracy', 'correction', 'text']
Step 4: Accessing Predefined and Custom Wordlist Corpora
The
'names'
corpus is pre-defined in NLTK and provides a list of male and female names. It’s very useful for tasks like gender classification based on names or just analyzing name patterns.
import nltk
nltk.download('names')
from nltk.corpus import names
print("Path : ", names.fileids())
print("\nNo. of female names : ", len(names.words('female.txt')))
print("\nNo. of male names : ", len(names.words('male.txt')))
Output :

Now Accessing our Custom Wordlist Corpus in addition with predefined corpora like names
, we can also work with custom wordlists. Lets see how we can access our custom wordlist corpus from the CSV file.
'
nlp_task_wordlist.txt'
,fileids()
: R
eturn a list containing that file name:['nlp_task_wordlist.txt']
.fileids()
: Returns the list of filenames in the corpus. In our case it will return the name of the text file used ('nlp_task_wordlist.txt'
).print("\nTotal number of words in the custom wordlist: ", len(wordlist_reader.words()))
: Prints total number of words in the wordlist.
from nltk.corpus import WordListCorpusReader
corpus_root = '/content'
wordlist_reader = WordListCorpusReader(corpus_root, 'nlp_task_wordlist.txt')
print("File: ", wordlist_reader.fileids())
print("\nTotal number of words in the custom wordlist: ", len(wordlist_reader.words()))
Output :

Step 5: Accessing English Wordlist corpus
en-basic
list is a smaller, basic set of English words.en
list is the larger collection of words that includes more extensive vocabulary.
import nltk
nltk.download('words')
from nltk.corpus import words
print ("File : ", words.fileids())
print ("\nNo. of female names : ", len(words.words('en-basic')))
print ("\nNo. of male names : ", len(words.words('en')))
Output :

Whether you're working on a specific domain or general language processing tasks using resources like WordListCorpusReader can enhance your NLP projects by using predefined or custom wordlists.