Open In App

NLP | Wordlist Corpus

Last Updated : 14 Apr, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

In Natural Language Processing (NLP) a corpus is a collection of text data that is used for training, testing or evaluating NLP models. It is used in many NLP tasks like sentiment analysis, text classification and machine translation. A Wordlist Corpus is a specific type of corpus that contains a list of words used for tasks where word-level information is required. In this article, we will explore more about Wordlist Corpus and how it can be used in various NLP tasks.

Understanding Wordlist Corpus

A Wordlist Corpus is a collection of words organized in a specific format with each word on a separate line. This type of corpus is widely used in NLP tasks that require a predefined set of words such as creating custom dictionaries, spell-checking applications, text normalization or filtering out certain words based on the task's requirements.

  1. Text Preprocessing: Removes stopwords, unwanted words or filter out non-relevant terms.
  2. Spell Checking: Ensures presence of correctly spelled words by referencing them from dictionary-like corpus.
  3. Text Normalization: Converts variations of the same word into a standard format for further processing.
  4. Word Filtering: When working with text corpus it may be necessary to filter out certain words or phrases.
  5. Building Custom Dictionaries: You can create custom dictionaries to enhance Named Entity Recognition (NER), classification or other NLP tasks that require domain-specific knowledge.

Implementation of Wordlist Corpus

We will see how to create a Wordlist Corpus for NLP tasks:

Step 1: Prepare Your Wordlist File

First step is to create a file that contains list of words. Each word should be listed on a new line and the file can be in a .txt format or .csv file. We are using a simple wordlist in CSV format which you can download from here.

Step 2: Load the Wordlist Using NLTK

Now that we have our csv file we can use WordListCorpusReader to load and work with which can handle and manage wordlist corpora.

We are using nltk and pandas libraries required to work on it.

  • df['word']: extracts the word column from the DataFrame.
  • corpus_root = '/content': directory path where the text file will be stored is defined.
  • Loop for word in wordlist iterates through each word in the wordlist and writes it to the text file nlp_task_wordlist.txt in the /content directory each word followed by a newline character (\n).
Python
import nltk
import pandas as pd
from nltk.corpus.reader import WordListCorpusReader

df = pd.read_csv('/content/nlp_task_wordlist.csv')

wordlist = df['word'].tolist()

corpus_root = '/content' 
with open(f'{corpus_root}/nlp_task_wordlist.txt', 'w') as file:
    for word in wordlist:
        file.write(f"{word}\n")
wordlist_reader = WordListCorpusReader(corpus_root, 'nlp_task_wordlist.txt')
print(wordlist_reader.words())
print(wordlist_reader.fileids())

Output :

['text', 'analysis', 'preprocessing', 'tokenization', 'lemmatization', ........ 'scaling', 'accuracy', 'correction', 'text']
['nlp_task_wordlist.txt']

Step 3: Accessing Raw Text

We can access raw text of our wordlist file by using raw() method which will return entire content of the file as a single string.

  • from nltk.tokenize import line_tokenize: Imports line_tokenize function from the nltk.tokenize module.
  • raw_text = wordlist_reader.raw(): Retrieves entire raw text from the wordlist file.
Python
from nltk.tokenize import line_tokenize
raw_text = wordlist_reader.raw()
print("Wordlist: ", line_tokenize(raw_text))

Output :

Wordlist: ['text', 'analysis', 'preprocessing',.... 'accuracy', 'correction', 'text']

Step 4: Accessing Predefined and Custom Wordlist Corpora

The 'names' corpus is pre-defined in NLTK and provides a list of male and female names. It’s very useful for tasks like gender classification based on names or just analyzing name patterns.

Python
import nltk
nltk.download('names')
from nltk.corpus import names
print("Path : ", names.fileids())
print("\nNo. of female names : ", len(names.words('female.txt')))
print("\nNo. of male names : ", len(names.words('male.txt')))

Output :

WORDLIST
Number of males and females

Now Accessing our Custom Wordlist Corpus in addition with predefined corpora like names, we can also work with custom wordlists. Lets see how we can access our custom wordlist corpus from the CSV file.

  • 'nlp_task_wordlist.txt', fileids(): Return a list containing that file name: ['nlp_task_wordlist.txt'].
  • fileids(): Returns the list of filenames in the corpus. In our case it will return the name of the text file used ('nlp_task_wordlist.txt').
  • print("\nTotal number of words in the custom wordlist: ", len(wordlist_reader.words())): Prints total number of words in the wordlist.
Python
from nltk.corpus import WordListCorpusReader
corpus_root = '/content'  
wordlist_reader = WordListCorpusReader(corpus_root, 'nlp_task_wordlist.txt')
print("File: ", wordlist_reader.fileids())
print("\nTotal number of words in the custom wordlist: ", len(wordlist_reader.words()))

Output :

WORDLIST1
Number of words in custom wordlist

Step 5: Accessing English Wordlist corpus

  • en-basic list is a smaller, basic set of English words.
  • en list is the larger collection of words that includes more extensive vocabulary.
Python
import nltk
nltk.download('words')
from nltk.corpus import words 
print ("File : ", words.fileids()) 
print ("\nNo. of female names : ", len(words.words('en-basic'))) 
print ("\nNo. of male names : ", len(words.words('en'))) 

Output :

WORDLIST2
Words in 'en-basic" & 'en'

Whether you're working on a specific domain or general language processing tasks using resources like WordListCorpusReader can enhance your NLP projects by using predefined or custom wordlists.


Next Article

Similar Reads