Open In App

Text Preprocessing in NLP

Last Updated : 23 Jul, 2025
Comments
Improve
Suggest changes
2 Likes
Like
Report

Natural Language Processing (NLP) has advanced significantly and now plays an important role in multiple real-world applications like chatbots, search engines and sentiment analysis. An early step in any NLP workflow is text preprocessing, which prepares raw textual data for further analysis and modeling.

Text processing involves cleaning and preparing raw text data for further analysis or model training. Proper text preprocessing can significantly impact the performance and accuracy of NLP models.

Importance of Text Preprocessing

Raw text data is usually noisy and unstructured, containing various inconsistencies such as typos, slang, abbreviations and irrelevant information. Preprocessing helps in:

  • Improving Data Quality: Removing noise and irrelevant information ensures that the data fed into the model is clean and consistent.
  • Enhancing Model Performance: Well-preprocessed text can lead to better feature extraction, improving the performance of NLP models.
  • Reducing Complexity: Simplifying the text data can reduce the computational complexity and make the models more efficient.

Text Preprocessing Techniques in NLP

  • Regular Expressions: Regular expressions (regex) is an important tool in text preprocessing for Natural Language Processing (NLP). They allow for efficient and flexible pattern matching and text manipulation.
  • Tokenization: Tokenization is the process of breaking down text into smaller units such as words or sentences. This is an important step in NLP as it transforms raw text into a structured format that can be further analyzed.
  • Lemmatization and Stemming: Lemmatization and stemming are techniques used in NLP to reduce words to their base or root forms. This process is important for tasks like text normalization, information retrieval and text mining.
  • Parts of Speech (POS): Parts of Speech (POS) tagging involves labeling each word in a sentence with its corresponding part of speech such as noun, verb, adjective etc. This information is crucial for many NLP applications, including parsing, information retrieval and text analysis.

Example - Text Preprocessing in NLP

Now, we will perform the tasks on the sample corpus:

Python
corpus = [
    "I can't wait for the new season of my favorite show!",
    "The COVID-19 pandemic has affected millions of people worldwide.",
    "U.S. stocks fell on Friday after news of rising inflation.",
    "<html><body>Welcome to the website!</body></html>",
    "Python is a great programming language!!! ??"
]

1. Text Cleaning

We'll convert the text to lowercase, remove punctuation, numbers, special characters and HTML tags.

  • Defines a clean_text() function to clean and normalize raw text data for NLP tasks.
  • Applies clean_text() to every document in the corpus list using a list comprehension.
  • Stores the cleaned version of all documents in a new list called cleaned_corpus.
  • Prints cleaned_corpus which is ready for tokenization.
Python
import re
import string
from bs4 import BeautifulSoup

def clean_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    text = re.sub(r'\W', ' ', text)  # Remove special characters
    text = BeautifulSoup(text, "html.parser").get_text()  # Remove HTML tags
    return text

cleaned_corpus = [clean_text(doc) for doc in corpus]
print(cleaned_corpus)

Output:

['i cant wait for the new season of my favorite show', 'the covid pandemic has affected millions of people worldwide', 'us stocks fell on friday after news of rising inflation', 'htmlbodywelcome to the websitebodyhtml', 'python is a great programming language ']

2. Tokenization

  • Splitting the cleaned text into tokens (words).
  • Imports word_tokenize to split text into individual words.
  • Downloads the necessary NLTK tokenizer model ('punkt_tab' ).
  • Tokenizes each cleaned document in cleaned_corpus.
  • Stores the list of tokens for each document in tokenized_corpus.
  • Prints the final tokenized output(a list of word lists).
Python
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt_tab')

tokenized_corpus = [word_tokenize(doc) for doc in cleaned_corpus]
print(tokenized_corpus)

Output:

[['i', 'cant', 'wait', 'for', 'the', 'new', 'season', 'of', 'my', 'favorite', 'show'], ['the', 'covid', 'pandemic', 'has', 'affected', 'millions', 'of', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'on', 'friday', 'after', 'news', 'of', 'rising', 'inflation'], ['htmlbodywelcome', 'to', 'the', 'websitebodyhtml'], ['python', 'is', 'a', 'great', 'programming', 'language']]

[nltk_data] Downloading package punkt_tab to /root/nltk_data...

[nltk_data] Package punkt_tab is already up-to-date!

3. Stop Words Removal

Removing common stop words from the tokens.

  • Imports the list of English stopwords from nltk.corpus.stopwords.
  • Downloads the stopwords corpus using nltk.download('stopwords').
  • Stores all English stopwords (like "the", "is", "and", etc.) in a set called stop_words for fast lookup.
  • Iterates over each document in tokenized_corpus and removes all stopwords.
  • Saves the cleaned, non-stopword tokens into filtered_corpus.
  • Prints the a list of documents.
Python
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_corpus = [[word for word in doc if word not in stop_words] for doc in tokenized_corpus]
print(filtered_corpus)

Output:

[nltk_data] Downloading package stopwords to /root/nltk_data...

[['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'millions', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'friday', 'news', 'rising', 'inflation'], ['htmlbodywelcome', 'websitebodyhtml'], ['python', 'great', 'programming', 'language']]

[nltk_data] Unzipping corpora/stopwords.zip.

4. Stemming and Lemmatization

Reducing words to their base form using stemming and lemmatization.

  • Imports PorterStemmer and WordNetLemmatizer from NLTK.
  • Downloads the wordnet resource required for lemmatization.
  • Initializes the stemmer and lemmatizer.
  • Applies stemming to each word in filtered_corpus and stores the result in stemmed_corpus.
  • Applies lemmatization to each word in filtered_corpus and stores the result in lemmatized_corpus.
  • Prints both the stemmed and lemmatized versions of the corpus.
Python
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('wordnet')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_corpus = [[stemmer.stem(word) for word in doc] for doc in filtered_corpus]
lemmatized_corpus = [[lemmatizer.lemmatize(word) for word in doc] for doc in filtered_corpus]
print(stemmed_corpus)
print(lemmatized_corpus)

Output:

[nltk_data] Downloading package wordnet to /root/nltk_data...

[['cant', 'wait', 'new', 'season', 'favorit', 'show'], ['covid', 'pandem', 'affect', 'million', 'peopl', 'worldwid'], ['us', 'stock', 'fell', 'friday', 'news', 'rise', 'inflat'], ['htmlbodywelcom', 'websitebodyhtml'], ['python', 'great', 'program', 'languag']]

[['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'million', 'people', 'worldwide'], ['u', 'stock', 'fell', 'friday', 'news', 'rising', 'inflation'], ['htmlbodywelcome', 'websitebodyhtml'], ['python', 'great', 'programming', 'language']]

5. Handling Contractions

Expanding contractions in the text.

  • Imports the contractions library, which expands shortened words.
  • Applies contractions.fix() to each document in cleaned_corpus.
  • Expands all contractions in the text for better clarity and consistency.
  • Stores the output in expanded_corpus.
  • Prints a list of documents with all contractions expanded.
Python
import contractions

expanded_corpus = [contractions.fix(doc) for doc in cleaned_corpus]
print(expanded_corpus)

Output:

['i cannot wait for the new season of my favorite show', 'the covid pandemic has affected millions of people worldwide', 'us stocks fell on friday after news of rising inflation', 'htmlbodywelcome to the websitebodyhtml', 'python is a great programming language ']

6. Handling Emojis and Emoticons

Converting emojis to their textual representation.

  • Imports the emoji library for handling emojis in text.
  • Applies emoji.demojize() to each document in cleaned_corpus.
  • Converts all emojis into descriptive text.
  • Stores the output in emoji_corpus.
  • Prints a list of documents where emojis are replaced with readable names.
Python
import emoji

emoji_corpus = [emoji.demojize(doc) for doc in cleaned_corpus]
print(emoji_corpus)

Output:

['i cant wait for the new season of my favorite show', 'the covid pandemic has affected millions of people worldwide', 'us stocks fell on friday after news of rising inflation', 'htmlbodywelcome to the websitebodyhtml', 'python is a great programming language ']

7. Spell Checking

Correcting spelling errors in the text.

  • Imports the SpellChecker class from the pyspellchecker library.
  • Initializes the spell checker with spell = SpellChecker().
  • Iterates through each word in each tokenized document from tokenized_corpus.
  • Applies spell.correction(word) to fix misspelled words.
  • Stores the corrected words in corrected_corpus.
  • Prints the a list of tokenized documents with spelling corrections applied.
Python
from spellchecker import SpellChecker

spell = SpellChecker()
corrected_corpus = [[spell.correction(word) for word in doc] for doc in tokenized_corpus]
print(corrected_corpus)

Output:

[['i', 'cant', 'wait', 'for', 'the', 'new', 'season', 'of', 'my', 'favorite', 'show'], ['the', 'bovid', 'pandemic', 'has', 'affected', 'millions', 'of', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'on', 'friday', 'after', 'news', 'of', 'rising', 'inflation'], [None, 'to', 'the', None], ['python', 'is', 'a', 'great', 'programming', 'language']]

You can download the code here.

After completing all the preprocessing steps, the final corpus is well-prepared for downstream NLP tasks such as feature extraction, text classification or sentiment analysis. This structured pipeline ensures the text is clean, standardized and optimized for modeling, ultimately enhancing the effectiveness and reliability of NLP applications.

Further Reading

Regular Expressions

Tokenization

Stemming

POS


Explore