Open In App

Latent Dirichlet Allocation and Topic Modelling

Last Updated : 11 Aug, 2025
Comments
Improve
Suggest changes
1 Likes
Like
Report

Topic modeling is a technique in Natural Language Processing (NLP) that helps uncover hidden themes or "topics" across large sets of raw text. By recognizing patterns in how words appear together, topic models can organize documents by their underlying ideas without needing labeled data. Latent Dirichlet Allocation (LDA), the most widely applied topic modeling method, works as an unsupervised probabilistic model. It assumes that similar documents will share similar word usage and thus, will likely belong to the same topics. Each document is viewed as a mixture of topics and each topic is characterized by a distribution over words.

  • Documents are expressed as probabilities over topics.
  • Topics are defined as probabilities over words.

Components of Latent Dirichlet Allocation(LDA)

Probabilistic Generative Model

LDA assumes that each document is generated using a two-step random process:

  • For each document, sample a distribution over topics (using a Dirichlet prior).
  • For each word in the document, sample a topic from the document’s topic distribution, then sample a word from the selected topic’s word distribution.

Role of Dirichlet Distributions

The model uses Dirichlet distributions in two places:

  • To model the diversity of topic proportions for each document (parameter α).
  • To model the diversity of word proportions for each topic (parameter β).

LDA as a Mixture Model

Each document is viewed as a random mixture of topics and each topic as a mixture over words. For example, an article about sports might be a combination of topics like “teams,” “games,” and “scores.” LDA discovers these topics based on patterns in word usage across the corpus.

Bayesian Inference in LDA

LDA uses Bayesian inference to "reverse engineer" the hidden topics from the observed words in documents. Techniques like Gibbs sampling or variational Bayes are used to estimate the latent variables:

  • The topic proportions in each document.
  • The word probabilities in each topic.

Key Model Parameters

  • \alpha: Controls per-document topic diversity (high α means documents have many topics).
  • \beta: Controls per-topic word diversity (high β means topics use many different words).

Step-by-Step Implementation

Let's see the implementation of LDA topic modeling pipeline,

Step 1: Install and Import libraries

We install and import the required libraries,

  • pandas: Loads, manipulates and inspects tabular data.
  • numpy: Enables efficient numerical computations; sometimes useful for arrays.
  • string: Helps remove punctuation during text cleaning.
  • spacy: Processes text (tokenizes, tags, lemmatizes) for NLP tasks.
  • nltk: Supplies English stopwords and other language tools.
  • gensim: Performs topic modeling and creates bag-of-words matrices.
  • matplotlib.pyplot: Creates charts and plots for data visualization.
Python
!pip install --upgrade gensim pyLDAvis spacy pandas scikit-learn
import spacy.cli
spacy.cli.download("en_core_web_md")

import pandas as pd
import string
import spacy
import nltk
import gensim
from gensim import corpora
from gensim.models import CoherenceModel
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
from nltk.corpus import stopwords
import en_core_web_md
nltk.download('wordnet')
nltk.download('stopwords')

Step 2: Load Data

We load the dataset for operations,

  • pd.read_csv('/content/mock_yelp.csv'): Loads Yelp-style reviews from a CSV into a pandas DataFrame.
  • print(len(yelp_review)), groupby('business_id'): Quickly checks how many reviews, unique businesses and users are present.
Python
yelp_review = pd.read_csv('/content/mock_yelp.csv')
print("Number of reviews:", len(yelp_review))
print("Unique businesses:", len(yelp_review.groupby('business_id')))
print("Unique users:", len(yelp_review.groupby('user_id')))

Output:

number of reviews:10
Unique Business:5
Unique User:5

Step 3: Preprocess Text

3.1 Clean text: clean_text(text): Removes punctuation and digits, lowercases text and discards short/non-informative words. Ensures input text is standardized for modeling.

Python
def clean_text(text):
    delete_dict = {sp_char: '' for sp_char in string.punctuation}
    delete_dict[' '] = ' '
    table = str.maketrans(delete_dict)
    text1 = text.translate(table)
    textArr = text1.split()
    text2 = ' '.join([w for w in textArr if not w.isdigit() and len(w) > 3])
    return text2.lower()


yelp_review['text'] = yelp_review['text'].apply(clean_text)
yelp_review['Num_words_text'] = yelp_review['text'].apply(
    lambda x: len(str(x).split()))

3.2 Remove Stopwards:

  • Calls to nltk.download('stopwords') and stopwords.words('english'): Retrieves an extensive list of English stopwords.
  • remove_stopwords(text): Filters these stopwords from reviews so only content-rich words remain.
Python
stop_words = stopwords.words('english')


def remove_stopwords(text):
    textArr = text.split(' ')
    rem_text = " ".join([i for i in textArr if i not in stop_words])
    return rem_text


yelp_review['text'] = yelp_review['text'].apply(remove_stopwords)

3.3 Lemmatization(nouns, adjectives):

  • spacy.cli.download("en_core_web_md"): Downloads spaCy’s medium English model with vocabulary and grammatical info.
  • en_core_web_md.load(disable=['parser', 'ner']): Loads the model for fast lemmatization, ignoring other NLP features to speed up code.
  • lemmatization(texts, allowed_postags=['NOUN', 'ADJ']): Converts all reviews into lists of base-form words (lemmas), only keeping nouns and adjectives, which are most useful for discovering themes.
Python
nlp = en_core_web_md.load(disable=['parser', 'ner'])


def lemmatization(texts, allowed_postags=['NOUN', 'ADJ']):
    output = []
    for sent in texts:
        doc = nlp(sent)
        output.append(
            [token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return output


text_list = yelp_review['text'].tolist()
tokenized_reviews = lemmatization(text_list)

Step 4: Create Document-Term Matrix

We create the Document-Term Matrix,

  • corpora.Dictionary(tokenized_reviews): Creates an ID-to-word mapping from tokenized reviews.
  • [dictionary.doc2bow(rev) for rev in tokenized_reviews]: Builds a bag-of-words matrix needed for LDA input.
Python
dictionary = corpora.Dictionary(tokenized_reviews)
if len(dictionary) > 0:
    doc_term_matrix = [dictionary.doc2bow(rev) for rev in tokenized_reviews]
else:
    doc_term_matrix = []

Step 5: Fit LDA Model

We prepare the LDA Model,

  • Instantiates LdaModel from gensim using the corpus and dictionary.
  • Parameters like num_topics, passes and iterations control how many topics to find and how thoroughly to search for them.
  • print(lda_model.print_topics()): Outputs the top words and their weights for each detected topic.
Python
if doc_term_matrix:
    LDA = gensim.models.ldamodel.LdaModel
    lda_model = LDA(
        corpus=doc_term_matrix,
        id2word=dictionary,
        num_topics=10,
        random_state=100,
        chunksize=1000,
        passes=50,
        iterations=100
    )
    print(lda_model.print_topics())
else:
    print("Document term matrix is empty, cannot build LDA model.")

Output:

output
Fit LDA Model

Step 6: Model Evaluation

We evaluate the results of model,

  • lda_model.log_perplexity(...): Measures how well the model fits the data (lower is better for perplexity).
  • CoherenceModel(...): Calculates topic coherence, indicating the interpretability and meaningfulness of the topics (higher is better).
Python
total_docs = len(doc_term_matrix)
if total_docs > 0:
    print('\nPerplexity:', lda_model.log_perplexity(
        doc_term_matrix, total_docs=total_docs))
    coherence_model_lda = CoherenceModel(
        model=lda_model,
        texts=tokenized_reviews,
        dictionary=dictionary,
        coherence='c_v'
    )
    coherence_lda = coherence_model_lda.get_coherence()
    print('Coherence:', coherence_lda)
else:
    print("No documents to evaluate coherence or perplexity.")

Output:

Perplexity: -5.0528945582253595
Coherence: 0.48202029896063986

Step 7: Visualize

  • pyLDAvis.gensim_models.prepare(...): Prepares topic and term distributions for visualization using LDA results.
  • pyLDAvis.enable_notebook(): Ensures the visualization will display interactively in Colab/Jupyter.
  • vis_data: Containing the topic maps and relevance charts for interactive exploration.
Python
if total_docs > 0:
    pyLDAvis.enable_notebook()
    vis_data = gensimvis.prepare(lda_model, doc_term_matrix, dictionary)
    vis_data
    pyLDAvis.save_html(vis_data, 'lda_visualization.html')
else:
    print("No documents for visualization.")

Output:

output
Visualization

The result can also be download from here.

Applications of LDA

Let's see a few applications of LDA,

  • Document Clustering: LDA is widely used to automatically organize large collections of text (news, reviews, research papers) into thematic groups.
  • Recommendation Systems: By identifying document/topic overlap, LDA can recommend articles, books, products or videos with similar themes.
  • Content Summarization: LDA helps summarize corpora by surfacing prominent topics and representative terms.
  • Information Retrieval: LDA-powered search systems can find and rank documents based on topic relevance, beyond simple keyword matches.

Advantages

  • Interpretable Output: Each topic is a clear distribution over words; users can read topics and see representative terms.
  • Handles Large Data: Efficient and scalable to large datasets and corpora.
  • Flexible: Can be applied to various domains (text, genetics, images, etc.) and supports extension to dynamic, hierarchical or correlated topic models.
  • Improves Personalized Recommendations: By modeling user preferences as distributions over topics, personalization is improved.

Limitations

  • Bag-of-Words Assumption: Ignores word order and syntactic structure sometimes loses semantic nuance.
  • Topic Interpretability: Topics may be hard to interpret or too broad/narrow, especially with noisy or small datasets.
  • Requires Pre-Specifying Number of Topics: The num_topics parameter must be set manually and chosen carefully for each dataset.
  • Sensitivity to Preprocessing: Performance and quality are heavily influenced by text cleaning, stopword removal and lemmatization choices.

Explore