Clustering Text Documents using K-Means in Scikit Learn
Clustering text documents is a common problem in Natural Language Processing (NLP) where similar documents are grouped based on their content. K-Means clustering is a popular clustering technique used for this purpose. In this article we'll learn how to perform text document clustering using the K-Means algorithm in Scikit-Learn.
Implementation using Python
In this project we're building an application to detect sarcasm in headlines. Sarcasm can make sentences sound opposite to their true meaning which can confuse systems that analyze sentiment.
Step 1: Import Necessary Libraries
We need some Python libraries for our task like numpy, pandas, matplotlib and scikit learn.
import json
import numpy as np
import pandas as pd
import requests
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
Step 2: Load the Dataset
Now let's load the dataset of sarcasm headlines. We download the dataset using the requests.get(url) method. The .json() method converts the raw data into a Python dictionary. Then we create a pandas DataFrame df to make the data easier to work with.
url = "https://raw.githubusercontent.com/PawanKrGunjan/Natural-Language-Processing/main/Sarcasm%20Detection/sarcasm.json"
response = requests.get(url)
data = response.json()
df = pd.DataFrame(data)
Step 3: Convert Text to Numeric Representation using TF-IDF
We need to convert the text data into a format that the K-Means algorithm can understand (numbers). We use TF-IDF for this.
- TfidfVectorizer converts text into a numeric format.
- stop_words='english' removes common words like "the", "and" that don't add much meaning.
- fit_transform(sentence) creates a TF-IDF matrix where each row represents a document and each column represents a word’s importance.
sentence = df['headline']
vectorizer = TfidfVectorizer(stop_words='english')
vectorized_documents = vectorizer.fit_transform(sentence)
Step 4: Reduce Dimensionality using PCA
Since TF-IDF produces a high-dimensional matrix we reduce its dimensions to make it easier to visualize.
- TF-IDF output is high-dimensional and difficult to visualize.
- PCA(n_components=2) reduces it to 2 dimensions so we can plot it.
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(vectorized_documents.toarray())
Step 5: Applying K-Means Clustering
We will now apply the K-Means algorithm to group the headlines into categories (sarcastic or not sarcastic).
- KMeans(n_clusters=2): We choose 2 clusters since the dataset has headlines labeled as either sarcastic or not sarcastic.
- n_init=5: Runs K-Means 5 times to get the best clustering result.
- max_iter=500: The algorithm can iterate 500 times to find the best solution.
- random_state=42: Ensures that results are reproducible.
num_clusters = 2
kmeans = KMeans(n_clusters=num_clusters, n_init=5, max_iter=500, random_state=42)
kmeans.fit(vectorized_documents)
Output:

Step 6: Storing Clustering Results
After clustering we store the results in a DataFrame for easy viewing.
- kmeans.labels_ contains the cluster label for each headline (0 or 1).
- We print 5 random samples of the results to check the clustering.
results = pd.DataFrame()
results['document'] = sentence
results['cluster'] = kmeans.labels_
print(results.sample(5))
Output:

Step 7: Visualizing Clusters
Finally we visualize the clustered headlines in a scatter plot.
- We use plt.scatter to plot the data points.
- Each cluster is shown in different colors red for non-sarcastic and green for sarcastic.
- The scatter plot shows how K-Means has grouped the headlines.
colors = ['red', 'green']
cluster_labels = ['Not Sarcastic', 'Sarcastic']
for i in range(num_clusters):
plt.scatter(reduced_data[kmeans.labels_ == i, 0],
reduced_data[kmeans.labels_ == i, 1],
s=10, color=colors[i],
label=f'{cluster_labels[i]}')
plt.legend()
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('K-Means Clustering of Sarcasm Headlines')
plt.show()
Output:

The scatter plot shows the K-Means clustering results for sarcasm detection in headlines. Red points represent Not Sarcastic headline while Green points indicate Sarcastic headlines. This clustering reveals distinct patterns using TF-IDF and K-Means can effectively separate text categories. This showcases the potential of clustering for text analysis using scikit learn.