Document Statistics Analyzer

Overview

The Document Statistics Analyzer is a Python-based tool designed to extract and analyze linguistic and structural features from text documents. It supports multiple file formats (PDF, DOCX, TXT) and provides detailed insights at the word, sentence, paragraph, and document levels. The tool generates comprehensive reports in Markdown, HTML, or PDF formats, complete with visualizations like word clouds, histograms, bar charts, and sentiment heatmaps. Reports and visualizations are bundled into a downloadable ZIP archive for easy sharing.

This project leverages libraries such as NLTK, spaCy, TextBlob, and Gradio to perform natural language processing, sentiment analysis, and interactive user interface creation.

Features

File Support: Processes PDF, DOCX, and TXT files.
Word-Level Analysis:
- Tokenization and Part-of-Speech (POS) tagging
- TF-IDF keyword extraction
- N-gram frequency analysis
- Keyword co-occurrence
- Lexical diversity (Type-Token Ratio)
- Stop word ratio and average word length
Sentence-Level Analysis:
- Sentence count and length distribution
- Average sentence length
- Sentence-level sentiment analysis
Paragraph-Level Analysis:
- Paragraph count
- Sentence count per paragraph
- Average sentences per paragraph
Document-Level Analysis:
- Total word count
- Readability scores (Flesch Reading Ease, Flesch-Kincaid Grade)
- Named Entity Recognition (NER)
- Overall document sentiment
- Punctuation frequency
Visualizations:
- Word cloud
- Sentence and paragraph length histograms
- POS tag distribution bar chart
- Sentiment heatmap
Report Generation:
- Customizable reports in Markdown, HTML, or PDF
- Executive summary with key insights
- Embedded visualizations
- ZIP archive for report and images
User Interface:
- Interactive Gradio interface for file upload and report generation

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
One-Example		One-Example
LICENSE		LICENSE
README.md		README.md
main.ipynb		main.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Statistics Analyzer

Overview

Features

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

dgholamian/document-statistics-analyzer

Folders and files

Latest commit

History

Repository files navigation

Document Statistics Analyzer

Overview

Features

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages