This project is my implementation of a task assigned during the Computation Methods course at AGH UST.
Lab report (with search examples)
The project's goal is to develop a search engine utilizing Natural Language Processing (NLP) techniques to efficiently retrieve relevant documents from a large corpus of text data. Singular Value Decomposition (SVD) is employed for low-rank approximation of the term-by-document matrix to enhance information retrieval and reduce noise.
There were several key steps during development of this project:
- Data collection
- Data preprocessing
- Data extraction/indexing/vectorization
- Calculating term by document matrix, and it's low rank approximation
- Querying the matrix
To crawl some articles:
python app/engine/data_generation/wikipedia_crawler_2.py
To process the articles:
python app/engine/data_processing/MainDataProcessor.py
If you want to run the console based search engine:
python app/engine/search_engine.py
If you want to run the web based search engine:
python -m flask --app app.backend run
The application should be available on http://localhost:5000
demo.mp4
The project is split into 3 parts, the app directory contains three pieces:
- frontend - contains the react based user interface
- backend - contains the flask server
- engine - contains the core logic of the search engine
The engine directory is the most important directory as it contains essential files for logic of the engine.
Inside the data_generation module, there are two wikipedia crawlers, first one was a slow prototype to get familiar with wikipedia library, the second one is faster due to multithreading.
The wikipedia crawler works by visiting one article, and then following every link it finds on that article until max depth is reached. It saves the raw text from the article into a folder.
The data_processing module comprises scripts for processing downloaded Wikipedia articles.
Responsible for sanitizing and processing raw article text by removing stop words, punctuation, and stemming words. Processed text is then saved onto disk.
Generates the dictionary of words to be indexed by the engine. It adds every word from processed articles and removes less common words (occurring less than 15 times).
Creates an array of article names to map engine results to article names.
Creates the term-by-document matrix using the generated dictionary. It iterates over processed articles, creating matrix rows with word occurrences. Columns are then multiplied by the Inverse Document Frequency (IDF) of each word to reduce common word significance. Finally, matrix rows are normalized and the matrix is saved onto disk.
Performs low-rank approximation of the term-by-document matrix using the SVD algorithm.
Due to RAM limitations on my PC, the result is saved as (U * S) matrix and Vh matrix. It would be better to fully compute the U * S * Vh matrix and return it, but my PC couldn't process the data I had.
Aggregates all processing scripts, executing each to process, index, and vectorize the data.
Utilizes the generated term-by-document matrix, dictionary, and article lookup map to query articles. The user-provided query (Q) is normalized, and M * Q is computed, where M represents the term-by-document matrix. The resulting vector contains cosine similarity values for each article. By default, the engine returns the top 10 results with the highest similarity scores.