ArXiv GenAI Application

This application fetches computer science (AI) paper abstracts from arXiv for a given date, stores them, generates daily digests, creates embeddings for semantic search, and provides a RAG-based chatbot to query the papers.

Important note: the date must be within the last 90 days.

Recent Enhancements

Separate Scraping and Embedding: New CLI commands to separate paper scraping from embedding generation
Date Range Support: Ability to scrape papers from a range of dates in a single command
Rate-Limited Embedding: Configurable rate limiting for API requests to respect Google's embedding API limits
Improved Subject Handling: Papers now store subject categories as proper lists for better filtering and categorization
Document Optimization: Vector store documents now include only title and abstract, with authors and subjects stored as metadata
Immediate Vector Upsert: Each paper is immediately added to the vector store after its embedding is generated, preventing memory issues with large batches

Features

Fetch daily arXiv cs.AI papers
Store paper metadata (title, abstract, subjects, links, ID) in a local SQLite database
Generate a daily digest of paper topics using Google's Generative AI
Create embeddings for paper titles and abstracts using Google's text embedding model
Store embeddings in a local vector database (ChromaDB)
Prevent reprocessing of already processed papers
Provide a RAG (Retrieval Augmented Generation) chatbot to query papers using natural language
Command-line interface for all operations

Setup

Prerequisites

Python 3.9+
Poetry (dependency management)
Google API Key for Gemini models

Installation

Clone the repository

git clone https://github.com/yourusername/arxiv-genai-chat.git
cd arxiv-genai-chat

Install dependencies

poetry install

Create a .env file in the project root with your Google API key:

GOOGLE_API_KEY=your_api_key_here

Usage

Fetch Daily Papers

# Fetch papers from yesterday (default)
poetry run arxiv-app fetch-daily

# Fetch papers for a specific date
poetry run arxiv-app fetch-daily --date 2025-05-08 

# Fetch papers for a specific category
poetry run arxiv-app fetch-daily --category cs.CL

Scrape Papers Only (Without Embedding)

# Scrape papers from yesterday (default) without generating embeddings
poetry run arxiv-app scrape-only

# Scrape papers for a specific date
poetry run arxiv-app scrape-only --date 2025-05-08

# Scrape papers for a date range
poetry run arxiv-app scrape-only --start-date 2025-05-01 --end-date 2025-05-08

# Scrape papers for a specific category
poetry run arxiv-app scrape-only --category cs.CL

Generate Embeddings Only

# Generate embeddings for all papers in the database that don't have them
poetry run arxiv-app embed-only

# Generate embeddings for papers from a specific date
poetry run arxiv-app embed-only --date 2025-05-08

# Control the rate limiting for the embedding API
poetry run arxiv-app embed-only --rate-limit 100

# Limit the number of papers to process (useful for testing)
poetry run arxiv-app embed-only --limit 10

Query Vectors

Search for papers semantically related to your query:

poetry run arxiv-app query-vectors "machine learning for natural language processing" --n-results 10

Interactive Chat

Start an interactive chat session about papers in your database:

poetry run arxiv-app chat

Architecture

The application consists of several modules:

arxiv_scraper.py: Handles fetching and parsing papers from arXiv
database.py: Database models and session management
vector_store.py: ChromaDB integration for vector storage
llm_services.py: Google Gemini integration for embeddings and text generation
core_workflow.py: Business logic for paper processing
main.py: CLI interface using Typer

Limitations and Future Work

ArXiv API Constraints:
- The arXiv catchup API appears to have date limitations or format restrictions
- Current testing confirms successful operation with date 2025-05-08
- Future work should explore alternative APIs or methods for accessing the arXiv database
Vector Store Improvements:
- The current implementation uses a local ChromaDB instance
- Future versions could support remote ChromaDB servers or alternative vector databases
Embedding Model Optimization:
- Current embedding dimension is 768, which may be optimized for specific use cases
- Future work could include dimensionality reduction or model selection options
LLM Response Format:
- The current Google Gemini integration requires specific response parsing logic
- This may need updates as the Google API evolves

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project Structure

.
├── pyproject.toml      # Poetry project configuration
├── README.md           # This file
├── .env.example        # Example environment file
├── .gitignore
├── src/
│   └── arxiv_genai_chat/
│       ├── __init__.py
│       ├── main.py             # Typer CLI application
│       ├── __main__.py         # Entry point for python -m arxiv_genai_chat
│       ├── config.py           # Settings, env loading, logging
│       ├── models_db.py        # SQLModel definitions
│       ├── database.py         # SQLModel engine, session, CRUD
│       ├── arxiv_scraper.py    # ArXiv HTML fetching and parsing
│       ├── llm_services.py     # Google GenAI interactions
│       ├── vector_store.py     # ChromaDB client and operations
│       └── core_workflow.py    # Main application logic orchestration
├── tests/
│   └── arxiv_genai_chat/
│       ├── __init__.py
│       ├── test_config.py
│       ├── test_models_db.py
│       ├── test_database.py
│       ├── test_arxiv_scraper.py
│       ├── test_llm_services.py
│       ├── test_vector_store.py
│       ├── test_core_workflow.py
│       └── test_cli.py
├── llm_docs/               # Markdown notes for libraries and plan
├── vector_store_data/      # For ChromaDB persistent storage (gitignored)
└── db_data/                # For SQLite DB file (gitignored)

Running Tests

The project uses pytest for testing.

Make sure you have installed the development dependencies (which poetry install does by default).

Run tests from the project root:

poetry run pytest

To include coverage reports:

poetry run pytest --cov=arxiv_genai_chat tests/
# To generate an HTML coverage report:
# poetry run pytest --cov=arxiv_genai_chat --cov-report=html tests/
# Then open htmlcov/index.html in your browser.

Troubleshooting

Common Issues

ModuleNotFoundError: No module named 'src.main'

This error occurs if the Poetry script in pyproject.toml is incorrectly pointing to "src.main" instead of "arxiv_genai_chat.main". Ensure your pyproject.toml contains:
```
[tool.poetry.scripts]
arxiv-app = "arxiv_genai_chat.main:app" 
```
ChromaDB File Lock Issues (Windows)

If you encounter file permission errors with ChromaDB during testing (particularly on Windows), it may be due to SQLite database files not being properly released. Solutions include:
- Add proper garbage collection in your tests: del client followed by gc.collect()
- Use time.sleep(1) after cleanup to allow resources to be fully released
- Implement retry mechanisms with multiple attempts for file operations
Test Mock Assertion Failures

When writing tests with mocks, ensure your mock assertions match the actual parameter names used in function calls. For example, if a function is called with named parameters in the code:
```
get_or_create_arxiv_collection(collection_name="name", client=client_obj)
```
Your test assertion should match this format:
```
mock_function.assert_called_once_with(collection_name="name", client=mock_client)
```
Typer Version Compatibility Issues

The latest versions of Typer (0.12+) may have compatibility issues with certain parameter configurations. If you encounter errors like "Parameter.make_metavar() missing 1 required positional argument: 'ctx'", you can:
- Downgrade to a more stable version: pip install typer==0.9.0
- Update the project's dependencies in pyproject.toml:
```
[tool.poetry.dependencies]
typer = "^0.9.0"
```
- Run poetry update to apply the dependency changes

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.cursor/rules		.cursor/rules
llm_docs		llm_docs
src/arxiv_genai_chat		src/arxiv_genai_chat
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
check_db.py		check_db.py
fix_arxiv_fetch.py		fix_arxiv_fetch.py
fix_vector_store.py		fix_vector_store.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
test_vector_store.py		test_vector_store.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ArXiv GenAI Application

Recent Enhancements

Features

Setup

Prerequisites

Installation

Usage

Fetch Daily Papers

Scrape Papers Only (Without Embedding)

Generate Embeddings Only

Query Vectors

Interactive Chat

Architecture

Limitations and Future Work

License

Contributing

Project Structure

Running Tests

Troubleshooting

Common Issues

About

Uh oh!

Releases

Packages

Uh oh!

Languages

RLinnae/Arxiv-Genai-Chat

Folders and files

Latest commit

History

Repository files navigation

ArXiv GenAI Application

Recent Enhancements

Features

Setup

Prerequisites

Installation

Usage

Fetch Daily Papers

Scrape Papers Only (Without Embedding)

Generate Embeddings Only

Query Vectors

Interactive Chat

Architecture

Limitations and Future Work

License

Contributing

Project Structure

Running Tests

Troubleshooting

Common Issues

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages