This application fetches computer science (AI) paper abstracts from arXiv for a given date, stores them, generates daily digests, creates embeddings for semantic search, and provides a RAG-based chatbot to query the papers.
Important note: the date must be within the last 90 days.
- Separate Scraping and Embedding: New CLI commands to separate paper scraping from embedding generation
- Date Range Support: Ability to scrape papers from a range of dates in a single command
- Rate-Limited Embedding: Configurable rate limiting for API requests to respect Google's embedding API limits
- Improved Subject Handling: Papers now store subject categories as proper lists for better filtering and categorization
- Document Optimization: Vector store documents now include only title and abstract, with authors and subjects stored as metadata
- Immediate Vector Upsert: Each paper is immediately added to the vector store after its embedding is generated, preventing memory issues with large batches
- Fetch daily arXiv cs.AI papers
- Store paper metadata (title, abstract, subjects, links, ID) in a local SQLite database
- Generate a daily digest of paper topics using Google's Generative AI
- Create embeddings for paper titles and abstracts using Google's text embedding model
- Store embeddings in a local vector database (ChromaDB)
- Prevent reprocessing of already processed papers
- Provide a RAG (Retrieval Augmented Generation) chatbot to query papers using natural language
- Command-line interface for all operations
- Python 3.9+
- Poetry (dependency management)
- Google API Key for Gemini models
- Clone the repository
git clone https://github.com/yourusername/arxiv-genai-chat.git
cd arxiv-genai-chat- Install dependencies
poetry install- Create a
.envfile in the project root with your Google API key:
GOOGLE_API_KEY=your_api_key_here
# Fetch papers from yesterday (default)
poetry run arxiv-app fetch-daily
# Fetch papers for a specific date
poetry run arxiv-app fetch-daily --date 2025-05-08
# Fetch papers for a specific category
poetry run arxiv-app fetch-daily --category cs.CL# Scrape papers from yesterday (default) without generating embeddings
poetry run arxiv-app scrape-only
# Scrape papers for a specific date
poetry run arxiv-app scrape-only --date 2025-05-08
# Scrape papers for a date range
poetry run arxiv-app scrape-only --start-date 2025-05-01 --end-date 2025-05-08
# Scrape papers for a specific category
poetry run arxiv-app scrape-only --category cs.CL# Generate embeddings for all papers in the database that don't have them
poetry run arxiv-app embed-only
# Generate embeddings for papers from a specific date
poetry run arxiv-app embed-only --date 2025-05-08
# Control the rate limiting for the embedding API
poetry run arxiv-app embed-only --rate-limit 100
# Limit the number of papers to process (useful for testing)
poetry run arxiv-app embed-only --limit 10Search for papers semantically related to your query:
poetry run arxiv-app query-vectors "machine learning for natural language processing" --n-results 10Start an interactive chat session about papers in your database:
poetry run arxiv-app chatThe application consists of several modules:
arxiv_scraper.py: Handles fetching and parsing papers from arXivdatabase.py: Database models and session managementvector_store.py: ChromaDB integration for vector storagellm_services.py: Google Gemini integration for embeddings and text generationcore_workflow.py: Business logic for paper processingmain.py: CLI interface using Typer
-
ArXiv API Constraints:
- The arXiv catchup API appears to have date limitations or format restrictions
- Current testing confirms successful operation with date 2025-05-08
- Future work should explore alternative APIs or methods for accessing the arXiv database
-
Vector Store Improvements:
- The current implementation uses a local ChromaDB instance
- Future versions could support remote ChromaDB servers or alternative vector databases
-
Embedding Model Optimization:
- Current embedding dimension is 768, which may be optimized for specific use cases
- Future work could include dimensionality reduction or model selection options
-
LLM Response Format:
- The current Google Gemini integration requires specific response parsing logic
- This may need updates as the Google API evolves
Contributions are welcome! Please feel free to submit a Pull Request.
.
├── pyproject.toml # Poetry project configuration
├── README.md # This file
├── .env.example # Example environment file
├── .gitignore
├── src/
│ └── arxiv_genai_chat/
│ ├── __init__.py
│ ├── main.py # Typer CLI application
│ ├── __main__.py # Entry point for python -m arxiv_genai_chat
│ ├── config.py # Settings, env loading, logging
│ ├── models_db.py # SQLModel definitions
│ ├── database.py # SQLModel engine, session, CRUD
│ ├── arxiv_scraper.py # ArXiv HTML fetching and parsing
│ ├── llm_services.py # Google GenAI interactions
│ ├── vector_store.py # ChromaDB client and operations
│ └── core_workflow.py # Main application logic orchestration
├── tests/
│ └── arxiv_genai_chat/
│ ├── __init__.py
│ ├── test_config.py
│ ├── test_models_db.py
│ ├── test_database.py
│ ├── test_arxiv_scraper.py
│ ├── test_llm_services.py
│ ├── test_vector_store.py
│ ├── test_core_workflow.py
│ └── test_cli.py
├── llm_docs/ # Markdown notes for libraries and plan
├── vector_store_data/ # For ChromaDB persistent storage (gitignored)
└── db_data/ # For SQLite DB file (gitignored)
The project uses pytest for testing.
- Make sure you have installed the development dependencies (which
poetry installdoes by default). - Run tests from the project root:
To include coverage reports:
poetry run pytest
poetry run pytest --cov=arxiv_genai_chat tests/ # To generate an HTML coverage report: # poetry run pytest --cov=arxiv_genai_chat --cov-report=html tests/ # Then open htmlcov/index.html in your browser.
-
ModuleNotFoundError: No module named 'src.main'
This error occurs if the Poetry script in pyproject.toml is incorrectly pointing to "src.main" instead of "arxiv_genai_chat.main". Ensure your pyproject.toml contains:
[tool.poetry.scripts] arxiv-app = "arxiv_genai_chat.main:app"
-
ChromaDB File Lock Issues (Windows)
If you encounter file permission errors with ChromaDB during testing (particularly on Windows), it may be due to SQLite database files not being properly released. Solutions include:
- Add proper garbage collection in your tests:
del clientfollowed bygc.collect() - Use
time.sleep(1)after cleanup to allow resources to be fully released - Implement retry mechanisms with multiple attempts for file operations
- Add proper garbage collection in your tests:
-
Test Mock Assertion Failures
When writing tests with mocks, ensure your mock assertions match the actual parameter names used in function calls. For example, if a function is called with named parameters in the code:
get_or_create_arxiv_collection(collection_name="name", client=client_obj)
Your test assertion should match this format:
mock_function.assert_called_once_with(collection_name="name", client=mock_client)
-
Typer Version Compatibility Issues
The latest versions of Typer (0.12+) may have compatibility issues with certain parameter configurations. If you encounter errors like "Parameter.make_metavar() missing 1 required positional argument: 'ctx'", you can:
- Downgrade to a more stable version:
pip install typer==0.9.0 - Update the project's dependencies in pyproject.toml:
[tool.poetry.dependencies] typer = "^0.9.0"
- Run
poetry updateto apply the dependency changes
- Downgrade to a more stable version: