Skip to content

ZitouniNidhal/GAC-RAG-Graph-Augmented-Code-RAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 GAC-RAG: Graph-Augmented Code RAG

A novel Retrieval-Augmented Generation mechanism for code assistants that uses dependency graph traversal instead of pure vector similarity β€” enabling true multi-hop reasoning across codebases.

Python Neo4j ChromaDB License: MIT


πŸ” The Problem with Standard RAG on Code

Standard RAG retrieves code chunks by cosine similarity alone. But code is a graph:

  • Functions call other functions
  • Classes inherit from other classes
  • Modules import other modules

When you ask "Why does UserService.save() fail when DB disconnects?", vanilla RAG retrieves only UserService.save() β€” missing DatabasePool, RetryHandler, and the config it depends on.

GAC-RAG solves this with 3-layer retrieval:

Query β†’ [Semantic Anchor] β†’ [Graph Expansion] β†’ [LLM Reranker] β†’ Answer

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        GAC-RAG                          β”‚
β”‚                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Layer 1  β”‚    β”‚   Layer 2    β”‚    β”‚   Layer 3    β”‚  β”‚
β”‚  β”‚ Semantic │───▢│   Graph      │───▢│    LLM       β”‚  β”‚
β”‚  β”‚  Anchor  β”‚    β”‚  Expansion   β”‚    β”‚  Reranker    β”‚  β”‚
β”‚  β”‚(ChromaDB)β”‚    β”‚  (Neo4j)     β”‚    β”‚  (Claude)    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚       β”‚                β”‚                    β”‚           β”‚
β”‚  Vector search   Hop traversal         Prune noise      β”‚
β”‚  top-k nodes    calls/imports/         keep relevant    β”‚
β”‚                 inherits edges         nodes only       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

✨ Features

  • πŸ”Ž Multi-language support β€” parses Python, JS/TS, Java, Go, C++ via tree-sitter
  • πŸ•ΈοΈ Dependency graph β€” builds call graph, import graph, inheritance graph in Neo4j
  • πŸ”’ Semantic anchoring β€” ChromaDB stores function/class embeddings
  • πŸ” N-hop traversal β€” configurable depth with relevance decay scoring
  • πŸ€– LLM reranking β€” Claude prunes irrelevant nodes before final generation
  • πŸ““ Jupyter demo β€” interactive notebook with a real sample codebase
  • πŸ§ͺ Test suite β€” unit tests for graph builder, retriever, and reranker

πŸš€ Quick Start

1. Prerequisites

  • Python 3.10+
  • Neo4j Desktop or Docker
  • Anthropic API key

2. Install

git clone https://github.com/YOUR_USERNAME/gac-rag.git
cd gac-rag
pip install -r requirements.txt

3. Start Neo4j (Docker)

docker run \
  --name neo4j-gac \
  -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/password \
  neo4j:5

4. Configure

cp .env.example .env
# Edit .env with your API key and Neo4j credentials

5. Run the Demo Notebook

jupyter notebook notebooks/demo.ipynb

πŸ“ Project Structure

gac-rag/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ indexer.py          # Parses code β†’ builds graph + embeddings
β”‚   β”œβ”€β”€ graph_store.py      # Neo4j interface (nodes, edges, traversal)
β”‚   β”œβ”€β”€ vector_store.py     # ChromaDB interface (embed, search)
β”‚   β”œβ”€β”€ retriever.py        # 3-layer retrieval pipeline
β”‚   β”œβ”€β”€ reranker.py         # LLM-based context pruning
β”‚   └── assistant.py        # Final answer generation
β”œβ”€β”€ notebooks/
β”‚   └── demo.ipynb          # Full interactive walkthrough
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_indexer.py
β”‚   β”œβ”€β”€ test_retriever.py
β”‚   └── test_reranker.py
β”œβ”€β”€ sample_repo/            # Example codebase to query against
β”‚   β”œβ”€β”€ services/
β”‚   β”œβ”€β”€ models/
β”‚   └── utils/
β”œβ”€β”€ docs/
β”‚   └── architecture.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .env.example
└── README.md

πŸ§ͺ Example Query

from src.assistant import CodeAssistant

assistant = CodeAssistant(repo_path="./sample_repo")
assistant.index()  # Build graph + embeddings

answer = assistant.ask(
    "Why does payment processing fail silently when the database is down?"
)
print(answer)

Standard RAG retrieves: PaymentService.process() only

GAC-RAG retrieves:

  • PaymentService.process() ← semantic anchor
  • DatabasePool.getConnection() ← 1 hop (called)
  • RetryHandler.attempt() ← 1 hop (called)
  • EventBus.emit() ← 2 hops
  • config.db_timeout ← 2 hops (imported)

πŸ“Š Benchmark

Metric Standard RAG GAC-RAG
Single-function questions βœ… Good βœ… Good
Cross-file reasoning ❌ Poor βœ… Excellent
Multi-hop (3+ steps) ❌ Fails βœ… Strong
Retrieval precision ~0.61 ~0.84
Context relevance ~0.58 ~0.81

Tested on the included sample repo with 20 multi-hop questions.


🀝 Contributing

PRs welcome! See CONTRIBUTING.md for guidelines.


πŸ“„ License

MIT β€” see LICENSE

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors