Ask questions about any graph in natural language - no SQL, no graph query languages, just plain English
โ ๏ธ Software Disclaimer: GraphQA is provided "as is" without warranty of any kind. This is research/experimental software designed for graph analysis and exploration. Use at your own discretion and always validate results for production use cases.
- ๐ฏ What is GraphQA?
- ๐ง Graph Algorithm Intelligence
- ๐ Quick Start
- ๐ฃ๏ธ Interactive Chat
- ๐ Programmatic Usage
- ๐ง Custom Data Loaders
- ๐ Memory & Performance
- โ๏ธ Configuration
- ๐ ๏ธ Troubleshooting
- ๐ Documentation
- ๐ค Contributing
- ๐ Project Files Explained
GraphQA is a comprehensive graph analysis framework that transforms complex graph datasets into natural language conversations. It's essentially a graph algorithm engine with an AI interface that understands what you're asking and automatically selects the right NetworkX algorithms to answer your questions.
Traditional graph analysis requires:
- Deep expertise in graph theory and algorithms
- Programming skills to implement NetworkX code
- Domain knowledge to know which algorithms apply when
- Trial and error to find the right analysis approach
GraphQA solves this by providing an intelligent algorithm selection layer that:
- Understands natural language questions about graph data
- Automatically chooses appropriate graph algorithms
- Executes analysis using proven NetworkX implementations
- Presents results in human-readable format
Foundation: GraphQA is built on NetworkX, the gold standard for graph analysis in Python. Your entire dataset lives in memory as a NetworkX MultiDiGraph, providing:
- โก Instant access to 500+ NetworkX algorithms
- ๐ฌ Research-grade implementations (PageRank, Louvain, Dijkstra, etc.)
- ๐ Rich data structures supporting any graph topology
- ๐ Interactive analysis with sub-second response times
- ๐งฎ Mathematical precision for complex computations
Intelligence Layer: AI agents intelligently orchestrate graph algorithms by:
- Schema Discovery: Embedding-based analysis of your graph structure
- Question Understanding: LangChain ReAct agents parse natural language
- Algorithm Selection: Smart routing to optimal NetworkX algorithms
- Execution: Running analysis with appropriate parameters
- Result Synthesis: Converting algorithm outputs to insights
Traditional NetworkX requires expertise:
# Manual algorithm selection and interpretation
communities = nx.community.greedy_modularity_communities(G)
centrality = nx.pagerank(G)
# ... complex result analysis ...GraphQA makes it conversational:
๐ค Ask GraphQA: "Find communities and influential nodes"
๐ Found 1,247 communities. Top nodes: B001, B002, B003...
GraphQA makes zero assumptions about your data structure:
- ๐ Automatic Discovery: embeddings map attributes to semantic concepts
- ๐ค Contextual Understanding: "expensive" automatically maps to price/cost attributes
- ๐ Algorithm Adaptation: Community detection parameters adjust to graph size/density
The Result: Whether you're analyzing protein interactions, supply chains, or social media networks, GraphQA provides the same powerful natural language interface backed by rigorous graph algorithms.
GraphQA's power comes from its intelligent algorithm selection system that automatically chooses and configures the right NetworkX algorithms for your questions. Here's what's under the hood:
GraphQA's ReAct agent uses exactly 5 specialized tools:
- Embedding-based attribute discovery
- Multi-attribute filtering and pattern matching
- Sample data retrieval
- Attribute-based node/edge filtering
- Multi-hop neighborhood exploration (up to depth N)
- Similarity searches and metric-based queries
- Centrality measures (PageRank, betweenness, degree)
- Connectivity analysis and topology metrics
- Attribute distributions and summary statistics
- Node-specific centrality and importance metrics
- Local neighborhood analysis and clustering
- Similarity computation with other nodes
- Community detection (Louvain, greedy modularity, label propagation)
- Pathfinding algorithms (Dijkstra, shortest paths)
- Network topology analysis and connectivity patterns
GraphQA doesn't just run algorithms randomly - it uses sophisticated decision logic:
Dataset-Aware Selection:
# GraphQA automatically chooses based on your graph characteristics:
if graph_size == "small" and preference == "comprehensive":
use_betweenness_centrality() # O(Vยณ) - thorough but slow
elif graph_size == "large" and preference == "fast":
use_degree_centrality() # O(V) - fast approximation
else:
use_pagerank() # O(V+E) - balanced choicePerformance-Optimized Routing:
- Small graphs (< 1K nodes): Use comprehensive algorithms (betweenness centrality, all-pairs shortest paths)
- Medium graphs (1K-100K nodes): Balance quality vs speed (PageRank, greedy modularity)
- Large graphs (100K+ nodes): Prioritize scalable algorithms (degree centrality, label propagation)
Context-Aware Configuration:
- Community Detection: Louvain for undirected graphs, greedy modularity for directed
- Centrality: Adapts to graph density and connectivity patterns
- Pathfinding: Chooses between Dijkstra, A*, or bidirectional search
Here's how natural language maps to specific graph algorithms:
| Question Type | Selected Algorithms | NetworkX Functions |
|---|---|---|
| "Most connected nodes" | Degree, PageRank centrality | nx.degree_centrality(), nx.pagerank() |
| "Find communities" | Louvain, greedy modularity | nx.community.louvain_communities() |
| "Shortest path" | Dijkstra, A* pathfinding | nx.shortest_path(), nx.astar_path() |
| "Important nodes" | Betweenness, closeness centrality | nx.betweenness_centrality() |
| "Network structure" | Clustering, connectivity analysis | nx.average_clustering(), nx.connected_components() |
| "Similar products" | Structural similarity, attribute matching | nx.jaccard_coefficient(), custom similarity |
GraphQA uses research-grade implementations:
- NetworkX algorithms: Peer-reviewed, mathematically validated
- Performance tested: Benchmarked on real-world datasets
- Parameter tuned: Automatically optimized for different graph types
- Error handling: Graceful fallbacks for edge cases
Example Output Quality:
๐ค Ask GraphQA: "Find communities in this network"
๐ง Algorithm Selection:
โข Graph: 29,091 nodes, 60,168 edges (medium, directed)
โข Chosen: Greedy modularity optimization
โข Rationale: Best balance of quality vs performance for this size
๐ Results:
โข Found 2,743 communities (modularity = 0.892)
โข Largest community: 150 nodes
โข Average community size: 10.6 nodes
โข Execution time: 49.8 seconds
This intelligent algorithm selection is what makes GraphQA special - it brings expert-level graph analysis to anyone who can ask a question in English.
- Python 3.8+ (recommended: Python 3.10+)
- OpenAI API Key for LLM functionality
# Clone the repository
git clone https://github.com/catio-tech/graphqa.git
cd graphqa
# Run interactive setup - guides you through everything
python quickstart.py# Clone the repository
git clone https://github.com/catio-tech/graphqa.git
cd graphqa
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Automated setup
python quickstart.py --auto
# Or use the wrapper script
./setup# Clone and install manually
git clone https://github.com/catio-tech/graphqa.git
cd graphqa
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate
# Install GraphQA (core dependencies)
pip install -e .
# Or install with development tools
pip install -e ".[dev]"
# Or install with observability
pip install -e ".[observability]"
# Or install everything
pip install -e ".[all]"Set your OpenAI API key:
Option 1 (Recommended): Use .env file
# Copy the example environment file
cp env.example .env
# Edit .env and add your API key
# OPENAI_API_KEY=your-api-key-hereOption 2: Set environment variable
export OPENAI_API_KEY="your-api-key-here"Test the installation:
from graphqa import GraphQA
# Initialize GraphQA
agent = GraphQA(dataset_name="amazon")
print("โ
GraphQA ready!")- ๐ NetworkX Integration: Full compatibility with NetworkX ecosystem and algorithms
- ๐ง Zero Configuration: Automatic schema discovery - no manual setup required
- ๐ Embedding-Based Search: 384-dimensional semantic vectors for intelligent attribute matching
- ๐ค LangChain ReAct Agents: Multi-step reasoning with tool selection and execution
- ๐ Universal Graph Algorithms: Community detection, centrality, clustering, pathfinding
- ๐พ In-Memory Performance: Sub-second analysis on graphs with 100K+ nodes
- ๐ง Production Ready: Built-in observability, error handling, and memory management
The easiest way to use GraphQA is through the interactive chat interface:
# Start interactive GraphQA chat
python -m graphqa.demo
# Or with specific dataset
python -m graphqa.demo --dataset amazonThis launches an interactive session where you can:
- Ask questions in natural language
- Get immediate responses about your graph data
- Explore data without writing code
Example Interactive Session:
๐ GraphQA Interactive Demo - Amazon Products Dataset
๐ก Ask questions about your graph data in plain English!
๐ค Ask GraphQA: What are the highest rated products?
๐ I found the highest rated products in the Amazon dataset...
[Results with top-rated products including ASINs like B0001234...]
๐ค Ask GraphQA: What categories do these products belong to?
๐ Based on the previous results, here are the categories...
[Category information for the highly rated products]
๐ค Ask GraphQA: Find products under $50 with good ratings
๐๏ธ I found products under $50 with high ratings...
[Budget-friendly alternatives with ratings above 4.0]
# Quick interactive demo
python -m graphqa.clifrom graphqa import GraphQA
# Load any graph dataset into memory as NetworkX graph
agent = GraphQA(dataset_name="amazon")
agent.load_dataset() # Loads into NetworkX MultiDiGraph
# Ask questions in natural language
response = agent.ask("What are the most connected nodes?")
print(response)
# Follow-up questions use conversation context
response = agent.ask("Show me their categories")
print(response)
agent.shutdown()from graphqa import GraphQA
# Load dataset
agent = GraphQA(dataset_name="amazon")
agent.load_dataset()
# Access the underlying NetworkX graph directly
graph = agent.graph # NetworkX MultiDiGraph object
print(f"Nodes: {graph.number_of_nodes()}")
print(f"Edges: {graph.number_of_edges()}")
# Use any NetworkX algorithms directly
import networkx as nx
centrality = nx.degree_centrality(graph)
communities = nx.community.greedy_modularity_communities(graph)
# Or let GraphQA's AI choose the right algorithms
response = agent.ask("Find the most central nodes and detect communities")
print(response)import networkx as nx
from graphqa import GraphQA
# Create your own NetworkX graph
graph = nx.MultiDiGraph()
graph.add_node("A", type="server", region="us-west-2")
graph.add_node("B", type="database", region="us-east-1")
graph.add_edge("A", "B", relationship="connects_to", weight=1.0)
# Load into GraphQA for natural language analysis
agent = GraphQA()
agent.graph = graph
agent._discover_and_setup_schema() # Auto-discover schema
# Now ask questions about your custom graph
response = agent.ask("What nodes are in the us-west-2 region?")
print(response)GraphQA is designed to work with any graph dataset. You can create custom data loaders to import your specific data format into the GraphQA ecosystem.
All data loaders inherit from BaseGraphLoader, which provides:
- Automatic schema discovery using embedding-based analysis
- Validation and error handling for robust data loading
- Metadata generation for intelligent tool configuration
- Universal interface that works with all GraphQA tools
from graphqa.loaders import BaseGraphLoader
import networkx as nx
class MyCustomLoader(BaseGraphLoader):
"""Custom loader for your specific data format"""
def __init__(self, config: Dict[str, Any] = None):
super().__init__(config)
self.data_source = self.config.get('data_source', 'my_data.json')
def load_graph(self) -> nx.MultiDiGraph:
"""
REQUIRED: Load your data and return a NetworkX MultiDiGraph
This is the core method you must implement
"""
graph = nx.MultiDiGraph()
# Your custom data loading logic here
# Example: loading from JSON, CSV, database, API, etc.
return graph
def get_dataset_description(self) -> str:
"""REQUIRED: Describe what this dataset contains"""
return "My custom dataset containing..."
def get_sample_queries(self) -> List[str]:
"""REQUIRED: Provide example questions for this dataset"""
return [
"What are the most connected entities?",
"Find clusters in the data",
"Show me the network structure"
]
# Optional: Override these methods for custom behavior
def get_dataset_name(self) -> str:
return "MyCustomDataset"
def get_data_sources(self) -> List[str]:
return [self.data_source]import json
import pandas as pd
import networkx as nx
from graphqa.loaders import BaseGraphLoader
from pathlib import Path
class SocialNetworkLoader(BaseGraphLoader):
"""Example: Load social network data from CSV files"""
def __init__(self, config: Dict[str, Any] = None):
super().__init__(config)
# Configure data sources
self.users_file = self.config.get('users_file', 'users.csv')
self.connections_file = self.config.get('connections_file', 'connections.csv')
self.max_users = self.config.get('max_users', None)
self.logger.info(f"Social network loader initialized") def load_graph(self) -> nx.MultiDiGraph:
"""Load social network data into NetworkX graph"""
graph = nx.MultiDiGraph()
try:
# Load user data (nodes)
users_df = pd.read_csv(self.users_file)
if self.max_users:
users_df = users_df.head(self.max_users)
# Add nodes with attributes
for _, user in users_df.iterrows():
graph.add_node(
user['user_id'],
name=user.get('name', ''),
age=user.get('age', 0),
location=user.get('location', ''),
followers=user.get('followers', 0),
# Add any other attributes from your data
)
# Load connections (edges)
connections_df = pd.read_csv(self.connections_file)
for _, conn in connections_df.iterrows():
source = conn['user_id']
target = conn['friend_id']
# Only add edge if both nodes exist
if graph.has_node(source) and graph.has_node(target):
graph.add_edge(
source, target,
relationship_type='friendship',
weight=conn.get('strength', 1.0),
created_date=conn.get('date', '')
)
self.logger.info(f"Loaded {graph.number_of_nodes()} users, {graph.number_of_edges()} connections")
return graph
except Exception as e:
raise LoaderError(f"Failed to load social network data: {e}") def get_dataset_description(self) -> str:
return (
"Social Network dataset containing user profiles and friendship connections. "
"Includes user demographics, follower counts, and relationship data for "
"analyzing social influence and community structures."
)
def get_sample_queries(self) -> List[str]:
return [
"Who are the most popular users?",
"Find friend groups and communities",
"What's the average number of connections?",
"Show me users in the same location",
"Which users have the most influence?",
"Find the shortest path between two users",
"Analyze age distribution in the network",
"Identify users with unusual connection patterns"
]
def get_dataset_name(self) -> str:
return "Social Network"class MultiSourceLoader(BaseGraphLoader):
"""Example: Combine data from multiple sources"""
def load_graph(self) -> nx.MultiDiGraph:
graph = nx.MultiDiGraph()
# Load from database
graph = self._load_from_database(graph)
# Enrich with API data
graph = self._enrich_with_api(graph)
# Add computed features
graph = self._add_computed_features(graph)
return graph
def _load_from_database(self, graph):
# Your database loading logic
return graph
def _enrich_with_api(self, graph):
# API enrichment logic
return graph
def _add_computed_features(self, graph):
# Add computed node/edge attributes
for node in graph.nodes():
graph.nodes[node]['degree'] = graph.degree(node)
return graph def load_graph(self) -> nx.MultiDiGraph:
graph = nx.MultiDiGraph()
# Validate data sources exist
for source in self.get_data_sources():
if not Path(source).exists():
raise LoaderError(f"Data source not found: {source}")
try:
# Load with progress tracking
total_steps = 3
self.logger.info("Step 1/3: Loading nodes...")
self._load_nodes(graph)
self.logger.info("Step 2/3: Loading edges...")
self._load_edges(graph)
self.logger.info("Step 3/3: Validating graph...")
self._validate_graph(graph)
return graph
except Exception as e:
self.logger.error(f"Data loading failed: {e}")
raise LoaderError(f"Cannot load dataset: {e}")
def _validate_graph(self, graph):
"""Custom validation logic"""
if graph.number_of_nodes() == 0:
raise LoaderError("No nodes loaded - check your data")
# Add your specific validation rules
isolated_nodes = [n for n in graph.nodes() if graph.degree(n) == 0]
if len(isolated_nodes) > graph.number_of_nodes() * 0.5:
self.logger.warning(f"Many isolated nodes: {len(isolated_nodes)}")from graphqa import GraphQA
from my_loaders import SocialNetworkLoader
# Create custom config
config = {
'users_file': 'my_social_data/users.csv',
'connections_file': 'my_social_data/connections.csv',
'max_users': 10000 # Limit for testing
}
# Initialize GraphQA with custom loader
loader = SocialNetworkLoader(config)
agent = GraphQA()
agent.loader = loader
agent.load_dataset()
# Now use natural language on your data!
response = agent.ask("Find the most influential users in the network")
print(response)# Register your loader with GraphQA
from graphqa.loaders import register_loader
register_loader("social_network", SocialNetworkLoader)
# Now use it like built-in loaders
agent = GraphQA(dataset_name="social_network", config=config)
agent.load_dataset()Your loader can handle any data format, but the output must be:
โ Required:
nx.MultiDiGraphobject (supports any graph topology)- Nodes with string IDs (can be any string: "user123", "product_B001", etc.)
- Basic error handling for missing files/invalid data
๐ฏ Recommended:
- Node attributes: Add meaningful properties (name, category, price, etc.)
- Edge attributes: Include relationship types, weights, timestamps
- Data validation: Check for duplicates, missing values, format issues
- Progress logging: Use
self.logger.info()for status updates - Memory management: Handle large datasets with chunking/streaming
๐ง Best Practices:
- Attribute naming: Use descriptive names (
product_titlevstitle) - Data types: Keep consistent types (all prices as float, all dates as strings)
- Missing data: Handle nulls gracefully (empty string vs None vs 0)
- Performance: Use pandas/numpy for large data processing
- Documentation: Provide clear dataset descriptions and sample queries
GraphQA includes reference implementations you can learn from:
AmazonProductLoader: JSON files, product relationships, review data
Check src/graphqa/loaders/ for complete working examples!
GraphQA loads your entire graph into system memory as a NetworkX object, providing:
- โก Sub-second queries: No database round-trips or network latency
- ๐งฎ Rich algorithms: Full NetworkX algorithm library available
- ๐ Interactive analysis: Immediate follow-up questions and exploration
- ๐ Complex analytics: Multi-step analysis without data movement
| Dataset Size | Memory Usage | Load Time | Recommended For |
|---|---|---|---|
| Small (< 1K nodes) | < 50MB | < 5s | Development, testing |
| Medium (1K-10K nodes) | 50-200MB | 5-30s | Demos, prototypes |
| Large (10K-100K nodes) | 200MB-2GB | 30s-5min | Research, production |
| Very Large (100K+ nodes) | 2GB+ | 5+ min | High-memory servers |
- Amazon Dataset (Full): ~200K products, ~1M reviews โ 2GB RAM, 2-5min load
- Amazon Dataset (Demo): ~5K products, ~10K reviews โ 200MB RAM, 10-30s load
- Query Performance: Most questions answered in < 5 seconds
- Memory Management: Automatic cleanup, configurable limits
GraphQA is optimized for interactive analysis rather than big data processing:
โ
Perfect for: Research, prototyping, dashboard analytics, graph exploration
โ
Good for: Production analytics on medium-large graphs (< 1M nodes)
For larger datasets, consider:
- Sampling strategies (built-in test modes)
- Graph databases (Neo4j, Amazon Neptune) for storage + GraphQA for analysis
- Distributed processing (Apache Spark GraphX) for preprocessing
GraphQA ships with test mode enabled by default to ensure fast loading and prevent memory issues:
- Amazon dataset: Loads 5,000 products (vs 200,000+ full dataset)
- Loading time: 10-30 seconds (vs 2-5 minutes for full dataset)
- Memory usage: 100-200MB (vs 1-2GB for full dataset)
GraphQA automatically loads config.yaml from the current directory. Edit this file to customize behavior:
datasets:
amazon_products:
config:
# DEMO MODE (default - fast & safe)
test_mode: true
max_products: 5000
max_reviews: 10000
# FULL DATASET (requires more resources)
# test_mode: false
# max_products: null
# max_reviews: nullChoose the right template for your use case:
# For demos and quick testing (recommended)
cp config-templates/demo.yaml config.yaml
# For research and full analysis
cp config-templates/full-analysis.yaml config.yaml| Template | Products | Load Time | Memory | Use Case |
|---|---|---|---|---|
| demo.yaml | 5,000 | 10-30s | 200MB | Demos, tutorials, quick tests |
| full-analysis.yaml | 200,000+ | 2-5min | 2GB | Research, production analysis |
from graphqa import GraphQA
# Option 1: Use custom config file
agent = GraphQA(dataset_name="amazon", config_file="my_config.yaml")
# Option 2: Programmatic configuration
from graphqa.config import UniversalRetrieverConfig
config = UniversalRetrieverConfig()
config.datasets['amazon_products'].config['test_mode'] = False # Full dataset
agent = GraphQA(dataset_name="amazon", config=config)
agent.load_dataset()Import Error: "No module named 'graphqa'"
# Make sure you're in the virtual environment
source venv/bin/activate
pip install -e .OpenAI API Error
# Option 1: Check your .env file
cat .env # Should contain: OPENAI_API_KEY=your-key-here
# Option 2: Set environment variable
export OPENAI_API_KEY="your-key-here"Memory Issues with Large Graphs
# Option 1: Edit config.yaml to reduce limits
# datasets.amazon_products.config.max_products: 2000
# Option 2: Use smaller test mode
python -c "
from graphqa import GraphQA
from graphqa.config import UniversalRetrieverConfig
config = UniversalRetrieverConfig()
config.datasets['amazon_products'].config['max_products'] = 1000
agent = GraphQA(dataset_name='amazon', config=config)
"Slow Loading Performance
# Check your current settings
grep -A 5 "test_mode" config.yaml
# For faster loading, ensure test mode is enabled
# test_mode: true (default)
# max_products: 5000 (vs 200,000+ full)Missing Dependencies
# Install optional dependencies
pip install langfuse # For observabilityEnvironment Variables Not Loading
# Check if .env file exists and has correct format
cat .env
# Should contain: OPENAI_API_KEY=your-actual-key-here
# Reload .env in current session
python -c "from dotenv import load_dotenv; load_dotenv(override=True); import os; print('Loaded:', bool(os.getenv('OPENAI_API_KEY')))"Performance Issues / Out of Memory
# Switch to demo mode (lighter load)
cp config-templates/demo.yaml config.yaml
# Or manually reduce limits in config.yaml
# test_mode: true
# max_products: 1000 # Even smaller- Check the logs: GraphQA provides detailed logging
- Try the examples: Run files in
examples/directory - Open an issue: GitHub Issues
- User Guide - Complete usage documentation
- Examples - Real-world usage examples
# Development setup
git clone https://github.com/catio-tech/graphqa.git
cd graphqa
# Interactive setup (recommended)
python quickstart.py
# Or manual setup
python3 -m venv venv
source venv/bin/activate
pip install -e ".[dev]" # Installs dev dependencies from pyproject.toml
# Run tests
pytest
# Format code
black src/
isort src/| File | Purpose | When to Use |
|---|---|---|
quickstart.py |
Interactive guided setup with options | New users, need help with API keys/config |
setup |
Automated setup script | CI/CD, experienced users |
setup.py |
Python package definition (legacy) | Used by pip (don't run directly) |
| File | Purpose | Usage |
|---|---|---|
pyproject.toml |
Modern Python dependencies & project config | pip install -e ".[dev]" |
Why no requirements.txt? We use pyproject.toml (modern Python standard) instead of legacy requirements.txt files. This eliminates redundancy and follows current best practices.
Apache License 2.0. See LICENSE for details.
Need help? Open an issue or check our troubleshooting guide.