Releases: Hawksight-AI/semantica
v0.3.0
🧠 Semantica v0.3.0 — First Stable Release
Released: 2026-03-10 | PyPI: pip install semantica | Python: 3.8 – 3.12 | License: MIT
The first
Production/Stablerelease of Semantica — an open-source framework for building context graphs and decision intelligence layers for AI agents. This release consolidates everything shipped across three stages: 0.3.0-alpha (2026-02-19), 0.3.0-beta (2026-03-07), and 0.3.0 stable (2026-03-10).
pip install --upgrade semanticaNo breaking changes. All new parameters carry safe defaults. All new methods are purely additive.
🚦 Release Highlights
- 🕐 Temporal Validity —
valid_from/valid_untilon nodes & edges; query what's active at any point in time - 🔗 Cross-Graph Navigation — link separate
ContextGraphinstances; navigate across them; survives save/load - ⚖️ Weighted BFS Traversal — filter multi-hop queries by edge confidence with
min_weight - 🧠 Decision Intelligence — full lifecycle: record → causal chain → impact analysis → precedent search → policy enforcement
- 🔄 Delta Processing — SPARQL-based incremental graph diffs; only changed data flows through the pipeline
- 🗃️ Deduplication v2 — 6.98x faster semantic dedup, 63.6% faster candidate generation
- 📤 New Export Formats — ArangoDB AQL, Apache Parquet (Spark/BigQuery/Databricks ready)
- 🗄️ Graph Backends — Apache AGE, PgVector, AWS Neptune, FalkorDB
- ✅ 886+ tests passing — 0 failures
👥 Contributors
| Contributor | Areas |
|---|---|
| @KaifAhmad1 | Lead maintainer — context graph, decision intelligence, KG algorithms, semantic extraction, pipeline, provenance, bug fixes, release management |
| @ZohaibHassan16 | Deduplication v2 suite, incremental/delta processing, benchmark suite |
| @Sameer6305 | Apache AGE backend, PgVector store, Snowflake connector, Apache Arrow export |
| @tibisabau | ArangoDB AQL export, Apache Parquet export |
| @d4ndr4d3 | ResourceScheduler deadlock fix |
✨ v0.3.0 Stable — Context Graph Feature Completeness
Shipped 2026-03-10 · All changes by @KaifAhmad1
🕐 Temporal Validity Windows
Nodes and edges now carry first-class valid_from / valid_until ISO datetime fields — stored directly on the ContextNode and ContextEdge dataclasses, not buried in metadata.
New API:
add_node(valid_from=..., valid_until=...)andadd_edge(valid_from=..., valid_until=...)— set validity window at creationnode.is_active(at_time=None)andedge.is_active(at_time=None)— returnsTrueif live at the given time (defaults to now)graph.find_active_nodes(node_type=None, at_time=None)— filters entire graph to active nodes only
Bug fixes:
is_active()crashed withTypeErroron tz-awaredatetimeinputs — fixed by normalising to tz-naive UTC via new_parse_iso_dt()helper- Validity fields silently lost during serialisation — fixed across all four paths:
add_nodes(),add_edges(),to_dict(),from_dict()
🔗 Cross-Graph Navigation
Separate ContextGraph instances can now be linked and navigated between. Links are fully durable — they survive save_to_file() / load_from_file() and reconnect via a registry.
New API:
graph.graph_id— stable UUID assigned at init; persisted to JSONlink_graph(other_graph, source_node_id, target_node_id)— creates a navigable bridge; returnslink_idnavigate_to(link_id)— returns(other_graph, target_node_id)resolve_links({graph_id: instance})— reconnects links after load; returns count resolvedsave_to_file()— now writes alinkssection alongside nodes and edgesload_from_file()— restoresgraph_idand populates_unresolved_links
Bug fix: Previous implementation auto-created marker targets as phantom "entity" nodes — fixed by pre-creating a "cross_graph_link" typed ContextNode before inserting the marker edge.
14 new tests in tests/context/test_cross_graph_navigation.py covering link creation, phantom-node prevention, partial registry resolution, and full save/load round-trips.
⚖️ Weighted Multi-Hop BFS Traversal
get_neighbors() now accepts a min_weight threshold to confine traversal to high-confidence causal links only. Default 0.0 passes all edges — fully backward-compatible.
🔧 Additional Fixes in v0.3.0 Stable
PipelineBuilder.add_step()return type annotation corrected from"PipelineBuilder"to"PipelineStep"test_hybrid_search_performancefixed to accumulate a truesearch_timeslist; threshold relaxed to< 5.0sfor realsentence-transformerslatency
🔧 v0.3.0-beta — Semantic Extraction, Deduplication v2, New Export Formats
Shipped 2026-03-07
🧩 Semantic Extraction Fixes — @KaifAhmad1 (PR #354, #355)
LLM Relation Extraction:
- Unmatched subjects/objects now produce a synthetic
UNKNOWNentity instead of silently dropping the relation - Orphaned legacy block in
_parse_relation_resultthat appended every relation twice has been removed extraction_methodparameter added — typed extraction paths now record"llm_typed"instead of"llm"
Reasoner Pattern Matching:
_match_patterninreasoner.pyfully rewritten — splits patterns on?varplaceholders, escapes only literal segments, uses backreferences for repeated variables and non-greedy.+?to prevent over-consumption
RDF Export Aliases:
RDFExporternow accepts"ttl","nt","xml","rdf", and"json-ld"as format aliases — zero API changes
Tests added: tests/reasoning/test_reasoner.py (4 tests), tests/semantic_extract/test_relation_extractor.py (6 tests), tests/export/test_rdf_exporter.py (8 tests)
🔄 Incremental / Delta Processing — @ZohaibHassan16, @KaifAhmad1 (PR #349)
- Native SPARQL-based diff between graph snapshots — only changed triples enter the pipeline
delta_modeflag inPipelineBuilderfor near-real-time incremental workloads- Version snapshot management with graph URI tracking and per-snapshot metadata storage
prune_versions()for automatic retention cleanup of old snapshots
Bug fixes: corrected SPARQL variable order, fixed class references, resolved duplicate dictionary keys.
🗃️ Deduplication v2 Suite — @ZohaibHassan16, @KaifAhmad1 (PR #338, #339, #340, #344)
Three independently opt-in tiers — legacy mode remains the default, fully backward-compatible.
Candidate Generation v2 (PR #338):
- New
blocking_v2andhybrid_v2strategies replace O(N²) pair enumeration - Multi-key blocking with normalised token prefixes, type-aware keys, and optional phonetic (Soundex) matching
- Deterministic
max_candidates_per_entitybudgeting with stable sorting - 63.6% faster in worst-case scenarios (0.259s → 0.094s for 100 entities)
Two-Stage Scoring Prefilter (PR #339):
- Fast gates for type mismatch, name-length ratio, and token overlap eliminate expensive semantic scoring for obvious non-matches
- Configurable thresholds:
min_length_ratio,min_token_overlap_ratio,required_shared_token - 18–25% faster batch processing when enabled (
prefilter_enabled=Falseby default)
Semantic Relationship Deduplication v2 (PR #340):
- Canonicalisation engine with predicate synonym mapping (e.g.
works_for→employed_by) - O(1) hash matching for exact canonical signatures before any semantic scoring
- Weighted scoring: 60% predicate + 40% object with explainable
semantic_match_scorein metadata - 6.98x faster than legacy mode (83ms vs 579ms)
dedup_triplets()infinite recursion bug fixed; promoted to first-class API inmethods.py
Migration guide: MIGRATION_V2.md with complete examples for all v2 strategies (PR #344)
📤 New Export Formats — @tibisabau (PR #342, #343)
ArangoDB AQL Export (PR #342):
- Full AQL
INSERTstatement generation for vertices and edges - Configurable collection names with validation and sanitisation; batch processing (default: 1000)
export_arango()convenience function;.aqlauto-detection in the unified exporter- 17 tests — 100% pass rate
Apache Parquet Export (PR #343):
- Columnar storage with configurable compression: snappy, gzip, brotli, zstd, lz4, none
- Explicit Apache Arrow schemas with type safety and field normalisation
- Analytics-ready: pandas, Spark, Snowflake, BigQuery, Databricks
export_parquet()convenience function;.parquetauto-detection- 25 tests — 100% pass rate
🐛 Beta Bug Fixes — @KaifAhmad1
Context module:
retrieve_decision_precedents— entity extraction correctly gated onuse_hybrid_search=True_extract_entities_from_query— switched toword[0].isupper()to capture camelCase identifiers likeCreditCard- Added missing
expand_context()(BFS traversal) and_get_decision_query()methods - Fixed
hybrid_retrieval,dynamic_context_traversal,multi_hop_context_assemblyfor correct single-pass BFS - Fixed
_retrieve_from_vectorfallback to prevent empty content and negative similarity scores
KG module:
calculate_pagerank— addedalpha/max_iteraliases; return format structured to{"centrality": scores, "rankings": sorted_list}community_detector._to_networkx— fixed silent edge-loss when a NetworkX graph is passed directly- Added 9 domain-specific tracking methods to
AlgorithmTrackerWithProvenance - Created
provenance_tracker.pywithProvenanceTracker; correctly exported fromsemantica.kg
Pipeline module:
- Retry loop fixed — now correctly iterate...
v0.3.0-beta
Semantica v0.3.0-beta — Release Notes
Date: 2026-03-07 | Tag: v0.3.0-beta | Status: Internal Beta (Pre-release)
Consolidates all alpha and unreleased features for internal validation ahead of the public 0.3.0 launch.
What's New
Semantic Extraction & Reasoning
- Multi-Founder LLM Extraction Fix (#354) — Unmatched relation subjects/objects now produce synthetic
UNKNOWNentities instead of being silently dropped; all LLM-returned co-founders preserved - Reasoner Pattern Matching Rewrite (#354) —
_match_patterncorrectly handles multi-word values, pre-bound variables, repeated variable backreferences, and non-greedy separators
Export
- RDF / TTL Alias Fix (#355) —
format="ttl","nt","xml","rdf","json-ld"all resolve without breaking existing callers - ArangoDB AQL Export (#342) — Full AQL INSERT generation for vertices and edges; configurable batching; 17 tests passing
- Apache Parquet Export (#343) — Columnar storage with configurable compression (snappy, gzip, brotli, zstd, lz4); explicit Arrow schemas; 25 tests passing
Deduplication v2 (Epic #333)
- Candidate Generation v2 (#338) —
blocking_v2/hybrid_v2strategies with multi-key and phonetic blocking; 63.6% faster worst-case - Two-Stage Scoring Prefilter (#339) — Fast prefilter gates before expensive semantic scoring; 18–25% faster batch processing
- Semantic Deduplication v2 (#340) — Opt-in
semantic_v2with canonicalization, O(1) hash matching, weighted scoring; 6.98x speedup; fixed infinite recursion bug - Migration Guide (#344) —
MIGRATION_V2.mdwith full examples; 5.86x speedup confirmed; backward compatible
Incremental / Delta Processing
- Delta Processing (#349) — Native SPARQL delta computation between graph snapshots;
delta_modepipeline config;prune_versions()for snapshot retention; production-ready for near real-time pipelines
Bug Fixes
NameError— missingTypeimport inutils/helpers.py; removed unused import fromconfig_manager.py- Context module — fixed
retrieve_decision_precedents,hybrid_retrieval,dynamic_context_traversal,multi_hop_context_assembly,_retrieve_from_vector,_extract_entities_from_query; added missingexpand_contextand_get_decision_querymethods - Knowledge Graph module — fixed
calculate_pagerank,community_detector._to_networkx,detect_communities,_build_adjacency; addedProvenanceTrackerand 9 domain-specific tracking methods - Pipeline module — fixed retry loop in
execution_engine; addedRecoveryActionwith LINEAR / EXPONENTIAL / FIXED backoff; fixedadd_stepreturn value; addedvalidatealias - Test files — replaced emoji with ASCII for Windows cp1252 compatibility; fixed assertion ordering and loop bugs across 4 test files
Test Results
| Passing | Skipped (external services) | Failed |
|---|---|---|
| ~840 | 36 | 0 |
Contributors
v0.3.0-alpha
🎉 Semantica v0.3.0-alpha Release
This alpha release introduces comprehensive decision tracking capabilities, advanced knowledge graph algorithms, and production-ready architecture for testing.
🚀 Major Features
Decision Tracking System
- Complete decision lifecycle management with audit trails
- Provenance tracking and lineage management
- Policy compliance and exception handling
- Decision influence analysis and impact scoring
Advanced Knowledge Graph Algorithms
- Node2Vec embeddings for semantic similarity
- Centrality analysis (degree, betweenness, closeness, eigenvector)
- Community detection and graph analytics
- Path finding and link prediction
Enhanced Context Module
- Unified AgentContext with granular feature flags
- Decision tracking integration
- Production-ready architecture with validation
- GraphStore capability validation
Vector Store Features
- Hybrid search combining semantic, structural, and category similarity
- Advanced retrieval with configurable weights
- FastEmbed integration for efficient operations
🧪 Testing & Quality
- 113+ tests passing across context and core modules
- Comprehensive decision tracking test coverage
- Enhanced error handling and edge case testing
- Fixed all critical test failures for release readiness
📦 Installation
pip install semantica==0.3.0a0Semantica 0.2.7
Overview
Release 0.2.7 adds Snowflake integration, Apache Arrow export, and benchmark suite.
🚀 New Features
Snowflake Connector for Data Ingestion
PR #276 by @Sameer6305
Native Snowflake connector with multi-authentication support (password, OAuth, key-pair, SSO). Includes table/query ingestion, schema introspection, and SQL injection prevention.
Tests: 24/24 passing
Dependency: db-snowflake optional
Apache Arrow Export Support
PR #273 by @Sameer6305
High-performance columnar export with explicit schemas, compression, and Pandas/DuckDB compatibility.
Tests: 20/20 passing
Dependency: db-arrow optional
Comprehensive Benchmark Suite
PR #289 by @ZohaibHassan16, @KaifAhmad1
137+ benchmarks across all modules with regression detection and CI/CD integration.
Features: Statistical analysis, environment-agnostic design, CLI tool
📊 Quality Assurance
- Total Tests: 44/44 passing
- Breaking Changes: None
- Backward Compatible: Yes
🛠 Installation
pip install semantica==0.2.7
pip install semantica[db-snowflake,db-arrow]==0.2.7🙏 Contributors
- @Sameer6305: Snowflake Connector, Arrow Export
- @ZohaibHassan16: Benchmark Suite implementation
- @KaifAhmad1: Benchmark enhancements, CI/CD integration
🔗 Links
- GitHub: https://github.com/Hawksight-AI/semantica
- PyPI: https://pypi.org/project/semantica/
- Benchmarks:
python benchmarks/benchmark_runner.py
📈 Performance
- Text Processing: >10,000 ops/sec
- Arrow Export: 10x faster
- Benchmark Coverage: 137+ tests
Thanks to all contributors for making this release possible!
Semantica v0.2.6
Semantica v0.2.6
Release Date: February 3, 2026
We're excited to announce Semantica v0.2.6, featuring major enhancements in provenance tracking, change management, and several important bug fixes!
🎉 Highlights
Major Features
- W3C PROV-O Compliant Provenance Tracking - Enterprise-grade lineage tracking across all 17 modules
- Enhanced Change Management - Version control for knowledge graphs and ontologies
- CSV Ingestion Improvements - Auto-detection and robust error handling
- Comprehensive Test Coverage - 80-86% coverage for ingestion modules
Bug Fixes
- Temperature compatibility for LLM providers
- JenaStore empty graph initialization
✨ New Features & Enhancements
W3C PROV-O Compliant Provenance Tracking
PRs: #254, #246 | Contributor: @KaifAhmad1
A comprehensive provenance tracking system with W3C PROV-O compliance across all 17 Semantica modules.
Core Module:
ProvenanceManagerfor centralized tracking- W3C PROV-O schemas (Activity, Entity, Agent)
- Storage backends: InMemory and SQLite
- SHA-256 integrity verification
Module Integrations:
- Semantic Extract, LLMs (Groq, OpenAI, HuggingFace, LiteLLM)
- Pipeline, Context, Ingest, Embeddings
- Graph/Vector/Triplet stores
- Reasoning, Conflicts, Deduplication
- Export, Parse, Normalize, Ontology, Visualization
Features:
- Complete lineage tracking: Document → Chunk → Entity → Relationship → Graph
- LLM tracking: tokens, costs, latency
- Source tracking and bridge axioms for domain transformations
Compliance:
- W3C PROV-O, FDA 21 CFR Part 11, SOX, HIPAA, TNFD
Testing:
- 237 tests covering core functionality, all 17 module integrations, edge cases, backward compatibility
Design:
- Opt-in with
provenance=Falseby default - Zero breaking changes
- No new dependencies
Enhanced Change Management Module
PRs: #248, #243 | Contributor: @KaifAhmad1
Enterprise-grade version control for knowledge graphs and ontologies with persistent storage and audit trails.
Core Classes:
TemporalVersionManager- Knowledge graph versioningOntologyVersionManager- Ontology versioningChangeLogEntry- Change metadata tracking
Storage:
- SQLite (persistent) and in-memory backends
- Thread-safe operations
Features:
- SHA-256 checksums for integrity
- Detailed entity/relationship diffs
- Structural ontology comparison
- Email validation
Compliance:
- HIPAA, SOX, FDA 21 CFR Part 11
- Immutable audit trails
Testing:
- 104 tests (100% pass)
- Unit, integration, compliance, performance, edge cases
Performance:
- 17.6ms for 10k entities
- 510+ ops/sec concurrent
- Handles 5k+ entity graphs
Migration:
- Backward compatible
- Simplified class names
- Zero external dependencies
CSV Ingestion Enhancements
PR: #244 | Contributor: @saloni0318
Robust CSV parsing with auto-detection and error handling.
Features:
- Auto-detect CSV encoding using
chardet - Auto-detect delimiter using
csv.Sniffer - Tolerant decoding and malformed-row handling (
on_bad_lines='warn') - Optional chunked reading for large files
- Metadata tracks detected values
Testing:
- Expanded unit tests covering:
- Multiple delimiters
- Quoted/multiline fields
- Header overrides
- Chunked reading
- NaN preservation
Comprehensive Test Coverage
TextNormalizer Tests
PR: #242 | Contributor: @ZohaibHassan16
Added focused test coverage for TextNormalizer behavior across various inputs.
Integration Test Improvements
PR: #241 | Contributor: @KaifAhmad1
- Introduced integration test marker
- Reduced noisy warnings in ingest tests
Ingest Unit Tests
PRs: #239, #232 | Contributor: @Mohammed2372
Comprehensive unit tests for ingestion modules (file, web, and feed ingestors).
Coverage:
- File scanning: local/cloud (S3/GCS/Azure)
- Web ingestion: URL/sitemap/robots.txt
- RSS/Atom feed parsing
Testing:
- 998 lines of test code
- Mocked external dependencies for fast, isolated execution
Results:
file_ingestor: 86% coverageweb_ingestor: 86% coveragefeed_ingestor: 80% coverage
Covers happy paths, edge cases, and error handling.
🐛 Bug Fixes
Temperature Compatibility Fix
PRs: #256, #252 | Contributors: @F0rt1s, @IGES-Institut
Fixed hardcoded temperature=0.3 that broke compatibility with models requiring specific temperature values (e.g., gpt-5-mini).
Changes:
- Added
_add_if_sethelper method toBaseProvider - Only passes parameters when explicitly set
- When
temperature=None, parameter is omitted allowing APIs to use model defaults - Updated all 5 providers: OpenAI, Groq, Gemini, Ollama, DeepSeek
Impact:
- Reduced code by ~85 lines with cleaner parameter handling
- Comprehensive test coverage added (10 temperature tests, all passing)
- Backward compatible - no breaking changes
JenaStore Empty Graph Bug
PRs: #257, #258 | Contributor: @ZohaibHassan16
Fixed ProcessingError: Graph not initialized when operating on empty (but initialized) graphs.
Changes:
- Replaced implicit
if not self.graph:checks with explicitif self.graph is None:validation - Updated 5 methods:
add_triplets,get_triplets,delete_triplet,execute_sparql,serialize - Properly distinguishes
None(uninitialized) from empty graphs (initialized with 0 triplets)
Impact:
- Unblocks benchmarking suite
- Enables fresh deployments
- Improves testing workflows
📦 Installation
pip install semantica==0.2.6Or upgrade from a previous version:
pip install --upgrade semantica🙏 Contributors
Special thanks to all contributors who made this release possible:
- @KaifAhmad1 - Provenance tracking, change management, test improvements
- @saloni0318 - CSV ingestion enhancements
- @ZohaibHassan16 - TextNormalizer tests, JenaStore bug fix
- @Mohammed2372 - Comprehensive ingest unit tests
- @F0rt1s - Temperature compatibility fix
- @IGES-Institut - Temperature compatibility fix
📚 Documentation
- Documentation: https://semantica.readthedocs.io
- GitHub: https://github.com/Hawksight-AI/semantica
- PyPI: https://pypi.org/project/semantica/
🔗 Links
🚀 What's Next?
Stay tuned for upcoming features in future releases. Check our GitHub Issues to see what we're working on!
Full Changelog: v0.2.5...v0.2.6
Deep Extraction, BYOM & Pinecone Support (v0.2.5)
Semantica v0.2.5
🚀 Release Highlights
This release brings native Pinecone Vector Store support, configurable LLM retry logic, and major enhancements to the Semantic Extraction module, including robust support for custom Hugging Face models (BYOM), improved NER/Relation extraction, and completed Triplet extraction logic.
🌟 New Features
Pinecone Vector Store Support
- Implemented native
PineconeStorewith full CRUD capabilities. - Support for serverless and pod-based indexes, namespaces, and metadata filtering.
- Fully integrated with the unified
VectorStoreinterface and registry. - (Closes #219, Resolves #220)
Configurable LLM Retry Logic
- Exposed
max_retriesparameter inNERExtractor,RelationExtractor, andTripletExtractor. - Defaults to 3 retries to handle JSON validation failures or API timeouts gracefully.
- Propagated retry configuration through chunked processing helpers for consistent long-document handling.
Bring Your Own Model (BYOM) Support
- Custom Hugging Face Models: Enabled full support for custom models in
NERExtractor,RelationExtractor, andTripletExtractor. - Custom Tokenizers: Added support for models with non-standard tokenization requirements.
- Runtime Overrides:
extract(model=...)now correctly overrides configuration defaults.
Enhanced Extraction Capabilities
- NER: Added configurable aggregation strategies (
simple,first,average,max) and robust IOB/BILOU parsing. - Relation Extraction: Implemented standard entity marker techniques (
<subj>,<obj>) and structured output parsing. - Triplet Extraction: Added specialized parsing for Seq2Seq models (e.g., REBEL) to generate structured triplets directly from text.
🐛 Bug Fixes
- LLM Extraction Stability: Fixed infinite retry loops by strictly enforcing
max_retrieslimits. - Model Parameter Precedence: Resolved issues where config defaults overrode runtime arguments.
- Import Handling: Fixed circular import issues in test suites via improved mocking strategies.
📦 Installation
pip install semantica==0.2.5Semantica v0.2.4
Added
- Ontology Ingestion Module:
- Implemented
OntologyIngestorfor parsing RDF/OWL files (Turtle, RDF/XML, JSON-LD, N3). - Added
ingest_ontologyand unifiedingest(source_type="ontology")interface. - Added recursive directory scanning for batch ontology ingestion.
- Added
OntologyDatadataclass for consistent metadata.
- Implemented
- Documentation:
- Updated
ontology_usage.mdandontology.mdwith usage examples and API details.
- Updated
- Tests:
- Added comprehensive test suite
tests/ingest/test_ontology_ingestor.py. - Added
examples/demo_ontology_ingest.pyfor end-to-end demonstration.
- Added comprehensive test suite
Semantica v0.2.3
We are excited to announce Semantica v0.2.3! This release focuses on stability, performance, and developer experience improvements, including critical fixes for LLM relation extraction, high-performance vector store ingestion, and resolved circular dependencies.
🚀 Added
Vector Store High-Performance Ingestion
- New
add_documentsAPI: Added high-throughput ingestion with automatic embedding generation, batching, and parallel processing. embed_batchHelper: Efficiently generate embeddings for lists of texts without immediate storage.- Parallel Defaults: Enabled default parallel ingestion in
VectorStore(default:max_workers=6) for faster processing. - Documentation: Added dedicated guide
docs/vector_store_usage.mdfor high-performance configuration. - Tests: Added
tests/vector_store/test_vector_store_parallel.pycovering parallel vs. sequential performance and edge cases.
Amazon Neptune Dev Environment
- CloudFormation Template: Added
cookbook/introduction/neptune-setup.yamlto provision a development Neptune cluster with public endpoints and IAM auth. - Documentation: Updated
cookbook/introduction/21_Amazon_Neptune_Store.ipynbwith deployment guides, cost estimates, and IAM best practices. - Linting: Added
cfn-lintto pre-commit hooks for CloudFormation validation.
Comprehensive Test Suite
- Unit Tests: Added
tests/test_relations_llm.pycovering typed and structured response paths for relation extraction. - Integration Tests: Added
tests/integration/test_relations_groq.pyfor real Groq API validation.
🐛 Fixed
LLM Relation Extraction Parsing
- Zero Relations Fix: Resolved issue where relation extraction returned zero results despite successful API calls.
- Response Normalization: Normalized typed responses from Instructor/OpenAI/Groq to a consistent dictionary format.
- JSON Fallback: Added structured JSON fallback when typed generation yields empty results.
- Parameter Cleanup: Removed unsupported kwargs (
max_tokens,max_entities_prompt) from internal calls to prevent API errors.
Pipeline Circular Import
- Resolved Import Cycles: Fixed circular dependency between
pipeline_builderandpipeline_validator(Issues #192, #193). - Lazy Loading: Implemented lazy loading for
PipelineValidatorto ensure stable imports.
JupyterLab Stability
- Progress Output Control: Added
SEMANTICA_DISABLE_JUPYTER_PROGRESSenvironment variable. - Memory Fix: Fallback to console-style output when enabled to prevent JupyterLab out-of-memory errors from infinite scrolling tables (Issue #181).
⚡ Changed
Relation Extraction API
- Simplified Interface: Removed unused kwargs to prevent parameter leakage.
- Better Debugging: Improved error handling and verbose logging for extraction workflows.
- Robust Parsing: Enhanced post-response parsing stability across different LLM providers.
Vector Store Defaults
- Standardized Concurrency: Set default
max_workers=6forVectorStoreparallel ingestion. - Simplified Usage: Updated documentation to rely on smart defaults rather than manual configuration.
Semantica 0.2.2
Highlights
- High-throughput parallel extraction engine across all core extractors.
- Major performance improvements (~1.89x speedup) for real-world extraction workloads.
- Stronger security hygiene in examples and caching.
- Updated Gemini SDK integration and dependency constraints for more stable installs.
Added
-
Parallel Extraction Engine
- Implemented parallel batch processing across all core extractors:
NERExtractor,RelationExtractor,TripletExtractorEventDetector,SemanticNetworkExtractor
- Added
max_workersparameter to all extractorextract()methods so users can tune concurrency based on CPU or rate limits. - Enabled parallel chunking for large documents in:
_extract_entities_chunked_extract_relations_chunked
- Enhanced
ProgressTrackerto be thread-safe for concurrent batch updates.
- Implemented parallel batch processing across all core extractors:
-
Semantic Extract Performance & Regression
- Added regression suite for:
- Max worker defaults
- LLM prompt entity filtering
- Extractor reuse scenarios
- Added a runnable benchmark script for batch latency across:
NERExtractor,RelationExtractor,TripletExtractorEventDetector,SemanticAnalyzer,SemanticNetworkExtractor
- Added Groq LLM smoke tests for entities/relations/triplets when
GROQ_API_KEYis set.
- Added regression suite for:
Security
-
Credential Sanitization
- Removed hardcoded API keys from 8 cookbook notebooks to prevent secret leakage.
- Enforced environment variable usage for
GROQ_API_KEYacross all examples.
-
Secure Caching
- Updated
ExtractionCacheto exclude sensitive parameters (api_key,token,password, etc.) from cache keys, enabling safe cache sharing. - Upgraded cache key hashing from MD5 to SHA-256 for stronger collision resistance.
- Updated
Changed
-
Gemini SDK Migration
- Migrated
GeminiProviderto the newgoogle-genaiSDK (v0.1.0+) to address deprecations. - Added graceful fallback to
google.generativeaifor backward compatibility.
- Migrated
-
Dependency Resolution
- Pinned
opentelemetry-apiandopentelemetry-sdkto1.37.0to resolve pip conflicts. - Updated
protobufandgrpcioconstraints for better stability.
- Pinned
-
Entity Filtering Scope
- Removed entity filtering from non-LLM extraction flows to avoid accuracy regressions.
- Limited entity downselection to LLM relation prompt construction, while still matching returned entities against the full original list.
-
Batch Concurrency Defaults
- Standardized
max_workersdefaults acrosssemantic_extract:- ML-backed methods default to single worker.
- Pattern/regex/rules/LLM/HuggingFace methods use higher parallelism, capped by CPU.
- Increased global
optimization.max_workersdefault to 8 for better throughput on batch workloads.
- Standardized
Performance
-
Bottleneck Optimization (GitHub Issue #186)
- Replaced sequential loops with parallel execution for:
- Document-level batches
- Intra-document chunks
- Achieved roughly 1.89x speedup in real-world extraction scenarios
(benchmarked with Groqllama-3.3-70b-versatile).
- Replaced sequential loops with parallel execution for:
-
Low-Latency Entity Matching
- Reduced reliance on heavyweight embedding stacks for common cases by:
- Improving fast matching heuristics.
- Short-circuiting before embedding similarity when possible.
- Optimized entity matching to:
- Prefer exact / substring / word-boundary matches.
- Only fall back to embedding similarity when necessary, reducing CPU overhead.
- Reduced reliance on heavyweight embedding stacks for common cases by:
# Release v0.2.1: Stability Fixes
🚀 Summary
This release addresses critical LLM extraction failures on long documents (Bug #176) and fixes the Earnings Call Analysis cookbook (Bug #177).
🛠 Key Changes
- LLM Stability (Fixes #176):
- Solved incomplete JSON outputs by correctly propagating
max_tokens. - Added auto-retry with reduced chunk sizes when token limits are hit.
- Standardized default chunk size to 64k for Groq, OpenAI, and Anthropic.
- Solved incomplete JSON outputs by correctly propagating
- Cookbook Fix (Fixes #177):
- Resolved
TypeErrorin03_Earnings_Call_Analysis.ipynbby fixingSourceReferenceusage.
- Resolved
- Improvements:
- Added
max_completion_tokenssupport for newer provider APIs. - Removed hardcoded length constraints from semantic classes.
- Added
✅ Verification
- Verified
max_tokenspropagation and error handling via new tests. - Validated Groq Llama 3.3 70B integration manually.
- PyPI Release: Successfully built and uploaded
semantica-0.2.1.