Skip to content

Releases: Hawksight-AI/semantica

v0.3.0

10 Mar 22:07

Choose a tag to compare

🧠 Semantica v0.3.0 — First Stable Release

Released: 2026-03-10  |  PyPI: pip install semantica  |  Python: 3.8 – 3.12  |  License: MIT

The first Production/Stable release of Semantica — an open-source framework for building context graphs and decision intelligence layers for AI agents. This release consolidates everything shipped across three stages: 0.3.0-alpha (2026-02-19), 0.3.0-beta (2026-03-07), and 0.3.0 stable (2026-03-10).

pip install --upgrade semantica

No breaking changes. All new parameters carry safe defaults. All new methods are purely additive.


🚦 Release Highlights

  • 🕐 Temporal Validityvalid_from/valid_until on nodes & edges; query what's active at any point in time
  • 🔗 Cross-Graph Navigation — link separate ContextGraph instances; navigate across them; survives save/load
  • ⚖️ Weighted BFS Traversal — filter multi-hop queries by edge confidence with min_weight
  • 🧠 Decision Intelligence — full lifecycle: record → causal chain → impact analysis → precedent search → policy enforcement
  • 🔄 Delta Processing — SPARQL-based incremental graph diffs; only changed data flows through the pipeline
  • 🗃️ Deduplication v2 — 6.98x faster semantic dedup, 63.6% faster candidate generation
  • 📤 New Export Formats — ArangoDB AQL, Apache Parquet (Spark/BigQuery/Databricks ready)
  • 🗄️ Graph Backends — Apache AGE, PgVector, AWS Neptune, FalkorDB
  • 886+ tests passing — 0 failures

👥 Contributors

Contributor Areas
@KaifAhmad1 Lead maintainer — context graph, decision intelligence, KG algorithms, semantic extraction, pipeline, provenance, bug fixes, release management
@ZohaibHassan16 Deduplication v2 suite, incremental/delta processing, benchmark suite
@Sameer6305 Apache AGE backend, PgVector store, Snowflake connector, Apache Arrow export
@tibisabau ArangoDB AQL export, Apache Parquet export
@d4ndr4d3 ResourceScheduler deadlock fix

✨ v0.3.0 Stable — Context Graph Feature Completeness

Shipped 2026-03-10 · All changes by @KaifAhmad1

🕐 Temporal Validity Windows

Nodes and edges now carry first-class valid_from / valid_until ISO datetime fields — stored directly on the ContextNode and ContextEdge dataclasses, not buried in metadata.

New API:

  • add_node(valid_from=..., valid_until=...) and add_edge(valid_from=..., valid_until=...) — set validity window at creation
  • node.is_active(at_time=None) and edge.is_active(at_time=None) — returns True if live at the given time (defaults to now)
  • graph.find_active_nodes(node_type=None, at_time=None) — filters entire graph to active nodes only

Bug fixes:

  • is_active() crashed with TypeError on tz-aware datetime inputs — fixed by normalising to tz-naive UTC via new _parse_iso_dt() helper
  • Validity fields silently lost during serialisation — fixed across all four paths: add_nodes(), add_edges(), to_dict(), from_dict()

🔗 Cross-Graph Navigation

Separate ContextGraph instances can now be linked and navigated between. Links are fully durable — they survive save_to_file() / load_from_file() and reconnect via a registry.

New API:

  • graph.graph_id — stable UUID assigned at init; persisted to JSON
  • link_graph(other_graph, source_node_id, target_node_id) — creates a navigable bridge; returns link_id
  • navigate_to(link_id) — returns (other_graph, target_node_id)
  • resolve_links({graph_id: instance}) — reconnects links after load; returns count resolved
  • save_to_file() — now writes a links section alongside nodes and edges
  • load_from_file() — restores graph_id and populates _unresolved_links

Bug fix: Previous implementation auto-created marker targets as phantom "entity" nodes — fixed by pre-creating a "cross_graph_link" typed ContextNode before inserting the marker edge.

14 new tests in tests/context/test_cross_graph_navigation.py covering link creation, phantom-node prevention, partial registry resolution, and full save/load round-trips.


⚖️ Weighted Multi-Hop BFS Traversal

get_neighbors() now accepts a min_weight threshold to confine traversal to high-confidence causal links only. Default 0.0 passes all edges — fully backward-compatible.


🔧 Additional Fixes in v0.3.0 Stable

  • PipelineBuilder.add_step() return type annotation corrected from "PipelineBuilder" to "PipelineStep"
  • test_hybrid_search_performance fixed to accumulate a true search_times list; threshold relaxed to < 5.0s for real sentence-transformers latency

🔧 v0.3.0-beta — Semantic Extraction, Deduplication v2, New Export Formats

Shipped 2026-03-07

🧩 Semantic Extraction Fixes — @KaifAhmad1 (PR #354, #355)

LLM Relation Extraction:

  • Unmatched subjects/objects now produce a synthetic UNKNOWN entity instead of silently dropping the relation
  • Orphaned legacy block in _parse_relation_result that appended every relation twice has been removed
  • extraction_method parameter added — typed extraction paths now record "llm_typed" instead of "llm"

Reasoner Pattern Matching:

  • _match_pattern in reasoner.py fully rewritten — splits patterns on ?var placeholders, escapes only literal segments, uses backreferences for repeated variables and non-greedy .+? to prevent over-consumption

RDF Export Aliases:

  • RDFExporter now accepts "ttl", "nt", "xml", "rdf", and "json-ld" as format aliases — zero API changes

Tests added: tests/reasoning/test_reasoner.py (4 tests), tests/semantic_extract/test_relation_extractor.py (6 tests), tests/export/test_rdf_exporter.py (8 tests)


🔄 Incremental / Delta Processing — @ZohaibHassan16, @KaifAhmad1 (PR #349)

  • Native SPARQL-based diff between graph snapshots — only changed triples enter the pipeline
  • delta_mode flag in PipelineBuilder for near-real-time incremental workloads
  • Version snapshot management with graph URI tracking and per-snapshot metadata storage
  • prune_versions() for automatic retention cleanup of old snapshots

Bug fixes: corrected SPARQL variable order, fixed class references, resolved duplicate dictionary keys.


🗃️ Deduplication v2 Suite — @ZohaibHassan16, @KaifAhmad1 (PR #338, #339, #340, #344)

Three independently opt-in tiers — legacy mode remains the default, fully backward-compatible.

Candidate Generation v2 (PR #338):

  • New blocking_v2 and hybrid_v2 strategies replace O(N²) pair enumeration
  • Multi-key blocking with normalised token prefixes, type-aware keys, and optional phonetic (Soundex) matching
  • Deterministic max_candidates_per_entity budgeting with stable sorting
  • 63.6% faster in worst-case scenarios (0.259s → 0.094s for 100 entities)

Two-Stage Scoring Prefilter (PR #339):

  • Fast gates for type mismatch, name-length ratio, and token overlap eliminate expensive semantic scoring for obvious non-matches
  • Configurable thresholds: min_length_ratio, min_token_overlap_ratio, required_shared_token
  • 18–25% faster batch processing when enabled (prefilter_enabled=False by default)

Semantic Relationship Deduplication v2 (PR #340):

  • Canonicalisation engine with predicate synonym mapping (e.g. works_foremployed_by)
  • O(1) hash matching for exact canonical signatures before any semantic scoring
  • Weighted scoring: 60% predicate + 40% object with explainable semantic_match_score in metadata
  • 6.98x faster than legacy mode (83ms vs 579ms)
  • dedup_triplets() infinite recursion bug fixed; promoted to first-class API in methods.py

Migration guide: MIGRATION_V2.md with complete examples for all v2 strategies (PR #344)


📤 New Export Formats — @tibisabau (PR #342, #343)

ArangoDB AQL Export (PR #342):

  • Full AQL INSERT statement generation for vertices and edges
  • Configurable collection names with validation and sanitisation; batch processing (default: 1000)
  • export_arango() convenience function; .aql auto-detection in the unified exporter
  • 17 tests — 100% pass rate

Apache Parquet Export (PR #343):

  • Columnar storage with configurable compression: snappy, gzip, brotli, zstd, lz4, none
  • Explicit Apache Arrow schemas with type safety and field normalisation
  • Analytics-ready: pandas, Spark, Snowflake, BigQuery, Databricks
  • export_parquet() convenience function; .parquet auto-detection
  • 25 tests — 100% pass rate

🐛 Beta Bug Fixes — @KaifAhmad1

Context module:

  • retrieve_decision_precedents — entity extraction correctly gated on use_hybrid_search=True
  • _extract_entities_from_query — switched to word[0].isupper() to capture camelCase identifiers like CreditCard
  • Added missing expand_context() (BFS traversal) and _get_decision_query() methods
  • Fixed hybrid_retrieval, dynamic_context_traversal, multi_hop_context_assembly for correct single-pass BFS
  • Fixed _retrieve_from_vector fallback to prevent empty content and negative similarity scores

KG module:

  • calculate_pagerank — added alpha/max_iter aliases; return format structured to {"centrality": scores, "rankings": sorted_list}
  • community_detector._to_networkx — fixed silent edge-loss when a NetworkX graph is passed directly
  • Added 9 domain-specific tracking methods to AlgorithmTrackerWithProvenance
  • Created provenance_tracker.py with ProvenanceTracker; correctly exported from semantica.kg

Pipeline module:

  • Retry loop fixed — now correctly iterate...
Read more

v0.3.0-beta

07 Mar 11:28

Choose a tag to compare

v0.3.0-beta Pre-release
Pre-release

Semantica v0.3.0-beta — Release Notes

Date: 2026-03-07 | Tag: v0.3.0-beta | Status: Internal Beta (Pre-release)

Consolidates all alpha and unreleased features for internal validation ahead of the public 0.3.0 launch.


What's New

Semantic Extraction & Reasoning

  • Multi-Founder LLM Extraction Fix (#354) — Unmatched relation subjects/objects now produce synthetic UNKNOWN entities instead of being silently dropped; all LLM-returned co-founders preserved
  • Reasoner Pattern Matching Rewrite (#354) — _match_pattern correctly handles multi-word values, pre-bound variables, repeated variable backreferences, and non-greedy separators

Export

  • RDF / TTL Alias Fix (#355) — format="ttl", "nt", "xml", "rdf", "json-ld" all resolve without breaking existing callers
  • ArangoDB AQL Export (#342) — Full AQL INSERT generation for vertices and edges; configurable batching; 17 tests passing
  • Apache Parquet Export (#343) — Columnar storage with configurable compression (snappy, gzip, brotli, zstd, lz4); explicit Arrow schemas; 25 tests passing

Deduplication v2 (Epic #333)

  • Candidate Generation v2 (#338) — blocking_v2 / hybrid_v2 strategies with multi-key and phonetic blocking; 63.6% faster worst-case
  • Two-Stage Scoring Prefilter (#339) — Fast prefilter gates before expensive semantic scoring; 18–25% faster batch processing
  • Semantic Deduplication v2 (#340) — Opt-in semantic_v2 with canonicalization, O(1) hash matching, weighted scoring; 6.98x speedup; fixed infinite recursion bug
  • Migration Guide (#344) — MIGRATION_V2.md with full examples; 5.86x speedup confirmed; backward compatible

Incremental / Delta Processing

  • Delta Processing (#349) — Native SPARQL delta computation between graph snapshots; delta_mode pipeline config; prune_versions() for snapshot retention; production-ready for near real-time pipelines

Bug Fixes

  • NameError — missing Type import in utils/helpers.py; removed unused import from config_manager.py
  • Context module — fixed retrieve_decision_precedents, hybrid_retrieval, dynamic_context_traversal, multi_hop_context_assembly, _retrieve_from_vector, _extract_entities_from_query; added missing expand_context and _get_decision_query methods
  • Knowledge Graph module — fixed calculate_pagerank, community_detector._to_networkx, detect_communities, _build_adjacency; added ProvenanceTracker and 9 domain-specific tracking methods
  • Pipeline module — fixed retry loop in execution_engine; added RecoveryAction with LINEAR / EXPONENTIAL / FIXED backoff; fixed add_step return value; added validate alias
  • Test files — replaced emoji with ASCII for Windows cp1252 compatibility; fixed assertion ordering and loop bugs across 4 test files

Test Results

Passing Skipped (external services) Failed
~840 36 0

Contributors

@KaifAhmad1 · @ZohaibHassan16 · @tibisabau

v0.3.0-alpha

19 Feb 18:46

Choose a tag to compare

v0.3.0-alpha Pre-release
Pre-release

🎉 Semantica v0.3.0-alpha Release

This alpha release introduces comprehensive decision tracking capabilities, advanced knowledge graph algorithms, and production-ready architecture for testing.

🚀 Major Features

Decision Tracking System

  • Complete decision lifecycle management with audit trails
  • Provenance tracking and lineage management
  • Policy compliance and exception handling
  • Decision influence analysis and impact scoring

Advanced Knowledge Graph Algorithms

  • Node2Vec embeddings for semantic similarity
  • Centrality analysis (degree, betweenness, closeness, eigenvector)
  • Community detection and graph analytics
  • Path finding and link prediction

Enhanced Context Module

  • Unified AgentContext with granular feature flags
  • Decision tracking integration
  • Production-ready architecture with validation
  • GraphStore capability validation

Vector Store Features

  • Hybrid search combining semantic, structural, and category similarity
  • Advanced retrieval with configurable weights
  • FastEmbed integration for efficient operations

🧪 Testing & Quality

  • 113+ tests passing across context and core modules
  • Comprehensive decision tracking test coverage
  • Enhanced error handling and edge case testing
  • Fixed all critical test failures for release readiness

📦 Installation

pip install semantica==0.3.0a0

Semantica 0.2.7

09 Feb 07:26

Choose a tag to compare

Overview

Release 0.2.7 adds Snowflake integration, Apache Arrow export, and benchmark suite.

🚀 New Features

Snowflake Connector for Data Ingestion

PR #276 by @Sameer6305

Native Snowflake connector with multi-authentication support (password, OAuth, key-pair, SSO). Includes table/query ingestion, schema introspection, and SQL injection prevention.

Tests: 24/24 passing
Dependency: db-snowflake optional

Apache Arrow Export Support

PR #273 by @Sameer6305

High-performance columnar export with explicit schemas, compression, and Pandas/DuckDB compatibility.

Tests: 20/20 passing
Dependency: db-arrow optional

Comprehensive Benchmark Suite

PR #289 by @ZohaibHassan16, @KaifAhmad1

137+ benchmarks across all modules with regression detection and CI/CD integration.

Features: Statistical analysis, environment-agnostic design, CLI tool

📊 Quality Assurance

  • Total Tests: 44/44 passing
  • Breaking Changes: None
  • Backward Compatible: Yes

🛠 Installation

pip install semantica==0.2.7
pip install semantica[db-snowflake,db-arrow]==0.2.7

🙏 Contributors

🔗 Links

📈 Performance

  • Text Processing: >10,000 ops/sec
  • Arrow Export: 10x faster
  • Benchmark Coverage: 137+ tests

Thanks to all contributors for making this release possible!

Semantica v0.2.6

03 Feb 05:10

Choose a tag to compare

Semantica v0.2.6

Release Date: February 3, 2026

We're excited to announce Semantica v0.2.6, featuring major enhancements in provenance tracking, change management, and several important bug fixes!


🎉 Highlights

Major Features

  • W3C PROV-O Compliant Provenance Tracking - Enterprise-grade lineage tracking across all 17 modules
  • Enhanced Change Management - Version control for knowledge graphs and ontologies
  • CSV Ingestion Improvements - Auto-detection and robust error handling
  • Comprehensive Test Coverage - 80-86% coverage for ingestion modules

Bug Fixes

  • Temperature compatibility for LLM providers
  • JenaStore empty graph initialization

✨ New Features & Enhancements

W3C PROV-O Compliant Provenance Tracking

PRs: #254, #246 | Contributor: @KaifAhmad1

A comprehensive provenance tracking system with W3C PROV-O compliance across all 17 Semantica modules.

Core Module:

  • ProvenanceManager for centralized tracking
  • W3C PROV-O schemas (Activity, Entity, Agent)
  • Storage backends: InMemory and SQLite
  • SHA-256 integrity verification

Module Integrations:

  • Semantic Extract, LLMs (Groq, OpenAI, HuggingFace, LiteLLM)
  • Pipeline, Context, Ingest, Embeddings
  • Graph/Vector/Triplet stores
  • Reasoning, Conflicts, Deduplication
  • Export, Parse, Normalize, Ontology, Visualization

Features:

  • Complete lineage tracking: Document → Chunk → Entity → Relationship → Graph
  • LLM tracking: tokens, costs, latency
  • Source tracking and bridge axioms for domain transformations

Compliance:

  • W3C PROV-O, FDA 21 CFR Part 11, SOX, HIPAA, TNFD

Testing:

  • 237 tests covering core functionality, all 17 module integrations, edge cases, backward compatibility

Design:

  • Opt-in with provenance=False by default
  • Zero breaking changes
  • No new dependencies

Enhanced Change Management Module

PRs: #248, #243 | Contributor: @KaifAhmad1

Enterprise-grade version control for knowledge graphs and ontologies with persistent storage and audit trails.

Core Classes:

  • TemporalVersionManager - Knowledge graph versioning
  • OntologyVersionManager - Ontology versioning
  • ChangeLogEntry - Change metadata tracking

Storage:

  • SQLite (persistent) and in-memory backends
  • Thread-safe operations

Features:

  • SHA-256 checksums for integrity
  • Detailed entity/relationship diffs
  • Structural ontology comparison
  • Email validation

Compliance:

  • HIPAA, SOX, FDA 21 CFR Part 11
  • Immutable audit trails

Testing:

  • 104 tests (100% pass)
  • Unit, integration, compliance, performance, edge cases

Performance:

  • 17.6ms for 10k entities
  • 510+ ops/sec concurrent
  • Handles 5k+ entity graphs

Migration:

  • Backward compatible
  • Simplified class names
  • Zero external dependencies

CSV Ingestion Enhancements

PR: #244 | Contributor: @saloni0318

Robust CSV parsing with auto-detection and error handling.

Features:

  • Auto-detect CSV encoding using chardet
  • Auto-detect delimiter using csv.Sniffer
  • Tolerant decoding and malformed-row handling (on_bad_lines='warn')
  • Optional chunked reading for large files
  • Metadata tracks detected values

Testing:

  • Expanded unit tests covering:
    • Multiple delimiters
    • Quoted/multiline fields
    • Header overrides
    • Chunked reading
    • NaN preservation

Comprehensive Test Coverage

TextNormalizer Tests

PR: #242 | Contributor: @ZohaibHassan16

Added focused test coverage for TextNormalizer behavior across various inputs.

Integration Test Improvements

PR: #241 | Contributor: @KaifAhmad1

  • Introduced integration test marker
  • Reduced noisy warnings in ingest tests

Ingest Unit Tests

PRs: #239, #232 | Contributor: @Mohammed2372

Comprehensive unit tests for ingestion modules (file, web, and feed ingestors).

Coverage:

  • File scanning: local/cloud (S3/GCS/Azure)
  • Web ingestion: URL/sitemap/robots.txt
  • RSS/Atom feed parsing

Testing:

  • 998 lines of test code
  • Mocked external dependencies for fast, isolated execution

Results:

  • file_ingestor: 86% coverage
  • web_ingestor: 86% coverage
  • feed_ingestor: 80% coverage

Covers happy paths, edge cases, and error handling.


🐛 Bug Fixes

Temperature Compatibility Fix

PRs: #256, #252 | Contributors: @F0rt1s, @IGES-Institut

Fixed hardcoded temperature=0.3 that broke compatibility with models requiring specific temperature values (e.g., gpt-5-mini).

Changes:

  • Added _add_if_set helper method to BaseProvider
  • Only passes parameters when explicitly set
  • When temperature=None, parameter is omitted allowing APIs to use model defaults
  • Updated all 5 providers: OpenAI, Groq, Gemini, Ollama, DeepSeek

Impact:

  • Reduced code by ~85 lines with cleaner parameter handling
  • Comprehensive test coverage added (10 temperature tests, all passing)
  • Backward compatible - no breaking changes

JenaStore Empty Graph Bug

PRs: #257, #258 | Contributor: @ZohaibHassan16

Fixed ProcessingError: Graph not initialized when operating on empty (but initialized) graphs.

Changes:

  • Replaced implicit if not self.graph: checks with explicit if self.graph is None: validation
  • Updated 5 methods: add_triplets, get_triplets, delete_triplet, execute_sparql, serialize
  • Properly distinguishes None (uninitialized) from empty graphs (initialized with 0 triplets)

Impact:

  • Unblocks benchmarking suite
  • Enables fresh deployments
  • Improves testing workflows

📦 Installation

pip install semantica==0.2.6

Or upgrade from a previous version:

pip install --upgrade semantica

🙏 Contributors

Special thanks to all contributors who made this release possible:


📚 Documentation


🔗 Links


🚀 What's Next?

Stay tuned for upcoming features in future releases. Check our GitHub Issues to see what we're working on!


Full Changelog: v0.2.5...v0.2.6

Deep Extraction, BYOM & Pinecone Support (v0.2.5)

27 Jan 16:26

Choose a tag to compare

Semantica v0.2.5

🚀 Release Highlights

This release brings native Pinecone Vector Store support, configurable LLM retry logic, and major enhancements to the Semantic Extraction module, including robust support for custom Hugging Face models (BYOM), improved NER/Relation extraction, and completed Triplet extraction logic.

🌟 New Features

Pinecone Vector Store Support

  • Implemented native PineconeStore with full CRUD capabilities.
  • Support for serverless and pod-based indexes, namespaces, and metadata filtering.
  • Fully integrated with the unified VectorStore interface and registry.
  • (Closes #219, Resolves #220)

Configurable LLM Retry Logic

  • Exposed max_retries parameter in NERExtractor, RelationExtractor, and TripletExtractor.
  • Defaults to 3 retries to handle JSON validation failures or API timeouts gracefully.
  • Propagated retry configuration through chunked processing helpers for consistent long-document handling.

Bring Your Own Model (BYOM) Support

  • Custom Hugging Face Models: Enabled full support for custom models in NERExtractor, RelationExtractor, and TripletExtractor.
  • Custom Tokenizers: Added support for models with non-standard tokenization requirements.
  • Runtime Overrides: extract(model=...) now correctly overrides configuration defaults.

Enhanced Extraction Capabilities

  • NER: Added configurable aggregation strategies (simple, first, average, max) and robust IOB/BILOU parsing.
  • Relation Extraction: Implemented standard entity marker techniques (<subj>, <obj>) and structured output parsing.
  • Triplet Extraction: Added specialized parsing for Seq2Seq models (e.g., REBEL) to generate structured triplets directly from text.

🐛 Bug Fixes

  • LLM Extraction Stability: Fixed infinite retry loops by strictly enforcing max_retries limits.
  • Model Parameter Precedence: Resolved issues where config defaults overrode runtime arguments.
  • Import Handling: Fixed circular import issues in test suites via improved mocking strategies.

📦 Installation

pip install semantica==0.2.5

Semantica v0.2.4

22 Jan 07:20

Choose a tag to compare

Added

  • Ontology Ingestion Module:
    • Implemented OntologyIngestor for parsing RDF/OWL files (Turtle, RDF/XML, JSON-LD, N3).
    • Added ingest_ontology and unified ingest(source_type="ontology") interface.
    • Added recursive directory scanning for batch ontology ingestion.
    • Added OntologyData dataclass for consistent metadata.
  • Documentation:
    • Updated ontology_usage.md and ontology.md with usage examples and API details.
  • Tests:
    • Added comprehensive test suite tests/ingest/test_ontology_ingestor.py.
    • Added examples/demo_ontology_ingest.py for end-to-end demonstration.

Semantica v0.2.3

20 Jan 06:39

Choose a tag to compare

We are excited to announce Semantica v0.2.3! This release focuses on stability, performance, and developer experience improvements, including critical fixes for LLM relation extraction, high-performance vector store ingestion, and resolved circular dependencies.

🚀 Added

Vector Store High-Performance Ingestion

  • New add_documents API: Added high-throughput ingestion with automatic embedding generation, batching, and parallel processing.
  • embed_batch Helper: Efficiently generate embeddings for lists of texts without immediate storage.
  • Parallel Defaults: Enabled default parallel ingestion in VectorStore (default: max_workers=6) for faster processing.
  • Documentation: Added dedicated guide docs/vector_store_usage.md for high-performance configuration.
  • Tests: Added tests/vector_store/test_vector_store_parallel.py covering parallel vs. sequential performance and edge cases.

Amazon Neptune Dev Environment

  • CloudFormation Template: Added cookbook/introduction/neptune-setup.yaml to provision a development Neptune cluster with public endpoints and IAM auth.
  • Documentation: Updated cookbook/introduction/21_Amazon_Neptune_Store.ipynb with deployment guides, cost estimates, and IAM best practices.
  • Linting: Added cfn-lint to pre-commit hooks for CloudFormation validation.

Comprehensive Test Suite

  • Unit Tests: Added tests/test_relations_llm.py covering typed and structured response paths for relation extraction.
  • Integration Tests: Added tests/integration/test_relations_groq.py for real Groq API validation.

🐛 Fixed

LLM Relation Extraction Parsing

  • Zero Relations Fix: Resolved issue where relation extraction returned zero results despite successful API calls.
  • Response Normalization: Normalized typed responses from Instructor/OpenAI/Groq to a consistent dictionary format.
  • JSON Fallback: Added structured JSON fallback when typed generation yields empty results.
  • Parameter Cleanup: Removed unsupported kwargs (max_tokens, max_entities_prompt) from internal calls to prevent API errors.

Pipeline Circular Import

  • Resolved Import Cycles: Fixed circular dependency between pipeline_builder and pipeline_validator (Issues #192, #193).
  • Lazy Loading: Implemented lazy loading for PipelineValidator to ensure stable imports.

JupyterLab Stability

  • Progress Output Control: Added SEMANTICA_DISABLE_JUPYTER_PROGRESS environment variable.
  • Memory Fix: Fallback to console-style output when enabled to prevent JupyterLab out-of-memory errors from infinite scrolling tables (Issue #181).

⚡ Changed

Relation Extraction API

  • Simplified Interface: Removed unused kwargs to prevent parameter leakage.
  • Better Debugging: Improved error handling and verbose logging for extraction workflows.
  • Robust Parsing: Enhanced post-response parsing stability across different LLM providers.

Vector Store Defaults

  • Standardized Concurrency: Set default max_workers=6 for VectorStore parallel ingestion.
  • Simplified Usage: Updated documentation to rely on smart defaults rather than manual configuration.

Semantica 0.2.2

14 Jan 19:13

Choose a tag to compare

Highlights

  • High-throughput parallel extraction engine across all core extractors.
  • Major performance improvements (~1.89x speedup) for real-world extraction workloads.
  • Stronger security hygiene in examples and caching.
  • Updated Gemini SDK integration and dependency constraints for more stable installs.

Added

  • Parallel Extraction Engine

    • Implemented parallel batch processing across all core extractors:
      • NERExtractor, RelationExtractor, TripletExtractor
      • EventDetector, SemanticNetworkExtractor
    • Added max_workers parameter to all extractor extract() methods so users can tune concurrency based on CPU or rate limits.
    • Enabled parallel chunking for large documents in:
      • _extract_entities_chunked
      • _extract_relations_chunked
    • Enhanced ProgressTracker to be thread-safe for concurrent batch updates.
  • Semantic Extract Performance & Regression

    • Added regression suite for:
      • Max worker defaults
      • LLM prompt entity filtering
      • Extractor reuse scenarios
    • Added a runnable benchmark script for batch latency across:
      • NERExtractor, RelationExtractor, TripletExtractor
      • EventDetector, SemanticAnalyzer, SemanticNetworkExtractor
    • Added Groq LLM smoke tests for entities/relations/triplets when GROQ_API_KEY is set.

Security

  • Credential Sanitization

    • Removed hardcoded API keys from 8 cookbook notebooks to prevent secret leakage.
    • Enforced environment variable usage for GROQ_API_KEY across all examples.
  • Secure Caching

    • Updated ExtractionCache to exclude sensitive parameters (api_key, token, password, etc.) from cache keys, enabling safe cache sharing.
    • Upgraded cache key hashing from MD5 to SHA-256 for stronger collision resistance.

Changed

  • Gemini SDK Migration

    • Migrated GeminiProvider to the new google-genai SDK (v0.1.0+) to address deprecations.
    • Added graceful fallback to google.generativeai for backward compatibility.
  • Dependency Resolution

    • Pinned opentelemetry-api and opentelemetry-sdk to 1.37.0 to resolve pip conflicts.
    • Updated protobuf and grpcio constraints for better stability.
  • Entity Filtering Scope

    • Removed entity filtering from non-LLM extraction flows to avoid accuracy regressions.
    • Limited entity downselection to LLM relation prompt construction, while still matching returned entities against the full original list.
  • Batch Concurrency Defaults

    • Standardized max_workers defaults across semantic_extract:
      • ML-backed methods default to single worker.
      • Pattern/regex/rules/LLM/HuggingFace methods use higher parallelism, capped by CPU.
    • Increased global optimization.max_workers default to 8 for better throughput on batch workloads.

Performance

  • Bottleneck Optimization (GitHub Issue #186)

    • Replaced sequential loops with parallel execution for:
      • Document-level batches
      • Intra-document chunks
    • Achieved roughly 1.89x speedup in real-world extraction scenarios
      (benchmarked with Groq llama-3.3-70b-versatile).
  • Low-Latency Entity Matching

    • Reduced reliance on heavyweight embedding stacks for common cases by:
      • Improving fast matching heuristics.
      • Short-circuiting before embedding similarity when possible.
    • Optimized entity matching to:
      • Prefer exact / substring / word-boundary matches.
      • Only fall back to embedding similarity when necessary, reducing CPU overhead.

# Release v0.2.1: Stability Fixes

12 Jan 12:36
ccaadf6

Choose a tag to compare

🚀 Summary

This release addresses critical LLM extraction failures on long documents (Bug #176) and fixes the Earnings Call Analysis cookbook (Bug #177).

🛠 Key Changes

  • LLM Stability (Fixes #176):
    • Solved incomplete JSON outputs by correctly propagating max_tokens.
    • Added auto-retry with reduced chunk sizes when token limits are hit.
    • Standardized default chunk size to 64k for Groq, OpenAI, and Anthropic.
  • Cookbook Fix (Fixes #177):
    • Resolved TypeError in 03_Earnings_Call_Analysis.ipynb by fixing SourceReference usage.
  • Improvements:
    • Added max_completion_tokens support for newer provider APIs.
    • Removed hardcoded length constraints from semantic classes.

✅ Verification

  • Verified max_tokens propagation and error handling via new tests.
  • Validated Groq Llama 3.3 70B integration manually.
  • PyPI Release: Successfully built and uploaded semantica-0.2.1.