Haji Rufai hajirufai

Hey, I'm Haji 👋

Data engineer who builds infrastructure from scratch to understand how it actually works.

Based in Mombasa, Kenya. I write Python, SQL, and whatever gets the pipeline running.

What I build

Data pipelines and infrastructure. Most of my recent work is a series of from-scratch implementations of tools I use daily. No frameworks, no dependencies, just the core algorithms:

Project	What it is
streamlite	Stream processing engine - windowing, watermarks, keyed state, checkpoints. Flink internals demystified.
brokerlite	Message broker with pub/sub, consumer groups, WAL, dead letter queues. Kafka-inspired.
raftkv	Distributed key-value store with Raft consensus - leader election, log replication, strong consistency.
queryforge	SQL query engine - lexer, parser, optimizer, executor. SELECT, JOIN, GROUP BY, subqueries over CSV/JSON.
searchlite	Full-text search engine - inverted index, BM25 scoring, Porter stemmer, faceted search.
cachelite	In-memory cache with LRU/LFU/FIFO eviction, TTL, snapshots, HTTP API.
cronlite	Task scheduler - POSIX cron syntax, priority queues, DAG dependencies, retry strategies, SQLite persistence.
vaultlite	Secrets manager with AES-128 from scratch. Envelope encryption, seal/unseal, audit logging, versioning.
gatelite	API gateway - routing, rate limiting, JWT auth, circuit breaking, load balancing, caching.
tracelite	Distributed tracing - W3C Trace Context, sampling, critical path analysis, waterfall visualization.
servekit	HTTP/1.1 server built from raw TCP sockets.
tinylang	Programming language interpreter - lexer, parser, AST, closures, first-class functions.

Every one of these is zero dependencies, pure Python standard library.

Production data work

Project	Stack
afridata-pipeline	World Bank API to DuckDB star-schema warehouse. Dimensional modeling, data quality checks, Vercel dashboard.
realtime-event-pipeline	Kafka + DuckDB streaming pipeline. Ingestion, transformation, enrichment, OLAP analytics.
dbt-ecommerce-warehouse	dbt + DuckDB analytics warehouse. Star schema, 50+ tests, custom macros, incremental models.
stock-market-data-pipeline	Real-time stock tracking. Airflow, Spark, Slack alerts, Metabase dashboards.
datapact	Data quality and contract validation library. Declare expectations, enforce in pipelines and CI.
datadrift	Drift detection framework - schema changes, distribution shifts, statistical testing, HTML reports.

Tools and AI

Project	What it does
documind	RAG document Q&A. Hybrid search (BM25 + TF-IDF), cited answers, pluggable LLMs.
ai-agent-toolkit	Composable agent framework - tool use, memory, multi-agent orchestration. Under 1000 lines of core.
pipeforge	CI/CD pipeline generator - analyzes codebases and outputs GitHub Actions, GitLab CI, Docker configs.
vectorlite	Vector search engine - Flat, IVF, HNSW indexes with cosine/euclidean/dot product.
airbnb-clone	Full-stack MERN app. MongoDB, Express, React, Node. Auth, search, bookings, image upload.

Background

BSc Mathematics and Computer Science, JKUAT
Data Engineering certs from ExploreAI Academy and Wizeline Academy
AWS Certified Cloud Practitioner
Day-to-day: Python, SQL, dbt, Airflow, Spark, Kafka, DuckDB, BigQuery, Docker, GCP, Azure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Haji Rufai hajirufai

Achievements

Achievements

Highlights

Block or report hajirufai

Hey, I'm Haji 👋

What I build

Production data work

Tools and AI

Background

Pinned Loading

Uh oh!