Skip to content
View hajirufai's full-sized avatar

Highlights

  • Pro

Block or report hajirufai

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
hajirufai/readme.md

Hey, I'm Haji 👋

Data engineer who builds infrastructure from scratch to understand how it actually works.

Based in Mombasa, Kenya. I write Python, SQL, and whatever gets the pipeline running.

LinkedIn Dev.to Email


What I build

Data pipelines and infrastructure. Most of my recent work is a series of from-scratch implementations of tools I use daily. No frameworks, no dependencies, just the core algorithms:

Project What it is
streamlite Stream processing engine - windowing, watermarks, keyed state, checkpoints. Flink internals demystified.
brokerlite Message broker with pub/sub, consumer groups, WAL, dead letter queues. Kafka-inspired.
raftkv Distributed key-value store with Raft consensus - leader election, log replication, strong consistency.
queryforge SQL query engine - lexer, parser, optimizer, executor. SELECT, JOIN, GROUP BY, subqueries over CSV/JSON.
searchlite Full-text search engine - inverted index, BM25 scoring, Porter stemmer, faceted search.
cachelite In-memory cache with LRU/LFU/FIFO eviction, TTL, snapshots, HTTP API.
cronlite Task scheduler - POSIX cron syntax, priority queues, DAG dependencies, retry strategies, SQLite persistence.
vaultlite Secrets manager with AES-128 from scratch. Envelope encryption, seal/unseal, audit logging, versioning.
gatelite API gateway - routing, rate limiting, JWT auth, circuit breaking, load balancing, caching.
tracelite Distributed tracing - W3C Trace Context, sampling, critical path analysis, waterfall visualization.
servekit HTTP/1.1 server built from raw TCP sockets.
tinylang Programming language interpreter - lexer, parser, AST, closures, first-class functions.

Every one of these is zero dependencies, pure Python standard library.


Production data work

Project Stack
afridata-pipeline World Bank API to DuckDB star-schema warehouse. Dimensional modeling, data quality checks, Vercel dashboard.
realtime-event-pipeline Kafka + DuckDB streaming pipeline. Ingestion, transformation, enrichment, OLAP analytics.
dbt-ecommerce-warehouse dbt + DuckDB analytics warehouse. Star schema, 50+ tests, custom macros, incremental models.
stock-market-data-pipeline Real-time stock tracking. Airflow, Spark, Slack alerts, Metabase dashboards.
datapact Data quality and contract validation library. Declare expectations, enforce in pipelines and CI.
datadrift Drift detection framework - schema changes, distribution shifts, statistical testing, HTML reports.

Tools and AI

Project What it does
documind RAG document Q&A. Hybrid search (BM25 + TF-IDF), cited answers, pluggable LLMs.
ai-agent-toolkit Composable agent framework - tool use, memory, multi-agent orchestration. Under 1000 lines of core.
pipeforge CI/CD pipeline generator - analyzes codebases and outputs GitHub Actions, GitLab CI, Docker configs.
vectorlite Vector search engine - Flat, IVF, HNSW indexes with cosine/euclidean/dot product.
airbnb-clone Full-stack MERN app. MongoDB, Express, React, Node. Auth, search, bookings, image upload.

Background

  • BSc Mathematics and Computer Science, JKUAT
  • Data Engineering certs from ExploreAI Academy and Wizeline Academy
  • AWS Certified Cloud Practitioner
  • Day-to-day: Python, SQL, dbt, Airflow, Spark, Kafka, DuckDB, BigQuery, Docker, GCP, Azure

GitHub Stats

GitHub Streak

Pinned Loading

  1. airbnb-clone airbnb-clone Public

    JavaScript 1

  2. stock-market-data-pipeline stock-market-data-pipeline Public

    Python 2

  3. classic-snake-game classic-snake-game Public

    HTML

  4. audio-recorder audio-recorder Public

    JavaScript