This is a series of data engineering projects I will be uploading to GitHub. This is a simple application that generates a series of events. The events are then generated as part of a topic in Kafka. Spark consumes from that topic, writes to Redis for initial fast aggregation if needed, and it also writes to a file for future consumption. There are multiple things to continue the project, including 1) taking it to the cloud, 2) fixing parquet writing, and 3) scaling it. The repository is here: https://lnkd.in/gKyHi34R #spark #kafka #dataengineering #python #redis
Miguel Jimeno, PhD’s Post
More Relevant Posts
-
🚀 𝐌𝐚𝐬𝐭𝐞𝐫 𝐍𝐨𝐒𝐐𝐋 — 𝐓𝐡𝐞 𝐂𝐨𝐫𝐞 𝐒𝐤𝐢𝐥𝐥 𝐄𝐯𝐞𝐫𝐲 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫 𝐌𝐮𝐬𝐭 𝐊𝐧𝐨𝐰 𝐢𝐧 𝟐𝟎𝟐𝟓! 💡 𝐖𝐡𝐲 𝐍𝐨𝐒𝐐𝐋 𝐌𝐚𝐭𝐭𝐞𝐫𝐬: In today’s world of big data, IoT, and real-time analytics, relational databases alone can’t handle the scale and flexibility modern systems demand. That’s where NoSQL databases — like MongoDB, Cassandra, Couchbase, and DynamoDB — shine by offering speed, scalability, and schema flexibility. 📘 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐑𝐨𝐚𝐝𝐦𝐚𝐩 (𝐒𝐭𝐞𝐩 𝐛𝐲 𝐒𝐭𝐞𝐩): Step 1 → Understand what NoSQL is and how it differs from RDBMS (CAP theorem, schema-less data, scalability). Step 2 → Learn the 4 types of NoSQL databases: • Document (MongoDB) • Key-Value (Redis, DynamoDB) • Column (Cassandra, HBase) • Graph (Neo4j) Step 3 → Practice CRUD operations in MongoDB and Cassandra. Step 4 → Explore indexing, aggregation pipelines, and query optimization. Step 5 → Learn data modeling principles for NoSQL (denormalization, embedding vs referencing). Step 6 → Deploy and scale NoSQL clusters on cloud platforms (AWS, Azure, GCP). ━━━━━━━━━━━━━━━━━━━ 🎯 𝐑𝐞𝐚𝐥-𝐭𝐢𝐦𝐞 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰 𝐐𝐮𝐞𝐬𝐭𝐢𝐨𝐧𝐬: 1️⃣ When would you prefer NoSQL over a traditional relational database? 2️⃣ Explain the CAP theorem and how it applies to NoSQL systems. 3️⃣ How does MongoDB handle data relationships without foreign keys? 4️⃣ Describe partitioning and replication strategies in Cassandra. 5️⃣ How would you model a product catalog in a NoSQL database? 6️⃣ What are the trade-offs between consistency and availability? ━━━━━━━━━━━━━━━━━━━ 💼 𝐏𝐫𝐚𝐜𝐭𝐢𝐜𝐚𝐥 𝐔𝐬𝐞 𝐂𝐚𝐬𝐞𝐬: ✅ E-commerce: Product catalog and user sessions in MongoDB. ✅ Social Media: Graph relationships and follower systems in Neo4j. ✅ Banking: Real-time transactions and fraud detection with Cassandra. ✅ IoT Analytics: Time-series data ingestion with DynamoDB or InfluxDB. ✅ Gaming: Leaderboards and real-time state tracking in Redis. ━━━━━━━━━━━━━━━━━━━ 🎓 𝐅𝐫𝐞𝐞 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐑𝐞𝐬𝐨𝐮𝐫𝐜𝐞𝐬: 🔹 https://lnkd.in/gg34HxfY 🔹 https://lnkd.in/gi8DccJW 🔹https://lnkd.in/g3vMJyqb ━━━━━━━━━━━━━━━━━━━ 📺 𝐘𝐨𝐮𝐓𝐮𝐛�� 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐂𝐡𝐚𝐧𝐧𝐞𝐥𝐬 (𝐖𝐨𝐫𝐤𝐢𝐧𝐠 𝐋𝐢𝐧𝐤𝐬): 🎥 MongoDB Full Course for Beginners | Programming with Mosh https://lnkd.in/gV8xcV69 🎥 Cassandra Tutorial | Intellipaat https://lnkd.in/gdNV-ZjZ 🎥 Redis Crash Course | Traversy Media https://lnkd.in/g3M7dQaV 🎥 Neo4j Crash Course | Amigoscode https://lnkd.in/gRQdZCtJ 🔥 𝐐𝐮𝐢𝐜𝐤 𝐓𝐢𝐩: Don’t learn NoSQL in isolation — build small projects like a chat app, IoT dashboard, or product inventory. That’s how you’ll truly understand scalability and schema design! ❓ 𝐂𝐚𝐥𝐥 𝐭𝐨 𝐀𝐜𝐭𝐢𝐨𝐧: What’s your favorite NoSQL database — and why? #DataEngineering #NoSQL #MongoDB #Cassandra #Redis #DynamoDB #BigData #DatabaseDesign #CloudComputing #CareerGrowth
MongoDB Crash Course
https://www.youtube.com/
To view or add a comment, sign in
-
Upgrading Apache Spark has historically been painful — breaking workloads, shifting APIs, and endless patching cycles. Databricks just changed that forever. 🙋♂️Introducing Versionless Spark: a new architecture that delivers automatic, AI-powered Spark upgrades with zero code changes and seamless stability. Over the last 18 months, Databricks has auto‑upgraded 2+ billion workloads across 25 Databricks Runtime releases (including Spark 4), all without user intervention — an industry first. 🔩How it works: ✅Stable public Spark API via Spark Connect decouples client & server. ✅Environment versioning provides 3‑year supported base images for reproducibility. ✅AI-powered Release Stability System (RSS) detects regressions, auto‑rolls back failing jobs, and ensures smooth recovery. 🥏 Only 0.000006% of jobs required rollback — all remediated and upgraded within 12 days. This marks a major leap for data teams: faster access to new features, stronger reliability, and no maintenance overhead. “Versionless Spark = Continuous Innovation, Zero Friction.” Great future ahead! #Databricks #ApacheSpark #DataEngineering #AI #Automation #BigData #VersionlessSpark
To view or add a comment, sign in
-
Just published my latest article diving deep into a question every Platform/DevOps engineer faces: "Should you run your production databases on Docker, or go Bare Metal/VM?" The common wisdom of "spin it up quickly" works great for testing, but when I/O performance, persistence (data loss!), and observability are on the line, the abstraction layer of containers becomes a liability. In this analysis, based on my real-world experience with Elasticsearch, Redis, and Cassandra, I share: ✅ The undeniable speed benefits of Docker for Dev/CI/CD. ❌ The critical I/O overhead and complex networking in production. ⚖️ My decision framework: Docker as the lab, Bare Metal/StatefulSet as the foundation. Don't let **"Container Deleted, Data Gone"** be your next war story. Read the full technical breakdown below. #DatabaseEngineering #DevOps #Docker #PlatformEngineering #Elasticsearch #Redis #SystemDesign
To view or add a comment, sign in
-
📣 We are thrilled to announce that Timeplus 3.0 is now generally available! This version of our single-binary, vectorized streaming SQL platform simplifies real-time data pipelines and scales efficiently to massive workloads. 50GB/s processing throughput. Zero replication. Zero lag. 𝑯𝒊𝒈𝒉𝒍𝒊𝒈𝒉𝒕𝒔: ✔️ Highly scalable, zero-replication shared-storage model ✔️ Zero-replication NativeLog, zero-replication query state checkpoint ✔️ End-to-end streaming: ETL, joins, aggregation, alerts, and tasks ✔️ Native Python UDF/UDAF support ✔️ Native connectors: Apache Kafka, Redpanda Data, ApachePulsar, ClickHouse, Splunk, Elastic, MongoDB, #AmazonS3, Apache Iceberg ✔️ Light mode UI in Timeplus Console ✔️ #BYOC: Deploy real-time processing inside your own cloud ✔️ Updates in community version, Timeplus Proton 3.0 … and more Read the announcement blog from our CEO, Ting Wang: https://lnkd.in/ggBr2sNA 👉 Try Timeplus Enterprise 3.0, free for 30-days: https://lnkd.in/gGPts8RT 👉 Try our community version, Timeplus Proton 3.0, free forever: https://lnkd.in/gJ8vV_8E
To view or add a comment, sign in
-
-
Timeplus 3.0 just launched with some features that was all from real customer feedbacks to solve some real challenges in data streaming processing . Zero-replication architecture → Cut your storage costs in half. No more paying for 3x data copies across replicas. Plus, your ops team isn't managing replica sync anymore. Native Python UDF/UDAF → Support your AI/ML work loads in the stream. Build realtime features, score transactions for fraud in real-time, apply sentiment analysis on incoming messages, or run anomaly detection without moving data to a separate system. BYOC (Bring Your Own Cloud) → Your data never leaves your AWS VPC. Makes data compliance straightforward since you control the data perimeter. Native connectors for Kafka, S3, MongoDB, etc. → No more writing and maintaining custom integration code. Connect to your existing data sources in minutes instead of weeks. End-to-end streaming with joins, aggregations, alerts → Build your entire pipeline in one place. One SQL query can read from Kafka, join with S3 reference data, aggregate, and trigger alerts. Light mode UI → Your team working long hours actually gets a UI that doesn't cause eye strain. Accessibility matters for developer productivity. 50GB/s throughput, zero lag → Process every clickstream event, IoT sensor reading, or log entry without sampling or batching. Make decisions on complete data, not subsets. Check out our CEO Ting's full announcement: https://lnkd.in/ggBr2sNA #StreamProcessing #RealTimeAnalytics #DataEngineering #Timeplus
📣 We are thrilled to announce that Timeplus 3.0 is now generally available! This version of our single-binary, vectorized streaming SQL platform simplifies real-time data pipelines and scales efficiently to massive workloads. 50GB/s processing throughput. Zero replication. Zero lag. 𝑯𝒊𝒈𝒉𝒍𝒊𝒈𝒉𝒕𝒔: ✔️ Highly scalable, zero-replication shared-storage model ✔️ Zero-replication NativeLog, zero-replication query state checkpoint ✔️ End-to-end streaming: ETL, joins, aggregation, alerts, and tasks ✔️ Native Python UDF/UDAF support ✔️ Native connectors: Apache Kafka, Redpanda Data, ApachePulsar, ClickHouse, Splunk, Elastic, MongoDB, #AmazonS3, Apache Iceberg ✔️ Light mode UI in Timeplus Console ✔️ #BYOC: Deploy real-time processing inside your own cloud ✔️ Updates in community version, Timeplus Proton 3.0 … and more Read the announcement blog from our CEO, Ting Wang: https://lnkd.in/ggBr2sNA 👉 Try Timeplus Enterprise 3.0, free for 30-days: https://lnkd.in/gGPts8RT 👉 Try our community version, Timeplus Proton 3.0, free forever: https://lnkd.in/gJ8vV_8E
To view or add a comment, sign in
-
-
The long-awaited Timeplus 3.0 is finally here 🚀. It’s been an incredible journey — built hand-in-hand with our customers and community. Their real-world use cases and constant feedback shaped what Timeplus has become today. From day one, we made two big bets: our **Multi-Raft NativeLog** (which speaks SQL) and a **single binary** architecture. Those decisions turned out to be clutch — especially for customers running high-performance streaming systems on edge or hybrid setups where low latency and local storage really matter. But as we landed more customers, many users are now all-in on AWS, looking for elasticity and cost efficiency when scale. To tackle these new challenges, NativeLog now supports **S3-based zero replication** in Timeplus 3.0. That means you can mix MPP and shared-storage modes in one cluster — edge-grade performance where you need it, and cloud-native efficiency where you can afford to relax latency. And my favorite part: 360° observability for Materialized Views — full visibility into processing lagging, CPU, memory, disk, live execution DAGs, and workload skew in real time. It’s one of those things that makes operating scaled pipelines actually less painful. Timeplus 3.0 isn’t just an update — it’s the foundation for the next wave of fast, intelligent streaming analytics.
📣 We are thrilled to announce that Timeplus 3.0 is now generally available! This version of our single-binary, vectorized streaming SQL platform simplifies real-time data pipelines and scales efficiently to massive workloads. 50GB/s processing throughput. Zero replication. Zero lag. 𝑯𝒊𝒈𝒉𝒍𝒊𝒈𝒉𝒕𝒔: ✔️ Highly scalable, zero-replication shared-storage model ✔️ Zero-replication NativeLog, zero-replication query state checkpoint ✔️ End-to-end streaming: ETL, joins, aggregation, alerts, and tasks ✔️ Native Python UDF/UDAF support ✔️ Native connectors: Apache Kafka, Redpanda Data, ApachePulsar, ClickHouse, Splunk, Elastic, MongoDB, #AmazonS3, Apache Iceberg ✔️ Light mode UI in Timeplus Console ✔️ #BYOC: Deploy real-time processing inside your own cloud ✔️ Updates in community version, Timeplus Proton 3.0 … and more Read the announcement blog from our CEO, Ting Wang: https://lnkd.in/ggBr2sNA 👉 Try Timeplus Enterprise 3.0, free for 30-days: https://lnkd.in/gGPts8RT 👉 Try our community version, Timeplus Proton 3.0, free forever: https://lnkd.in/gJ8vV_8E
To view or add a comment, sign in
-
-
Having access to real-time, accurate data is critical for making agile business decisions. Traditional batch processing pipelines often struggle to keep up with evolving data schemas and continuous streams of change data capture (CDC). Amazon MSK Serverless, Apache Iceberg, and AWS Glue streaming can solve these challenges with an automated schema evolution approach that adapts dynamically to changes in source databases without disruption. Now you can seamlessly capture real-time database changes using the Debezium MySQL connector and Amazon MSK Serverless. Then, stream and process CDC events with AWS Glue jobs that update Apache Iceberg tables. This provides near real-time data synchronization in your data lake with schema evolution to automatically handle new columns and schema changes. #ApacheIceberg #ApacheKafka #MSK #Streaming #DataStreaming #Data #AWS https://lnkd.in/gMaigS6M
To view or add a comment, sign in
-
AWS Serverless Data Pipeline Project Just finished building my first end-to-end AWS Serverless Data Pipeline! Over the past few days, I wanted to understand how modern cloud data workflows are automated using serverless tools — so I built a small but complete pipeline using AWS Lambda, S3, Glue, Athena, and QuickSight. Here’s what I created 👇 🔹 Architecture Overview Amazon S3 – stores both raw and processed data AWS Lambda – automatically triggered on every new upload, cleans and transforms data AWS Glue – catalogs processed data for easy querying Amazon Athena – runs SQL queries directly on S3 Amazon QuickSight – visualizes data insights in interactive dashboards 💡 How it works Upload raw CSV data to the S3 raw bucket Lambda gets triggered → cleans data using Python (pandas + boto3) Writes the processed file back to a processed S3 bucket (partitioned by date) Glue Crawler detects the schema Athena queries the cleaned data QuickSight creates dashboards like “Event Type vs Total Purchases” 🔍 Tech stack: AWS Lambda, S3, Glue, Athena, QuickSight, Python, IAM, CloudWatch This project really helped me strengthen my understanding of: - How serverless data workflows are orchestrated - The importance of IAM permissions (least privilege principle) - How AWS services connect seamlessly in a data engineering context 💻 Check out the full project here: 🔗https://lnkd.in/dy_69TyX #AWS #CloudComputing #Serverless #DataEngineering #Lambda #S3 #Athena #Glue #QuickSight #CloudArchitecture #Python #AWSCertified
To view or add a comment, sign in
-