I have been exploring the problem of Real-Time Ad Click Aggregation by combining insights from multiple resources and practical experience. Some key takeaways include: - Don’t block the user redirect path - Streaming (Kafka/Kinesis) is the backbone - Event-time processing is preferred over processing-time for accuracy - Raw data stored in S3 enables replay, audit, and machine learning - Replay pipelines are critical for billing correctness Curious to know, would you choose Kafka or Kinesis for this use case? #systemdesign #datastreaming #aws #bigdata #softwareengineering
Real-Time Ad Click Aggregation with Kafka/Kinesis
More Relevant Posts
-
Everyone talks about streaming latency. Nobody talks about backfills. Latency is the fun metric. Sub-minute, sub-second, real-time. It makes great marketing. But here's what actually ruins your week: you need to resync a table with 2B rows, and your destination goes dark for 12 hours. Dashboards break. Downstream jobs fail. Your data team is fielding Slack messages asking "why is the data wrong?" That's a backfill. And most replication tools treat it like an afterthought. The standard playbook is brutal: -> Find a maintenance window (good luck) -> Pause CDC while the resync runs -> Hope nothing breaks while your table is in a half-loaded state -> Pray the WAL doesn't pile up and crash your source AWS DMS literally recommends running resyncs during "periods of minimal source database activity." Translation: do it at 2am and hope for the best. The crazy part is this problem was solved years ago for schema migrations. GitHub built gh-ost - dual-write to a staging table, then atomic swap. No downtime. That same pattern works for backfills. Dual-write to a staging table while the live table keeps serving queries. Backfill into staging. Swap when it's caught up. Your consumers barely notice. We built this into Artie and it changed how our customers think about resyncs entirely. They went from "schedule it for the weekend" to "just run it." The unsexy problems are usually the ones worth solving.
To view or add a comment, sign in
-
Proud to have been part of this with such an amazing team. Check out how Hydrolix and Amazon Bedrock AgentCore come together to power agentic analytics for media streaming operations. and if you're ready to build, don't start from zero. Build from here using Kiro Powers. 🚀 Reference Solution: https://lnkd.in/djC4p87m Extend to a multi-agent system using Kiro Powers for AgentCore and Strands Agents SDK: https://kiro.dev/powers/ Osmar Bento da Silva Junior
AWS for M&E + Hydrolix = 💡 💥 🚀 Excited to share this post, detailing our collaboration with the Amazon Web Services (AWS) Media, Entertainment, Games & Sports team - providing customers with a ready-made reference architecture for intelligent agentic streaming operations. https://lnkd.in/eZVxh-5B Learn more about the collaboration, see the demo, and join the Builder's Lab this weekend at #NAB2026 - we'll be on the AWS booth, as well as having our own (#W2300A). We'll also be presenting a session with NVIDIA around AI-Powered CDN Monitoring at Dan Rayburn's Streaming Summit. In the agentic era, the data layer becomes the basis of differentiation. Plenty more to come from us here - watch this space!
To view or add a comment, sign in
-
https://lnkd.in/eT7rpNPs Learn how AWS and Hydrolix combined CDN Insights with Amazon Bedrock AgentCore to bring agentic AI to petabyte-scale streaming data. Great post from Rory McVicar and Franz Knupfer!
To view or add a comment, sign in
-
How does video understanding and AI-powered semantic search technology enable teams to unlock insights from massive media archives, accelerate storytelling, and create entirely new content formats? Find out in our talk with Moments Lab client Bloomberg, and Amazon Web Services (AWS), TODAY at NAB Show - see you in West Hall in the AWS Theater 👋 #NAB2026 #NABShow #VideoUnderstanding #SemanticSearch #MediaTechnology
To view or add a comment, sign in
-
-
I think we may be underestimating what Polars is becoming for data engineering. 👀 Reading the latest Polars update, what stands out to me is not just performance. It is the direction of the ecosystem. Streaming is improving. ⚙️ Streaming joins are getting more serious. Iceberg and Delta support are making the story more credible. So the question is changing. It is no longer just: “Is Polars fast?” 🚀 It is becoming: “Where is Polars now the right tradeoff between performance, simplicity, and operational cost?” That is a much more interesting question. Because not every pipeline needs a heavy distributed stack. And not every production workload should start with infrastructure built for maximum scale. There is a big space between fragile scripts and over-engineered platforms. 📦 Polars is starting to look very credible in that space. To me, that is one of the most interesting shifts in data engineering right now. Curious how others see it: Is Polars still mainly an analyst tool to you, or already a real production building block? 💬 https://lnkd.in/epub-YWm
To view or add a comment, sign in
-
Redpanda Data Connect Breaks Down Data Silos with Salesforce Connectors, Streaming CDC for Oracle and DynamoDB https://ow.ly/7vXo50YGWMy #MarTech #MarketingTechnology #MarketingTech #AdTech
To view or add a comment, sign in
-
Most data pipelines run in batches. But what happens when your data cannot wait? I wrote a deep dive on real-time streaming in Databricks covering: • Structured Streaming • Watermarks • Windowing strategies • Exactly-once guarantees If you're preparing for data engineering roles or building real-time systems, this will help. Read here 👇 #DataEngineering #Databricks #Streaming
To view or add a comment, sign in
-
"TL;DR: DuckLake’s data inlining stores small updates directly in the catalog, eliminating the “small files problem” and making continuous streaming into data lakes practical. Our benchmark shows 926× faster queries and 105× faster ingestion when compared to Iceberg." https://lnkd.in/dWvZxNhD
To view or add a comment, sign in
-
How to build a streaming analytics pipeline with Terraform and Instaclustr-Part 2: Designing the complete data pipeline Learn how to design a streaming analytics pipeline with Terraform. Discover how Kafka, ClickHouse & Kafka Connect power real-time data insights.
To view or add a comment, sign in
-
Three years ago, most of us killed our DStv subscriptions. Felt great. Picked Showmax and Netflix. Felt smart. Today your TV has nine streaming apps. You pay more than DStv ever cost. Half the time you just scroll. Data infrastructure is running the exact same play. The "modern data stack" that emerged around 2020 was the streaming moment for enterprise data. Stop paying for a bloated Oracle contract. Buy exactly what you need. Fivetran for ingestion. Snowflake for storage. dbt for transformation. Power BI for dashboards. A fractional data engineer glued on top. It was cheaper. On paper. Then came five contracts, five renewal cycles, five support queues, and a person whose whole job was making sure the handoffs didn't break at 2am. We've watched this play out across dozens of mid-market data teams. Last October, Fivetran and dbt announced they're merging. Combined ARR near $600 million. Over 1,500 shared customers. Roughly 80% of Fivetran users were already paying for dbt. This is the DStv bundle coming back. Streaming users called it the Disney Bundle. Data will call it something else. For a 200-person company, the real takeaway isn't "pick the winning tool." It's that stitching infrastructure is a phase, not an architecture. Your team's job is insight. Not integration.
To view or add a comment, sign in
-