Understanding Data Warehousing Trends

Explore top LinkedIn content from expert professionals.

Summary

Understanding data warehousing trends means staying up to date with the evolving ways organizations store, manage, and use large volumes of data for reporting and analysis. Data warehousing involves central repositories where information is processed so businesses can make smarter decisions, but the technology and approaches are changing rapidly—especially with the influence of cloud services and artificial intelligence.

  • Embrace cloud solutions: Consider shifting from old, rigid data warehouse models to modern cloud-based platforms that offer scalability and flexible storage options, helping your business adapt quickly as data needs change.
  • Adopt hybrid architectures: Explore lakehouse systems that combine the reliability of traditional warehouses with the cost savings and flexibility of data lakes, so you can manage both structured and unstructured data efficiently.
  • Integrate AI tools: Use artificial intelligence for tasks like data quality checks, anomaly detection, and automating routine processes, freeing up your team to focus on deeper business insights.
Summarized by AI based on LinkedIn member posts
  • View profile for Akhil Reddy

    Senior Data Engineer | AI & ML Data Infrastructure | Databricks, Snowflake, PySpark, Delta Lake, Unity Catalog | LLM Pipelines & GenAI Platforms | Kafka, dbt, Airflow | Azure, AWS, GCP |

    3,438 followers

    The data engineering landscape just shifted overnight. 90% of teams have no idea it's happening. January 2026 is wild. Here's what changed this week: 1. Snowflake's "Cortex Search" - Goodbye vector databases Native semantic search in Snowflake. No Pinecone needed. That $15K/month vector DB bill? Now $500. 2. dbt Copilot - AI writes your entire data model You: "Build customer LTV model" dbt: *generates 15 models with tests and docs* Junior engineers just became 10x more productive. 3. Apache Iceberg 2.0 - Data lakehouse wars over → 100x faster metadata → Built-in row-level security → Time travel up to 1 year Your data lake finally works like a database. 4. OpenAI's "Data Analyst GPT" - Threatens BI tools Ask: "Show revenue trends with anomaly detection" It: → Pulls data from warehouse → Runs analysis → Creates visualizations → Explains findings Looker and Tableau execs are sweating. 5. Real-time is now the default → Snowflake Dynamic Tables: second-level updates → BigQuery streaming: 10x cheaper → Databricks real-time: GA If your dashboards refresh nightly, you're behind. 6. ETL tools dying? OLD: Source → Fivetran → Warehouse → dbt → BI NEW: Source → Warehouse (direct) → AI Layer → Everywhere Cloud warehouses have native connectors now. 7. Data quality shifted left → Great Expectations 2.0 validates at source → dbt contracts enforced at build → LLM-powered anomaly detection Bad data never enters your warehouse. 8. The "Modern Data Stack" consolidated 2024: 8-12 tools, $180K/year 2026: 3-4 tools, $50K/year Why? Cloud warehouses absorbed everything. 9. Data mesh actually works now 50 data domains, zero central team. Works because: → Data contracts → AI validates compatibility → Automatic documentation → Self-service discovery 10. Salaries are weird Junior roles: Down 20% (AI doing entry work) Senior roles: Up 40% (demand exploded) New titles: → AI Data Engineer → LLM Operations Engineer → Semantic Layer Architect The pattern: Every 6 months, something that cost $10K/month → $500/month The industry is consolidating FAST. What to do NOW: ❌ Standalone vector DB → Try Snowflake Cortex ❌ Manual quality checks → Implement dbt contracts ❌ 10+ tools → Consolidate to 3-4 Skills that matter in 2026: → Knowing specific tools → Writing boilerplate code → Manual testing MORE valuable: → AI prompt engineering → Cost optimization → Architecture decisions → Business communication → Knowing when to keep it simple The prediction: By end of 2026: → 40% of roles AI-assisted → Team sizes shrink 30% → Salaries increase 50% → Title splits: "Platform Engineer" vs "AI Data Engineer" What I'm doing this month: 1️⃣ Testing Snowflake Cortex (cut vector DB costs) 2️⃣ Training team on dbt Copilot (3x productivity) 3️⃣ Migrating to real-time (users love it) 4️⃣ Consolidating from 8 tools to 4 5️⃣ Mandatory AI fluency training The landscape changed in ONE WEEK. How fast are you adapting? What trends are you seeing? 👇 #DataEngineering #AI #Snowflake #dbt #2026Trends

  • View profile for Dmitriy Braverman

    Data Architect @ Western Midstream

    7,680 followers

    🚀 The Evolution of Data Warehousing: From ETL to Lakehouse The data warehousing landscape has undergone a massive #transformation over the past few decades — driven by growing data volumes, the demand for agility, and the need for faster, more reliable insights. 🏛️ The Birth of the Enterprise Data Warehouse (EDW) 35–40 years ago, the Enterprise Data Warehouse (EDW) emerged as a centralized repository for reporting and analytics. * Data was integrated from multiple operational systems via #ETL (Extract → Transform → Load). * Tables were predefined, and transformations happened before loading — a #schema-on-write approach. * Reporting tools relied on consistent, structured, relational data. * This model prioritized #governance, #quality, and #reliability, but struggled with flexibility and scalability. 🌊 The Rise of the Data Lake About 15 years ago, the Data Lake emerged — first via Hadoop Distributed File System (#HDFS) and later through cloud-native object storage like #Amazon S3 and Azure Data Lake Storage (#ADLS). This era introduced two key shifts: * #ELT (Extract → Load → Transform) replaced traditional ETL, allowing more flexibility by performing transformations post-load. * A #schema-on-read approach enabled storing raw, #unstructured, or semi-structured data without enforcing a schema upfront. 🔻 Limitations of Classic Data Lakes Despite their flexibility and scalability, traditional data lakes had critical shortcomings: ❌ Lack of schema enforcement – Made it harder to manage and validate data. ❌ No ACID guarantees – Data consistency was not ensured in concurrent environments. ❌ No transactional consistency – No safe way to update or delete data without risks. As a result, data lakes were often unsuitable for BI, governance, or regulatory use cases. ☁️ The #Cloud #Data #Warehouse Era (2012- Present) To address the limitations of both EDWs and classic data lakes, cloud data warehouses emerged. They brought scalability, performance, and accessibility by leveraging cloud infrastructure. Key platforms include: * Snowflake * Google BigQuery * Azure Synapse Analytics * Amazon Redshift Key benefits: * Fully managed infrastructure * High performance and concurrency * Familiar #SQL interfaces However, these systems still had limitations, including closed formats, vendor lock-in, and cost challenges at extreme scale. 🏠 The Data Lakehouse: The Best of Both Worlds (2019 - Present) The Lakehouse architecture emerged as a hybrid solution, combining the cost-efficiency and flexibility of data lakes with the structure and reliability of data warehouses. Key components: * Open table formats like Apache Iceberg and Delta Lake * Open, scalable storage (e.g., S3, ADLS) * ACID transactions directly on the data lake * Query engines like #Presto, Trino, #Spark SQL, and Athena enable #SQL queries directly on lake data This unified architecture allows organizations to support #BI, data #engineering, #datascience, and #ML.

  • View profile for Brij Kishore Pandey
    Brij Kishore Pandey Brij Kishore Pandey is an Influencer

    AI Architect & AI Engineer | Building Agentic Systems & Scalable AI Solutions

    727,405 followers

    Data Integration Revolution: ETL, ELT, Reverse ETL, and the AI Paradigm Shift In recents years, we've witnessed a seismic shift in how we handle data integration. Let's break down this evolution and explore where AI is taking us: 1. ETL: The Reliable Workhorse      Extract, Transform, Load - the backbone of data integration for decades. Why it's still relevant: • Critical for complex transformations and data cleansing • Essential for compliance (GDPR, CCPA) - scrubbing sensitive data pre-warehouse • Often the go-to for legacy system integration 2. ELT: The Cloud-Era Innovator Extract, Load, Transform - born from the cloud revolution. Key advantages: • Preserves data granularity - transform only what you need, when you need it • Leverages cheap cloud storage and powerful cloud compute • Enables agile analytics - transform data on-the-fly for various use cases Personal experience: Migrating a financial services data pipeline from ETL to ELT cut processing time by 60% and opened up new analytics possibilities. 3. Reverse ETL: The Insights Activator The missing link in many data strategies. Why it's game-changing: • Operationalizes data insights - pushes warehouse data to front-line tools • Enables data democracy - right data, right place, right time • Closes the analytics loop - from raw data to actionable intelligence Use case: E-commerce company using Reverse ETL to sync customer segments from their data warehouse directly to their marketing platforms, supercharging personalization. 4. AI: The Force Multiplier AI isn't just enhancing these processes; it's redefining them: • Automated data discovery and mapping • Intelligent data quality management and anomaly detection • Self-optimizing data pipelines • Predictive maintenance and capacity planning Emerging trend: AI-driven data fabric architectures that dynamically integrate and manage data across complex environments. The Pragmatic Approach: In reality, most organizations need a mix of these approaches. The key is knowing when to use each: • ETL for sensitive data and complex transformations • ELT for large-scale, cloud-based analytics • Reverse ETL for activating insights in operational systems AI should be seen as an enabler across all these processes, not a replacement. Looking Ahead: The future of data integration lies in seamless, AI-driven orchestration of these techniques, creating a unified data fabric that adapts to business needs in real-time. How are you balancing these approaches in your data stack? What challenges are you facing in adopting AI-driven data integration?

  • View profile for Mark Freeman II

    Building Trustworthy Agentic Systems | O’Reilly Author | LinkedIn Learning Instructor (39k+ students) | Translating deep technical expertise into developer demand for Pre-Seed to Series A startups.

    66,429 followers

    Is the data warehouse dying!? Early data warehouses were feats of engineering discipline. Racks of dedicated hardware stood behind locked doors, and significant thought went into the design of the data. That rigidity produced dependable reports, yet it also baked assumptions so deeply into schemas that adapting to a new market question felt like excavating bedrock. When infrastructure costs plummeted, due to the rise of cloud services, software teams began spinning up new services overnight, and the warehouse’s planned rigidity morphed into technical debt. A nightly batch that worked fine on ten tables balked at hundreds of micro-service streams. Suddenly, the speed of business outpaced the speed of schema redesign. Longevity has driven many enterprises to maintain carefully modeled warehouses (decades old) that must continue to serve them alongside new cloud technologies. Perhaps the "cheap" cloud compute and storage will enable organizations to adapt quickly, along with their core data, in their data warehouse. The opposite is happening: the cheaper it becomes to spin up new data sources, the faster that static model decays. Guardrails, therefore, cannot be bolted on after deployment; they need to be part of the entire data and software lifecycle. Now, please excuse my rage bait at the beginning of this post. Data warehouses are not dying... instead, they are evolving into a continuously governed, continuously tested backbone that thrives only when quality questions shift left into the development lifecycle, where data is created or sourced. That last sentence is why data teams must move beyond the comfort of their data warehouse best practices! Instead, they need to understand what's happening upstream with software engineers and why it's so hard to implement those data best practices. #DataEngineering #SoftwareEngineering #Cloud

  • View profile for Dunith Danushka

    Technical Product Marketing at EDB | Author of “Practical Data Engineering with Apache Projects”

    6,838 followers

    💡There’s an interesting trend I observed with organizations recently: they are choosing to save money and simplify their operations by using slower but cheaper storage systems. This is especially true when they handle large amounts of data and sub-second latency isn't critical. Let’s find out what’s motivating this. Data loses its value over time. Once data becomes older and rarely accessed, real-time performance becomes less crucial. While developers need to access historical data for analysis, ad hoc queries, and compliance requirements, they can accept some latency. Their priority now shifts to storing this older data most cost-effectively and efficiently. Compute-storage decoupling is something that we inherited from the Hadoop era, allowing storage systems to use tiered storage for improved cost-efficiency and scalability. ✳️ Object stores became the de facto tiered storage Amazon S3 was officially launched in 2006. Almost 20 years later and with trillions of objects stored, we now have reliable infinite storage. People started to call this cheap, infinitely scalable storage a Data Lake(or Lakehouse nowadays). For developers, it offers a simple path to disaster recovery. When you upload a file to S3, you immediately get eleven nines of durability—that's 99.999999999%. To put this in perspective: if you store 10,000 objects, you might lose just one in 10 million years. As object stores like S3 become more affordable, databases and OLAP systems have increasingly utilized deep object storage to enhance cost efficiency and durability. For example, PGAA, the EDB’s analytics extension for Postgres, allows you to query hot data and cold data with a single dedicated node, ensuring optimal performance by automatically offloading cold data to columnar tables in object storage, reducing the complexity of managing analytics over multiple data tiers. ✳️ Not only databases, but streaming data platforms are evolving too Redpanda and WarpStream show how modern streaming platforms can save money while maintaining good performance. They do this by using a mix of fast local storage (SSDs) for quick access and cloud storage for most of their data, avoiding costly cross-AZ data transfers. ✳️ Why not make the object stores Iceberg compatible? That will transform simple storage solutions into powerful data management systems like data lakehouses. This compatibility brings essential features like schema evolution, time travel capabilities, ACID transactions, and performance optimizations—all while maintaining the cost benefits of object storage. This gives organizations the flexibility to choose their own query engine and catalog, making data platforms more modular and composable.

  • View profile for Ashish Joshi

    Engineering Director & Crew Architect @ UBS - Data & AI | Driving Scalable Data Platforms to Accelerate Growth, Optimize Costs & Deliver Future-Ready Enterprise Solutions | LinkedIn Top 1% Content Creator

    44,819 followers

    Most data strategies fail for one reason: They are built on outdated architecture assumptions. In 2026, the question is no longer “Do we need a data warehouse or a data lake?” That debate is already over. Modern data systems are composed, event-driven, and AI-aware. 𝐇𝐞𝐫𝐞 𝐢𝐬 𝐡𝐨𝐰 𝐥𝐞𝐚𝐝𝐢𝐧𝐠 𝐭𝐞𝐚𝐦𝐬 𝐚𝐫𝐞 𝐭𝐡𝐢𝐧𝐤𝐢𝐧𝐠 𝐚𝐛𝐨𝐮𝐭 𝐝𝐚𝐭𝐚 𝐚𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 𝐧𝐨𝐰: → 𝐖𝐚𝐫𝐞𝐡𝐨𝐮𝐬𝐞 𝐢𝐬 𝐬𝐭𝐢𝐥𝐥 𝐫𝐞𝐥𝐞𝐯𝐚𝐧𝐭 • Strong for governed analytics and reporting • But no longer the center of gravity → 𝐋𝐚𝐤𝐞 𝐢𝐬 𝐧𝐨𝐰 𝐟𝐨𝐮𝐧𝐝𝐚𝐭𝐢𝐨𝐧𝐚𝐥 • Cheap storage for raw and semi-structured data • Rarely used standalone → 𝐋𝐚𝐤𝐞𝐡𝐨𝐮𝐬𝐞 𝐡𝐚𝐬 𝐛𝐞𝐜𝐨𝐦𝐞 𝐝𝐞𝐟𝐚𝐮𝐥𝐭 • Combines storage + compute flexibility • Backbone for BI + AI workloads → 𝐒𝐭𝐫𝐞𝐚𝐦𝐢𝐧𝐠-𝐟𝐢𝐫𝐬𝐭 𝐢𝐬 𝐫𝐢𝐬𝐢𝐧𝐠 𝐟𝐚𝐬𝐭 • Real-time data is becoming the baseline • Critical for AI, personalization, fraud detection → 𝐊𝐚𝐩𝐩𝐚 𝐨𝐯𝐞𝐫 𝐋𝐚𝐦𝐛𝐝𝐚 • Treat everything as streams • Simpler operational model at scale → 𝐃𝐚𝐭𝐚 𝐌𝐞𝐬𝐡 (𝐨𝐫𝐠 𝐩𝐫𝐨𝐛𝐥𝐞𝐦, 𝐧𝐨𝐭 𝐣𝐮𝐬𝐭 𝐭𝐞𝐜𝐡) • Domain ownership of data products • Requires cultural and governance maturity → 𝐃𝐚𝐭𝐚 𝐅𝐚𝐛𝐫𝐢𝐜 (𝐜𝐨𝐧𝐭𝐫𝐨𝐥 𝐩𝐥𝐚𝐧𝐞 𝐭𝐡𝐢𝐧𝐤𝐢𝐧𝐠) • Metadata-driven integration across systems • Focus on governance + discoverability → 𝐄𝐯𝐞𝐧𝐭-𝐝𝐫𝐢𝐯𝐞𝐧 𝐚𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞𝐬 • Decouple producers and consumers • Foundation for scalable, reactive systems → 𝐀𝐈-𝐧𝐚𝐭𝐢𝐯𝐞 𝐝𝐚𝐭𝐚 𝐬𝐭𝐚𝐜𝐤𝐬 • Vector DBs, feature stores, model pipelines • Data architecture now directly powers AI systems → 𝐂𝐨𝐦𝐩𝐨𝐬𝐚𝐛𝐥𝐞 𝐬𝐭𝐚𝐜𝐤 • Decoupled storage, compute, and serving • Avoid vendor lock-in, increase flexibility → 𝐑𝐞𝐯𝐞𝐫𝐬𝐞 𝐄𝐓𝐋 𝐜𝐥𝐨𝐬𝐞𝐬 𝐭𝐡𝐞 𝐥𝐨𝐨𝐩 • Push data back into operational systems • Turn insights into actions The shift is clear: Data architecture is no longer about where data lives. It is about how data flows, is governed, and creates value in real time. P.S. Which of these architectures is becoming central in your stack today? Follow Ashish Joshi for more insights

  • View profile for Akshay T.

    Azure 15X | GCP | Alteryx | Power BI | Data Engineer | Microsoft Fabric | DataBricks | Data Lake | Data Pipelines | Data Warehousing | CI/CD | PySpark | SQL | [Views Are Personal]

    28,490 followers

    𝑻𝒉𝒆 𝑬𝒗𝒐𝒍𝒖𝒕𝒊𝒐𝒏 𝒐𝒇 𝑫𝒂𝒕𝒂 𝑨𝒓𝒄𝒉𝒊𝒕𝒆𝒄𝒕𝒖𝒓𝒆𝒔: 𝑭𝒓𝒐𝒎 𝑾𝒂𝒓𝒆𝒉𝒐𝒖𝒔𝒆𝒔 𝒕𝒐 𝑳𝒂𝒌𝒆𝒉𝒐𝒖𝒔𝒆𝒔 (1980𝒔-2020) The journey of enterprise data architectures tells a fascinating story about how businesses have adapted to handle ever-growing volumes and varieties of data. Let me walk you through this remarkable evolution that spans four decades: 𝐋𝐚𝐭𝐞 𝟏𝟗𝟖𝟎𝐬: 𝐓𝐡𝐞 𝐃𝐚𝐭𝐚 𝐖𝐚𝐫𝐞𝐡𝐨𝐮𝐬𝐞 𝐄𝐫𝐚 The traditional data warehouse emerged as enterprises needed centralized repositories for their structured data. The architecture was elegantly simple: - Data flowed through a classic ETL process - Information was first extracted and loaded into staging areas - Transformation happened within the warehouse environment - Department-specific data marts provided tailored views - The focus was on structured data and batch processing This approach worked brilliantly for its time, providing a single source of truth that enabled consistent reporting across the organization. 𝐋𝐚𝐭𝐞 𝟐𝟎𝟎𝟎𝐬: 𝐓𝐡𝐞 𝐑𝐢𝐬𝐞 𝐨𝐟 𝐃𝐚𝐭𝐚 𝐋𝐚𝐤𝐞𝐬 As data volumes exploded and unstructured data became increasingly valuable, data lakes emerged with technologies like Apache Spark leading the charge: - Distributed storage and computation became essential - Enterprise departments gained individual access - The architecture supported a wider variety of data types - ELT (Extract-Load-Transform) processes became more common - More users could directly interact with the data This democratization of data access was revolutionary, allowing organizations to store vast amounts of raw data for later discovery and analysis. 𝐌𝐢𝐝 𝟐𝟎𝟏𝟎𝐬: 𝐓𝐡𝐞 𝐃𝐚𝐭𝐚 𝐅𝐚𝐛𝐫𝐢𝐜 𝐀𝐩𝐩𝐫𝐨𝐚𝐜𝐡 The need to combine the best of both worlds led to the data fabric concept: - Modern data warehouses connected with data lakes - Big data compute engines handled transformations - Data lakes evolved with distinct raw, query, and report layers - Real-time processing capabilities were integrated - Organizations could process both historical and streaming data This hybrid approach recognized that different data needs required different tools and architectures working together seamlessly. 𝟐𝟎𝟐𝟎: 𝐓𝐡𝐞 𝐃𝐚𝐭𝐚 𝐋𝐚𝐤𝐞𝐡𝐨𝐮𝐬𝐞 & 𝐃𝐞𝐥𝐭𝐚 𝐋𝐚𝐤𝐞 The most recent evolution brings us the data lakehouse and delta lake concepts: - Big data compute engines sit at the heart of these architectures - Transformations happen before data lands in structured layers - The raw-query-report layering provides both flexibility and structure - The architecture combines data warehouse reliability with data lake flexibility - Organizations gain both governance and agility in a single architecture This convergence represents a maturation of our understanding that organizations need both the structure of warehouses and the flexibility of lakes. Are you still operating with legacy architectures, or have you embraced the latest approaches? #DataArchitecture #DataWarehouse #DataLake

  • View profile for Ravena O

    AI Researcher and Data Leader | Healthcare Data | GenAI | Driving Business Growth | Data Science Consultant | Data Strategy

    93,203 followers

    Still building data platforms without clear design patterns? That’s where most pipelines break. This visual is a powerful reminder that data engineering isn’t about tools — it’s about patterns. Modern data systems scale not because of Spark, Snowflake, or Kafka… They scale because the right architectural patterns are applied at the right time. 🧩 What this image breaks down clearly 🔹 Ingestion Design Patterns • Batch ingestion for cost-efficient historical loads • Streaming ingestion for real-time use cases • CDC for low-latency, low-impact data movement 🔹 Storage Design Patterns • Data Lake for raw, flexible storage • Data Warehouse for curated analytics • Lakehouse for combining flexibility + performance 🔹 Transformation Patterns • ETL for schema-first, compliance-heavy systems • ELT for agile analytics and scalability • Incremental processing to avoid reprocessing everything 🔹 Orchestration & Workflow • DAG-based pipelines for complex dependencies • Event-driven pipelines for real-time architectures 🔹 Reliability & Fault Tolerance • Idempotent pipelines (safe re-runs) • Retry & dead-letter queues • Backfill patterns for safe historical reprocessing 🔹 Data Quality & Governance • Validation checks (nulls, ranges, constraints) • Schema evolution without breaking consumers • Data lineage for trust, debugging, and compliance 🔹 Serving & Consumption • Semantic layers to abstract complexity • API-based serving instead of direct table access 🔹 Performance & Scalability • Partitioning for faster queries • Caching to reduce compute and latency 🔹 Cost Optimization • Tiered storage for retention compliance • On-demand compute to avoid idle spend 🎯 Why this matters If you’re: • Designing a modern data platform • Scaling analytics for multiple teams • Migrating to cloud or lakehouse • Building real-time or AI-ready pipelines 👉 These patterns matter more than any single tool choice. 📌 Bookmark this. 📤 Share it with your data team. Question for you: Which of these patterns has saved you the most pain in production — and which one do teams usually ignore until it’s too late? #DataEngineering #DataArchitecture #AnalyticsEngineering #BigData #CloudData #ModernDataStack #Lakehouse #DataGovernance

  • View profile for David Yaffe

    Co-Founder at Estuary, Previously Co-Founder of Arbor (Acquired by LiveRamp)

    18,731 followers

    Is the Modern Data Stack dying, or turning inside out? A funny thing happened a while back. Snowflake added change data capture support. We support it now, as customers requested to add onto their basic ELT pipelines by pushing to new destinations like databases, vector DB’s, SaaS, and other compute engines to process data not just for analytics, but for real-time operations, or for AI model training and execution. About the same time the modern data stack was taking off, Martin Kleppmann was talking about turning the database inside out – which was big inspiration for us when we created Estuary Flow. Think of a database as something that keeps (mutable) state. A data warehouse is more like a collection of immutable facts; it’s meant to keep history and is read-optimized. But what if you focused instead on working with historical data or events, changes, or facts, as they arrive? For a database, that’s a transaction or write-ahead log (WAL). Change data capture exposes that stream from a database, turning the database inside out. It’s possible to store that stream as a new log and keep adding to it forever. Joining streams together to form new ones enables real-time materialized views and it’s possible to create whatever pre-computed state you want to, enabling arbitrary views. Kleppmann talks about replication, secondary indexes, caching, and materialized views all as derived, up-to-date real-time “inside-out” views of data optimized for specific queries. This is exactly what’s happening to the modern data stack. It’s starting to go real-time and turning inside out. Materialized views have already been happening, as have caches. Snowflake, Databricks and data lakes, Amazon Athena, Starburst, and others can be used for data processing. They’re not quite real-time, but newer entrants like Materialize can provide a real-time materialized view. Back in 2014 the Gazette open source project was created to manage streams and batch data together as data with schema, inside out. It eventually became the foundation of Estuary for real-time ETL and CDC. A collection is a durable, append-only cloud store of a stream with exactly-once transactionally guaranteed delivery, just like a WAL. But you can also create new derived views with state called, you guessed it, derivations. These derivations are created by compute engines using SQL, TypeScript, and (soon) Python. Companies use them to do all kinds of processing for data warehouses, but also for real-time operational analytics, search, or processing data for AI. You can connect to many sources streaming or batch, and to many targets - a data warehouse, Elastic, MongoDB…or hundreds of others - streaming or batch. The modern data stack isn’t dead. Like Jurassic Park and other software, it’s .. found a way. It’s evolving and turning inside out, becoming compute engines with state, all wired together as streaming data using something like Estuary, to support real-time analytics and AI use cases.

  • View profile for Aditi Jain

    Co-Founder of The Ravit Show | Data & Generative AI | Media & Marketing for Data & AI Companies | Community Evangelist | ACCA |

    76,479 followers

    “Data 3.0 in the Lakehouse era,” using this map as a guide. Data 3.0 is composable. Open formats anchor the system, metadata is the control plane, orchestration glues it together, and AI use cases shape choices. Ingestion & Transformation - Pipelines are now products, not scripts. Fivetran, Airbyte, Census, dbt, Meltano and others standardize ingestion. Orchestration tools like Prefect, Flyte, Dagster and Airflow keep things moving, while Kafka, Redpanda and Flink show that streaming is no longer a sidecar but central to both analytics and AI. Storage & Formats - Object storage has become the system of record. Open file and table formats—Parquet, Iceberg, Delta, Hudi—are the backbone. Warehouses (Snowflake, Firebolt) and lakehouses (Databricks, Dremio) co-exist, while vector databases sit alongside because RAG and agents demand fast recall. Metadata as Control - This is where teams succeed or fail. Unity Catalog, Glue, Polaris and Gravtino act as metastores. Catalogs like Atlan, Collibra, Alation and DataHub organize context. Observability tools—Telmai, Anomalo, Monte Carlo, Acceldata—make trust scalable. Without this layer, you might have a modern-looking stack that still behaves like 2015. Compute & Query Engines - The right workload drives the choice: Spark and Trino for broad analytics, ClickHouse for throughput, DuckDB/MotherDuck for frictionless exploration, and Druid/Imply for real-time. ML workloads lean on Ray, Dask and Anyscale. Cost tools like Sundeck and Bluesky matter because economics matter more than logos. Producers vs Consumers - The left half builds, the right half uses. Treat datasets, features and vector indexes as products with owners and SLOs. That mindset shift matters more than picking any single vendor. Trends I see • Batch and streaming are converging around open table formats. • Catalogs are evolving into enforcement layers for privacy and quality. • Orchestration is getting simpler while CI/CD for data is getting more rigorous. • AI sits on the same foundation as BI and data science—not a separate stack. This is my opinion of how the space is shaping up. Use this to reflect on your own stack, simplify, standardize, and avoid accidental complexity!!!! ---- ✅ I post real stories and lessons from data and AI. Follow me and join the newsletter at www.theravitshow.com

Explore categories