Most data strategies fail for one reason: They are built on outdated architecture assumptions. In 2026, the question is no longer “Do we need a data warehouse or a data lake?” That debate is already over. Modern data systems are composed, event-driven, and AI-aware. 𝐇𝐞𝐫𝐞 𝐢𝐬 𝐡𝐨𝐰 𝐥𝐞𝐚𝐝𝐢𝐧𝐠 𝐭𝐞𝐚𝐦𝐬 𝐚𝐫𝐞 𝐭𝐡𝐢𝐧𝐤𝐢𝐧𝐠 𝐚𝐛𝐨𝐮𝐭 𝐝𝐚𝐭𝐚 𝐚𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 𝐧𝐨𝐰: → 𝐖𝐚𝐫𝐞𝐡𝐨𝐮𝐬𝐞 𝐢𝐬 𝐬𝐭𝐢𝐥𝐥 𝐫𝐞𝐥𝐞𝐯𝐚𝐧𝐭 • Strong for governed analytics and reporting • But no longer the center of gravity → 𝐋𝐚𝐤𝐞 𝐢𝐬 𝐧𝐨𝐰 𝐟𝐨𝐮𝐧𝐝𝐚𝐭𝐢𝐨𝐧𝐚𝐥 • Cheap storage for raw and semi-structured data • Rarely used standalone → 𝐋𝐚𝐤𝐞𝐡𝐨𝐮𝐬𝐞 𝐡𝐚𝐬 𝐛𝐞𝐜𝐨𝐦𝐞 𝐝𝐞𝐟𝐚𝐮𝐥𝐭 • Combines storage + compute flexibility • Backbone for BI + AI workloads → 𝐒𝐭𝐫𝐞𝐚𝐦𝐢𝐧𝐠-𝐟𝐢𝐫𝐬𝐭 𝐢𝐬 𝐫𝐢𝐬𝐢𝐧𝐠 𝐟𝐚𝐬𝐭 • Real-time data is becoming the baseline • Critical for AI, personalization, fraud detection → 𝐊𝐚𝐩𝐩𝐚 𝐨𝐯𝐞𝐫 𝐋𝐚𝐦𝐛𝐝𝐚 • Treat everything as streams • Simpler operational model at scale → 𝐃𝐚𝐭𝐚 𝐌𝐞𝐬𝐡 (𝐨𝐫𝐠 𝐩𝐫𝐨𝐛𝐥𝐞𝐦, 𝐧𝐨𝐭 𝐣𝐮𝐬𝐭 𝐭𝐞𝐜𝐡) • Domain ownership of data products • Requires cultural and governance maturity → 𝐃𝐚𝐭𝐚 ��𝐚𝐛𝐫𝐢𝐜 (𝐜𝐨𝐧𝐭𝐫𝐨𝐥 𝐩𝐥𝐚𝐧𝐞 𝐭𝐡𝐢𝐧𝐤𝐢𝐧𝐠) • Metadata-driven integration across systems • Focus on governance + discoverability → 𝐄𝐯𝐞𝐧𝐭-𝐝𝐫𝐢𝐯𝐞𝐧 𝐚𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞𝐬 • Decouple producers and consumers • Foundation for scalable, reactive systems → 𝐀𝐈-𝐧𝐚𝐭𝐢𝐯𝐞 𝐝𝐚𝐭𝐚 𝐬𝐭𝐚𝐜𝐤𝐬 • Vector DBs, feature stores, model pipelines • Data architecture now directly powers AI systems → 𝐂𝐨𝐦𝐩𝐨𝐬𝐚𝐛𝐥𝐞 𝐬𝐭𝐚𝐜𝐤 • Decoupled storage, compute, and serving • Avoid vendor lock-in, increase flexibility → 𝐑𝐞𝐯𝐞𝐫𝐬𝐞 𝐄𝐓𝐋 𝐜𝐥𝐨𝐬𝐞𝐬 𝐭𝐡𝐞 𝐥𝐨𝐨𝐩 • Push data back into operational systems • Turn insights into actions The shift is clear: Data architecture is no longer about where data lives. It is about how data flows, is governed, and creates value in real time. P.S. Which of these architectures is becoming central in your stack today? Follow Ashish Joshi for more insights
Storage Architecture Trends in Data Centers
Explore top LinkedIn content from expert professionals.
Summary
Storage architecture trends in data centers refer to the evolving ways organizations design and manage the systems that store, move, and process vast amounts of data. Recent developments focus on flexible frameworks that support real-time analytics, AI applications, and cost savings by blending traditional data warehouses, data lakes, and new hybrid solutions like “lakehouses.”
- Adopt hybrid models: Consider combining flexible storage formats and compute engines to manage a variety of data needs, from analytics to machine learning, while controlling costs.
- Use tiered storage: Store frequently accessed data on faster drives and move older or less critical information to more affordable storage layers to save money and improve disaster recovery options.
- Prioritize intelligent control: Implement systems that use metadata and automation to organize, monitor, and govern data across all storage platforms, making data easier to find and trust.
-
-
💡There’s an interesting trend I observed with organizations recently: they are choosing to save money and simplify their operations by using slower but cheaper storage systems. This is especially true when they handle large amounts of data and sub-second latency isn't critical. Let’s find out what’s motivating this. Data loses its value over time. Once data becomes older and rarely accessed, real-time performance becomes less crucial. While developers need to access historical data for analysis, ad hoc queries, and compliance requirements, they can accept some latency. Their priority now shifts to storing this older data most cost-effectively and efficiently. Compute-storage decoupling is something that we inherited from the Hadoop era, allowing storage systems to use tiered storage for improved cost-efficiency and scalability. ✳️ Object stores became the de facto tiered storage Amazon S3 was officially launched in 2006. Almost 20 years later and with trillions of objects stored, we now have reliable infinite storage. People started to call this cheap, infinitely scalable storage a Data Lake(or Lakehouse nowadays). For developers, it offers a simple path to disaster recovery. When you upload a file to S3, you immediately get eleven nines of durability—that's 99.999999999%. To put this in perspective: if you store 10,000 objects, you might lose just one in 10 million years. As object stores like S3 become more affordable, databases and OLAP systems have increasingly utilized deep object storage to enhance cost efficiency and durability. For example, PGAA, the EDB’s analytics extension for Postgres, allows you to query hot data and cold data with a single dedicated node, ensuring optimal performance by automatically offloading cold data to columnar tables in object storage, reducing the complexity of managing analytics over multiple data tiers. ✳️ Not only databases, but streaming data platforms are evolving too Redpanda and WarpStream show how modern streaming platforms can save money while maintaining good performance. They do this by using a mix of fast local storage (SSDs) for quick access and cloud storage for most of their data, avoiding costly cross-AZ data transfers. ✳️ Why not make the object stores Iceberg compatible? That will transform simple storage solutions into powerful data management systems like data lakehouses. This compatibility brings essential features like schema evolution, time travel capabilities, ACID transactions, and performance optimizations—all while maintaining the cost benefits of object storage. This gives organizations the flexibility to choose their own query engine and catalog, making data platforms more modular and composable.
-
Global SSD controller industry is entering one of the most important structural transformations in its history. Historically, SSD controllers were viewed primarily as NAND management devices focused on: • wear leveling • ECC correction • garbage collection • firmware management • and cost optimization. With the rise of Generative AI, hyperscale infrastructure, GPU-centric computing, edge inference and vector databases, the SSD controller is increasingly becoming a strategic “data orchestration processor” responsible for managing latency, parallelism, QoS, power efficiency and intelligent data movement across AI infrastructures. Our latest 2025–2030 SSD Controller Strategic Playbook highlights several important industry observations: The global SSD controller market reached approximately 401M units in 2025, yet growth dynamics are becoming highly bifurcated. Consumer/client SSDs still account for ~82% of shipments, but enterprise SSD controllers are emerging as the dominant future profit pool due to significantly higher ASPs, firmware complexity and AI infrastructure requirements. Merchant controller suppliers still represent ~51% of total market volume, led by Silicon Motion (~47% share of merchant market) and rapidly growing Maxio (~30%). PCIe Gen5 is now entering mainstream enterprise acceleration, while PCIe Gen6 is beginning to shape future hyperscale and AI-storage roadmaps for the 2027–2030 timeframe. AI storage architectures increasingly prioritize: • latency determinism • parallel queue optimization • power efficiency • firmware intelligence • thermal optimization • and GPU utilization efficiency. Vertical integration is becoming a major strategic advantage. Companies such as Samsung, SK hynix/Solidigm, Micron and Kioxia benefit from tight NAND-controller-firmware co-optimization and direct hyperscaler qualification ecosystems. At the same time, China is accelerating localization efforts through companies such as Maxio, D-One, Innogrit and Yeestor as storage infrastructure increasingly becomes part of broader semiconductor sovereignty strategies. One of the most important conclusions from this analysis is that the industry is no longer competing only on: ➡️ raw bandwidth ➡️ NAND channels ➡️ or benchmark throughput. The next competitive battleground will increasingly revolve around: ✔ firmware sophistication ✔ AI workload optimization ✔ low-latency architectures ✔ power efficiency ✔ computational storage ✔ and ecosystem integration. In many ways, the SSD controller is evolving into the “CPU of storage infrastructure.” #SSD #SSDController #Storage #Semiconductors #AI #ArtificialIntelligence #PCIeGen5 #PCIeGen6 #DataCenter #EnterpriseSSD #Hyperscale #Firmware #FlashMemory #NAND #SiliconMotion #Phison #Maxio #Samsung #Micron #SKHynix #Solidigm #Kioxia #ComputationalStorage #EdgeAI #GPU #AIInfrastructure #SemiconductorIndustry #TechStrategy #DigitalInfrastructure #MemoryMarket #DataInfrastructure
-
"Lakehouse" has been the talk of the year 2024! More & more organizations have been thinking of and adopting open lakehouse architectures. The reason is actually quite simple. - customers have the flexibility to store data in open storage formats - they own/control their cloud storage (such as S3 bucket/MinIO etc.) - they can work on the same data with multiple compute engines (BI, Streaming, ML use cases) These aspects have resonated with orgs suffering with increasing storage & compute costs, unable to manage multiple data copies, need to maintain a 2-tier architecture (data warehouse + data lake) among other pains. This year I am plan to bring into light the learnings from organizations who have taken lakehouses into Production! While a lot of the time last year was spent on debates on 'table formats war', it is time for data engineers/architects to look beyond & learn from implementations. I picked up these 4 examples of orgs who have been running Apache Hudi for some time now. ✅ Uber: - Uber has been running lakehouse architecture in production for quite some time. In fact, Hudi started back at Uber to solve some of their architectural challenges. - This team 'Co-services Data engineering' at Uber facilitates data solutions for various verticals such as payments. So use cases such as - 'What's the collection from a Trip or an Uber Eats order?', etc. - They were able to achieve 75% improvement in their end-to-end refresh latency (20 hours -> 5 hours), among other benefits ✅ Amazon: - This team is part of the 'Worldwide Amazon Stores' & their goal is to enable top quality selling experience for their sellers + analytics (pricing, forecasting) so they can grow. - Their solution: Nexus (built on top of Hudi) helped streamline data workflows by providing a consistent config-driven framework to scale operations & onboard new businesses. - They deals with 3PB+ data with ~1 PB added & deleted every month using Hudi ✅ Peloton Interactive: - Peloton's data platform team shared some of their pain points with their old architecture, especially how their recommender system was constrained to daily recommendations. - With Hudi they had hourly data freshness for high frequency tables, near real-time recommender system & a lot to save on costs (with Merge-on-read tables, async cleaner + compaction) ✅ Notion: - Notion has experienced exponential user growth that led them to re-think their data infrastructure, especially for Notion AI. - Over 90% of their operations are updates (with Snowflake this was slower & costly). Hudi’s UPSERT operation ensures the changes are efficiently handled without reprocessing the entire dataset. - Moving 10 TB+ Postgres datasets to data lake gave them a net savings of 1 million+ dollars for 2022 and proportionally higher savings in 2023 & 2024. There are a lot to learn from these implementations. The videos for each of these talks are linked in the comments. #dataengineering #softwareengineering
-
“Data 3.0 in the Lakehouse era,” using this map as a guide. Data 3.0 is composable. Open formats anchor the system, metadata is the control plane, orchestration glues it together, and AI use cases shape choices. Ingestion & Transformation - Pipelines are now products, not scripts. Fivetran, Airbyte, Census, dbt, Meltano and others standardize ingestion. Orchestration tools like Prefect, Flyte, Dagster and Airflow keep things moving, while Kafka, Redpanda and Flink show that streaming is no longer a sidecar but central to both analytics and AI. Storage & Formats - Object storage has become the system of record. Open file and table formats—Parquet, Iceberg, Delta, Hudi—are the backbone. Warehouses (Snowflake, Firebolt) and lakehouses (Databricks, Dremio) co-exist, while vector databases sit alongside because RAG and agents demand fast recall. Metadata as Control - This is where teams succeed or fail. Unity Catalog, Glue, Polaris and Gravtino act as metastores. Catalogs like Atlan, Collibra, Alation and DataHub organize context. Observability tools—Telmai, Anomalo, Monte Carlo, Acceldata—make trust scalable. Without this layer, you might have a modern-looking stack that still behaves like 2015. Compute & Query Engines - The right workload drives the choice: Spark and Trino for broad analytics, ClickHouse for throughput, DuckDB/MotherDuck for frictionless exploration, and Druid/Imply for real-time. ML workloads lean on Ray, Dask and Anyscale. Cost tools like Sundeck and Bluesky matter because economics matter more than logos. Producers vs Consumers - The left half builds, the right half uses. Treat datasets, features and vector indexes as products with owners and SLOs. That mindset shift matters more than picking any single vendor. Trends I see • Batch and streaming are converging around open table formats. • Catalogs are evolving into enforcement layers for privacy and quality. • Orchestration is getting simpler while CI/CD for data is getting more rigorous. • AI sits on the same foundation as BI and data science—not a separate stack. This is my opinion of how the space is shaping up. Use this to reflect on your own stack, simplify, standardize, and avoid accidental complexity!!!! ---- ✅ I post real stories and lessons from data and AI. Follow me and join the newsletter at www.theravitshow.com
-
Two years ago most teams were evaluating Apache Iceberg. Today they are deploying it at massive scale. Last week I hosted a closed door roundtable with data leaders running billions of records and tens of terabytes per day through Iceberg in production. Here is what became clear. → Adoption is clearly inflecting as teams move from pilots to serious scale deployments (billions of records / tens of TB+ per day) across core analytics workloads. → Cost pressure is a major driver as companies move off pure Snowflake, Redshift, and Databricks footprints toward Iceberg based lakehouse designs to reduce warehouse spend. → The architecture is getting simpler as teams reduce brittle microservice to warehouse sync pipelines and centralize storage around open table formats. → Iceberg handles near real time workloads in the tens of seconds range well, while sub second operational use cases still rely on specialized serving engines. One thing became obvious. Iceberg is becoming the storage foundation. Performance and differentiation are shifting to the layers on top. In my next post I will break down what we heard about interoperability friction, the emerging Bronze Silver Gold patterns, how data teams are shifting from infra to modeling, and how database vendors are building serious performance advantages on top of Iceberg. The center of gravity in the modern data stack is shifting toward open table formats with performance layered on top. The teams that win will understand storage, modeling, and serving as one system instead of isolated tools. #iceberg #analytics #firebolt #data
-
Ready to architect the future of data❓ The Data Lakehouse isn't just a buzzword—it's the architectural evolution that's reshaping how we think about data storage, processing, and analytics. But theory without practice is just wishful thinking. 🧊 𝗗𝗮𝘁𝗮 𝗟𝗮𝗸𝗲 Think of it as a raw data reservoir. Stores structured, semi-structured, and unstructured data. Great for scalability and flexibility. But… lacks governance, performance, and query optimization. 🏢 𝗗𝗮𝘁𝗮 𝗪𝗮𝗿𝗲𝗵𝗼𝘂𝘀𝗲 A refined data factory. Optimized for structured data and analytics. Strong governance, ACID compliance, and fast queries. But… expensive and rigid for modern data types. 🏙️ Enter 𝗗𝗮𝘁𝗮 𝗟𝗮𝗸𝗲𝗵𝗼𝘂𝘀𝗲: The Smart City of Data A Lakehouse combines the best of both worlds: Scalability of a data lake Reliability & performance of a warehouse Unified architecture for BI, ML, and real-time analytics ACID transactions, schema evolution, time travel, and streaming support 🔍 It’s like building a smart city where raw materials (data) and finished goods (insights) coexist, governed by intelligent systems. 🛠️ Explore some powerful tools to start building your own lakehouse: • Apache Iceberg – Table format with time travel & schema evolution • Delta Lake – ACID transactions on data lakes • Apache Hudi – Real-time ingestion and upserts • LakeSoul – Rust-based lakehouse with streaming support • Nessie – Git-like catalog versioning Explore these amazing Lakehouse resources to get your hands dirty - 1. 𝗣𝗮𝗰𝗸𝘁 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 𝗟𝗮𝗸𝗲𝗵𝗼𝘂𝘀𝗲 by Will Girten - https://lnkd.in/gx6Hpt_y 2. 𝗥𝗲𝗮𝗹𝘁𝗶𝗺𝗲 𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗗𝗮𝘁𝗮 𝗟𝗮𝗸𝗲𝗵𝗼𝘂𝘀𝗲 by Yusuf Ganiyu- https://lnkd.in/gN9_Bnb7 3. 𝗡𝗬𝗖 𝗧𝗮𝘅𝗶 𝗗𝗮𝘁𝗮 𝗟𝗮𝗸𝗲𝗵𝗼𝘂𝘀𝗲 𝗣𝗿𝗼𝗷𝗲𝗰𝘁 - https://lnkd.in/gjaJuiZp 📣 Ready to build your smart data city? 💬 Start exploring these tools, experiment with hybrid architectures, and share your learnings. #data #engineering
-
Let’s do this! I speak to so many leaders and get so many insights into how the space is evolving! “Data 3.0 in the Lakehouse era,” using this map as a guide. Data 3.0 is composable. Open formats anchor the system, metadata is the control plane, orchestration glues it together, and AI use cases shape choices. Ingestion & Transformation - Pipelines are now products, not scripts. Fivetran, Airbyte, Census, dbt, Meltano and others standardize ingestion. Orchestration tools like Prefect, Flyte, Dagster and Airflow keep things moving, while Kafka, Redpanda and Flink show that streaming is no longer a sidecar but central to both analytics and AI. Storage & Formats - Object storage has become the system of record. Open file and table formats—Parquet, Iceberg, Delta, Hudi—are the backbone. Warehouses (Snowflake, Firebolt) and lakehouses (Databricks, Dremio) co-exist, while vector databases sit alongside because RAG and agents demand fast recall. Metadata as Control - This is where teams succeed or fail. Unity Catalog, Glue, Polaris and Gravtino act as metastores. Catalogs like Atlan, Collibra, Alation and DataHub organize context. Observability tools—Telmai, Anomalo, Monte Carlo, Acceldata—make trust scalable. Without this layer, you might have a modern-looking stack that still behaves like 2015. Compute & Query Engines - The right workload drives the choice: Spark and Trino for broad analytics, ClickHouse for throughput, DuckDB/MotherDuck for frictionless exploration, and Druid/Imply for real-time. ML workloads lean on Ray, Dask and Anyscale. Cost tools like Sundeck and Bluesky matter because economics matter more than logos. Producers vs Consumers - The left half builds, the right half uses. Treat datasets, features and vector indexes as products with owners and SLOs. That mindset shift matters more than picking any single vendor. Trends I see • Batch and streaming are converging around open table formats. • Catalogs are evolving into enforcement layers for privacy and quality. • Orchestration is getting simpler while CI/CD for data is getting more rigorous. • AI sits on the same foundation as BI and data science—not a separate stack. This is my opinion of how the space is shaping up. Use this to reflect on your own stack, simplify, standardize, and avoid accidental complexity!!!! ---- ✅ I post real stories and lessons from data and AI. Follow me and join the newsletter at www.theravitshow.com
-
What happens when AI-driven workloads push enterprise data storage to its limits? During the The IT Press Tour in Silicon Valley, I sat down with David Flynn CEO of Hammerspace, to discuss how the company is redefining unstructured data management. With AI, hybrid cloud, and high-performance computing generating massive storage demands, enterprises need a new way to access and orchestrate data across edge, data centers, and the cloud. Hammerspace is having a breakout year, reporting 10x revenue growth, fueled by the rising demand for AI storage and hybrid cloud solutions. At the heart of this success is its Global Data Platform, which eliminates data silos and ensures seamless access to unstructured data—no matter where it resides. One of the key takeaways from our discussion was the role of Parallel Network File System (pNFS) and why it’s becoming a game-changer for enterprise storage. Unlike traditional storage architectures, pNFS separates metadata from data, unlocking extreme parallel performance without the usual bottlenecks. But as David explains, storage performance alone isn’t enough—true data orchestration is the missing piece. We also explored why global namespaces alone are not a solution, and how Hammerspace is combining them with data orchestration to create a system that moves, accesses, and utilizes data efficiently across distributed environments. This shift is critical as enterprises struggle to keep pace with AI-driven analytics, GPU computing, and hybrid cloud strategies. With its recent recognition as TechTarget’s Storage Product of the Year and a new Chief Revenue Officer joining the team, Hammerspace is scaling rapidly to meet the evolving needs of AI-powered businesses. So, what does the future hold for AI storage and enterprise data management? Will solutions like Hammerspace’s redefine the way organizations handle high-performance workloads? https://lnkd.in/eV4_MS5M #ITPT #Technology #DataManagement #artificialintelligence
-
Are storage platforms becoming control planes? Performance and capacity still matter, but as AI workloads move into production, teams are managing more data, more variability, and tighter recovery expectations without adding staff. The limiting factor increasingly becomes admin time. At HyperFRAME Research, we took a closer look at IBM’s FlashSystem update and published TWO (2) distinct research notes today. A few things stood out: 🔵 FlashSystem.ai automates provisioning, placement, SLA monitoring, and reporting so admins spend less time on tickets and routine tasks 🔵 FlashSystem Grid lets multiple systems run as a coordinated pool and move workloads without disruption 🔵 New FlashCore modules increase density while IBM’s SLC–QLC design targets more stable cost and supply 🔵 Inline inspection at the media layer detects anomalies earlier, which tightens recovery windows during ransomware events Taken together, the portfolio points toward storage functioning as a managed platform across compute, storage, and cloud, with fewer systems to run, tighter recovery objectives, and less day-to-day overhead. In our view, storage teams are being asked to manage more capacity and complexity with fewer specialists, which is pushing the market toward platforms that automate routine decisions and focus operators on outcomes instead of systems. IBM's approach brings together telemetry with a control plane that interprets intent and enforces outcomes across an elastic fabric, helping generalist teams run enterprise storage with more consistency and less day-to-day effort. Links to both research notes below, covering the systems architecture and the resiliency strategy, respectively. #StorageInfrastructure #DataResiliency Steven Dickens Stephanie Walter Ron Westfall Stephen Sopko Fred McClimans Sam Werner Alistair Symon Scott Baker Elisa Ortiz Fava Alexandra Demetriades