The Data Lake: Half Empty or Half Full?

The Data Lake: Half Empty or Half Full?

I wrote a post on the Data Informed blog this week about the different perspectives on the Data Lake and traditional data warehousing. Re-posting it here for comment:

Conference season is in full swing in the world of data management and business intelligence, and it’s clear that when it comes to the infrastructure needed to support modern analytics, we are in a major transition. To put things in paleontology terms, with the emergence of Hadoop and its impact on traditional data warehousing, it’s as if we’ve gone from the Mesozoic to the Cenozoic Era and people who have worked in the industry for some time are struggling with the aftermath of the tectonic shift.

A much-debated topic is the so-called data lake. The concept of an easily accessible raw data repository running on Hadoop is also called a data hub or data refinery, although critics call it nothing more than a data fallacy or, even worse, a data swamp. But where you stand (or swim) depends upon where you sit (or dive in). Here’s what I’ve seen and heard in the past few months.

Hadoop love is in the air. In February, I attended Strata + Hadoop World in Silicon Valley. A sold-out event, the exhibit hall was buzzing with developers, data scientists, and IT professionals. Newly minted companies along with legacy vendors trying to get their mojo back were giving out swag like it was 1999. The event featured more than 50 sponsors, over 250 sessions on topics ranging from the Hadoop basics to machine learning and real-time streaming with Spark, and even a message from President Obama in the keynote.

One of the recurring themes at the conference was the potential of the data lake as the new, more flexible strategy to deliver on the analytical and economic benefits of big data. As organizations move from departmental to production Hadoop deployments, there was genuine excitement about the data lake as a way to extend the traditional data warehouse and, in some cases, replace it altogether.

The traditionalists resist change. The following week, I attended The Data Warehouse Institute (TDWI) conference in Las Vegas, and the contrast was stark. Admittedly, TDWI is a pragmatic, hands-on type of event, but the vibe was a bit of a big data buzz kill.

What struck me was the general antipathy toward the developer-centric Hadoop crowd. The concept of a data lake was the object of great skepticism and even frustration with many people I spoke with – it was being cast as an alternative to traditional data warehousing methodologies. IT pros from mid-sized insurance companies were quietly discussing vintage data warehouse deployments. An analyst I met groused, “Hadoopies think they own the world. They’ll find out soon enough how hard this stuff is.”

And that about sums it up: New School big data Kool-Aid drinkers think Hadoop is the ultimate data management technology, while the Old Guard points to the market dominance of legacy solutions and the data governance, stewardship, and security lessons learned from past decades. But Hadoop isn’t just about replacing data warehouse technologies. Hadoop brings value by extending and working alongside those traditional systems, bringing flexibility and cost savings, along with greater business visibility and insight.

Making the Case for Hadoop
It’s wise to heed the warnings of the pragmatists and not throw the baby out with the bath (lake) water. As one industry analyst said to me recently, “People who did data warehousing badly will do things badly again.” Fair enough. Keep in mind that, just like a data warehouse, a data lake strategy is a lot more than just the technology choices. But is the data lake half full or half empty? Will Hadoop realize its potential, or is it more hype than reality?

I believe that, as the market moves from the early adopter techies and visionaries to the pragmatists and skeptics, Hadoopies will learn from the mistakes of their predecessors and something better – that is, more flexible, accessible, and economical – will emerge. Traditional data warehousing and data management methodologies are being re-imagined today.

Every enterprise IT organization should consider the strengths, weaknesses, opportunities, and threats of the data lake. Hadoop will expand analytic and storage capabilities at lower costs, bringing big data to Main Street. There are still issues around security and governance, no doubt. But in the short term, Hadoop is making a nice play for data collection and staging. Hadoop is not a panacea, but the promise of forward-looking, real-time analytics and the potential to ask – and answer – bigger questions is too enticing to ignore.

This post originally appeared on Data Informed.

Sam Chehab

Head of Security and IT

10y

Great article...I find it particularly interesting trying to watch companies 'sidestep' the whole master data concept and move directly to 'shiny toy x'. The real area for improvement is how to make 'mastering' data suck less.

Like
Reply
🤝 Ramon Chen

Ping me to learn how #agenticDM means better #datagovernance #dataobservability #dataquality, #MDM, & #ai | CPO @Acceldata | Ex-Reltio CPO & CMO | Partner to CDOs & CIOs | Top LinkedIn Exec Voice | Who’s who in Data

10y

Great article Darren You know where we swim here at Reltio :-) For the sake of business agility and speed, companies have to shift to NoSQL repositories to be data-driven. But the data lake can become muddy and swampy without the backbone of reliable master data discipline. And regardless of where the data is stored or how it's managed, it doesnt mean anything if business users don't get the refined, distilled, relevant insights and increasingly recommended actions to help them in their goals. Rather than debating the backend, it might be better to ask business users would they like their enterprise apps to be like LinkedIn? With all encompassing, data types of any variety seamlessly blended and related, easy to use functionality, with mobile interfaces built in and new capabilities requiring no upgrades, collaboration included and unlimited scale and performance. LinkedIn and Facebook use Graph/NoSQL technology. So the answer in my mind is not lake vs no lake, but does the business need new data-driven apps vs. sticking with traditional process-driven siloed apps.

Like
Reply
Neil Barry

Unlocking opportunities through data

10y

Excellent article. Before deciding on your Architecuture - you should ask yourself: What data do I need to analyze in real time and what data can I store for a period of time ?

Like
Reply

Great article. While Hadoop itself has proven to an amazing game-changer, the lakes and swamps are in a rudimentary space with a lot of vendor-speak and promises. We need to be rational and apply use-cases as opposed to a blind tectonic shift. Given the maturity in the DL space, we may see a mad rush like the one currently moving away from mapreduce to Spark. Certainly a good time to pilot/experiment without assuming the full risks as the evolution happens!

Like
Reply
Ron Dunn

Performance Specialist at Snowflake - The Data Cloud

10y

Microsoft's announcement of Azure Data Lake pushes this concept into prominence. Until now it has been the specialist vendors like Cloudera and Pivotal promoting the concept, Microsoft's move brings it to the mainstream.

Like
Reply

To view or add a comment, sign in

More articles by Darren Cunningham

Insights from the community

Others also viewed

Explore topics