Scaling S3 for Enterprise-Grade Security and Disaster Recovery

This title was summarized by AI from the post below.

60,048 followers

If you are an architect tasked with moving beyond legacy storage toward a more resilient, scalable object platform, this session is built for you. Alon Horev is taking the stage at VAST FWD to lead a technical exchange on the realities of S3 at scale. This is a peer-driven session for practitioners who need to manage the transition from file to object while maintaining enterprise-grade security and disaster recovery. Come discuss the nuances of native S3 for Hadoop and learn how to implement the replication workflows required for your organization's most critical data assets.

2 Comments

VAST Data 2w

Will you be at the forefront of object storage innovation? Join VAST FWD: https://www.vastdata.com/vast-forward?utm_medium=social&utm_source=social&utm_campaign=

Jens Melhede 2w

Great, just put this into my schedule 👌

See more comments

To view or add a comment, sign in

More Relevant Posts

Apache Gravitino

274 followers
2w Edited
Report this post
🚀 Apache Gravitino 101: A New Tutorial on Apache Gravitino + Apache Spark for ETL 🔗 Read now: https://lnkd.in/gatnZvbq ETL pipelines are only as good as the metadata behind them. In our latest Apache Gravitino 101 tutorial, we explore how Apache Gravitino integrates with Apache Spark to enable more consistent, scalable ETL workflows. 💡In this tutorial, you’ll learn how to: - Integrate Apache Gravitino with Apache Spark for ETL workloads - Use a centralized catalog to manage schemas and tables - Reduce metadata inconsistency across Spark jobs - Design ETL pipelines with governance in mind We’ll continue sharing more hands-on Apache Gravitino tutorials focused on real-world data platform challenges. Feel free to share your thoughts or questions in the comments 👇 #ApacheGravitino #OpenSource #ApacheSpark #ETL #DataEngineering #Metadata #DataPlatform
Like Comment
To view or add a comment, sign in
Acceldata

71,373 followers
3w
Report this post
✨ Acceldata ODP — New Releases Available �� ODP 3.2.3.5-2, ODP 3.2.3.5-3, and ODP 3.3.6.3-1 are now available. These updates help big data teams modernize safely—without turning upgrades into disruptive migrations—across on-prem and hybrid environments. ✅ Certified Java upgrade paths: JDK 11 (3.2.3 line) + JDK 17 (3.3.6 line) ✅ Stronger security & secret handling with standardized password encryption ✅ Improved S3 reliability for Hive/Hadoop (incl. delegation token enhancements) ✅ Scalable orchestration with Airflow (Redis integration + multi-node support) ✅ More flexibility: multi-version Spark 3 + Livy 3 coexistence ✅ Pinot upgrade: v1.3.0 → v1.4.0 (ODP 3.3.6.3-1) 📖 Read the blog to learn more: https://lnkd.in/gFwMfnAd Chandrakant Sharma Kumar Ravi Shankar Shubham Sharma Vivek Singh #Acceldata #ODP #OpenDataPlatform #Hadoop #BigData #DataEngineering #Spark #Airflow #Hive #S3 #Pinot
1 Comment
Like Comment
To view or add a comment, sign in
Jennifer Molnár
5d Edited
Report this post
Practice done for tonight. 💻 This time I didn’t just run a simple program. I set up a complete Hadoop environment on my machine 😃 In simple terms, I built a small “data center” on my laptop. What I worked on: 🔹 Configuring HDFS (the distributed file system) 🔹 Setting up NameNode and DataNode 🔹 Managing resources with YARN 🔹 Fixing XML configuration errors (yes, even one wrong quotation mark can break everything 😄) 🔹 Successfully starting a single-node cluster Now everything is running properly: ✔ NameNode ✔ DataNode ✔ SecondaryNameNode ✔ ResourceManager ✔ NodeManager Today’s lesson: Distributed systems do not tolerate small mistakes — but when everything finally works, the satisfaction is on another level. Step by step, building deeper system knowledge. 🚀
Like Comment
To view or add a comment, sign in
Julien Chanaud
1w
Report this post
This week, I tackled a very common requirement in the Kafka ecosystem, which is generally perceived as an anti-pattern: Database enrichment. "But it won't scale to millions of events per second". True, but it does 5000 events per second with a free-tier Postgres instance 800km away from me. So if you're not Uber or Netflix, let me show you something good enough, simple, and executed with user experience and accessibility in mind. The Streemlined way. The use case: - Read from a Kafka topic - Execute a SQL query against a database - Enrich the Kafka record with the SQL response - Do something with it This is what I ended up with (image). With the usual schemas and autocomplete suggestions everywhere. Oh, and it supports batching, so you can do one database round-trip per 1000 Kafka records. Next up, HTTP.
10 Comments
Like Comment
To view or add a comment, sign in
Rushikesh Bhattad
3w
Report this post
🔐 Securing Stream Processing with Apache Flink What Actually Keeps Your Real-Time Jobs Safe? When people talk about real-time security, the focus is often on Kafka. But Apache Flink plays a critical role in securing how data is processed, stored, and executed. 🔹 Core Pillars of Apache Flink Security 1️⃣ Authentication (Who can run jobs?) Kerberos-based authentication (Hadoop ecosystems) Native support for secure clusters (YARN, Kubernetes) 2️⃣ Authorization (What can a job access?) Inherits permissions from underlying platforms Controls access to: Kafka topics Data lakes (HDFS, S3) Databases & external sinks 3️⃣ State & Checkpoint Security (The hidden risk) Encrypted checkpoints and savepoints (via filesystem / cloud KMS) Restricted access to Flink state backends Protection against state leakage in shared clusters 4️⃣ Secrets Management Secure handling of credentials (no hardcoding) Integration with Kubernetes Secrets / Vault Token-based access for connectors 5️⃣ Execution Isolation TaskManager isolation (containers, namespaces) Prevents one job from impacting another Critical for multi-tenant Flink platforms 🧠 Key Takeaway > Apache Flink doesn’t secure the data pipeline it secures the computation running on that data. In modern architectures, security is incomplete without protecting stream processing itself. 🚀 Best Practice 🔗 Use Flink + Kafka + Secure Infra (K8s/YARN) → End-to-end security from ingestion → processing → storage #ApacheFlink #StreamingSecurity #RealTimeProcessing #BigData #DataEngineering #Kubernetes #DataPrivacy
Like Comment
To view or add a comment, sign in
Rami Chalhoub
3d
Report this post
The most important data structure you rarely think about is the log. Every time #PostgreSQL commits a transaction, etcd updates cluster state, or CockroachDB flushes a memtable, the same mechanism is at work: Write Ahead Logging. Before modifying data, you first append the intent to a log. If the system crashes, the log is replayed and the state is rebuilt. Nothing is lost. WAL is not just crash recovery. It powers: Replication: In PostgreSQL streaming replication, standbys replay WAL segments. No #WAL, no high availability. Change Data Capture (#CDC): #Debezium reads the WAL to stream changes into Apache #Kafka without application changes. Point in Time Recovery (PITR): Base backup plus archived WAL lets you restore to any second in history. Distributed consensus: #etcd persists Raft proposals to its WAL before acknowledging them. Consensus and durability become the same operation. LSM storage engines: #CockroachDB uses #Pebble, which maintains its own WAL for local durability, while Raft handles distributed agreement. Different layers, same primitive. Even Apache Kafka relies on an append only distributed commit log. With the removal of Apache ZooKeeper, its metadata is now managed through KRaft, a Raft based internal log. The core insight is performance. Sequential writes are dramatically faster than random writes on both HDDs and NVMe SSDs. Append only wins because the hardware favors it. Takeaway: If you are building a system that requires durability, replication, or auditability, you are either already relying on a write ahead log or you are about to reinvent one poorly. #DistributedSystems #Databases #Logging #WAL #PostgreSQL #Kafka #DataEngineering #SystemDesign #HighAvailability #StorageEngines #LSMTree
Like Comment
To view or add a comment, sign in
OLake™ by Datazip

15,277 followers
3w
Report this post
OLake™ by Datazip started with one simple question: What breaks first in real-world data ingestion? - Accidental full refreshes. - Unreliable CDC. - Messy schemas. - Pipelines that work in demos but fail at scale. With each release, OLake tackled one real problem at a time. In Part 1 of our journey, we focused on building a solid foundation for ingestion: - New Source Connectors: MongoDB, PostgreSQL, and MySQL became first-class connectors, ensuring flexibility in your data pipeline. - Lakehouse-Ready Writers: With Apache Iceberg and Parquet writers, we made it easier to deliver data to lakehouses seamlessly. - Strict CDC Sync Mode: We introduced this mode to prevent accidental full refreshes and keep your data pipelines safe. - Improved Schema Discovery & Normalization: This release made schema discovery smoother and data flow more predictable. We started by making ingestion safer before we made it faster. Just lessons learned from running ingestion in production. If you’re building on Apache Iceberg - or thinking about it - this journey will resonate with you. Swipe through and see how OLake’s evolution began. Join the community and stay up-to-date with our latest updates and releases! See the full version of the release notes here: https://lnkd.in/dydrZGsd Also, if you want to dive deeper into OLake, check out our resources: [Links in the comments] #OLake #ApacheIceberg #DataEngineering #CDC #Lakehouse #OpenSource #StreamingData

2 Comments
Like Comment
To view or add a comment, sign in
Jeetu Gupta
1w
Report this post
🚨 𝗬𝗼𝘂𝗿 𝗗𝗮𝘁𝗮𝗯𝗮𝘀𝗲 𝗖𝗼𝗺𝗺𝗶𝘁 𝗦𝘂𝗰𝗰𝗲𝗲𝗱𝗲𝗱… 𝗕𝘂𝘁 𝗬𝗼𝘂𝗿 𝗘𝘃𝗲𝗻𝘁 𝗡𝗲𝘃𝗲𝗿 𝗥𝗲𝗮𝗰𝗵𝗲𝗱 𝗞𝗮𝗳𝗸𝗮. User signs up. ✅ User row inserted ❌ Event publish fails Now what? Your database says the user exists. But other services never heard about it. Welcome to the dual-write problem. 𝗧𝗵𝗲 𝗣𝗿𝗼𝗯𝗹𝗲𝗺 You do this: - Insert user into DB - Publish UserCreated event to Apache Kafka If step 2 fails: - Retry? → Might duplicate - Don’t retry? → System becomes inconsistent You just broke your distributed system. 𝗧𝗵𝗲 𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻: 𝗢𝘂𝘁𝗯𝗼𝘅 𝗣𝗮𝘁𝘁𝗲𝗿𝗻 Instead of publishing directly to Kafka: Step 1: In the SAME DB transaction: - Insert user - Insert event into outbox table Either both succeed. Or both fail. No partial state. Step 2: Background Worker A separate process: - Reads unsent events from outbox - Publishes to Kafka - Marks them as sent Now you get: ✔ Atomicity ✔ Reliability ✔ No dual-write issue 𝗪𝗵𝘆 𝗧𝗵𝗶𝘀 𝗠𝗮𝘁𝘁𝗲𝗿𝘀 In distributed systems: The database and message broker don’t share a transaction. The Outbox Pattern gives you consistency without distributed transactions. If you’re writing to DB and publishing events separately… you’re one crash away from inconsistency. #OutboxPattern #DistributedSystems #Microservices #EventDrivenArchitecture #SystemDesign
2 Comments
Like Comment
To view or add a comment, sign in
Apache Spark

24,283 followers
3w
Report this post
Meet Sandy Ryza, one of the core engineers behind Spark Declarative Pipelines (SDP)! Sandy has been a part of the open-source community for over a decade, contributing to Apache Spark since 2014 and Hadoop before that. In this clip, Sandy explains why SDP is such a game-changer for the ecosystem and how you can get started today. 🚀 What’s inside the full talk? 🔸 How a declarative approach simplifies your data flow. 🔸 Learn how to install it in seconds using pip. 🔸 Why a shared set of thinking leads to better development for everyone. Sandy says it best: Open source allows the community to standardize and participate in what the future looks like. 🌐 📺 Watch the full talk: https://lnkd.in/eV-bwbZK #ApacheSpark #DataPipelines #SDP #DataEngineering #OpenSource

1 Comment
Like Comment
To view or add a comment, sign in

60,048 followers

View Profile Follow

Scaling S3 for Enterprise-Grade Security and Disaster Recovery

More from this author

Firmus Technologies Group Selects VAST AI Operating System to Power Sovereign, Energy-Efficient AI Factories

Talk to Your Data Infrastructure: Accelerate Infrastructure Management with MCPs

Preparing Your Data Architecture for AI Inference

Explore content categories