If you are building or using a distributed database where multiple nodes can accept writes on the same entities, you should know a bit about CRDTs... In traditional distributed databases, when you write, the system must confirm with other nodes to maintain consistency (quorums). This means you need network connectivity and consensus. CRDTs (conflict-free replicated data types) flip this model. Each node can accept writes independently without coordination. Eventually, all nodes would independently converge to the same state without the need for their coordination at write time. This comes in handy when building an application that lets you write to your local copy while completely offline. When nodes eventually reconnect, the CRDT merge algorithm automatically reconciles all changes deterministically. This "write locally, sync later" property also makes CRDTs super useful for local-first applications. Your notes, todos, contacts, calendar, and chat messages can all sync in the background whenever your devices are online. The only catch here is the constraint that CRDTs do not work well for high-volume writes from many nodes. Still, CRDTs are pretty handy to build use cases that need to work seamlessly offline and reduce dependence on centralized servers.
Distributed Database Systems
Explore top LinkedIn content from expert professionals.
Summary
Distributed database systems are networks of databases spread across multiple servers or locations, allowing data to be stored and accessed in a coordinated way for better scalability, reliability, and performance. These systems help manage large volumes of data and support applications that need to be always available, even if some parts of the network experience failures.
- Consider scaling strategies: Evaluate whether your application needs vertical scaling (adding resources to a single machine) or horizontal scaling (adding more machines) to handle growing data and user demand.
- Prioritize data consistency: Choose the right consistency model—such as strong, eventual, or causal—based on your application's requirements for accuracy and speed.
- Select database type carefully: Match your use case with the appropriate distributed database, like document, graph, or time-series databases, to simplify development and maintenance.
-
-
Database Decision Matrix: A Data Engineer's Guide 🛠️ We as data engineers, when architecting data solutions often get confused choosing the right databases. This isn't just about storing data - it's about understanding your data's journey. Here's a deep dive into various databases: 1. Data Flow Patterns - Heavy Write Workloads: Consider Apache Cassandra or TimescaleDB for time-series data with massive write operations - Read-Heavy Applications: Redis or MongoDB with read replicas shine for caching and quick retrievals - ACID Requirements: PostgreSQL or MySQL remain gold standards for transactional integrity 2. Scaling Requirements - Horizontal Scaling Needs: DynamoDB or Cassandra excel with distributed architectures - Vertical Scaling Focus: Traditional RDBMSs like PostgreSQL with powerful single instances - Global Distribution: CockroachDB or Azure Cosmos DB for multi-region deployments 3. Data Complexity - Complex Relationships: Graph databases like Neo4j for interconnected data models - Document Storage: MongoDB or CouchDB for nested, schema-flexible documents - Time-Series Data: InfluxDB or TimescaleDB for temporal data analytics - Search-Heavy Apps: Elasticsearch for full-text search capabilities 4. Operational Overhead - Managed Services: Cloud offerings (RDS, Atlas) for reduced DevOps burden - Self-Hosted: Consider team expertise and maintenance capacity - Backup & Recovery: Evaluate point-in-time recovery capabilities and replication features 5. Performance Considerations - Query Patterns: Analyze common query patterns and required response times - Indexing Requirements: Evaluate index size and maintenance overhead - Memory vs. Disk Trade-offs: Consider in-memory solutions like Redis for ultra-low latency 6. Cost Analysis - Data Volume Growth: Project storage costs and scaling expenses - Query Costs: Especially important for cloud-based solutions where queries = dollars - Operational Costs: Factor in monitoring, maintenance, and expertise required Real-World Selection Examples: - User Activity Tracking: Cassandra (high write throughput, time-series friendly) - Financial Transactions: PostgreSQL (ACID compliance, robust consistency) - Content Management: MongoDB (flexible schema, document-oriented) - Real-time Analytics: ClickHouse (columnar storage, fast aggregations) - Cache Layer: Redis (in-memory, fast access) It's important to start with boring technology (PostgreSQL) unless you have a compelling reason not to. It's better to scale proven solution than debug an exotic one in production. Few cloud database solutions: Amazon Web Services (AWS) - Amazon DynamoDB, Amazon ElastiCache, Amazon Kinesis, Amazon Redshift and Amazon SimpleDB Google Cloud - Cloud Bigtable, Cloud Datastore, Firestore, BigQuery, Cloud SQL and Google Cloud Spanner Microsoft Azure - Azure Cosmos DB, Azure Table Storage, Azure Redis Cache, Azure Data Lake Storage, Azure DocumentDB and Azure Redis Cache PC: Rocky Bhatia #data #engineering #sql #nosql
-
This Distributed Systems Cheatsheet Took Me a Year to Build I spent all of 2024 learning, failing, and finally understanding these concepts. If you're starting with distributed systems in 2025, let this be your guide: 1/ Core Concepts + Scalability: - Vertical Scaling: Adding resources to existing machines (e.g., CPU, RAM). - Horizontal Scaling: Adding more machines (nodes). + Reliability: Ensure the system continues functioning correctly despite failures. + Availability: Maintain uptime by ensuring the system is operational even during failures. + Consistency Models: Decide the trade-off between data consistency and latency (eventual, strong, or causal). + CAP Theorem: Systems can only achieve two of the three: Consistency, Availability, and Partition Tolerance. --- 2/ Communication + Remote Procedure Calls (RPC): Call a function on a remote server as if it’s local. + Message Queues: Asynchronous communication (e.g., RabbitMQ, Kafka). + REST vs gRPC: - REST: HTTP-based, suitable for CRUD APIs. - gRPC: High-performance, protocol buffer-based, ideal for low-latency communication. + Webhooks: Get notified when specific events occur (e.g., payment success). --- 3/ Coordination & Consensus + Consensus Algorithms: Achieve agreement across nodes (Paxos, Raft). + Distributed Locks: Ensure only one process accesses a critical section (e.g., Zookeeper). + Leader Election: Decide which node should act as the leader in distributed systems. + Gossip Protocol: Decentralized communication for node state updates. --- 4/ Architectures + Client-Server: Traditional request-response systems. + Peer-to-Peer (P2P): All nodes are equal, like torrents. + Event-Driven: Trigger actions in response to events (e.g., Kafka). + Microservices: Small, independent services that work together. + Lambda Architecture: Hybrid approach for real-time and batch data processing. --- 5/ Key Technologies + Container Orchestration: Kubernetes, Docker Swarm. + Service Discovery: Tools like Consul to locate services dynamically. + API Gateways: Central point for routing API requests (e.g., Kong, NGINX). + Distributed Tracing: Tools like Jaeger to monitor requests across services. + Infrastructure as Code: Automate server setups using Terraform or Pulumi. --- 6/ Data Management + Distributed Databases: MongoDB, Cassandra, CockroachDB (scalable, high availability). + Caching: Use Redis or Memcached to speed up frequently accessed data. + Replication Strategies: Copy data for fault tolerance (e.g., sharding, partitioning). + Consistency Models: Understand BASE (eventual consistency) vs. ACID (strict consistency). --- 7/ Common Pitfalls + The Network Isn’t Reliable: Always design for potential network failures. + Latency is Never Zero: Acknowledge real-world delays. + Topology Changes Happen: Nodes can go offline or change—build for this flexibility.
-
How did Google build the world’s most scalable database? Spanner is an incredibly impressive system–the first data store to provide globally distributed transactions at scale. However, when it was first built, it was hard to use, offering only key-value semantics. This paper tells the story of how Spanner evolved into a full database, with support for SQL and all the features application developers expect, at a scale at which they had never been offered before. Like any database, Spanner compiles a SQL query into a physical query plan then optimizes it. However, to run queries at massive scale, it needs new distributed operators. The most important of these is distributed union, which ships a subquery to each shard of the underlying data and concatenates the results. This is a building block for performing distributed aggregations or joins over sharded tables. To make Spanner work, a distributed union has to be inserted above every table in a query plan. Because a distributed union is expensive, they do a lot of work to push operations into the union (particularly filters). Moreover, joins are aggressively rewritten to minimize the number and size of the distributed unions performed. Spanner optimizes the performance of distributed queries using a coprocessor framework: each remote call is addressed not to a particular server, but to a range of data. This gives the runtime leeway to execute each query in the most efficient manner, routing each subquery request to the nearest replica that can serve the request. It also gives the runtime freedom to filter which shards are queried based on the requested keys, so shards don’t receive irrelevant requests. Moreover, it allows transparent masking of transient failures, as any subquery is automatically served from an available replica, even if other replicas are offline. What are the main takeaways? First, SQL semantics help adoption. Spanner always allowed users to reliably store and query data at scale, but adding SQL made it easier to write faster queries (thanks to the optimizer) and more complex queries. Second, even with full SQL support, using databases at scale is hard. The optimizer makes it much easier to write performant queries, but even then there are many pitfalls, and the paper makes it clear that (as of 2017), the Spanner team still needs to work with internal customers to make sure their queries won’t’ hit scaling bottlenecks.
-
"One of the most interesting things in DSQL’s architecture is that we can scale compute (SQL execution), read throughput, write throughput, and storage space independently. At a fundamental level, scaling compute in a database system requires disaggregation of storage and compute. If you stick storage and compute together, you end up needing to scale one to scale the other, which is either impossible or uneconomical." "Each transaction inside DSQL runs in a customized Postgres engine inside a Firecracker MicroVM, dedicated to your database. When you connect to DSQL, we make sure there are enough of these MicroVMs to serve your load, and scale up dynamically if needed. We add MicroVMs in the AZs and regions your connections are coming from, keeping your SQL query processor engine as close to your client as possible to optimize for latency. We opted to use PostgreSQL here because of it’s pedigree, modularity, extensibility, and performance. We’re not using any of the storage or transaction processing parts of PostgreSQL, but are using the SQL engine, an adapted version of the planner and optimizer, and the client protocol implementation." "The overall approach here is disaggregation: we’ve taken each of the critical components of an OLTP database and made it a dedicated service. Each of those services is independently horizontally scalable, most of them are shared-nothing, and each can make the design choices that is most optimal in its domain. This approach is enabled by the extremely fast and reliable networking available in modern data centers, and by designing each component as part of the overall architecture." https://lnkd.in/gJgs9FGr by Marc Brooker
-
2 years ago, I fumbled in a system design interview. The interviewer asked, “How would you scale a database to handle millions of users?” And I blanked out. I knew what partitioning was, but not why it mattered, when to use key-range vs hash, or how routing and indexing could break everything if misunderstood. That day stung. But it also sparked something in me. 🔥 So, I have made a video where, I dive deep into: Partitioning, Indexing & Request Routing in Distributed Databases — The core building blocks behind systems like MongoDB, Cassandra, and DynamoDB. 🔍 You’ll learn: ✅ Real-world analogies for sharding, hotspots & scatter-gather queries ✅Key differences between local and global secondary indexes ✅How service discovery (Zookeeper, Gossip Protocol, DNS) keeps routing reliable ✅Request routing strategies: Smart Clients vs Routers vs Any-Node ✅Why trade-offs in design define system performance at scale If you're aiming for SDE-II/III interviews, backend architecture roles, or building real-time, large-scale systems — this breakdown is made for you. 👉 Watch here: https://lnkd.in/gXUzrGFV And if this helps even 1%, drop a ❤️ or share with someone prepping right now. Because we’ve all had that one interview where we just… froze. Let’s make sure it doesn’t happen again. P.S : Follow Aarchi Gandhi for more such stuff :) P.P.S : Thanks for the Tee #TakeYouForward Team. #SystemDesign #BackendEngineering #DistributedSystems #Partitioning #Indexing #RequestRouting #TechWithAarchi #aarchigandhi #SoftwareArchitecture #EngineeringCareers #DynamoDB #MongoDB #Cassandra #CareerGrowth #TechInterviews
-
Have you ever searched for an old message in Facebook Messenger? Have you wondered how it finds all messages containing that word within milliseconds? This is the use-case that triggered the innovation of Cassandra. Why can't traditional databases handle this? Relational databases are usually hosted on a big machine (high CPUs, huge memory & disk). Given the scale of Facebook, a single machine can't store all messages of all users. Facebook's solution? Cassandra Cassandra is a distributed NoSQL database that solves this exact problem. Distributed - You can host Cassandra on multiple machines. It internally distributes data across all machines, but to the developer, it pretends to be holding all data on a single mega machine. NoSQL - SQL comes with rigid schema, relations & joins between tables. NoSQL doesn't have a strict schema, and table joins are not supported. How does Cassandra make search lightning fast? Cassandra works as a key-value store for this use case. Each word in a message becomes a key, and the messageId becomes the value. When you search for a word, Cassandra looks up that key and returns all associated messageIds instantly. Handling massive scale Cassandra splits all keys across multiple machines using a hash function. To find a particular key, it first determines which machine holds the data, then queries that specific machine. Today, this distributed database powers not just Facebook, but Netflix, Instagram, and countless other platforms.