Infrastructure

Building a resilient DNS client for web-scale infrastructure

At LinkedIn, our infrastructure supports a vast ecosystem of applications that serve our over one billion members worldwide. These applications are built on foundational systems that ensure seamless connectivity, scalability, and reliability. The Domain Name System (DNS) is part of that foundation, and it plays a crucial role in resolving domain names to IP addresses and acts as the backbone of our internet communication.

DNS infrastructure consists of two key components: the server side which stores and serves DNS records, and the client side which caches records close to applications and often on the same host. While DNS servers are designed for scalability and redundancy, the resilience of DNS clients is just as critical. A failure in the DNS client can lead to performance degradation, increased latency, or possible outages, which impacts every service that relies on it and can result in a lower quality experience for members and customers when using the platform.

For years, we used the Name Service Cache Daemon (NSCD) for our DNS caching. However, as our infrastructure grew, NSCD faced significant challenges scaling with us. Its poor visibility and lack of robust debugging tools made troubleshooting difficult and time-consuming. Existing alternatives such as systemd-resolved and Unbound didn’t fully meet our needs for reliability, scalability, and visibility. 

To solve this, we built a DNS Caching Layer (DCL) - a high performance, resilient DNS client cache deployed across our fleet. DCL, which has become a critical component of LinkedIn’s infrastructure, enhances reliability and provides visibility into DNS behavior for easier troubleshooting and anomaly detection. In this blog, we will dive deeper into the architecture, features, and impact of this next-generation DNS caching solution.

Architecture and features

DCL is built for high availability, simplicity, and efficiency, running as a daemon on the host and listening on localhost:53 by default. It supports both Transmission Control Protocol and User Datagram Protocol to handle truncated queries, ensuring robust DNS resolution. DNS queries are redirected to DCL by configuring localhost in /etc/resolv.conf, naturally integrating it into the system’s DNS workflow. For diagnostics and monitoring, a command-line utility provides insights into DCL’s performance and statistics. Managed as a systemd service, DCL benefits from automatic startup, resource constraints, and security restrictions, making it a reliable and resilient DNS caching layer for large-scale infrastructure.

Figure 1: A high-level architecture of how DCL is being deployed in every linux host in our environment.
Figure 1: A high-level architecture of how DCL is being deployed in every linux host in our environment.

DCL provides a highly flexible configuration, enabling users to fine-tune cache size, caching behavior for different DNS record types, and handling of specific return codes. It also supports customizable policies for individual domain names, offering granular control over DNS queries and cached records. This level of configurability allows applications to optimize DNS behavior based on their specific needs. By adjusting these settings, services can achieve the right balance of performance, efficiency, and resilience in their DNS resolution workflows.

DCL also includes two key features that improve overall DNS reliability:

  • Adaptive timeout:
DCL dynamically adjusts DNS query timeouts based on real-time latency measurements (inspired by RFC 6298) instead of the default 5-second timeout.
  • Exponential backoff: Prevents overloading
the degraded upstream servers already struggling under high load.

DCL continuously tracks DNS query error rates to upstream servers, automatically isolating endpoints that exceed a predefined failure threshold. These isolated servers undergo periodic health checks and are reinstated once they recover. In the meantime, failed queries are instantly retried with healthy servers, enabling rapid fault detection and mitigation. This proactive mechanism has prevented major incidents where misconfigurations left upstream servers unhealthy yet still reachable by clients. By shielding applications from infrastructure breakdowns, DCL ensures a resilient DNS experience.

To minimize DNS disruptions, DCL leverages dynamic configuration management, enabling updates to be applied in real time without requiring a restart or reload. This allows DCL to validate and deploy new configurations efficiently while continuing to serve DNS queries. Combined with code upgrades without service disruption, this approach ensures upgrades and configuration changes can be pushed without any service interruptions.

DCL incorporates a warm cache mechanism to ensure DNS records remain valid at all times, proactively refreshing entries before they expire. This prevents cache misses and allows applications to resolve domain names with minimal latency. Additionally, DCL supports cache preservation across restarts, preventing spikes in query load and latency during service restarts or reloads. These features are designed to minimize tail latency and enhance the performance of critical applications. These capabilities streamline deployments, allowing new code to be shipped faster without the complexity of traditional rollout processes.

Testing for production readiness

Since DNS is a critical system that underpins our global operations, we conduct rigorous testing to safeguard against vulnerabilities, confirm the system’s capabilities to handle increased loads, and proactively prevent misconfigurations during updates.

DCL, developed in Go, benefits from robust unit testing with strong code coverage. We utilize Go's testing framework and implement a docker based end-to-end test framework for functional tests, ensuring comprehensive coverage of all supported features. Additionally, we employ an open-source DNS compliance test suite to verify DCL's adherence to DNS RFC standards.

Our team has made significant investments in testing DCL alongside other DNS clients through A/B testing, focusing on behavior, scalability and performance over extended periods to identify potential resource inefficiencies. We also conducted penetration testing, aligning DCL with LinkedIn's security standards to ensure it is secure for fleet-wide deployment.

DCL is tested across various programming languages, including Java, Go, and Python, to confirm compatibility with diverse DNS library implementations. Furthermore, we validate DCL across multiple operating systems, such as RedHat and Azure Linux, to address any behavioral discrepancies.

Deployment methodology

DCL runs on every host in LinkedIn’s fleet, which means a bug or misconfiguration could cause a correlated failure across thousands of hosts. Given the extensive impact this can have, we designed a multi-layered rollout strategy to minimize risk at every stage of deployment. It was implemented in 3 different phases:

  1. Installed DCL across all hosts without affecting live traffic, validating configurations and functionality. The health check, metrics, and alerts were helpful in the validation process. 
  2. Progressively shifted DNS traffic from NSCD (legacy DNS client) to DCL, starting with a small subset of hosts and expanding as confidence grew.
  3. Stop NSCD once DCL was fully validated.

To ensure reliability and prevent incidents as we worked through the rollout, we introduced an external health checker that runs alongside DCL on every host. This checker periodically validates DCL’s health, and if an issue is detected, the host quickly switches to a cacheless mode while alerting the site reliability team. If the failure persists for more than 15 minutes, NSCD is temporarily restarted to restore caching. However, thanks to our rigorous validation and testing, we have never needed to invoke NSCD in production.

Observability and metrics

One of the key advantages of DCL is the deep visibility it provides into DNS traffic across LinkedIn. While DCL exports a standard set of per-query metrics, their true value emerges when aggregated across all hosts in our fleet. This rich dataset has been instrumental in enabling proactive DNS alerting, faster debugging of complex network issues, and accurate forecasting of long-term traffic trends. Since DCL’s deployment, our teams have leveraged this visibility to enhance DNS reliability and optimize infrastructure performance at scale.

Smart DNS alerting with DCL

DNS alerting is essential but challenging due to the sheer volume and variability of traffic across the fleet. Query patterns range from low-frequency lookups to 100K queries per second from proxy and intermediary hosts.

Traditional server-side DNS monitoring struggles with high noise-to-signal ratio, making precise alerting difficult. DCL’s client-side metrics changed this, enabling aggregate-based alerts that leverage fleet-wide patterns rather than isolated noisy signals. For example, we now trigger an alert when 5% of total DNS queries fail, ensuring DNS Site Reliability Engineers are paged only for significant issues. Since deploying this approach, we’ve proactively detected and mitigated faulty upstream DNS servers, preventing widespread application impact.

Figure 2: Graph showing fluctuations in error rate from all DNS clients to DNS-server-A during a failure event. The dashed red line is an alert threshold. Note that the y-axis scale is logarithmic.
Figure 2: Graph showing fluctuations in error rate from all DNS clients to DNS-server-A during a failure event. The dashed red line is an alert threshold. Note that the y-axis scale is logarithmic.

Diagnostic tools for engineers 

Debugging is an equally important part of maintaining our infrastructure. We really wanted to make sure all the needed debugging data was available to us when needed, but we had to manage the cost impact that is created when collecting detailed process profiles and query traces. To navigate the tradeoffs, we focused our efforts on a few different areas. 

Granular metrics

Iterating over progressive rollouts, we found a balance with granularity of debug data. We identified DNS query parameters that could be exported efficiently as metrics with a small label span.

Scalable logs for deeper insights

Storing individual DNS queries as metrics is impractical, so DCL leverages logs for detailed analysis. This has been crucial for root-cause investigations, especially when only specific queries fail during incidents. To maintain efficiency, we use standard log rotation techniques to minimize the memory footprint on the host, and all logs are sent to a centralized logging system for large-scale analytics and security monitoring. Additionally, DCL supports runtime log level adjustments, allowing teams to quickly switch from info to debug mode for deeper troubleshooting without requiring restarts.

Query tracing and process profiles 

A crucial gap in DNS troubleshooting is the inability to know which application or client made a specific query. To address this, we developed a tracing feature that can be enabled via command line on demand, providing fine-grained visibility into each DNS query and response. This tracing mechanism not only captures the details of every DNS message exchanged between DCL and upstream DNS servers, but also identifies the client that initiated the request by backtracking the User Datagram Protocol port used for the DNS query. The tracing output is efficiently written to files in binary format and can be conveniently decoded and displayed in a human-readable format.

Dashboards and alerts

Dashboards built on DCL metrics have been invaluable for troubleshooting, not just DNS-related incidents, but broader infrastructure issues as well. To ensure accessibility and usability, we structured dashboards by tiers of granularity and use cases, making it easier to pinpoint issues at different levels. Key dashboard categories include:

  • High-level SLIs – Tracking core DNS reliability and performance metrics
  • DNS upstream server health – The primary dashboard for on-call triaging and incident response
  • Latency breakdown – Dashboards capture latency at various points in the DNS resolution path, helping diagnose slow response times due to upstream congestion, cache misses, or packet drops
  • Per-host visibility – Identifying and troubleshooting problematic clients
  • Long-term trends – Used for capacity planning, including QPS growth and IPv6 adoption trends
  • Resource monitoring – Tracking the DNS health and resource utilization

This structured approach ensures quick access to actionable data, enabling faster incident resolution and better long-term decision-making.

Figure 3: LinkedIn fleet-wide DNS query latency. The graph is useful to understand the median and tail latency of the DNS.
Figure 3: LinkedIn fleet-wide DNS query latency. The graph is useful to understand the median and tail latency of the DNS.

Beyond triaging, dashboards acted as a testing framework for alert rules. End-to-end testing of the alert pipeline, from rule evaluation to alert message rendering, is challenging. There’s no standardized framework for end-to-end alert testing. However, we found a workaround:

  • Visualize the alert expression in a dashboard
  • Overlay the threshold limit on historical data
  • Check if the alert would have triggered at expected points in the past

Effective dashboard debugging is an iterative process. Since alerts and dashboards lack standardized testing, they will inevitably contain bugs. The key to success is continuous iteration and refinement.

DCL at scale

As of this writing, DCL has been serving all internal DNS queries across our fleet for over a year. The journey of developing and deploying DCL at scale has been a rewarding experience, offering valuable insights into DNS patterns and system behavior.

Currently, DCL processes over millions of queries per second, significantly reducing DNS resolution latency to sub-millisecond levels for most requests. Its robust fault detection and isolation mechanisms have seamlessly masked infrastructure failures, ensuring uninterrupted application performance. Additionally, DCL's validation layer has proactively prevented multiple misconfiguration errors, enhancing overall system reliability.

One of DCL’s key strengths is its application-level DNS visibility, enabling service owners to quickly diagnose connectivity issues. With fine-grained observability, DCL has reduced the mean time to detect (MTTD) infrastructure outages from hours to minutes. This visibility also helps identify heavy DNS users within our fleet, providing valuable data for trend analysis and capacity planning.

With DCL, we have built a modern, secure, and highly efficient DNS client cache that ensures reliable and low-latency resolution at scale. This transition has strengthened our infrastructure, enabling us to better support LinkedIn’s global services while maintaining the highest standards of performance and resilience.

Acknowledgements

All this work is not possible without the contribution of many individuals. These are the top collaborators on this project that built DCL at LinkedIn. Special thanks to our amazing LinkedIn DCL team members that led this massive initiative: Artur Makutunowicz, Nisheed Meethal, Tim Crofts, Mike Scheel, Maanas Alungh, Shenghao Huang, Bhavani Kanikaram, Deepu K, Stephen Xu, Sangita Maity, Diana Issatayeva, Harish Shetty, Cliff Mcintire, Ievegen Priadka, Caleb Cameron, Franck Martin, Guy Purcell, Abhijeet Panday, Rohit Bhanot, Lovell Felix, Zaheer Kasim Shaikh, Vaibhav Singh Gour

This project would not have been possible without the continued support and investment from the leadership team in boosting productivity across all of infrastructure engineering: Neil Pinto.