The COVID-19 pandemic gave the engineering teams at travel search giant Skyscanner an opportunity to introspectively examine their observability stack. Skyscanner has written about how it has overhauled its approach to technical observability with a system that improves reliability for engineers and travellers alike.
According to a recent engineering blog post, the company began this transformation in 2020, when the travel industry lay dormant and faced unprecedented disruption. Skyscanner's engineering team used this period as an opportunity to identify and address weaknesses in its monitoring systems.
"Our challenges were not simply related to cost, or the complexity of running this platform with a small team," the Skyscanner engineering team writes. "We understood that our most important problem to solve was improving the confidence of all engineers to understand and operate their services, to reliably connect more than 110 million users to over 1,200 flight, hotel and car hire partners each month."
The company's previous observability architecture was complex, featuring a mix of specialised vendors and some internal systems based on open-source technologies like OpenTSDB, Prometheus, and they had multiple ELK stacks. This fragmentation created problems for engineers who needed to understand and troubleshoot issues across the platform.
To address these challenges, Skyscanner developed a new strategy centred around two key principles: adopting OpenTelemetry as a single standard for instrumenting services and transporting data and implementing New Relic as a unified data storage and analysis backend.
We're all in when it comes to the vision of high-quality, standardised, portable, and ubiquitous telemetry provided by OpenTelemetry. The future is OTel-native, not APM agents, and we're ready for it.
The simplicity of OpenTracing and OpenTelemetry's API designs allowed Skyscanner to migrate over 300 microservices within weeks by just bumping a library version number. This rapidly reduced the cognitive load on engineers by removing context switching between multiple observability platforms. The standardisation also enabled better correlation of traces, metrics, logs, and events across services and frameworks.
An unexpectedly positive outcome was a cultural shift in how teams approached telemetry. When made aware of the costs associated with data collection and storage, many teams voluntarily looked for more efficient approaches.
"We had teams that wanted to find more optimal ways of using telemetry," the engineering team notes. "When they saw the advantages, they were convinced, and started to rely on tracing rather than verbose logging or high-cardinality metrics. This made some teams reduce their telemetry costs by over 90%!"
To gain further adoption of the new observability stack, Skyscanner launched an "Observability Ambassadors" initiative, identifying engineers within teams who could bring observability best practices to their domains. The company also began hosting Observability Game Days using the official OpenTelemetry Demo to make system debugging more engaging.
Skyscanner has also connected its new abilities in technical observability back to actual business outcomes that affect its travelling customers by rethinking its approach to Service Level Objectives (SLOs). Rather than focusing solely on technical metrics like API response codes, the company now drives SLOs from signals directly related to user experience.
With access to client-side telemetry, we can drive SLOs from signals that relate directly to our users, like 'how many flight searches displayed valid results?'
This new observability approach has radically changed how Skyscanner manages cross-domain dependencies and facilitates collaboration between teams. "We're using observability not only as a technical tool, but also as a sociotechnical tool, to help us reason about our system and make data-driven decisions," the engineering team concludes. "We base our commitments on evidence, not intuition."
Skyscanner is not the only company using OpenTelemetry as a lever to redefine how SLOs are driven. An article by NOFire AI reinforces and extends many of these themes, particularly around the transformative impact of OpenTelemetry and the shift from traditional monitoring to intelligent observability. This article also emphasises how OpenTelemetry provides unified observability by eliminating fragmented, siloed approaches that aren't helpful in a progressive organisation. NOFire argues that Service Level Objectives should focus on user experience rather than arbitrary technical metrics, mirroring Skyscanner's shift from API response codes to meaningful user-centric signals like "how many flight searches displayed valid results."
However, NOFire takes this further by introducing AI-powered incident resolution. While Skyscanner achieved cultural transformation through human-centred initiatives like Observability Ambassadors, NOFire proposes using Generative AI to automatically surface root causes and generate actionable resolutions, potentially eliminating the manual dashboard exploration that both companies identify as problematic.