TNS
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
NEW! Try Stackie AI
DevOps / Linux / Security

How a Critical Hosting Failure Solved a DevOps Crisis

Resilience isn’t just about solving today’s problems — it’s about building systems and cultures that can adapt to tomorrow’s challenges.
Feb 7th, 2025 2:00pm by
Featued image for: How a Critical Hosting Failure Solved a DevOps Crisis

When routine system updates caused critical hosting systems to fail and left machines unbootable, Pentera’s DevOps team found themselves in a race against time against a nightmare bug.

With operations grinding to a halt, they collaborated with the company’s in-house security researchers for a different perspective. This collaboration uncovered a flaw in the hosting platform and showcased the power of cross-discipline teamwork to resolve complex issues. This story offers a blueprint for resilience for organizations grappling with similar challenges: combining technical know-how with strategic collaboration to stay ahead of disruptions.

The Unexpected Boot Failure

In the final weeks of 2024, our DevOps faced a surprising situation: Machines previously accessible on the network suddenly failed to connect. This failure halted the team’s ability to continue developing and releasing versions to customers, making it imperative to identify and fix the issue quickly.

The team launched an exhaustive investigation, retracing their steps through environment variables and configuration files to determine what could have changed to cause this. Upon physical inspection of the affected machines, they encountered boot failures accompanied by the following error message:

Looking a bit further up the terminal, they could also see:

Something was causing libcrypto.so.1.1 to be missing during the boot process, rendering the machine unusable.

When in Doubt (and Facing a Time Crunch): Brainstorm

Under pressure to roll out product updates on schedule, the DevOps team faced a tough decision. They didn’t know where the issue was coming from and needed to figure it out quickly. There was a strong indication that something with the initramfs was wrong, which is a key component during the boot process of Debian and other Linux systems, but little more than that. With that knowledge, they could reach out to the Debian team for long-term insights, but there’s no predicting how long it would take, and it wouldn’t resolve the immediate challenge of returning online shortly.

Alternatively, they could implement a workaround to bypass the issue, but that risked leaving the root cause unresolved and inviting future problems. Instead, they opted for a more innovative approach: a brainstorming session involving fresh perspectives — people unfamiliar with the problem and free from biases tied to past actions. Given my background in researching Linux systems, our VP of Research suggested I join the team to see what I could contribute.

As a research team lead within the Pentera Labs team, my experience and perspective differ from those of the DevOps team. While their experience primarily focuses on building and maintaining products, my role involves researching the latest attack trends and techniques, understanding how threat actors exploit vulnerabilities, and, in essence, figuring out how to break and exploit things effectively.

The root of the issue wasn’t immediately apparent. Unlike my usual assignments, I set out to investigate, diving into a task. My goal was to reverse-engineer the conditions or mechanisms that had created a denial-of-service (DoS) scenario. This shift in perspective was challenging but engaging, offering a valuable opportunity to approach the problem creatively.

Debian Mkinitramfs Flaw

I spent two weeks analyzing the system and collaborating closely with DevOps. We uncovered the root cause: a bug that had been dormant in the system until this specific scenario triggered it. Interestingly, the issue wasn’t directly related to the choice of tools or infrastructure upgrades but revealed a more significant systemic weakness within Debian.

The Culprit

A routine part of our product’s installation is upgrading the system’s packages to have the latest versions available. To achieve that, we have compiled Python code that runs apt upgrades. In this case, this was our root cause issue. During the investigation, we discovered that running an apt upgrade inside an ELF file that was compiled using PyInstaller was the cause of this bug. Digging further into why it was happening, it looked like PyInstaller packaged some libraries with the executable file and then used an environment variable LD_LIBRARY_PATH to load them. In short, LD_LIBRARY_PATH specifies directories where the system should look for dynamic libraries before searching the standard library paths.

Removing the apt upgrade from the ELF file resulted in the crash disappearing.

This crash can be easily replicated using the following command (tested on Ubuntu 20.04).

Underlying Cause

The upgrade process can update the kernel or other critical packages, requiring changes to the initial RAM filesystem (initramfs). The initramfs contain essential drivers and tools to mount the root filesystem and boot the system, so they must be regenerated whenever updates affect the boot process.

During this process, the mkinitramfs command uses a subroutine called copy_exec to copy some executables into a temporary directory, which is later compressed into the final initramfs image.

Copy exec uses the dd command to check for library dependencies for those binaries and copies them. For example, running ldd on /sbin/modprobe:

We can see libcrypto.so.1.1 here as well.

In the start of the mkinitramfs script, it creates the necessary directories for those libraries being copied.

However, due to the LD_LIBRARY_PATH environment variable, the output of ldd is changed.

After adding some logs to the copy_file subroutine, which is used by copy_exec to do the actual copying, I got the following log:


The directory /tmp/lib was never created inside the temporary mkinitramfs directory, causing the cp command to fail. Thus, any library inside the LD_LIBRARY_PATH directory was left out of the initramfs image.

Remediation

Initially, the team considered avoiding the problematic feature entirely. It seemed like the most straightforward path forward — a workaround to bypass the issue. But this was a short-term band-aid that didn’t address the underlying problem. Without fixing the issue, the bug could resurface in future scenarios, possibly in ways that were harder to predict or control. Fixing the issue would ensure the entire system’s integrity for future operations.

It appears the Debian team encountered a similar issue in the past, as evidenced by the usage of ldd within copy_exec:

The environment variable LD_PRELOAD is unset while using the ldd command.

LD_PRELOAD works very similarly to LD_LIBRARY_PATH, except that it points to a specific library rather than a directory of libraries.

So, to fix the bug we found, all that needs to be done is add another flag to the usage of the ldd command:

Security Perspective

As a security researcher investigating the situation, I was intrigued by the potential use of what I had found as an attack vector. The outcome would be a massive DoS attack on critical hosting services, a highly destructive endgame. However, logically, from my perspective, it’s not the most attractive tactic unless your goal from the outset is to shut down the entire operation.

Executing the attack would require very high-level permissions. As an attacker, if I had already gained access to those levels of credentials, I would have much more attractive options for an attack. I could use those permissions to access more lucrative systems, move laterally, or escalate permissions. I wouldn’t want to waste my access on an attack that would shut down the whole system, alerting the organization to an issue and potentially taking the system I have access to offline. So while this could technically be utilized as an attack, the more realistic outcome is precisely what happened here. A DevOps team accidentally creates these conditions rather than a hacker actively and purposefully exploiting them.

Cross-Discipline Collaboration: A Blueprint for Resilience

This incident highlights how cross-discipline collaboration builds resilience at the organizational level. By combining the DevOps team’s operational expertise with the investigative mindset of security researchers, we avoided waiting on the Debian team for support. This approach allowed us to identify the underlying issue and develop a real fix rather than relying on a rough workaround.

For team leaders, the lesson is clear: resilience stems from encouraging diverse perspectives and fostering interdepartmental collaboration. In this case, it was security researchers teaming with DevOps, but the principle applies across any combination of specialized skill sets. Breaking down silos and inviting fresh viewpoints can transform challenges into opportunities, ensuring long-term solutions rather than quick fixes.

To make collaboration like this a repeatable process, leaders can take deliberate steps to institutionalize it. For example:

  • Establish cross-functional “tiger teams” to tackle high-priority problems that cut across disciplines
  • Create shared knowledge hubs where teams can document and exchange insights, tools, and strategies to address recurring challenges
  • Promote cross-training opportunities, so team members develop a baseline understanding of other disciplines, improving communication and trust when it’s time to collaborate

Resilience isn’t just about solving today’s problems — it’s about building systems and cultures that can adapt to tomorrow’s challenges. Strategic teamwork isn’t merely a “nice-to-have”; it’s how organizations thrive in an increasingly complex and unpredictable world.

Created with Sketch.
TNS owner Insight Partners is an investor in: Pentera.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.