Key Takeaways
- Investments in Engineering Productivity tend to happen at specific inflection points, such as increases in headcount, after incidents, with organizational maturity, entering new markets, or when aiming for operational excellence.
- Some decisions are foundational and will have ramifications in how builder tools shape for generations. For example, deciding between monorepo and multirepo architectures can impact the development lifecycle, testing strategies, and overall engineering workflow, requiring tailored approaches to optimize productivity.
- As organizations scale, it's important to implement controls and gates (like mandatory code reviews or canary deployments) strategically and pragmatically to mitigate risks, even if they appear to slow down the development.
- If you want to advocate for Engineering Productivity improvements, you need a data-driven approach. This can help your leadership understand the impact of seemingly small inefficiencies that aggregate to significant waste at scale.
- Whether to build proprietary tools or use third-party solutions depends on factors such as scalability needs, opportunities for optimizations, the need for ecosystem integration, and a commitment to continuous evolution as per the industry standards.
My presentation at QCon San Francisco 2024 delved into these topics:
Lesson Learned from a Failure
Back in 2011, I was an engineer in a team that owned a service processing petabytes of data to support our company’s retail site. As Black Friday approached, we knew we needed load testing to ensure it could handle peak traffic, but we cut the testing due to time constraints. Given how time consuming testing was, we made the decision that it was not worth the investment. Black Friday came, and our service didn’t have enough hardware to handle the load. We frantically scaled, only to overload a critical database.
The Amazon Store flew blind for eight hours during the busiest shopping day of the year. The next day my Director asked me how to prevent it from happening again. It was a turning point. Suddenly, building the load testing infrastructure I'd previously suggested was a priority. That failure, a multi-million dollar operational issue, became an inflection point, not just for my career, but for Amazon's infrastructure too. The lesson? Never waste a great crisis!
Therefore, the important takeaway is identifying inflection points at which investments in Engineering Productivity become advantageous. Recognizing recurring patterns is crucial.
Number of Engineers
The most obvious inflection point is typically the number of engineers employed by the organization. As a company expands from three thousand to six thousand to fifty thousand engineers, inefficiencies that were previously inconsequential can become significant.
I tend to think about the extent of operational friction, the frequency of its occurrence, and the number of people affected. To illustrate, an engineer recently highlighted a manual task requiring ten seconds to complete. This seemingly minor inconvenience was frequent enough that at Amazon scale equated to approximately thirty-five engineering years of lost productivity per year. It’s not just about the time lost, it’s also about the opportunities missed when engineers could have been working on higher-leverage, more strategic bets.
Automating the task would require four months, plus an additional two months for ongoing operational maintenance. Thus, an investment of one-half engineer year to get back thirty-five engineer years represents a judicious allocation of resources. These benefits compound over time. Over a five-year period with just fixed headcount, the cumulative savings amount to 174 engineer years, and in high-growth environments, these benefits are amplified even more.
While quantifying engineering productivity can be challenging, it is not an insurmountable task. I recommend Douglas Hubbard's book How to Measure Anything, which offers valuable methodologies for quantifying seemingly intangible metrics. Even if initial estimates are imprecise – yielding a range of, for example, twenty to fifty years – obtaining a reasonable approximation provides the data directionally necessary to inform sound investment decisions.
Inflection Point - A Crisis
While engineer headcount constitutes one inflection point, crises represent another significant one. As I mentioned earlier, I experienced such a turning point stemming from Black Friday 2011 example I mentioned earlier.
Eight years later, Amazon Prime Video was responsible for the live world premiere of "You Need to Calm Down", by Taylor Swift. We anticipated tens of millions of simultaneous viewers given her extensive fanbase. The technology developed in response to that 2011 operational issue was leveraged to validate the platform's capacity to handle this volume of concurrent users.
The evolution from that operational issue in 2011 to supporting a livestream for tens of millions of viewers in 2019 was quite a journey. First, I focused on creating the infrastructure for just my team, simply wanting to prevent a recurrence of that previous incident. Then, I recognized the potential benefit of making it available to other teams. Adoption grew organically, progressing from two teams to ten, then to a hundred.
This project, initially a personal pet project, ultimately scaled to support thousands of services and evolved into my primary responsibility. It became the Amazon-wide load and performance testing infrastructure used for events such as the above mentioned Taylor Swift premiere and the launch of AWS services. This progression underscores the potential of crises, both personal and professional, to serve as inflection points for advancements in engineering productivity.
Maturity
Organizational maturity can also serve as an inflection point. Large software companies like Amazon, Microsoft, Google, Meta, etc, often have significant duplication in Engineering Productivity tools. While I am not a huge fan of redundancy, this duplication is both common and often justified.
During periods of rapid growth, moving quickly and independently is paramount so individual organizations optimize their toolsets to expedite time to market.
Or, under conditions of high ambiguity, a degree of experimentation is also warranted. The current state of Generative AI exemplifies this situation. While its eventual impact on engineering productivity is certain, the specific ways in which it will reshape coding, testing, and related processes remain unclear, so the exploration of diverse approaches concurrently is a reasonable strategy.
While some duplication makes sense, there is always a time for convergence, where deprecating redundant systems and consolidating infrastructure becomes the priority. Upon rejoining Amazon and working within Amazon Worldwide Stores, I noticed that more and more of our customers were using us on mobile devices, so I felt it was necessary to evolve our testing infrastructure to reflect that change.
As we were thinking about building infrastructure to facilitate device provisioning, test execution, and resource management via both physical and virtual devices, I engaged with other teams across Amazon shipping mobile applications (e.g., Amazon Prime Video, Amazon Music, and device-centric organizations, Alexa and Kindle). These discussions revealed considerable duplication of effort, a consequence of independent development efforts during periods of rapid expansion. So, currently, I am focusing on driving convergence towards a more unified model across these organizations. In my role as a Senior Principal, I have the latitude to operate with a healthy disregard for organizational boundaries and act in the best interests of Amazon as a whole.
It is important to recognize that inflection points do not always necessitate expansion; convergence and consolidation are equally valid objectives. My experience at Google, where I led the effort to consolidate four disparate integration testing infrastructures, reinforced this perspective.
Operational or Engineering Efficiency
Raising the bar in operational or engineering excellence can also serve as a catalyst for change. As organizations mature, and as the number of engineers increases, gates become important to keep our bar high.
As an example, consider the evolution of security practices. Believe it or not, in 2009, direct SSH access to production servers was common. Today, granting similar access to the EC2 or S3 production environments would be absolutely unacceptable.
Similarly, direct code submissions without review were once common. While code reviews were encouraged as a best-practice, the repository tooling itself did not enforce it. Now, our code review tooling ensures that approvals from designated reviewers happen before merging the code. These controls, while introducing a degree of friction, are essential at scale.
Should every AWS service implement canaries to monitor their health? A number of years ago we decided that, given the availability of SLAs we provide to our customers, the answer was yes. So appropriate canaries are now a mandatory prerequisite for launching any AWS service.
Should code changes be gated on adequate code coverage? Again, the response is yes. While teams retain the autonomy to determine their specific code coverage targets, a deliberate decision regarding the acceptable level of untested code in production is required.
Each of these controls, while potentially perceived as impediments to agility, has been implemented in direct response to past operational incidents, to raise our engineering bar.
A New Market as an Inflection Point
The emergence of new markets can also serve as an inflection point.
Back in 2014-2020, when I worked at the Builder Tools organization at Amazon, my primary focus was Engineering Productivity infrastructure for web services, because that represented the majority of the work happening. But the company's expansion into devices presented new challenges. Our existing continuous integration and continuous delivery (CI/CD) tools were optimized for web services. Mobile phones, Alexa devices, and Kindle devices had different needs.
Throughout the years we’ve expanded our CI/CD infrastructure to adapt to some things I never imagined we would do, such as testing turnstiles. Amazon Go stores have a checkout-free model where customers enter, scan their payment method, select their items, and exit, without requiring interaction with a cashier. The turnstiles in these stores are controlled by software, and we need to test them.
As a company diversifies its business into new markets, pre-existing assumptions embedded in tooling may no longer be valid, requiring a broadening of perspective.
Foundational Decisions as Inflection Points
Finally, certain decisions are inherently foundational, shaping the trajectory of subsequent engineering practices and tools for generations. Having observed Amazon's evolution over the past fifteen years, I have witnessed the lasting impact of such decisions.
Around 2007, Amazon had challenges managing a large, tightly coupled codebase (referred to as "Obidos"). These challenges included compilation times, deployment complexities, memory constraints, and difficulties in collaborative development. We decided to decompose the monolith into microservices – this concept is widely accepted today, but was less prevalent at the time. A primary objective in decomposing the monolith was to establish clear boundaries between teams, thereby minimizing interdependencies.
This approach led to the adoption of "two-pizza teams", a term referring to the number of engineers that can be adequately fed with two pizzas (typically six to nine individuals). The intention was to create autonomous teams responsible for discrete microservices, communicating exclusively over the network via RPC or HTTP, with no shared code. To enforce this separation, we adopted a multirepo strategy, with each team maintaining its own independent repository.
The question of monorepo versus multirepo is a matter of architectural preference. My experience at Google, a monorepo environment, and at Amazon, a multirepo environment, has demonstrated that both approaches entail managing complexity, albeit in different forms. One is not better than the other: there are trade-offs determined by where you deal with the complexity.
Consider the software development lifecycle, which typically encompasses local development changes, code reviews, submission and merging, deployment to a testing environment, and ultimately deployment to production. Ideally, integration tests are executed at each stage.
Using the analogy of a house, a multirepo environment is like a house with a lot of discrete, independent rooms. An error within one repository is contained, and it primarily impacts the responsible team, limiting disruption to other teams.
In contrast, a monorepo environment is analogous to a studio apartment: an error can have far-reaching consequences. Google, with its vast engineering workforce of 120 thousand engineers, has invested significant resources in infrastructure and processes to mitigate the risks associated with a monorepo lacking branching. While effective for the most part, this approach has drawbacks, such as a higher degree of validation complexity and deployment challenges.
Multirepo architectures inherently limit the blast radius of potential errors. But in a monorepo environment the impact can be widespread, so pre-submit testing is paramount. As a consequence of that, companies like Google prioritize investments in pre-submit integration testing to identify and prevent such issues. Amazon, on the other hand, leverages the inherent blast radius reduction afforded by its multirepo architecture to focus on post-submit testing strategies. The pre-submit and post-submit strategies are contrasting; however, each is effective within their respective environments.
Because of this need to shift-left and test early, Google invested heavily in infrastructure for ephemeral, hermetic test environments that can be quickly provisioned and deprovisioned. In contrast, multirepo environments have historically emphasized long-lived, static test environments with high fidelity, as well as robust canary deployments, telemetry, rollback mechanisms, and production alarming. (And now, even Amazon is increasing its focus on shifting left).
Managing library dependencies presents unique challenges in a multirepo environment. We needed to create mechanisms to facilitate the controlled vending of libraries from one repository to others, as well as to track library versions across the organization to address potential security vulnerabilities.
Proprietary Tooling vs Third Party / Open Source Tooling
We often have to choose between developing proprietary tooling versus leveraging third-party solutions. There are several justifications for investing in bespoke tools. First, existing third-party or open-source solutions may lack the scalability required for your specific needs. My experience developing Amazon's load and performance testing infrastructure in 2012 exemplifies this: no readily available tools could generate the necessary throughput.
Second, optimizations can warrant bespoke development. Google's internal IDE is a relevant example. Given Google has 120 thousand highly paid engineers, investing on an IDE that enhances their efficiency is a sound strategic decision.
Third, a cohesive ecosystem of integrated tools can be more effective than a collection of disparate third-party solutions requiring significant integration efforts. This is a justification for the internal investments that large organizations such as Amazon, Microsoft, and Google make in comprehensive engineering productivity platforms.
However, awareness of potential pitfalls associated with proprietary tooling is important. The danger of becoming isolated within a specific organizational "bubble" is ever-present. It is imperative to maintain awareness of industry trends and ensure that internal tools continue to evolve in accordance with external advancements. A failure to do so can result in proprietary solutions becoming outdated and less effective than commercially available alternatives. While I have spent a substantial portion of my career developing proprietary tools, I recognize the inherent trade-offs and the need for careful consideration when pursuing this path.
Conclusion
In summary, the path to optimizing engineering productivity is multifaceted and requires deliberate evaluation of various inflection points: headcount, crises, organizational maturity, new markets, and the pursuit of operational excellence. Foundational architectural decisions, such as choosing between monorepo and multirepo strategies, further influence the allocation of resources and the prioritization of testing methodologies.
Ultimately, the decision to invest in proprietary tooling versus third-party solutions demands a balanced assessment of scalability, optimization potential, ecosystem effects, and the ongoing need to remain aligned with industry advancements. Organizations can cultivate an engineering environment that fosters both efficiency and innovation by thoughtfully navigating these considerations.