The Wayback Machine - https://web.archive.org/web/20140413033922/https://github.com/blog/category/engineering
Skip to content

Security: Heartbleed vulnerability

On April 7, 2014 information was released about a new vulnerability (CVE-2014-0160) in OpenSSL, the cryptography library that powers the vast majority of private communication across the Internet. This library is key for maintaining privacy between servers and clients, and confirming that Internet servers are who they say they are.

This vulnerability, known as Heartbleed, would allow an attacker to steal the keys that protect communication, user passwords, even the system memory of a vulnerable server. This represents a major risk to large portions of private traffic on the Internet, including github.com.

Note: GitHub Enterprise servers are not affected by this vulnerability. They run an older OpenSSL version which is not vulnerable to the attack.

As of right now, we have no indication that the attack has been used against github.com. That said, the nature of the attack makes it hard to detect so we're proceeding with a high level of caution.

What is GitHub doing about this?

UPDATE: 2014-04-08 16:00 PST - All browser sessions that were active prior to the vulnerability being addressed have been reset. See below for more info.

We've completed a number of measures already and continue to work the issue.

  1. We've patched all our systems using the newer, protected versions of OpenSSL. We started upgrading yesterday after the vulnerability became public and completed the roll out today. We are also working with our providers to make sure they're upgrading their systems to minimize GitHub's exposure.

  2. We've recreated and redeployed new SSL keys and reset internal credentials. We have also revoked our older certs just to be safe.

  3. We've forcibly reset all browser sessions that were active prior to the vulnerability being addressed on our servers. You may have been logged out and have to log back into GitHub. This was a proactive measure to defend against potential session hijacking attacks that may have taken place while the vulnerability was open.

Prior to this incident, GitHub made a number of enhancement to mitigate attacks like this. We deployed Perfect Forward Secrecy at the end of last year, which makes it impossible to use stolen encryption keys to read old encrypted communication. We are working to find more opportunities like this.

What should you do about Heartbleed right now?

Right now, GitHub has no indication that the vulnerability has been used outside of testing scenarios. However, out of an abundance of caution, you can:

  1. Change your GitHub password. Be sure your password is strong; for more information, see What is a strong password?
  2. Enable Two-Factor Authentication.
  3. Revoke and recreate personal access and application tokens.

Stay tuned

GitHub works hard to keep your code safe. We are continuing to respond to this vulnerability and will post updates as things progress. For more information as it's available, keep an eye on Twitter or the GitHub Blog.

Denial of Service Attacks

On Tuesday, March 11th, GitHub was largely unreachable for roughly 2 hours as the result of an evolving distributed denial of service (DDoS) attack. I know that you rely on GitHub to be available all the time, and I'm sorry we let you down. I'd like to explain what happened, how we responded to it, and what we're doing to reduce the impact of future attacks like this.

Background

Over the last year, we have seen a large number and variety of denial of service attacks against various parts of the GitHub infrastructure. There are two broad types of attack that we think about when we're building our mitigation strategy: volumetric and complex.

We have designed our DDoS mitigation capabilities to allow us to respond to both volumetric and complex attacks.

Volumetric Attacks

Volumetric attacks are intended to exhaust some resource through the sheer weight of the attack. This type of attack has been seen with increasing frequency lately through UDP based amplification attacks using protocols like DNS, SNMP, or NTP. The only way to withstand an attack like this is to have more available network capacity than the sum of all of the attacking nodes or to filter the attack traffic before it reaches your network.

Dealing with volumetric attacks is a game of numbers. Whoever has more capacity wins. With that in mind, we have taken a few steps to allow us to defend against these types of attacks.

We operate our external network connections at very low utilization. Our internet transit circuits are able to handle almost an order of magnitude more traffic than our normal daily peak. We also continually evaluate opportunities to expand our network capacity. This helps to give us some headroom for larger attacks, especially since they tend to ramp up over a period of time to their ultimate peak throughput.

In addition to managing the capacity of our own network, we've contracted with a leading DDoS mitigation service provider. A simple Hubot command can reroute our traffic to their network which can handle terabits per second. They're able to absorb the attack, filter out the malicious traffic, and forward the legitimate traffic on to us for normal processing.

Complex Attacks

Complex attacks are also designed to exhaust resources, but generally by performing expensive operations rather than saturating a network connection. Examples of these are things like SSL negotiation attacks, requests against computationally intensive parts of web applications, and the "Slowloris" attack. These kinds of attacks often require significant understanding of the application architecture to mitigate, so we prefer to handle them ourselves. This allows us to make the best decisions when choosing countermeasures and tuning them to minimize the impact on legitimate traffic.

First, we devote significant engineering effort to hardening all parts of our computing infrastructure. This involves things like tuning Linux network buffer sizes, configuring load balancers with appropriate timeouts, applying rate limiting within our application tier, and so on. Building resilience into our infrastructure is a core engineering value for us that requires continuous iteration and improvement.

We've also purchased and installed a software and hardware platform for detecting and mitigating complex DDoS attacks. This allows us to perform detailed inspection of our traffic so that we can apply traffic filtering and access control rules to block attack traffic. Having operational control of the platform allows us to very quickly adjust our countermeasures to deal with evolving attacks.

Our DDoS mitigation partner is also able to assist with these types of attacks, and we use them as a final line of defense.

So what happened?

At 21:25 UTC we began investigating reports of connectivity problems to github.com. We opened an incident on our status site at 21:29 UTC to let customers know we were aware of the problem and working to resolve it.

As we began investigating we noticed an apparent backlog of connections at our load balancing tier. When we see this, it typically corresponds with a performance problem with some part of our backend applications.

After some investigation, we discovered that we were seeing several thousand HTTP requests per second distributed across thousands of IP addresses for a crafted URL. These requests were being sent to the non-SSL HTTP port and were then being redirected to HTTPS, which was consuming capacity in our load balancers and in our application tier. Unfortunately, we did not have a pre-configured way to block these requests and it took us a while to deploy a change to block them.

By 22:35 UTC we had blocked the malicious request and the site appeared to be operating normally.

Despite the fact that things appeared to be stabilizing, we were still seeing a very high number of SSL connections on our load balancers. After some further investigation, we determined that this was an additional vector that the attack was using in an effort to exhaust our SSL processing capacity. We were able to respond quickly using our mitigation platform, but the countermeasures required significant tuning to reduce false positives which impacted legitimate customers. This resulted in approximately 25 more minutes of downtime between 23:05-23:30 UTC.

By 23:34 UTC, the site was fully operational. The attack continued for quite some time even once we had successfully mitigated it, but there were no further customer impacts.

What did we learn?

The vast majority of attacks that we've seen in the last several months have been volumetric in terms of bandwidth, and we'd grown accustomed to using throughput as a way of confirming that we were under attack. This attack did not generate significantly more bandwidth but it did generate significantly more packets per second. It didn't look like what we had grown to expect an attack to look like and we did not have the monitoring we needed to detect it as quickly as we would have liked.

Once we had identified the problem, it took us much longer than we'd like to mitigate it. We had the ability to mitigate attacks of this nature in our load balancing tier and in our DDoS mitigation platform, but they were not configured in advance. It took us valuable minutes to configure, test, and tune these countermeasures which resulted in a longer than necessary downtime.

We're happy that we were able to successfully mitigate the attack but we have a lot of room to improve in terms of how long the process takes.

Next steps?

  1. We have already made adjustments to our monitoring to better detect and alert us of traffic pattern changes that are indicative of an attack. In addition, our robots are now able to automatically enable mitigation for the specific traffic pattern that we saw during the attack. These changes should dramatically reduce the amount of time it takes to respond to a wide variety of attacks in the future and reduce their impact on our service.
  2. We are investigating ways that we can simulate attacks in a controlled way so that we can test our countermeasures on a regular basis to build additional confidence in both our mitigation tools and to improve our response time in bringing them to bear.
  3. We are talking to some 3rd party security consultants to review our DDoS detection and mitigation capability. We do a good job mitigating attacks we've seen before, but we'd like to more proactively plan for attacks that we haven't yet encountered.
  4. Hubot is able to route our traffic through our mitigation partner and to apply templates to operate our mitigation platform for known attack types. We've leveled him up with some new templates for attacks like this one so that he can help us recover faster in the future.

Summary

This attack was painful, and even though we were able to successfully mitigate the effects of it, it took us far too long. We know that you depend on GitHub and our entire company is focused on living up to the trust you place in us. I take problems like this personally. We will do whatever it takes to improve how we respond to problems to ensure that you can rely on GitHub being available when you need us.

Thanks for your support!

Passion Projects Short Documentary: Timoni West

We're now 11 installments into our talk series Passion Projects, which we created to help surface and celebrate the work of incredible women in the tech industry.

We sat down with past speaker Timoni West to talk a little more about her background in design and more specifically, the role the Internet is playing in making data available and consumable for everyday people.

Since filming, Timoni has started working with Alphaworks.

Timezone-aware contribution graphs

Today we've made your contribution graphs timezone-aware. GitHub is used everywhere and we want to reflect that in our features. If you happen to work from Japan, Australia or Ulan Bator, we want to count your contributions from your perspective.

When counting commits, we use the timezone information present in the timestamps for those commits. Pull requests and issues opened on the web will use the timezone of your browser. If you use the API you can also specify your timezone.

We don't want to mess up your current contribution streaks, so only contributions after Monday 10 March 2014 (Temps Universel Coordonné) will be timezone-aware.

Enjoy your time(zone)!

Free Public Speaking Workshop For Women

We're hosting our first ever free public speaking workshop for women in San Francisco! If you're interested in leveling up your public speaking skills, join us on Saturday, February 22nd for a day of inspiring talks from women who rock, workshopping with incredible mentors from the tech community, and (only if you're up for it) getting on stage to deliver your first lightning talk.

stage

Conferences are notable not only for the prominent people on stage, but also for those who are missing.

— Sarah Millstein in Putting An End To Conferences Dominated By White Men

Changing the ratio starts with increasing the visibility of those people who are missing from tech conference lineups. With this workshop, we're hoping to give you the tools not only to feel comfortable talking about the work you do, but help you to increase your own visibility within the community.

Meet Our Keynote Speakers:

  • Denise Jacobs, Speaker, Author, Creativity Evangelist, Passionate Diversity Advocate
  • Diana Kimball, Expert Novice, Bright Soul, and Harvard MBA Set Out on Making the World A Better Place

Our Awesome Mentors For The Day:

  • Ana Hevesi, Community Developer at StackExchange, Conference Organizer, Brilliant Wordsmith, So Damn Well-Spoken
  • Andi Galpern, Expert Web Designer, Rockin' Musician, and Passionate Tech Educator
  • Alexis Finch, Sketch Artist, Has Probably Seen More Conference Talks Than Ted Himself, Badass Women's Advocate
  • Alice Lee, Designer and Illustrator at Dropbox, Super Talented Letterer, and Organizer of Origins
  • Anita Sarkeesian, Creator and Host of Feminist Frequency, Pop Culture Trope Expert , Probably the Most Hilarious Human Alive
  • Angelina Fabbro, Engineer/Developer and Developer Advocate at Mozilla. Writes Code/Writes Words About Code/Speaks About Code
  • Ash Huang, Designer at Pinterest, Really Quite Handy with Gifs IRL
  • C J Silverio, Cats, Weightlifting, and Node.js, Not Necessarily In That Order.
  • Divya Manian, Crazy Talented Speaker, Avid Coder, and Armchair Anarchist
  • Garann Means, JavaScript Developer, Incredible Writer, Proud Austin-ite, and Beyond Powerful Speaker
  • Emily Nakashima, Resides in the East Bay, Programs at GitHub
  • Jackie Balzer, Writes CSS Like It's Her Job (It Is), Leads An Army of CSS Badasses at Behance
  • Jen Myers, Former Passion Projects Speaker, Dev Bootcamp Instructor, Fantastic Keynoter, and Starter of Brilliant Things
  • Jesse Toth, Developer at GitHub, Cal CS Grad
  • Jessica Dillon, Lover, Fighter, Javascript Writer
  • Jessica Lord, Open Sourcerer, Former Code For America Fellow, Changing The Way The World Interacts With GitHub/Code/Javascript
  • Julie Ann Horvath, Passion Projects Creator, Developer, and Designer of Websites and Also Slides
  • Kelly Shearon, All Things Marketing and Content Strategy at GitHub, Could Write You Under A Table, Super Cool Mom
  • Luz Bratcher, Helvetica-loving UX designer at Design Commission, Event Admin for Seattle Creative Mornings
  • Mina Markham, Badass Lady Dev, Girl Develop It Founder/Instructor, Generally Rad Person
  • Netta Marshall, Lead Designer at Watsi, Formerly Rdio, Professional Ninja, Owner Of Best Website Footer On The Internet
  • Raquel Vélez, Hacker of The Web (node.js), Robotics Engineer, Polyglot, (Cal)Techer
  • Sara Pyle, Supportocat at GitHub, Amateur Shapeshifter, and Professional Superhero
  • Sonya Green, Chief Empathy Officer, Leads Support at GitHub
  • Tatiana Simonian, VP of Music at Nielsen, Formerly Music at Twitter and Disney
  • Willo O'Brien, Heart-Centered Entrepreneur, Speaker, Coach, Seriously Positive Person

The Pertinent Details:

  • GitHub’s First Public Speaking Workshop For Women
  • At GitHub HQ in San Francisco, CA
  • Saturday, February 22nd, from 11:00am-4:00pm
  • Food, beverages, moral support and also plenty of fun provided.
  • You must register interest here if you'd like to attend. The last day to register interest is Sunday, February 16th. You will be notified on Monday, February 17th if* you've been selected to participate.

*Because we can only host so many people in our space, we're using a lottery system to select participants to ensure the process is fair and balanced.

If you can't make our workshop but are interested in leveling up as a speaker, here's a few resources:

If you're a conference organizer who is looking for some resources to help diversify your lineups this year, these are all great places to start:

Video from Passion Projects Talk #10 with Dana McCallum

Dana McCallum joined us in January of 2013 for the 10th installment of our Passion Projects talk series. Dana's talk revealed how she brought her non-tech passions to life through programming. Check out the full video of her talk and our panel discussion below.

Photos from the event

Thanks to everyone who came out for Dana's talk, including our musical performance for the evening, Running in the Fog.

passionproj_danamccallum-5138passionproj_danamccallum-5122passionproj_danamccallum-5754passionproj_danamccallum-5175passionproj_danamccallum-5234passionproj_danamccallum-5740passionproj_danamccallum-5741passionproj_danamccallum-5791passionproj_danamccallum-5783

Photos courtesy of our fab photog :sparkles:Mona Brooks :sparkles: of Mona Brooks Photography.

Proxying User Images

A while back, we started proxying all non-https images to avoid mixed-content warnings using a custom node server called camo. We’re making a small change today and proxying HTTPS images as well.

Proxying these images will help protect your privacy: your browser information won't be leaked to other third party services. Since we're also routing images through our CDN, you should also see faster overall load times across GitHub, as well as fewer broken images in the future.

Related open source patches

DNS Outage Post Mortem

Last week on Wednesday, January 8th, GitHub experienced an outage of our DNS infrastructure. As a result of this outage, our customers experienced 42 minutes of downtime of services along with an additional 1 hour and 35 minutes of downtime within a subset of repositories as we worked to restore full service. I would like apologize to our customers for the impact to your daily operations as a result of this outage. Unplanned downtime of any length is unacceptable to us. In this case we fell short of both our customers' expectations and our own. For that, I am truly sorry.

I would like to take a moment and explain what caused the outage, what happened during the outage, and what we are doing to help prevent events like this in the future.

Some background…

For some time we've been working to identify places in our infrastructure that are vulnerable to Distributed Denial of Service (DDoS) attacks. One of the things we specifically investigated was options for improving our defenses against DNS amplification attacks, which have become very common across the internet. In order to simplify our access control rules, we decided to reduce the number of hosts which are allowed to make DNS requests and receive DNS replies to a very small number of name servers. This change allows us to explicitly reject DNS traffic that we receive for any address that isn't explicitly whitelisted, reducing our potential attack surface area.

What happened...

In order to roll out these changes, we had prepared changes to our firewall and router configuration to update the IP addresses our name servers used to send queries and receive responses. In addition, we prepared similar changes to our DNS server configuration to allow them to use these new IP addresses. The plan was to roll out this set of changes for one of our name servers, validate the new configuration worked as expected, and proceed to make the same change to the second server.

Our rollout began on the afternoon of the 8th at 13:20 PM PST. Changes were deployed to the first DNS server, and an initial verification led us to believe the changes had been rolled out successfully. We proceeded to deploy to the second name server at 13:29 PM PST, and again performed the same verification. However, problems began manifesting nearly immediately.

We began to observe that certain DNS queries were timing out. We quickly investigated, and discovered a bug in our rollout procedure. We expected that when our change was applied, both our caching name servers and authoritative name servers would receive updated configuration - including their new IP addresses - and restart to apply this configuration. Both name servers received the appropriate configuration changes, but only the authoritative name server was restarted due to a bug in our Puppet manifests. As a result, our caching name server was requesting authoritative DNS records from an IP that was no longer serving DNS. This bug created the initial connection timeouts we observed, and began a cascade of events.

Our caching and authoritative name servers were reloaded at 13:49 PST, resolving DNS query timeouts. However, we observed that certain queries were now incorrectly returning NXDOMAIN. Further investigation found that our DNS zone files had become corrupted due to a circular dependency between our internal provisioning service and DNS.

During the investigation of the first phase of this incident, we triggered a deployment of our DNS system, which performs an API call against our internal provisioning system and uses the result of this call to construct a zone file. However, this query requires a functioning DNS infrastructure to complete successfully. Further, the output from this API call verification was not adequately checked for sanity before being converted into a zone file. As a result, this deployment removed a significant amount of records from our name servers, causing the NXDOMAIN results we observed. The missing DNS records were restored by performing the API call manually, validating the output, and updating the affected zones.

Many of our servers recovered gracefully once DNS service began responding appropriately. However, we quickly noted that github.com performance had not returned to normal, and our error rates were far higher than normal. Further investigation found that a subset of our fileservers were actively refusing connections due to what we found out later was memory exhaustion, exacerbated by the spawning of a significant number of processes on during the DNS outage.

Total number of processes across fileservers

total number

Total memory footprint across fileservers

total memory

The failing fileservers began creating a back pressure in our routing layer that prevented connections to healthy fileservers. Our team began manually removing all misbehaving fileservers from the routing layer, restoring services for the fileservers that had survived the spike in processes and memory during the DNS event.

The team split up the pool of disabled fileserver, and triaged their status. Collectively, we found one of two scenarios existed to be repaired: either the node had calmed down ‘enough’ as a result of DNS service restoration to allow one of our engineers to log into the box and start forcefully killing hung processes to restore service, or the node had become so exhausted that our HA daemon kicked in to STONITH the active node and bring up our secondary node. In both of these situations, our team went in and performed checks against our low-level DRBD block devices to ensure there were no inconsistencies or errors in data replication. Full service was restored for all of our customers by 15:47 PM PST.

What we’re doing about it...

This small problem uncovered quite a bit about our infrastructure that we will be critically reviewing over the next few weeks. This includes:

  1. We are investigating further decoupling of our internal and external DNS infrastructure. While the pattern of forwarding requests to an upstream DNS server is not uncommon, the tight dependency that exists between our internal name servers and our external name servers needs to be broken up to allow changes to happen independently of each other.
  2. We are reviewing our configuration management code for other service restart bugs. In many cases, this means the improvement of our backend testing. We will be reviewing critical code for appropriate tests using rspec-puppet, as well as looking at integration tests to ensure that service management behaves as intended.
  3. We are reviewing the cyclic dependency between our internal provisioning system and our DNS resolvers, and have already updated the deployment procedure to verify the results returned from the API call before removing a large number of records.
  4. We are reviewing and testing all of the designed safety release valves in our fileserver management systems and routing layers. During the failure when filesevers became so exhausted that the routing layer failed due to back pressure, we should have seen several protective measures kick in to automatically remove these servers from service. These mechanisms did not fire off as designed, and need to be revisited.
  5. We are implementing process accounting controls to appropriately limit the resources consumed by our application processes. Specifically, we are testing Linux cgroups to further isolate application processes from administrative system functionality. In the event of a similar event in the future, this should allow us to restore full access much more quickly.
  6. We are reviewing the code deployed to our fileservers to analyze for tight dependencies to DNS. We reviewed the DNS time-outs on our fileservers and found that DNS requests should have timed out after 1 second, and only retried to resolve 2 times in total. This analysis along with cgroup implementation should provide a better barrier to avoid runaway processes in the first place, and a safety valve to manage them if processing becomes unruly in the future.

Summary

We realize that GitHub is an important part of your development and workflow. Again, I would like to take a moment to apologize for the impact that this outage had to your operations. We take great pride in providing the best possible service quality to our customers. Occasionally, we run into problems as detailed above. These incidents further drive us to continually improve the quality of our own internal operations and ensure that we are living up to the trust you have placed in us. We are working diligently to provide you with a stable, fast, and pleasant GitHub experience. Thank you for your continual support of GitHub!

Optimizing large selector sets

CSS selectors are to frontend development as SQL statements are to the backend. Aside from their origin in CSS, we use them all over our JavaScript. Importantly, selectors are declarative, which makes them prime candidates for optimizations.

Browsers have a number of ways of dealing with parsing, processing, and matching large numbers of CSS selectors. Modern web apps are now using thousands of selectors in their stylesheets. In order to calculate the styles of a single element, a huge number of CSS rules need to be considered. Browsers don't just iterate over every selector and test it. That would be way too slow.

Most browsers implement some kind of grouping data structure to sort out obvious rules that would not match. In WebKit, it's called a RuleSet.

SelectorSet

SelectorSet is a JavaScript implementation of group technique browsers are already using. If you have a set of selectors known upfront, it makes matching and querying elements against that set of selectors much more efficient.

Selectors added to the set are quickly analyzed and indexed under a key. This key is derived from a significant part of the right most side of the selector. If the selector targets an id, the id name is used as the key. If there's a class, the class name is used and so forth. The selector is then put into a map indexed by this key. Looking up the key is constant time.

When it's time to match the element against the group, the element's properties are examined for possible keys. These keys are then looked up in the mapping which returns a smaller set of selectors which then perform a full matches test against the element.

Speeding up document delegated events

jQuery’s original $.fn.live function (and its modern form, $.fn.on) are probably the most well known delegation APIs. The main advantage of using the delegated event handler over a directly bound one is that new elements added after DOMContentLoaded will trigger the handler. A technique like this is essential when using a pattern such as pjax, where the entire page never fully reloads.

Extensive usage of document delegated event handlers is considered controversial. This includes applications with a large number of $(‘.foo’).live(‘click’) or $(document).on(‘click’, ‘.foo’) registrations. The common performance argument is that the selector has to be matched against entire ancestor chain of the event target. On an application with large and deeply nested DOM, like github.com, this could be as deep as 15 elements. However, this is likely not the most significant factor. It is when the number of delegated selectors themselves is large. GitHub has 100+ and Basecamp has 300+ document delegated events.

Using the selector set technique described above, installing this jQuery patch could massively speed up your apps event dispatch. Here’s a fun little jsPerf test using real GitHub selectors and markup to demonstrate how much faster the patched jQuery is.

Conclusion

Both of these libraries should be unnecessary and hopefully obsoleted by browsers someday. Browsers already implement techniques like this to process CSS styles efficiently. It's still unfortunate we have no native implementation of declarative event handlers, even though people have been doing this since 2006.

References

Video from Passion Projects Talk #7 with Jen Myers

Jen Myers joined us in December of 2013 for the 7th installment of our Passion Projects talk series. Jen taught us the importance of not being an expert and how to be responsible for our own learning and personal and professional growth. Check out the full video of her talk and our panel discussion below.

Photos from the event

passionproj_jenmyers-1761passionproj_jenmyers-1839passionproj_jenmyers-1811passionproj_jenmyers-3000passionproj_jenmyers-3053passionproj_jenmyers-3080passionproj_jenmyers-3112passionproj_jenmyers-1827passionproj_jenmyers-3025

Improving our SSL setup

As we announced previously we've improved our SSL setup by deploying forward secrecy and improving the list of supported ciphers. Deploying forward secrecy and up to date cipher lists comes with a number of considerations which makes doing it properly non trivial.

This is why we thought it would be worth expanding some more on the discussions we've had, choices we've made and feedback we've got from people.

Support newer versions of TLS

A lot of the internet's traffic is still secured by TLS 1.0. This version was attacked numerous times and also doesn't provide support for newer algorithms that you'd want to deploy.

We were glad that we were already on a recent enough OpenSSL version that supports TLS 1.1 and 1.2 as well. If you're looking at improving your SSL setup, making sure that you can support TLS 1.2 is the first step you should take, because it makes the other improvements possible as well. TLS 1.2 is supported in OpenSSL 1.0.0h and 1.0.1 and newer.

To BEAST or not to BEAST

When the BEAST attack first was published, the recommended way to mitigate this attack vector was to switch to RC4. Since then though, additional attacks against RC4 have been devised.

This led more and more people to recommend to move away from RC4, like the people behind SSL Labs and Mozilla.

Attacks against RC4 will only get better over time, and the vast majority of browsers have implemented client-side protections against BEAST. This is why we have decided to move RC4 to the bottom of our cipher prioritization, keeping it only for backwards compatibility.

The only cipher that is relatively broadly supported and that hasn't been compromised by attacks is AES GCM. This mode of AES doesn't suffer from keystream bias like RC4 or attacks on CBC that resulted in BEAST and Lucky 13.

Currently AES GCM is supported in Chrome, but it's also in the works for other browsers like Firefox. We've given priority to these ciphers and, given our usage patterns we now see a large majority of our connections being secured by this cipher.

Forward secrecy pitfalls

So the recommendations on which ciphers to use are fairly straighforward. But, choosing the right ciphers is only one step to ensuring forward secrecy. There are some pitfalls that can cause you to not actually provide any additional security to customers.

In order to explain these potential problems, first we need to introduce the concept of session resumption. Session resumption is a mechanism used to significantly shorten the handshake mechanism when a new connection is opened. This means that if a client connects again to the same server, we can do a shorter setup and greatly reduce the time it takes to setup a secure connection.

There are two mechanisms for implementing these session resumption, the first is using session IDs, the second is using session tickets.

SSL Session IDs

Using session IDs means that the server keeps track of state and if a client reconnects with a session ID the server has given out, it can reuse the existing state it tracked there. Let's see how that looks when we connect to a server supporting session IDs.

openssl s_client -tls1_2 -connect github.com:443 < /dev/null

...

New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-GCM-SHA384
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : ECDHE-RSA-AES256-GCM-SHA384
    Session-ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

...

What you can see here is that the server hands out a Session-ID that the client can then use to send and reconnect. The downside of this is of course that this means the server needs to keep track of this state.

This state tracking also means that if you have a site that has multiple front ends for SSL termination, you might not get the benefits that you expect. If a client ends up on a different front end the second time, that front end doesn't know about the session ID and will have to setup a completely new connection.

SSL Session tickets

SSL Session tickets are described in RFC5077 and provide a mechanism that means we don't have to keep the same state at the server.

How this mechanism works is that the state is encrypted by the server and handed to the client. This means the server doesn't have to keep track of all this state in memory. It does mean however that the key used to encrypt session tickets needs to be tracked server side. This is how it looks when we connect to a server supporting session tickets.


New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-GCM-SHA384
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : ECDHE-RSA-AES256-GCM-SHA384
    ...
    TLS session ticket:
    0000 - XX XX XX XX XX XX XX XX-XX XX XX XX XX XX XX XX   ................
    0010 - XX XX XX XX XX XX XX XX-XX XX XX XX XX XX XX XX   ................
    0020 - XX XX XX XX XX XX XX XX-XX XX XX XX XX XX XX XX   ................
    0030 - XX XX XX XX XX XX XX XX-XX XX XX XX XX XX XX XX   ................
    0040 - XX XX XX XX XX XX XX XX-XX XX XX XX XX XX XX XX   ................
    0050 - XX XX XX XX XX XX XX XX-XX XX XX XX XX XX XX XX   ................
    0060 - XX XX XX XX XX XX XX XX-XX XX XX XX XX XX XX XX   ................
    0070 - XX XX XX XX XX XX XX XX-XX XX XX XX XX XX XX XX   ................
    0080 - XX XX XX XX XX XX XX XX-XX XX XX XX XX XX XX XX   ................
    0090 - XX XX XX XX XX XX XX XX-XX XX XX XX XX XX XX XX   ................

With a session ticket key, it is possible to share this ticket across multiple front ends. This way you can have the performance benefits of session resumption even across different servers. If you don't share this ticket key it has the same performance benefits as using session ID's.

How this applies to GitHub

Not carefully considering the session resumption mechanism can lead to not getting the benefits of forward secrecy. If you keep track of the state for too long, it can be used to decrypt prior sessions, even when deploying forward secrecy.

This is described well by Adam Langley on his blog. Twitter also did a technical deep dive describing how they deployed a setup with sharing the session ticket key.

So, we had to decide whether developing a secure means of sharing ticket keys (ala Twitter) was necessary to maintain acceptable performance given our current traffic patterns. We found that clients usually end up on the same load balancer when they make a new connection shortly after a previous one. As a result, we decided that we can rely on session IDs as our resumption mechanism and still maintain a sufficient level of performance for clients.

This is also where we got tripped up. We currently use HAProxy as our SSL termination which ends up using the default OpenSSL settings if you don't specify any additional options. This means that both session IDs and session tickets are enabled by default.

The problem here lies with session tickets being enabled. Even though we didn't setup sharing the key across servers, it still means that HAProxy uses an in-memory key to encrypt session tickets. This encryption key is initialized once the process starts up and stays the same for the process lifetime.

This means that if we would have a HAProxy running for a long time, an attacker who who obtains the session ticket key can decrypt traffic from prior sessions whose ticket was encrypted using the session ticket key. This of course doesn't provide the forward secrecy properties we were aiming for.

Session IDs don't have this problem, since they have a lifetime of 5 minutes (on our platform), making the window for this attack only 5 minutes wide instead of the entire process lifetime.

Given that session tickets don't provide any additional value for us at this point, we decided to disable them and only rely on session IDs. This way we get the benefits of forward secrecy while also maintaining an acceptable level of performance for clients.

Acknowledgements

We would like to thank Jeff Hodges for reaching out to us and point us at what we've missed in our initial setup.

Introducing Forward Secrecy and Authenticated Encryption Ciphers

As of yesterday we've updated our SSL setup on the systems that serve traffic for GitHub. The changes introduce support for Forward Secrecy and Authenticated Encryption Ciphers.

So what is Forward Secrecy? The EFF provides a good explanation of what it is and why it is important. Authenticated Encryption means that we provide ciphers that are much less vulnerable to attacks. These are already supported in Chrome.

Also check SSL Labs if you want to know more details of the setup we've deployed.

Since this article was published, we've also written a more extensive post on what we've done.

The Ghost of Issues Past

The end of the year is fast approaching, and this is a good time to review open issues from long ago. A great way to find older issues and pull requests is our wonderful search system. Here are a few examples:

That last group, the ones not touched in the past year, should probably just be closed. If it's remained untouched in 2013, it probably won't be touched in 2014. There are 563,600 open issues across GitHub that have not been touched in the past year.

So go forth and close with impunity!

Join our Octostudy!

There are a lot of interesting people on GitHub today. Since we can't meet everyone at a conference, drinkup, or charity dodgeball game, we are hoping you can tell us a little more about yourself.

Please take a minute to fill out this short survey. You'll be helping us learn how we can make GitHub even better for you.

octostudy-cat-2

Cheers & Octocats!

(Also: tell your friends.)

Weak passwords brute forced

Some GitHub user accounts with weak passwords were recently compromised due to a brute force password-guessing attack. I want to take this opportunity to talk about our response to this specific incident and account security in general.

We sent an email to users with compromised accounts letting them know what to do. Their passwords have been reset and personal access tokens, OAuth authorizations, and SSH keys have all been revoked. Affected users will need to create a new, strong password and review their account for any suspicious activity. This investigation is ongoing and we will notify you if at any point we discover unauthorized activity relating to source code or sensitive account information.

Out of an abundance of caution, some user accounts may have been reset even if a strong password was being used. Activity on these accounts showed logins from IP addresses involved in this incident.

The Security History page logs important events involving your account. If you had a strong password or GitHub's two factor authentication enabled you may have still seen attempts to access your account that have failed.

This is a great opportunity for you to review your account, ensure that you have a strong password and enable two-factor authentication.

While we aggressively rate-limit login attempts and passwords are stored properly, this incident has involved the use of nearly 40K unique IP addresses. These addresses were used to slowly brute force weak passwords or passwords used on multiple sites. We are working on additional rate-limiting measures to address this. In addition, you will no longer be able to login to GitHub.com with commonly-used weak passwords.

If you have any questions or concerns please let us know.

Something went wrong with that request. Please try again.