As of December 2017, all Google Cloud Platform (GCP) services now had protections in place for all known variants of the vulnerability. The updates implemented by Google have no material effect on workloads.
If you've been keeping up with the latest tech news, you've no doubt heard about the CPU security flaw that Google's Project Zero disclosed on 10/1/2018. They then answered some of the questions and detailed how they protect Google Cloud customers (Google Cloud Platform and G Suite). This note goes into more detail on how they have protected Google Cloud products against these speculative execution vulnerabilities, and what was done to make sure our Google Cloud customers got minimal performance impact from these mitigations.
Modern CPUs and operating systems protect programs and users by placing a "wall" around them so that an application or user cannot read what is stored in the memory of another application. These limits are imposed by the CPU.
But it was reported last week, Project Zero discovered techniques that can bypass these protections in some cases, allowing one application to read the private memory of another, potentially exposing sensitive information.
The vulnerabilities come in three variants, each of which must be protected individually. Variant 1 and Variant 2 are also known as "Spectrum". Variant 3 is known as "Meltdown". Project Zero described these technical details, the Google Security blog described how we protect users across all Google products, and we explained how we protect Google Cloud customers and provided guidance on security best practices for customers using their own operating systems with Google Cloud services.
Surprisingly, these vulnerabilities have been present in most computers for nearly 20 years. Because the vulnerabilities exploit features that are fundamental to most modern CPUs, and were believed to be secure, they were not only hard to find, but even harder to fix. For months, hundreds of engineers at Google and other companies worked continuously to understand these new vulnerabilities and find mitigations for them.
In September, we began deploying variant 1 and variant 3 fixes for the production infrastructure that underpins all Google products, from cloud services to Gmail, Search and Drive, with more refined fixes in October. Thanks to extensive performance tuning work, these protections caused no discernible impact to our cloud and required no customer downtime, in part due to Google Cloud Platform's Live Migration technology. No performance degradation has been reported by any customer or internal GCP team.
While those solutions addressed Variants 1 and 3, it was clear from the beginning that Variant 2 was going to be much more difficult to mitigate. For several months, it appeared that disabling vulnerable CPU features would be the only option to protect all of our workloads against Variant 2. While that would work, it would also disable key performance-enhancing CPU features, which slowed down applications considerably. This would cause significant slowdowns for many applications, inconsistent performance was also noted, as the speed of one application could be affected by the behavior of other applications running on the same core. Implementing these mitigations would have negatively impacted many customers.
With performance characteristics uncertain, Google began looking for a novel alternative, a way to mitigate Variant 2 without hardware support. Finally, inspiration appeared in the form of "Retpoline," a novel binary software modification technique that avoids the possibility of injection, created by Paul Turner, a software engineer who is part of the Technical Infrastructure group. With Retpoline, it was not necessary to disable speculative execution or other hardware features. Instead, this solution modifies programs to ensure that execution cannot be influenced by an attacker.
With Retpoline, the infrastructure can be protected at compile time, without modifications to the source code. Furthermore, testing of this feature, particularly when combined with optimizations such as software fork prediction hints, showed that this protection produces almost no performance loss.
In December, all Google Cloud Platform (GCP) services had protections in place for all known variants of the vulnerability. During the entire update process, no one noticed: no customer support tickets related to the updates were received. This confirmed the internal assessment that, in real-world use, performance-optimized updates deployed by Google do not have a material effect on workloads.
We believe that Retpoline-based protection is the best solution for Variant 2 on current hardware. Retpoline fully protects against Variant 2 without impacting customer performance on all of our platforms. By sharing our research publicly, we hope this can be universally deployed to improve the cloud experience across the industry.
This set of vulnerabilities was perhaps the most challenging and difficult to fix in a decade, requiring changes to many layers of the software stack. It also required extensive industry collaboration since the scope of the vulnerabilities was so pervasive. Due to the extreme circumstances of high impact and the complexity involved in developing solutions, the response to this problem has been one of the few times Project Zero made an exception to its 90-day disclosure policy.
While these vulnerabilities represent a new class of attack, they are just a few of the many types of threats against which the Google Cloud infrastructure is designed to defend against cyber attacks. The infrastructure includes mitigations by design and defense in depth, and is committed to ongoing research and contributions to the security community and to protecting customers as new vulnerabilities are discovered.
Reference: Protecting our Google Cloud customers from new vulnerabilities without impacting performance