Mitigating risk in the hardware supply chain

At Google, security is among our primary design criteria as we build hardware, software, and services. We think comprehensively about potential risks, no matter how small, and the best ways to mitigate them and stay ahead of attackers.We take a “defense in depth” approach to security, which means that we don’t rely on any one thing to keep us secure, but instead build layers of checks and controls. Even if an attacker were to circumvent one of our safeguards, they would be met with many more carefully designed protections to keep them out.One area where we’ve put a lot of thought, and which we continue to focus on, is the security of our hardware supply chain. Today, I’d like to go into a few of the things we do specifically in this area.Hardware design and provenanceA Google data center consists of thousands of servers connected to a local network. In most cases, both the server boards and the networking equipment are custom-designed by Google. We vet component vendors and choose components with care, working with vendors to audit and validate the security properties provided by the components. We also design custom chips, such as the Titan hardware security chip that we’re rolling out on both servers and peripherals, which help us securely identify and authenticate legitimate Google devices at the hardware level.Hardware tracking and disposalGoogle meticulously tracks the location and status of all equipment within our data centers—from acquisition to installation to retirement to destruction—via barcodes and asset tags. Metal detectors and video surveillance are implemented to help make sure no equipment leaves the data center floor without authorization. If a component fails to pass a performance test at any point during its lifecycle, it is removed from inventory and retired.When a hard drive is retired, authorized individuals verify that the disk is erased by writing zeros to the drive and performing a multiple-step verification process to ensure the drive contains no data. If the drive cannot be erased for any reason, it is stored securely until it can be physically destroyed. Depending on available equipment, we either crush and deform the drive or shred the drive into small pieces. Each data center adheres to a strict disposal policy and any variances are promptly addressed.Secure boot stack and machine identityGoogle servers use a variety of technologies to ensure that they are booting the correct software stack. We use cryptographic signatures over low-level components like the BIOS, bootloader, kernel, and base operating system image. These signatures can be validated during each boot or update. The components are Google-controlled, built, and hardened. With each new generation of hardware we strive to continually improve security: for example, depending on the generation and type of server, we root the trust of the boot chain in either a lockable firmware chip, a microcontroller running Google-written security code, or the above mentioned Google-designed security chip.Each server in the data center has its own specific identity that can be tied to the hardware root of trust and the software with which the machine booted. This identity is used to authenticate API calls to and from low-level management services on the machine.Google has developed automated systems which ensure servers run up-to-date versions of their software stacks (including security patches), detect and diagnose hardware and software problems, and remove machines from service if necessary.Defense-in-depthAs mentioned, while these are examples of protections designed to address specific attack vectors in a potential supply chain attack, they are by no means the only defense. Google’s infrastructure and Google Cloud have been designed with a defense-in-depth approach so that we have opportunities to mitigate potential vulnerabilities at other layers of our stack. For example, even if a piece of server hardware were compromised, our network infrastructure is designed to be able to detect and automatically prevent the command-and-control communications that are often necessary to take advantage of compromised hardware. Similarly, by encrypting and authenticating network traffic we are able to prevent a compromised network device from accessing sensitive data.Google will continue to invest in our platform to allow you to benefit from our services in a secure and transparent manner. To learn more about our approach to infrastructure security, visit our Infrastructure Security page, and download our Infrastructure Security whitepaper.
Quelle: Google Cloud Platform

Helping SaaS partners run reliably with new SRE tools and training

Our Customer Reliability Engineering (CRE) team is on a mission to help make everyone more reliable by making it easy to adopt Site Reliability Engineering (SRE) principles and practices. Lately, we’ve been spending a lot of time with our SaaS company partners,  helping them reduce the operational burden on their systems, become more agile, and run reliable services for their users and customers.We’ve been doing this work with these SaaS partners for more than a year now, and we’ve learned some lessons along the way:Most companies are still in the early stages of their site reliability engineering (SRE) journey. Interest in learning more about SRE principles, best practices, and tooling is coming from a wide variety of roles, many of which aren’t specifically called “SRE.” We’ve gotten consistent feedback that companies want self-paced, interactive online resources, such as a Coursera course, to learn more about SRE.While companies have unique combinations of customer requirements and solutions, we’ve found that they share many common architectural patterns as it relates to their customers’ experiences. Overwhelmingly, customers want to be able to build service-level objectives (SLOs) quickly and effectively.The concept of reliability goes beyond defining and monitoring metrics. We’ve heard that companies want to prevent unanticipated failures and build resilient systems that can gracefully handle previously unknown failure modes when they first occur. They also want to take advantage of the collective knowledge and experience of Google engineers.As we continue our mission to support all SaaS companies to operate reliably on Google Cloud, we have been working on making it easy for newcomers to get started on their SRE journey in several ways.Introducing a new Coursera course on Site Reliability EngineeringWe want to make it easy for developers to start learning the basics of SRE concepts and help the larger SRE community establish baselines. We designed this new course to distill years of collective Google SRE experience with designing and managing complex systems that meet their reliability targets. We hope that it helps you as developers learn at your own pace and provides insight for new and experienced SREs alike. You can enroll for the class here.Introducing SLO Guide, a tool that helps you discover what you should measureAt Google, we’ve always believed in building tools to solve complex problems at scale. A goal of our CRE team—our first customer-facing SRE team—is to help every single SaaS company in the world run reliably on Google Cloud Platform (GCP). In the pursuit of this mission, we’ve built SLO Guide, a new tool to help SaaS companies discover what they should measure based on common architectures and critical user journeys (CUJ). Simply put, it will help you quickly create SLOs that measure what your users actually care about.  The SRE course and SLO Guide are available now as a few of the key benefits for our Google Cloud SaaS partners. If you’re an existing partner, you can request access to the tool here. If you’re not a Google Cloud SaaS partner yet, you can become one here.
Quelle: Google Cloud Platform