September 2021 - Seite 16 von 56 - Cloud Computing Köln

Continuous delivery is frequently top-of-mind for organizations adopting Google Kubernetes Engine (GKE). However, continuous delivery —deploying container image artifacts into your various environments—remains complex, particularly in Kubernetes environments. With little in the way of accepted best practices, building and scaling continuous delivery tooling, pipelines, and repeatable processes is hard work that requires a lot of on-the-job experience.It doesn’t have to be this way. Today, we are pleased to announce Google Cloud Deploy, a managed, opinionated continuous delivery service that makes continuous delivery to GKE easier, faster, and more reliable.Solving for continuous delivery challengesGoogle Cloud Deploy is the product of discussions with more than 50 customers to better understand the challenges they face doing continuous delivery to GKE. From cloud-native to more traditional businesses, three themes consistently emerged: cost of ownership, security and audit, and integration.Let’s take a deeper look at these challenges and how we address them with Google Cloud Deploy.Cost of ownershipTime and again we heard that the operational cost of Kubernetes continuous delivery is high. Identifying best and repeatable practices, scaling delivery tooling and pipelines, and staying current—to say nothing of maintenance—is resource-intensive and takes time away from the core business. “We can’t afford to be innovating in continuous delivery,” one customer told us. “We want an opinionated product that supports best practices out of the box.”Google Cloud Deploy addresses cost of ownership head-on.As a managed service, Google Cloud Deploy eliminates the scaling and maintenance responsibilities that typically come with self-managed continuous delivery solutions. Now you can reclaim the time spent maintaining your continuous delivery tooling and spend it delivering value to your customers. Google Cloud Deploy also provides structure. Delivery pipelines and targets are defined declaratively and are stored alongside each release. That means if your delivery pipeline changes, the release’s path to production remains durable. No more time lost troubleshooting issues on in-flight releases caused by changes made to the delivery pipeline.We have found that a variety of GKE roles and personas interact with continuous delivery processes. A DevOps engineer may be focused on release promotion and rollback decisions, while a business decision maker thinks about delivery pipeline health and velocity. Google Cloud Deploy’s user experience keeps these multiple perspectives in mind, making it easier for various personas to perform contextualized reviews and make decisions, improving efficiency and reducing cost of ownership.Contextualized deployment approvalsSecurity and auditLots of different users interact with a continuous delivery system, making a variety of decisions. Not all users and decisions carry the same authority, however. Being able to define a delivery pipeline and make updates doesn’t always mean you can create releases, for example, nor does being able to promote a release to staging mean you can approve it to production. Modern continuous delivery is full of security and audit considerations. Restricting who can access what, where, and how is necessary to maintain release integrity and safety.Throughout, Google Cloud Deploy enables fine-grained restriction, with discrete resource access control and execution-level security. For additional safeguards against unwanted approvals, you can also take advantage of flow management features such as release promotion, rollback, and approvals.Auditing with Google Cloud Deploy works just like it does for other Google Cloud services. Cloud Audit Logs audits user-invoked Google Cloud Deploy activities, providing centralized awareness into who promoted a specific release or made an update to a delivery pipeline.IntegrationWhether or not you already have continuous delivery capabilities, you likely already have continuous integration (CI), approval and/or operation workflows, and other systems that intersect with your software delivery practices.Google Cloud Deploy embraces the GKE delivery tooling ecosystems in three ways: connectivity to CI systems, support for leading configuration (rendering) tooling, and Pub/Sub notifications to enable third-party integrations.Connecting Google Cloud Deploy to existing CI tools is straightforward. After you build your containers, Google Cloud Deploy creates a delivery pipeline release that initiates the Kubernetes manifest configuration (render) and deployment process to the first environment in a progression sequence. Whether you are using Jenkins, Cloud Build, or another CI tool, this is usually a simple `gcloud beta deploy releases create`.Delivering to Kubernetes often changes over time. To help, Google Cloud Deploy leverages Skaffold, allowing you to standardize your configuration between development and production environments. Organizations new to Kubernetes typically deploy using raw manifests, but as they become more sophisticated, may want to use more advanced tooling (Helm, Kustomize, kpt). The combination of Google Cloud Deploy and Skaffold lets you transition to these tools without impacting your delivery pipelines.Finally, to facilitate other integrations, such as a post-deployment test execution or third party approval workflows, Google Cloud Deploy emits Pub/Sub messages throughout a release’s lifecycle.The futureComprehensive, easy-to-use, and cost-effective DevOps tools are key to building an efficient software development team, and it’s our hope that Google Cloud Deploy will help you complete your CI/CD pipelines. And we’re just getting started! Stay tuned as we continue to introduce exciting new capabilities and features to Google Cloud Deploy in the months and quarters to come.In the meantime, to get started with the Preview, check out the product page, documentation, quickstart, and tutorials. Finally, If you have feedback on Google Cloud Deploy, you can join the conversation. We look forward to hearing from you!Related Article2021 Accelerate State of DevOps report addresses burnout, team performanceThe SODR is continually one of the most downloaded assets on the GCP website. We are releasing the updated version of the report with new…Read Article
Quelle: Google Cloud Platform

23. September 2021

da Agency

Dual deployments on Vertex AI

In this post, we will cover an end-to-end workflow enabling dual model deployment scenarios using Kubeflow, TensorFlow Extended (TFX), and Vertex AI. We will start with the motivation behind the project and then we will move over to the approaches we realized as a part of this project. We will conclude the post by going over the cost breakdown for each of the approaches. While this post will not include exhaustive code snippets and reviews you can always find the entire code in this GitHub repository. To fully follow through this post, we assume that you are already familiar with the basics of TFX, Vertex AI, and Kubeflow. It’d be also helpful if you have some familiarity with TensorFlow and Keras since we will be using them as our primary deep learning framework. MotivationScenario #1 (Online / offline prediction)Let’s say you want to allow your users to run an application both in online and offline mode. Your mobile application would use a TensorFlow Lite (TFLite) model depending on the network bandwidth/battery etc., and if sufficient network coverage/internet bandwidth is available your application would instead use the online cloud one. This way your application stays resilient and can ensure high availability.Scenario #2 (Layered predictions) Sometimes we also do layered predictions where we first divide a problem into smaller tasks:1) predict if it’s a yes/no, 2) depending on the output of 1) we run the final model.In these cases, 1) takes place on-device and 2) takes place on the cloud to ensure a smooth user experience. Furthermore, it’s a good practice to use a mobile-friendly network architecture (such as MobileNetV3) when considering mobile deployments. A detailed analysis of this situation is discussed in the book ML Design Patterns.The discussions above lead us to the following question:Can we train two different models within the same deployment pipeline and manage them seamlessly?This project is motivated by this question. The rest of this post will walk you over the different components that were pulled in to make such a pipeline operate in a self-contained and seamless manner. Dataset and modelsWe use the Flowers dataset in this project which consists of 3670 examples of flowers categorized into five classes – daisy, dandelion, roses, sunflowers, and tulips. So, our task is to build flower classification models which are essentially multi-class classifiers in this case. Recall that we will be using two different models. One, that will be deployed on the cloud and will be consumed via REST API calls. The other model will sit inside mobile phones and will be consumed by mobile applications. For the first model, we will use a DenseNet121 and for the mobile-friendly model, we will use a MobileNetV3. We will make use of transfer learning to speed up the model training process. You can study the entire training pipeline from this notebook.On the other hand, we also make use of AutoML-based training pipelines for the same workflow where the tooling automatically discovers the best models for the given task within a preconfigured compute budget. Note that the dataset remains the same in this case. You can find the AutoML-based training pipeline in this notebook.ApproachesDifferent organizations have people with varied technical backgrounds. We wanted to provide the easiest solution first and then move on to something that is more customizable.AutoMLFigure 1: Schematic representation of the overall workflow with AutoML components (high-quality).To this end, we leverage standard components from the Google Cloud Pipeline Components library to build, train, and deploy models with different production use-cases. With AutoML, the developers can delegate a large part of their workflows to the SDKs and the codebase also stays comparatively smaller. Figure 1 depicts a sample system architecture for this scenario.For reference, there are a number of tasks supported ranging from image classification to object tracking in Vertex AI. TFX But the story does not end here. What if we wanted to have better control over the models to be built, trained, and deployed? Enter TFX! TFX provides the flexibility of writing custom components and including them inside a pipeline. This way Machine Learning Engineers can focus on building and training their favorite models and delegate a part of the heavy lifting to TFX and Vertex AI. On Vertex AI (acting as an orchestrator) this pipeline will look like so:Figure 2: Computation graph of the TFX components required for our workflow (high-quality).You are probably wondering why there is Firebase in both of the approaches we just discussed. For the model that would be used by mobile applications, that needs to be a TFLite model because of tremendous interoperability with mobile platforms. Firebase provides excellent tooling and integration for TFLite models such as canary rollouts, A/B testing, etc. You can learn more about how Firebase can enhance your TFLite deployments from this blog post.So far we have developed a brief idea about the approaches followed in this project. In the next section, we will dive a bit more into the code and various nuts and bolts that had to be adjusted to make things work. You can find all the code shown in the coming section here. Implementation detailsSince this project uses two distinguished setups i.e. AutoML based minimal code and TFX-based custom code we will divide this section into two. First, we will introduce the AutoML side of things and then we will head over to TFX. Both these setups will provide similar outputs and will implement identical functionalities. Vertex AI Pipelines with Kubeflow’s AutoML ComponentsThe Google Cloud Pipeline Components library comes with a variety of predefined components supporting services built-in Vertex AI. For instance, you can directly import dataset from Vertex AI’s managed dataset feature into the pipeline, or you can create a model training job to be delegated to Vertex AI’s training feature. You can follow along with the rest of this section with the entire notebook. This project uses the following components:ImageDatasetCreateOpAutoMLImageTrainingJobRunOpModelDeployOpModelExportOpWe use ImageDatasetCreateOp to create a dataset to be injected to the next component, AutoMLImageTrainingJobRunOp. It supports all kinds of datasets from Vertex AI. The import_schema_uri argument determines the type of the target dataset. For instance, it is set to multi_label_classification for this project.The AutoMLImageTrainingJobRunOp delegates model training jobs to Vertex AI training with specified configurations. Since the AutoML model can grow very large, we can set some constraints with budget_milli_node_hours and model_type arguments. The budget_milli_node_hours how many hours are allowed for training. The model_type tells the training job what the target environment is, and which format a trained model should have. We created two instances of AutoMLImageTrainingJobRunOp, and model_type is set to “CLOUD” and “MOBILE_TF_VERSATILE_1″ respectively. As you can see, the string parameter itself describes what it is. There are more options, so please take a look at the official API document. The ModelDeployOp does three jobs in one place. It uploads a trained model to Vertex AI model, creates an endpoint, and deploys the trained model to the endpoint. With ModelDeployOp, you can deploy your model in the cloud easily and fast. On the other hand, the ModelExportOp only exports a trained model to a designated location like GCS bucket. Because the mobile model is not going to be deployed in the cloud, we explicitly need to get the saved model so that we can directly embed it on a device or publish it to Firebase ML. In order to make a trained model as an on-device model, export_format_id should be set appropriately in ModelExportOp. The possible values are “tflite”, “edgetpu-tflite”, “tf-saved-model”, “tf-js”, “core-ml”, and “custom-trained”, and it is set to “tflite” for this project. With these four components, you can create a dataset, train cloud and mobile models with AutoML, deploy the trained model to cloud, and export the trained model to a file whose format is .tflite. The last step would be to embed the exported model into the mobile application project. However, it is not flexible since you have to compile the application and upload it to the marketplace every time. FirebaseInstead, we can publish a trained model to Firebase ML. We are not going to explain what Firebase ML is in-depth, but it basically lets the application download and update the machine learning model on the fly. This ensures that the user experience becomes much smoother. In order to integrate publishing capability into the pipeline, we have created custom components, one for KFP native and the other one for TFX. Let’s explore what it looks like in KFP native now, then the one for TFX will be discussed in the next section. Please make sure you read the general instructions under the “Before you begin” section on the official Firebase document as a prerequisite.In this project, we have written python function-based custom components for the KFP native environment. The first step is to mark a function with @component decorator by specifying which packages to be installed. When compiling the pipeline, KFP will wrap this function as a Docker image which means everything inside the function is completely isolated, so we have to say what dependencies this function needs via packages_to_install.The beginning part is omitted, but what it does is to download the firebase credential file and the saved model from firebase_credential_uri and model_bucket respectively. You can assume that the downloaded files are named as credential.json and model.tflite. Also, we have found that the files can not be directly referenced if they are stored in GCS, so this is why we have downloaded them locally. firebase_admin.initialize_app method initializes the authorization to the Firebase with the given credential and the GCS bucket which is used to store the model file temporarily. The GCS bucket is required by Firebase, and you can simply create one within the storage menu in the Firebase dashboard.ml.list_models method returns a list of models deployed in the Firebase ML, and you can filter the items with display_name or tags. The purpose of this line is to check if the model with the same name has already been deployed because we have to update the model instead of creating one if the one exists.The update and create routine has one thing in common. That is the loading process for the local model file to be uploaded into the temporary GCS bucket by calling ml.TFLiteGCSModelSource.from_tflite_model_file method. After the loading process, you can choose either of ml.create_model or ml.update_model method. Then you are good to publish the model with the ml.publish_model method.Putting things togetherWe have explored five components including the custom one, push_to_firebase. It is time to jump into the pipeline to see how these components are connected together. First of all, we need two different sets of configurations for each deployment. We can hard-code them, but it would be much better to have a list of dictionaries like below.You should be able to recognize each individual component and what it does. What you need to focus on this time is how the components are connected, how to make parallel jobs for each deployment, and how to make a conditional branch to handle each deployment-specific job. As you can see, each component except for push_to_firebase has an argument to get input from the output of the previous component. For instance, the AutoMLImageTrainingJobRunOp launches a model training process based on the dataset parameter, and its value is injected from the output of ImageDatasetCreateOp. You might wonder why there is no dependency between ModelExportOp and push_to_firebase components. That is because the GCS location for the exported model is defined manually with artifact_destination parameter in ModelExportOp. Because of this, the same GCS location can be passed down to the push_to_firebase component manually.With the pipeline function defined with @kfp.dsl.pipeline decorator, we can compile the pipeline via the kfp.v2.compiler.compile method. The compiler converts all the details about how the pipeline is constructed into a JSON format file. You can safely store the JSON file in a GCS bucket if you want to control different versions. Why not version control the actual pipelining code? That is because the pipeline can be run by just referring to the JSON file with create_run_from_job_spec method under kfp.v2.google.client.AIPlatformClient.Vertex AI Pipelines with TFX’s pre-built and custom componentsTFX provides a number of useful pre-built components that are crucial to orchestrate a machine learning project end-to-end. Here you can find a list of the standard components offered by TFX. This project leverages the following stock TFX components:ImportExampleGenTrainerPusherWe use ImportExampleGen to read TFRecords from a Google Cloud Storage (GCS) bucket. The Trainer component trains models and Pusher exports the trained model to a pre-specified location (which is a GCS bucket in this case). For the purpose of this project, the data preprocessing steps are performed within the training component but TFX provides first-class support for data preprocessing.Note: Since we will be using Vertex AI to orchestrate the entire pipeline, the Trainer component here is tfx.extensions.google_cloud_ai_platform.Trainer which lets us take advantage of Vertex AI’s serverless infrastructure to train models. Recall from Figures 1 and 2 that once the models have been trained they will need to go down two different paths – 1) Endpoint (more on this in a moment), 2) Firebase. So, after training and pushing the models we would need to:1. Deploy one of the models to Vertex AI as an Endpoint so that it can be consumed via REST API calls.To deploy your model using Vertex AI one first needs to import their model if it’s not already there. Once the right model is imported (or identified) it needs to be deployed to an Endpoint. Endpoints provide a flexible way to version control different models that one may deploy during the entire production life-cycle. 2. Push the other model to Firebase so that mobile developers can use it to build their applications. As per these requirements, we need to develop three custom components at the very least:One that would take input as a pre-trained model and import that in Vertex AI (VertexUploader). Another component will be responsible for deploying it to an Endpoint (if it’s not present it will be created automatically) (VertexDeployer). The final component will push the mobile-friendly model to Firebase (FirebasePublisher). Let’s now go through the main components of each of these one by one. Model uploadWe will be using Vertex AI’s Python SDK to import a model of choice in Vertex AI. The code to accomplish this is fairly straightforward:Learn more about the different arguments of vertex_ai.Model.upload() from here. Now, in order to turn this into a custom TFX component (so that it runs as a part of the pipeline), we need to put this code inside a Python function and decorate that with the component decorator:And that is it! The full snippet is available here for reference. One important detail to note here is that serving_image_uri should be one of the pre-built containers as listed here. Model deployNow that our model is imported in Vertex AI we can proceed with its deployment. First, we will create an Endpoint and then we will deploy the imported model to that Endpoint. With some utilities discarded the code for doing this looks like so (full snippet can be found here):Explore the different arguments used inside endpoint.deploy() from here. You might actually enjoy them because they provide many production-friendly features like autoscaling, hardware configurations, traffic splitting, etc. right off the bat. Thanks to this repository that was used as references for implementing these two components. FirebaseThis part shows how to create a custom python function based on the TFX component. However, the underlying logic is pretty much the same to the one introduced in the AutoML section. We omit the internal details on this post, but you can find the complete source code here.We just want to point out the usage of the type checker, tfx.dsl.components.InputArtifact[tfx.types.standard_artifacts.PushedModel]. The tfx.dsl.components.InputArtifact means the parameter is a type of TFX artifact, and it is used as an input to the component. Likewise, there is tfx.dsl.components.OutputArtifact, and you can specify what kind of output the component should produce.Then, we have to tell where the input artifact comes from within the square brackets. In this case, we want to publish the pushed model to the Firebase ML, so the tfx.types.standard_artifacts.PushedModel is used. You can hard code the URI, but it is not flexible, and it is recommended to refer to the information from the PushedModel component.Custom Docker imageTFX provides pre-built Docker images where the pipelines can be run. But to execute a pipeline that contains custom components leveraging various external libraries we need to build a custom Docker image. Surprisingly, the changes are minor to accommodate this. Below is the Dockerfile configuration to build a custom Docker image that would support the above-discussed custom TFX components:Here, custom_components contains the .py files of our custom components. Now, we just need to build the image and push it to Google Container Registry (one can use Docker Hub as well). For building and pushing the image, we can either use docker build and docker push commands or we can use Cloud Build which is a serverless CI/CD platform from Google Cloud. To trigger the build using Cloud Build we can just use the following command:Do note that TFX_IMAGE_URI which, as the name suggests, is the URI of our custom Docker image that will be used to execute the final pipeline. The builds are available in the form of a nice dashboard along with all the build logs. Figure 3: Docker image build output from Cloud Build (high-quality).Putting things togetherNow that we have all the important pieces together we need to make them a part of a TFX pipeline so that it can be executed end-to-end. The entire code can be found in this notebook. Before putting things together into the pipeline, it is better to define some constant variables separately for readability. The name of model_display_name, pushed_model_location, and pushed_location_mobilenet variable itself explains pretty much what they are. On the other hand, the TRAINING_JOB_SPEC is somewhat verbose, so let’s go through it.TRAINING_JOB_SPEC basically sets up the hardware and the software infrastructures for model training. The worker_pool_specs lets you have different types of clusters if you want to leverage distributed training features on Vertex AI. For instance, the first entry is reserved for the primary cluster, and the fourth entry is reserved for evaluators. In this project, we have set only the primary cluster. For each worker_pool_specs, the machine_spec and the container_spec define hardware and software infrastructures respectively. As you can see, we have used only one NVIDIA_TESLA_K80 GPU within n1-standard-4 instance, and we have set the base Docker image to an official TFX image. You can learn more about these specifications here.We will use these configurations in the pipeline below. Note that the model training infrastructure is completely different from the GKE cluster where the Vertex AI internally runs each component’s job. That is why we need to set base Docker images in multiple places rather than via a unified API. The code below shows how everything is organized in the entire pipeline. Please follow the code by focusing on how components are connected and what special parameters are necessary to leverage Vertex AI.As you can see, each standard component has at least one special parameter to get input from the output of different components. For instance, the Trainer has the examples parameter, and its value comes from the ImportExampleGen. Likewise, Pusher has the model parameter, and its value comes from the Trainer. On the other hand, if a component doesn’t define a special parameter, you can set the dependencies explicitly via add_upstream_node method. You can find the example usages of add_upstream_node with VertexUploader and VertexDeployer.After defining and connecting TFX components, the next step is to put those components in a list. A pipeline function should return tfx.dsl.Pipeline type of object, and it can be instantiated with that list. With tfx.dsl.Pipeline, we can finally create a pipeline specification with KubeflowV2DagRunnerunder the tfx.orchestration.experimental module. When you call the run method of the KubeflowV2DagRunnerwith the tfx.dsl.Pipeline object, it will create a pipeline specification file in JSON format. The JSON file can be passed to the kfp.v2.google.AIPlatformClient’s create_run_from_job_spec method, then it will create a pipeline run on Vertex AI Pipeline. All of these in code looks like so:Once the above steps are executed you should be able to see a pipeline on the Vertex AI Pipelines dashboard. One very important detail to note here is that the pipeline needs to be compiled such that it runs on the custom TFX Docker image we built in one of the earlier steps. CostVertex AI Training is a separate service from Pipeline. We need to pay for the Vertex AI Pipeline individually, and it costs about $0.03 per pipeline run. The type of compute instance for each component was e2-standard-4, and it costs about $0.134 per hour. Since the whole pipeline took less than an hour to be finished, we can estimate that the total cost was about $0.164 for a Vertex AI Pipeline run.The cost for the AutoML training depends on the type of task and the target environment. For instance, the AutoML training job for the cloud model costs about $3.15 per hour whereas the AutoML training job for the on-device mobile model costs about $4.95 per hour. The training jobs were done in less than an hour for this project, so it cost about $10 for the two models fully trained. On the other hand, the cost of custom model training depends on the type of machine and the number of hours. Also, you have to consider that you pay for the server and the accelerator separately. For this project, we chose n1-standard-4 machine type whose price is $0.19 per hour and NVIDIA_TESLA_K80 accelerator type whose price is $0.45 per hour. The training for each model was done in less than an hour, so it cost about $1.28 in total.The cost of the model prediction is defined separately for AutoML and custom-trained models. The online and batch predictions for AutoML model cost about $1.25 and $2.02 per hour respectively. On the other hand, the prediction cost of a custom-trained model is roughly determined by the machine type. In this project, we specified it as n1-standard-4 whose price is $0.1901 per hour without an accelerator in the us-central-1 region. If we sum up the cost spent on this project, it is about $12.13 for the two pipeline runs to be completed. Please refer to the official document for further information.Firebase ML doesn’t cost anything. You can use it for free for Custom Model Deployment. Please find out more information about the price for Firebase service here.ConclusionIn this post, we covered why having two different types of models may be necessary to serve users. We realized a simple but scalable automated pipeline for the same using two different approaches using Vertex AI on GCP. One, where we used Kubeflow’s AutoML SDK delegating much of the heavy lifting to the frameworks. In the other approach, we leveraged TFX’s custom components to customize various parts of the pipeline as per our requirements. Hopefully, this post provided you with a few important recipes that are important to have in your Machine Learning Engineering toolbox. Feel free to try out our code here and let us know what you think.AcknowledgementsWe are grateful to the ML-GDE program that provided GCP credits for supporting our experiments. We sincerely thank Karl Weinmeister and Robert Crowe of Google for their help with the review.Related ArticleNew to ML: Learning path on Vertex AIIf you’re new to ML, or new to Vertex AI, this post will walk through a few example ML scenarios to help you understand when to use which…Read Article
Quelle: Google Cloud Platform

23. September 2021

da Agency

What’s your org’s reliability mindset? Insights from Google SREs

Editor’s note: There’s more to ensuring a product’s reliability than following a bunch of prescriptive rules. Today, we hear from some Google SREs—Vartika Agarwal, Senior Technical Program Manager, Development; Tracy Ferrell, Senior SRE Manager; Mahesh Palekar, Director SRE; and Magi Agrama, Senior Technical Program Manager, SRE—about how to evaluate your team’s current reliability mindset, and what you want it to be.Having a reliable software product can improve users’ trust in your organization, the effectiveness of your development processes, and the quality of your products overall. More than ever, product reliability is front and center, as outages negatively impact customers and their businesses. But in an effort to develop new features, many organizations limit their reliability efforts to what happens after an outage, and tactically solve for the immediate problems that sparked it. They often fail to realize that they can move quickly while still improving their product’s reliability.At Google, we’ve given a lot of thought to product reliability—and several of its aspects are well understood, for example product or system design. What people think about less is the culture and the mindset of the organization that creates a reliable product in the first place. We believe that the reliability of a product is a property of the architecture of its system, processes, culture, as well as the mindset of the product team or organization that built it. In other words, reliability should be woven into the fabric of an organization, not just the result of a strong design ethos. In this blog post, we discuss the lessons we’ve learned relevant to organizational or product leads who have the ability to influence the culture of the entire product team, from (but not limited to) engineering, product management, marketing, reliability engineering, and support organizations.GoalsReliability should be woven into the fabric of how an organization executes. At Google, we’ve developed a terminology to categorize and describe your organization’s reliability mindset, to help you understand how intentional your organization is in this respect. Our ultimate goal is to help you improve and adopt product reliability practices that will permeate the ethos of the organization.By identifying these reliability phases, we do not mean to offer a prescriptive list of things to do that will improve your product’s reliability. Nor should they be read as a set of mandated principles that everyone should apply, or be used to publicly label a team, spurring competition between teams. Rather, leaders should consider these phases as a way to help them develop their team’s culture, on the road to sustainably building reliable products. The organizational reliability continuumBased on our observations here at Google, there are five basic stages of organizational reliability, and they are based on the classic organizational model of absent, reactive, proactive, strategic and visionary. These phases describe the mindset of an organization at a point in time, and each one of them is characterized by a series of attributes, and is appropriate for different classes of workloads.Absent: Reliability is a secondary consideration for the organization. A feature launch is the key organizational metric and is the focus for incentivesThe majority of issues are found by users or testers. This organization is not aware of their long-term reliability risks. Developer velocity is rarely exchanged for reliability.This reliability phase maybe appropriate for products and projects that are still under development.Reactive:Responses to reliability issues/risks are tied to recent outages with sporadic follow-through and rarely are there longer-term investments in fixing system issues.Teams have some reliability metrics defined and react when required.They write postmortems for outages and create action items for tactical fixes.Reasonable availability is maintained through heroic efforts by a few individuals or teams Developer productivity is throttled due to a temporary shift in priority on reliability work due to outages. Feature development may be frozen for a short period of time.This level is appropriate for products/projects in pre-launch or in a stable long-term maintenance phase.Proactive:Potential reliability risks are identified and addressed through regular organizational processes.Risks are regularly reviewed and prioritized.Teams proactively manage dependencies and review their reliability metrics (SLOs)New designs are assessed for known risks and failure modes early on. Graceful degradation is a basic requirement.The business understands the need to continuously invest in reliability and maintain its balance with developer velocity. Most services/products should be at this level; particularly if they have a large blast radius or are critical to the business.Strategic:Organizations at this level manage classes of risk via systemic changes to architectures, products and processes.Reliability is inherent and ingrained in how the organization designs, operates and develops software. Reliability is systemic.Complexity is addressed holistically through product architecture. Dependencies are constantly reduced or improved.The cross-functional organization can sustain reliability and developer velocity simultaneously.Organizations widely celebrate quality and stability milestones.This level is appropriate for services and products that need very high availability to meet business-critical needs.Visionary:The organization has reached the highest order of reliability and is able to drive broader reliability efforts within and outside the company (e.g., writing papers, sharing knowledge), based on their best practices and experiences. Reliability knowledge exists broadly across all engineers and teams at a fairly advanced level and is carried forward as they move across organizations.Systems are self-healing.Architectural improvements for reliability positively impact productivity (release velocity) due to reduction of maintenance work/toil.Very few services or products are at this level, and when they are, are industry leading.Where should you be on the reliability spectrum?It is very important to understand your organization does not necessarily need to be at the strategic or visionary phase. There is a significant cost associated with moving from one phase to another and a cost to remain very high on this curve. In our experience, being proactive is a healthy level to target and is ideal for most products. To illustrate this point, here is a simple graph of where various Google product teams are on the organizational reliability spectrum; as you can see, it produces a standard bell-curve distribution. While many Google’s product teams have a reactive or proactive reliability culture, most can be described as proactive. You, as an organizational leader, must consciously decide to be at a level based on the product requirements and client expectations.Further, it’s common to have attributes across several phases, for example, an organization may be largely reactive with a few proactive attributes. Team culture will wax and wane between phases, as it takes effort to maintain a strategic reliability culture. However, as more of the organization embraces and celebrates reliability as a key feature, the cost of maintenance decreases. The key to success is making an honest assessment of what phase you’re in, and then doing concerted work to move to the phase that makes sense for your product. If your organization is in the absent or reactive phase, remember that many products in nascent stages of their life cycle may be comfortable there (in both the startup and long term maintenance of a stable product).Reliability phases in actionTo illustrate the reliability phases in practice, it is interesting to look at examples of organizations and how they have progressed or regressed through them. It should be noted that all companies and teams are different and the progress through these phases can take varying amounts of time. It is not uncommon to take two to three years to move into a truly proactive state. In a proactive state all parts of the organization contribute to reliability without worrying that it will negatively impact feature velocity. Staying in the proactive phase also takes time and effort.Nobody can be a hero foreverOne infrastructure services team started small with a few well understood APIs. One key member of the team, a product architect, understood the system well and ensured that things ran smoothly by ensuring design decisions were sound and being at each major incident to rapidly mitigate the issue. This was the one person who understood the entire system and was able to predict what can and cannot impact its stability. But when they left the team, the system complexity grew by leaps and bounds. Suddenly there were many critical user-facing and internal outages. Organizational leaders initiated both short and long-term reliability programs to restore stability. They focused on reducing the blast radius and the impact of global outages. Leadership recognized that to sustain this trajectory, they recognized that they had to go beyond engineering solutions and implement cultural changes such as recognizing reliability as their number-one feature. This led to broad training around reliability best practices, incorporating reliability in architectural/design reviews and recognizing and rewarding reliability beyond hero moments. As a result, the organization evolved from a reactive to a strategic reliability mindset, aided by setting reliability as their number-one feature, recognizing and rewarding long-term reliability improvements, and adopting the systemic belief that reliability is everyone’s responsibility—not just that of a few heroes.If you think you are done, think againEnd users are highly dependent on the reliability of this product and it ties directly to user trust. For this reason, reliability was top of mind for one Google organization for years, and the product was held as the gold standard of reliability by other Google teams. The org was deemed visionary in its reliability processes and work. However, over the years, new products were added to the base service. The high level of reliability did not come as freely and easily as it did with the simpler product. Reliability was impacted at the cost of developer velocity and the organization moved to a more reactive reliability mindset.To turn the ship around, the organization’s leaders had to be intentional about their reliability posture and overall practices, for example, how much they thought about and prioritized reliability. It took several years to move the team back to a strategic mindset. Embrace reliability principles from the startAnother team with a new user-facing product was focused on adding features and growing their user base. Before they knew it, the product took off and saw exponential growth.Unfortunately, their laser-focus on managing user requirements and growing user adoption led to high technical debt and reliability issues. Since the service didn’t start off with reliability as a primary focus, it was very hard to incorporate it after the fact. Much of the code had to be re-written and re-architected to reach a sustainable state. The team’s leaders incentivized attention to reliability throughout the organization, from product management through to development and UX domains, constantly reminding the organization about the importance of reliability to the long-term success of the product. This mindshift took years to set in.ConclusionIt is important that cross-functional organizations be honest about their reliability journeys and determine what is appropriate for their business and product. It is not uncommon for organizations to move from one level to another and then back again as the product matures, stabilizes and then is sunset for the next generation. Getting to a strategic level can be 4+ years in the making and require very high levels of investment from all aspects of the business. Leaders should ensure their product requires this level of continued investment.We encourage you to study your culture of reliability, assess what phase you are in, determine where you should be on the continuum and carefully and thoughtfully move there. Changing culture is hard and can not be done by edicts or penalties. Most of all, remember that this is a journey and the business is ever-evolving; you cannot set reliability on the shelf and expect it to maintain itself in perpetuity.Related ArticleAre we there yet? Thoughts on assessing an SRE team’s maturityExamining the key indicators that signal a mature SRE team.Read Article
Quelle: Google Cloud Platform