September 2022 - Seite 12 von 49 - Cloud Computing Köln

Calling all Dataflow developers, operators and users…So you developed your Dataflow job, and you’re now wondering how exactly will it perform in the wild, in particular:How many workers does it need to handle your peak load and is there sufficient capacity (e.g. CPU quota)?What is your pipeline’s total cost of ownership (TCO), and is there room to optimize performance/cost ratio?Will the pipeline meet your expected service-level objectives (SLOs) e.g. daily volume, event throughput and/or end-to-end latency?To answer all these questions, you need to performance test your pipeline with real data to measure things like throughput and expected number of workers. Only then can you optimize performance and cost. However, performance testing data pipelines is historically hard as it involves: 1) configuring non-trivial environments including sources & sinks, to 2) staging realistic datasets, to 3) setting up and running a variety of tests including batch and/or streaming, to 4) collecting relevant metrics, to 5) finally analyzing and reporting on all tests’ results.We’re excited to share that PerfKit Benchmarker (PKB) now supports testing Dataflow jobs! As an open-source benchmarking tool used to measure and compare cloud offerings, PKB takes care of provisioning (and cleaning up) resources in the cloud, selecting and executing benchmark tests, as well as collecting and publishing results for actionable reporting. PKB is a mature toolset that has been around since 2015 with community effort from over 30 industry and academic participants such as Intel, ARM, Canonical, Cisco, Stanford, MIT and many more.We’ll go over the testing methodology and how to use PKB to benchmark a Dataflow job. As an example, we’ll present sample test results from benchmarking one of the popular Google-provided Dataflow templates, Pub/Sub Subscription to BigQuery template, and how we identified its throughput and optimum worker size. There are no performance or cost guarantees since results presented are specific to this demo use case.Quantifying pipeline performance“You can’t improve what you don’t measure.” One common way to quantify pipeline performance is to measure its throughput per vCPU core in elements per second (EPS). This throughput value depends on your specific pipeline and your data, such as:Pipeline’s data processing stepsPipeline’s sources/sinks (and their configurations/limits)Worker machine sizeData element sizeIt’s important to test your pipeline with your expected real-world data (type and size), and in a testbed that mirrors your actual environment including similarly configured network, sources and sinks. You can then benchmark your pipeline by varying several parameters such as worker machine size. PKB makes it easy to A/B test different machine sizes and determine which one provides the maximum throughput per vCPU.Note: What about measuring pipeline throughput in MB/s instead of EPS?While either of these units work, measuring throughput in EPS draws a clear line with:underlying performance dependency (i.e. element size in your particular data), andtarget performance requirement (i.e. number of individual elements processed by your pipeline). Similar to how disk performance depends on I/O block size (KB), pipeline throughput depends on element size (KB). With pipelines processing primarily small element sizes (in the order of KBs), EPS is likely the limiting performance factor. The ultimate choice between EPS and MB/s depends on your use case and data.Note: The approach presented here expands on this prior post from 2020, predicting dataflow cost. However, we also recommend varying worker machine sizes to identify any potential cpu/network/memory bottlenecks, and determine the optimum machine size for your specific job and input profile, rather than assuming default machine size (i.e. n1-standard-2). The same applies to any other relevant pipeline configuration option such as custom parameters.The following are sample PKB results from benchmarking PubSub Subscription to BigQuery Dataflow template across n1-standard-{2,4,8,16} using the same input data, that is logs with element size of ~1KB. As you can see, while n1-standard-16 offers the maximum throughput at 28.9k EPS, the maximum throughput per vCPU is provided by n1-standard-4 at around 3.8k EPS/core slightly beating n1-standard-2 which is at 3.7k EPS/core, by 2.6%.Latency & throughput results from PKB testing of Pub/Sub to BigQuery Dataflow templateWhat about pipeline cost? Which machine size offers the best performance/cost ratio?Let’s look at resource utilization and total cost to quantify this. After each test run, PKB collects standard Dataflow metrics such as average CPU utilization and calculates the total cost based on reported resources used by the job. In our case, jobs running on n1-standard-4 incurred on average 5.3% more costs than jobs running on n1-standard-2. With an increased performance of only 2.6%, one might argue that from a performance/cost point of view, n1-standard-4 is less optimal than n1-standard-2. However, looking at CPU utilization, n1-standard-2 was highly utilized at > 80% on average, while n1-standard-4 utilization was at a healthy average of 68.57% offering room to respond faster to small load changes, without potentially spinning up a new instance.Utilization and cost results from PKB testing of Pub/Sub to BigQuery Dataflow templateChoosing optimum worker size sometimes involves a tradeoff between cost, throughput and freshness of data. The choice depends on your specific workload profile and target requirements namely throughput and event latency. In our case, the extra 5.3% in cost for n1-standard-4 is worth it, given the added performance and responsiveness. Therefore, for our specific use case and input data, we chose n1-standard-4 as the pipeline unit worker size with throughput of 3.8k EPS per vCPU.Sizing & costing pipelines“Provision for peak, and pay only for what you need.”Now that you measured (and hopefully optimized) your pipeline’s throughput per vCPU, you can deduce the pipeline size necessary to process your expected input workload as follows:Since your pipeline’s input workload is likely variable, you need to calculate the average and maximum pipeline sizes. Maximum pipeline size helps with capacity planning for peak load. Average pipeline size is necessary for cost estimation: you can now plug in the average number of workers and chosen instance type in the Google Cloud Pricing Calculator to determine TCO.Let’s go through an example. For our specific use case, let’s assume the following about our input workload profile:Daily volume to be processed: 10 TB/dayAverage element size: 1 KBTarget steady-state throughput: 125k EPSTarget peak throughput: 500k EPS (or 4x steady-state)Peak load occurs 10% of the timeIn other words, the average throughput is expected to be around 90% x 125k + 10% x 500k =162.5k (EPS).Let’s calculate the average pipeline size:To determine pipeline monthly cost, we can now plug in the average number of workers (11 workers) and instance type (n1-standard-4) into the pricing calculator. Note the number of hours per month (730 on average) given this is a continuously running streaming pipeline:How to get startedTo get up and running with PKB, refer topublic PKB docs. If you prefer walkthrough tutorials, check out this beginner lab, which goes over PKB setup, PKB command-line options, and how to visualize test results in Data Studio, similar to how we did above.The repo includes example PKB config files, including dataflow_template.yaml which you can use to re-run the sequence of tests above. You need to replace all <MY_PROJECT> and <MY_BUCKET> instances with your own GCP project and bucket. You also need to create an input Pub/Sub subscription with your own test data preprovisioned (since test results vary based on your data), and an output BigQuery tablewith correct schema to receive the test data. The PKB benchmark handles saving and restoring a snapshot of that Pub/Sub subscription for every test run iteration. You can run the entire benchmark directly from PKB root directory:code_block[StructValue([(u’code’, u’./pkb.py –project=$PROJECT_ID \rn –benchmark_config_file=dataflow_template.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ee61e1d1f10>)])]To benchmark Dataflow jobs from a jar file (instead of a staged Dataflow template), refer to wordcount_template.yaml PKB config file as an example, which you can run as follows:code_block[StructValue([(u’code’, u’./pkb.py –project=$PROJECT_ID \rn –benchmark_config_file=wordcount_template.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ee61e1cd310>)])]To publish test results in BigQuery for further analysis, you need to append BigQuery-specific arguments to above commands. For example:code_block[StructValue([(u’code’, u’./pkb.py –project=$PROJECT_ID \rn –benchmark_config_file=dataflow_template.yaml \rn –bq_project=$PROJECT_ID \rn –bigquery_table=example_dataset.dataflow_tests’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ee61e2ee150>)])]What’s next?We’ve covered how performance benchmarking can help ensure your pipeline is properly sized and configured, in order to:meet your expected data volumes,without hitting capacity limits, andwithout breaking your cost budgetIn practice, there may be many more parameters that impact your pipeline performance beyond just the machine size, so we encourage you to take advantage of PKB to benchmark different configurations of your pipeline, and help you make data-driven decisions around things like:Planned pipeline’s features developmentDefault and recommended values for your pipeline parameters. See this sizing guideline for one of Google-provided Dataflow templates as an example of PKB benchmark results synthesized into deployment best practices.You can also incorporate these performance tests in your pipeline development process to quickly identify and avoid performance regressions. You can automate such pipeline regression testing as part of your CI/CD pipeline – no pun intended.Finally, there’s a lot of opportunity to further enhance PKB for Dataflow benchmarking, such as collecting more stats and adding more realistic benchmarks that’s in line with your pipeline’s expected input workload. While we have tested here pipeline’s unit performance (max EPS/vCPU) under peak load, you might want to test your pipeline’s auto-scaling and responsiveness (e.g. 95th percentile for event latency) under varying load which could be just as critical for your use case. You can file tickets to suggest features or submit pull requests and join the 100+ PKB developer community.On that note, we’d like to acknowledge the following individuals who helped make PKB available to Dataflow end-users:Diego Orellana, Software Engineer @ Google, PerfKit BenchmarkerRodd Zurcher, Cloud Solutions Architect @ Google, App/Infra ModernizationPablo Rodriguez Defino, PSO Cloud Consultant @ Google, Data & AnalyticsRelated ArticleWhat’s New with Google’s Unified, Open and Intelligent Data CloudGoogle’s unified, open and intelligent data cloud provides insights at every level of the enterprise to empower leaders to drive results.Read Article
Quelle: Google Cloud Platform

24. September 2022

da Agency

Building a Machine Learning Platform with Kubeflow and Ray on Google Kubernetes Engine

Increasingly more enterprises adopt Machine Learning (ML) capabilities to enhance their services, products, and operations. As their ML capabilities mature, they build centralized ML Platforms to serve many teams and users across their organization. Machine learning is inherently an experimental process requiring repeated iterations. An ML Platform standardizes the model development and deployment workflow to offer greater consistency for the repeated process. This facilitates productivity and reduces time from prototype to production.When first trying ML in the cloud, many practitioners will start with fully managed ML platforms like Google Cloud’s Vertex AI. Fully-managed platforms abstract out many complexities to simplify the end-to-end workflow. However, like with most decisions, there are tradeoffs. Organizations may choose to build their own custom, self-managed ML platform for various reasons such as control and flexibility. Building your own platform gives you more control over your resources. You can implement unique resource utilization constraints, access permissions, and infrastructure strategies that fit your organization’s specific needs. You also get more flexibility over tools and frameworks. Since the system is completely open, you can integrate any ML tools that you already are using. And lastly, these benefits help avoid vendor lock-in because cloud-native platforms are by definition portable across cloud providers.For self-managed ML Platforms, Open Source Software is an important driver of digital innovation. If you are following the evolution of ML technologies, then you are probably aware of the ever growing ecosystem of Open Source Machine Learning frameworks, platforms, and tools. However, no single Open Source library delivers a complete ML solution, so we must integrate multiple Open Source projects to build an ML platform.To start building an ML Platform, it should support the basic ML user journey of notebook prototyping to scaled training to online serving. For organizations with multiple teams, it additionally needs to support administrative requirements of multi-user support with identity-based authentication and authorization. Two popular Open Source projects – Kubeflow and Ray – together can support these needs. Kubeflow provides the multi-user environment and interactive notebook management. Ray orchestrates distributed computing workloads across the entire ML lifecycle, including training and serving.Google Kubernetes Engine (GKE) simplifies deploying Open Source ML software in the cloud with autoscaling and auto-provisioning. GKE reduces the effort to deploy and manage the underlying infrastructure at scale and offers the flexibility to use your ML frameworks of choice. In this article, we will show how Kubeflow and Ray can be assembled into a seamless experience. We will demonstrate how platform builders can deploy them both to GKE to provide a comprehensive, production-ready ML platform.Kubeflow and RayFirst, let’s take a closer look at these two Open Source projects. While both Kubeflow and Ray deal with the problem of enabling ML at scale, they focus on very different aspects of the puzzle.Kubeflow is a Kubernetes-native ML platform aimed at simplifying the build-train-deploy lifecycle of ML models. As such, its focus is on general MLOps. Some of the unique features offered by Kubeflow include:Built-in integration with Jupyter notebooks for prototypingMulti-user isolation supportWorkflow orchestration with Kubeflow PipelinesIdentity-based authentication and authorization through Istio IntegrationOut-of-the-box integration with major cloud providers such as GCP, Azure, and AWSSource: https://www.kubeflow.org/docs/started/architecture/Ray is a general-purpose distributed computing framework with a rich set of libraries for large scale data processing, model training, reinforcement learning, and model serving. It is popular with customers as a simple API for building and scaling AI and Python workloads. Its focus is on the application itself – allowing users to build distributed computing software with a unified and flexible set of APIs. Some of the advanced libraries offered by Ray include:RLLib for reinforcement learningRay Tune for hyperparameter tuningRay Train for distributed deep learningRay Serve for scalable model servingRay Data for preprocessingSource: https://docs.ray.io/en/latest/index.html#what-is-rayIt should be noted that Ray is not a Kubernetes-native project. In order to deploy Ray on Kubernetes, the Open Source community has created KubeRay, which is exactly what it sounds like – a toolkit for deploying Ray in Kubernetes. KubeRay offers a powerful set of tools that include many great features, like custom resource APIs and a scalable operator. You can learn more about it here.Now that we have examined the differences between Kubeflow and Ray, you might be asking which is the right platform for your organization. Kubeflow’s MLOps capabilities and Ray’s distributed computing libraries are both independently useful with different advantages. What if we can combine the benefits of both systems? Imagine having an environment that:Supports Ray Train with autoscaling and resource provisioningIntegrated with identity-based authentication and authorizationSupports multi-user isolation and collaborationContains an interactive notebook serverLet’s now take a look at how we can put these two platforms together and take advantage of the useful features offered by each. Specifically, we will deploy Kuberay in a GKE cluster installed with Kubeflow. The system looks something like this:In this system, the Kubernetes cluster is partitioned into logically-isolated workspaces, called “profiles”. Each new user will create their own profile, which is a container for all their resources in this Kubernetes cluster. The user can then provision their own resources within their designated namespace, including Ray Clusters and Jupyter Notebooks. If the user’s resources are provisioned through the Kubeflow dashboard, then Kubeflow will automatically place these resources in their profile namespace.Under this setup, each Ray cluster is by default protected by role-based access control policies (with Istio) preventing unauthorized access. This allows each user to interact with their own Ray clusters independently of each other, and allows them to share Ray clusters with other team members.For this setup, I used the following versions:Google Kubernetes Engine 1.21.12-gke.2200 Kubeflow 1.5.0Kuberay 0.3.0Python 3.7Ray 1.13.1The configuration files used for this deployment can be found here.Deploying Kubeflow and KuberayFor deploying Kubeflow, we will be using the GCP instructions here. For simplicity purposes, I have used mostly default configuration settings. You can freely experiment with customizations before deploying, for example, you can enable GPU nodes in your cluster by following these instructions.Deploying the KubeRay operator is pretty straightforward. We will be using the latest released version:code_block[StructValue([(u’code’, u’export KUBERAY_VERSION=v0.3.0rnkubectl create -k “github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=${KUBERAY_VERSION}”rnkubectl apply -k “github.com/ray-project/kuberay/manifests/base?ref=${KUBERAY_VERSION}”‘), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ead112f5950>)])]This will deploy the KubeRay operator in the “ray-systems” namespace in your cluster.Creating Your Kubeflow User ProfileBefore you can deploy and use resources in Kubeflow, you need to first create your user profile. If you follow the GKE installation instructions, you should be able to navigate to https://[cluster].endpoints.[project].cloud.goog/ in your browser, where [cluster] is the name of your GKE cluster and [project] is your GCP project name.This should redirect you to a web page where you can use your GCP credentials to authenticate yourself.Follow the dialogue, and Kubeflow will create a namespace with you as the administrator. We’ll discuss later in this article how to invite others to your workspace.Build the Ray Worker ImageNext, let’s build the image we’ll be using for the Ray cluster. Ray is very sensitive when it comes to version compatibility (for example, the head and worker nodes must use the same versions of Ray and Python), so it is highly recommended to prepare and version-control your own worker images. Look for the base image you want from their Docker page here: rayproject/ray – Docker Image. The following is a functioning worker image using Ray 1.13 and Python 3.7:code_block[StructValue([(u’code’, u’FROM rayproject/ray:1.13.1-py37rnrnRUN pip install numpy tensorflowrnrnCMD [“bin/bash”]’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eacefa87550>)])]Here is the same Dockerfile for a worker image running on GPUs if you prefer GPUs instead of CPUs:code_block[StructValue([(u’code’, u’FROM rayproject/ray:1.13.1-py37-gpurnrnRUN pip install numpy tensorflowrnrnCMD [“bin/bash”]’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ead12cd8110>)])]Use Docker to build and push both images to your image repository:code_block[StructValue([(u’code’, u’$ docker build -t <path-to-your-image> -f Dockerfile .rn$ docker push <path-to-your-image>’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ead12cd8190>)])]Build the Jupyter Notebook ImageSimilarly we need to build the notebook image that we are going to use. Because we are going to use this notebook to interact with the Ray cluster, we need to ensure that it uses the same version of Ray and Python as the Ray workers.The Kubeflow example Jupyter notebooks can be found at Example Notebook Servers. For this example, I changed the PYTHON_VERSION in components/example-notebook-servers/jupyter/Dockerfile to the following:code_block[StructValue([(u’code’, u’ARG MINIFORGE_VERSION=4.10.1-4rnARG PIP_VERSION=21.1.2rnARG PYTHON_VERSION=3.7.10′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ead10670ad0>)])]Use Docker to build and push the notebook image to your image repository, similar to the previous step:code_block[StructValue([(u’code’, u’$ docker build -t <path-to-your-image> -f Dockerfile .rn$ docker push <path-to-your-image>’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ead10670490>)])]Deploy a Ray ClusterNow we are ready to configure and deploy our Ray cluster.1. Copy the following sample yaml file from GitHub:code_block[StructValue([(u’code’, u’curl https://github.com/richardsliu/ray-on-gke/blob/main/manifests/ray-cluster.serve.yaml -o ray-cluster.serve.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ead103d1590>)])]2. Edit the settings in the file:a. For the user namespace, change the value to match with your Kubeflow profile name:code_block[StructValue([(u’code’, u’namespace: %your_name%’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ead103d15d0>)])]b. For the Ray head and worker settings, change the value to point to the image you have built previously:code_block[StructValue([(u’code’, u’image: %your_image%’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ead103d1550>)])]c. Edit resource requests and limits, as required. For example, you can change the CPU or GPU requirements for worker nodes here:code_block[StructValue([(u’code’, u’resources: rn limits:rn cpu: 1rn requests:rn cpu: 200m’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ead103d18d0>)])]3. Deploy the cluster:code_block[StructValue([(u’code’, u’kubectl apply -f raycluster.serve.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ead103d1ed0>)])]4. Your cluster should be ready to go momentarily. If you have enabled node auto-provisioning on your GKE cluster, you should be able to see the cluster dynamically scale up and down according to usage. You can check the status of your cluster by doing:code_block[StructValue([(u’code’, u’$ kubectl get pods -n <user name>rnNAME READY STATUS RESTARTS AGErnexample-cluster-head-8cbwb 1/1 Running 0 12srnexample-cluster-worker-large-group-75lsr 1/1 Running 0 12srnexample-cluster-worker-large-group-jqvtp 1/1 Running 0 11srnexample-cluster-worker-large-group-t7t4n 1/1 Running 0 12s’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ead103d1890>)])]You can also verify that the service endpoints are created:code_block[StructValue([(u’code’, u’$ kubectl get services -n <user name>rnNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGErnexample-cluster-head-svc ClusterIP 10.52.9.88 <none> 8265/TCP,10001/TCP,8000/TCP,6379/TCP 18s’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ead103d16d0>)])]Remember this service name – we will come back to it later.Now our ML Platform is all set up and we are ready to start Training a model.Training a ML ModelWe are going to use a Notebook to orchestrate our model training. We can access Ray from a Jupyter notebook session.1. In the Kubeflow dashboard, navigate to the “Notebooks” tab.2. Click on “New Notebook”.3. In the “Image” section, click on “Custom Image”, and input the path to the Jupyter notebook image that you have built here.4. Configure resource requirements for the notebook as needed. The default notebook uses half a CPU and 1G of memory. Note that these resources are only for the notebook session, and not for the Training resources. Later, we use Ray to orchestrated resources at scale on GKE.5. Click on “LAUNCH”.6. When the notebook finishes deploying, click on “Connect” to start a new notebook session.7. Inside the notebook, open a terminal by clicking on File -> New -> Terminal. 8. Install Ray 1.13 in the terminal:code_block[StructValue([(u’code’, u’pip install ray==1.13′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ead103d1d10>)])]9. Now you are ready to run an actual Ray application, using this notebook and the Ray cluster you just deployed in the previous section. I have made a .ipynb file using the canonical Ray trainer example here.10. Run through the cells in the notebook. The magic line that connects to the Ray cluster is:code_block[StructValue([(u’code’, u’ray.init(“ray://example-cluster-head-svc:10001″)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eacefa871d0>)])]This should match with the service endpoint that you created earlier. If you have several different Ray clusters, you can simply change the endpoint here to connect to a different one.11. The next few lines will start a Ray Trainer process on the cluster:code_block[StructValue([(u’code’, u’trainer = Trainer(backend=”tensorflow”, num_workers=4)rntrainer.start()rnresults = trainer.run(train_func_distributed)rntrainer.shutdown()’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eacefab3bd0>)])]Note here that we specify 4 workers, which matches with our Ray cluster’s number of replicas. If we change this number, the Ray cluster will automatically scale up or down according to resource demands.Serving a ML ModelIn this section we will look at how we can serve the machine learning model that we have just trained in the last section.1. Using the same notebook, wait for the training steps to complete. You should see some output logs with metrics for the model that we have trained.2 Run the next cell:code_block[StructValue([(u’code’, u’serve.start(detached=True, http_options={“host”: “0.0.0.0”})rnTFMnistModel.deploy(TRAINED_MODEL_PATH)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eacefab3110>)])]This will start serving the model that we have just trained, using the same service endpoint we created before.3. To verify that the inference endpoint is now working, we can create a new notebook. You can use this one here.4. Note that we are calling the same inference endpoint as before, but using a different port:code_block[StructValue([(u’code’, u’resp = requests.get(rn “http://example-cluster-head-svc:8000/mnist”,rn json={“array”: np.random.randn(28 * 28).tolist()})’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3eacefab3e50>)])]5. You should see the inference results displayed in your notebook session.Sharing the Ray Cluster with OthersNow that you have a functional workspace with an interactive notebook and a Ray cluster, let’s invite others to collaborate.1. On Cloud Console, grant the user minimal cluster access here.2. In the left-hand panel of the Kubeflow dashboard, select “Manage Contributors”.3. In the “Contributors to your namespace” section, enter the email address of the user to whom you are granting access. Press enter.4. That user can now select your namespace and access your notebooks, including your Ray cluster.Using Ray DashboardFinally, you can also bring up the Ray Dashboard using Istio virtual services. Using these steps, you can bring up a dashboard UI inside the Kubeflow central dashboard console:1. Create an Istio Virtual Service config file:code_block[StructValue([(u’code’, u”apiVersion: networking.istio.io/v1alpha3rnkind: VirtualServicernmetadata:rn name: example-cluster-virtual-servicern Namespace: kubeflowrnspec:rn gateways:rn – kubeflow-gatewayrn hosts:rn – ‘*’rn http:rn – match:rn – uri:rn prefix: /example-cluster/rn rewrite:rn uri: /rn route:rn – destination:rn host: example-cluster-head-svc.$(USER_NAMESPACE).svc.localrn port:rn number: 8265″), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ead111f4c10>)])]Replace $(USER_NAMESPACE) with the namespace of your user profile. Save this to a local file. 2. Deploy the virtual service:code_block[StructValue([(u’code’, u’kubectl apply -f virtual_service.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ead12985910>)])]3. In your browser window, navigate to https://<host>/_/example-cluster/. The Ray dashboard should be displayed in the window:ConclusionLet’s take a minute to recap what we have done. In this article, we have demonstrated how to deploy two popular ML frameworks, Kubeflow and Ray, in the same GCP Kubernetes cluster. The setup also takes advantage of GCP features like IAP (Identity-Aware Proxy) for user authentication, which protects your applications while simplifying the experience for cloud admins. The end result is a well-integrated and production-ready system that pulls in useful features offered by each system:Orchestrating distributed computing workloads using Ray APIs;Multi-user isolation using Kubeflow;Interactive notebook environment using Kubeflow notebooks;Cluster autoscaling and auto-provisioning using Google Kubernetes EngineWe’ve only scratched the surface of the possibilities, and you can expand from here:Integrations with other MLOps offerings, such as Vertex Model monitoring;Faster and safer image storage and management, through the Artifact Repository;High throughput storage for unstructured data using GCSFuse;Improve network throughput for collective communication with NCCL Fast Socket.We look forward to the growth of your ML Platform and how your team innovates with Machine Learning. Look out for future articles on how to enable additional ML Platform features.Related ArticleEnabling real-time AI with Streaming Ingestion in Vertex AIMany machine learning (ML) use cases, like fraud detection, ad targeting, and recommendation engines, require near real-time predictions….Read Article
Quelle: Google Cloud Platform

24. September 2022

da Agency

View policy enforcement metrics for ACM Policy Controller

Policy Controller enables the enforcement of fully programmable policies for your clusters. These policies act as “guardrails” and prevent any changes from violating security, operational, or compliance controls at admission time, and post admission, using continuous audit.Through ongoing conversations with platform and security administrators, we have received feedback about increasing visibility into how the policies are applied i.e. enforced or audited across Anthos or GKE clusters.With the Anthos Config Management (ACM) 1.12.0 onwards, we have made it easier to export and visualize Policy Controller metrics.Policy Controller MetricsPolicy controller includes the metrics related to policy usage such as number of constraints, constraint templates, audit violations detected just to name a few (see list of metrics exposed).Exporting the metricsPolicy Controller uses OpenCensus to create and record metrics related to its processes and policy usage. Policy Controller can be easily configured to export these metrics to Prometheus and/or Cloud Monitoring at the install time. Default setting for exporting metrics for Policy controller will export the metrics to both Prometheus and Cloud monitoring. Viewing the metricsThese metrics are exported to the customer’s Cloud Monitoring project in Prometheus format. As a result, customers can view these metrics in the Cloud Monitoring UI or query them via the Cloud Monitoring API using either PromQL (the de-facto query language for Kubernetes metrics) or MQL (Google’s proprietary metrics query language). There is also a newly added cloud monitoring dashboard to view your metrics. This dashboard can be further edited to meet your business or operational needs. This dashboard can be imported from within Cloud Console.Login to Cloud Console and click on the hamburger (collapsed) menu and click on More Products to expand the list of products in the menu.Select Monitoring > Dashboards and then click the Sample Library tab on the page.This will show all the samples available by category.Select Anthos Config Management from the list.Check Policy Controller from the list and click Import.Confirm that you want to import the dashboard.This will create a new dashboard.You can view by clicking on the Dashboards menu item and then selecting the newly created Policy Controller dashboard from the list.PricingThese metrics are available at no additional cost to our customers. Alerting on the metricsYou can create alerting policies in Cloud Alertingso you are notified in case something needs your attention. Third Party integration Any third party observability tool can ingest these metrics using Cloud Monitoring API. If you are using Grafana dashboards all you have to do is point it to the Cloud Monitoring API for it to work. Next stepsInstall Policy Controller Implement CIS benchmark using Policy ControllerExplore Policy controller constraint template libraryConfig Sync metricsRelated ArticleExtending Anthos to manage on-premises edge VMs: now generally availableVM support in Anthos extends Anthos on bare metal (Google Distributed Cloud Virtual) to run and manage both containers and VMs on a singl…Read Article
Quelle: Google Cloud Platform