Kata Container And GVisor With K0s

medium.com – As not all pods can be trusted, this article will show different options to enhance process isolation through the usage of container runtimes other than the default one (runc). We will use Kubernetes…
Quelle: news.kubernauts.io

Helm — Kubernetes Package Manager

nethminiromina.medium.com – In Docker, we build docker images and store them in a remote repository such as Docker Hub, GCR or ECR. Same with Helm you can generate Kubernetes packages with Helm called Charts. These charts can…
Quelle: news.kubernauts.io

Analyze AWS EKS Audit logs with Falco

ismailyenigul.medium.com – Falco is a Kubernetes threat detection engine. Falco supports Kubernetes Audit Events to track the changes defined in k8s audit rules made to your cluster. But unfortunately, AWS EKS is a managed…
Quelle: news.kubernauts.io

Helping users keep their organization secure with their phone's built-in security key

Phishing remains among an organization’s most prevalent security threats. At Google, we’ve developed advanced tools to help protect users from phishing attacks, including our Titan Security Keys. With the goal of making security keys even easier to use and more ubiquitous, we’ve recently made it possible to use your phone’s built-in security key to secure your account. Security keys based on FIDO standards are a form of 2-Step verification (2SV) and we consider them to be the strongest, most phishing-resistant method of 2SV because they leverage public key cryptography to verify a user’s identity, and that of the login page, blocking attackers from accessing an account even if they have the username and password. We want as many of our customers as possible to adopt this essential protection and to make them aware of potential risks they are exposed to if they don’t. That’s why today we’re launching a new Recommender into Active Assist, our portfolio of services that help teams operate and optimize their cloud deployments with proactive intelligence instead of unnecessary manual effort. This new “Account security” recommender will automatically detect when a user with elevated permissions, such as a Project Owner, is eligible to use their phone’s built-in security key to better protect their account, but has not yet turned on this important safeguard. Users will see a notification prompting them to enable their phone as a phishing-resistant second factor. This allows organizations to immediately implement this protection and strengthen their security posture using a device end-users almost certainly always have at-hand: their phones. The notification in the Cloud Console looks like this:Acting on the recommendation takes just three simple steps:Click on “Secure Now”, which will open the account’s Security Checkup tool.Follow the instructions located in the “2-Step Verification” tab.Finish the enrollment process.As with all of the recommenders within Active Assist, the goal is to make these recommendations easy to see, understand, and take action on. That means you spend less time on cloud administration, while still achieving a more performant, secure cloud. Here, users can bolster their security posture with just a few clicks by enabling their phone’s built-in security keys. This is similar to what we’ve already empowered security teams to do with Active Assist’s IAM Recommender, which helps greatly reduce unnecessary permissions across your user accounts. This feature will start rolling out to eligible users over the next several weeks. For more information on how to start using your phone’s built-in security key, read our documentation. To learn more about other ways Active Assist can help optimize your cloud operations, check out this blog.
Quelle: Google Cloud Platform

New whitepaper: CISO’s guide to Cloud Security Transformation

Whether you’re a CISO actively pursuing a cloud security transformation or a CISO supporting a wider digital transformation, you’re responsible for securing information for your company, your partners, and your customers. At Google Cloud, we help you stay ahead of emerging threats, giving you the tools you need to strengthen your security and maintain trust in your company.Enabling a successful digital transformation and migration to the cloud by executing a parallel security transformation ensures that not only can you manage risks in the new environment, but you can also fully leverage the opportunities cloud security offers to modernize your approach and net-reduce your security risk. Our new whitepaper shares our thinking, based on our experiences working with Google Cloud customers, their CISOs, and their teams, on how best to approach a security transformation with this in mind. Here are the key highlights:Prepare your company for cloud securityWhilst it is true that cloud generally, and cloud security specifically, involves the use of sophisticated technologies, it would be wrong to consider cloud security as only a technical problem to solve. In this whitepaper we describe a number of organisational, procedural, people and policy considerations that are critical to achieving the levels of security and risk mitigation you require. As your company starts on, or significantly expands its cloud journey, consider the following;Security Culture. Is security an afterthought, or nice to have, or deemed to be the exclusive responsibility of the security team? Are peer security design and code reviews common and positively viewed, and is it accepted that a culture of inevitability will better prepare you for worst case scenarios?Thinking Differently. Cloud security approaches provide a significant opportunity to debunk a number of longstanding security myths and to adopt modern security practices. By letting go of the traditional security perimeter model, you can direct investments into architectures and models that leverage zero trust concepts, and so dramatically increase the security of your technology more broadly. And by adopting a data-driven assurance approach you can leverage the fact that all deployed cloud technology is explicitly declared and discoverable in data, and build velocity and scale into your assurance processes.Understand how companies evolve with cloudWhen your business moves to the cloud, the way that your whole company works—not just the security team—evolves. As CISO, you need to understand and prepare for these new ways of working so you can integrate and collaborate with your partners and the rest of your company. For example:Accelerated development timelines. Developing and deploying in the cloud can significantly reduce the time between releases, often creating a continuous, iterative release cycle. The shift to this development process—whether it’s called Agile, DevOps, or something else—also represents an opportunity for you to accelerate the development and release of new security features. To take this opportunity, security teams must understand—or even drive—the new release process and timeline, collaborate closely or integrate with development teams, and adopt an iterative approach to security development. Infrastructure managed as code. When servers, racks, and data centers are managed for you in the cloud, your code becomes your infrastructure. Deploying and managing infrastructure as code represents a clear opportunity for your security organization to improve its processes and to integrate more effectively with the software development process. When you deploy infrastructure as code, you can integrate your security policies directly in the code, making security central to both your company’s development process and to any software that your company develops,Evolve your security operating modelTransforming in the cloud also transforms how your security organization works. For example, manual security work will be automated, new roles and responsibilities will emerge, and security experts will partner more closely with development teams. Your organization will also have a new collaborator to work with: your cloud service provider. There are three key considerations:Collaboration with your cloud service provider. Understanding the responsibilities your cloud provider has (“security of the cloud”), and the responsibilities you retain (“security in the cloud”), are important steps to take. Equally, so are the methods you will use to assure the responsibilities that both parties have, including working with your cloud service provider to consume solutions, updates and best practices so that you and your provider have a “shared fate”.Evolving how security roles are performed. In addition to working with a new collaborator in your cloud service provider, your security organization will also change how it works from within. While every organization is different, it is important to consider all parts of the security organisation, from policies and risk management, to security architecture, engineering, operations and assurance, as most roles and responsibilities will need to evolve to some extent.Identifying the optimal security operating model. Your transformation to cloud security is an opportunity to rethink your security operating model. How should security teams work with development teams? Should security functions and operations be centralized or federated? As CISO, you should answer these questions and design your security operating model before you begin moving to the cloud. Our whitepaper helps you choose a cloud-appropriate security operating model by describing the pros and cons of three approaches.Moving to the cloud represents a huge opportunity to transform your company’s approach to security. To lead your security organization and your company through this transformation, you need to think differently about how you work, how you manage risk, and how you deploy your security infrastructure. As CISO, you need to instill a culture of security throughout the company and manage changes in how your company thinks about security and how your company is organized. The recommendations throughout this whitepaper come from Google’s years of leading and innovating in cloud security, in addition to the experience that Google Cloud experts have from their previous roles as CISOs and lead security engineers in major companies that have successfully navigated the journey to cloud. We are excited to collaborate with you on your cloud security transformation.Related ArticleNew whitepaper: Designing and deploying a data security strategy with Google CloudOur new whitepaper helps you start a data security program in a cloud-native way and adjust your existing data security program when you …Read Article
Quelle: Google Cloud Platform

Service Directory is generally available: Simplify your service inventory

Enterprises are increasingly adopting a service-oriented approach to building applications, composing several different services that span multiple products and environments. For example, a typical deployment can include:Services on Google Cloud, fronted by load balancersThird-party services, such as RedisServices on-premisesServices on other cloudsAs the number and diversity of services grows, it becomes increasingly challenging to maintain an inventory of all of the services across an organization. Last year, we launched Service Directory in beta to help simplify the problem of service management, and it’s now generally available. Service Directory allows you to easily register these services to a single fully managed registry, build a rich ecosystem of services, and uplevel your environment from an infrastructure-centric to a service-centric model.Simplify service naming and lookupWith Service Directory, you can maintain a flexible runtime service inventory. Some of the benefits of using Service Directory include:Human-friendly service naming: Customers can associate human-readable names with their services in Service Directory, as opposed to autogenerated default names. For example, your payments service can be called payments, instead of something like service-b3ada17a-9ada-46b2, making it easier to reference and reason about your servicesEnrich service data with additional properties: In addition to control over names, Service Directory also allows you to annotate a service and its endpoints with additional information beyond names. For example, new services can be given an experimental annotation until they are ready for production, or be given a hipaa-compliant annotation if they are able to handle PHI. Customers can also filter services based on their annotations; for example, if you have services using multiple types of weather data, you can annotate those data sources with fields like sunnyvale-temp, sunnyvale-precipitation, and paloalto-temp. You could then use Service Directory’s query API to find services using only Sunnyvale weather data, by searching for all services annotated with sunnyvale-temp or sunnyvale-precipitation, but not paloalto-temp.Easily resolve services from a variety of clients: Service Directory allows you to resolve services via REST, gRPC, and DNS lookups. In addition, Service Directory’s private DNS zones automatically update DNS records as services change, instead of needing to manually add DNS entries as you add new services.Fully managed: Service Directory is fully managed, allowing you to maintain your service registry with minimal operational overhead.New: automatic service registrationIn this release, you can now automatically register services in Service Directory without needing to write any orchestration code. This feature is available today for Internal TCP/UDP and Internal HTTP(S) load balancers, and will be extended to several other products going forward. Registering services with Service Directory is easy. When you create an Internal Load Balancer forwarding rule, register it with Service Directory by specifying a –service-directory-registration flag with the name of the Service Directory service you want your load balancer to be registered in. This automatically creates a Service Directory entry for your ILB service, and populates it with data such as the forwarding rule’s IP and port. When you delete the forwarding rule, the Service Directory entry is automatically removed as well, without needing to write any cleanup or teardown code.To learn more about Service Directory, visit the documentation, or walk through the configuration guide to get started.
Quelle: Google Cloud Platform

Ship your Go applications faster to Cloud Run with ko

As developers work more and more with containers, it is becoming increasingly important to reduce the time to move from source code to a deployed application. To make building container images faster and easier, we have built technologies like Cloud Build, ko, Jib, Nixery and added support for cloud-native Buildpacks. Some of these tools focus specifically on building container images directly from the source code without a Docker engine or a Dockerfile.The Go programming language specifically makes building container images from source code much easier. This article focuses on how a tool we developed named “ko” can help you deploy services written in Go to Cloud Run faster than Docker build/push, and how it compares to alternatives like Buildpacks.How does ko work?ko is an open-source tool developed at Google that helps you build container images from Go programs and push them to container registries (including Container Registry and Artifact Registry). ko does its job without requiring you to write a Dockerfile or even install Docker itself on your machine.ko is spun off of the go-containerregistry library, which helps you interact with container registries and images. This is for a good reason: The majority of ko’s functionality is implemented using this Go module. Most notably this is what ko does:Download a base image from a container registryStatically compile your Go binaryCreate a new container image layer with the Go binaryAppend that layer to the base image to create a new imagePush the new image to the remote container registryBuilding and pushing a container image from a Go program is quite simple with ko:In the command above, we specified a registry for the resulting image to be published and then specified a Go import path (the same as what we would use in a “go build” command, i.e. the current directory in this case) to refer to the application we want to build. By default, the ko command uses a secure and lean base image from the Distroless collection of images (the gcr.io/distroless/static:nonroot image), which doesn’t contain a shell or other executables in order to reduce the attack surface of the container. With this base image, the resulting container will have CA certificates, timezone data, and your statically-compiled Go application binary.ko also works with Kubernetes quite well. For example, with “ko resolve” and “ko apply” commands you can hydrate your YAML manifests as ko replaces your “image:” references in YAML automatically with the image it builds, so you can deploy the resulting YAML to the Kubernetes cluster with kubectl:Using ko with Cloud RunBecause of ko’s composable nature, you can use ko with gcloud command-line tools to build and push images to Cloud Run with a single command:This works because ko outputs the full pushed image reference to the stdout stream, which gets captured by the shell and passed as an argument to gcloud via the –image flag.Similar to Kubernetes, ko can hydrate your YAML manifests for Cloud Run if you are deploying your services declaratively using YAML:In the command above, “ko resolve” replaces the Go import paths in the “image: …” values of your YAML file, and sends the output to stdout, which is passed to gcloud over a pipe. gcloud reads the hydrated YAML from stdin (due to the “-” argument) and deploys the service to Cloud Run.For this to work, the “image:” field in the YAML file needs to list the import path of your Go program using the following syntax:Ko, compared to its alternativesAs we mentioned earlier, accelerating the refactor-build-deploy-test loop is crucial for developers iterating on their applications. To illustrate the speed gains made possible by using ko (in addition to the time and system resources you’ll save by not having to write a Dockerfile or run Docker), we compared it to two common alternatives:Local docker build and docker push commands (with a Dockerfile)Buildpacks (no Dockerfile, but runs on Docker)Below is the performance comparison for building a sample Go application into a container image and pushing this image to Artifact Registry.Note: In this chart, “cold” builds do not cache layers either in the build machine or in the container registry. In contrast, “warm” builds cache both layers (if caching is enabled by default) and skip pushing the layer blobs to the registry if they already exist.ko vs local Docker Engine: ko wins here by a small margin. This is because the “docker build” command packages your source code into a tarball and sends it to the Docker engine, which either runs natively on Linux or inside a VM on macOS/Windows.  Then, Docker builds the image by spinning up a new container for Dockerfile instruction and snapshots the filesystem of the resulting container into an image layer. These steps can take a while.ko does not have these shortcomings; it directly creates the image layers without spinning up any containers and pushes the resulting layer tarballs and image manifest to the registry.In this approach we built and pushed the Go application using the following command:ko vs Buildpacks (on local Docker): Buildpacks help you build images for many languages without having to write a Dockerfile. It’s worth noting that Buildpacks still require Docker to work. Buildpacks work by detecting your language and using a “builder image” that has all the build tools installed, before finally copying the resulting artifacts into a smaller image.In this case, the builder image (gcr.io/buildpacks/builder:v1) is around 500 MB, so it will show up in the “cold” builds. However, even for “warm” builds, Buildpacks use a local Docker engine, which is already slower than ko. And similarly, Buildpacks will run custom logic during the build phase, so it is also slower than Docker.In this approach we built and pushed the Go application using the following command:Conclusionko is part of a larger effort to make developers’ lives easier by simplifying how container images are built. With buildpacks support, you can build container images out of many programming languages without writing Dockerfiles at all, and then you can deploy these images to Cloud Run with a single command.ko helps you build your Go applications into container images and makes it easy to deploy them to Kubernetes or Cloud Run. ko is not limited to the Google Cloud ecosystem: It can authenticate to any container registry and works with any Kubernetes cluster.To learn more, make sure to check out ko documentation at the GitHub repository and try deploying some of your own Go services to Cloud Run.Related ArticleStreamlining Cloud Run development with Cloud CodeCloud Run is now integrated with Cloud Code, making it easier to create new Cloud Run services from your favorite IDE.Read Article
Quelle: Google Cloud Platform

Discover and invoke services across clusters with GKE multi-cluster services

Do you have a Kubernetes application that needs to span multiple clusters? Whether for privacy, scalability, availability, cost management, and data sovereignty reasons, it can be hard for platform teams to architect, implement, operate, and maintain applications across cluster boundaries, as Kubernetes’ Service primitive only enables service discovery within the confines of a single Kubernetes cluster. Today, we are announcing the general availability of multi-cluster services (MCS), a Kubernetes-native cross-cluster service discovery and invocation mechanism for Google Kubernetes Engine (GKE), the most scalable managed Kubernetes offering. MCS extends the reach of the Kubernetes Service primitive beyond the cluster boundary, so you can easily build Kubernetes applications that span multiple clusters. This is especially important for cloud-native applications, which are typically built using containerized microservices. The one constant with microservices is change—microservices are constantly being updated, scaled up, scaled down, and redeployed throughout the lifecycle of an application, and the ability for microservices to discover one another is critical. GKE’s new multi-cluster services capability makes managing cross-cluster microservices-based apps simple. How does GKE MCS work?GKE MCS leverages the existing service primitive that developers and operators are already familiar with, making expanding into multiple clusters consistent and intuitive. Services enabled with this feature are discoverable and accessible across clusters with a virtual IP, matching the behavior of a ClusterIP service within a cluster. Just like your existing services, services configured to use MCS are compatible with community-driven, open APIs, ensuring your workloads remain portable. The GKE MCS solution leverages environs to group clusters and is powered by the same technology offered by Traffic Director, Google Cloud’s fully managed, enterprise-grade platform for global application networking.Common MCS use casesMercari, a leading e-commerce company and an early adopter of MCS: “We have been running all our microservices in a single multi-tenant GKE cluster. For our next-generation Kubernetes infrastructure, we are designing multi-region homogeneous and heterogeneous clusters. Seamless inter-cluster east-west communication is a prerequisite and multi-cluster Services promise to deliver. Developers will not need to think about where the service is running. We are very excited at the prospect.” – Vishal Banthia, Engineering Manager, Platform Infra, Mercari We are excited to see how you use MCS to deploy services that span multiple clusters to deliver solutions optimized for your business needs. Here are some popular use cases we have seen our customers enable with GKE MCS.High availability – Running the same service across clusters in multiple regions provides improved fault tolerance. In the event that a service in one cluster is unavailable, the request can fail over and be served from another cluster (or clusters). With MCS, it’s now possible to manage the communication between services across clusters, to improve the availability and resiliency of your applications to meet service level objectives.Stateful and stateless services – Stateful and stateless services have different operational dependencies and complexities and present different operational tradeoffs. Typically, stateless services have less of a dependency on migrating storage, making it easier to scale, upgrade and migrate a workload with high availability. MCS lets you separate an application into separate clusters for stateful and stateless workloads, making them easier to manage.Shared services – Increasingly, customers are spinning up separate Kubernetes clusters to get higher availability, better management of stateful and stateless services, and easier compliance with data sovereignty requirements. However, many services such as logging, monitoring (Prometheus), secrets management (Vault), or DNS are often shared amongst all clusters to simplify operations and reduce costs. Instead of each cluster requiring its own local service replica, MCS makes it easy to set up common shared services in a separate cluster that is used by all functional clusters.Migration – Modernizing an existing application into a containerized microservices-based architecture often requires services to be deployed across multiple Kubernetes clusters. MCS provides a mechanism to help bridge the communication between those services, making it easier to migrate your applications—especially when the same service can be deployed in two different clusters and traffic is allowed to shift.Multi-cluster Services & Multi-cluster IngressMCS also complements [Multi-cluster Ingress] with multi-cluster load balancing for both East-West and North-South traffic flows. Whether your traffic flows from the internet across clusters, within the VPC between clusters, or both, GKE provides multi-cluster networking that is deeply integrated and Kubernetes-native. Get started with GKE multi-cluster services, todayYou can start using multi-cluster services in GKE today and gain the benefits of higher availability, better management of shared services, and easier compliance for data sovereignty requirements of your applications.Thanks to Maulin Patel, Product Manager, for his contributions to this blog post.Related ArticleGKE best practices: Exposing GKE applications through Ingress and ServicesWe’ll walk through the different factors you should consider when exposing applications on GKE, explain how they impact application expos…Read Article
Quelle: Google Cloud Platform

How to Deploy GPU-Accelerated Applications on Amazon ECS with Docker Compose

Many applications can take advantage of GPU acceleration, in particular resource-intensive Machine Learning (ML) applications. The development time of such applications may vary based on the hardware of the machine we use for development. Containerization will facilitate development due to reproducibility and will make the setup easily transferable to other machines. Most importantly, a containerized application is easily deployable to platforms such as Amazon ECS, where it can take advantage of different hardware configurations.

In this tutorial, we discuss how to develop GPU-accelerated applications in containers locally and how to use Docker Compose to easily deploy them to the cloud (the Amazon ECS platform). We make the transition from the local environment to a cloud effortless, the GPU-accelerated application being packaged with all its dependencies in a Docker image, and deployed in the same way regardless of the target environment.

Requirements

In order to follow this tutorial, we need the following tools installed locally:

Windows and MacOS: install Docker DesktopLinux: install Docker Engine and Compose CLITo deploy to Amazon ECS: an AWS account

For deploying to a cloud platform, we rely on the new Docker Compose implementation embedded into the Docker CLI binary. Therefore, when targeting a cloud platform we are going to run docker compose commands instead of docker-compose. For local commands, both implementations of Docker Compose should work. If you find a missing feature that you use, report it on the issue tracker.

Sample application

Keep in mind that what we want to showcase is how to structure and manage a GPU accelerated application with Docker Compose, and how we can deploy it to the cloud. We do not focus on GPU programming or the AI/ML algorithms, but rather on how to structure and containerize such an application to facilitate portability, sharing and deployment.

For this tutorial, we rely on sample code provided in the Tensorflow documentation, to simulate a GPU-accelerated translation service that we can orchestrate with Docker Compose. The original code can be found documented at  https://www.tensorflow.org/tutorials/text/nmt_with_attention. For this exercise, we have reorganized the code such that we can easily manage it with Docker Compose.

This sample uses the Tensorflow platform which can automatically use GPU devices if available on the host. Next, we will discuss how to organize this sample in services to containerize them easily and what the challenges are when we locally run such a resource-intensive application.

Note: The sample code to use throughout this tutorial can be found here. It needs to be downloaded locally to exercise the commands we are going to discuss.

1. Local environment

Let’s assume we want to build and deploy a service that can translate simple sentences to a language of our choice. For such a service, we need to train an ML model to translate from one language to another and then use this model to translate new inputs. 

Application setup

We choose to separate the phases of the ML process in two different Compose services:

A training service that trains a model to translate between two languages (includes the data gathering, preprocessing and all the necessary steps before the actual training process).A translation service that loads a model and uses it to `translate` a sentence.

This structure is defined in the docker-compose.dev.yaml from the downloaded sample application which has the following content:

docker-compose.yml

services:

training:
build: backend
command: python model.py
volumes:
– models:/checkpoints

translator:
build: backend
volumes:
– models:/checkpoints
ports:
– 5000:5000

volumes:
models:

We want the training service to train a model to translate from English to French and to save this model to a named volume models that is shared between the two services. The translator service has a published port to allow us to query it easily.

Deploy locally with Docker Compose

The reason for starting with the simplified compose file is that it can be deployed locally whether a GPU is present or not. We will see later how to add the GPU resource reservation to it.

Before deploying, rename the docker-compose.dev.yaml to docker-compose.yaml to avoid setting the file path with the flag -f for every compose command.

To deploy the Compose file, all we need to do is open a terminal, go to its base directory and run:

$ docker compose up
The new ‘docker compose’ command is currently experimental.
To provide feedback or request new features please open
issues at https://github.com/docker/compose-cli
[+] Running 4/0
⠿ Network "gpu_default" Created 0.0s
⠿ Volume "gpu_models" Created 0.0s
⠿ gpu_translator_1 Created 0.0s
⠿ gpu_training_1 Created 0.0s
Attaching to gpu_training_1, gpu_translator_1

translator_1 | * Running on http://0.0.0.0:5000/ (Press CTRL+C
to quit)

HTTP/1.1" 200 –
training_1 | Epoch 1 Batch 0 Loss 3.3540
training_1 | Epoch 1 Batch 100 Loss 1.6044
training_1 | Epoch 1 Batch 200 Loss 1.3441
training_1 | Epoch 1 Batch 300 Loss 1.1679
training_1 | Epoch 1 Loss 1.4679
training_1 | Time taken for 1 epoch 218.06381964683533 sec
training_1 |
training_1 | Epoch 2 Batch 0 Loss 0.9957
training_1 | Epoch 2 Batch 100 Loss 1.0288
training_1 | Epoch 2 Batch 200 Loss 0.8737
training_1 | Epoch 2 Batch 300 Loss 0.8971
training_1 | Epoch 2 Loss 0.9668
training_1 | Time taken for 1 epoch 211.0763041973114 sec

training_1 | Checkpoints saved in /checkpoints/eng-fra
training_1 | Requested translator service to reload its model,
response status: 200
translator_1 | 172.22.0.2 – – [18/Dec/2020 10:23:46]
"GET /reload?lang=eng-fra

Docker Compose deploys a container for each service and attaches us to their logs which allows us to follow the progress of the training service.

Every 10 cycles (epochs), the training service requests the translator to reload its model from the last checkpoint. If the translator is queried before the first training phase (10 cycles) is completed, we should get the following message. 

$ curl -d "text=hello" localhost:5000/
No trained model found / training may be in progress…

From the logs, we can see that each training cycle is resource-intensive and may take very long (depending on parameter setup in the ML algorithm).

The training service runs continuously and checkpoints the model periodically to a named volume shared between the two services. 

$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f11fc947a90a gpu_training "python model.py" 14 minutes ago Up 54 minutes gpu_training_1
baf147fbdf18 gpu_translator "/bin/bash -c ‘pytho…" 14 minutes ago Up 54 minutes 0.0.0.0:5000->5000/tcp gpu_translator_1

We can now query the translator service which uses the trained model:

$ $ curl -d "text=hello" localhost:5000/
salut !
$ curl -d "text=I want a vacation" localhost:5000/
je veux une autre .
$ curl -d "text=I am a student" localhost:5000/
je suis etudiant .

Keep in mind that, for this exercise, we are not concerned about the accuracy of the translation but how to set up the entire process following a service approach that will make it easy to deploy with Docker Compose.

During development, we may have to re-run the training process and evaluate it each time we tweak the algorithm. This is a very time consuming task if we do not use development machines built for high performance.

An alternative is to use on-demand cloud resources. For example, we could use cloud instances hosting GPU devices to run the resource-intensive components of our application. Running our sample application on a machine with access to a GPU will automatically switch to train the model on the GPU. This will speed up the process and significantly reduce the development time.

The first step to deploy this application to some faster cloud instances is to pack it as a Docker image and push it to Docker Hub, from where we can access it from cloud instances.

Build and Push images to Docker Hub

During the deployment with compose up, the application is packed as a Docker image which is then used to create the containers. We need to tag the built images and push them to Docker Hub.

 A simple way to do this is by setting the image property for services in the Compose file. Previously, we had only set the build property for our services, however we had no image defined. Docker Compose requires at least one of these properties to be defined in order to deploy the application.

We set the image property following the pattern <account>/<name>:<tag> where the tag is optional (default to ‘latest’). We take for example a Docker Hub account ID myhubuser and the application name gpudemo. Edit the compose file and set the image property for the two services as below:

docker-compose.yml

services:

training:
image: myhubuser/gpudemo
build: backend
command: python model.py
volumes:
– models:/checkpoints

translator:
image: myhubuser/gpudemo
build: backend
volumes:
– models:/checkpoints
ports:
– 5000:5000

volumes:
models:

To build the images run:

$ docker compose build
The new ‘docker compose’ command is currently experimental. To
provide feedback or request new features please open issues
at https://github.com/docker/compose-cli
[+] Building 1.0s (10/10) FINISHED
=> [internal] load build definition from Dockerfile
0.0s
=> => transferring dockerfile: 206B

=> exporting to image
0.8s
=> => exporting layers
0.8s
=> => writing image sha256:b53b564ee0f1986f6a9108b2df0d810f28bfb209
4743d8564f2667066acf3d1f
0.0s
=> => naming to docker.io/myhubuser/gpudemo

$ docker images | grep gpudemo
myhubuser/gpudemo latest b53b564ee0f1 2 minutes ago
5.83GB

Notice the image has been named according to what we set in the Compose file.

Before pushing this image to Docker Hub, we need to make sure we are logged in. For this we run:

$ docker login

Login Succeeded

Push the image we built:

$ docker compose push
Pushing training (myhubuser/gpudemo:latest)…
The push refers to repository [docker.io/myhubuser/gpudemo]
c765bf51c513: Pushed
9ccf81c8f6e0: Layer already exists

latest: digest: sha256:c40a3ca7388d5f322a23408e06bddf14b7242f9baf7fb
e7201944780a028df76 size: 4306

The image pushed is public unless we set it to private in Docker Hub’s repository settings. The Docker documentation covers this in more detail.

With the image stored in a public image registry, we will look now at how we can use it to deploy our application on Amazon ECS and how we can use GPUs to accelerate it.

2. Deploy to Amazon ECS for GPU-acceleration

To deploy the application to Amazon ECS, we need to have credentials for accessing an AWS account and to have Docker CLI set to target the platform.

Let’s assume we have a valid set of AWS credentials that we can use to connect to AWS services.  We need now to create an ECS Docker context to redirect all Docker CLI commands to Amazon ECS.

Create an ECS context

To create an ECS context run the following command:

$ docker context create ecs cloud
? Create a Docker context using: [Use arrows to move, type
to filter]
> AWS environment variables
An existing AWS profile
A new AWS profile

This prompts users with 3 options, depending on their familiarity with the AWS credentials setup.

For this exercise, to skip the details of  AWS credential setup, we choose the first option. This requires us to have the AWS_ACCESS_KEY and AWS_SECRET_KEY set in our environment,  when running Docker commands that target Amazon ECS.

We can now run Docker commands and set the context flag for all commands targeting the platform, or we can switch it to be the context in use to avoid setting the flag on each command.

Set Docker CLI to target ECS

Set the context we created previously as the context in use by running:

$ docker context use cloud

$ docker context ls
NAME TYPE DESCRIPTION DOCKER ENDPOINT KUBERNETES ENDPOINT ORCHESTRATOR
default moby Current DOCKER_HOST based configuration unix:///var/run/docker.sock swarm
cloud * ecs credentials read from environment

Starting from here, all the subsequent Docker commands are going to target Amazon ECS. To switch back to the default context targeting the local environment, we can run the following:

$ docker context use default

For the following commands, we keep ECS context as the current context in use. We can now run a command to check we can successfully access ECS.

$ AWS_ACCESS_KEY="*****" AWS_SECRET_KEY="******" docker compose ls
NAME STATUS

Before deploying the application to Amazon ECS, let’s have a look at how to update the Compose file to request GPU access for the training service. This blog post describes a way to define GPU reservations. In the next section, we cover the new format supported in the local compose and the legacy docker-compose.

Define GPU reservation in the Compose file

Tensorflow can make use of NVIDIA GPUs with CUDA compute capabilities to speed up computations. To reserve NVIDIA GPUs,  we edit the docker-compose.yaml  that we defined previously and add the deploy property under the training service as follows:


training:
image: myhubuser/gpudemo
command: python model.py eng-fra
volumes:
– models:/checkpoints
deploy:
resources:
reservations:
memory:32Gb
devices:
– driver: nvidia
count: 2
capabilities: [gpu]

For this example we defined a reservation of 2 NVIDIA GPUs and 32GB memory dedicated to the container. We can tweak these parameters according to the resources of the machine we target for deployment. If our local dev machine hosts an NVIDIA GPU, we can tweak the reservation accordingly and deploy the Compose file locally.  Ensure you have installed the NVIDIA container runtime and set up the Docker Engine to use it before deploying the Compose file.

We focus in the next part on how to make use of GPU cloud instances to run our sample application.

Note: We assume the image we pushed to Docker Hub is public. If so, there is no need to authenticate in order to pull it (unless we exceed the pull rate limit). For images that need to be kept private, we need to define the x-aws-pull_credentials property with a reference to the credentials to use for authentication. Details on how to set it can be found in the documentation.

Deploy to Amazon ECS

Export the AWS credentials to avoid setting them for every command.

$ export AWS_ACCESS_KEY="*****"
$ export AWS_SECRET_KEY="******"

When deploying the Compose file, Docker Compose will also reserve an EC2 instance with GPU capabilities that satisfies the reservation parameters. In the example we provided, we ask to reserve an instance with 32GB and 2 Nvidia GPUs. Docker Compose matches this reservation with the instance that satisfies this requirement. Before setting the reservation property in the Compose file, we recommend to check the Amazon GPU instance types and set your reservation accordingly. Ensure you are targeting an Amazon region that contains such instances.

WARNING: Aside from ECS containers, we will have a `g4dn.12xlarge` EC2 instance reserved. Before deploying to the cloud, check the Amazon documentation for the resource cost this will incur.

To deploy the application, we run the same command as in the local environment.

$ docker compose up
[+] Running 29/29
⠿ gpu CreateComplete 423.0s
⠿ LoadBalancer CreateComplete 152.0s
⠿ ModelsAccessPoint CreateComplete 6.0s
⠿ DefaultNetwork CreateComplete 5.0s

⠿ TranslatorService CreateComplete 205.0s
⠿ TrainingService CreateComplete 161.0s

Check the status of the services:

$ docker compose ps
NAME SERVICE STATE PORTS
task/gpu/3311e295b9954859b4c4576511776593 training Running
task/gpu/78e1d482a70e47549237ada1c20cc04d translator Running gpu-LoadBal-6UL1B4L7OZB1-d2f05c385ceb31e2.elb.eu-west-3.amazonaws.com:5000->5000/tcp

Query the exposed translator endpoint. We notice the same behaviour as in the local deployment (the model reload has not been triggered yet by the training service).

$ curl -d "text=hello" gpu-LoadBal-6UL1B4L7OZB1-d2f05c385ceb31e2.elb.eu-west-3.amazonaws.com:5000/
No trained model found / training may be in progress…

Check the logs for the GPU device’s tensorflow detected. We can easily identify the 2 GPU devices we reserved and how the training is almost 10X faster than our CPU-based local training.

$ docker compose logs

training | 2021-01-08 20:50:51.595796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
training | pciBusID: 0000:00:1c.0 name: Tesla T4 computeCapability: 7.5
training | coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s

training | 2021-01-08 20:50:51.596743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
training | pciBusID: 0000:00:1d.0 name: Tesla T4 computeCapability: 7.5
training | coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s

training | Epoch 1 Batch 300 Loss 1.2269
training | Epoch 1 Loss 1.4794
training | Time taken for 1 epoch 42.98397183418274 sec

training | Epoch 2 Loss 0.9750
training | Time taken for 1 epoch 35.13995909690857 sec

training | Epoch 9 Batch 0 Loss 0.1375

training | Epoch 9 Loss 0.1558
training | Time taken for 1 epoch 32.444278955459595 sec

training | Epoch 10 Batch 300 Loss 0.1663
training | Epoch 10 Loss 0.1383
training | Time taken for 1 epoch 35.29659080505371 sec
training | Checkpoints saved in /checkpoints/eng-fra
training | Requested translator service to reload its model, response status: 200.

The training service runs continuously and triggers the model reload on the translation service every 10 cycles (epochs). Once the translation service has been notified at least once, we can stop and remove the training service and release the GPU instances at any time we choose. 

We can easily do this by removing the service from the Compose file:

services:
translator:
image: myhubuser/gpudemo
build: backend
volumes:
– models:/checkpoints
ports:
– 5000:5000
volumes:
models:

and then run docker compose up again to update the running application. This will apply the changes and remove the training service.

$ docker compose up
[+] Running 0/0
⠋ gpu UpdateInProgress User Initiated
⠋ LoadBalancer CreateComplete
⠋ ModelsAccessPoint CreateComplete

⠋ Cluster CreateComplete
⠋ TranslatorService CreateComplete

We can list the services running to see the training service has been removed and we only have the translator one:

$ docker compose ps
NAME SERVICE STATE PORTS
task/gpu/78e1d482a70e47549237ada1c20cc04d translator Running gpu-LoadBal-6UL1B4L7OZB1-d2f05c385ceb31e2.elb.eu-west-3.amazonaws.com:5000->5000/tcp

Query the translator:

$ curl -d "text=hello" gpu-LoadBal-6UL1B4L7OZB1-d2f05c385ceb31e2.elb.eu-west-3.amazonaws.com:5000/
salut !

To remove the application from Amazon ECS run:

$ docker compose down

Summary

We discussed how to setup a resource-intensive ML application to make it easily deployable in different environments with Docker Compose. We have exercised how to define the use of GPUs in a Compose file and how to deploy it on Amazon ECS.

Resources:

https://docs.docker.com/cloud/ecs-integrationDocker Composehttps://github.com/docker/compose-cliSample application and Compose Files 
The post How to Deploy GPU-Accelerated Applications on Amazon ECS with Docker Compose appeared first on Docker Blog.
Quelle: https://blog.docker.com/feed/

AWS CloudFormation StackSets ist jetzt in der Region Japan (Osaka) verfügbar

AWS CloudFormation hat die Verfügbarkeit für StackSets auf Japan (Osaka) erweitert. StackSets ist eine CloudFormation-Funktion, die es Ihnen ermöglicht, die Bereitstellung von Cloud-Ressourcen in mehreren AWS-Konten und Regionen in einem einzigen Vorgang zentral zu verwalten. StackSets ist auch in AWS-Organisationen integriert, sodass Sie die automatische Bereitstellung nutzen können, wenn ein Konto in eine Organisation eingeht.
Quelle: aws.amazon.com