Unify data lakes and warehouses with BigLake, now generally available

Data continues to grow in volume and is increasingly distributed across lakes, warehouses, clouds, and file formats. As more users demand more use cases, the traditional approach to build data movement infrastructure is proving difficult to scale. Unlocking the full potential of data requires breaking down these silos, and is increasingly a top priority for enterprises. Earlier this year, we previewed BigLake, a storage engine that extends innovations in BigQuery storage to open file formats running on public cloud object stores. This allows customers to build secure multi-cloud data lakes over open file formats. BigLake provides consistent, fine-grained security controls for Google Cloud and open-source query engines to interact with data. Today, we are excited to announce General Availability for BigLake, and a set of new capabilities to help you build a differentiated data platform. “We are using GCP to build and extend one of the street’s largest risk systems. During several tests we have seen the great potential and scale of BigLake. It is one of the products that could support our cloud journey and drive application’s future efficiency” – Scott Condit, Director, Risk CTO Deutsche Bank.Build a distributed data lake that spans across warehouses, object stores & clouds with BigLakeCustomers can create BigLake tables on Google Cloud Storage (GCS), Amazon S3 and ADLS Gen 2 over supported open file formats, such as Parquet, ORC and Avro. BigLake tables are a new type of external table that can be managed similar to data warehouse tables. Administrators do not need to grant end users access to files in object stores, but instead manage access at a table, row or a column level. These tables can be created from a query engine of your choice, such as BigQuery or open-source engines using the BigLake connector. Once these tables are created, BigLake and BigQuery tables can be centrally discovered in the data catalog and managed at scale using Dataplex. BigLake extends the BigQuery storage API to object stores to help you build a multi-compute architecture. BigLake connectors are built on the BigQuery storage API and enable Google Cloud DataFlow and open-source query engines (such as Spark, Trino, Presto, Hive) to query BigLake tables by enforcing security. This eliminates the need to move the data to a query engine specific use case and security only needs to be configured at one place and is enforced everywhere. “We are using GCP to design datalake solutions for our customers and transform their digital strategy to create a data-driven enterprise. Biglake has been critical for our customers to quickly realize the value of analytical solutions by reducing the need to build ETL pipelines and cutting-down time-to-market. The performance & governance features of BigLake enabled a variety of data lake use cases for our customers.” – Sureet Bhurat, Founding Board member – Synapse LLCBigLake unlocks new use cases using Google Cloud and OSS Query enginesDuring the preview, we saw a large number of customers use BigLake in various ways. Some of the top use cases include: Building secure and governed data lakes for open-source workloads – Workloads migrating from Hadoop, Spark first customers, or those using Presto/Trino, can now use BigLake to build secure, governed and performant data lakes on GCS. BigLake tables on GCS provide fine-grained security, table management (vs giving access to files), better query performance and integrated governance with Dataplex. These characteristics are accessible across multiple OSS query engines when using the BigLake connectors.”To support our data driven organization, Wizard needs a data lake solution that leverages open file formats and can grow to meet our needs. BigLake allows us to build and query on open file formats, scales to meet our needs, and accelerates our insight discovery. We look forward to expanding our use cases with future BigLake features” – Rich Archer, Senior Data Engineer – WizardEliminate or reduce data duplication across data warehouses and lakes – Customers who use GCS, and BigQuery managed storage had to previously create two copies of data to support users using BigQuery and OSS engines. BigLake makes the GCS tables more consistent with BigQuery tables, reducing the need to duplicate data. Instead, customers can now keep a single copy of data split across BigQuery storage and GCS, and data can be accessed by BigQuery or OSS engines in either places in a consistent, secure manner.Fine-grained security for multi-cloud use cases – BigQuery Omni customers can now use BigLake tables on Amazon S3, and ADLS Gen 2 to configure fine grained security access control, and take advantage of localized data processing, and cross cloud transfer capabilities to do multi-cloud analytics. Tables created on other clouds are centrally discoverable on Data catalog for ease of management & governance Interoperability between analytics and data science workloads – Data science workloads, using either Spark or Vertex AI notebooks can now directly access data in BigQuery or GCS through the API connector, enforcing security & eliminating the need to import data for training models. For BigQuery customers, these models can be imported back into BigQuery ML to produce inferences.  Build a differentiated data platform with new BigLake capabilitiesWe are also excited to announce new capabilities as part of this General Availability launch. These include:Analytics Hub support: Customers can now share BigLake tables on GCS with partners, vendors or suppliers as linked data sets. Consumers can access this data in place through the preferred query engine of their choice (BigQuery, Spark, Presto, Trino, Tensorflow).BigLake tables is now the default table type BigQuery Omni, and has been upgraded from the previous default of external tables.BigQuery ML support: BigQuery customers can now train their models on GCS BigLake tables using BigQuery ML, without needing to import data, and accessing the data in accordance to the access policies on the table.Performance acceleration (preview): Queries for GCS BigLake tables can now be accelerated using the underlying BigQuery infrastructure. If you would like to use this feature please get in touch with your account team or fill out this form.Cloud Data Loss Prevention (DLP) profiling support (coming soon): Cloud DLP can soon scan BigLake tables to identify and protect sensitive data at scale. If you would like to use this feature please get in touch with your account team or fill out this form.Data masking and audit logging (Coming soon): BigLake tables now support dynamic data masking, enabling you to mask sensitive data elements to meet compliance needs. End user query requests to GCS for BigLake tables are now audit logged and are available to query via logs.Next stepsRefer to BigLake documentation to learn more, or get started with this quick start tutorial. If you are already using external tables today, consider upgrading them to BigLake tables to take advantage of above mentioned new features. For more information, reach out to the Google cloud account team to see how BigLake can add value to your data platform.Special mention to Anoop Johnson, Thibaud Hottelier, Yuri Volobuev and rest of the BigLake engineering team to make this launch possible.Related ArticleBigLake: unifying data lakes and data warehouses across cloudsBigLake unifies data warehouses and data lakes into a consistent format for faster data analytics across Google Cloud and open source for…Read Article
Quelle: Google Cloud Platform

Azure empowers easy-to-use, high-performance, and hyperscale model training using DeepSpeed

This blog was written in collaboration with the DeepSpeed team, the Azure ML team, and the Azure HPC team at Microsoft.

Large-scale transformer-based deep learning models trained on large amounts of data have shown great results in recent years in several cognitive tasks and are behind new products and features that augment human capabilities. These models have grown several orders of magnitude in size during the last five years. Starting from a few million parameters of the original transformer model all the way to the latest 530 billion-parameter Megatron-Turing (MT-NLG 530B) model as shown in Figure 1. There is a growing need for customers to train and fine-tune large models at an unprecedented scale.

Figure 1: Landscape of large models and hardware capabilities.

Azure Machine Learning (AzureML) brings large fleets of the latest GPUs powered by the InfiniBand interconnect to tackle large-scale AI training. We already train some of the largest models including Megatron/Turing and GPT-3 on Azure. Previously, to train these models, users needed to set up and maintain a complex distributed training infrastructure that usually required several manual and error-prone steps. This led to a subpar experience both in terms of usability and performance.

Today, we are proud to announce a breakthrough in our software stack, using DeepSpeed and 1024 A100s to scale the training of a 2T parameter model with a streamlined user experience at 1K+ GPU scale. We are bringing these software innovations to you through AzureML (including a fully optimized PyTorch environment) that offers great performance and an easy-to-use interface for large-scale training.

Customers can now use DeepSpeed on Azure with simple-to-use training pipelines that utilize either the recommended AzureML recipes or via bash scripts for VMSS-based environments. As shown in Figure 2, Microsoft is taking a full stack optimization approach where all the necessary pieces including the hardware, the OS, the VM image, the Docker image (containing optimized PyTorch, DeepSpeed, ONNX Runtime, and other Python packages), and the user-facing Azure ML APIs have been optimized, integrated, and well-tested for excellent performance and scalability without unnecessary complexity.

Figure 2: Microsoft full-stack optimizations for scalable distributed training on Azure.

This optimized stack enabled us to efficiently scale training of large models using DeepSpeed on Azure. We are happy to share our performance results supporting 2x larger model sizes (2 trillion vs. 1 trillion parameters), scaling to 2x more GPUs (1024 vs. 512), and up to 1.8x higher compute throughput/GPU (150 TFLOPs vs. 81 TFLOPs) compared to those published on other cloud providers.

We offer near-linear scalability both in terms of an increase in model size as well as increase in number of GPUs. As shown in Figure 3a, together with the DeepSpeed ZeRO-3, its novel CPU offloading capabilities, and a high-performance Azure stack powered by InfiniBand interconnects and A100 GPUs, we were able to maintain an efficient throughput/GPU (>157 TFLOPs) in a near-linear fashion as the model size increased from 175 billion parameters to 2 trillion parameters. On the other hand, for a given model size, for example, 175B, we achieve near-linear scaling as we increase the number of GPUs from 128 all the way to 1024 as shown in Figure 3b. The key takeaway from the results presented in this blog is that Azure and DeepSpeed together are breaking the GPU memory wall and enabling our customers to easily and efficiently train trillion-parameter models at scale.

(a)                                                                                          (b)

Figure 3: (a) Near-perfect throughput/GPU as we increase the model size from 175 billion to 2 trillion parameters (BS/GPU=8), (b) Near-perfect performance scaling with the increase in number of GPU devices for the 175B model (BS/GPU=16). The sequence length is 1024 for both cases.

Learn more

To learn more about the optimizations, technologies, and detailed performance trends presented above, please refer to our extended technical blog.

Learn more about DeepSpeed, which is part of Microsoft’s AI at Scale initiative.
Learn more about Azure HPC + AI.
To get started with DeepSpeed on Azure, please follow our getting started tutorial.
The results presented in this blog were produced on Azure by following the recipes and scripts published as part of the Megatron-DeepSpeed repository. The recommended and most easy-to-use method to run the training experiments is to utilize the AzureML recipe.
If you are running experiments on a custom environment built using Azure VMs or VMSS, please refer to the bash scripts we provide in Megatron-DeepSpeed.

Quelle: Azure

How to Build and Deploy a Task Management Application Using Go

Golang is designed to let developers rapidly develop scalable and secure web applications. Go ships with an easy to use, secure, and performant web server alongside its own web templating library. Enterprise users also leverage the language for rapid, cross-platform deployment. With its goroutines, native compilation, and the URI-based package namespacing, Go code compiles to a single, small binary with zero dependencies — making it very fast.
Developers also favor Go’s performance, which stems from its concurrency model and CPU scalability. Whenever developers need to process an internal request, they use separate goroutines, which consume just one-tenth of the resources that Python threads do. Via static linking, Go actually combines all dependency libraries and modules into a single binary file based on OS and architecture.
Why is containerizing your Go application important?
Go binaries are small and self-contained executables. However, your application code inevitably grows over time as it’s adapted for additional programs and web applications. These apps may ship with templates, assets and database configuration files. There’s a higher risk of getting out-of-sync, encountering dependency hell, and pushing faulty deployments.
Containers let you synchronize these files with your binary. They also help you create a single deployable unit for your complete application. This includes the code (or binary), the runtime, and its system tools or libraries. Finally, they let you code and test locally while ensuring consistency between development and production.
We’ll walk through our Go application setup, and discuss the Docker SDK’s role during containerization.
Table of Contents

Building the Application
Key Components
Getting Started
Define a Task
Create a Task Runner
Container Manager
Sequence Diagram
Conclusion

Building the Application
In this tutorial, you’ll learn how to build a basic task system (Gopher) using Go.
First, we’ll create a system in Go that uses Docker to run its tasks. Next, we’ll build a Docker image for our application. This example will demonstrate how the Docker SDK helps you build cool projects. Let’s get started.
Key Components

Go

Go Docker SDK

Microsoft Visual Studio Code

Docker Desktop

Getting Started
Before getting started, you’ll need to install Go on your system. Once you’ve finished up, follow these steps to build a basic task management system with the Docker SDK.
Here’s the directory structure that we’ll have at the end:
➜ tree gopher
gopher
├── go.mod
├── go.sum
├── internal
│ ├── container-manager
│ │ └── container_manager.go
│ ├── task-runner
│ │ └── runner.go
│ └── types
│ └── task.go
├── main.go
└── task.yaml

4 directories, 7 files

You can click here to access the complete source code developed for this example. This guide leverages important snippets, but the full code isn’t documented throughout.  
version: v0.0.1
tasks:
– name: hello-gopher
runner: busybox
command: ["echo", "Hello, Gopher!"]
cleanup: false
– name: gopher-loops
runner: busybox
command:
[
"sh",
"-c",
"for i in `seq 0 5`; do echo ‘gopher is working'; sleep 1; done",
]
cleanup: false
 
Define a Task
First and foremost, we need to define our task structure. This task is going to be a YAML definition with the following structure:
The following table describes the task definition:
 

 
Now that we have a task definition, let’s create some equivalent Go structs.
Structs in Go are typed collections of fields. They’re useful for grouping data together to form records. For example, this Task Task struct type has Name, Runner, Command, and Cleanup fields.
// internal/types/task.go

package types

// TaskDefinition represents a task definition document.
type TaskDefinition struct {
Version string `yaml:"version,omitempty"`
Tasks []Task `yaml:"tasks,omitempty"`
}

// Task provides a task definition for gopher.
type Task struct {
Name string `yaml:"name,omitempty"`
Runner string `yaml:"runner,omitempty"`
Command []string `yaml:"command,omitempty"`
Cleanup bool `yaml:"cleanup,omitempty"`
}
 
Create a Task Runner
The next thing we need is a component that can run our tasks for us. We’ll use interfaces for this, which are named collections of method signatures. For this example task runner, we’ll simply call it Runner and define it below:

// internal/task-runner/runner.go

type Runner interface {
Run(ctx context.Context, doneCh chan<- bool)
}

Note that we’re using a done channel (doneCh). This is required for us to run our task asynchronously — and it also notifies us once this task is complete.
You can find your task runner’s complete definition here. In this example, however, we’ll stick to highlighting specific pieces of code:

// internal/task-runner/runner.go

func NewRunner(def types.TaskDefinition) (Runner, error) {
client, err := initDockerClient()
if err != nil {
return nil, err
}

return &runner{
def: def,
containerManager: cm.NewContainerManager(client),
}, nil
}

func initDockerClient() (cm.DockerClient, error) {
cli, err := client.NewClientWithOpts(client.FromEnv)
if err != nil {
return nil, err
}

return cli, nil
}

The NewRunner returns an instance of the struct, which provides the implementation of the Runner interface. The instance will also hold a connection to the Docker Engine. The initDockerClient function initializes this connection by creating a Docker API client instance from environment variables.
By default, this function creates an HTTP connection over a Unix socket unix://var/run/docker.sock (the default Docker host). If you’d like to change the host, you can set the DOCKER_HOST environment variable. The FromEnv will read the environment variable and make changes accordingly.
The Run function defined below is relatively basic. It loops over a list of tasks and executes them. It also uses a channel named taskDoneCh to see when a task completes. It’s important to check if we’ve received a done signal from all the tasks before we return from this function.

// internal/task-runner/runner.go

func (r *runner) Run(ctx context.Context, doneCh chan<- bool) {
taskDoneCh := make(chan bool)
for _, task := range r.def.Tasks {
go r.run(ctx, task, taskDoneCh)
}

taskCompleted := 0
for {
if <-taskDoneCh {
taskCompleted++
}

if taskCompleted == len(r.def.Tasks) {
doneCh <- true
return
}
}
}

func (r *runner) run(ctx context.Context, task types.Task, taskDoneCh chan<- bool) {
defer func() {
taskDoneCh <- true
}()

fmt.Println("preparing task – ", task.Name)
if err := r.containerManager.PullImage(ctx, task.Runner); err != nil {
fmt.Println(err)
return
}

id, err := r.containerManager.CreateContainer(ctx, task)
if err != nil {
fmt.Println(err)
return
}

fmt.Println("starting task – ", task.Name)
err = r.containerManager.StartContainer(ctx, id)
if err != nil {
fmt.Println(err)
return
}

statusSuccess, err := r.containerManager.WaitForContainer(ctx, id)
if err != nil {
fmt.Println(err)
return
}

if statusSuccess {
fmt.Println("completed task – ", task.Name)

// cleanup by removing the task container
if task.Cleanup {
fmt.Println("cleanup task – ", task.Name)
err = r.containerManager.RemoveContainer(ctx, id)
if err != nil {
fmt.Println(err)
}
}
} else {
fmt.Println("failed task – ", task.Name)
}
}

 
The internal run function does the heavy lifting for the runner. It accepts a task and transforms it into a Docker container. A ContainerManager executes a task in the form of a Docker container.
Container Manager
The container manager is responsible for:

Pulling a Docker image for a task

Creating the task container

Starting the task container

Waiting for the container to complete

Removing the container, if required

Therefore, with respect to Go, we can define our container manager as shown below:
// internal/container-manager/container_manager.go

type ContainerManager interface {
PullImage(ctx context.Context, image string) error
CreateContainer(ctx context.Context, task types.Task) (string, error)
StartContainer(ctx context.Context, id string) error
WaitForContainer(ctx context.Context, id string) (bool, error)
RemoveContainer(ctx context.Context, id string) error
}

type DockerClient interface {
client.ImageAPIClient
client.ContainerAPIClient
}

type ImagePullStatus struct {
Status string `json:"status"`
Error string `json:"error"`
Progress string `json:"progress"`
ProgressDetail struct {
Current int `json:"current"`
Total int `json:"total"`
} `json:"progressDetail"`
}

type containermanager struct {
cli DockerClient
}
 
The containerManager interface has a field called cli with a DockerClient type. The interface in-turn embeds two interfaces from the Docker API, namely ImageAPIClient and ContainerAPIClient. Why do we need these interfaces?
For the ContainerManager interface to work properly, it must act as a client for the Docker Engine and API. For the client to work effectively with images and containers, it must be a type which provides required APIs. We need to embed the Docker API’s core interfaces and create a new one.
The initDockerClient function (seen above in runner.go) returns an instance that seamlessly implements those required interfaces. Check out the documentation here to better understand what’s returned upon creating a Docker client.
Meanwhile, you can view the container manager’s complete definition here.
Note: We haven’t individually covered all functions of container manager here, otherwise the blog would be too extensive.
Entrypoint
Since we’ve covered each individual component, let’s assemble everything in our main.go, which is our entrypoint. The package main tells the Go compiler that the package should compile as an executable program instead of a shared library. The main() function in the main package is the entry point of the program.

// main.go

package main

func main() {
args := os.Args[1:]

if len(args) < 2 || args[0] != argRun {
fmt.Println(helpMessage)
return
}

// read the task definition file
def, err := readTaskDefinition(args[1])
if err != nil {
fmt.Printf(errReadTaskDef, err)
}

// create a task runner for the task definition
ctx := context.Background()
runner, err := taskrunner.NewRunner(def)
if err != nil {
fmt.Printf(errNewRunner, err)
}

doneCh := make(chan bool)
go runner.Run(ctx, doneCh)

<-doneCh
}

 
Here’s what our Go program does:

Validates arguments

Reads the task definition

Initializes a task runner, which in turn initializes our container manager

Creates a done channel to receive the final signal from the runner

Runs our tasks

Building the Task System
1) Clone the repository
The source code is hosted over GitHub. Use the following command to clone the repository to your local machine.
git clone https://github.com/dockersamples/gopher-task-system.git
 
2) Build your task system
The go build command compiles the packages, along with their dependencies.
go build -o gopher

3) Run your tasks
You can directly execute gopher file to run the tasks as shown in the following way:
$ ./gopher run task.yaml

preparing task – gopher-loops
preparing task – hello-gopher
starting task – gopher-loops
starting task – hello-gopher
completed task – hello-gopher
completed task – gopher-loops

 
4) View all task containers  
You can view the full list of containers within the Docker Desktop. The Dashboard clearly displays this information: 

5) View all task containers via CLI
Alternatively, running docker ps -a also lets you view all task containers:
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
396e25d3cea8 busybox "sh -c ‘for i in `se…" 6 minutes ago Exited (0) 6 minutes ago gopher-loops
aba428b48a0c busybox "echo ‘Hello, Gopher…" 6 minutes ago Exited (0) 6 minutes ago

Note that in task.yaml the cleanup flag is set to false for both tasks. We’ve purposefully done this to retrieve a container list after task completion. Setting this to true automatically removes your task containers.
Sequence Diagram
 

Conclusion
Docker is a collection of software development tools for building, sharing, and running individual containers. With the Docker SDK’s help, you can build and scale Docker-based apps and solutions quickly and easily. You’ll also better understand how Docker works under the hood. We look forward to sharing more such examples and showcasing other projects you can tackle with Docker SDK, soon!
Want to start leveraging the Docker SDK, yourself? Check out our documentation for install instructions, a quick-start guide, and library information.
References

Docker SDK
Go SDK Reference
Getting Started with Go

Quelle: https://blog.docker.com/feed/

AWS re:Post führt von der Community verfasste Artikel ein

re:Post bietet fachkundigen Mitgliedern der Community jetzt die erweiterte Möglichkeit, über die Artikelfunktion technische Leitlinien und Fachkenntnisse zu teilen, die über das Beantworten von Fragen hinausgehen. Mit dieser Funktion können Mitglieder der Community bewährte Methoden und Fehlerbehebungsverfahren teilen und Kundenbedürfnisse rund um AWS-Technologie im Detail behandeln. Diese Artikelfunktion steht Mitgliedern der Community zur Verfügung, die den Rising-Star-Status in re:Post haben, oder Fachexperten, die sich aufgrund ihrer Beiträge und Zertifizierungen einen Ruf in der Community erarbeitet haben. Jeder auf re:Post veröffentlichte Artikel trägt zum Wachstum des öffentlichen AWS-Wissens bei, erleichtert allen Kunden die eigenständige Hilfesuche und trägt dazu bei, den Weg in die Cloud zu beschleunigen.
Quelle: aws.amazon.com

Amazon Braket SDK fügt Unterstützung für Kostenverfolgung in Fast-Echtzeit hinzu

Amazon Braket, der Quanten-Computing-Service von AWS, macht es Kunden leichter, wissenschaftliche Forschung und Softwareentwicklung mit Quantencomputern durchzuführen. Wir freuen uns, heute die Einführung einer neuen Kostenverfolgungsfunktion in unserem Braket-SDK bekannt zu geben, die es Kunden ermöglicht, ihre Quanten-Computing-Kosten schneller und einfacher zu überwachen. Statt auf eine AWS-Abrechnung warten zu müssen, können Kunden mit paar wenigen Zeilen Code direkt, nachdem eine Quantenaufgabe entweder auf einer Quantenverarbeitungseinheit (QPU) oder einem On-Demand-Simulator bearbeitet wurde, eine Kostenschätzung zu erhalten.
Quelle: aws.amazon.com

AWS Glue Streaming ETL Auto Scaling ist jetzt allgemein verfügbar

Die automatische Skalierung in AWS Glue Streaming ETL ist jetzt allgemein verfügbar. Streaming-ETL-Aufträge in AWS Glue können Ressourcen jetzt je nach Eingabestrom dynamisch auf- und abskalieren. Kunden können durch die automatische Skalierung die Kosten und den für die Ressourcenoptimierung erforderlichen manuellen Aufwand verringern, indem sie die richtigen für Streaming-ETL-Aufträge erforderlichen Ressourcen zuweisen.
Quelle: aws.amazon.com

AWS Single Sign-On (AWS SSO) fügt Support für kundenverwaltete Richtlinien (CMPs) von AWS Identity and Access Management (IAM) hinzu

AWS Single Sign-On (AWS SSO) unterstützt jetzt kundenverwaltete Richtlinien (CMPs) von AWS Identity and Access Management (IAM) und Berechtigungsgrenzen-Richtlinien innerhalb von AWS-SSO-Berechtigungssätzen. Mit dieser neuen Funktionalität können AWS-SSO-Kunden ihren Sicherheitsstatus verbessern, indem sie größere und genauere Richtlinien für den Zugriff mit der geringsten Berechtigung ausarbeiten und Richtlinien so anpassen, dass sie die Ressourcen des Kontos angeben, auf das sie angewendet werden. Mit CMPs können AWS-SSO-Kunden die Einheitlichkeit der Richtlinien aufrechterhalten, da CMP-Änderungen automatisch auf alle Berechtigungssätze und Rollen angewendet werden, die diese CMP verwenden. Dadurch können Kunden ihre CMPs und Berechtigungsgrenzen zentral verwalten, und Auditoren können sie suchen, überwachen und überprüfen. Kunden, die für Rollen, die sie in AWS IAM verwalten, bereits über CMPs verfügen, können diese CMPs wiederverwenden, ohne neue Inline-Richtlinien für Berechtigungssätze erstellen, überprüfen und genehmigen zu müssen.
Quelle: aws.amazon.com