How to efficiently process both real-time and aggregate data with Dataflow

Pub/Sub, Dataflow and BigQuery is a common stack that enables analytic workloads over event data streams. But when choosing the right implementation, many businesses need to consider both real-time constraints and historical analysis over the whole dataset, resulting in trade-offs. But it doesn’t have to be this way.Imagine that we face a scenario where data can be conveniently divided into two categories: (1) actionable events that need to be delivered with stringent latency requirements, and (2) not-so-urgent data that can tolerate some delay. Should we opt for streaming inserts or go with load jobs? Is there a better solution? Spoiler alert: With a clever and simple pipeline design that combines these two worlds, we can meet all our requirements and provide significant cost savings.Where can this be applied?Before we continue, let’s examine some of the business use cases that can benefit from our approach.Fraud detection—Potential fraudulent activity can be flagged immediately while all other transactions are logged to be later used to derive insights or train ML models.Monitoring systems—Anomalies can be detected and instantly alerted while allowing for delay in data under normal conditions. Applications can range from earthquake detection to SRE dashboards.Customer service ticketing systems—Critical issues filed by customers can be prioritized while non-critical issues (like feature requests) can be delayed without impacting the customer experience.Online gaming health checks—By using a representative fraction of the incoming data for quick analysis, we can check that everything is in order while preserving the rest of the data for future deeper analysis or ML projects.In three of the scenarios above, incoming data is classified as either urgent (when there is a need for low latency data) or non-urgent. But this approach can also be applied in other ways. For example, let’s say you need early speculative results (like in the online gaming health check use case described above). By sampling all incoming events, we can get an early analysis while preserving the complete data set for deeper future analysis. In other words, this approach can be easily adapted to stream a representative sample of the data while the rest is completed afterwards with load jobs.Architecture conceptsWithin our Pub/Sub, Dataflow, and BigQuery stack, Dataflow provides simple ways to connect to Pub/Sub and BigQuery via the Apache Beam for Java SDK built-in IO connectors.In our pipeline, we will be reading the real-time events generated by a Pub/Sub topic with the PubsubIO connector. Once data has been processed, we will insert it into the BigQuery destination tables. The BigQueryIO connector provides two ways to insert our data: Load Jobs or Streaming Inserts.With Load Jobs, elements are buffered in Cloud Storage and each batch is written to BigQuery in a single atomic update. On the other hand, with Streaming Inserts, each record will be immediately appended to the BigQuery table and available to be queried within seconds.Choosing the right implementationWe can favor a play-it-safe design in which we stream all data directly into BigQuery. Streaming insert quotas are generous and it’s easy to be within them, but we will be paying for each inserted row, regardless of its urgency. In some of the previous examples, the fraction of high-priority events can be very low. Also, operations such as DML updates are disallowed (on a partition level) when a streaming buffer is attached to the table.Instead, we can leverage load jobs which are free. To satisfy the real-time view of the data, we’ll need to write data very frequently, which can lead us to exhaust the daily load jobs per table quota and hinder query performance, fragmenting the table into an excessive amount of files.An interesting solution is to combine both: use streaming inserts to send urgent events right away and load jobs that contain all events. Herein we develop and (briefly) explain this design choice.We read JSON-formatted messages from Cloud Pub/Sub with an attribute that manifests the event urgency. Events with an urgency factor equal or above the threshold will be stream-ingested into a BigQuery table using a side output. Depending on the urgency category of the event, it will be emitted to a different table. In the event that we need to query data from both tables, a simple UNION statement will suffice.We add a timestamp field to all elements when the row is created. We can retrieve actual processing time even if two events belong to the same batch and were inserted simultaneously.We’ll redirect each output to the corresponding table according to their tag. Changes are straightforward. Note that if we don’t specify the write method it will default to streaming inserts. For load jobs, we add Method.FILE_LOADS and the triggering frequency can be adjusted at will to better suit our use case.In the alternative case where there is not an explicit priority field, we can modify the example to sample the data and send some immediate representative results while the rest is completed afterwards. By using a random function instead of an urgency value, we can get a desired percentage of our data for real-time analysis. There may be some cases where another sampling strategy is preferred and for that you would need to implement your own logic.What are the benefits of this solution?Here are some of the advantages we’ve experienced taking this approach.Direct control on data availability—We can decide upfront which events will be streamed into our destination table.Easier-to-accommodate quotas—Since we are splitting data into streamed rows and batched loads, we can relax both rates.Cost expenditure—Load jobs are free of charge, so we would only pay for the important data that we choose to stream.Avoiding duplicate work—We process elements once and send them to the corresponding side output. BigQueryIO makes the changes for each insert method trivial.Sounds great, doesn’t it?Optimizing the pipeline furtherConsidering additional best practices in an upfront design phase can be the icing on the cake in terms of optimizing performance and cost:The amount of records written to the table can be huge and can drive up the amount of scanned data in queries over time. Using partitioned tables, our queries can target only the days we want to analyze, thus reducing the analytics cost.Another possible approach would be to have a table that hosts only high urgency events and another table that hosts all events no matter its urgency. In this case, even if it’s just a small fraction, we would be paying for extra storage. Again, we can resort to partitions and set a low TTL (partition expiration time) so that we don’t have to manually clean up the data.Depending on the nature of our data, we can also add clustering into the equation. In this instance, we can force a better collocation of data with a daily query that overwrites the “closed” partition (we don’t expect new data to arrive for that partition) as explained in this documentation.Wrapping upIn this post, we explored a mixed streaming and batch approach in Dataflow to get the best performance out of our data pipeline, taking into consideration project needs and the latest BigQuery features. We considered many factors, such as data availability requirements, code easiness, scalability and cost, and determined the optimal architecture for our use case.  To learn more, about Dataflow and data analytics on Google Cloud, visit our website.Acknowledgements: Guillem Xercavins and Alvaro Gomez, Big Data SMEs, and Berta Izquierdo, Big Data Team Lead, contributed to this post.
Quelle: Google Cloud Platform

Google Cloud named a leader in the Forrester Wave: Data Security Portfolio Vendors, Q2 2019 report

Today, we’re honored to share that Google Cloud was named a Leader in The Forrester Wave™: Data Security Portfolio Vendors, Q2 2019 report. The report evaluates a vendor’s portfolio of offerings specific to data security and includes both cloud and on-premise offerings. Of the 13 vendors evaluated, Google Cloud scored highest in the Strategy category.Making data security easier and scalable for enterprisesThe report notes that Google Cloud customers appreciate our ease of deployment and the scalability of our capabilities. This includes services like Cloud Data Loss Prevention (DLP) that help you discover, classify, and redact sensitive data across your organization. We also continue to work to provide easy-to-adopt ways for our customers to increase visibility into data use, sharing, and protection in their cloud environments. This includes Cloud Security Command Center for Google Cloud Platform (GCP) and Security Center for G Suite, products that help to surface actionable security insights.Security at the heart of Google CloudThe report recognizes that we put security at the center of our strategy at Google Cloud. We’ve written at length in the past about our belief that if you put security first, all else will follow. And we are explicit in our commitment to our Cloud customers: you own your data, and we put you in control.The report also recognizes Google’s strengths around access control granularity when it comes to supporting a “Zero Trust” approach via our BeyondCorp model and Context-Aware Access solutions.To learn more about how Forrester evaluates Google Cloud’s data security portfolio, you can download a complimentary copy of the report here.Google Cloud is a rated Leader by industry analyst firms in many areas. Learn more at our analyst reports page.
Quelle: Google Cloud Platform

Kubernetes Operators Best Practices

Introduction Kubernetes Operators are processes connecting to the master API and watching for events, typically on a limited number of resource types. When a relevant event occurs, the operator reacts and performs a specific action. This may be limited to interacting with the master API only, but will often involve performing some action on some […]
The post Kubernetes Operators Best Practices appeared first on Red Hat OpenShift Blog.
Quelle: OpenShift

Sprint: Telekom bangt um Milliardenübernahme in den USA

Eine Gruppe von 10 Generalstaatsanwälten plant, T-Mobile US und Sprint zu verklagen, um deren Zusammenschluss zu blockieren. Die New Yorker Generalstaatsanwältin Letitia James plant dazu eine Bekanntgabe. Für die Deutsche Telekom ist der amerikanische Markt von größter Bedeutung. (T-Mobile, Telekom)
Quelle: Golem

On a quest: Learn GKE security and monitoring best practices

Whether you’re running Kubernetes yourself, using our Google Kubernetes Engine (GKE) managed service, or using Anthos, you need visibility into your environment, and you need to know how to secure it. To help you on your way, there are two new educational resources to teach you application observability and security best practices for using Kubernetes at scale.Fashioned as a series of self-paced labs, this learning content will guide you through the most common activities associated with monitoring and securing Kubernetes through a series of complementary hands-on exercises that we call quests.Quest for migration and observability best practicesFor migration and observability best practices, enroll in the Cloud Kubernetes Best Practice quest, which includes the following labs:GKE Migrating to Containers demonstrates containers’ central premise of isolation, restricting resources and portability.Monitoring with Stackdriver on Kubernetes Engine explores how to obtain useful deployment information from code by using Stackdriver’s extensive real-time tooling.Tracing with Stackdriver on Kubernetes Engine explores how to follow application trace events to find potential algorithm improvements.  Logging with Stackdriver on Kubernetes Engine presents common techniques for resource identification and export sink, including an overview of the powerful resource filter.Connect to Cloud SQL from an Application in Kubernetes Engine helps to bridge the divide between containers and non-containers, leveraging design patterns such as the sidecar or ambassador to connect to external resources via the Kubernetes API.On a quest for secure Kubernetes applicationsSimilarly, the Google Kubernetes Engine Security Best Practice quest provides actionable guidance on how to approach Kubernetes security, and includes the following labs:How to Use a Network Policy on GKE discusses the “principle of least privilege” as applied to Kubernetes network policy, illustrating how to achieve granular control over intra-cluster communication.Using Role-based Access Control in Kubernetes Engine shows you how to use RBAC to restrict things such as cluster state changes.Google Kubernetes Engine Security: Binary Authorization highlights a new GKE feature that helps to determine and enforce the provenance of container security.Securing Applications on Kubernetes Engine – Three Examples demonstrates how to use AppArmor to secure an Nginx web server; how to apply policies to unspecified resources using a Kubernetes Daemonset; and how to update pod metadata associated with a deployment with the Kubernetes API’s ServiceAccount, Role, and RoleMapping features.Kubernetes Engine Communication Through VPC Peering walks through the process to expose services between distinct clusters using VPC Peering.Hardening Default GKE Cluster Configurations explores mitigation security issues that can arise from running a cluster based on default settings.When working with infrastructure and application environments, sophisticated observability tools like Stackdriver provide a unified method of monitoring, tracing and logging. Likewise, securing an environment represents an ongoing challenge, but Google Cloud Platform offers a number of tools that help to reduce the complexity, and ensure that deployments follow generally accepted best practices.Ready to begin? Get started with Kubernetes best practice and the GKE Security Best Practice quests. On completion of the quest, you’ll be presented with a Qwiklabs digital badge that you can share on social media.
Quelle: Google Cloud Platform

Azure Shared Image Gallery now generally available

At Microsoft Build 2019, we announced the general availability of Azure Shared Image Gallery, making it easier to manage, share, and globally distribute custom virtual machine (VM) images in Azure.

Shared Image Gallery provides a simple way to share your applications with others in your organization, within or across Azure Active Directory (AD) tenants and regions. This enables you to expedite regional expansion or DevOps processes and simplify your cross-region HA/DR setup.

Shared Image Gallery also supports larger deployments. You can now deploy up to a 1,000 virtual machine instances in a scale set, up from 600 with managed images.

Here is what one of our customers had to say about the feature:

“Shared Image Gallery enables us to build all our VM images from a single Azure DevOps pipeline and to deploy IaaS VMs from these images in any subscription in any tenant in any region, without the added complexity of managing and distributing copies of managed images or VHDs across multiple subscriptions or regions.”

– Stanley Merkx, an Infrastructure Engineer at VIVAT, a Netherlands based insurance company

Regional availability

Shared Image Gallery now supports all Azure public cloud regions as target regions and all generally available Azure public cloud regions, with the exception of South Africa regions as a source region. Check the list of source and target regions.
In the coming months, this feature will also be available in sovereign clouds.

Quota

The default quota that is supported on Shared Image Gallery resources include:

100 shared image galleries per subscription per region
1,000 image definitions per subscription per region
10,000 image versions per subscription per region

Users can request for a higher quota based on their requirements. Learn how you can track usage in your subscription.

Pricing

There is no extra charge for using the Shared Image Gallery service. You will only pay for the following:

Storage charges for image versions and replicas in each region, source and target
Network egress charges for replication across regions

Getting started

CLI
PowerShell
Azure portal
API
Quickstart templates
.NET
Java

Let’s take a quick look at what you can do with Shared Image Gallery.

Manage your images better

We introduced three new Azure Resource Manager resources as part of the feature—gallery, image definition, and image version—which helps you organize images in logical groups. You can also publish multiple versions of your images as and when you update or patch the applications.

Share images across subscriptions and Azure Active Directory tenants

One of the key capabilities that Shared Image Gallery provides is a way to share your images across subscriptions. Since all three newly introduced constructs are Azure Resource Manager resources, you can use Azure role-based access control (RBAC) to share your galleries or image definitions with other users who can then deploy VMs in their subscriptions, even across Azure Active Directory tenants.

A few common scenarios where sharing images across tenants becomes useful are:

A company acquires another and suddenly the Azure infrastructure is spread across Azure AD tenants.
A company with multiple subsidiaries that use Azure is likely to have multiple Azure AD tenants.

Learn more about how to share your images across tenants.

Distribute your images globally

We understand that business happens at a global scale and you don’t want your organization to be limited by the platform. Shared Image Gallery provides a way to globally distribute your images based on your organizational needs. You only need to specify the target regions and Shared Image Gallery will replicate your image versions to the regions specified.

Scale your deployments

With Shared Image Gallery, you can now deploy up to a 1,000 VM instances in a VM scale set, an increase from 600 with managed images. We also introduced a concept of image replicas for better deployment performance, reliability, and consistency. You can set a different replica count in each target region based on your regional scale needs. Since each replica is a deep copy of your image, you can scale your deployments linearly with each extra replica versus a managed image.

Learn more about how to use replicas.

Make your images highly available

With the general availability of Shared Image Gallery, you can choose to store your images in zone-redundant storage (ZRS) accounts in regions with Availability Zones. You can also choose to specify storage account type for each of the target regions. Check the regional availability of zone-redundant storage.
Quelle: Azure