Resource governance in Azure SQL Database

This blog post continues the Azure SQL Database architecture series where we share background on how we run the service, as described by the architects who originally created the service. The first two posts covered data integrity in Azure SQL Database and how cloud speed helps SQL Server database administrators. In this blog post, we will talk about how we use governance to help achieve a balanced system.

Allocated and governed resources

When you choose a specific Azure SQL Database service tier, you are selecting a pre-defined set of allocated resources across several dimensions such as CPU, storage type, storage limit, memory, and more. Ideally you will select a service tier that meets the workload demands of your application, however if you over or under-size your selection you can easily scale up or down accordingly.

With each service tier selection, you are also inherently selecting a set of resource usage boundaries or limits. For example, a business critical, Gen 4 database with eight cores has the following resource allocations and associated limits:

Compute size
BC_Gen4_8

Memory (GB)
56

In-memory OLTP storage (GB)
8

Storage type
Local SSD

Max data size (GB)
650

Max log size (GB)
195

TempDB size (GB)
256

IO latency (approximate)

Target IOPS (64KB)
1-2 millisecond (write)
1-2 millisecond (read)
40000

Log rate limits (MBps)
48

Max concurrent workers (requests)
1600

Max concurrent logins (requests)
1600

Max allowed sessions
30000

Number of replicas
4

As you increase the resources in your tier, you may also see changes in limits up to a certain threshold. Furthermore, these limits can be automatically relaxed over time, but never further restricted without penalty to the customer.

We document resource allocation by service tier and also the associated resource governance limits in the following resources:

vCore Model: Azure SQL Database vCore-based purchasing model limits for a single database
DTU Model: Resource limits for single databases using the DTU-based purchasing model

While the resource allocation by service tier is intuitive to customers because the more you pay, the more resources you get, resource governance and boundaries has historically been less clear of a subject with customers. While we are increasing transparency around these governing mechanisms, it is important to understand the broader purposes behind resource governance in a database as a service (DBaaS). For this, we’ll talk next about what it takes to achieve a balanced system.

Providing a balanced database as a service (DBaaS)

For the context of this blog post, we define a system as balanced if all resources are sufficiently maximized without encountering bottlenecks. This balance includes an interplay of resources such as CPU, IO, memory, network paired with an application’s workload characteristics, maximum tolerated latency, and desired throughput.

With Azure SQL Database, our view of a balanced system must also take a broad and comprehensive perspective in order to meet articulated DBaaS requirements and customer expectations.

Azure SQL Database surfaces a familiar and popular database ecosystem with the intent of giving customers the following additional benefits:

Elasticity of scale – Customers can provision a database based on the throughput requirements of their application. As throughput requirements change, the customer can easily scale up or down.
Automated backups with self-service restore to any point in time – Database backups are automatically handled by the service, with log backups generally occurring every five to ten minutes.
High availability – Azure SQL Database supports a differentiated availability SLA with a maximum of 99.995 percent, backed by availability zone resilience to infrastructure failures.
Predictable performance – Customers on the same provisioned resource level always get the same performance with the same workload.
Predictable scalability – Customers using the hyperscale service tier can rely on predictable latency of the online scaling operations backed by a verifiable scaling SLA. This gives the customer a reliable tool to react to, changing compute capacity demands in a timely manner.
Automatic upgrades – Azure SQL Database is designed to facilitate transparent hardware, software upgrades, and periodic, lightweight software updates.
Global scale – Customers can deploy databases around the world and easily provision geographically distributed database replicas enabling regional data access and disaster recovery solutions. These solutions are backed by strong geo-replication and failover SLAs.

For the Azure SQL Database engineering team, providing a balanced DBaaS system for customers goes well beyond simply providing the purchased CPU, IO, memory, and storage. We must also honor all aforementioned factors and aim to balance these key DBaaS factors along with overall performance requirements.

The following figure shows some of the key resources that are governed within the service.

Figure 1: Governed resources in Azure SQL Database

We need to provide this balanced system in such a way that allows us to continually improve the service over time. This requirement for continual improvement implies a necessary level of component abstraction and over-arching governance. Governance in Azure SQL Database ensures that we properly balance requirements around scale, high availability, recoverability, disaster recovery, and predictable performance.

To illustrate, let’s use transaction log rate governance as an example of why we actively manage in order to provide a balanced DBaaS. Transaction log governance is a process in Azure SQL Database used to limit high ingestion rates for workloads such as bulk insert, select into, and index builds.

Why govern this type of activity? Consider the following dimensions and the impact of transaction log generation rate.

Dimension

Log generation rate impact

Database recoverability

We make guarantees around the maximum window of possible data loss based on transaction log backup frequency.

High availability

Local replicas must remain within a recoverability and availability (up-time) range that aligns with our SLAs.

Disaster recovery

Globally distributed replicas must remain within a recoverability range that minimizes data loss.

Predictable performance

Log generation rates must not over-saturate the system or create unpredictable performance.

Log rates are set such that they can be achieved and sustained in a variety of scenarios, while the overall system can maintain its functionality with minimized impact to the user load. Log rate governance ensures that transaction log backups stay within published recoverability SLAs and prevents an excessive backlog on secondary replicas. We have similar impact and interdependencies across other governed areas including CPU, memory, and data IOPs.

How we govern resources in Azure SQL Database

While we use a multi-faceted approach to governance, today we do rely primarily on three main technologies, Job Objects, File Server Resource Manager (FSRM), and SQL Server Resource Governor.

Job Objects

Azure SQL Database leverages multiple mechanisms for governing overall performance for a database. One of the features we leverage is Windows Job Objects, which allows a group of processes to be managed and governed as a unit.   We use this functionality to govern file virtual memory commit, working set caps, CPU affinity, and rate caps. We onboard new governance capabilities as the Windows team releases them.

File Source Resource Manager (FSRM)

Available in Windows Server, we use FSRM to govern file directory quotas.

SQL Server Resource Governor

A SQL Server instance has multiple consumers of resources, including user requests and system tasks. SQL Server Resource Governor was introduced to ensure fair sharing of resources and prevent out-of-control requests from starving other requests. This feature was introduced in SQL Server years ago and over time was extended to help govern several resources including CPU, physical IO, memory, and more for a SQL Server instance. We use this functionality in Azure SQL Database as well to help govern IOPs both local and remote, CPU caps, memory, worker counts, session counts, memory grant limits, and the maximum number of concurrent requests.

Beyond the three main technologies, we also created additional mechanisms for governing transaction log rate.

Configurations for safe and predictable operations

Consider all the settings one must configure for a well-tuned on-premises SQL Server instance, including database file settings, max memory, max degree of parallelism, and more. In Azure SQL Database we pre-configure several settings based on similar best practices. And as mentioned earlier, we pre-configure SQL Server Resource Governor, FSRM, and Job Objects to deliver fairness and prevent starvation. The reasoning behind this is to aim for safe and predictable operation. We can also provide varying settings for customers based on their workload and specific needs, assuming it conforms to safety limits defined for the service.

Improvements over time

Sometimes we deploy software changes that improve the performance and scalability of specific operations. Customers benefit automatically and we might exceed the defined limits and/or increase them for all customers in the future. Furthermore, as we enhance the hardware of machines, storage, and network, these benefits may also be transparently available to an application. This is because we have defined this DBaaS abstraction layer instead of just providing a specific physical machine.

Evolving governance

The Azure SQL Database engineering team regularly enhances governance capabilities used in the service. We continually review our models based on feedback and production telemetry and we modify our limits to maximize available resources, increase safety, and reduce the impact of system tasks.

If you have feedback to share, we would like to hear from you. To contact the engineering team with feedback or comments on this subject, please email SQLDBArchitects@microsoft.com.
Quelle: Azure

Umanis lifts the hood on their AI implementation methodology

Microsoft creates deep, technical content to help developers enhance their proficiency when building solutions using the Azure AI Platform. Our preferred training partners redeliver our LearnAI Bootcamps for customers around the globe on topics including Azure Databricks, Azure Machine Learning service, Azure Search, and Cognitive Services. Umanis, a systems integrator and preferred AI training partner based in France, has been innovating in Big Data and Analytics in numerous verticals for more than 25 years and has developed an effective methodology for guiding customers into the Intelligent Cloud. Here, Philippe Harel, the AI Practice Director at Umanis, describes this methodology and shares lessons learned to empower customers to do more with data and AI.

2019 is the year when artificial intelligence (AI) and machine learning (ML) are shifting from being mere buzzwords to real-world adoption and rollouts across the enterprise. This year reminds us of the cloud adoption curve a few years ago, when it was no longer an option to stay on-premises alone, but a question of how to make the shift. As you draw up plans on how to best use AI, here are some learnings and methodologies that Umanis is following.

Given the ever-increasing speed of change in technology, along with the variety of sectors and industries Umanis works in, they focused on building a methodology that could be standardized across AI implementations from project to project. This methodology follows an iterative cycle: assimilate, learn, and act, with the goal of adding value with each iteration.

The Azure platform acts as an enabler of this methodology as seen in the image below.

In most data and artificial intelligence (AI) projects implemented at Umanis, several trends are gaining momentum and are likely to amplify in 2019:

More unstructured, big, and real-time data.
An increased need for fast and reliable AI solutions to scale up.
Increasing expectations from customers.

In this blog post, we will explain how you can address these kinds of projects, and how Umanis maps their approach to the Azure offering to deliver solutions that are easy to use, operationalize, and maintain.

The 3 phases of the AI implementation methodology

1. Assimilate

In this initial phase, you can be hit by anything. From the good to the big, bad, and ugly: databases, text, logs, telemetry, images, videos, social networks, and more are flowing in. The challenge is to make sense of everything, so you can serve the next phase (Learn) successfully. By assimilating, we mean:

Ingest: The performance of an algorithm depends on the quality of the data. We consider “ingesting” to be checking the quality of the data, the quality of the transmission, and building the pipelines to feed the subsequent parts.
Store: Since the data will be used by highly demanding algorithms (I/O, processing power) that will mix data from various sources, you need to store the data in the most efficient way for future access by algorithms or data visualizations.
Structure: Finally, you’ll need to prepare the data for an algorithms’ consumption and execute as many transformations, preprocessing, and cleaning tasks as you can to speed up the data scientists’ activities and algorithms.

2. Learn

This is the heart of any AI project: Creating, deploying, and managing models.

Create: Data scientists use available data to design algorithms, train their models, and compare the results. There are two key points to this:

Don’t make them wait for results! Data scientists are rare resources and their time is precious.
Allow any language or combination of languages. On that perspective, Azure Databricks is a great solution as it addresses this natively by allowing different languages to be used in a single block of code.

Use: Once algorithms are deployed as APIs and consumed, the need for parallelization goes up. SLAs and testing the performance of the sending, processing, and receiving pipeline is crucial.
Refine: Refining the quality of algorithms ensures reliable results over time. The easy part of this activity is automatic re-training on a regular basis. The less obvious one is what we call the “human in the loop” activity. In short, a Power BI report showing the results of predictions that a human can re-classify quickly as needed, and the machine uses this human expertise to get better at its task.

3. Act

All of the above phases are useless unless you actually make good use of the algorithm’s added value.

Inform: Any mistake in code, misunderstanding in requirements, or bug can be devastating as first user impressions are crucial. Therefore, instead of a “big bang” of visualizations, start very small, iterate very quickly, and make a few key users on-board to secure adoption before widening the audience.
Connect: Systems that use the information from algorithms need to be plugged in. This is called RPA, IPA, or automation in general, and the architectures can vary greatly on each project. Don’t overlook the need for human monitoring of this activity. Consider the impact of the most wrong answer from an algorithm, and you will get a good feel of the need for human supervision.
Dialog: When dealing with human interaction, so much comes into play that to be successful, the scope of the interaction needs to be narrowed down to the actions that really add value and are not trivial. (This is not easily possible via classic interfaces.)

Conclusion

This methodology will certainly change and adapt overtime. Nevertheless, Umani has found it to be a robust way of rolling out end-to-end data and AI projects while minimizing friction and risk. By using this approach to present a Data & AI project to both customers and internal teams, everyone can get a good feeling of what activities, technologies, and challenges are involved. It’s one way to address the “Urgent need to build shared context, trust, and credibility with your team” as Satya Nadella states in his book, Hit Refresh. This methodology, is a great way to build trust in your relationships.

If you want more information about the methodology used by Umanis, you can find them at upcoming conferences in the next two months (in French) discussing this topic in Luxembourg, Paris, and Nantes.

Learn More

Learn more about the Azure Machine Learning service

Get started with a free trial of Azure Machine Learning service
Quelle: Azure