Advancing memory leak detection with AIOps—introducing RESIN

“Operating a cloud infrastructure at global scale is a large and complex task, particularly when it comes to service standard and quality. In a previous blog, we shared how AIOps was leveraged to improve service quality, engineering efficiency, and customer experience. In this blog, I’ve asked Jian Zhang, Principal Program Manager from the AIOps Platform and Experiences team to share how AI and machine learning is used to automate memory leak detection, diagnosis, and mitigation for service quality.”—Mark Russinovich, Chief Technology Officer, Azure.

In the ever-evolving landscape of cloud computing, memory leaks represent a persistent challenge—affecting performance, stability, and ultimately, the user experience. Therefore, memory leak detection is important to cloud service quality. Memory leaks happen when memory is allocated but not released in a timely manner unintentionally. It causes potential performance degradation of the component and possible crashes of the operation system (OS). Even worse, it often affects other processes running on the same machine, causing them to be slowed down or even killed.

Given the impact of memory leak issues, there are many studies and solutions for memory leak detection. Traditional detection solutions fall into two categories: static and dynamic detection. The static leak detection techniques analyze software source code and deduce potential leaks whereas the dynamic method detects leak through instrumenting a program and tracks the object references at runtime.

However, these conventional techniques for detecting memory leaks are not adequate to meet the needs of leak detection in a cloud environment. The static approaches have limited accuracy and scalability, especially for leaks that result from cross-component contract violations, which need rich domain knowledge to capture statically. In general, the dynamic approaches are more suitable for a cloud environment. However, they are intrusive and require extensive instrumentations. Furthermore, they introduce high runtime overhead which is costly for cloud services.

RESIN

Designed to address memory leaks in production cloud infrastructure

Explore the research

Introducing RESIN

Today, we are introducing RESIN, an end-to-end memory leak detection service designed to holistically address memory leaks in large cloud infrastructure. RESIN has been used in Microsoft Azure production and demonstrated effective leak detection with high accuracy and low overhead.

REsin: a holistic service for memory leaks

Read the report

RESIN system workflow

A large cloud infrastructure could consist of hundreds of software components owned by different teams. Prior to RESIN, memory leak detection was an individual team’s effort in Microsoft Azure. As shown in Figure 1, RESIN utilizes a centralized approach, which conducts leak detection in multi-stages for the benefit of low overhead, high accuracy, and scalability. This approach does not require access to components’ source code or extensive instrumentation or re-compilation.

Figure 1: RESIN workflow

RESIN conducts low-overhead monitoring using monitoring agents to collect memory telemetry data at host level. A remote service is used to aggregate and analyze data from different hosts using a bucketization-pivot scheme. When leaking is detected in a bucket, RESIN triggers an analysis on the process instances in the bucket. For highly suspicious leaks identified, RESIN performs live heap snapshotting and compares it to regular heap snapshots in a reference database. After generating multiple heap snapshots, RESIN runs diagnosis algorithm to localize the root cause of the leak and generates a diagnosis report to attach to the alert ticket to assist developers for further analysis—ultimately, RESIN automatically mitigates the leaking process.

Detection algorithms

There are unique challenges in memory leak detection in cloud infrastructure:

Noisy memory usage caused by changing workload and interference in the environment results in high noise in detection using static threshold-based approach.

Memory leak in production systems are usually fail-slow faults that could last days, weeks, or even months and it can be difficult to capture gradual change over long periods of time in a timely manner.

At the scale of Azure global cloud, it’s not practical to collect fine-grained data over long period of time.

To address these challenges, RESIN uses a two-level scheme to detect memory leak symptoms: A global bucket-based pivot analysis to identify suspicious components and a local individual process leak detection to identify leaking processes.

With the bucket-based pivot analysis at component level, we categorize raw memory usage into a number of buckets and transform the usage data into summary about number of hosts in each bucket. In addition, a severity score for each bucket is calculated based on the deviations and host count in the bucket. Anomaly detection is performed on the time-series data of each bucket of each component. The bucketization approach not only robustly represents the workload trend with noise tolerance but also reduces computational load of the anomaly detection.

However, detection at component level only is not sufficient for developers to investigate the leak efficiently because, normally, many processes run on a component. When a leaking bucket is identified at the component level, RESIN runs a second-level detection scheme at the process granularity to narrow down the scope of investigation. It outputs the suspected leaking process, its start and end time, and the severity score.

Diagnosis of detected leaks

Once a memory leak is detected, RESIN takes a snapshot of live heap, which contains all memory allocations referenced by running application, and analyzes the snapshots to pinpoint the root cause of the detected leak. This makes memory leak alert actionable.

RESIN also leverages Windows heap manager’s snapshot capability to perform live profiling. However, the heap collection is expensive and could be intrusive to the host’s performance. To minimize overhead caused by heap collection, a few considerations are considered to decide how snapshots are taken.

The heap manager only stores limited information in each snapshot such as stack trace and size for each active allocation in each snapshot.

RESIN prioritizes candidate hosts for snapshotting based on leak severity, noise level, and customer impact. By default, the top three hosts in the suspected list are selected to ensure successful collection.

RESIN utilizes a long-term, trigger-based strategy to ensure the snapshots capture the complete leak. To facilitate the decision regarding when to stop the trace collection, RESIN analyzes memory growth patterns (such as steady, spike, or stair) and takes a pattern-based approach to decide the trace completion triggers.

RESIN uses a periodical fingerprinting process to build reference snapshots, which is compared with the snapshot of suspected leaking process to support diagnosis.

RESIN analyzes the collected snapshots to output stack traces of the root.

Mitigation of detected leaks

When a memory leak is detected, RESIN attempts to automatically mitigate the issue to avoid further customer impact. Depending on the nature of the leak, a few types of mitigation actions are taken to mitigate the issue. RESIN uses a rule-based decision tree to choose a mitigation action that minimizes the impact.

If the memory leak is localized to a single process or Windows service, RESIN attempts the lightest mitigation by simply restarting the process or the service. OS reboot can resolve software memory leaks but takes a much longer time and can cause virtual machine downtime and as such, is normally reserved as the last resort. For a non-empty host, RESIN utilizes solutions such as Project Tardigrade, which skips hardware initialization and only performs a kernel soft reboot, after live virtual machine migration, to minimize user impact. A full OS reboot is performed only when the soft reboot is ineffective.

RESIN stops applying mitigation actions to a target once the detection engine no longer considers the target leaking.

Result and impact of memory leak detection

RESIN has been running in production in Azure since late 2018 and to date, it has been used to monitor millions of host nodes and hundreds of host processes daily. Overall, we achieved 85% precision and 91% recall with RESIN memory leak detection,1 despite the rapidly growing scale of the cloud infrastructure monitored.

The end-to-end benefits brought by RESIN are clearly demonstrated by two key metrics:

Virtual machine unexpected reboots: the average number of reboots per one hundred thousand hosts per day due to low memory.

Virtual machine allocation error: the ratio of erroneous virtual machine allocation requests due to low memory.

Between September 2020 and December 2023, the virtual machine reboots were reduced by nearly 100 times, and allocation error rates were reduced by over 30 times. Furthermore, since 2020, no severe outages have been caused by Azure host memory leaks.1

Learn more about RESIN

You can improve the reliability and performance of your cloud infrastructure, and prevent issues caused by memory leaks through RESIN’s end-to-end memory leak detection capabilities designed to holistically address memory leaks in large cloud infrastructure. To learn more, read the publication.

1 RESIN: A Holistic Service for Dealing with Memory Leaks in Production Cloud Infrastructure, Chang Lou, Johns Hopkins University; Cong Chen, Microsoft Azure; Peng Huang, Johns Hopkins University; Yingnong Dang, Microsoft Azure; Si Qin, Microsoft Research; Xinsheng Yang, Meta; Xukun Li, Microsoft Azure; Qingwei Lin, Microsoft Research; Murali Chintalapati, Microsoft Azure, OSDI’22.
The post Advancing memory leak detection with AIOps—introducing RESIN appeared first on Azure Blog.
Quelle: Azure

Microsoft Cost Management updates—March 2024 

Whether you’re a new student, a thriving startup, or the largest enterprise, you have financial constraints, and you need to know what you’re spending, where it’s being spent, and how to plan. Nobody wants a surprise when it comes to the bill, and this is where Cost Management comes in. 

We’re always looking for ways to learn more about your challenges and how Cost Management can help you better understand where you’re accruing costs in the cloud, identify and prevent bad spending patterns, and optimize costs to empower you to do more with less. Here are a few of the latest improvements and updates based on your feedback: 

Microsoft Azure Kubernetes Service (AKS) costs

Auto renewal of Azure Reservations 

Connector for AWS—Retirement date: March 31, 2025 

Pricing updates on Azure.com 

Cost Management Labs 

New ways to save money in the Microsoft Cloud 

New videos and learning opportunities 

Documentation updates 

Let’s dig into the details. 

Cost Management solutions

Learn how to optimize your cloud investments with confidence

Microsoft Azure Kubernetes Service (AKS) costs 

Cost views

I am pleased to share that the AKS cost views are now generally available in Cost analysis. This was officially announced at Kubecon in Paris held last month. We announced the preview of these views in November 2023 at Ignite. 

AKS users always had visibility into the infrastructure costs of running their clusters. With these new views, they also get visibility into the costs of namespaces running in their clusters and an aggregated view of cluster costs across their subscription. With these additional insights, users can allocate and optimize their AKS costs more efficiently, maximizing the benefits of running their workloads on shared infrastructure. To enable these views, users must install the cost analysis add-on on their clusters.  

Figure 1: Kubernetes clusters view 

Figure 2: Kubernetes namespaces view

Please refer to the two articles below for more information: 

Azure Kubernetes Service cost analysis – Azure Kubernetes Service | Microsoft Learn 

View Kubernetes costs (Preview) – Cost Management | Microsoft Learn 

Fleet workload placement

An additional announcement from Kubecon that I want to highlight is the extension of fleet workload placement to schedule workloads to clusters based on new heuristics such as cost and availability of resources. For more information, please refer to “Open-Source Fleet Workload Placement Scheduling and Override.”  

Auto renewal of Azure Reservations 

Azure Reservations can significantly reduce your resource costs by up to 72% from pay-as-you-go prices. To simplify the management of reservations and to continue getting reservation discounts, you can now set up auto-renewal of your reservations at the time of purchase. Please note that the setting is turned off by default, so make sure to turn it on before your reservation purchase expires. To learn more, refer to “Automatically renew Azure reservations – Cost Management | Microsoft Learn.”

Connector for Amazon Web Services (AWS)—Retirement date: March 31, 2025  

Please note that we will be retiring the connector for AWS in Cost Management on March 31, 2025. You will not have access to AWS data through the API or portal beyond the retirement date, you will continue to have access to data that you stored in your S3 bucket in the AWS console. To prepare for the retirement date, we have removed the ability to add a new connector from Cost Management. We encourage you to look at alternative solutions to access your AWS costs. For more information, please refer to “Support for Connector for AWS in Cost Management is ending on 31 March 2025.” 

Pricing updates on Azure.com 

We’ve been working hard to make some changes to our Azure pricing experiences, and we’re excited to share them with you. These changes will help make it easier for you to estimate the costs of your solutions:

Azure savings plan has now been extended to Microsoft Azure Spring apps, offering more flexibility and cost optimization on both the pricing page and calculator. 

We’ve added a calculator entry for Azure Kubernetes Services Edge Essentials. 

We’ve added pricing for many new offers on Microsoft Azure, including Microsoft Azure Application Gateway (with the general availability (GA) of Application Gateway for Containers), new Microsoft Azure Virtual Machines series (Dasv6, Easv6, and Fasv6 – all in preview), Microsoft Azure Red Hat OpenShift (added virtual machine (VM) families and improved search experience on Pricing Calculator), Microsoft Azure SQL Database (HA Replica Pricing to elastic pools, Hyperscale), Microsoft Azure Databricks (“Model Training” workload for premium-tier workspaces), Microsoft Azure Managed Grafana (Standard and Essential plan types added to pricing calculator), Microsoft Azure Backup (pricing for Enhanced Policy type), Microsoft Azure Private 5G Core (new offers for RAN Overage and Devices Overage to both page and calculator). 

We’re constantly working to improve our pricing tools and make them more accessible and user-friendly. We hope you find these changes helpful in estimating the costs for your Azure Solutions. If you have any feedback or suggestions for future improvements, please let us know!

Cost Management Labs 

With Cost Management Labs, you get a sneak peek at what’s coming in Cost Management and can engage directly with us to share feedback and help us better understand how you use the service, so we can deliver more tuned and optimized experiences. Here are a few features you can see in Cost Management Labs:  

 Currency selection in Cost analysis smart views. View your non-USD charges in USD or switch between the currencies you have charges in to view the total cost for that currency only. To change currency, select “Customize” at the top of the view and select the currency you would like to apply. Currency selection is not applicable to those with only USD charges. Currency selection is enabled by default in Labs.   

 Streamlined Cost Management menu. Organize Cost Management tools into related sections for reporting, monitoring, optimization, and configuration settings.   

Recent and pinned views in the cost analysis preview. Show all classic and smart views in cost analysis and streamline navigation by prioritizing recently used and pinned views.   

Forecast in Cost analysis smart views. Show your forecast cost for the period at the top of Cost analysis preview.   

Charts in Cost analysis smart views. View your daily or monthly cost over time in Cost analysis smart views.   

Open configuration items in the menu. Experimental option to show the selected configuration screen as a nested menu item in the Cost Management menu. Please share feedback.  

New ways to save money in the Microsoft Cloud 

Here are a couple of important updates for you to review that can help reduce costs:

“Generally Available: Azure Kubernetes Service (AKS) support for 5K Node limit by default for standard tier clusters” 

“Public Preview: Well-Architected Framework assessment on Azure Advisor” 

New videos and learning opportunities 

Check out “Leverage anomaly management processes with Microsoft Cost Management”, a great video for managing anomalies and reservations. You can also follow the Cost Management YouTube channel to stay in the loop with new videos as they’re released and let us know what you’d like to see next. Want a more guided experience? Start with ”Control Azure spending and manage bills with Microsoft Cost Management.”

Refer to the blog post: Combine FinOps best practices and Microsoft tools to streamline and optimize your workloads about using Microsoft tools for FinOps best practices.

Documentation updates  

Here are a few documentation updates you might be interested in: 

New: “Azure Hybrid Benefit documentation”

Update: “Transfer Azure Enterprise enrollment accounts and subscriptions”

Update: Azure EA pricing – Cost Management

Update: Review your Azure Enterprise Agreement bill

Update: Understand usage details fields

Update: “Organize your costs by customizing your billing account”

Want to keep an eye on all documentation updates? Check out the Cost Management and Billing documentation change history in the azure-docs repository on GitHub. If you see something missing, select “Edit” at the top of the document and submit a quick pull request. You can also submit a GitHub issue. We welcome and appreciate all contributions! 

What’s next? 

These are just a few of the big updates from last month. Don’t forget to check out the previous Cost Management updates. We’re always listening and making constant improvements based on your feedback, so please keep the feedback coming. 

Best wishes, 

Cost Management team 
The post Microsoft Cost Management updates—March 2024  appeared first on Azure Blog.
Quelle: Azure