Join Microsoft at ISC2019 in Frankfurt

The world of computing goes deep and wide in regards to working on issues related to our environment, economy, energy, and public health systems. These needs require modern, advanced solutions that can be hard to scale, take a long time to deliver, and were traditionally limited to a few organizations. Microsoft Azure delivers high-performance computing (HPC) capability and tools to power solutions that address these challenges integrated into a global-scale cloud platform.

Join us in Frankfurt, Germany from June 17–19, 2019 at the world's second-largest supercomputing show, ISC High Performance 2019. Learn how Azure customers combine the flexibility and elasticity of the cloud and how to integrate both our specialized compute virtual machines (VMs), as well as bare-metal offerings from Cray.

Microsoft booth presentations and topics include:

How to achieve high-performance computing on Azure
Cray Supercomputing on Azure
Cray ClusterStor on Azure with H-Series VMs
AI and HPC with NVIDIA
Autonomous driving
Live demos
Case studies from partners and customers
More about our recently launched HB and HC virtual machines

To learn more, please come by the Microsoft booth, K-530, and say "hello" on June 17 and June 19.

Microsoft, AMD, and Cray breakfast at ISC

Please join us for a co-hosted breakfast with Microsoft, AMD, and Cray on June 19, 2019 where we will discuss how to successful support your large scale HPC jobs in the cloud. In this session we will discuss our recently launched offers with Cray in Azure, as well as the Azure HB-series VMs optimized for applications driven by memory bandwidth, all powered by AMD EPYC processors. The breakfast is at the Frankfurt Marriott in Gold I-III (1st Floor) from 7:45 AM – 9:00 AM. Please feel free to register for this event.

Supercomputing in the cloud

Building on our strong relationship with Cray, we’re excited to showcase our three new dedicated offerings at ISC. We look forward to showcasing our accelerated innovation and delivery of next generation HPC and AI technologies to Azure customers.

We’re looking forward to seeing you at ISC.

Microsoft's ISC schedule

Tuesday, June 18, 2019

10:30 AM –

10:50 AM

Speaker

Burak Yenier, CEO, TheUberCloud Inc.

Title

UberCloud and Microsoft are helping customers move their engineering workload to Azure

11:30 AM –

11:50 AM

Speaker

Mohammad Zamaninasab, AI TSP GBB, Microsoft

Title

Artificial intelligence with Azure Machine Learning, Cognitive Services, DataBricks

12:30 PM –

12:50 PM

Speaker

Dr. Ulrich Knechtel, CSP Manager – EMEA, NVIDIA

Title

Accelerate your HPC workloads with NVIDIA GPUs on Azure

1:30 PM –

1:50 PM

Speaker

Uli Plechschmidt, Storage Marketing, Cray

Title

Why moving large scale, extremely I/O intensive HPC applications to Microsoft Azure is now possible

2:30 PM –

2:50 PM

Speaker

Joseph George, Executive Director, Cray Inc.

Title

5 reasons why you can maximize your manufacturing environment with Cray in Azure

3:30 PM –

3:50 PM

Speaker

Martin Hilgeman, Senior Manager, AMD HPC Centre of Excellence

Title

Turbocharging HPC in the cloud with AMD EPYC

4:30 PM –

4:50 PM

Speaker

Evan Burness, Principal Program Manager, Azure HPC, Microsoft

Title

HPC infrastructure in Azure

5:30 PM –

5:50 PM

Speaker

Niko Dukic, Senior Program Manager for Azure Storage, Microsoft

Title

Azure Storage ready for HPC

Wednesday, June 19, 2019

10:30 AM –

10:50 AM

Speaker

Gabriel Broner, Vice President and General Manager of HPC, Rscale Inc.

Title

Rescale HPC platform on Microsoft Azure

11:30 AM –

11:50 AM

Speaker

Martijn de Vries, CTO, Bright Computing

Title

Enabling hybrid clusters that span on-premises and Microsoft Azure

12:30 PM –

12:50 PM

Speaker

Rob Futrik, Program Manager, Microsoft

Title

HPC Clustermanagement in Azure via Microsoft programs: CycleCloud / Azure Batch / HP Pack

1:30 PM –

1:50 PM

Speaker

Christopher Woll, CTO, GNS Systems

Title

Digital Engineering Center – The HPC Workplace of tomorrow already today

2:30 PM –

2:50 PM

Speaker

Addison Snell, CEO, Intersect360 Research

Title

HPC and AI market update

3:30 PM –

3:50 PM

Speaker

Rick Watkins, Director of Appliance and Cloud Solutions, Altair

Title

Altair HyperWorks Unlimited Virtual Appliance (HWUL-VA) –  Easy to use HPC-Powered CAE solvers running on Azure

4:30 PM –

4:50 PM

Speaker

Gabriel Sallah, PSE GBB, Microsoft

Title

Deploying autonomous driving on Azure

5:30 PM –

5:50 PM

Speaker

Brock Taylor, Engineering Director and HPC Solutions Architect

Title

HPC as a service on-premises and off-premises considerations for the cloud

Quelle: Azure

Google Cloud networking in-depth: How Andromeda 2.2 enables high-throughput VMs

Here at Google Cloud, we’ve always aimed to provide great network bandwidth for Compute Engine VMs, thanks in large part to our custom Jupiter network fabric and Andromeda virtual network stack. During Google Cloud Next ‘19, we improved that bandwidth even further by doubling the maximum network egress data rate to 32 Gbps for common VM types. We also announced VMs with up to 100 Gbps bandwidth on V100 and T4 GPU accelerator platforms—all without raising prices or requiring you to use premium VMs.Specifically, for any Skylake or newer VM with at least 16 vCPUs, we raised the egress bandwidth cap to 32 Gbps for same-zone VM-to-VM traffic; this capability is now generally available. This includes n1-ultramem VMs, which provide more compute resources and memory than any other Compute Engine VM instance type. There is no additional configuration needed to get that 32 Gbps throughput.Meanwhile, 100 Gbps Accelerator VMs are in alpha, soon in beta. Any VM with eight V100 or four T4 GPUs attached will have bandwidth caps raised to 100 Gbps.These high-throughput VMs are ideal for running compute-intensive workloads that also need a lot of networking bandwidth. Some key applications and workloads that can leverage these high-throughput VMs are:High-performance computing applications, batch processing, scientific modelingHigh-performance web serversVirtual network appliances (firewalls, load balancers)Highly scalable multiplayer gamingVideo encoding servicesDistributed analyticsMachine learning and deep learningIn addition, services built on top of Compute Engine like CloudSQL, Cloud Filestore and some partner solutions can leverage 32 Gbps throughput already.One use case that is particularly network- and compute-intensive is distributed machine learning (ML). To train large datasets or models, ML workloads use a distributed ML framework, e.g., TensorFlow. The dataset is divided and trained by separate workers, which exchange model parameters with each other. These ML jobs consume substantial network bandwidth due to large model size and frequent data exchanges among workers. Likewise, the compute instances that run the worker nodes create high throughput requirements for VMs and the fabric serving the VMs. One customer, a large chip manufacturer, leverages 100 Gbps GPU-based VMs to run these massively parallel ML jobs, while another customer uses our 100 Gbps GPU machines to test a massively parallel seismic analysis application.Making it all possible: Jupiter and AndromedaOur highly-scalable Jupiter network fabric and high-performance, flexible Andromeda virtual network stack are the same technologies that power Google’s internal infrastructure and services.Jupiter provides Google with tremendous bandwidth and scale. For example, Jupiter fabrics can deliver more than 1 Petabit/sec of total bisection bandwidth. To put this in perspective, this is enough capacity for 100,000 servers to exchange information at a rate of 10 Gbps each, or enough to read the entire scanned contents of the Library of Congress in less than 1/10th of a second.Andromeda, meanwhile, is a Software Defined Networking (SDN) substrate for our network virtualization platform, acting as the orchestration point for provisioning, configuring, and managing virtual networks and in-network packet processing. Andromeda lets us share Jupiter networks for many different uses, including Compute Engine and bandwidth-intensive products like Cloud BigQuery and Cloud Bigtable.Since we last blogged about Andromeda, we launched Andromeda 2.2. Among other infrastructure improvements, Andromeda 2.2 features increased performance and improved performance isolation through the use of hardware offloads, enabling you to achieve the network performance you want, even in a multi-tenant environment.Increasing performance with offload enginesIn particular, Andromeda now takes full advantage of the Intel QuickData DMA Engines to offload payload copies of larger packets. Driving the DMA hardware directly from our OS-bypassed Andromeda SDN enables the SDN to spend more time processing packets rather than moving data around. We employ the processor’s IOMMU to provide security and safety isolation for DMA Engine copies.In Google Cloud Platform (GCP), we encrypt all network traffic in transit that leaves a physical boundary not controlled by Google or on behalf of Google. Andromeda 2.2 now utilizes special-purpose network hardware in the Network Interface Card (NIC) to offload that encryption, freeing the host machine’s CPUs to run guest vCPUs more efficiently.Furthermore, Andromeda’s unique architecture allows us to offload other virtual network processing to hardware opportunistically, improving performance and efficiency under the hood without requiring the use of SR-IOV or other specifications that tie a VM to a physical machine for its lifetime. This architecture also enables us to perform a “hitless upgrade” of the Andromeda SDN as needed to improve performance, add features, or fix bugs.Combined, these capabilities have allowed us to seamlessly upgrade our network infrastructure across five generations of virtual networking—increasing VM-to-VM bandwidth by nearly 18X (and more than 50X for certain accelerator VMs) as well as reducing latency by 8X—all without introducing downtime for our customers.Performance isolationAll that performance is meaningless if your VM is scheduled on a host with other VMs that are overloading or abusing the network and preventing your VM from achieving the performance you expect. Within Andromeda 2.2, we’ve made several improvements to provide isolation, ensuring that each VM receives its expected share of bandwidth. Then, for the rare cases when too many VMs are trying to push massive amounts of network traffic simultaneously, we reengineered the algorithm to optimize for fairness.For VM egress traffic, we schedule the act of looking for work on each VM’s transmit queues such that each VM gets its fair share of bandwidth. If we need to throttle a VM because it has reached its network throughput limits, we provide momentary back-pressure to the VM, which causes a well-behaved guest TCP stack to reduce its offered load slightly without causing packet loss.For VM ingress traffic, we use offloads in the NIC to steer packets into per-VM NIC receive queues. Then, similarly to egress, we look for work on each of those queues in proportion to each VM’s fair share of network bandwidth. In the rare event that a VM is receiving an excessive amount of traffic, its per-VM queue fills up and eventually starts dropping packets. Those drops will again cause a well-behaved TCP connection, originating perhaps from another VM or the internet, to back off slightly, preserving performance for that connection. A VM with a badly behaved connection might not back off, due possibly to bugs in a customer’s workload, or even malicious intent. Either way, per-VM receive queues mean we don’t need to drop packets for other VMs on the host, protecting those VMs from the performance pathologies of a bad actor.You can never have too good a networkAt Google we’re constantly working to improve the performance and reliability of our network infrastructure. Stay tuned for new advances from Google Cloud, including low-latency products focused on HPC use cases, and even higher bandwidth VMs. We’d love your feedback and what else you’d like to see in networking. You can reach us at gcp-networking@google.com.
Quelle: Google Cloud Platform

Jupyter Notebook Manifesto: Best practices that can improve the life of any developer using Jupyter notebooks

Many data science teams, both inside and outside of Google, find that it’s easiest to build accurate models when teammates can collaborate and suggest new hyperparameters, layers, and other optimizations. And notebooks are quickly becoming the common platform for the data science community, whether in the form of AI Platform Notebooks, Kaggle Kernels, Colab, or the notebook that started it all, Jupyter.A Jupyter Notebook is an open-source web application that helps you create and share documents that contain live code, equations, visualizations, and narrative text. Because Jupyter Notebooks are a relatively recently-developed tool, they don’t (yet) follow or encourage consensus-based software development best practices.Data scientists, typically collaborating on a small project that involves experimentation, often feel they don’t need to adhere to any engineering best practices. For example, your team may have the odd Python or Shell script that has neither test coverage nor any CI/CD integration.However, if you’re using Jupyter Notebooks in a larger project that involves many engineers, you may soon find it challenging to scale your environment, or deploy to production.To set up a more robust environment, we established a manifesto that incorporates best practices that can help simplify and improve the life of any developer who uses Jupyter tools.It’s often possible to share best practices across multiple industries, since the fundamentals remain the same. Logically, data scientists, ML researchers, and developers using Jupyter Notebooks should carry over the best practices already established by the older fields of computer science and scientific research.Here is a list of best practices adopted by those communities, with a focus on those that still apply today:Our Jupyter Notebooks development manifesto0. There should be an easy way to use Jupyter Notebooks in your organization, where you can “just write code” within seconds.1. Follow established software development best practices: OOP, style guides, documentation2. You should institute version control for your Notebooks3. Reproducible Notebooks4. Continuous Integration (CI)5. Parameterized Notebooks6. Continuous Deployment (CD)7. Log all experiments automaticallyBy following the above guidelines in this manifesto, we want to help you to achieve this outcome:Note: Security is a critical part of Software Development practices. In a future blog we will cover best practices for secure software development with Jupyter Notebooks, currently this topic is not covered in this blog post, but is something critical you must consider.PrinciplesEasy access to Jupyter NotebooksCreating and using a new Jupyter Notebook instance should be very easy. On Google Cloud Platform (GCP), we just launched a new service called AI Platform Notebooks. AI Platform Notebooks is a managed service that offers an integrated JupyterLab environment, in which you can create instances running JupyterLab that come pre-installed with the latest data science and machine learning frameworks in a single click.Follow established software development best practicesThis is essential. Jupyter Notebook is just a new development environment for writing code. All the best practices of software development should still apply:Version control and code review systems (e.g. git, mercurial).Separate environments: split production and development artifacts.A comprehensive test suite (e.g. unitests, doctests) for your Jupyter Notebooks.Continuous integration (CI) for faster development: automate the compilation and testing of Jupyter notebooks every time a team member commits changes to version control.Just as an Android Developer would need to follow the above best practices to build a scalable and successful mobile app, a Jupyter Notebook focused on sustainable data science should follow them, too.Using a version control system with your Jupyter NotebooksVersion control systems record changes to your code over time, so that you can revisit specific versions later. This also lets you develop separate branches in parallel, such as allowing you to perform code reviews and providing CI/CD revision history to know who is the expert in certain code areas.In order to unblock effective use of a version control system like git, there should be a tool well integrated into the Jupyter UI that allows every data scientist on your team to effectively resolve conflicts for the notebook, view the history for each cell, and commit and push particular parts of the notebook to your notebook’s repository right from the cell.Don’t worry, though: if you perform a diff operation in git and suddenly see that multiple lines have changed, instead of one, this is the intended behavior, as of today. With Jupyter notebooks, there is a lot of metadata that can change with a simple one-line edit, including kernel spec, execution info, and visualization parameters. To apply the principles and corresponding workflows of traditional version control to Jupyter notebooks, you need the help of two additional tools:nbdime: tool for “diffing” and merging of Jupyter Notebooksjupyterlab-git: a JupyterLab extension for version control using gitIn this demo, we clone a Github repository, then after this step is completed, we modified some minor parts of the code. If you execute a diff command, you would normally expect git to show only the lines that changed, but as we explained above, this is not true for Jupyter notebooks. nbdime allows you to perform a diff from Jupyter notebook and also from CLI, without the distraction of extraneous JSON output.Reproducible notebooksYou and your team should write notebooks in such a way that anyone can rerun it on the same inputs, and produce the same outputs. Your notebook should be executable from top to bottom and should contain the information required to set up the correct, consistent environment.How to do it?If you are using AI Platform notebooks, for example on the TensorFlow M22 image, this platform information should be embedded in your notebook’s metadata for future use.Let’s say you create a notebook and install TensorFlow’s nightly version. If you execute the same notebook in a different Compute Engine instance, you need to make sure that this dependency is already installed. A notebook should have a notion of dependencies and its dependencies appropriately tracked, this can be in the environment or in the notebook metadata.In summary, a notebook is reproducible if it meets the following requirements:The Compute Engine image and underlying hardware used for creating the Notebook should be embedded in the notebook itself.All dependencies should be installed by the notebook itself.A notebook should be executable from top to bottom without any errors.In this demo we clone a GitHub repository that contains a few notebooks, and then activate the new Nova plugin, which allows you to execute notebooks directly from your Jupyter UI. Nova and its corresponding compute workload runs on a separate Compute Engine instance using Nteract papermill. AI Platform notebooks support this plugin by default—to enable it, run the enable_notebook_submission.sh script.Nova pluginContinuous integrationContinuous integration is a software development practice that requires developers to integrate code into a shared repository. Each check-in is verified by an automated build system, allowing teams to detect problems at early stages. Each change to a Jupyter notebook should be validated by a continuous integration system before being checked in; this can be done using different setups (non-master remote branch, remote execution in local branch, etc)In this demo, we modified a notebook so that it contains invalid Python code, and then we commit the results to git. This particular git repository is connected to Cloud Build. The notebook executes and the commit step fails as the engine finds an invalid cell at runtime. Cloud Build creates a new notebook to help you to troubleshoot your mistake. Once you correct the code, you’ll find that your notebook runs successfully, and Cloud Build can then integrate your code.Parameterized NotebooksReusability of code is another software development best practice.You can think of a production-grade notebook as a function or a job specification: A notebook takes a series of inputs, processes them, and generates some outputs—consistently. If you’re a data scientist you might start running grid search to find your model’s optimal hyperparameters for training, stepping through different parameters such as learning rate, num_steps or batch_size:During notebook execution, you can pass different parameters to your models, and once results are generated, pick the best options using the same notebook. For the previous execution steps, consider using Papermill and its ability to configure different parameters, these parameters will be used by the notebook during execution. This means you can override the default source of data for training or submit the same notebook with different input (for example different learning rate, epochs, etc).In this demo, we execute a notebook passing different extra parameters. Here we’re using information about bike rentals in San Francisco, with the bike rental data stored in BigQuery. This notebook queries the data and generates a top ten list and station map of the most popular bike rental stations, using start and end date as parameters. By tagging the cells with a parameters tags so Papermill can use these options, you can run reuse your notebook without making any updates to it, but still generate a different dashboard.Continuous deploymentEach version of a Jupyter Notebook that has passed all the tests should be used to automatically generate a new artifact and deploy it to staging and production environments.In this demo, we show you how to perform continuous deployment on GCP, incorporating Cloud Functions, Cloud Pub/Sub, and Cloud Scheduler.Now that you’ve established a CI system that generates a tested, reproducible, and parameterized notebook, let’s automate the generation of artifacts for a continuous deployment system.Based on the previous CI system, there is an additional step in CI to upload a payload to Cloud Functions when tests are successful. When triggered, this payload sends the same artifact build request with parameters to Cloud Build, spinning up the instance and storing the results. To add the automation, we’ll orchestrate using Cloud Pub/Sub (message passing) and Cloud Scheduler (cron). The first time the cloud function is deployed, it will create a new Pub/Sub topic and subscribe to it, later any published message will start the cloud function.  This notification is published using Cloud Scheduler, which sends messages based on time. Cloud Scheduler can use different interfaces, for example new data arriving in Cloud Storage or a manual job request.Log all experimentsEvery time you try to train a model, metadata about the training session should be automatically logged. You’ll want to keep track of things like the code you ran, hyperparameters, data sources, results, and training time. This way, you remember past results and won’t find yourself wondering if you already tried running that experiment.ConclusionBy following the guidelines defined above, you can make your Jupyter notebooks deployments more efficient. To learn more, read our AI Platform Notebooks overview.Acknowledgements: Gonzalo Gasca Meza, Developer Programs Engineer and Karthik Ramachandran, Product Manager contributed to this post.
Quelle: Google Cloud Platform

Are you up for the challenge? Get Google Cloud Certified in 3 months

There’s no doubt that cloud skills are in demand. Google Cloud skills are especially in high demand, with a 66.74% increase over the past year in job listings, which is why we rolled out four new certifications at the beginning of this year. So today we’re excited to announce that we are reaffirming our commitment to prepare millions of workers to thrive in a cloud-first world by launching the Google Cloud certification challenge, available in 25 countries (details at the bottom).By signing up for the certification challenge, you’ll get access to a series of free learning resources on Qwiklabs and Coursera to sharpen your cloud architecture knowledge. You’ll also receive additional tips and resources to help prepare you for the Google Cloud Certified Associate Cloud Engineer or Professional Cloud Architect exam. If you successfully certify within 12 weeks of starting the certification challenge, we’ll send you a $100 Google Store voucher to redeem toward the product of your choice.Why get Google Cloud certified?Cloud certifications are a great way for you to demonstrate your skills to the larger IT market. Not only does it validate your cloud skills and experience to recruiters, it demonstrates your value to your current employer. Getting certified can open up opportunities to progress within your company and could help in the next review of your compensation package. For example, the Google Cloud Professional Cloud Architect certification debuted at number one on the top-paying certifications list in the 2019 Global Knowledge survey.   Hear from the Google Cloud certified communityHere’s what a few community members had to say about the certification.   Sign up for the certification challenge todayVisit our certification challenge site to sign up, and start thinking about how you’ll spend that $100! We’ll be cheering you on.Qualifying countries for the certification challenge are: U.S., Canada, Puerto Rico, Australia, Hong Kong, Japan, New Zealand, Singapore, Taiwan, Austria, Belgium, Denmark, Finland, France, Germany, Ireland, Italy, South Korea, Netherlands, Norway, Portugal, Spain, Sweden, Switzerland, and the U.K.
Quelle: Google Cloud Platform