Apache Hadoop has become an established and long-running framework for distributed storage and data processing. Google’s Cloud Dataproc is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simple, cost-efficient way. With Cloud Dataproc, you can set up a distributed storage platform without worrying about the underlying infrastructure. But what if you want to train TensorFlow workloads directly on your distributed data store?This post will explain how to install a Hadoop cluster for LinkedIn open-source project TonY (TensorFlow on YARN). You will deploy a Hadoop cluster using Cloud Dataproc and TonY to launch a distributed machine learning job. We’ll explore how you can use two of the most popular machine learning frameworks: TensorFlow and PyTorch.TensorFlow supports distributed training, allowing portions of the model’s graph to be computed on different nodes. This distributed property can be used to split up computation to run on multiple servers in parallel. Orchestrating distributed TensorFlow is not a trivial task and not something that all data scientists and machine learning engineers have the expertise, or desire, to do—particularly since it must be done manually. TonY provides a flexible and sustainable way to bridge the gap between the analytics powers of distributed TensorFlow and the scaling powers of Hadoop. With TonY, you no longer need to configure your cluster specification manually, a task that can be tedious, especially for large clusters.The components of our system:First, Apache HadoopApache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provides for data storage, data processing, data access, data governance, security, and operations.Next, Cloud DataprocCloud Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Cloud Dataproc’s automation capability helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don’t need them. With less time and money spent on administration, you can focus on your jobs and your data.And now TonYTonY is a framework that enables you to natively run deep learning jobs on Apache Hadoop. It currently supports TensorFlow and PyTorch. TonY enables running either single node or distributed training as a Hadoop application. This native connector, together with other TonY features, runs machine learning jobs reliably and flexibly.InstallationSetup a Google Cloud Platform projectGet started on Google Cloud Platform (GCP) by creating a new project, using the instructions found here.Create a Cloud Storage bucketThen create a Cloud Storage bucket. Reference here.Create a Hadoop cluster via Cloud Dataproc using initialization actionsYou can create your Hadoop cluster directly from Cloud Console or via an appropriate `gcloud` command. The following command initializes a cluster that consists of 1 master and 2 workers:When creating a Cloud Dataproc cluster, you can specify in your TonY initialization actions script that Cloud Dataproc should run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up.Note: Use Cloud Dataproc version 1.3-deb9, which is supported for this deployment. Cloud Dataproc version 1.3-deb9 provides Hadoop version 2.9.0. Check this version list for details.Once your cluster is created. You can verify that under Cloud Console > Big Data > Cloud Dataproc > Clusters, that cluster installation is completed and your cluster’s status is Running.Go to Cloud Console > Big Data > Cloud Dataproc > Clusters and select your new cluster:You will see the Master and Worker nodes.Connect to your Cloud Dataproc master server via SSHClick on SSH and connect remotely to Master server.Verify that your YARN nodes are activeExampleInstalling TonYTonY’s Cloud Dataproc initialization action will do the following:Install and build TonY from GitHub repository.Create a sample folder containing TonY examples, for the following frameworks:TensorFlowPyTorchThe following folders are created:TonY install folder (TONY_INSTALL_FOLDER) is located by default in:TonY samples folder (TONY_SAMPLES_FOLDER) is located by default in:The Tony samples folder will provide 2 examples to run distributed machine learning jobs using:TensorFlow MNIST examplePyTorch MNIST exampleRunning a TensorFlow distributed jobLaunch a TensorFlow training jobYou will be launching the Dataproc job using a `gcloud` command.The following folder structure was created during installation in `TONY_SAMPLES_FOLDER`, where you will find a sample Python script to run the distributed TensorFlow job.This is a basic MNIST model, but it serves as a good example of using TonY with distributed TensorFlow. This MNIST example uses “data parallelism,” by which you use the same model in every device, using different training samples to train the model in each device. There are many ways to specify this structure in TensorFlow, but in this case, we use “between-graph replication” using tf.train.replica_device_setter.DependenciesTensorFlow version 1.9Note: If you require a more recent TensorFlow and TensorBoard version, take a look at the progress of this issue to be able to upgrade to latest TensorFlow version.Connect to Cloud ShellOpen Cloud Shell via the console UI:Use the following gcloud command to create a new job. Once launched, you can monitor the job. (See the section below on where to find the job monitoring dashboard in Cloud Console.)Running a PyTorch distributed jobLaunch your PyTorch training jobFor PyTorch as well, you can launch your Cloud Dataproc job using gcloud command.The following folder structure was created during installation in the TONY_SAMPLES_FOLDER, where you will find an available sample script to run the TensorFlow distributed job:DependenciesPyTorch version 0.4Torch Vision 0.2.1Launch a PyTorch training jobVerify your job is running successfullyYou can track Job status from the Dataproc Jobs tab: navigate to Cloud Console > Big Data > Dataproc > Jobs.Access your Hadoop UILogging via web to Cloud Dataproc’s master node via web: http://<Node_IP>:8088 and track Job status. Please take a look at this section to see how to access the Cloud Dataproc UI.Cleanup resourcesDelete your Cloud Dataproc clusterConclusionDeploying TensorFlow on YARN enables you to train models straight from your data infrastructure that lives in HDFS and Cloud Storage. If you’d like to learn more about some of the related topics mentioned in this post, feel free to check out the following documentation links:Machine Learning with TensorFlow on GCPHyperparameter tuning on GCPHow to train ML models using GCPAcknowledgements: Anthony Hsu, LinkedIn Software Engineer; and Zhe Zhang, LinkedIn Core Big Data Infra team manager.
Quelle: Google Cloud Platform
Published by