A Tech News Site Has Been Using AI To Write Articles, So We Did The Same Thing Here
BuzzFeed News would like to thank ChatGPT.
Quelle: <a href="A Tech News Site Has Been Using AI To Write Articles, So We Did The Same Thing Here“>BuzzFeed
Quelle: <a href="A Tech News Site Has Been Using AI To Write Articles, So We Did The Same Thing Here“>BuzzFeed
As a Google Cloud practitioner, you typically spend a lot of time in the documentation pages to read up on the guides, commands, tutorials and more. The documentation team has introduced several features over the years to make it easier to be more productive while working with Google Cloud documentation.A few of these tips would be well known to some of you and I am hoping that there is at least one tip that you go away with that helps you. In no particular order, here are my personal list of tips that I have found useful.Interactive Tutorials or WalkthroughsThis is an excellent feature of the documentation, where an interactive tutorial opens up right in the Google Cloud Console and you can complete the tutorial as a sequence of steps. Several tutorials are available from the Google Cloud Console via the Support icon in the top Action Bar. SearchThe Search bar at the top in the Google Cloud console is an efficient way to search for various product services, documentation pages, tutorials and even Google Cloud Resources (e.g. Compute Engine VM names). While you can locate the specific Product page from the hamburger menu on the top left and the subsequent left-navigation bar, the Search bar is probably the quickest way to get to a product (Extra points to those who are power users and have used the “pin” feature to lock frequently used products at the top of the list in the left-navigation bar). Here is a screencast demonstrating how to search for a specific product. You will notice that it’s just not about going to a specific product but it also provides you different sections (Tutorials, Google Cloud Resources, etc).If you would like to straightaway look at all the products and their related documentation, you should check out the View all Products link in the left-navigation bar. The screencast below demonstrates that. Need more tutorials, Quickstarts and reference guides? You have probably noticed that as you navigate across the documentation, we have a list of tutorials, Quickstarts and reference guides available for each of the products. There are couple of ways that I use to get more information on a specific product. First up, you will notice that some of our product pages have a Learn icon. Here is a sample of the Compute Engine product home page.Click on the Learn button to get access to a bunch of related documentation around the product. At times, I want to try out a few more interactive tutorials (walkthroughs). You would have noticed that via the Support icon in the top action bar, you can get access to some interactive tutorials via the Start a tutorial link as we saw earlier. This list is limited and there are other interactive tutorials available and you can get them as follows:Let’s say that you are interested in learning more about IAM and want to check out the various interactive tutorials that are available under this service. Go to the main Search bar at the top and enter IAM. This will present a list of search results as we saw earlier. You will notice that we provide a few results under the Documentation and Tutorials section as shown above. The keyword here is Interactive Tutorial. If you click on See more results … , this will lead to a search results page, where you can further filter into interactive tutorials only. Saving your favorite documentation pagesAt the top of each documentation page, you will see a Bookmark icon that you can click on to save to your collection of documentation pages that you can then reference easily from your Google Profile. For e.g. here is a documentation page on how to create and start a VM instance in Compute Engine. I wish to bookmark this document. All I need to do is click on the Bookmark icon as shown below:You can choose to save it to your My saved pages or create a New Collection and save it in that. In my case above, I have created a new collection named Compute Engine and I chose to bookmark this page under that. How do you access all your bookmarked pages? On the top bar, next to your Google Profile pic, you will see a set of 3 dots, click on that. This will provide you a way to visit your Google Developer Profile associated with that account. One of the options as you can see below is that of Saved pages. When you visit the page, you will see your Saved Pages as shown below:You can tap on any of the collections that you have created and all your bookmarks will be available under that. Providing FeedbackYour feedback is valuable and Google Cloud Documentation makes it easy for you to submit your feedback. Notice the Send feedback button on the documentation pages. Click that and it will help you give us feedback on the specific page or the particular product documentation in general. Interactive Code samplesThis one continues to be one of my favorites and it boosts developer productivity by multiple levels, especially when you are trying out the various gcloud commands. The specific feature is about using placeholder variables in the commands e.g. Project ID, Region, etc that you need to repeat across a series of commands. The feature is well over 2+ years old and has been well documented in the following blog post. I reproduce a screencast of the same here and reproduce the text from that blog post pertaining to this feature:“If a page has multiple code samples with the same placeholder variable, you only need to replace the variable once. For example, when you replace a PROJECT_ID variable with your own Google Cloud project ID, all instances of the PROJECT_ID variable (including in any other command line samples on the page) will use the same Google Cloud project ID.”Hope this set of tips was useful to you. If you would like to try out an interactive tutorial, try out the Compute Engine quickstart. I am sure you have a list of your own tips that you have found useful while working with Google Cloud documentation? Do reach out on Twitter (@iRomin) with those. I’d love to hear about them.
Quelle: Google Cloud Platform
Angestellte sind laut einem Urteil nicht dazu verpflichtet, SMS des Arbeitgebers in der Freizeit zu lesen. (Arbeit, Wirtschaft)
Quelle: Golem
Suzuki versucht sich mit dem eVX-SUV an einem Elektroauto und hat zunächst eine Studie präsentiert. (Suzuki, Elektroauto)
Quelle: Golem
Euro NCAP hat Autos in diversen sicherheitsrelevanten Bereichen bewertet. Zwei neue chinesische Hersteller belegen Spitzenplätze. (Elektromobilität, Elektroauto)
Quelle: Golem
Vorerst wird es einige Filmklassiker von Paramount nur noch im Abo bei Paramount+ geben. Eine Exklusivmeldung von Ingo Pakalski (Prime Video, Amazon)
Quelle: Golem
Das deutsche Unternehmen DeepL bereitet sich mit frischen Investorengeldern auf ein “neues Zeitalter der künstlichen Intelligenz” vor. (DeepL, KI)
Quelle: Golem
At WordPress.com we’re always looking for ways to make building and running your web site simpler and more impactful (and more fun!).
One of the biggest challenges for any site owner is finding your readers, fans, customers, or subscribers. Until now, promoting your WordPress.com web site required multiple tools, online accounts, professional design and marketing skills, and – yes – lots of money.
That’s why we’re excited to announce Blaze, a new tool allowing anyone with a WordPress blog to advertise on WordPress.com and Tumblr in just a few clicks. How? By turning your site content into clean, compelling ads that run across our millions-strong network of blogs.
Create an ad today!
How Blaze works
If your website is hosted on WordPress.com, then head to wordpress.com/advertising and select your website — you’ll see a list of recent posts and pages you can promote. If your WordPress site isn’t hosted on WordPress.com, you can take advantage of Blaze through the Jetpack plugin.
Alternatively, when viewing the post or page list in your WordPress.com dashboard, click the ellipses (three dots) next to any individual post/page to bring up a new menu, then click “Promote with Blaze.”
Now you’ll be in the Blaze Campaign Wizard.
Step 1: Design your ad. The wizard automatically formats your content into a beautiful ad, but you can adjust the image and text however you like.
Step 2: Select your audience. Want to target the whole world? Only people in certain areas? Folks who are reading content about a specific category, like movies or sports? As you adjust these settings, you’ll see our estimate of how many people you’ll reach.
Step 3: Select your dates and set your budget. Run your ad for 6 months or for just a few days — it’s up to you.
Step 4: Finish and pay. We may offer the lowest ad prices in the industry, but we also protect your content with a system backed by Verity and Grapeshot. So rest easy knowing that your ads will only show up where they’re supposed to — and nowhere you’d feel strange about.
Once your ad is running, you can check how it’s doing in the “Campaigns” tab of the advertising page.
Our campaigns are billed weekly based on how many times your ad is shown, so you’ll only ever pay for what you signed up for. As always, you can find even more details about this tool on our support page.
This feature is currently only available to users with “English” set as their primary language, but we’re working hard on bringing it to other languages as well.
Try Blaze today
Let us know what you think about Blaze!
We’re excited to launch this powerful new feature, and we’re eager to get feedback. If you have any questions about it, challenges while using it, or ideas to make it better, please share them with our team. We’ll make sure a real person reads through all of the feedback, and we’ll be working tirelessly to make sure that this tool is something valuable for you.
Quelle: RedHat Stack
Dataproc is a fully managed service for hosting open-source distributed processing platforms such as Apache Hive, Apache Spark, Presto, Apache Flink, and Apache Hadoop on Google Cloud. Dataproc provides flexibility to provision and configure clusters of varying sizes on demand. In addition, Dataproc has powerful features to enable your organization to lower costs, increase performance and streamline operational management of workloads running on the cloud. Dataproc is an important service in any data lake modernization effort. Many customers begin their journey to the cloud by migrating their Hadoop workloads to Dataproc and continue to modernize their solutions by incorporating the full suite of Google Cloud’s data offerings.This guide demonstrates how you can optimize Dataproc job stability, performance, and cost-effectiveness. You can achieve this by using a workflow template to deploy a configured ephemeral cluster that runs a Dataproc job with calculated application-specific properties. Before you beginPre-requisitesA Google Cloud projectA 100-level understanding of Dataproc (FAQ)Experience with shell scripting, YAML templates, Hadoop ecosystemAn existing dataproc application , referred to as “the job” or “the application”Sufficient project quotas (CPUs, Disks, etc.) to create clusters Consider Dataproc Serverless or BigQueryBefore getting started with Dataproc, determine whether your application is suitable for (or portable to) Dataproc Serverless or BigQuery. These managed services will save you time spent on maintenance and configuration. This blog assumes the user has identified Dataproc as the best choice for their scenario. For more information about other solutions, please check out some of our other guides like Migrating Apache Hive to Bigquery and Running an Apache Spark Batch Workload on Dataproc Serverless.Separate data from computationConsider the advantages of using Cloud Storage. Using this persistent storage for your workflows has the following advantages:It’s a Hadoop Compatible File System (HCFS), so it’s easy to use with your existing jobs.Cloud Storage can be faster than HDFS. In HDFS, a MapReduce job can’t start until the NameNode is out of safe mode—a process that can take a few seconds to many minutes, depending on the size and state of your data.It requires less maintenance than HDFS.It enables you to easily use your data with the whole range of Google Cloud products.It’s considerably less expensive than keeping your data in replicated (3x) HDFS on a persistent Dataproc cluster. Pricing Comparison Examples (North America, as of 11/2022):GCS: $0.004 – $0.02 per GB, depending on the tierPersistent Disk: $0.04 – $0.34 per GB + compute VM costsHere are some guides on Migrating On-Premises Hadoop Infrastructure to Google Cloud and HDFS vs. Cloud Storage: Pros, cons, and migration tips. Google Cloud has developed an open-source tool for performing HDFS to GCS.Optimize your Cloud StorageWhen using Dataproc, you can create external tables in Hive, HBase, etc., where the schema resides in Dataproc, but the data resides in Google Cloud Storage. Separating compute and storage enables you to scale your data independently of compute power.In HDFS / Hive On-Prem setups, the compute and storage were closely tied together, either on the same machine or in a nearby machine. When using Google Cloud Storage over HDFS, you separate compute and storage at the expense of latency. It takes time for Dataproc to retrieve files on Google Cloud Storage. Many small files (e.g. millions of <1mb files) can negatively affect query performance, and file type and compression can also affect query performance.When performing data analytics on Google Cloud, it is important to be deliberate in choosing your Cloud Storage file strategy. Monitoring Dataproc JobsAs you navigate through the following guide, you’ll submit Dataproc Jobs and continue to optimize runtime and cost for your use case. Monitor the Dataproc Jobs console during/after job submissions to get in-depth information on the Dataproc cluster performance. Here you will find specific metrics that help identify opportunities for optimization, notably YARN Pending Memory, YARN NodeManagers, CPU Utilization, HDFS Capacity, and Disk Operations. Throughout this guide you will see how these metrics influence changes in cluster configurations.Guide: Run Faster and Cost-Effective Dataproc Jobs1. Getting startedThis guide demonstrates how to optimize performance and cost of applications running on Dataproc clusters. Because Dataproc supports many big data technologies – each with their own intricacies – this guide intends to be trial-and-error experimentation. Initially it will begin with a generic dataproc cluster with defaults set. As you proceed through the guide, you’ll increasingly customize Dataproc cluster configurations to fit your specific workload.Plan to separate Dataproc Jobs into different clusters – each data processing platform uses resources differently and can impact each other’s performances when run simultaneously. Even better, isolating single jobs to single clusters can set you up for ephemeral clusters, where jobs can run in parallel on their own dedicated resources.Once your job is running successfully, you can safely iterate on the configuration to improve runtime and cost, falling back to the last successful run whenever experimental changes have a negative impact.You can export an existing cluster’s configuration to a file during experimentation. Use this configuration to create new clusters through the import command.code_block[StructValue([(u’code’, u’gcloud dataproc clusters export my-cluster \rn –region=region \rn –destination=my-cluster.yamlrnrngcloud dataproc clusters import my-new-cluster \rn –region=us-central1 \rn –source=my-cluster.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfd8f5f5d0>)])]Keep these as reference to the last successful configuration incase drift occurs. 2. Calculate Dataproc cluster sizea. Via on-prem workload (if applicable)View the YARN UIIf you’ve been running this job on-premise, you can identify the resources used for a job on the Yarn UI. The image below shows a Spark job that ran successfully on-prem.The table below are key performance indicators for the job.For the above job you can calculate the followingNow that you have the cluster sizing on-prem, the next step is to identify the initial cluster size on Google Cloud. Calculate initial Dataproc cluster sizeFor this exercise assume you are using n2-standard-8, but a different machine type might be more appropriate depending on the type of workload. n2-standard-8 has 8 vCPUs and 32 GiB of memory. View other Dataproc-supported machine types here.Calculate the number of machines required based on the number of vCores required.Recommendations based on the above calculations:Take note of the calculations for your own job/workload.b. Via an autoscaling clusterAlternatively, an autoscaling cluster can help determine the right number of workers for your application. This cluster will have an autoscaling policy attached. Set the autoscaling policy min/max values to whatever your project/organization allows. Run your jobs on this cluster. Autoscaling will continue to add nodes until the YARN pending memory metric is zero. A perfectly sized cluster minimizes the amount of YARN pending memory while also minimizing excess compute resources.Deploying a sizing Dataproc clusterExample: 2 primary workers (n2-standard-8)0 secondary workers (n2-standard-8)pd-standard 1000GBAutoscaling policy: 0 min, 100 max.No application properties set.sample-autoscaling-policy.ymlcode_block[StructValue([(u’code’, u’workerConfig:rn minInstances: 2rn maxInstances: 2rnsecondaryWorkerConfig:rn minInstances: 0rn maxInstances: 100rnbasicAlgorithm:rn cooldownPeriod: 5mrn yarnConfig:rn scaleUpFactor: 1.0rn scaleDownFactor: 1.0rn gracefulDecommissionTimeout: 1hr’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfda5cd190>)])]code_block[StructValue([(u’code’, u’gcloud dataproc autoscaling-policies import policy-namern –source=sample-autoscaling-policy.yml \rn –region=regionrnrngcloud dataproc clusters create cluster-name \rn –master-machine-type=n2-standard-8 \rn –worker-machine-type=n2-standard-8 \rn –master-boot-disk-type=pd-standard \rn –master-boot-disk-size=1000GBrn –autoscaling-policy=policy-name rn –region=region’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfc6803ed0>)])]Submitting Jobs to Dataproc Clustercode_block[StructValue([(u’code’, u”gcloud dataproc jobs submit spark \rn –cluster=cluster-name \rn –region=region \rn –jar=<your-spark-jar-path> \rn –properties=’spark.executor.cores=5,spark.executor.memory=4608mb’ \rn — arg1 arg2″), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfe510d7d0>)])]Monitoring Worker Count / YARN NodeManagersObserve the peak number of workers required to complete your job.To calculate the number of required cores, multiply the machine size (2,8,16,etc. by the number of node managers.) 3. Optimize Dataproc cluster configurationUsing a non-autoscaling cluster during this experimentation phase can lead to the discovery of more accurate machine-types, persistent disks, application properties, etc. For now, build an isolated non-autoscaling cluster for your job that has the optimized number of primary workers.Example: N primary workers (n2-standard-8)0 secondary workers (n2-standard-8)pd-standard 1000GBNo autoscaling policyNo application properties setDeploying a non-autoscaling Dataproc clustercode_block[StructValue([(u’code’, u’gcloud dataproc clusters create cluster-name \rn –master-machine-type=n2-standard-8 \rn –worker-machine-type=n2-standard-8 \rn –master-boot-disk-type=pd-standard \rn –master-boot-disk-size=1000GB \rn –region=region \rn –num-workers=x’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfe510d1d0>)])]Choose the right machine type and machine sizeRun your job on this appropriately-sized non-autoscaling cluster. If the CPU is maxing out, consider using C2 machine type. If memory is maxing out, consider using N2D-highmem machine types. Prefer using smaller machine types (e.g. switch n2-highmem-32 to n2-highmem-8). It’s okay to have clusters with hundreds of small machines. For Dataproc clusters, choose the smallest machine with maximum network bandwidth (32 Gbps). Typically these machines are n2-standard-8 or n2d-standard-16.On rare occasions you may need to increase machine size to 32 or 64 cores. Increasing your machine size can be necessary if your organization is running low on IP addresses or you have heavy ML or processing workloads. Refer to Machine families resource and comparison guide | Compute Engine Documentation | Google Cloud for more information.Submitting Jobs to Dataproc Clustercode_block[StructValue([(u’code’, u’gcloud dataproc jobs submit spark \rn –cluster=cluster-name \rn –region=region \rn –jar=<your-spark-jar-path> \rn — arg1 arg2′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfd978bf90>)])]Monitoring Cluster MetricsMonitor memory to determine machine-type:Monitor CPU to determine machine-type:Choose the right persistent diskIf you’re still observing performance issues, consider moving from pd-standard to pd-balanced or pd-ssd.Standard persistent disks (pd-standard) are best for large data processing workloads that primarily use sequential I/Os. For PD-Standard without local SSDs, we strongly recommend provisioning 1TB (1000GB) or larger to ensure consistently high I/O performance. Balanced persistent disks (pd-balanced) are an alternative to SSD persistent disks that balance performance and cost. With the same maximum IOPS as SSD persistent disks and lower IOPS per GB, a balanced persistent disk offers performance levels suitable for most general-purpose applications at a price point between that of standard and SSD persistent disks.SSD persistent disks (pd-ssd) are best for enterprise applications and high-performance database needs that require lower latency and more IOPS than standard persistent disks provide.For similar costs, pd-standard 1000GB == pd-balanced 500GB == pd-ssd 250 GB. Be certain to review performance impact when configuring disk. See Configure Disks to Meet Performance Requirements for information on disk I/O performance. View Machine Type Disk Limits for information on the relationships between machine types and persistent disks. If you are using 32 core machines or more, consider switching to multiple Local SSDs per node to get enough performance for your workload. You can monitor HDFS Capacity to determine disk size. If HDFS Capacity ever drops to zero, you’ll need to increase the persistent disk size.If you observe any throttling of Disk bytes or Disk operations, you may need to consider changing your cluster’s persistent disk to balanced or SSD:Choose the right ratio of primary workers vs. secondary workersYour cluster must have primary workers. If you create a cluster and you do not specify the number of primary workers, Dataproc adds two primary workers to the cluster. Then you must determine if you prioritize performance or cost optimization.If you prioritize performance, utilize 100% primary workers. If you prioritize cost optimization, specify the remaining workers to be secondary workers. Primary worker machines are dedicated to your cluster and provide HDFS capacity. On the other hand, secondary worker machines have three types: spot VMs, standard preemptible VMs, and non-preemptible VMs. As a default, secondary workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and do not run HDFS. Be aware that secondary workers may not be dedicated to your cluster and may be removed at any time. Ensure that your application is fault-tolerant when using secondary workers.Consider attaching Local SSDsSome applications may require higher throughput than what Persistent Disks provide. In these scenarios, experiment with Local SSDs. Local SSDs are physically attached to the cluster and provide higher throughput than persistent disks (see the Performance table). Local SSDs are available at a fixed size of 375 gigabytes, but you can add multiple SSDs to increase performance.Local SSDs do not persist data after a cluster is shut down. If persistent storage is desired, you can use SSD persistent disks, which provide higher throughput for their size than standard persistent disks. SSD persistent disks are also a good choice if partition size will be smaller than 8 KB (however, avoid small paritions).Like Persistent Disks, continue to monitor any throttling of Disk bytes or Disk operations to determine whether Local SSDs are appropriate:Consider attaching GPUsFor even more processing power, consider attaching GPUs to your cluster. Dataproc provides the ability to attach graphics processing units (GPUs) to the master and worker Compute Engine nodes in a Dataproc cluster. You can use these GPUs to accelerate specific workloads on your instances, such as machine learning and data processing.GPU drivers are required to utilize any GPUs attached to Dataproc nodes. You can install GPU drivers by following the instructions for this initialization action.Creating Cluster with GPUscode_block[StructValue([(u’code’, u’gcloud dataproc clusters create cluster-name \rn –region=region \rn –master-accelerator type=nvidia-tesla-k80 \rn –worker-accelerator type=nvidia-tesla-k80,count=4 \rn –secondary-worker-accelerator type=nvidia-tesla-k80,count=4rn –initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfda9ecbd0>)])]Sample cluster for compute-heavy workload:code_block[StructValue([(u’code’, u’gcloud dataproc clusters create cluster-name \rn –master-machine-type=c2-standard-30 \rn –worker-machine-type=c2-standard-30 \rn –master-boot-disk-type=pd-balanced \rn –master-boot-disk-size=500GB \rn –region=region \rn –num-workers=10′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfda9ecb90>)])]4. Optimize application-specific propertiesIf you’re still observing performance issues, you can begin to adjust application properties. Ideally these properties are set on the job submission, isolating properties to their respective jobs. View the best practices for your application below.Spark Job TuningHive Performance TuningTez Memory TuningPerformance and Efficiency in Apache PigSubmitting Dataproc jobs with propertiescode_block[StructValue([(u’code’, u”gcloud dataproc jobs submit spark \rn –cluster=cluster-name \rn –region=region \rn –jar=my_jar.jar \rn –properties=’spark.executor.cores=5,spark.executor.memory=4608mb’ \rn — arg1 arg2″), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfe59c08d0>)])]5. Handle edge-case workload spikes via an autoscaling policyNow that you have an optimally sized, configured, tuned cluster, you can choose to introduce autoscaling. Autoscaling should not be viewed as a cost-optimization technique because aggressive up/down scaling can lead to Dataproc job instability. However, conservative autoscaling can improve Dataproc cluster performance during edge-cases that require more worker nodes.Use ephemeral clusters (see next step) to allow clusters to scale up, and delete them when the job or workflow is complete.Ensure primary workers make up >50% of your cluster.Avoid autoscaling primary workers. Primary workers run HDFS Datanodes, while secondary workers are compute-only workers. HDFS’s Namenode has multiple race conditions that cause HDFS to get into a corrupted state that causes decommissioning to get stuck forever. Primary workers are more expensive but provide job stability and better performance. The ratio of primary workers vs. secondary workers is a tradeoff you can make; stability versus cost.Note: Having too many secondary workers can create job instability. Best practice indicates to avoid having a majority of secondary workers. Prefer ephemeral, non-autoscaled clusters where possible.Allow these to scale up and delete them when jobs are complete.As stated earlier, you should avoid scaling down workers because it can lead to job instability. Set scaleDownFactor to 0.0 for ephemeral clusters.Creating and attaching autoscaling policiessample-autoscaling-policy.ymlcode_block[StructValue([(u’code’, u’workerConfig:rn minInstances: 10rn maxInstances: 10rnsecondaryWorkerConfig:rn maxInstances: 50rnbasicAlgorithm:rn cooldownPeriod: 4mrn yarnConfig:rn scaleUpFactor: 1.0rn scaleDownFactor: 0rn gracefulDecommissionTimeout: 0rnrnrngcloud dataproc autoscaling-policies import policy-namern –source=sample-autoscaling-policy.yml \rn –region=regionrnrngcloud dataproc clusters update cluster-name rn –autoscaling-policy=policy-name rn –region=region’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfe4027710>)])]6. Optimize cost and reusability via ephemeral Dataproc clustersThere are several key advantages of using ephemeral clusters:You can use different cluster configurations for individual jobs, eliminating the administrative burden of managing tools across jobs.You can scale clusters to suit individual jobs or groups of jobs.You only pay for resources when your jobs are using them.You don’t need to maintain clusters over time, because they are freshly configured every time you use them.You don’t need to maintain separate infrastructure for development, testing, and production. You can use the same definitions to create as many different versions of a cluster as you need when you need them.Build a custom imageOnce you have satisfactory cluster performance, you can begin to transition from a non-autoscaling cluster to an ephemeral cluster.Does your cluster have init scripts that install various software? Use Dataproc Custom Images. This will allow you to create ephemeral clusters with faster startup times. Google Cloud provides an open-source tool to generate custom images.Generate a custom imagecode_block[StructValue([(u’code’, u’git clone https://github.com/GoogleCloudDataproc/custom-images.gitrnrncd custom-images || exitrnrnpython generate_custom_image.py \rn –image-name “<image-name>” \rn –dataproc-version 2.0-debian10 \rn –customization-script ../scripts/customize.sh \rn –zone zone \rn –gcs-bucket gs://”<gcs-bucket-name>” \rn –disk-size 50 \rn –no-smoke-test’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfd8526810>)])]Using Custom Imagescode_block[StructValue([(u’code’, u’gcloud dataproc clusters create cluster-name \rn –image=projects/<PROJECT_ID>/global/images/<IMAGE_NAME> \rn –region=region rnrngcloud dataproc workflow-templates instantiate-from-file \rn –file ../templates/pyspark-workflow-template.yaml \rn –region region’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfd8f5f890>)])]Create a Workflow TemplateTo create an ephemeral cluster, you’ll need to set up a Dataproc workflow template. A Workflow Template is a reusable workflow configuration. It defines a graph of jobs with information on where to run those jobs.Use the gcloud dataproc clusters export command to generate yaml for your cluster config:code_block[StructValue([(u’code’, u’gcloud dataproc clusters export my-cluster \rn –region=region \rn –destination=my-cluster.yaml’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfd8f5fe10>)])]Use this cluster config in your workflow template. Point to your newly created custom image, your application, and add your job specific properties.Sample Workflow Template (with custom image)code_block[StructValue([(u’code’, u’—rnjobs:rn – pysparkJob:rn properties:rn spark.pyspark.driver.python: ‘/usr/bin/python3’rn args:rn – “arg1″rn mainPythonFileUri: gs://<path-to-python-script>rn stepId: step1rn placement:rn managedCluster:rn clusterName: cluster-namern config:rn gceClusterConfig:rn zoneUri: zonern masterConfig:rn diskConfig:rn bootDiskSizeGb: 500rn machineTypeUri: n1-standard-4rn imageUri: projects/<project-id>/global/images/<image-name>rn workerConfig:rn diskConfig:rn bootDiskSizeGb: 500rn machineTypeUri: n1-standard-4rn numInstances: 2rn imageUri: projects/<project-id>/global/images/<image-name>rn initializationActions:rn – executableFile: gs://<path-to-init-script>rn executionTimeout: ‘3600s”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfc6650cd0>)])]Deploying an ephemeral cluster via a workflow templatecode_block[StructValue([(u’code’, u’gcloud dataproc workflow-templates instantiate-from-file \rn –file ../templates/pyspark-workflow-template.yaml \rn –region region’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ebfc6650e10>)])]Dataproc Workflow Templates provide a dataproc orchestration solution for use-cases such as:Automation of repetitive tasksTransactional fire-and-forget API interaction modelSupport for ephemeral and long-lived clustersGranular IAM securityFor broader data orchestration strategies, consider a more comprehensive data orchestration service like Cloud Composer.Next stepsThis post demonstrates how you can optimize Dataproc job stability, performance, and cost-effectiveness. Use Workflow templates to deploy a configured ephemeral cluster that runs a Dataproc job with calculated application-specific properties. Finally, there are many ways that you can continue striving for maximum optimal performance. Please review and consider the guidance laid out in the Google Cloud Blog. For general best practices, check out Dataproc best practices | Google Cloud Blog. For guidance on running in production, check out 7 best practices for running Cloud Dataproc in production | Google Cloud Blog.
Quelle: Google Cloud Platform
Eines der wichtigsten Echtzeit-Strategiespiele für 2023 kann nun im Playtest auf Windows-PC ausprobiert werden: Company of Heroes 3. (Company of Heroes, Steam)
Quelle: Golem