Schedule Connectivity Tests for continuous networking reachability diagnostics

As the scope and size of your cloud deployments expand, the need for automation to quickly and consistently diagnose service-affecting issues increases in parallel. Connectivity Tests – part of the Network Intelligence Center capabilities focused on Google Cloud network observability, monitoring, and troubleshooting – help you quickly troubleshoot network connectivity issues by analyzing your configuration and, in some cases, validating the data plane by sending synthetic traffic.  It’s common to start using Connectivity Tests in an ad hoc manner, for example, to determine whether an issue reported by your users is caused by a recent configuration change.  Another popular  use case for Connectivity Tests is to verify that applications and services are reachable post-migration, which helps verify that the cloud networking design is working as intended.  Once workloads are migrated to Google Cloud, Connectivity Tests help prevent regressions caused by mis-configuration or maintenance issues.  As you become more familiar with the power of Connectivity Tests, you may discover different use cases for running Connectivity Tests on a continuous basis.  In this post, we’ll walk through a solution to continuously run Connectivity Tests.Scheduling Connectivity Tests leverages existing Google Cloud platform tools to continuously execute tests and surface failures through Cloud Monitoring alerts.  We use the following products and tools as part of this solution:One or more Connectivity Tests to check connectivity between network endpoints by analyzing the cloud networking configuration and (when eligible) performing live data plane analysis between the endpoints.A single Cloud Function to programmatically run the Connectivity Tests using the Network Management API, and publish results to Cloud Logging.One or more Cloud Scheduler jobs that run the Connectivity Tests on a continuous schedule that you define.Operations Suite integrates logging, log-based metrics and alerting to surface test results that require your attention.Let’s get started.In this example there are two virtual machines running in different cloud regions of the same VPC.Connectivity TestsWe configure a connectivity test to verify that the VM instance in cloud region us-east4 can reach the VM instance in cloud region europe-west1 on port 443 using the TCP protocol.  The following Connectivity Test UI example shows the complete configuration of the test.For more detailed information on the available test parameters, see the Connectivity Tests documentation.At this point you can verify that the test passes both the configuration and data plane analysis steps, which tells you that the cloud network is configured to allow the VM instances to communicate and the packets transmitted between the VM instances were successfully passed through the network.Before moving on to the next step, note the name of the connectivity test in URI format, which is visible in the equivalent REST response output:We’ll use this value as part of the Cloud Scheduler configuration in a later step.Create Cloud FunctionCloud Functions provide a way to interact with the Network Management API to run a connectivity test.  While there are other approaches for interacting with the API, we take advantage of the flexibility in Cloud Functions to run the test and enrich the output we send to Cloud Logging.  Cloud Functions also provide support for numerous programming languages, so you can adapt these instructions to the language of your choice.  In this example, we use Python for interfacing with the Network Management API.Let’s walk through the high-level functionality of the code.First, the Cloud Function receives an HTTP request with the name of the connectivity test that you want to execute.  By providing the name of the connectivity test as a variable, we can reuse the same Cloud Function for running any of your configured connectivity tests.code_block[StructValue([(u’code’, u’if http_request.method != ‘GET':rn return flask.abort(rn flask.Response(rn http_request.method +rn ‘ requests are not supported, use GET instead’,rn status=405))rn if ‘name’ not in http_request.args:rn return flask.abort(rn flask.Response(“Missing ‘name’ URL parameter”, status=400))rn test_name = http_request.args[‘name’]’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea6e88ebd90>)])]Next, the code runs the connectivity test specified using the Network Management API.code_block[StructValue([(u’code’, u’client = network_management_v1.ReachabilityServiceClient()rn rerun_request = network_management_v1.RerunConnectivityTestRequest(rn name=test_name)rn try:rn response = client.rerun_connectivity_test(request=rerun_request).result(rn timeout=60)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea6dda3a250>)])]And finally, if the connectivity test fails for any reason, a log entry is created that we’ll later configure to generate an alert.code_block[StructValue([(u’code’, u”if (response.reachability_details.result !=rn types.ReachabilityDetails.Result.REACHABLE):rn entry = {rn ‘message':rn f’Reran connectivity test {test_name!r} and the result was ‘rn ‘unreachable’,rn ‘logging.googleapis.com/labels': {rn ‘test_resource_id': test_namern }rn }rn print(json.dumps(entry))”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea6f4a04dd0>)])]There are a couple of things to note about this last portion of sample code:We define a custom label (test_resource_id: test_name) used when a log entry is written.  We’ll use this as part of the logs-based metric in a later step.We only write a log entry when the connectivity test fails.  You can customize the logic for other use cases, for example logging when tests that you expect to fail succeed or writing logs for successful and unsuccessful test results to generate a ratio metric.The full example code for the Cloud Function is below.code_block[StructValue([(u’code’, u’import jsonrnimport flaskrnfrom google.api_core import exceptionsrnfrom google.cloud import network_management_v1rnfrom google.cloud.network_management_v1 import typesrnrnrndef rerun_test(http_request):rn “””Reruns a connectivity test and prints an error message if the test fails.”””rn if http_request.method != ‘GET':rn return flask.abort(rn flask.Response(rn http_request.method +rn ‘ requests are not supported, use GET instead’,rn status=405))rn if ‘name’ not in http_request.args:rn return flask.abort(rn flask.Response(“Missing ‘name’ URL parameter”, status=400))rn test_name = http_request.args[‘name’]rn client = network_management_v1.ReachabilityServiceClient()rn rerun_request = network_management_v1.RerunConnectivityTestRequest(rn name=test_name)rn try:rn response = client.rerun_connectivity_test(request=rerun_request).result(rn timeout=60)rn if (response.reachability_details.result !=rn types.ReachabilityDetails.Result.REACHABLE):rn entry = {rn ‘message':rn f’Reran connectivity test {test_name!r} and the result was ‘rn ‘unreachable’,rn ‘logging.googleapis.com/labels': {rn ‘test_resource_id': test_namern }rn }rn print(json.dumps(entry))rn return flask.Response(status=200)rn except exceptions.GoogleAPICallError as e:rn print(e)rn return flask.abort(500)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea6dfe39b10>)])]We use the code above and create a Cloud Function named run_connectivity_test.  Use the default trigger type of HTTP and make note of the trigger URL to use in a later stepcode_block[StructValue([(u’code’, u’https://us-east4-project6.cloudfunctions.net/run_connectivity_test’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea6df0fcf10>)])]Under Runtime, build, connections and security settings, increase the Runtime Timeout to 120 seconds.For the function code, select Python for the Runtime.For main.py, use the sample code provided above and configure the following dependencies for the Cloud Function in requirements.txt.code_block[StructValue([(u’code’, u’# Function dependencies, for example:rn# package>=versionrngoogle-cloud-network-management>=1.3.1rngoogle-api-core>=2.7.2′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea6df0fc590>)])]Click Deploy and wait for the Cloud Function deployment to complete.Cloud SchedulerThe functionality to execute the Cloud Function on a periodic schedule is accomplished using Cloud Scheduler.  A separate Cloud Scheduler job is created for each connectivity test you want to schedule.The following Cloud Console example shows the Cloud Scheduler configuration for our example.Note that the Frequency is specified in unix-cron format and in our example schedules the Cloud Function to run once an hour.  Make sure you take the Connectivity Tests pricing into consideration when configuring the frequency of the tests.The URL parameter of the execution configuration in the example below is where we bring together the name of the connectivity test and the Cloud Function trigger from the previous steps.  The format of the URL is{cloud_function_trigger}?name={connectivity-test-name}In our example, the URL is configured as:https://us-east4-project6.cloudfunctions.net/run_connectivity_test?name=projects/project6/locations/global/connectivityTests/inter-region-test-1The following configuration options complete the Cloud Scheduled configuration:Change the HTTP method to GET.Select Add OIDC token for the Auth header.Specify a service account that has the Cloud Function invoker permission for your Cloud Function.Set the Audience to the URL minus the query parameters, e.g.:https://us-east4-project6.cloudfunctions.net/run_connectivity_testLogs-based MetricThe Logs-based metric will convert unreachable log entries created by our Cloud Function into a Cloud Monitoring metric that we can use to create an alert. We start by configuring a Counter logs-based metric named unreachable_connectivity_tests.  Next, configure a filter to match the `test_resource_id` label that is included in the unreachable log messages.The complete metric configuration is shown below.Alerting PolicyThe Alerting Policy is triggered any time the logs-based metric increments, indicating that one of the continuous connectivity tests has failed.  The alert includes the name of the test that failed, allowing you to quickly focus your effort on the resources and traffic included in the test parameters.To create a new Alerting Policy, select the logging/user/unreachable_connectivity_test metric for the Cloud Function resource.Under Transform data, configure the following parameters:Within each time seriesRolling window = 2 minutesRolling window function = rateAcross time seriesTime series aggregation = sumTime series group by = test_resource_idNext, configure the alert trigger using the parameters shown in the figure below.Finally, configure the Documentation text field to include the name of the specific test that logged an unreachable result.Connectivity Tests provide critical insights into the configuration and operation of your cloud networking environment.  By combining multiple Google Cloud services, you can transform your Connectivity Tests usage from an ad-hoc troubleshooting tool to a solution for ongoing service validation and issue detection.We hope you found this information useful.  For a more in-depth look into Network Intelligence Center check out the What is Network Intelligence Center? post and our documentation.Related ArticleWhat is Network Intelligence Center?Network Intelligence Center provides a single console for managing Google Cloud network observability, monitoring, and troubleshooting.Read Article
Quelle: Google Cloud Platform

Published by