Application logs are a treasure trove of information. Beyond debugging and troubleshooting, they can provide valuable insights into your application’s performance, user behavior, and overall health. However, raw logs are often unstructured and difficult to analyze directly. This is where log-based metrics come in.

Log-based metrics allow you to extract numerical data from your logs, transforming unstructured data into quantifiable measurements that can be used for monitoring, alerting, and visualization. This post will guide you through creating log-based metrics using Cloud Logging and demonstrate how to leverage them for proactive monitoring and alerting.

What are Log-Based Metrics?

Log-based metrics are aggregated numerical values derived from log entries. Instead of manually searching through logs, you define filters and extraction rules that automatically count events or extract specific data points based on the content of your log messages. These metrics can then be used to create dashboards, set up alerts, and perform trend analysis, providing a powerful way to understand your application’s behavior in real-time.

Benefits of Log-Based Metrics

  • Proactive Monitoring: Identify issues before they impact users by setting up alerts based on metric thresholds.
  • Performance Analysis: Track key performance indicators (KPIs) such as request latency, error rates, and resource utilization.
  • Trend Identification: Detect patterns and anomalies in your application’s behavior over time.
  • Improved Debugging: Correlate log events with metric data to pinpoint the root cause of problems.
  • Cost Optimization: Identify areas where you can improve resource utilization and reduce costs.

Tutorial: Counting 500 Error Codes with Cloud Logging

This tutorial demonstrates how to create a log-based metric in Cloud Logging to count the number of 500 error codes in your application logs.

Prerequisites

  • A Google Cloud Platform (GCP) project with Cloud Logging enabled.
  • Application logs that include HTTP response codes (e.g., 500, 200).

Step 1: Access Cloud Logging

  1. Go to the Google Cloud Console: https://console.cloud.google.com/
  2. Navigate to Logging > Logs Explorer.

Step 2: Create a Log-Based Metric

  1. In the Logs Explorer, click on the Logs-based Metrics option in the left navigation pane.
  2. Click on the Create Metric button.

Step 3: Configure the Metric

  1. Metric Type: Choose either Counter metric to simply count the occurrences of a specific log event, or Distribution metric if you need to calculate statistics like average, min, max, or percentiles from numerical data extracted from logs. For this example, select Counter metric.

  2. Name: Enter a descriptive name for your metric (e.g., http-500-errors).

  3. Description: Provide a brief description of the metric (e.g., “Counts the number of HTTP 500 error responses”).

  4. Units: Leave this field blank for counter metrics.

  5. Filter: This is the most crucial part. Here, you define the query that selects the log entries you want to include in the metric. Use the following query to filter for log entries with an HTTP status code of 500. Adjust the query based on how your logs are structured. This example assumes the log entry has a status field:

    resource.type="gce_instance"
    jsonPayload.status=500
    

    Explanation:

    • resource.type="gce_instance": This line filters logs from Google Compute Engine instances. Adjust this to the resource type where your application is running (e.g., k8s_container for Kubernetes).
    • jsonPayload.status=500: This line filters log entries where the status field in the jsonPayload (assuming your logs are in JSON format) is equal to 500. If your logs use a different field or format for the status code, modify this line accordingly. For example, if the field is http_response_code, you would use jsonPayload.http_response_code=500.
  6. Create Metric: Click the Create Metric button to save your new log-based metric.

Step 4: View and Monitor the Metric

  1. After creating the metric, it might take a few minutes for data to populate.
  2. Go to Logging > Logs-based Metrics in the Cloud Console.
  3. Find your newly created metric in the list.
  4. Click the three dots menu next to the metric and select View in Metrics Explorer. This will open the Metrics Explorer, where you can visualize the metric’s data over time.

Step 5: Create an Alert (Optional)

To proactively monitor your application for 500 errors, you can create an alert based on the log-based metric.

  1. In the Metrics Explorer, click on the Create Alert button.
  2. Configure the alert policy:
    • Target: The metric you just created (http-500-errors).
    • Configuration: Define a threshold and condition (e.g., “If the value of http-500-errors is greater than 100 in 5 minutes”).
    • Notifications: Choose how you want to be notified when the alert is triggered (e.g., email, Slack, PagerDuty).
    • Documentation: Add any relevant information to help troubleshoot the issue when the alert is triggered.
  3. Click Save to create the alert policy.

Alternative Log Structures and Filters

The example above assumes your logs are in JSON format with a status field. Here are some variations and how to adapt the filter:

  • Text-based Logs: If your logs are plain text, you can use regular expressions to match the error code. For example, if your logs contain lines like "Error: 500 Internal Server Error", you could use the following filter (assuming textPayload contains the log message):

    resource.type="gce_instance"
    textPayload=~"Error: 500"
    

    The ~ operator indicates a regular expression match.

  • Different Field Names: If your status code is in a field called http_code, your filter would be:

    resource.type="gce_instance"
    jsonPayload.http_code=500
    
  • Specific URL Paths: You can combine filters to count 500 errors only for specific URL paths. For example, to count 500 errors only for requests to /api/v1/users, assuming the URL is in a field called url:

    resource.type="gce_instance"
    jsonPayload.status=500
    jsonPayload.url="/api/v1/users"
    

Beyond Counting: Distribution Metrics

While counter metrics are useful for counting events, distribution metrics are powerful for analyzing the distribution of numerical data. For example, you can use a distribution metric to track request latency and calculate the average, minimum, maximum, and percentiles.

To create a distribution metric:

  1. Select “Distribution metric” as the Metric Type.
  2. Define the “Value Field”: This is the field in your log entry that contains the numerical value you want to analyze (e.g., jsonPayload.latency_ms).
  3. Optional: You can define explicit buckets or power-of-two buckets to control how the data is aggregated into histograms.

Conclusion

Log-based metrics are an indispensable tool for gaining insights into your applications’ performance and health. By extracting and aggregating data from your logs, you can proactively monitor your systems, identify trends, and respond quickly to issues. Cloud Logging provides a flexible and powerful platform for creating and managing log-based metrics, empowering you to build more reliable and observable applications. Remember to tailor the filters to your specific log structure for best results.