Introduction

In today’s data-driven world, understanding the performance and behavior of your BigQuery deployments is crucial. Effective observability and monitoring allow you to proactively identify issues, optimize resource utilization, and ensure the reliability of your data pipelines. This blog post will guide you through setting up observability and monitoring for BigQuery using Terraform and PromQL. We’ll delve into key metrics, explore how to query them with PromQL, and discuss how these metrics can be leveraged for alerting.

What is PromQL?

PromQL (Prometheus Query Language) is a functional query language used by Prometheus to select and aggregate time-series data in real time. It provides a powerful way to explore and analyze metrics collected by Prometheus and other monitoring systems. With PromQL, you can filter, aggregate, and perform mathematical operations on your metrics to gain valuable insights into system behavior.

Key features of PromQL include:

  • Metric Selection: Select time series based on labels and metric names.
  • Aggregation: Calculate sums, averages, minimums, maximums, and other aggregations over time ranges.
  • Operators: Perform arithmetic, logical, and set-based operations on metrics.
  • Functions: Use built-in functions for rate calculation, moving averages, and more.

BigQuery Metrics for Monitoring

Google Cloud provides a rich set of metrics for monitoring BigQuery performance and usage. These metrics can be accessed through Cloud Monitoring and queried using PromQL-compatible tools. Let’s explore some of the key metrics that are useful for monitoring:

1. query/biengine_fallback_count

  • Description: BI Engine query fallback count.
  • Metric Type: DELTA, INT64
  • Unit: 1
  • Labels:
    • bigquery_project: The BigQuery project ID.
    • reason: BI Engine fallback reasons.
  • Sampling: Every 60 seconds; data visible after 120 seconds.
  • Use Case: Alerting on high fallback counts to investigate BI Engine performance bottlenecks. A high number of fallbacks indicates that queries are not being optimized by the BI Engine.

    Example PromQL Query:

    sum(rate(bigquery_googleapis_com:query_biengine_fallback_count[5m])) by (project_id, reason)
    

    This query calculates the rate of fallback counts over the last 5 minutes, grouped by project and reason.

2. query/count

  • Description: Query count.
  • Metric Type: GAUGE, INT64
  • Unit: 1
  • Labels:
    • bigquery_project: The BigQuery project ID.
    • priority: Query priority (batch or interactive).
  • Sampling: Every 60 seconds; data visible after 420 seconds.
  • Use Case: Monitoring the number of in-flight queries. High number of inflight queries for long time can lead to performance degradation.

    Example PromQL Query:

    bigquery_googleapis_com:query_count{project_id="your-project-id"}
    

    This query retrieves the current number of queries running in a specific project.

3. query/execution_count

  • Description: Query execution count.
  • Metric Type: DELTA, INT64
  • Unit: 1
  • Labels:
    • bigquery_project: The BigQuery project ID.
    • priority: Query priority (batch or interactive).
    • caching_mode: Query caching mode (BI Engine, etc).
  • Sampling: Every 60 seconds; data visible after 420 seconds.
  • Use Case: Tracking the total number of queries executed, broken down by priority and caching mode. Can be used to identify inefficiencies or unexpected query patterns.

    Example PromQL Query:

    sum(rate(bigquery_googleapis_com:query_execution_count[5m])) by (project_id, priority, caching_mode)
    

    This query calculates the rate of query executions over the last 5 minutes, grouped by project, priority, and caching mode.

4. query/execution_times

  • Description: Query execution times.
  • Metric Type: GAUGE, DISTRIBUTION
  • Unit: s (seconds)
  • Labels:
    • bigquery_project: The BigQuery project ID.
    • priority: Query priority (batch or interactive).
  • Sampling: Every 60 seconds; data visible after 420 seconds.
  • Use Case: Monitoring the distribution of query execution times. Crucial for identifying slow queries and potential performance bottlenecks.

    Example PromQL Query (requires recording rule):

    histogram_quantile(0.95, sum(rate(bigquery_googleapis_com:query_execution_times_bucket[5m])) by (le, project_id, priority))
    

    This query estimates the 95th percentile of query execution times over the last 5 minutes, grouped by project and priority. This requires a properly configured recording rule to aggregate the histogram buckets.

5. query/scanned_bytes

  • Description: Scanned bytes.
  • Metric Type: DELTA, INT64
  • Unit: By (bytes)
  • Labels:
    • bigquery_project: The BigQuery project ID.
    • priority: Query priority (batch or interactive).
  • Sampling: Every 60 seconds; data visible after 21720 seconds.
  • Use Case: Tracking the amount of data scanned by queries. Helps identify queries that are scanning excessive amounts of data and optimize them.

    Example PromQL Query:

    sum(rate(bigquery_googleapis_com:query_scanned_bytes[5m])) by (project_id, priority)
    

    This query calculates the rate of scanned bytes over the last 5 minutes, grouped by project and priority.

6. query/scanned_bytes_billed

  • Description: Scanned bytes billed.
  • Metric Type: DELTA, INT64
  • Unit: By (bytes)
  • Labels:
    • bigquery_project: The BigQuery project ID.
    • priority: Query priority (batch or interactive).
  • Sampling: Every 60 seconds; data visible after 21720 seconds.
  • Use Case: Monitoring the amount of data billed for queries. Helps understand and control BigQuery costs.

    Example PromQL Query:

    sum(rate(bigquery_googleapis_com:query_scanned_bytes_billed[5m])) by (project_id, priority)
    

    This query calculates the rate of scanned bytes billed over the last 5 minutes, grouped by project and priority.

Alerting with PromQL

PromQL is instrumental in defining alerting rules. By querying metrics and setting thresholds, you can configure alerts that trigger when specific conditions are met. For instance, you might create an alert that fires when the query/execution_times exceeds a certain threshold, indicating a potential performance issue.

Example Alerting Rule (Conceptual):

groups:
  - name: BigQueryAlerts
    rules:
      - alert: HighQueryExecutionTime
        expr: histogram_quantile(0.95, sum(rate(bigquery_googleapis_com:query_execution_times_bucket[5m])) by (le, project_id, priority)) > 60
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High query execution time detected"
          description: "95th percentile of query execution times is above 60 seconds for project  and priority ."

This example demonstrates a Prometheus alerting rule. This rule will trigger an alert named “HighQueryExecutionTime” if the 95th percentile of query execution times exceeds 60 seconds for 5 minutes. It is crucial to test and refine thresholds based on your specific workload characteristics.

Terraform for Infrastructure as Code

Terraform allows you to define and manage your infrastructure as code. This includes setting up monitoring resources such as:

  • Cloud Monitoring Dashboards: Visualize BigQuery metrics.
  • Alerting Policies: Define alert conditions based on PromQL queries.
  • Metric Exporters: If needed, configure custom metric exports.

While providing a full Terraform configuration is beyond the scope of this post, here’s a conceptual example of how you might define a Cloud Monitoring Alerts using Terraform:

Sample Terraform code

resource "google_monitoring_alert_policy" "alert_policy" {
  display_name = "BI Engine Fallback Alert"
  combiner     = "OR"
  conditions {
    display_name = "High BI Engine fallback rate"
    condition_prometheus_query_language {
      # This query calculates the rate of BI Engine fallbacks over a 5-minute window.
      # The alert will trigger if this rate is greater than 0. 
      # You should adjust the threshold ('> 0') to a value appropriate for your use case.
      query    = "sum(rate(bigquery_googleapis_com:query_biengine_fallback_count[5m])) by (project_id, reason) > 0"
      
      # Duration is set to 300s (5 minutes) to align with the query's time window.
      # The condition must be true for this entire duration to trigger an alert.
      duration = "300s"
      
      evaluation_interval = "60s"
      # alert_rule and rule_group are optional fields for exporting to a Prometheus RuleFile.
      alert_rule = "HighBiEngineFallbackRate"
      rule_group = "BigQueryPerformance"
    }
  }

  alert_strategy {
    auto_close = "1800s" # 30 minutes
  }
}

This alert query calculates the rate of BI Engine fallbacks over a 5-minute window. The alert will trigger if this rate is greater than 0. You should adjust the threshold (> 0) to a value appropriate for your use case.

Conclusion

Effective observability and monitoring are essential for managing BigQuery deployments. By leveraging PromQL and Terraform, you can gain deep insights into your query performance, proactively identify issues, and optimize your resource utilization. Remember to carefully select the appropriate metrics, define meaningful alerting rules, and continuously refine your monitoring strategy based on your evolving needs.