Setting up Scheduled dataflow backups using Batch templates.

TL;DR In this post we will be setting up a scheduled job to take backup for Bigtable table in avro format.

Dataflow is a managed service for handling streaming and batch jobs, we will be using a batch job for this post to take backups in avro.

We will use cloud scheduler to trigger the backup process.

Assumptions.

Bigtable instance bt-instance-a is already present and has a table called table-to-backup present in it.
Bucket bt-backup-bucket is ready with necessary permissions for the service account bt-backups to write to.
Also the project_id of the project is PROJECT_ID.

(image: google) backups

We will do this in below steps.

Create a service account to do this workflow and Assign permissions (roles) for the service account.
Create a location for the dataflow template in the gcs bucket.
Create a terraform script to create the scheduled job.
Run the job create the schedule.

Create service account

We can create a service account with the below command.

gcloud iam service-accounts create bt-backups \
    --description="BigTable backups service account" \
    --display-name="Bigtable Backups"

Assign permission to the service account, we will need below permissions on the service account.

Dataflow worker.
Bigtable User [Also access to specific instance].
Cloud schedular permission.

Dataflow permissions.

gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:bt-backups@PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/dataflow.worker"

Cloud scheduler permissions.

gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:bt-backups@PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/cloudscheduler.jobRunner"

or (as required based on the usecase)

gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:bt-backups@PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/cloudscheduler.admin"

Bigtable permissions.

Project level permissions

gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:bt-backups@PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/bigtable.user"

Assign instance level permissions.

We need to create a policy
Assign in the policy to the intance.

Lets create a policy file called bt-policy.yaml, creating policy.

bindings:
- members:
  - serviceAccount:bt-backups@PROJECT_ID.iam.gserviceaccount.com
  role: roles/bigtable.user
etag: BwWWja0YfJA=
version: 1

or json

{
  "bindings": [
    {
      "role": "roles/resourcemanager.organizationAdmin",
      "members": [
        "serviceAccount:bt-backups@PROJECT_ID.iam.gserviceaccount.com"
      ]
    }
  ],
  "etag": "BwWWja0YfJA=",
  "version": 3
}

Setting the policy for the instance.

#
# gcloud bigtable instances set-iam-policy INSTANCE_ID POLICY_FILE
#
gcloud bigtable instances set-iam-policy bt-instance-a bt-policy.yaml

Now we have a service account which has permission to run the trigger a scheduler, dataflow job and read data from the bigtable instance.

Create a location for the dataflow template in the `gcs` bucket.

For this task we will be using the default templates.

gsutil cp gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Avro gs://bt-backup-bucket/dataflow-templates/20211009/Cloud_Bigtable_to_GCS_Avro

We will be making a copy of the current version of the templates due to below caution.

Caution: The latest template version is kept in the non-dated parent folder gs://dataflow-templates/latest which may update with breaking changes. Your production environments should use templates kept in the most recent dated parent folder to prevent these breaking changes from affecting your production workflows. All dated template versions can be found nested inside their dated parent folder in the Cloud Storage bucket: gs://dataflow-templates/.

Create a `terraform` script to create the scheduled job.

The Bigtable to Cloud Storage Avro template is a pipeline that reads data from a Bigtable table and writes it to a Cloud Storage bucket in Avro format. You can use the template to move data from Bigtable to Cloud Storage.

We can get more information from the POST API request.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Avro
{
   "jobName": "JOB_NAME",
   "parameters": {
       "bigtableProjectId": "BIGTABLE_PROJECT_ID",
       "bigtableInstanceId": "INSTANCE_ID",
       "bigtableTableId": "TABLE_ID",
       "outputDirectory": "OUTPUT_DIRECTORY",
       "filenamePrefix": "FILENAME_PREFIX",
   },
   "environment": { "zone": "us-central1-f" }
}

Requirements for this pipeline:

The Bigtable table must exist.
The output Cloud Storage bucket must exist before running the pipeline.

Parameter used by this template.

Parameter	Description
`bigtableProjectId`	The ID of the Google Cloud project of the Bigtable instance that you want to read data from.
`bigtableInstanceId`	The ID of the Bigtable instance that contains the table.
`bigtableTableId`	The ID of the Bigtable table to export.
`outputDirectory`	The Cloud Storage path where data is written. For example, `gs://mybucket/somefolder`.
`filenamePrefix`	The prefix of the Avro filename. For example, `output-`.

Environment for the dataflow job.

The environment values to set at runtime for the dataflow job.

{
  "numWorkers": integer,
  "maxWorkers": integer,
  "zone": string,
  "serviceAccountEmail": string,
  "tempLocation": string,
  "bypassTempDirValidation": boolean,
  "machineType": string,
  "additionalExperiments": [
    string
  ],
  "network": string,
  "subnetwork": string,
  "additionalUserLabels": {
    string: string,
    ...
  },
  "kmsKeyName": string,
  "ipConfiguration": enum (WorkerIPAddressConfiguration),
  "workerRegion": string,
  "workerZone": string,
  "enableStreamingEngine": boolean
}

Terraform script

Putting it all together.

resource "google_cloud_scheduler_job" "bt-backups-scheduler" {
  name = "scheduler-bt-backups"
  schedule = "0 0 * * *"
  # This needs to be us-central1 even if App Engine is in us-central.
  # You will get a resource not found error if just using us-central.
  region = "us-central1"

  http_target {
    http_method = "POST"
    uri = "https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://bt-backup-bucket/dataflow-templates/20211009/Cloud_Bigtable_to_GCS_Avro"
    oauth_token {
      service_account_email = bt-backups@PROJECT_ID.iam.gserviceaccount.com
    }

    # need to encode the string
    body = base64encode(<<-EOT
    {
      "jobName": "bt-backups-cloud-scheduler",
      "parameters": {
        "bigtableProjectId": "PROJECT_ID",
        "bigtableInstanceId": "bt-instance-a",
        "bigtableTableId": "table-to-backup",
        "outputDirectory": "gs://bt-backup-bucket/backups",
        "filenamePrefix" : "bt-backups-"
      },
      "environment": {
        "numWorkers": "3",
        "maxWorkers": "10",
        "tempLocation": "gs://bt-backup-bucket/temp",
        "serviceAccountEmail": "bt-backups@PROJECT_ID.iam.gserviceaccount.com",
        "additionalExperiments": ["use_network_tags=my-net-tag-name"], # Tag for any firewall rules on the shared VPC.
        "subnetwork": "https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION/subnetworks/SUBNETWORK", # Have your network as a share VPC, and with the same region as the bigtable instance. 
        "ipConfiguration": "WORKER_IP_PRIVATE", # not allowing public IPs.
        "workerRegion": "us-central1"           # Have this in the bigtable instance location
      }
    }
EOT
    )
  }
}

IMPORTANT NOTE:

Enable network tags

You can specify the network tags only when you run the Dataflow job template to create a job.

--experiments=use_network_tags=TAG-NAME

Replace TAG-NAME with the names of your tags. If you add more than one tag, separate each tag with a semicolon (;), as shown in the following format: TAG-NAME-1;TAG-NAME-2;TAG-NAME-3;….

In Terraform

{
..
  "additionalExperiments": ["use_network_tags=my-net-tag-name", "other=tags"],
..
}

Enable network tags for Flex Template Launcher VMs

When using Flex Templates, networks tags are only applied to Dataflow worker VMs and not to the launcher VM.

--additional-experiments=use_network_tags_for_flex_templates=TAG-NAME

Replace TAG-NAME with the names of your tags. If you add more than one tag, separate each tag with a semicolon (;), as shown in the following format: TAG-NAME-1;TAG-NAME-2;TAG-NAME-3;….

In Terraform.

{
..
  "additionalExperiments": ["use_network_tags_for_flex_templates=my-net-tag-name", "other=tags"],
..
}

After you enable the network tags, the tags are parsed and attached to the launcher VMs.

Run the job create the schedule.

Running terraform on the

terraform init
terraform plan
terraform apply --auto-approve

This should create the scheduled job on cloud scheduler.

(image: google)

Click RUN NOW.

This should trigger the dataflow job and copy data from bigtable instance bt-instance-a, table table-to-backup, to GCS bucket gs://bt-backup-bucket/backups.

(image: stackoverflow)

Share on

Twitter Facebook Google+ LinkedIn

Setting up Scheduled dataflow backups using Batch templates.

AHMED ZBYR

Create service account

Create a location for the dataflow template in the `gcs` bucket.

Create a `terraform` script to create the scheduled job.

Parameter used by this template.

Environment for the dataflow job.

Terraform script

Run the job create the schedule.

Share on

You May Also Enjoy

Setting Up Pi-hole on a Raspberry Pi Zero 2W for Network-Wide Ad Blocking

Event-Driven Ansible - Automating Responses to Real-Time Events: A Deep Dive

Ansible and Terraform - A Powerful DevOps Combination for Cloud and On-Premises Infrastructure

5 Common Ansible Mistakes and How to Avoid Them

Setting up Scheduled dataflow backups using Batch templates.

AHMED ZBYR

Create service account

Create a location for the dataflow template in the gcs bucket.

Create a terraform script to create the scheduled job.

Parameter used by this template.

Environment for the dataflow job.

Terraform script

Run the job create the schedule.

Share on

You May Also Enjoy

Setting Up Pi-hole on a Raspberry Pi Zero 2W for Network-Wide Ad Blocking

Event-Driven Ansible - Automating Responses to Real-Time Events: A Deep Dive

Ansible and Terraform - A Powerful DevOps Combination for Cloud and On-Premises Infrastructure

5 Common Ansible Mistakes and How to Avoid Them

Create a location for the dataflow template in the `gcs` bucket.

Create a `terraform` script to create the scheduled job.