TL;DR In this post we will be setting up a scheduled job to take backup for Bigtable table in avro
format.
Dataflow is a managed service for handling streaming and batch jobs, we will be using a batch job for this post to take backups in avro
.
We will use cloud scheduler to trigger the backup process.
Assumptions.
- Bigtable instance
bt-instance-a
is already present and has a table calledtable-to-backup
present in it. - Bucket
bt-backup-bucket
is ready with necessary permissions for the service accountbt-backups
to write to. - Also the project_id of the project is
PROJECT_ID
.
(image: google)
We will do this in below steps.
- Create a service account to do this workflow and Assign permissions (roles) for the service account.
- Create a location for the dataflow template in the
gcs
bucket. - Create a terraform script to create the scheduled job.
- Run the job create the schedule.
Create service account
We can create a service account with the below command.
gcloud iam service-accounts create bt-backups \
--description="BigTable backups service account" \
--display-name="Bigtable Backups"
Assign permission to the service account, we will need below permissions on the service account.
- Dataflow worker.
- Bigtable User [Also access to specific instance].
- Cloud schedular permission.
Dataflow permissions.
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="serviceAccount:bt-backups@PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/dataflow.worker"
Cloud scheduler permissions.
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="serviceAccount:bt-backups@PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/cloudscheduler.jobRunner"
or (as required based on the usecase)
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="serviceAccount:bt-backups@PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/cloudscheduler.admin"
Bigtable permissions.
Project level permissions
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="serviceAccount:bt-backups@PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/bigtable.user"
Assign instance level permissions.
- We need to create a policy
- Assign in the policy to the intance.
Lets create a policy file called bt-policy.yaml
, creating policy.
bindings:
- members:
- serviceAccount:bt-backups@PROJECT_ID.iam.gserviceaccount.com
role: roles/bigtable.user
etag: BwWWja0YfJA=
version: 1
or json
{
"bindings": [
{
"role": "roles/resourcemanager.organizationAdmin",
"members": [
"serviceAccount:bt-backups@PROJECT_ID.iam.gserviceaccount.com"
]
}
],
"etag": "BwWWja0YfJA=",
"version": 3
}
Setting the policy for the instance.
#
# gcloud bigtable instances set-iam-policy INSTANCE_ID POLICY_FILE
#
gcloud bigtable instances set-iam-policy bt-instance-a bt-policy.yaml
Now we have a service account which has permission to run the trigger a scheduler, dataflow job and read data from the bigtable instance.
Create a location for the dataflow template in the gcs
bucket.
For this task we will be using the default templates.
gsutil cp gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Avro gs://bt-backup-bucket/dataflow-templates/20211009/Cloud_Bigtable_to_GCS_Avro
We will be making a copy of the current version of the templates due to below caution.
Caution: The latest template version is kept in the non-dated parent folder gs://dataflow-templates/latest
which may update with breaking changes. Your production environments should use templates kept in the most recent dated parent folder to prevent these breaking changes from affecting your production workflows. All dated template versions can be found nested inside their dated parent folder in the Cloud Storage bucket: gs://dataflow-templates/
.
Create a terraform
script to create the scheduled job.
The Bigtable to Cloud Storage Avro template is a pipeline that reads data from a Bigtable table and writes it to a Cloud Storage bucket in Avro format. You can use the template to move data from Bigtable to Cloud Storage.
We can get more information from the POST
API request.
POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://dataflow-templates/latest/Cloud_Bigtable_to_GCS_Avro
{
"jobName": "JOB_NAME",
"parameters": {
"bigtableProjectId": "BIGTABLE_PROJECT_ID",
"bigtableInstanceId": "INSTANCE_ID",
"bigtableTableId": "TABLE_ID",
"outputDirectory": "OUTPUT_DIRECTORY",
"filenamePrefix": "FILENAME_PREFIX",
},
"environment": { "zone": "us-central1-f" }
}
Requirements for this pipeline:
- The Bigtable table must exist.
- The output Cloud Storage bucket must exist before running the pipeline.
Parameter used by this template.
Parameter | Description |
---|---|
bigtableProjectId |
The ID of the Google Cloud project of the Bigtable instance that you want to read data from. |
bigtableInstanceId |
The ID of the Bigtable instance that contains the table. |
bigtableTableId |
The ID of the Bigtable table to export. |
outputDirectory |
The Cloud Storage path where data is written. For example, gs://mybucket/somefolder . |
filenamePrefix |
The prefix of the Avro filename. For example, output- . |
Environment for the dataflow job.
The environment values to set at runtime for the dataflow job.
{
"numWorkers": integer,
"maxWorkers": integer,
"zone": string,
"serviceAccountEmail": string,
"tempLocation": string,
"bypassTempDirValidation": boolean,
"machineType": string,
"additionalExperiments": [
string
],
"network": string,
"subnetwork": string,
"additionalUserLabels": {
string: string,
...
},
"kmsKeyName": string,
"ipConfiguration": enum (WorkerIPAddressConfiguration),
"workerRegion": string,
"workerZone": string,
"enableStreamingEngine": boolean
}
Terraform script
Putting it all together.
resource "google_cloud_scheduler_job" "bt-backups-scheduler" {
name = "scheduler-bt-backups"
schedule = "0 0 * * *"
# This needs to be us-central1 even if App Engine is in us-central.
# You will get a resource not found error if just using us-central.
region = "us-central1"
http_target {
http_method = "POST"
uri = "https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/templates:launch?gcsPath=gs://bt-backup-bucket/dataflow-templates/20211009/Cloud_Bigtable_to_GCS_Avro"
oauth_token {
service_account_email = bt-backups@PROJECT_ID.iam.gserviceaccount.com
}
# need to encode the string
body = base64encode(<<-EOT
{
"jobName": "bt-backups-cloud-scheduler",
"parameters": {
"bigtableProjectId": "PROJECT_ID",
"bigtableInstanceId": "bt-instance-a",
"bigtableTableId": "table-to-backup",
"outputDirectory": "gs://bt-backup-bucket/backups",
"filenamePrefix" : "bt-backups-"
},
"environment": {
"numWorkers": "3",
"maxWorkers": "10",
"tempLocation": "gs://bt-backup-bucket/temp",
"serviceAccountEmail": "bt-backups@PROJECT_ID.iam.gserviceaccount.com",
"additionalExperiments": ["use_network_tags=my-net-tag-name"], # Tag for any firewall rules on the shared VPC.
"subnetwork": "https://www.googleapis.com/compute/v1/projects/HOST_PROJECT_ID/regions/REGION/subnetworks/SUBNETWORK", # Have your network as a share VPC, and with the same region as the bigtable instance.
"ipConfiguration": "WORKER_IP_PRIVATE", # not allowing public IPs.
"workerRegion": "us-central1" # Have this in the bigtable instance location
}
}
EOT
)
}
}
IMPORTANT NOTE:
Enable network tags
You can specify the network tags only when you run the Dataflow job template to create a job.
--experiments=use_network_tags=TAG-NAME
Replace TAG-NAME with the names of your tags. If you add more than one tag, separate each tag with a semicolon (;
), as shown in the following format: TAG-NAME-1;TAG-NAME-2;TAG-NAME-3;….
In Terraform
{
..
"additionalExperiments": ["use_network_tags=my-net-tag-name", "other=tags"],
..
}
Enable network tags for Flex Template Launcher VMs
When using Flex Templates, networks tags are only applied to Dataflow worker VMs and not to the launcher VM.
--additional-experiments=use_network_tags_for_flex_templates=TAG-NAME
Replace TAG-NAME with the names of your tags. If you add more than one tag, separate each tag with a semicolon (;
), as shown in the following format: TAG-NAME-1;TAG-NAME-2;TAG-NAME-3;….
In Terraform.
{
..
"additionalExperiments": ["use_network_tags_for_flex_templates=my-net-tag-name", "other=tags"],
..
}
After you enable the network tags, the tags are parsed and attached to the launcher VMs.
Run the job create the schedule.
Running terraform on the
terraform init
terraform plan
terraform apply --auto-approve
This should create the scheduled job on cloud scheduler.
(image: google)
Click RUN NOW
.
This should trigger the dataflow job and copy data from bigtable instance bt-instance-a
, table table-to-backup
, to GCS bucket gs://bt-backup-bucket/backups
.
(image: stackoverflow)