Cloud Storge Bucket - GCP Learning Notes.

Cloud Storage is Google Cloud’s object storage service, and it allows worldwide storage and retrieval of any amount of data at anytime.

Object Versioning

To support the retrieval of objects that are deleted or replaced, Cloud Storage offers the Object Versioning feature.

Overview.

You are charged for the versions as if there are multiple files,
Cloud Storage creates an archived version of an object each time the live version of the object is overwritten or deleted.
You can turn versioning on or off or a bucket at anytime.
Turning versioning off leaves existing object versions in place and causes the bucket to stop accumulating new archived object versions.

Properties

generation Identifies the content (data) generation. [Adding new versions of the file]
metageneration Identifies the metadata generation. [Adding new/update metadata for an exsisting/new file]

Notes.

Object Versioning cannot be enabled on a bucket that currently has a retention policy.
There is no default limit on the number of object versions you can have. Each noncurrent version of an object is charged at the same rate as the live version of the object.
If you enable versioning, consider using Object Lifecycle Management, which can remove the oldest versions of an object as newer versions become noncurrent.

Object Lifecycle Management

To support common use cases like setting a Time to Live (TTL) for objects

Overview

You can assign a lifecycle management configuration to a bucket.
The configuration is a set of rules that apply to all objects in the buckets.
When an object meets the criteria of one of the rules, Cloud Storage automatically performs a specified action on the object.
Updates to your lifecycle configuration may take up to 24-hours to go into effect.
Object Lifecycle Management may still perform actions based on the old configuration for up to 24-hours.

Lifecycle conditions

A lifecycle rule includes conditions which an object must meet before the action defined in the rule occurs on the object. Lifecycle rules support the following conditions:

Age
CreatedBefore
CustomTimeBefore
DaysSinceCustomTime
DaysSinceNoncurrentTime
IsLive
MatchesStorageClass
NoncurrentTimeBefore
NumberOfNewerVersions

Object Change Notification

Object change notification can be used to notify an application when an object is updated or added to a bucket through a watch request.

Trigger a event based on the change in the storage bucket.
Push notification from the storage bucket [webhook].

Example

Load CSV data on storage bucket when added to storage bucket.

Create a storage bucket.
Create a cloud function to recieve event when new file is added.
Process the data from the cloud storage in cloud function.
Load data.

Event Handling in cloud function.

More details on triggers for cloud functions

def hello_gcs(event, context):
    """Background Cloud Function to be triggered by Cloud Storage.
       This generic function logs relevant data when a file is changed.

    Args:
        event (dict):  The dictionary with data specific to this type of event.
                       The `data` field contains a description of the event in
                       the Cloud Storage `object` format described here:
                       https://cloud.google.com/storage/docs/json_api/v1/objects#resource
        context (google.cloud.functions.Context): Metadata of triggering event.
    Returns:
        None; the output is written to Stackdriver Logging
    """

    print('Event ID: {}'.format(context.event_id))
    print('Event type: {}'.format(context.event_type))
    print('Bucket: {}'.format(event['bucket']))
    print('File: {}'.format(event['name']))
    print('Metageneration: {}'.format(event['metageneration']))
    print('Created: {}'.format(event['timeCreated']))
    print('Updated: {}'.format(event['updated']))

Loading CSV data to BigQuery.

from google.cloud import bigquery

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"

job_config = bigquery.LoadJobConfig(
    schema=[
        bigquery.SchemaField("name", "STRING"),
        bigquery.SchemaField("post_abbr", "STRING"),
    ],
    skip_leading_rows=1,
    # The source format defaults to CSV, so the line below is optional.
    source_format=bigquery.SourceFormat.CSV,
)
uri = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"

load_job = client.load_table_from_uri(
    uri, table_id, job_config=job_config
)  # Make an API request.

load_job.result()  # Waits for the job to complete.

destination_table = client.get_table(table_id)  # Make an API request.
print("Loaded {} rows.".format(destination_table.num_rows))

Combine both together.

def csv_to_bigquery(event, context):
    """Background Cloud Function to be triggered by Cloud Storage.
       This generic function logs relevant data when a file is changed.

    Args:
        event (dict):  The dictionary with data specific to this type of event.
                       The `data` field contains a description of the event in
                       the Cloud Storage `object` format described here:
                       https://cloud.google.com/storage/docs/json_api/v1/objects#resource
        context (google.cloud.functions.Context): Metadata of triggering event.
    Returns:
        None; the output is written to Stackdriver Logging
    """

    print('Event ID: {}'.format(context.event_id))
    print('Event type: {}'.format(context.event_type))
    print('Bucket: {}'.format(event['bucket']))
    print('File: {}'.format(event['name']))
    print('Metageneration: {}'.format(event['metageneration']))
    print('Created: {}'.format(event['timeCreated']))
    print('Updated: {}'.format(event['updated']))
	
	from google.cloud import bigquery

	# Construct a BigQuery client object.
	client = bigquery.Client()

	# TODO(developer): Set table_id to the ID of the table to create.
	# table_id = "your-project.your_dataset.your_table_name"

	job_config = bigquery.LoadJobConfig(
		schema=[
			bigquery.SchemaField("name", "STRING"),
			bigquery.SchemaField("post_abbr", "STRING"),
		],
		skip_leading_rows=1,
		# The source format defaults to CSV, so the line below is optional.
		source_format=event['name'],
	)
	uri = event['bucket'] + "/" + event['name']

	load_job = client.load_table_from_uri(
		uri, table_id, job_config=job_config
	)  # Make an API request.

	load_job.result()  # Waits for the job to complete.

	destination_table = client.get_table(table_id)  # Make an API request.
	print("Loaded {} rows.".format(destination_table.num_rows))
	

Note: Recommended way to pass on triggers to other services is using pub/sub.

Data Import Service.

Tranfer Appliance. 100TB to 1PB
Storage Transfer Service. S3, bucket, web source
Offline Media Import. storage arrays, hard disk drives, tapes,and USB flash drives

Share on

Twitter Facebook Google+ LinkedIn

Cloud Storge Bucket - GCP Learning Notes.

AHMED ZBYR