Kerberos authentication in Google Dataproc clusters offers enhanced security and user isolation. It’s a crucial step for multi-tenancy, encryption, and user authentication within your Dataproc cluster. In this guide, we’ll walk you through the process of setting up Kerberos on a Dataproc cluster step by step.

Prerequisites

Before you begin, ensure you have:

  1. A Google Cloud project with Dataproc enabled.

Step 1: Create a Service Account

Start by creating a service account for Dataproc:

gcloud iam service-accounts create dp-svc-runner \
    --description="Dataproc service account" \
    --display-name="Dataproc SA"

Step 2: Create a KMS Key

If you don’t have one already, create a key ring:

gcloud kms keyrings create ahmed-keyring \
    --location us-east1

Next, create a key within the key ring:

gcloud kms keys create dataproc-ahmed-key \
    --location us-east1 \
    --keyring ahmed-keyring \
    --purpose encryption

Grant cryptoKeyDecrypter permission to the service account:

gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member serviceAccount:dp-svc-runner@YOUR_PROJECT_ID.iam.gserviceaccount.com \
    --role roles/cloudkms.cryptoKeyDecrypter

Step 3: Give dataproc.worker Permission

Grant the dataproc.worker role to the service account:

gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="serviceAccount:dp-svc-runner@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/dataproc.worker"

Step 4: Creating the Kerberos Root Principal Password

Use the service account to create the Kerberos root principal password file:

echo "my-strong-and-complex-password" | \
  gcloud kms encrypt \
    --location=us-east1 \
    --keyring=ahmed-keyring \
    --key=dataproc-ahmed-key \
    --plaintext-file=- \
    --ciphertext-file=kerberos-root-principal-password.encrypted

Move the kerberos-root-principal-password.encrypted file to a GCS bucket:

gsutil cp kerberos-root-principal-password.encrypted gs://dp-run-bucket/kerberos-rt-file/

Step 5: Creating the Cluster

Now, it’s time to create your Dataproc cluster using Terraform or the command line. Here’s a snippet of Terraform configuration:

resource "google_dataproc_cluster" "simplecluster" {
  name    = "simplecluster"
  region  = "us-east1"
  project = "YOUR_PROJECT_ID"

  cluster_config {
    # Other configuration settings...

    security_config {
      kerberos_config {
        kms_key_uri                 = "projects/YOUR_PROJECT_ID/locations/us-east1/keyRings/ahmed-keyring/cryptoKeys/dataproc-ahmed-key"
        root_principal_password_uri = "gs://dp-run-bucket/kerberos-rt-file/kerberos-root-principal-password.encrypted"
      }
    }
  }
}

Or use the command line:

gcloud dataproc clusters create cluster-name \
    --region=region \
    --image-version=1.3 \
    --kerberos-root-principal-password-uri=gs://dp-run-bucket/kerberos-rt-file/kerberos-root-principal-password.encrypted \
    --kerberos-kms-key=projects/YOUR_PROJECT_ID/locations/us-east1/keyRings/ahmed-keyring/cryptoKeys/dataproc-ahmed-key

Congratulations! You’ve successfully set up Kerberos on your Dataproc cluster, enhancing its security and authentication capabilities.

Remember to configure other aspects of your cluster as needed and explore further documentation to optimize your Dataproc cluster for your specific use cases.