Skip to main content

GCP Dataproc Cluster

Google Cloud Dataproc is a managed Spark / Hadoop service that provisions clusters of Compute Engine VMs on demand, runs distributed data-processing jobs, and tears the infrastructure down again when it is no longer required. A Dataproc Cluster represents that fleet of VMs together with their configuration (networking, IAM, autoscaling rules, encryption settings, initialisation actions, etc.).
For full details see the official documentation: https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#Cluster

Terrafrom Mappings:

  • google_dataproc_cluster.name

Supported Methods​

  • GET: Get a gcp-dataproc-cluster by its "name"
  • LIST: List all gcp-dataproc-cluster
  • SEARCH

gcp-compute-network​

Every Dataproc Cluster is launched inside a specific VPC network (and usually a sub-network) which controls its private IP range, routing and firewall behaviour.

gcp-storage-bucket​

A Dataproc Cluster references one or more Cloud Storage buckets, e.g. the optional “cluster staging bucket” used for job jars and logs, or user-provided buckets mounted via Hadoop/Spark connectors.

gcp-compute-instance-group-manager​

Each node pool (master, worker, secondary worker) in a Dataproc Cluster is implemented as a managed instance group created and controlled on the cluster’s behalf.

gcp-dataproc-autoscaling-policy​

Clusters can be attached to an autoscaling policy that automatically adds or removes workers based on YARN metrics; the policy resource is linked to the cluster.

gcp-compute-node-group​

If a cluster is deployed on sole-tenant nodes, the underlying VMs belong to a Compute Node Group which is referenced in the cluster specification.

gcp-iam-service-account​

VMs in a Dataproc Cluster run under a default or user-supplied service account that grants them access to Storage, BigQuery, Pub/Sub and other Google Cloud APIs.

gcp-cloud-kms-crypto-key​

Customer-managed encryption keys (CMEK) from Cloud KMS can be configured to encrypt the cluster’s persistent disks and in-cluster Storage buckets, creating a dependency on the Crypto Key.

gcp-compute-image​

A Dataproc Cluster can use a custom or publicly available Compute Engine image (via the dataproc-image or machine-image fields) for its node VMs, linking it to the corresponding Image resource.