Skip to main content

GCP Dataproc Cluster

Google Cloud Dataproc is a managed Spark / Hadoop service that provisions clusters of Compute Engine VMs on demand, runs distributed data-processing jobs, and tears the infrastructure down again when it is no longer required. A Dataproc Cluster represents that fleet of VMs together with their configuration (networking, IAM, autoscaling rules, encryption settings, initialisation actions, etc.).
For full details see the official documentation: https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#Cluster

Terrafrom Mappings:

google_dataproc_cluster.name

Supported Methods

GET: Get a gcp-dataproc-cluster by its "name"
LIST: List all gcp-dataproc-cluster
~~SEARCH~~

Possible Links

`gcp-compute-network`

Every Dataproc Cluster is launched inside a specific VPC network (and usually a sub-network) which controls its private IP range, routing and firewall behaviour.

`gcp-storage-bucket`

A Dataproc Cluster references one or more Cloud Storage buckets, e.g. the optional “cluster staging bucket” used for job jars and logs, or user-provided buckets mounted via Hadoop/Spark connectors.

`gcp-compute-instance-group-manager`

Each node pool (master, worker, secondary worker) in a Dataproc Cluster is implemented as a managed instance group created and controlled on the cluster’s behalf.

`gcp-dataproc-autoscaling-policy`

Clusters can be attached to an autoscaling policy that automatically adds or removes workers based on YARN metrics; the policy resource is linked to the cluster.

`gcp-compute-node-group`

If a cluster is deployed on sole-tenant nodes, the underlying VMs belong to a Compute Node Group which is referenced in the cluster specification.

`gcp-iam-service-account`

VMs in a Dataproc Cluster run under a default or user-supplied service account that grants them access to Storage, BigQuery, Pub/Sub and other Google Cloud APIs.

`gcp-cloud-kms-crypto-key`

Customer-managed encryption keys (CMEK) from Cloud KMS can be configured to encrypt the cluster’s persistent disks and in-cluster Storage buckets, creating a dependency on the Crypto Key.

`gcp-compute-image`

A Dataproc Cluster can use a custom or publicly available Compute Engine image (via the dataproc-image or machine-image fields) for its node VMs, linking it to the corresponding Image resource.

Supported Methods
Possible Links