OpenMCF logoOpenMCF

Loading...

GCP Dataproc Cluster

Deploys a standard (GCE-based) Google Cloud Dataproc cluster for running Apache Spark, Hadoop, and related data processing frameworks. The component supports master/worker node configuration, optional spot secondary workers for cost optimization, software component selection, CMEK encryption, and automatic lifecycle management for ephemeral clusters.

What Gets Created

When you deploy a GcpDataprocCluster resource, OpenMCF provisions:

  • Dataproc Cluster — a google_dataproc_cluster resource with master nodes, primary workers, and optional secondary (spot/preemptible) workers
  • GCS Staging Bucket — auto-created by GCP if not specified; stores job dependencies and intermediate data
  • GCS Temp Bucket — auto-created by GCP if not specified; stores ephemeral shuffle and spill data
  • Component Gateway Endpoints — authenticated HTTPS URLs for Spark UI, YARN ResourceManager, HDFS NameNode, Jupyter, and other web UIs (when endpointConfig.enableHttpPortAccess is true)

Prerequisites

  • GCP credentials configured via environment variables or OpenMCF provider config
  • A GCP project with the Dataproc API enabled (dataproc.googleapis.com)
  • VPC network or subnetwork if specifying custom networking (otherwise GCP uses the default network)
  • A service account with Dataproc Worker role if using a custom service account
  • A Cloud KMS key if enabling CMEK encryption for persistent disks
  • Initialization scripts in GCS if using init actions

Quick Start

Create a file dataproc.yaml:

apiVersion: gcp.openmcf.org/v1
kind: GcpDataprocCluster
metadata:
  name: my-spark-cluster
  labels:
    openmcf.org/provisioner: pulumi
    pulumi.openmcf.org/organization: my-org
    pulumi.openmcf.org/project: my-project
    pulumi.openmcf.org/stack.name: dev.GcpDataprocCluster.my-spark-cluster
spec:
  projectId:
    value: "my-gcp-project"
  region: us-central1
  clusterName: my-spark-cluster
  clusterConfig:
    masterConfig:
      machineType: n2-standard-4
    workerConfig:
      numInstances: 2
      machineType: n2-standard-4
    softwareConfig:
      imageVersion: "2.2-debian12"
    endpointConfig:
      enableHttpPortAccess: true
    lifecycleConfig:
      idleDeleteTtl: "1800s"

Deploy:

openmcf apply -f dataproc.yaml

This creates a Dataproc cluster with 1 master, 2 workers, Spark 3.5, Component Gateway enabled, and auto-delete after 30 minutes idle.

Configuration Reference

Required Fields

FieldTypeDescriptionValidation
projectIdStringValueOrRefGCP project where the cluster is created.Required
projectId.valuestringDirect project ID value—
projectId.valueFromobjectForeign key reference to a GcpProject resourceDefault kind: GcpProject
regionstringGCP region for the cluster (e.g., us-central1).Required
clusterNamestringCluster name. Lowercase letters, numbers, hyphens.2-55 chars, ^[a-z][a-z0-9-]{0,53}[a-z0-9]$

Optional Fields

FieldTypeDefaultDescription
gracefulDecommissionTimeoutstring"0s"Duration for YARN graceful decommissioning during scale-down (e.g., "3600s").
clusterConfig.stagingBucketStringValueOrRefAuto-createdGCS bucket for staging job dependencies. Can reference a GcpGcsBucket.
clusterConfig.tempBucketStringValueOrRefAuto-createdGCS bucket for ephemeral data. Can reference a GcpGcsBucket.
clusterConfig.gceConfig.networkStringValueOrRefDefault VPCVPC network for nodes. Mutually exclusive with subnetwork. Can reference GcpVpc.
clusterConfig.gceConfig.subnetworkStringValueOrRef—VPC subnetwork for nodes. Mutually exclusive with network. Can reference GcpSubnetwork.
clusterConfig.gceConfig.serviceAccountStringValueOrRefDefault CE SAService account for node VMs. Can reference GcpServiceAccount.
clusterConfig.gceConfig.zonestringAuto-selectedZone within region for node placement.
clusterConfig.gceConfig.internalIpOnlyboolfalseRestrict nodes to internal IP addresses only.
clusterConfig.gceConfig.tagsstring[][]GCE network tags for firewall targeting.
clusterConfig.gceConfig.metadatamap{}Instance metadata key-value pairs.
clusterConfig.masterConfig.numInstancesint1Number of masters. Use 3 for HA mode.
clusterConfig.masterConfig.machineTypestringGCP defaultMachine type (e.g., n2-standard-4).
clusterConfig.masterConfig.diskConfig.bootDiskSizeGbint500Boot disk size in GB (min 10).
clusterConfig.masterConfig.diskConfig.bootDiskTypestringpd-standardDisk type: pd-standard, pd-ssd, pd-balanced.
clusterConfig.masterConfig.diskConfig.numLocalSsdsint0Local SSDs (375 GB each).
clusterConfig.masterConfig.acceleratorsobject[][]GPU/TPU accelerators (acceleratorType, acceleratorCount).
clusterConfig.workerConfig.numInstancesint2Number of primary workers.
clusterConfig.workerConfig.machineTypestringGCP defaultMachine type for workers.
clusterConfig.workerConfig.minNumInstancesint—Minimum workers for autoscaling.
clusterConfig.workerConfig.diskConfigobject—Same structure as master disk config.
clusterConfig.workerConfig.acceleratorsobject[][]GPU/TPU accelerators on workers.
clusterConfig.secondaryWorkerConfig.numInstancesint0Number of secondary (spot/preemptible) workers.
clusterConfig.secondaryWorkerConfig.preemptibilitystringPREEMPTIBLESPOT, PREEMPTIBLE, or NON_PREEMPTIBLE.
clusterConfig.secondaryWorkerConfig.diskConfigobject—Disk config for secondary workers.
clusterConfig.softwareConfig.imageVersionstringLatest stableDataproc image version (e.g., 2.2-debian12).
clusterConfig.softwareConfig.optionalComponentsstring[][]Components: JUPYTER, DOCKER, PRESTO, ZEPPELIN, FLINK, TRINO.
clusterConfig.softwareConfig.propertiesmap{}Hadoop/Spark/YARN property overrides (e.g., "spark:spark.executor.memory": "4g").
clusterConfig.initializationActionsobject[][]Startup scripts (script GCS URI, optional timeoutSec).
clusterConfig.autoscalingPolicyUristring—URI of a Dataproc autoscaling policy resource.
clusterConfig.encryptionKmsKeyNameStringValueOrRefGoogle-managedCloud KMS key for CMEK disk encryption. Can reference GcpKmsKey.
clusterConfig.endpointConfig.enableHttpPortAccessboolfalseEnable Component Gateway for web UI access.
clusterConfig.lifecycleConfig.idleDeleteTtlstring—Auto-delete after idle (e.g., "1800s" for 30 min).
clusterConfig.lifecycleConfig.autoDeleteTimestring—Scheduled deletion timestamp (RFC3339).

Examples

Development Cluster with Jupyter

apiVersion: gcp.openmcf.org/v1
kind: GcpDataprocCluster
metadata:
  name: dev-jupyter
  labels:
    openmcf.org/provisioner: pulumi
    pulumi.openmcf.org/organization: my-org
    pulumi.openmcf.org/project: my-project
    pulumi.openmcf.org/stack.name: dev.GcpDataprocCluster.dev-jupyter
spec:
  projectId:
    value: "my-gcp-project"
  region: us-central1
  clusterName: dev-jupyter
  clusterConfig:
    masterConfig:
      machineType: e2-standard-4
    workerConfig:
      numInstances: 2
      machineType: e2-standard-4
    softwareConfig:
      imageVersion: "2.2-debian12"
      optionalComponents:
        - JUPYTER
    endpointConfig:
      enableHttpPortAccess: true
    lifecycleConfig:
      idleDeleteTtl: "1800s"

HA Production Cluster

apiVersion: gcp.openmcf.org/v1
kind: GcpDataprocCluster
metadata:
  name: prod-spark
  labels:
    openmcf.org/provisioner: pulumi
    pulumi.openmcf.org/organization: my-org
    pulumi.openmcf.org/project: my-project
    pulumi.openmcf.org/stack.name: prod.GcpDataprocCluster.prod-spark
spec:
  projectId:
    value: "my-gcp-project"
  region: us-central1
  clusterName: prod-spark
  gracefulDecommissionTimeout: "3600s"
  clusterConfig:
    gceConfig:
      subnetwork:
        value: "projects/my-project/regions/us-central1/subnetworks/dataproc"
      serviceAccount:
        value: "dataproc-sa@my-project.iam.gserviceaccount.com"
      internalIpOnly: true
    masterConfig:
      numInstances: 3
      machineType: n2-standard-8
      diskConfig:
        bootDiskSizeGb: 200
        bootDiskType: pd-ssd
    workerConfig:
      numInstances: 5
      machineType: n2-standard-8
      diskConfig:
        bootDiskSizeGb: 500
        bootDiskType: pd-ssd
        numLocalSsds: 2
    softwareConfig:
      imageVersion: "2.2-debian12"
    encryptionKmsKeyName:
      value: "projects/my-project/locations/us-central1/keyRings/my-ring/cryptoKeys/my-key"
    endpointConfig:
      enableHttpPortAccess: true

Cost-Optimized Batch with Spot Workers

apiVersion: gcp.openmcf.org/v1
kind: GcpDataprocCluster
metadata:
  name: batch-spark
  labels:
    openmcf.org/provisioner: pulumi
    pulumi.openmcf.org/organization: my-org
    pulumi.openmcf.org/project: my-project
    pulumi.openmcf.org/stack.name: prod.GcpDataprocCluster.batch-spark
spec:
  projectId:
    value: "my-gcp-project"
  region: us-central1
  clusterName: batch-spark
  clusterConfig:
    masterConfig:
      machineType: n2-standard-4
    workerConfig:
      numInstances: 2
      machineType: n2-standard-4
    secondaryWorkerConfig:
      numInstances: 10
      preemptibility: SPOT
    softwareConfig:
      imageVersion: "2.2-debian12"
    lifecycleConfig:
      idleDeleteTtl: "900s"

Foreign Key References

apiVersion: gcp.openmcf.org/v1
kind: GcpDataprocCluster
metadata:
  name: composed-spark
  labels:
    openmcf.org/provisioner: pulumi
    pulumi.openmcf.org/organization: my-org
    pulumi.openmcf.org/project: my-project
    pulumi.openmcf.org/stack.name: prod.GcpDataprocCluster.composed-spark
spec:
  projectId:
    valueFrom:
      kind: GcpProject
      name: my-project
  region: us-central1
  clusterName: composed-spark
  clusterConfig:
    stagingBucket:
      valueFrom:
        kind: GcpGcsBucket
        name: staging-bucket
    gceConfig:
      subnetwork:
        valueFrom:
          kind: GcpSubnetwork
          name: dataproc-subnet
      serviceAccount:
        valueFrom:
          kind: GcpServiceAccount
          name: dataproc-sa
    encryptionKmsKeyName:
      valueFrom:
        kind: GcpKmsKey
        name: dataproc-key
    masterConfig:
      machineType: n2-standard-4
    workerConfig:
      numInstances: 4
      machineType: n2-standard-8

Stack Outputs

After deployment, the following outputs are available in status.outputs:

OutputTypeDescription
cluster_idstringFully qualified cluster resource name (projects/{project}/regions/{region}/clusters/{cluster})
cluster_namestringShort cluster name (same as spec.clusterName)
cluster_uuidstringServer-generated unique identifier
staging_bucketstringGCS bucket used for staging (user-supplied or auto-created)

Related Components

  • GcpGcsBucket — Staging and temp bucket for job artifacts
  • GcpVpc — VPC network for cluster node placement
  • GcpSubnetwork — Subnetwork for controlled IP range allocation
  • GcpServiceAccount — Custom IAM identity for cluster VMs
  • GcpKmsKey — Customer-managed encryption keys for disk encryption

Next article

GCP Dataproc Virtual Cluster

GCP Dataproc Virtual Cluster Deploys a Dataproc on GKE virtual cluster that schedules Spark, PySpark, and SparkR workloads as Kubernetes pods on an existing GKE cluster. Instead of managing dedicated Compute Engine VMs, the virtual cluster shares GKE infrastructure with other workloads. What Gets Created When you deploy a GcpDataprocVirtualCluster resource, OpenMCF provisions: Dataproc Cluster — a googledataproccluster resource with virtualclusterconfig pointing to the specified GKE cluster and...
Read next article
Presets
3 ready-to-deploy configurationsView presets →