OpenMCF logoOpenMCF

Loading...

GCP Dataproc Virtual Cluster

Deploys a Dataproc on GKE virtual cluster that schedules Spark, PySpark, and SparkR workloads as Kubernetes pods on an existing GKE cluster. Instead of managing dedicated Compute Engine VMs, the virtual cluster shares GKE infrastructure with other workloads.

What Gets Created

When you deploy a GcpDataprocVirtualCluster resource, OpenMCF provisions:

  • Dataproc Cluster — a google_dataproc_cluster resource with virtual_cluster_config pointing to the specified GKE cluster and namespace
  • Node Pool Target Bindings — one or more GKE node pool assignments with Dataproc roles (DEFAULT, CONTROLLER, SPARK_DRIVER, SPARK_EXECUTOR) controlling where workloads are scheduled
  • Auxiliary Services — created only when auxiliaryServicesConfig is specified, integrates an existing Dataproc Metastore and/or Spark History Server with the virtual cluster

Prerequisites

  • GCP credentials configured via environment variables or OpenMCF provider config
  • A GCP project with the Dataproc API enabled (dataproc.googleapis.com)
  • A GKE cluster in the same project and region, referenced via gkeClusterTarget
  • At least one GKE node pool assigned the DEFAULT role
  • A Kubernetes namespace (optional — Dataproc creates one automatically if not specified)
  • A GCS bucket if specifying a custom staging bucket

Quick Start

Create a file dataproc-virtual-cluster.yaml:

apiVersion: gcp.openmcf.org/v1
kind: GcpDataprocVirtualCluster
metadata:
  name: my-spark-on-gke
  labels:
    openmcf.org/provisioner: pulumi
    pulumi.openmcf.org/organization: my-org
    pulumi.openmcf.org/project: my-project
    pulumi.openmcf.org/stack.name: dev.GcpDataprocVirtualCluster.my-spark-on-gke
spec:
  projectId:
    value: my-gcp-project
  region: us-central1
  gkeClusterTarget:
    value: projects/my-gcp-project/locations/us-central1/clusters/my-gke-cluster
  softwareConfig:
    componentVersion:
      SPARK: "3.5"
  nodePoolTargets:
    - nodePool:
        value: projects/my-gcp-project/locations/us-central1/clusters/my-gke-cluster/nodePools/default-pool
      roles:
        - DEFAULT

Deploy:

openmcf apply -f dataproc-virtual-cluster.yaml

This creates a Dataproc virtual cluster on an existing GKE cluster, scheduling Spark workloads on the default node pool.

Configuration Reference

Required Fields

FieldTypeDescriptionValidation
projectIdStringValueOrRefGCP project where the virtual cluster is created. Can reference a GcpProject resource via valueFrom.Required
regionstringGCP region. Must match the region of the target GKE cluster.Required
gkeClusterTargetStringValueOrRefFully qualified GKE cluster resource ID. Format: projects/{project}/locations/{location}/clusters/{name}. Can reference a GcpGkeCluster resource via valueFrom.Required
softwareConfig.componentVersionmap<string, string>Component versions. The SPARK key is mandatory (e.g., {"SPARK": "3.5"}).Required
nodePoolTargetsobject[]GKE node pool assignments with Dataproc roles. At least one must have the DEFAULT role.Minimum 1 item
nodePoolTargets[].nodePoolStringValueOrRefGKE node pool reference (short name or fully qualified path). Can reference a GcpGkeNodePool resource via valueFrom.Required
nodePoolTargets[].rolesstring[]Dataproc roles for this node pool. Valid values: DEFAULT, CONTROLLER, SPARK_DRIVER, SPARK_EXECUTOR.Minimum 1 item

Optional Fields

FieldTypeDefaultDescription
clusterNamestringmetadata.nameExplicit Dataproc cluster name. Lowercase letters, numbers, hyphens; starts with a letter.
kubernetesNamespaceStringValueOrRefAuto-createdKubernetes namespace for the virtual cluster. Can reference a KubernetesNamespace resource via valueFrom.
stagingBucketStringValueOrRefDefault bucketGCS bucket for staging job dependencies. Can reference a GcpGcsBucket resource via valueFrom.
softwareConfig.propertiesmap<string, string>{}Daemon config properties in prefix:property format (e.g., {"spark:spark.kubernetes.container.image": "custom:latest"}).
nodePoolTargets[].nodePoolConfig.locationsstring[]—Compute Engine zones for node pool nodes.
nodePoolTargets[].nodePoolConfig.machineTypestring—Machine type for nodes (e.g., n1-standard-4).
nodePoolTargets[].nodePoolConfig.localSsdCountint0Local SSD disks per node.
nodePoolTargets[].nodePoolConfig.minCpuPlatformstring—Minimum CPU platform (e.g., Intel Haswell).
nodePoolTargets[].nodePoolConfig.preemptibleboolfalseUse preemptible VMs. Cannot be used with CONTROLLER or sole DEFAULT role.
nodePoolTargets[].nodePoolConfig.spotboolfalseUse Spot VMs. Same restrictions as preemptible.
nodePoolTargets[].nodePoolConfig.autoscaling.minNodeCountint—Minimum nodes. Must be >= 0.
nodePoolTargets[].nodePoolConfig.autoscaling.maxNodeCountint—Maximum nodes. Must be >= minNodeCount.
auxiliaryServicesConfig.metastoreServicestring—Fully qualified Dataproc Metastore service name for Hive metastore integration.
auxiliaryServicesConfig.sparkHistoryServerClusterstring—Fully qualified Dataproc cluster name serving as the Spark History Server.

Examples

Multi-Pool Cluster with Role Separation

Separate Spark drivers and executors onto different node pools for resource isolation:

apiVersion: gcp.openmcf.org/v1
kind: GcpDataprocVirtualCluster
metadata:
  name: multi-pool-spark
  labels:
    openmcf.org/provisioner: pulumi
    pulumi.openmcf.org/organization: my-org
    pulumi.openmcf.org/project: my-project
    pulumi.openmcf.org/stack.name: prod.GcpDataprocVirtualCluster.multi-pool-spark
spec:
  projectId:
    value: my-gcp-project
  region: us-central1
  clusterName: multi-pool-spark
  gkeClusterTarget:
    value: projects/my-gcp-project/locations/us-central1/clusters/shared-gke
  softwareConfig:
    componentVersion:
      SPARK: "3.5"
  nodePoolTargets:
    - nodePool:
        value: driver-pool
      roles:
        - DEFAULT
        - CONTROLLER
        - SPARK_DRIVER
    - nodePool:
        value: executor-pool
      roles:
        - SPARK_EXECUTOR
      nodePoolConfig:
        autoscaling:
          minNodeCount: 2
          maxNodeCount: 20

Metastore-Integrated Cluster

A virtual cluster connected to an existing Dataproc Metastore for shared Hive table access:

apiVersion: gcp.openmcf.org/v1
kind: GcpDataprocVirtualCluster
metadata:
  name: metastore-spark
  labels:
    openmcf.org/provisioner: pulumi
    pulumi.openmcf.org/organization: my-org
    pulumi.openmcf.org/project: my-project
    pulumi.openmcf.org/stack.name: prod.GcpDataprocVirtualCluster.metastore-spark
spec:
  projectId:
    value: my-gcp-project
  region: us-central1
  clusterName: metastore-spark
  gkeClusterTarget:
    value: projects/my-gcp-project/locations/us-central1/clusters/shared-gke
  softwareConfig:
    componentVersion:
      SPARK: "3.5"
  nodePoolTargets:
    - nodePool:
        value: spark-pool
      roles:
        - DEFAULT
  auxiliaryServicesConfig:
    metastoreService: projects/my-gcp-project/locations/us-central1/services/shared-metastore
    sparkHistoryServerCluster: projects/my-gcp-project/regions/us-central1/clusters/history-server

Using Foreign Key References

Reference other OpenMCF-managed resources for fully composable infrastructure:

apiVersion: gcp.openmcf.org/v1
kind: GcpDataprocVirtualCluster
metadata:
  name: composed-spark
  labels:
    openmcf.org/provisioner: pulumi
    pulumi.openmcf.org/organization: my-org
    pulumi.openmcf.org/project: my-project
    pulumi.openmcf.org/stack.name: prod.GcpDataprocVirtualCluster.composed-spark
spec:
  projectId:
    valueFrom:
      kind: GcpProject
      name: my-project
      field: status.outputs.project_id
  region: us-central1
  gkeClusterTarget:
    valueFrom:
      kind: GcpGkeCluster
      name: shared-gke
      field: status.outputs.cluster_id
  kubernetesNamespace:
    valueFrom:
      kind: KubernetesNamespace
      name: spark-ns
      field: spec.name
  stagingBucket:
    valueFrom:
      kind: GcpGcsBucket
      name: spark-staging
      field: status.outputs.bucket_id
  softwareConfig:
    componentVersion:
      SPARK: "3.5"
  nodePoolTargets:
    - nodePool:
        valueFrom:
          kind: GcpGkeNodePool
          name: spark-pool
          field: status.outputs.node_pool_id
      roles:
        - DEFAULT

Stack Outputs

After deployment, the following outputs are available in status.outputs:

OutputTypeDescription
cluster_idstringFully qualified Dataproc cluster resource name (projects/{project}/regions/{region}/clusters/{name})
cluster_namestringShort name of the Dataproc cluster
cluster_uuidstringServer-generated UUID for the cluster

Related Components

  • GcpGkeCluster — provides the target GKE cluster for virtual cluster deployment
  • GcpGkeNodePool — provides node pools assigned to Dataproc roles
  • GcpGcsBucket — staging bucket for job dependencies
  • GcpDataprocCluster — standard GCE-based alternative for dedicated Spark clusters
  • GcpProject — provides the GCP project

Next article

GCP DNS Record

GCP DNS Record Deploys an individual DNS record set within an existing Google Cloud DNS Managed Zone. This component supports all standard record types (A, AAAA, CNAME, MX, TXT, SRV, NS, PTR, CAA, SOA), configurable TTL, and round-robin record sets with multiple values. What Gets Created When you deploy a GcpDnsRecord resource, OpenMCF provisions: DNS Record Set — a googlednsrecordset resource in the specified managed zone, with the given type, FQDN, values, and TTL Prerequisites GCP...
Read next article
Presets
3 ready-to-deploy configurationsView presets →