Dataproc¶
Create Google Cloud Dataproc jobs from within Vertex AI Pipelines.
Components:
|
Create a Dataproc PySpark batch workload and wait for it to finish. |
|
Create a Dataproc Spark batch workload and wait for it to finish. |
|
Create a Dataproc SparkR batch workload and wait for it to finish. |
|
Create a Dataproc Spark SQL batch workload and wait for it to finish. |
-
v1.dataproc.DataprocPySparkBatchOp(project: str, main_python_file_uri: str, gcp_resources: dsl.OutputPath(str), location: str =
'us-central1'
, batch_id: str =''
, labels: dict[str, str] ={}
, container_image: str =''
, runtime_config_version: str =''
, runtime_config_properties: dict[str, str] ={}
, service_account: str =''
, network_tags: list[str] =[]
, kms_key: str =''
, network_uri: str =''
, subnetwork_uri: str =''
, metastore_service: str =''
, spark_history_dataproc_cluster: str =''
, python_file_uris: list[str] =[]
, jar_file_uris: list[str] =[]
, file_uris: list[str] =[]
, archive_uris: list[str] =[]
, args: list[str] =[]
)¶ Create a Dataproc PySpark batch workload and wait for it to finish.
- Parameters¶
- project: str¶
Project to run the Dataproc batch workload.
- location: str =
'us-central1'
¶ Location of the Dataproc batch workload. If not set, defaults to
"us-central1"
.- batch_id: str =
''
¶ The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are
/[a-z][0-9]-/
.- labels: dict[str, str] =
{}
¶ The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of
"key": value
pairs.Example:
{ "name": "wrench", "mass": "1.3kg", "count": "3" }
.- container_image: str =
''
¶ Optional custom container image for the job runtime environment. If not specified, a default container image will be used.
- runtime_config_version: str =
''
¶ Version of the batch runtime.
- runtime_config_properties: dict[str, str] =
{}
¶ Runtime configuration for the workload.
- service_account: str =
''
¶ Service account that is used to execute the workload.
Tags used for network traffic control.
- kms_key: str =
''
¶ The Cloud KMS key to use for encryption.
- network_uri: str =
''
¶ Network URI to connect workload to.
- subnetwork_uri: str =
''
¶ Subnetwork URI to connect workload to.
- metastore_service: str =
''
¶ Resource name of an existing Dataproc Metastore service.
- spark_history_dataproc_cluster: str =
''
¶ The Spark History Server configuration for the workload.
- main_python_file_uri: str¶
The HCFS URI of the main Python file to use as the Spark driver. Must be a
.py
file.- python_file_uris: list[str] =
[]
¶ HCFS file URIs of Python files to pass to the PySpark framework. Supported file types:
.py
,.egg
, and.zip
.- jar_file_uris: list[str] =
[]
¶ HCFS URIs of jar files to add to the classpath of the Spark driver and tasks.
- file_uris: list[str] =
[]
¶ HCFS URIs of files to be placed in the working directory of each executor.
- archive_uris: list[str] =
[]
¶ HCFS URIs of archives to be extracted into the working directory of each executor. Supported file types:
.jar
,.tar
,.tar.gz
,.tgz
, and.zip
.- args: list[str] =
[]
¶ The arguments to pass to the driver. Do not include arguments that can be set as batch properties, such as
--conf
, since a collision can occur that causes an incorrect batch submission.
- Returns¶
gcp_resources: dsl.OutputPath(str)
Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
-
v1.dataproc.DataprocSparkBatchOp(project: str, gcp_resources: dsl.OutputPath(str), location: str =
'us-central1'
, batch_id: str =''
, labels: dict[str, str] ={}
, container_image: str =''
, runtime_config_version: str =''
, runtime_config_properties: dict[str, str] ={}
, service_account: str =''
, network_tags: list[str] =[]
, kms_key: str =''
, network_uri: str =''
, subnetwork_uri: str =''
, metastore_service: str =''
, spark_history_dataproc_cluster: str =''
, main_jar_file_uri: str =''
, main_class: str =''
, jar_file_uris: list[str] =[]
, file_uris: list[str] =[]
, archive_uris: list[str] =[]
, args: list[str] =[]
)¶ Create a Dataproc Spark batch workload and wait for it to finish.
- Parameters¶
- project: str¶
Project to run the Dataproc batch workload.
- location: str =
'us-central1'
¶ Location of the Dataproc batch workload. If not set, defaults to
"us-central1"
.- batch_id: str =
''
¶ The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are
/[a-z][0-9]-/
.- labels: dict[str, str] =
{}
¶ The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of
"key": value
pairs.Example:
{ "name": "wrench", "mass": "1.3kg", "count": "3" }
.- container_image: str =
''
¶ Optional custom container image for the job runtime environment. If not specified, a default container image will be used.
- runtime_config_version: str =
''
¶ Version of the batch runtime.
- runtime_config_properties: dict[str, str] =
{}
¶ Runtime configuration for the workload.
- service_account: str =
''
¶ Service account that is used to execute the workload.
Tags used for network traffic control.
- kms_key: str =
''
¶ The Cloud KMS key to use for encryption.
- network_uri: str =
''
¶ Network URI to connect workload to.
- subnetwork_uri: str =
''
¶ Subnetwork URI to connect workload to.
- metastore_service: str =
''
¶ Resource name of an existing Dataproc Metastore service.
- spark_history_dataproc_cluster: str =
''
¶ The Spark History Server configuration for the workload.
- main_jar_file_uri: str =
''
¶ The HCFS URI of the jar file that contains the main class.
- main_class: str =
''
¶ The name of the driver main class. The jar file that contains the class must be in the classpath or specified in jar_file_uris.
- jar_file_uris: list[str] =
[]
¶ HCFS URIs of jar files to add to the classpath of the Spark driver and tasks.
- file_uris: list[str] =
[]
¶ HCFS URIs of files to be placed in the working directory of each executor.
- archive_uris: list[str] =
[]
¶ HCFS URIs of archives to be extracted into the working directory of each executor. Supported file types:
.jar
,.tar
,.tar.gz
,.tgz
, and.zip
.- args: list[str] =
[]
¶ The arguments to pass to the driver. Do not include arguments that can be set as batch properties, such as
--conf
, since a collision can occur that causes an incorrect batch submission.
- Returns¶
gcp_resources: dsl.OutputPath(str)
Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
-
v1.dataproc.DataprocSparkRBatchOp(project: str, gcp_resources: dsl.OutputPath(str), location: str =
'us-central1'
, batch_id: str =''
, labels: dict[str, str] ={}
, container_image: str =''
, runtime_config_version: str =''
, runtime_config_properties: dict[str, str] ={}
, service_account: str =''
, network_tags: list[str] =[]
, kms_key: str =''
, network_uri: str =''
, subnetwork_uri: str =''
, metastore_service: str =''
, spark_history_dataproc_cluster: str =''
, main_r_file_uri: str =''
, file_uris: list[str] =[]
, archive_uris: list[str] =[]
, args: list[str] =[]
)¶ Create a Dataproc SparkR batch workload and wait for it to finish.
- Parameters¶
- project: str¶
Project to run the Dataproc batch workload.
- location: str =
'us-central1'
¶ Location of the Dataproc batch workload. If not set, defaults to
"us-central1"
.- batch_id: str =
''
¶ The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are
/[a-z][0-9]-/
.- labels: dict[str, str] =
{}
¶ The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of
"key": value
pairs.Example:
{ "name": "wrench", "mass": "1.3kg", "count": "3" }
.- container_image: str =
''
¶ Optional custom container image for the job runtime environment. If not specified, a default container image will be used.
- runtime_config_version: str =
''
¶ Version of the batch runtime.
- runtime_config_properties: dict[str, str] =
{}
¶ Runtime configuration for the workload.
- service_account: str =
''
¶ Service account that is used to execute the workload.
Tags used for network traffic control.
- kms_key: str =
''
¶ The Cloud KMS key to use for encryption.
- network_uri: str =
''
¶ Network URI to connect workload to.
- subnetwork_uri: str =
''
¶ Subnetwork URI to connect workload to.
- metastore_service: str =
''
¶ Resource name of an existing Dataproc Metastore service.
- spark_history_dataproc_cluster: str =
''
¶ The Spark History Server configuration for the workload.
- main_r_file_uri: str =
''
¶ The HCFS URI of the main R file to use as the driver. Must be a
.R
or.r
file.- file_uris: list[str] =
[]
¶ HCFS URIs of files to be placed in the working directory of each executor.
- archive_uris: list[str] =
[]
¶ HCFS URIs of archives to be extracted into the working directory of each executor. Supported file types:
.jar
,.tar
,.tar.gz
,.tgz
, and.zip
.- args: list[str] =
[]
¶ The arguments to pass to the driver. Do not include arguments that can be set as batch properties, such as
--conf
, since a collision can occur that causes an incorrect batch submission.
- Returns¶
gcp_resources: dsl.OutputPath(str)
Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
-
v1.dataproc.DataprocSparkSqlBatchOp(project: str, gcp_resources: dsl.OutputPath(str), location: str =
'us-central1'
, batch_id: str =''
, labels: dict[str, str] ={}
, container_image: str =''
, runtime_config_version: str =''
, runtime_config_properties: dict[str, str] ={}
, service_account: str =''
, network_tags: list[str] =[]
, kms_key: str =''
, network_uri: str =''
, subnetwork_uri: str =''
, metastore_service: str =''
, spark_history_dataproc_cluster: str =''
, query_file_uri: str =''
, query_variables: dict[str, str] ={}
, jar_file_uris: list[str] =[]
)¶ Create a Dataproc Spark SQL batch workload and wait for it to finish.
- Parameters¶
- project: str¶
Project to run the Dataproc batch workload.
- location: str =
'us-central1'
¶ Location of the Dataproc batch workload. If not set, defaults to
"us-central1"
.- batch_id: str =
''
¶ The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are /[a-z][0-9]-/.
- labels: dict[str, str] =
{}
¶ The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of
"key": value
pairs.Example:
{ "name": "wrench", "mass": "1.3kg", "count": "3" }
.- container_image: str =
''
¶ Optional custom container image for the job runtime environment. If not specified, a default container image will be used.
- runtime_config_version: str =
''
¶ Version of the batch runtime.
- runtime_config_properties: dict[str, str] =
{}
¶ Runtime configuration for the workload.
- service_account: str =
''
¶ Service account that is used to execute the workload.
Tags used for network traffic control.
- kms_key: str =
''
¶ The Cloud KMS key to use for encryption.
- network_uri: str =
''
¶ Network URI to connect workload to.
- subnetwork_uri: str =
''
¶ Subnetwork URI to connect workload to.
- metastore_service: str =
''
¶ Resource name of an existing Dataproc Metastore service.
- spark_history_dataproc_cluster: str =
''
¶ The Spark History Server configuration for the workload.
- query_file_uri: str =
''
¶ The HCFS URI of the script that contains Spark SQL queries to execute.
- query_variables: dict[str, str] =
{}
¶ Mapping of query variable names to values (equivalent to the Spark SQL command:
SET name="value";
). An object containing a list of"key": value
pairs.Example:
{ "name": "wrench", "mass": "1.3kg", "count": "3" }
.- jar_file_uris: list[str] =
[]
¶ HCFS URIs of jar files to be added to the Spark
CLASSPATH
.
- Returns¶
gcp_resources: dsl.OutputPath(str)
Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.