Dataproc¶
Create Google Cloud Dataproc jobs from within Vertex AI Pipelines.
Components:
|
Create a Dataproc PySpark batch workload and wait for it to finish. |
|
Create a Dataproc Spark batch workload and wait for it to finish. |
|
Create a Dataproc SparkR batch workload and wait for it to finish. |
|
Create a Dataproc Spark SQL batch workload and wait for it to finish. |
-
v1.dataproc.DataprocPySparkBatchOp(main_python_file_uri: str, gcp_resources: dsl.OutputPath(str), location: str =
'us-central1', batch_id: str ='', labels: dict[str, str] ={}, container_image: str ='', runtime_config_version: str ='', runtime_config_properties: dict[str, str] ={}, autotuning_config: dict[str, str] ={}, cohort: str ='', service_account: str ='', network_tags: list[str] =[], kms_key: str ='', network_uri: str ='', subnetwork_uri: str ='', metastore_service: str ='', spark_history_dataproc_cluster: str ='', python_file_uris: list[str] =[], jar_file_uris: list[str] =[], file_uris: list[str] =[], archive_uris: list[str] =[], args: list[str] =[], project: str ='{{$.pipeline_google_cloud_project_id}}')¶ Create a Dataproc PySpark batch workload and wait for it to finish.
- Parameters¶:
- location: str =
'us-central1'¶ Location of the Dataproc batch workload. If not set, defaults to
"us-central1".- batch_id: str =
''¶ The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are
/[a-z][0-9]-/.- labels: dict[str, str] =
{}¶ The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of
"key": valuepairs. Example:{ "name": "wrench", "mass": "1.3kg", "count": "3" }.- container_image: str =
''¶ Optional custom container image for the job runtime environment. If not specified, a default container image will be used.
- runtime_config_version: str =
''¶ Version of the batch runtime.
- runtime_config_properties: dict[str, str] =
{}¶ Runtime configuration for the workload.
- autotuning_config: dict[str, str] =
{}¶ Autotuning configuration for the workload.
- cohort: str =
''¶ Cohort identifier for the workload.
- service_account: str =
''¶ Service account that is used to execute the workload.
Tags used for network traffic control.
- kms_key: str =
''¶ The Cloud KMS key to use for encryption.
- network_uri: str =
''¶ Network URI to connect workload to.
- subnetwork_uri: str =
''¶ Subnetwork URI to connect workload to.
- metastore_service: str =
''¶ Resource name of an existing Dataproc Metastore service.
- spark_history_dataproc_cluster: str =
''¶ The Spark History Server configuration for the workload.
- main_python_file_uri: str¶
The HCFS URI of the main Python file to use as the Spark driver. Must be a
.pyfile.- python_file_uris: list[str] =
[]¶ HCFS file URIs of Python files to pass to the PySpark framework. Supported file types:
.py,.egg, and.zip.- jar_file_uris: list[str] =
[]¶ HCFS URIs of jar files to add to the classpath of the Spark driver and tasks.
- file_uris: list[str] =
[]¶ HCFS URIs of files to be placed in the working directory of each executor.
- archive_uris: list[str] =
[]¶ HCFS URIs of archives to be extracted into the working directory of each executor. Supported file types:
.jar,.tar,.tar.gz,.tgz, and.zip.- args: list[str] =
[]¶ The arguments to pass to the driver. Do not include arguments that can be set as batch properties, such as
--conf, since a collision can occur that causes an incorrect batch submission.- project: str =
'{{$.pipeline_google_cloud_project_id}}'¶ Project to run the Dataproc batch workload. Defaults to the project in which the PipelineJob is run.
- location: str =
- Returns¶:
gcp_resources: dsl.OutputPath(str)Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
-
v1.dataproc.DataprocSparkBatchOp(gcp_resources: dsl.OutputPath(str), location: str =
'us-central1', batch_id: str ='', labels: dict[str, str] ={}, container_image: str ='', runtime_config_version: str ='', runtime_config_properties: dict[str, str] ={}, autotuning_config: dict[str, str] ={}, cohort: str ='', service_account: str ='', network_tags: list[str] =[], kms_key: str ='', network_uri: str ='', subnetwork_uri: str ='', metastore_service: str ='', spark_history_dataproc_cluster: str ='', main_jar_file_uri: str ='', main_class: str ='', jar_file_uris: list[str] =[], file_uris: list[str] =[], archive_uris: list[str] =[], args: list[str] =[], project: str ='{{$.pipeline_google_cloud_project_id}}')¶ Create a Dataproc Spark batch workload and wait for it to finish.
- Parameters¶:
- location: str =
'us-central1'¶ Location of the Dataproc batch workload. If not set, defaults to
"us-central1".- batch_id: str =
''¶ The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are
/[a-z][0-9]-/.- labels: dict[str, str] =
{}¶ The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of
"key": valuepairs. Example:{ "name": "wrench", "mass": "1.3kg", "count": "3" }.- container_image: str =
''¶ Optional custom container image for the job runtime environment. If not specified, a default container image will be used.
- runtime_config_version: str =
''¶ Version of the batch runtime.
- runtime_config_properties: dict[str, str] =
{}¶ Runtime configuration for the workload.
- autotuning_config: dict[str, str] =
{}¶ Autotuning configuration for the workload.
- cohort: str =
''¶ Cohort identifier for the workload.
- service_account: str =
''¶ Service account that is used to execute the workload.
Tags used for network traffic control.
- kms_key: str =
''¶ The Cloud KMS key to use for encryption.
- network_uri: str =
''¶ Network URI to connect workload to.
- subnetwork_uri: str =
''¶ Subnetwork URI to connect workload to.
- metastore_service: str =
''¶ Resource name of an existing Dataproc Metastore service.
- spark_history_dataproc_cluster: str =
''¶ The Spark History Server configuration for the workload.
- main_jar_file_uri: str =
''¶ The HCFS URI of the jar file that contains the main class.
- main_class: str =
''¶ The name of the driver main class. The jar file that contains the class must be in the classpath or specified in jar_file_uris.
- jar_file_uris: list[str] =
[]¶ HCFS URIs of jar files to add to the classpath of the Spark driver and tasks.
- file_uris: list[str] =
[]¶ HCFS URIs of files to be placed in the working directory of each executor.
- archive_uris: list[str] =
[]¶ HCFS URIs of archives to be extracted into the working directory of each executor. Supported file types:
.jar,.tar,.tar.gz,.tgz, and.zip.- args: list[str] =
[]¶ The arguments to pass to the driver. Do not include arguments that can be set as batch properties, such as
--conf, since a collision can occur that causes an incorrect batch submission.- project: str =
'{{$.pipeline_google_cloud_project_id}}'¶ Project to run the Dataproc batch workload. Defaults to the project in which the PipelineJob is run.
- location: str =
- Returns¶:
gcp_resources: dsl.OutputPath(str)Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
-
v1.dataproc.DataprocSparkRBatchOp(gcp_resources: dsl.OutputPath(str), location: str =
'us-central1', batch_id: str ='', labels: dict[str, str] ={}, container_image: str ='', runtime_config_version: str ='', runtime_config_properties: dict[str, str] ={}, autotuning_config: dict[str, str] ={}, cohort: str ='', service_account: str ='', network_tags: list[str] =[], kms_key: str ='', network_uri: str ='', subnetwork_uri: str ='', metastore_service: str ='', spark_history_dataproc_cluster: str ='', main_r_file_uri: str ='', file_uris: list[str] =[], archive_uris: list[str] =[], args: list[str] =[], project: str ='{{$.pipeline_google_cloud_project_id}}')¶ Create a Dataproc SparkR batch workload and wait for it to finish.
- Parameters¶:
- location: str =
'us-central1'¶ Location of the Dataproc batch workload. If not set, defaults to
"us-central1".- batch_id: str =
''¶ The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are
/[a-z][0-9]-/.- labels: dict[str, str] =
{}¶ The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of
"key": valuepairs. Example:{ "name": "wrench", "mass": "1.3kg", "count": "3" }.- container_image: str =
''¶ Optional custom container image for the job runtime environment. If not specified, a default container image will be used.
- runtime_config_version: str =
''¶ Version of the batch runtime.
- runtime_config_properties: dict[str, str] =
{}¶ Runtime configuration for the workload.
- autotuning_config: dict[str, str] =
{}¶ Autotuning configuration for the workload.
- cohort: str =
''¶ Cohort identifier for the workload.
- service_account: str =
''¶ Service account that is used to execute the workload.
Tags used for network traffic control.
- kms_key: str =
''¶ The Cloud KMS key to use for encryption.
- network_uri: str =
''¶ Network URI to connect workload to.
- subnetwork_uri: str =
''¶ Subnetwork URI to connect workload to.
- metastore_service: str =
''¶ Resource name of an existing Dataproc Metastore service.
- spark_history_dataproc_cluster: str =
''¶ The Spark History Server configuration for the workload.
- main_r_file_uri: str =
''¶ The HCFS URI of the main R file to use as the driver. Must be a
.Ror.rfile.- file_uris: list[str] =
[]¶ HCFS URIs of files to be placed in the working directory of each executor.
- archive_uris: list[str] =
[]¶ HCFS URIs of archives to be extracted into the working directory of each executor. Supported file types:
.jar,.tar,.tar.gz,.tgz, and.zip.- args: list[str] =
[]¶ The arguments to pass to the driver. Do not include arguments that can be set as batch properties, such as
--conf, since a collision can occur that causes an incorrect batch submission.- project: str =
'{{$.pipeline_google_cloud_project_id}}'¶ Project to run the Dataproc batch workload. Defaults to the project in which the PipelineJob is run.
- location: str =
- Returns¶:
gcp_resources: dsl.OutputPath(str)Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
-
v1.dataproc.DataprocSparkSqlBatchOp(gcp_resources: dsl.OutputPath(str), location: str =
'us-central1', batch_id: str ='', labels: dict[str, str] ={}, container_image: str ='', runtime_config_version: str ='', runtime_config_properties: dict[str, str] ={}, autotuning_config: dict[str, str] ={}, cohort: str ='', service_account: str ='', network_tags: list[str] =[], kms_key: str ='', network_uri: str ='', subnetwork_uri: str ='', metastore_service: str ='', spark_history_dataproc_cluster: str ='', query_file_uri: str ='', query_variables: dict[str, str] ={}, jar_file_uris: list[str] =[], project: str ='{{$.pipeline_google_cloud_project_id}}')¶ Create a Dataproc Spark SQL batch workload and wait for it to finish.
- Parameters¶:
- location: str =
'us-central1'¶ Location of the Dataproc batch workload. If not set, defaults to
"us-central1".- batch_id: str =
''¶ The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are /[a-z][0-9]-/.
- labels: dict[str, str] =
{}¶ The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of
"key": valuepairs. Example:{ "name": "wrench", "mass": "1.3kg", "count": "3" }.- container_image: str =
''¶ Optional custom container image for the job runtime environment. If not specified, a default container image will be used.
- runtime_config_version: str =
''¶ Version of the batch runtime.
- runtime_config_properties: dict[str, str] =
{}¶ Runtime configuration for the workload.
- autotuning_config: dict[str, str] =
{}¶ Autotuning configuration for the workload.
- cohort: str =
''¶ Cohort identifier for the workload.
- service_account: str =
''¶ Service account that is used to execute the workload.
Tags used for network traffic control.
- kms_key: str =
''¶ The Cloud KMS key to use for encryption.
- network_uri: str =
''¶ Network URI to connect workload to.
- subnetwork_uri: str =
''¶ Subnetwork URI to connect workload to.
- metastore_service: str =
''¶ Resource name of an existing Dataproc Metastore service.
- spark_history_dataproc_cluster: str =
''¶ The Spark History Server configuration for the workload.
- query_file_uri: str =
''¶ The HCFS URI of the script that contains Spark SQL queries to execute.
- query_variables: dict[str, str] =
{}¶ Mapping of query variable names to values (equivalent to the Spark SQL command:
SET name="value";). An object containing a list of"key": valuepairs. Example:{ "name": "wrench", "mass": "1.3kg", "count": "3" }.- jar_file_uris: list[str] =
[]¶ HCFS URIs of jar files to be added to the Spark
CLASSPATH.- project: str =
'{{$.pipeline_google_cloud_project_id}}'¶ Project to run the Dataproc batch workload. Defaults to the project in which the PipelineJob is run.
- location: str =
- Returns¶:
gcp_resources: dsl.OutputPath(str)Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.