Dataproc¶
Create Google Cloud Dataproc jobs from within Vertex AI Pipelines.
Components:
|
Create a Dataproc PySpark batch workload and wait for it to finish. |
|
Create a Dataproc Spark batch workload and wait for it to finish. |
|
Create a Dataproc SparkR batch workload and wait for it to finish. |
|
Create a Dataproc Spark SQL batch workload and wait for it to finish. |
-
v1.dataproc.DataprocPySparkBatchOp(main_python_file_uri: str, gcp_resources: dsl.OutputPath(str), location: str =
'us-central1', batch_id: str ='', labels: dict[str, str] ={}, container_image: str ='', runtime_config_version: str ='', runtime_config_properties: dict[str, str] ={}, service_account: str ='', network_tags: list[str] =[], kms_key: str ='', network_uri: str ='', subnetwork_uri: str ='', metastore_service: str ='', spark_history_dataproc_cluster: str ='', python_file_uris: list[str] =[], jar_file_uris: list[str] =[], file_uris: list[str] =[], archive_uris: list[str] =[], args: list[str] =[], project: str ='{{$.pipeline_google_cloud_project_id}}')¶ Create a Dataproc PySpark batch workload and wait for it to finish.
not set, defaults to
"us-central1". :param batch_id: The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are/[a-z][0-9]-/. :param labels: The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of"key": valuepairs. Example:{ "name": "wrench", "mass": "1.3kg", "count": "3" }. :param container_image: Optional custom container image for the job runtime environment. If not specified, a default container image will be used. :param runtime_config_version: Version of the batch runtime. :param runtime_config_properties: Runtime configuration for the workload. :param service_account: Service account that is used to execute the workload. :param network_tags: Tags used for network traffic control. :param kms_key: The Cloud KMS key to use for encryption. :param network_uri: Network URI to connect workload to. :param subnetwork_uri: Subnetwork URI to connect workload to. :param metastore_service: Resource name of an existing Dataproc Metastore service. :param spark_history_dataproc_cluster: The Spark History Server configuration for the workload. :param main_python_file_uri: The HCFS URI of the main Python file to use as the Spark driver. Must be a.pyfile. :param python_file_uris: HCFS file URIs of Python files to pass to the PySpark framework. Supported file types:.py,.egg, and.zip. :param jar_file_uris: HCFS URIs of jar files to add to the classpath of the Spark driver and tasks. :param file_uris: HCFS URIs of files to be placed in the working directory of each executor. :param archive_uris: HCFS URIs of archives to be extracted into the working directory of each executor. Supported file types:.jar,.tar,.tar.gz,.tgz, and.zip. :param args: The arguments to pass to the driver. Do not include arguments that can be set as batch properties, such as--conf, since a collision can occur that causes an incorrect batch submission. :param project: Project to run the Dataproc batch workload. Defaults to the project in which the PipelineJob is run.- Returns¶
``gcp_resources: dsl.OutputPath(str)`` Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see
-
v1.dataproc.DataprocSparkBatchOp(gcp_resources: dsl.OutputPath(str), location: str =
'us-central1', batch_id: str ='', labels: dict[str, str] ={}, container_image: str ='', runtime_config_version: str ='', runtime_config_properties: dict[str, str] ={}, service_account: str ='', network_tags: list[str] =[], kms_key: str ='', network_uri: str ='', subnetwork_uri: str ='', metastore_service: str ='', spark_history_dataproc_cluster: str ='', main_jar_file_uri: str ='', main_class: str ='', jar_file_uris: list[str] =[], file_uris: list[str] =[], archive_uris: list[str] =[], args: list[str] =[], project: str ='{{$.pipeline_google_cloud_project_id}}')¶ Create a Dataproc Spark batch workload and wait for it to finish.
not set, defaults to
"us-central1". :param batch_id: The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are/[a-z][0-9]-/. :param labels: The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of"key": valuepairs. Example:{ "name": "wrench", "mass": "1.3kg", "count": "3" }. :param container_image: Optional custom container image for the job runtime environment. If not specified, a default container image will be used. :param runtime_config_version: Version of the batch runtime. :param runtime_config_properties: Runtime configuration for the workload. :param service_account: Service account that is used to execute the workload. :param network_tags: Tags used for network traffic control. :param kms_key: The Cloud KMS key to use for encryption. :param network_uri: Network URI to connect workload to. :param subnetwork_uri: Subnetwork URI to connect workload to. :param metastore_service: Resource name of an existing Dataproc Metastore service. :param spark_history_dataproc_cluster: The Spark History Server configuration for the workload. :param main_jar_file_uri: The HCFS URI of the jar file that contains the main class. :param main_class: The name of the driver main class. The jar file that contains the class must be in the classpath or specified in jar_file_uris. :param jar_file_uris: HCFS URIs of jar files to add to the classpath of the Spark driver and tasks. :param file_uris: HCFS URIs of files to be placed in the working directory of each executor. :param archive_uris: HCFS URIs of archives to be extracted into the working directory of each executor. Supported file types:.jar,.tar,.tar.gz,.tgz, and.zip. :param args: The arguments to pass to the driver. Do not include arguments that can be set as batch properties, such as--conf, since a collision can occur that causes an incorrect batch submission. :param project: Project to run the Dataproc batch workload. Defaults to the project in which the PipelineJob is run.- Returns¶
``gcp_resources: dsl.OutputPath(str)`` Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see
-
v1.dataproc.DataprocSparkRBatchOp(gcp_resources: dsl.OutputPath(str), location: str =
'us-central1', batch_id: str ='', labels: dict[str, str] ={}, container_image: str ='', runtime_config_version: str ='', runtime_config_properties: dict[str, str] ={}, service_account: str ='', network_tags: list[str] =[], kms_key: str ='', network_uri: str ='', subnetwork_uri: str ='', metastore_service: str ='', spark_history_dataproc_cluster: str ='', main_r_file_uri: str ='', file_uris: list[str] =[], archive_uris: list[str] =[], args: list[str] =[], project: str ='{{$.pipeline_google_cloud_project_id}}')¶ Create a Dataproc SparkR batch workload and wait for it to finish.
- Parameters¶
- location: str =
'us-central1'¶ Location of the Dataproc batch workload. If not set, defaults to
- location: str =
"us-central1". :param batch_id: The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are/[a-z][0-9]-/. :param labels: The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of"key": valuepairs. Example:{ "name": "wrench", "mass": "1.3kg", "count": "3" }. :param container_image: Optional custom container image for the job runtime environment. If not specified, a default container image will be used. :param runtime_config_version: Version of the batch runtime. :param runtime_config_properties: Runtime configuration for the workload. :param service_account: Service account that is used to execute the workload. :param network_tags: Tags used for network traffic control. :param kms_key: The Cloud KMS key to use for encryption. :param network_uri: Network URI to connect workload to. :param subnetwork_uri: Subnetwork URI to connect workload to. :param metastore_service: Resource name of an existing Dataproc Metastore service. :param spark_history_dataproc_cluster: The Spark History Server configuration for the workload. :param main_r_file_uri: The HCFS URI of the main R file to use as the driver. Must be a.Ror.rfile. :param file_uris: HCFS URIs of files to be placed in the working directory of each executor. :param archive_uris: HCFS URIs of archives to be extracted into the working directory of each executor. Supported file types:.jar,.tar,.tar.gz,.tgz, and.zip. :param args: The arguments to pass to the driver. Do not include arguments that can be set as batch properties, such as--conf, since a collision can occur that causes an incorrect batch submission. :param project: Project to run the Dataproc batch workload. Defaults to the project in which the PipelineJob is run.- Returns¶
``gcp_resources: dsl.OutputPath(str)`` Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see
-
v1.dataproc.DataprocSparkSqlBatchOp(gcp_resources: dsl.OutputPath(str), location: str =
'us-central1', batch_id: str ='', labels: dict[str, str] ={}, container_image: str ='', runtime_config_version: str ='', runtime_config_properties: dict[str, str] ={}, service_account: str ='', network_tags: list[str] =[], kms_key: str ='', network_uri: str ='', subnetwork_uri: str ='', metastore_service: str ='', spark_history_dataproc_cluster: str ='', query_file_uri: str ='', query_variables: dict[str, str] ={}, jar_file_uris: list[str] =[], project: str ='{{$.pipeline_google_cloud_project_id}}')¶ Create a Dataproc Spark SQL batch workload and wait for it to finish.
not set, defaults to
"us-central1". :param batch_id: The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are /[a-z][0-9]-/. :param labels: The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of"key": valuepairs. Example:{ "name": "wrench", "mass": "1.3kg", "count": "3" }. :param container_image: Optional custom container image for the job runtime environment. If not specified, a default container image will be used. :param runtime_config_version: Version of the batch runtime. :param runtime_config_properties: Runtime configuration for the workload. :param service_account: Service account that is used to execute the workload. :param network_tags: Tags used for network traffic control. :param kms_key: The Cloud KMS key to use for encryption. :param network_uri: Network URI to connect workload to. :param subnetwork_uri: Subnetwork URI to connect workload to. :param metastore_service: Resource name of an existing Dataproc Metastore service. :param spark_history_dataproc_cluster: The Spark History Server configuration for the workload. :param query_file_uri: The HCFS URI of the script that contains Spark SQL queries to execute. :param query_variables: Mapping of query variable names to values (equivalent to the Spark SQL command:SET name="value";). An object containing a list of"key": valuepairs. Example:{ "name": "wrench", "mass": "1.3kg", "count": "3" }. :param jar_file_uris: HCFS URIs of jar files to be added to the SparkCLASSPATH. :param project: Project to run the Dataproc batch workload. Defaults to the project in which the PipelineJob is run.- Returns¶
``gcp_resources: dsl.OutputPath(str)`` Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see