Dataproc¶

Create Google Cloud Dataproc jobs from within Vertex AI Pipelines.

Components:

`DataprocPySparkBatchOp`(main_python_file_uri, ...)	Create a Dataproc PySpark batch workload and wait for it to finish.
`DataprocSparkBatchOp`(gcp_resources[, ...])	Create a Dataproc Spark batch workload and wait for it to finish.
`DataprocSparkRBatchOp`(gcp_resources[, ...])	Create a Dataproc SparkR batch workload and wait for it to finish.
`DataprocSparkSqlBatchOp`(gcp_resources[, ...])	Create a Dataproc Spark SQL batch workload and wait for it to finish.

v1.dataproc.DataprocPySparkBatchOp(main_python_file_uri: str, gcp_resources: dsl.OutputPath(str), location: str = 'us-central1', batch_id: str = '', labels: dict[str, str] = {}, container_image: str = '', runtime_config_version: str = '', runtime_config_properties: dict[str, str] = {}, autotuning_config: dict[str, str] = {}, cohort: str = '', service_account: str = '', network_tags: list[str] = [], kms_key: str = '', network_uri: str = '', subnetwork_uri: str = '', metastore_service: str = '', spark_history_dataproc_cluster: str = '', python_file_uris: list[str] = [], jar_file_uris: list[str] = [], file_uris: list[str] = [], archive_uris: list[str] = [], args: list[str] = [], project: str = '{{$.pipeline_google_cloud_project_id}}')¶

Create a Dataproc PySpark batch workload and wait for it to finish.

Parameters¶:

location: str = 'us-central1'¶: Location of the Dataproc batch workload. If not set, defaults to "us-central1".
batch_id: str = ''¶: The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are /[a-z][0-9]-/.
labels: dict[str, str] = {}¶: The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of "key": value pairs. Example: { "name": "wrench", "mass": "1.3kg", "count": "3" }.
container_image: str = ''¶: Optional custom container image for the job runtime environment. If not specified, a default container image will be used.
runtime_config_version: str = ''¶: Version of the batch runtime.
runtime_config_properties: dict[str, str] = {}¶: Runtime configuration for the workload.
autotuning_config: dict[str, str] = {}¶: Autotuning configuration for the workload.
cohort: str = ''¶: Cohort identifier for the workload.
service_account: str = ''¶: Service account that is used to execute the workload.
network_tags: list[str] = []¶: Tags used for network traffic control.
kms_key: str = ''¶: The Cloud KMS key to use for encryption.
network_uri: str = ''¶: Network URI to connect workload to.
subnetwork_uri: str = ''¶: Subnetwork URI to connect workload to.
metastore_service: str = ''¶: Resource name of an existing Dataproc Metastore service.
spark_history_dataproc_cluster: str = ''¶: The Spark History Server configuration for the workload.
main_python_file_uri: str¶: The HCFS URI of the main Python file to use as the Spark driver. Must be a .py file.
python_file_uris: list[str] = []¶: HCFS file URIs of Python files to pass to the PySpark framework. Supported file types: .py, .egg, and .zip.
jar_file_uris: list[str] = []¶: HCFS URIs of jar files to add to the classpath of the Spark driver and tasks.
file_uris: list[str] = []¶: HCFS URIs of files to be placed in the working directory of each executor.
archive_uris: list[str] = []¶: HCFS URIs of archives to be extracted into the working directory of each executor. Supported file types: .jar, .tar, .tar.gz, .tgz, and .zip.
args: list[str] = []¶: The arguments to pass to the driver. Do not include arguments that can be set as batch properties, such as --conf, since a collision can occur that causes an incorrect batch submission.
project: str = '{{$.pipeline_google_cloud_project_id}}'¶: Project to run the Dataproc batch workload. Defaults to the project in which the PipelineJob is run.

Returns¶:

gcp_resources: dsl.OutputPath(str): Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

v1.dataproc.DataprocSparkBatchOp(gcp_resources: dsl.OutputPath(str), location: str = 'us-central1', batch_id: str = '', labels: dict[str, str] = {}, container_image: str = '', runtime_config_version: str = '', runtime_config_properties: dict[str, str] = {}, autotuning_config: dict[str, str] = {}, cohort: str = '', service_account: str = '', network_tags: list[str] = [], kms_key: str = '', network_uri: str = '', subnetwork_uri: str = '', metastore_service: str = '', spark_history_dataproc_cluster: str = '', main_jar_file_uri: str = '', main_class: str = '', jar_file_uris: list[str] = [], file_uris: list[str] = [], archive_uris: list[str] = [], args: list[str] = [], project: str = '{{$.pipeline_google_cloud_project_id}}')¶

Create a Dataproc Spark batch workload and wait for it to finish.

Parameters¶:

location: str = 'us-central1'¶: Location of the Dataproc batch workload. If not set, defaults to "us-central1".
batch_id: str = ''¶: The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are /[a-z][0-9]-/.
labels: dict[str, str] = {}¶: The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of "key": value pairs. Example: { "name": "wrench", "mass": "1.3kg", "count": "3" }.
container_image: str = ''¶: Optional custom container image for the job runtime environment. If not specified, a default container image will be used.
runtime_config_version: str = ''¶: Version of the batch runtime.
runtime_config_properties: dict[str, str] = {}¶: Runtime configuration for the workload.
autotuning_config: dict[str, str] = {}¶: Autotuning configuration for the workload.
cohort: str = ''¶: Cohort identifier for the workload.
service_account: str = ''¶: Service account that is used to execute the workload.
network_tags: list[str] = []¶: Tags used for network traffic control.
kms_key: str = ''¶: The Cloud KMS key to use for encryption.
network_uri: str = ''¶: Network URI to connect workload to.
subnetwork_uri: str = ''¶: Subnetwork URI to connect workload to.
metastore_service: str = ''¶: Resource name of an existing Dataproc Metastore service.
spark_history_dataproc_cluster: str = ''¶: The Spark History Server configuration for the workload.
main_jar_file_uri: str = ''¶: The HCFS URI of the jar file that contains the main class.
main_class: str = ''¶: The name of the driver main class. The jar file that contains the class must be in the classpath or specified in jar_file_uris.
jar_file_uris: list[str] = []¶: HCFS URIs of jar files to add to the classpath of the Spark driver and tasks.
file_uris: list[str] = []¶: HCFS URIs of files to be placed in the working directory of each executor.
archive_uris: list[str] = []¶: HCFS URIs of archives to be extracted into the working directory of each executor. Supported file types: .jar, .tar, .tar.gz, .tgz, and .zip.
args: list[str] = []¶: The arguments to pass to the driver. Do not include arguments that can be set as batch properties, such as --conf, since a collision can occur that causes an incorrect batch submission.
project: str = '{{$.pipeline_google_cloud_project_id}}'¶: Project to run the Dataproc batch workload. Defaults to the project in which the PipelineJob is run.

Returns¶:

gcp_resources: dsl.OutputPath(str): Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

v1.dataproc.DataprocSparkRBatchOp(gcp_resources: dsl.OutputPath(str), location: str = 'us-central1', batch_id: str = '', labels: dict[str, str] = {}, container_image: str = '', runtime_config_version: str = '', runtime_config_properties: dict[str, str] = {}, autotuning_config: dict[str, str] = {}, cohort: str = '', service_account: str = '', network_tags: list[str] = [], kms_key: str = '', network_uri: str = '', subnetwork_uri: str = '', metastore_service: str = '', spark_history_dataproc_cluster: str = '', main_r_file_uri: str = '', file_uris: list[str] = [], archive_uris: list[str] = [], args: list[str] = [], project: str = '{{$.pipeline_google_cloud_project_id}}')¶

Create a Dataproc SparkR batch workload and wait for it to finish.

Parameters¶:

location: str = 'us-central1'¶: Location of the Dataproc batch workload. If not set, defaults to "us-central1".
batch_id: str = ''¶: The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are /[a-z][0-9]-/.
labels: dict[str, str] = {}¶: The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of "key": value pairs. Example: { "name": "wrench", "mass": "1.3kg", "count": "3" }.
container_image: str = ''¶: Optional custom container image for the job runtime environment. If not specified, a default container image will be used.
runtime_config_version: str = ''¶: Version of the batch runtime.
runtime_config_properties: dict[str, str] = {}¶: Runtime configuration for the workload.
autotuning_config: dict[str, str] = {}¶: Autotuning configuration for the workload.
cohort: str = ''¶: Cohort identifier for the workload.
service_account: str = ''¶: Service account that is used to execute the workload.
network_tags: list[str] = []¶: Tags used for network traffic control.
kms_key: str = ''¶: The Cloud KMS key to use for encryption.
network_uri: str = ''¶: Network URI to connect workload to.
subnetwork_uri: str = ''¶: Subnetwork URI to connect workload to.
metastore_service: str = ''¶: Resource name of an existing Dataproc Metastore service.
spark_history_dataproc_cluster: str = ''¶: The Spark History Server configuration for the workload.
main_r_file_uri: str = ''¶: The HCFS URI of the main R file to use as the driver. Must be a .R or .r file.
file_uris: list[str] = []¶: HCFS URIs of files to be placed in the working directory of each executor.
archive_uris: list[str] = []¶: HCFS URIs of archives to be extracted into the working directory of each executor. Supported file types: .jar, .tar, .tar.gz, .tgz, and .zip.
args: list[str] = []¶: The arguments to pass to the driver. Do not include arguments that can be set as batch properties, such as --conf, since a collision can occur that causes an incorrect batch submission.
project: str = '{{$.pipeline_google_cloud_project_id}}'¶: Project to run the Dataproc batch workload. Defaults to the project in which the PipelineJob is run.

Returns¶:

gcp_resources: dsl.OutputPath(str): Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

v1.dataproc.DataprocSparkSqlBatchOp(gcp_resources: dsl.OutputPath(str), location: str = 'us-central1', batch_id: str = '', labels: dict[str, str] = {}, container_image: str = '', runtime_config_version: str = '', runtime_config_properties: dict[str, str] = {}, autotuning_config: dict[str, str] = {}, cohort: str = '', service_account: str = '', network_tags: list[str] = [], kms_key: str = '', network_uri: str = '', subnetwork_uri: str = '', metastore_service: str = '', spark_history_dataproc_cluster: str = '', query_file_uri: str = '', query_variables: dict[str, str] = {}, jar_file_uris: list[str] = [], project: str = '{{$.pipeline_google_cloud_project_id}}')¶

Create a Dataproc Spark SQL batch workload and wait for it to finish.

Parameters¶:

location: str = 'us-central1'¶: Location of the Dataproc batch workload. If not set, defaults to "us-central1".
batch_id: str = ''¶: The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component. This value must be 4-63 characters. Valid characters are /[a-z][0-9]-/.
labels: dict[str, str] = {}¶: The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch. An object containing a list of "key": value pairs. Example: { "name": "wrench", "mass": "1.3kg", "count": "3" }.
container_image: str = ''¶: Optional custom container image for the job runtime environment. If not specified, a default container image will be used.
runtime_config_version: str = ''¶: Version of the batch runtime.
runtime_config_properties: dict[str, str] = {}¶: Runtime configuration for the workload.
autotuning_config: dict[str, str] = {}¶: Autotuning configuration for the workload.
cohort: str = ''¶: Cohort identifier for the workload.
service_account: str = ''¶: Service account that is used to execute the workload.
network_tags: list[str] = []¶: Tags used for network traffic control.
kms_key: str = ''¶: The Cloud KMS key to use for encryption.
network_uri: str = ''¶: Network URI to connect workload to.
subnetwork_uri: str = ''¶: Subnetwork URI to connect workload to.
metastore_service: str = ''¶: Resource name of an existing Dataproc Metastore service.
spark_history_dataproc_cluster: str = ''¶: The Spark History Server configuration for the workload.
query_file_uri: str = ''¶: The HCFS URI of the script that contains Spark SQL queries to execute.
query_variables: dict[str, str] = {}¶: Mapping of query variable names to values (equivalent to the Spark SQL command: SET name="value";). An object containing a list of "key": value pairs. Example: { "name": "wrench", "mass": "1.3kg", "count": "3" }.
jar_file_uris: list[str] = []¶: HCFS URIs of jar files to be added to the Spark CLASSPATH.
project: str = '{{$.pipeline_google_cloud_project_id}}'¶: Project to run the Dataproc batch workload. Defaults to the project in which the PipelineJob is run.

Returns¶:

gcp_resources: dsl.OutputPath(str): Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.