google_cloud_pipeline_components.v1.dataproc module¶

Google Cloud Pipeline Dataproc Batch components.

google_cloud_pipeline_components.v1.dataproc.DataprocPySparkBatchOp()¶

dataproc_create_pyspark_batch Create a Dataproc PySpark batch workload and wait for it to finish.

Args:

project (str):
Required: Project to run the Dataproc batch workload.

location (Optional[str]):
Location of the Dataproc batch workload. If not set, default to us-central1.

batch_id (Optional[str]):
The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component.

This value must be 4-63 characters. Valid characters are /[a-z][0-9]-/.

labels (Optional[dict]):
The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch.

An object containing a list of “key”: value pairs. Example: { “name”: “wrench”, “mass”: “1.3kg”, “count”: “3” }.

container_image (Optional[str]):
Optional custom container image for the job runtime environment. If not specified, a default container image will be used.

runtime_config_version (Optional[str]):
Version of the batch runtime.

runtime_config_properties (Optional[dict]):
Runtime configuration for a workload.

service_account (Optional[str]):
Service account that used to execute workload.

network_tags (Optional[Sequence]):
Tags used for network traffic control.

kms_key (Optional[str]):
The Cloud KMS key to use for encryption.

network_uri (Optional[str]):
Network URI to connect workload to.

subnetwork_uri (Optional[str]):
Subnetwork URI to connect workload to.

metastore_service (Optional[str]):
Resource name of an existing Dataproc Metastore service.

spark_history_dataproc_cluster (Optional[str]):
The Spark History Server configuration for the workload.

main_python_file_uri (str):
Required. The HCFS URI of the main Python file to use as the Spark driver. Must be a .py file.

python_file_uris (Optional[Sequence]):
HCFS file URIs of Python files to pass to the PySpark framework. Supported file types: .py, .egg, and .zip.

jar_file_uris (Optional[Sequence]):
HCFS URIs of jar files to add to the classpath of the Spark driver and tasks.

file_uris (Optional[Sequence]):
HCFS URIs of files to be placed in the working directory of each executor.

archive_uris (Optional[Sequence]):
HCFS URIs of archives to be extracted into the working directory of each executor. Supported file types: .jar, .tar, .tar.gz, .tgz, and .zip.

args (Optional[Sequence]):
The arguments to pass to the driver. Do not include arguments that can be set as batch properties, such as –conf, since a collision can occur that causes an incorrect batch submission.

Returns:

gcp_resources (str):
Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.v1.dataproc.DataprocSparkBatchOp(project: str, location: str = 'us-central1', batch_id: str = '', labels: dict = '{}', container_image: str = '', runtime_config_version: str = '', runtime_config_properties: dict = '{}', service_account: str = '', network_tags: list = '[]', kms_key: str = '', network_uri: str = '', subnetwork_uri: str = '', metastore_service: str = '', spark_history_dataproc_cluster: str = '', main_jar_file_uri: str = '', main_class: str = '', jar_file_uris: list = '[]', file_uris: list = '[]', archive_uris: list = '[]', args: list = '[]')¶

dataproc_create_spark_batch Create a Dataproc Spark batch workload and wait for it to finish.

Args:

project (str):
Required: Project to run the Dataproc batch workload.

location (Optional[str]):
Location of the Dataproc batch workload. If not set, default to us-central1.

batch_id (Optional[str]):
The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component.

This value must be 4-63 characters. Valid characters are /[a-z][0-9]-/.

labels (Optional[dict]):
The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch.

An object containing a list of “key”: value pairs. Example: { “name”: “wrench”, “mass”: “1.3kg”, “count”: “3” }.

container_image (Optional[str]):
Optional custom container image for the job runtime environment. If not specified, a default container image will be used.

runtime_config_version (Optional[str]):
Version of the batch runtime.

runtime_config_properties (Optional[dict]):
Runtime configuration for a workload.

service_account (Optional[str]):
Service account that used to execute workload.

network_tags (Optional[Sequence]):
Tags used for network traffic control.

kms_key (Optional[str]):
The Cloud KMS key to use for encryption.

network_uri (Optional[str]):
Network URI to connect workload to.

subnetwork_uri (Optional[str]):
Subnetwork URI to connect workload to.

metastore_service (Optional[str]):
Resource name of an existing Dataproc Metastore service.

spark_history_dataproc_cluster (Optional[str]):
The Spark History Server configuration for the workload.

main_jar_file_uri (Optional[str]):
The HCFS URI of the jar file that contains the main class.

main_class (Optional[str]):
The name of the driver main class. The jar file that contains the class must be in the classpath or specified in jar_file_uris.

jar_file_uris (Optional[Sequence]):
HCFS URIs of jar files to add to the classpath of the Spark driver and tasks.

file_uris (Optional[Sequence]):
HCFS URIs of files to be placed in the working directory of each executor.

archive_uris (Optional[Sequence]):
HCFS URIs of archives to be extracted into the working directory of each executor.

args (Optional[Sequence]):
The arguments to pass to the driver.

Returns:

gcp_resources (str):
Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.v1.dataproc.DataprocSparkRBatchOp()¶

dataproc_create_spark_r_batch Create a Dataproc SparkR batch workload and wait for it to finish.

Args:

project (str):
Required: Project to run the Dataproc batch workload.

location (Optional[str]):
Location of the Dataproc batch workload. If not set, default to us-central1.

batch_id (Optional[str]):
The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component.

This value must be 4-63 characters. Valid characters are /[a-z][0-9]-/.

labels (Optional[dict]):
The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch.

An object containing a list of “key”: value pairs. Example: { “name”: “wrench”, “mass”: “1.3kg”, “count”: “3” }.

container_image (Optional[str]):
Optional custom container image for the job runtime environment. If not specified, a default container image will be used.

runtime_config_version (Optional[str]):
Version of the batch runtime.

runtime_config_properties (Optional[dict]):
Runtime configuration for a workload.

service_account (Optional[str]):
Service account that used to execute workload.

network_tags (Optional[Sequence]):
Tags used for network traffic control.

kms_key (Optional[str]):
The Cloud KMS key to use for encryption.

network_uri (Optional[str]):
Network URI to connect workload to.

subnetwork_uri (Optional[str]):
Subnetwork URI to connect workload to.

metastore_service (Optional[str]):
Resource name of an existing Dataproc Metastore service.

spark_history_dataproc_cluster (Optional[str]):
The Spark History Server configuration for the workload.

main_r_file_uri (str):
Required. The HCFS URI of the main R file to use as the driver. Must be a .R or .r file.

file_uris (Optional[Sequence]):
HCFS URIs of files to be placed in the working directory of each executor.

archive_uris (Optional[Sequence]):
HCFS URIs of archives to be extracted into the working directory of each executor.

args (Optional[Sequence]):
The arguments to pass to the driver.

Returns:

gcp_resources (str):
Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.v1.dataproc.DataprocSparkSqlBatchOp()¶

dataproc_create_spark_sql_batch Create a Dataproc Spark SQL batch workload and wait for it to finish.

Args:

project (str):
Required: Project to run the Dataproc batch workload.

location (Optional[str]):
Location of the Dataproc batch workload. If not set, default to us-central1.

batch_id (Optional[str]):
The ID to use for the batch, which will become the final component of the batch’s resource name. If none is specified, a default name will be generated by the component.

This value must be 4-63 characters. Valid characters are /[a-z][0-9]-/.

labels (Optional[dict]):
The labels to associate with this batch. Label keys must contain 1 to 63 characters, and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 characters, and must conform to RFC 1035. No more than 32 labels can be associated with a batch.

An object containing a list of “key”: value pairs. Example: { “name”: “wrench”, “mass”: “1.3kg”, “count”: “3” }.

container_image (Optional[str]):
Optional custom container image for the job runtime environment. If not specified, a default container image will be used.

runtime_config_version (Optional[str]):
Version of the batch runtime.

runtime_config_properties (Optional[dict]):
Runtime configuration for a workload.

service_account (Optional[str]):
Service account that used to execute workload.

network_tags (Optional[Sequence]):
Tags used for network traffic control.

kms_key (Optional[str]):
The Cloud KMS key to use for encryption.

network_uri (Optional[str]):
Network URI to connect workload to.

subnetwork_uri (Optional[str]):
Subnetwork URI to connect workload to.

metastore_service (Optional[str]):
Resource name of an existing Dataproc Metastore service.

spark_history_dataproc_cluster (Optional[str]):
The Spark History Server configuration for the workload.

query_file_uri (str):
Required. The HCFS URI of the script that contains Spark SQL queries to execute.

query_variables (Optional[dict]):
Mapping of query variable names to values (equivalent to the Spark SQL command: SET name=”value”;).

jar_file_uris (Optional[Sequence]):
HCFS URIs of jar files to be added to the Spark CLASSPATH.

Returns:

gcp_resources (str):
Serialized gcp_resources proto tracking the Dataproc batch workload. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.