google_cloud_pipeline_components.v1.custom_job module

Module for supporting Google Vertex AI Custom Training Job Op.

google_cloud_pipeline_components.v1.custom_job.CustomTrainingJobOp()

custom_training_job Launch a Custom training job using Vertex CustomJob API.

Args:

project (str):
Required. Project to create the custom training job in.

location (Optional[str]):
Location for creating the custom training job. If not set, default to us-central1.

display_name (str): The name of the custom training job. worker_pool_specs (Optional[Sequence[str]]): Serialized json spec of the worker pools

including machine type and Docker image. All worker pools except the first one are optional and can be skipped by providing an empty value.

For more details about the WorkerPoolSpec, see https://cloud.google.com/vertex-ai/docs/reference/rest/v1/CustomJobSpec#WorkerPoolSpec

timeout (Optional[str]): The maximum job running time. The default is 7
days. A duration in seconds with up to nine fractional digits, terminated by ‘s’, for example: “3.5s”.

restart_job_on_worker_restart (Optional[bool]): Restarts the entire
CustomJob if a worker gets restarted. This feature can be used by distributed training jobs that are not resilient to workers leaving and joining a job.

service_account (Optional[str]): Sets the default service account for

workload run-as account. The service account running the pipeline

(https://cloud.google.com/vertex-ai/docs/pipelines/configure-project#service-account)
submitting jobs must have act-as permission on this run-as account. If unspecified, the Vertex AI Custom Code Service

Agent(https://cloud.google.com/vertex-ai/docs/general/access-control#service-agents)
for the CustomJob’s project.

tensorboard (Optional[str]): The name of a Vertex AI Tensorboard resource to
which this CustomJob will upload Tensorboard logs.

enable_web_access (Optional[bool]): Whether you want Vertex AI to enable
[interactive shell access](https://cloud.google.com/vertex-ai/docs/training/monitor-debug-interactive-shell) to training containers. If set to true, you can access interactive shells at the URIs given by [CustomJob.web_access_uris][].

network (Optional[str]): The full name of the Compute Engine network to
which the job should be peered. For example, projects/12345/global/networks/myVPC. Format is of the form projects/{project}/global/networks/{network}. Where {project} is a project number, as in 12345, and {network} is a network name. Private services access must already be configured for the network. If left unspecified, the job is not peered with any network.

reserved_ip_ranges (Optional[Sequence[str]]): A list of names for the reserved ip ranges
under the VPC network that can be used for this job. If set, we will deploy the job within the provided ip ranges. Otherwise, the job will be deployed to any ip ranges under the provided VPC network.

base_output_directory (Optional[str]): The Cloud Storage location to store
the output of this CustomJob or HyperparameterTuningJob. see below for more details: https://cloud.google.com/vertex-ai/docs/reference/rest/v1/GcsDestination

labels (Optional[Dict[str, str]]): The labels with user-defined metadata to organize CustomJobs.
See https://goo.gl/xmQnxf for more information.

encryption_spec_key_name (Optional[str]): Customer-managed encryption key
options for the CustomJob. If this is set, then all resources created by the CustomJob will be encrypted with the provided encryption key.

Returns:

gcp_resources (str):
Serialized gcp_resources proto tracking the custom training job. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.v1.custom_job.create_custom_training_job_from_component(component_spec: Callable, display_name: Optional[str] = '', replica_count: Optional[int] = 1, machine_type: Optional[str] = 'n1-standard-4', accelerator_type: Optional[str] = '', accelerator_count: Optional[int] = 1, boot_disk_type: Optional[str] = 'pd-ssd', boot_disk_size_gb: Optional[int] = 100, timeout: Optional[str] = '604800s', restart_job_on_worker_restart: Optional[bool] = False, service_account: Optional[str] = '', network: Optional[str] = '', encryption_spec_key_name: Optional[str] = '', tensorboard: Optional[str] = '', enable_web_access: Optional[bool] = False, reserved_ip_ranges: Optional[Sequence[str]] = None, nfs_mounts: Optional[Sequence[Dict[str, str]]] = None, base_output_directory: Optional[str] = '', labels: Optional[Dict[str, str]] = None) → Callable

Create a component spec that runs a custom training in Vertex AI.

This utility converts a given component to a CustomTrainingJobOp that runs a custom training in Vertex AI. This simplifies the creation of custom training jobs. All Inputs and Outputs of the supplied component will be copied over to the constructed training job.

Note that this utility constructs a ClusterSpec where the master and all the workers use the same spec, meaning all disk/machine spec related parameters will apply to all replicas. This is suitable for use cases such as training with MultiWorkerMirroredStrategy or Mirrored Strategy.

This component does not support Vertex AI Python training application.

For more details on Vertex AI Training service, please refer to https://cloud.google.com/vertex-ai/docs/training/create-custom-job

Args:

component_spec: The task (ContainerOp) object to run as Vertex AI custom

job.

display_name (Optional[str]): The name of the custom job. If not provided

the component_spec.name will be used instead.

replica_count (Optional[int]): The count of instances in the cluster. One

replica always counts towards the master in worker_pool_spec[0] and the remaining replicas will be allocated in worker_pool_spec[1]. For more details see https://cloud.google.com/vertex-ai/docs/training/distributed-training#configure_a_distributed_training_job.

machine_type (Optional[str]): The type of the machine to run the custom job.

The default value is “n1-standard-4”. For more details about this input config, see https://cloud.google.com/vertex-ai/docs/training/configure-compute#machine-types.

accelerator_type (Optional[str]): The type of accelerator(s) that may be

attached to the machine as per accelerator_count. For more details about this input config, see https://cloud.google.com/vertex-ai/docs/reference/rest/v1/MachineSpec#acceleratortype.

accelerator_count (Optional[int]): The number of accelerators to attach to

the machine. Defaults to 1 if accelerator_type is set.

boot_disk_type (Optional[str]):

Type of the boot disk (default is “pd-ssd”). Valid values: “pd-ssd”: (Persistent Disk Solid State Drive) or “pd-standard” (Persistent Disk Hard Disk Drive). boot_disk_type is set as a static value and cannot be changed as a pipeline parameter.

boot_disk_size_gb (Optional[int]): Size in GB of the boot disk (default is

100GB). boot_disk_size_gb is set as a static value and cannot be: changed as a pipeline parameter.

timeout (Optional[str]): The maximum job running time. The default is 7

days. A duration in seconds with up to nine fractional digits, terminated by ‘s’, for example: “3.5s”.

restart_job_on_worker_restart (Optional[bool]): Restarts the entire

CustomJob if a worker gets restarted. This feature can be used by distributed training jobs that are not resilient to workers leaving and joining a job.

service_account (Optional[str]): Sets the default service account for

workload run-as account. The service account running the pipeline

(https://cloud.google.com/vertex-ai/docs/pipelines/configure-project#service-account): submitting jobs must have act-as permission on this run-as account. If unspecified, the Vertex AI Custom Code Service
Agent(https://cloud.google.com/vertex-ai/docs/general/access-control#service-agents): for the CustomJob’s project.

network (Optional[str]): The full name of the Compute Engine network to

which the job should be peered. For example, projects/12345/global/networks/myVPC. Format is of the form projects/{project}/global/networks/{network}. Where {project} is a project number, as in 12345, and {network} is a network name. Private services access must already be configured for the network. If left unspecified, the job is not peered with any network.

encryption_spec_key_name (Optional[str]): Customer-managed encryption key

options for the CustomJob. If this is set, then all resources created by the CustomJob will be encrypted with the provided encryption key.

tensorboard (Optional[str]): The name of a Vertex AI Tensorboard resource to

which this CustomJob will upload Tensorboard logs.

enable_web_access (Optional[bool]): Whether you want Vertex AI to enable

[interactive shell

access](https://cloud.google.com/vertex-ai/docs/training/monitor-debug-interactive-shell): to training containers. If set to true, you can access interactive shells at the URIs given by [CustomJob.web_access_uris][].

reserved_ip_ranges (Optional[Sequence[str]]): A list of names for the

reserved ip ranges under the VPC network that can be used for this job. If set, we will deploy the job within the provided ip ranges. Otherwise, the job will be deployed to any ip ranges under the provided VPC network.

nfs_mounts (Optional[Sequence[Dict]]): A list of NFS mount specs in Json

dict format. nfs_mounts is set as a static value and cannot be changed as a pipeline parameter. For API spec, see https://cloud.devsite.corp.google.com/vertex-ai/docs/reference/rest/v1/CustomJobSpec#NfsMount

For more details about mounting NFS for CustomJob, see

https://cloud.devsite.corp.google.com/vertex-ai/docs/training/train-nfs-share

base_output_directory (Optional[str]): The Cloud Storage location to store

the output of this CustomJob or HyperparameterTuningJob. see below for more details: https://cloud.google.com/vertex-ai/docs/reference/rest/v1/GcsDestination

labels (Optional[Dict[str, str]]): The labels with user-defined metadata to

organize CustomJobs. See https://goo.gl/xmQnxf for more information.

Returns:

A Custom Job component operator corresponding to the input component operator.

google_cloud_pipeline_components.v1.custom_job.create_custom_training_job_op_from_component(component_spec: Callable, display_name: Optional[str] = '', replica_count: Optional[int] = 1, machine_type: Optional[str] = 'n1-standard-4', accelerator_type: Optional[str] = '', accelerator_count: Optional[int] = 1, boot_disk_type: Optional[str] = 'pd-ssd', boot_disk_size_gb: Optional[int] = 100, timeout: Optional[str] = '604800s', restart_job_on_worker_restart: Optional[bool] = False, service_account: Optional[str] = '', network: Optional[str] = '', encryption_spec_key_name: Optional[str] = '', tensorboard: Optional[str] = '', enable_web_access: Optional[bool] = False, reserved_ip_ranges: Optional[Sequence[str]] = None, nfs_mounts: Optional[Sequence[Dict[str, str]]] = None, base_output_directory: Optional[str] = '', labels: Optional[Dict[str, str]] = None) → Callable