google_cloud_pipeline_components.experimental.automl.tabular package

Submodules

google_cloud_pipeline_components.experimental.automl.tabular.utils module

Util functions for AutoML Tabular pipeline.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_builtin_algorithm_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, algorithm_name: str, target_column: str, prediction_type: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], study_spec_metrics: List[Dict[str, Any]], study_spec_parameters: List[Dict[str, Any]], max_trial_count: int, parallel_trial_count: int, enable_profiler: bool = False, seed: int = 1, weight_column: str = '', max_failed_trial_count: int = 0, study_spec_algorithm: str = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str = 'BEST_MEASUREMENT', stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, training_machine_spec: Optional[Dict[str, Any]] = None, training_replica_count: int = 1, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the built-in algorithm HyperparameterTuningJob pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. algorithm_name: The name of the algorithm. One of ‘TabNet’ or ‘Wide & Deep’. target_column: The target column name. prediction_type: The type of prediction the Model is to produce.

“classification” or “regression”.

transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. study_spec_metrics: List of dictionaries representing metrics to optimize.

The dictionary contains the metric_id, which is reported by the training job, ands the optimization goal of the metric. One of ‘minimize’ or ‘maximize’.

study_spec_parameters: List of dictionaries representing parameters to

optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. enable_profiler: Enables profiling and saves a trace during evaluation. seed: Seed to be used for this run. weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen

before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm: The search algorithm specified for the study. One of

‘ALGORITHM_UNSPECIFIED’, ‘GRID_SEARCH’, or ‘RANDOM_SEARCH’.

study_spec_measurement_selection_type: Which measurement to use if/when the

service automatically selects the final measurement from previously reported intermediate measurements. One of ‘BEST_MEASUREMENT’ or ‘LAST_MEASUREMENT’.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

training_machine_spec: The machine spec for trainer component. training_replica_count: The replica count for the trainer component. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_distill_skip_evaluation_pipeline_and_parameters(*args, distill_batch_predict_machine_type: str = 'n1-standard-16', distill_batch_predict_starting_replica_count: int = 25, distill_batch_predict_max_replica_count: int = 25, **kwargs) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that distill and skips evaluation.

Args:

*args: All arguments in get_skip_evaluation_pipeline_and_parameters. distill_batch_predict_machine_type: The prediction server machine type for

batch predict component in the model distillation.

distill_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict component in the model distillation.

distill_batch_predict_max_replica_count: The max number of prediction server

for batch predict component in the model distillation.

**kwargs: All arguments in get_skip_evaluation_pipeline_and_parameters.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_feature_selection_skip_evaluation_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], max_selected_features: int, train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Optional[Dict[str, Any]] = None, optimization_objective_recall_value: float = - 1, optimization_objective_precision_value: float = - 1, stage_1_tuner_worker_pool_specs_override: Optional[Dict[str, Any]] = None, cv_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that skips evaluation.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column_name: The target column name. prediction_type: The type of prediction the Model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,

“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. max_selected_features: number of features to be selected. train_budget_milli_node_hours: The train budget of creating this model,

expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. weight_column_name: The weight column name. study_spec_override: The dictionary for overriding study spec. The

optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.
stage 1 tuner worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
cv trainer worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definition_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_skip_architecture_search_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_tuning_result_artifact_uri: str, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', optimization_objective_recall_value: float = - 1, optimization_objective_precision_value: float = - 1, cv_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that skips architecture search.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column_name: The target column name. prediction_type: The type of prediction the Model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,

“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. train_budget_milli_node_hours: The train budget of creating this model,

expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_tuning_result_artifact_uri: The stage 1 tuning result artifact GCS

URI.

stage_2_num_parallel_trials: Number of parallel trail for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. weight_column_name: The weight column name. optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
cv trainer worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_skip_evaluation_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Optional[Dict[str, Any]] = None, optimization_objective_recall_value: float = - 1, optimization_objective_precision_value: float = - 1, stage_1_tuner_worker_pool_specs_override: Optional[Dict[str, Any]] = None, cv_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that skips evaluation.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column_name: The target column name. prediction_type: The type of prediction the Model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,

“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. train_budget_milli_node_hours: The train budget of creating this model,

expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. weight_column_name: The weight column name. study_spec_override: The dictionary for overriding study spec. The

optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.
stage 1 tuner worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
cv trainer worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], learning_rate: float, optimizer_type: str = 'adam', max_steps: int = - 1, max_train_secs: int = - 1, l1_regularization_strength: float = 0, l2_regularization_strength: float = 0, l2_shrinkage_regularization_strength: float = 0, beta_1: float = 0.9, beta_2: float = 0.999, large_category_dim: int = 1, large_category_thresh: int = 300, yeo_johnson_transform: bool = True, feature_dim: int = 64, feature_dim_ratio: float = 0.5, num_decision_steps: int = 6, relaxation_factor: float = 1.5, decay_every: float = 100, gradient_thresh: float = 2000, sparsity_loss_weight: float = 1e-05, batch_momentum: float = 0.95, batch_size_ratio: float = 0.25, num_transformer_layers: int = 4, num_transformer_layers_ratio: float = 0.25, class_weight: float = 1.0, loss_function_type: str = 'default', alpha_focal_loss: float = 0.25, gamma_focal_loss: float = 2.0, enable_profiler: bool = False, seed: int = 1, eval_steps: int = 0, batch_size: int = 100, eval_frequency_secs: int = 600, weight_column: str = '', stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, training_machine_spec: Optional[Dict[str, Any]] = None, training_replica_count: int = 1, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the TabNet training pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the Model is to produce.

“classification” or “regression”.

transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. learning_rate: The learning rate used by the linear optimizer. optimizer_type: The type of optimizer to use. Choices are “adam”, “ftrl” and

“sgd” for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

max_steps: Number of steps (batches) to run the trainer for. max_train_secs: Amount of time in seconds to run the trainer for. l1_regularization_strength: L1 regularization strength for

optimizer_type=”ftrl”.

l2_regularization_strength: L2 regularization strength for

optimizer_type=”ftrl”.

l2_shrinkage_regularization_strength: L2 shrinkage regularization strength

for optimizer_type=”ftrl”.

beta_1: Beta 1 value for optimizer_type=”adam”. beta_2: Beta 2 value for optimizer_type=”adam”. large_category_dim: Embedding dimension for categorical feature with large

number of categories.

large_category_thresh: Threshold for number of categories to apply

large_category_dim embedding dimension to.

yeo_johnson_transform: Enables trainable Yeo-Johnson power transform. feature_dim: Dimensionality of the hidden representation in feature

transformation block.

feature_dim_ratio: The ratio of Output Dimension (Dimensionality of the

outputs of each decision step) to Feature Dimension.

num_decision_steps: Number of sequential decision steps. relaxation_factor: Relaxation factor that promotes the reuse of each feature

at different decision steps. When it is 1, a feature is enforced to be used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.

decay_every: Number of iterations for periodically applying learning rate

decaying.

gradient_thresh: Threshold for the norm of gradients for clipping. sparsity_loss_weight: Weight of the loss for sparsity regularization

(increasing it will yield more sparse feature selection).

batch_momentum: Momentum in ghost batch normalization. batch_size_ratio: The ratio of Virtual Batch Size (size of the ghost batch

normalization) to Batch Size.

num_transformer_layers: The number of transformer layers for each decision

step. used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.

num_transformer_layers_ratio: The ratio of Shared Transformer Layer to

Transformer Layers.

class_weight: The class weight is used to computes a weighted cross entropy

which is helpful in classify imbalanced dataset.

loss_function_type: Loss function type. Loss function in classification

[cross_entropy, weighted_cross_entropy, focal_loss], default is cross_entropy.

Loss function in regression: [rmse, mae, mse], default is mse.

alpha_focal_loss: Alpha value (balancing factor) in focal_loss function. gamma_focal_loss: Gamma value (modulating factor) for focal loss for focal

loss.

enable_profiler: Enables profiling and saves a trace during evaluation. seed: Seed to be used for this run. eval_steps: Number of steps (batches) to run evaluation for. If not

specified, it means run evaluation on the whole validation dataset. This value must be >= 1.

batch_size: Batch size for training. eval_frequency_secs: Frequency at which evaluation and checkpointing will

take place.

weight_column: The weight column name. stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

training_machine_spec: The machine spec for trainer component. training_replica_count: The replica count for the trainer component. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_wide_and_deep_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], learning_rate: float, dnn_learning_rate: float, optimizer_type: str = 'adam', max_steps: int = - 1, max_train_secs: int = - 1, l1_regularization_strength: float = 0, l2_regularization_strength: float = 0, l2_shrinkage_regularization_strength: float = 0, beta_1: float = 0.9, beta_2: float = 0.999, hidden_units: str = '30,30,30', use_wide: bool = True, embed_categories: bool = True, dnn_dropout: float = 0, dnn_optimizer_type: str = 'ftrl', dnn_l1_regularization_strength: float = 0, dnn_l2_regularization_strength: float = 0, dnn_l2_shrinkage_regularization_strength: float = 0, dnn_beta_1: float = 0.9, dnn_beta_2: float = 0.999, enable_profiler: bool = False, seed: int = 1, eval_steps: int = 0, batch_size: int = 100, eval_frequency_secs: int = 600, weight_column: str = '', stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, training_machine_spec: Optional[Dict[str, Any]] = None, training_replica_count: int = 1, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the Wide & Deep training pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the Model is to produce.

“classification” or “regression”.

transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. learning_rate: The learning rate used by the linear optimizer. dnn_learning_rate: The learning rate for training the deep part of the

model.

optimizer_type: The type of optimizer to use. Choices are “adam”, “ftrl” and

“sgd” for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

max_steps: Number of steps (batches) to run the trainer for. max_train_secs: Amount of time in seconds to run the trainer for. l1_regularization_strength: L1 regularization strength for

optimizer_type=”ftrl”.

l2_regularization_strength: L2 regularization strength for

optimizer_type=”ftrl”.

l2_shrinkage_regularization_strength: L2 shrinkage regularization strength

for optimizer_type=”ftrl”.

beta_1: Beta 1 value for optimizer_type=”adam”. beta_2: Beta 2 value for optimizer_type=”adam”. hidden_units: Hidden layer sizes to use for DNN feature columns, provided in

comma-separated layers.

use_wide: If set to True, the categorical columns will be used in the wide

part of the DNN model.

embed_categories: If set to True, the categorical columns will be used

embedded and used in the deep part of the model. Embedding size is the square root of the column cardinality.

dnn_dropout: The probability we will drop out a given coordinate. dnn_optimizer_type: The type of optimizer to use for the deep part of the

model. Choices are “adam”, “ftrl” and “sgd”. for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

dnn_l1_regularization_strength: L1 regularization strength for

dnn_optimizer_type=”ftrl”.

dnn_l2_regularization_strength: L2 regularization strength for

dnn_optimizer_type=”ftrl”.

dnn_l2_shrinkage_regularization_strength: L2 shrinkage regularization

strength for dnn_optimizer_type=”ftrl”.

dnn_beta_1: Beta 1 value for dnn_optimizer_type=”adam”. dnn_beta_2: Beta 2 value for dnn_optimizer_type=”adam”. enable_profiler: Enables profiling and saves a trace during evaluation. seed: Seed to be used for this run. eval_steps: Number of steps (batches) to run evaluation for. If not

specified, it means run evaluation on the whole validation dataset. This value must be >= 1.

batch_size: Batch size for training. eval_frequency_secs: Frequency at which evaluation and checkpointing will

take place.

weight_column: The weight column name. stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

training_machine_spec: The machine spec for trainer component. training_replica_count: The replica count for the trainer component. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.input_dictionary_to_parameter(input_dict: Optional[Dict[str, Any]]) str

Convert json input dict to encoded parameter string.

This function is required due to the limitation on YAML component definition that YAML definition does not have a keyword for apply quote escape, so the JSON argument’s quote must be manually escaped using this function.

Args:

input_dict: The input json dictionary.

Returns:

The encoded string used for parameter.

Module contents

Module for AutoML Tables KFP components.

google_cloud_pipeline_components.experimental.automl.tabular.BuiltinAlgorithmHyperparameterTuningJobOp()

automl_tabular_builtin_algorithm_hyperparameter_tuning_job Launch a built-in algorithm hyperparameter tuning job using Vertex HyperparameterTuningJob API.

Args:
project (str):

Required. The GCP project that runs the pipeline components.

location (str):

Required. The GCP region that runs the pipeline components.

image_uri (str): Required. The training image URI. root_dir (str): Required. The root GCS directory for the pipeline components. target_column (str): Required. The target column name. prediction_type (str): Required. The type of prediction the Model is to produce.

“classification” or “regression”.

weight_column (Optional[str]): The weight column name. enable_profiler (Optional[bool]): Enables profiling and saves a trace during evaluation. seed (Optional[int]): Seed to be used for this run. study_spec_metrics (list[dict]):

Required. List of dictionaries representing metrics to optimize. The dictionary contains the metric_id, which is reported by the training job, ands the optimization goal of the metric. One of ‘minimize’ or ‘maximize’.

study_spec_parameters (list[str]):

Required. List of dictionaries representing parameters to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count (int): Required. The desired total number of trials. parallel_trial_count (int): Required. The desired number of trials to run in parallel. max_failed_trial_count (Optional[int]): The number of failed trials that need to be

seen before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm (Optional[str]): The search algorithm specified for the study. One of

‘ALGORITHM_UNSPECIFIED’, ‘GRID_SEARCH’, or ‘RANDOM_SEARCH’.

study_spec_measurement_selection_type (Optional[str]): Which measurement to use if/when the

service automatically selects the final measurement from previously reported intermediate measurements. One of ‘BEST_MEASUREMENT’ or ‘LAST_MEASUREMENT’.

replica_count (Optional[int]): The replica count. machine_spec (Optional[Dict[str, Any]]): The machine spec. instance_baseline (AutoMLTabularInstanceBaseline): The path to a JSON file for baseline values. metadata (TabularExampleGenMetadata): Amount of time in seconds to run the trainer for. materialized_train_split (MaterializedSplit): The path to the materialized train split. materialized_eval_split (MaterializedSplit): The path to the materialized validation split. materialized_test_split (MaterializedSplit): The path to the materialized test split. transform_output (TransformOutput): The path to transform output. training_schema_uri (TrainingSchema): The path to the training schema. encryption_spec_key_name (Optional[str]): The KMS key name.

Returns:
gcp_resources (str):

Serialized gcp_resources proto tracking the custom training job.

instance_schema_uri (str): The path to the instance schema. prediction_schema_uri (str): The path to the prediction schema. trials (str): The path to the hyperparameter tuning trials prediction_docker_uri_output (str): The URI of the prediction container.

google_cloud_pipeline_components.experimental.automl.tabular.CvTrainerOp()

automl_tabular_cv_trainer AutoML Tabular cross-validation trainer

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer. If not set, default to us-central1.

root_dir (str): The Cloud Storage location to store the output. worker_pool_specs_override (str):

Quote escaped JSON string for the worker pool specs. An example of the worker pool specs JSON is: [{“machine_spec”: {“machine_type”: “n1-standard-16”}},{},{},{“machine_spec”: {“machine_type”: “n1-standard-16”}}]

deadline_hours (float): Number of hours the cross-validation trainer should run. num_parallel_trials (int): Number of parallel training trials. single_run_max_secs (int): Max number of seconds each training trial runs. num_selected_trials (int):

Number of selected trials. The number of weak learners in the final model is 5 * num_selected_trials.

transform_output (TransformOutput): The transform output artifact. metadata (TabularExampleGenMetadata): The tabular example gen metadata. materialized_cv_splits (MaterializedSplit): The materialized cross-validation splits. tuning_result_input (AutoMLTabularTuningResult): AutoML Tabular tuning result. encryption_spec_key_name (Optional[str]): Customer-managed encryption key.

Returns:

tuning_result_output (AutoMLTabularTuningResult): The trained model and architectures. gcp_resources (str):

google_cloud_pipeline_components.experimental.automl.tabular.EnsembleOp()

automl_tabular_ensemble Ensemble AutoML Tabular models

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer. If not set, default to us-central1.

root_dir (str): The Cloud Storage location to store the output. transform_output (TransformOutput): The transform output artifact. metadata (TabularExampleGenMetadata): The tabular example gen metadata. dataset_schema (DatasetSchema): The schema of the dataset. tuning_result_input (AutoMLTabularTuningResult): AutoML Tabular tuning result. instance_baseline (AutoMLTabularInstanceBaseline):

The instance baseline used to calculate explanations.

warmup_data (Dataset):

The warm up data. Ensemble component will save the warm up data together with the model artifact, used to warm up the model when prediction server starts.

encryption_spec_key_name (Optional[str]): Customer-managed encryption key. export_additional_model_without_custom_ops (Optional[str]):

True if export an additional model without custom TF operators to the model_without_custom_ops output.

Returns:
gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

model_architecture (AutoMLTabularModelArchitecture): The architecture of the output model. model (system.Model): The output model. model_without_custom_ops (system.Model):

The output model without custom TF operators, this output will be empty unless export_additional_model_without_custom_ops is set.

model_uri (str): The URI of the output model. instance_schema_uri (str): The URI of the instance schema. prediction_schema_uri (str): The URI of the prediction schema. explanation_metadata (str): The explanation metadata used by Vertex online and batch explanations. explanation_metadata (str): The explanation parameters used by Vertex online and batch explanations.

google_cloud_pipeline_components.experimental.automl.tabular.FeatureSelectionOp()

tabular_feature_ranking_and_selection Launch a feature selection task to pick top features.

Args:
project (str):

Required. Project to run feature selection.

location (str):

Location for running the feature selection. If not set, default to us-central1.

root_dir: The Cloud Storage location to store the output. dataflow_machine_type (Optional[str]):

The machine type used for dataflow jobs. If not set, default to n1-standard-16.

dataflow_max_num_workers (Optional[int]):

The number of workers to run the dataflow job. If not set, default to 25.

dataflow_disk_size_gb (Optional[int]):

The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.

encryption_spec_key_name (Optional[str]): Customer-managed encryption key. If this is set, then all resources will be encrypted with the provided encryption key. data_source(Dataset): the input dataset artifact which references csv, BigQuery, or TF Records. target_column_name(str): target column name of the input dataset. max_selected_features (Optional[int]):

number of features to select by the algorithm. If not set, default to 1000.

Returns:
feature_ranking (TabularFeatureRanking):

the dictionary of feature names and feature ranking values.

selected_features (JsonObject):

a json array of selected feature names.

gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.experimental.automl.tabular.FinalizerOp()

automl_tabular_finalizer Finalizer for AutoML Tabular pipelines

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer. If not set, default to us-central1.

root_dir (str): The Cloud Storage location to store the output. encryption_spec_key_name (Optional[str]): Customer-managed encryption key.

Returns:
gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.experimental.automl.tabular.InfraValidatorOp()

automl_tabular_infra_validator Validates the trained AutoML Tabular model is a valid model.

Args:

model (str): Path to the model to be validated.

google_cloud_pipeline_components.experimental.automl.tabular.Stage1TunerOp()

automl_tabular_stage_1_tuner AutoML Tabular stage 1 tuner

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer. If not set, default to us-central1.

root_dir (str): The Cloud Storage location to store the output. study_spec_override (str):

Quote escaped JSON string for the study spec. An example of the study specs JSON is: {“parameters”:[{“parameter_id”: “model_type”,”categorical_value_spec”: {“values”: [“nn”]}}]}

worker_pool_specs_override (str):

Quote escaped JSON string for the worker pool specs. An example of the worker pool specs JSON is: [{“machine_spec”: {“machine_type”: “n1-standard-16”}},{},{},{“machine_spec”: {“machine_type”: “n1-standard-16”}}]

reduce_search_space_mode (str): The reduce search space mode. Possible values: “regular” (default), “minimal”, “full”. num_selected_trials (int):

Number of selected trials. The number of weak learners in the final model is 5 * num_selected_trials.

deadline_hours (float): Number of hours the cross-validation trainer should run. disable_early_stopping (bool): True if disable early stopping. Default value is false. num_parallel_trials (int): Number of parallel training trials. single_run_max_secs (int): Max number of seconds each training trial runs. metadata (TabularExampleGenMetadata): The tabular example gen metadata. transform_output (TransformOutput): The transform output artifact. materialized_train_split (MaterializedSplit): The materialized train split. materialized_eval_split (MaterializedSplit): The materialized eval split. encryption_spec_key_name (Optional[str]): Customer-managed encryption key. is_distill (bool): True if in distillation mode. The default value is false.

Returns:
gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

tuning_result_output (AutoMLTabularTuningResult): The trained model and architectures.

google_cloud_pipeline_components.experimental.automl.tabular.StatsAndExampleGenOp(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, transformations: str, split_spec: str, data_source: str, weight_column_name: str = '', optimization_objective: str = '', optimization_objective_recall_value: float = '-1', optimization_objective_precision_value: float = '-1', dataflow_machine_type: str = 'n1-standard-16', dataflow_max_num_workers: int = '25', dataflow_disk_size_gb: int = '40', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = 'true', encryption_spec_key_name: str = '', is_distill: bool = 'false')

tabular_stats_and_example_gen Statistics and example gen for tabular data

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer. If not set, default to us-central1.

root_dir (str): The Cloud Storage location to store the output. target_column_name (str): The target column name. weight_column_name (str): The weight column name. prediction_type (str): The prediction type. Supported values: “classification”, “regression”. optimization_objective (str):

Objective function the model is optimizing towards. The training process creates a model that maximizes/minimizes the value of the objective function over the validation set. The supported optimization objectives depend on the prediction type. If the field is not set, a default objective function is used. classification (binary):

“maximize-au-roc” (default) - Maximize the area under the receiver

operating characteristic (ROC) curve.

“minimize-log-loss” - Minimize log loss. “maximize-au-prc” - Maximize the area under the precision-recall curve. “maximize-precision-at-recall” - Maximize precision for a specified recall value. “maximize-recall-at-precision” - Maximize recall for a specified precision value.

classification (multi-class):

“minimize-log-loss” (default) - Minimize log loss.

regression:

“minimize-rmse” (default) - Minimize root-mean-squared error (RMSE). “minimize-mae” - Minimize mean-absolute error (MAE). “minimize-rmsle” - Minimize root-mean-squared log error (RMSLE).

optimization_objective_recall_value (str):

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value (str):

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

transformations (str):

Quote escaped JSON string for transformations. Each transformation will apply transform function to given input column. And the result will be used for training. When creating transformation for BigQuery Struct column, the column should be flattened using “.” as the delimiter.

split_spec (str):

Quote escaped JSON string for split spec.

data_source (str):

Quote escaped JSON string for data source.

dataflow_machine_type (Optional[str]):

The machine type used for dataflow jobs. If not set, default to n1-standard-16.

dataflow_max_num_workers (Optional[int]):

The number of workers to run the dataflow job. If not set, default to 25.

dataflow_disk_size_gb (Optional[int]):

The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.

dataflow_subnetwork (Optional[str]):

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips (Optional[bool]):

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

is_distill (bool): True if in distillation mode. The default value is false.

Returns:

dataset_schema (DatasetSchema): The schema of the dataset. dataset_stats (AutoMLTabularDatasetStats): The stats of the dataset. train_split (Dataset): The train split. eval_split (Dataset): The eval split. test_split (Dataset): The test split. test_split_json (JsonObject): The test split JSON object. instance_baseline (AutoMLTabularInstanceBaseline):

The instance baseline used to calculate explanations.

metadata (TabularExampleGenMetadata): The tabular example gen metadata. gcp_resources (str):

google_cloud_pipeline_components.experimental.automl.tabular.TabNetTrainerOp(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, learning_rate: float, instance_baseline: AutoMLTabularInstanceBaseline, metadata: TabularExampleGenMetadata, materialized_train_split: MaterializedSplit, materialized_eval_split: MaterializedSplit, transform_output: TransformOutput, training_schema_uri: TrainingSchema, weight_column: str = '', max_steps: int = - 1, max_train_secs: int = - 1, optimizer_type: str = 'adam', l1_regularization_strength: float = 0, l2_regularization_strength: float = 0, l2_shrinkage_regularization_strength: float = 0, beta_1: float = 0.9, beta_2: float = 0.999, large_category_dim: int = 1, large_category_thresh: int = 300, yeo_johnson_transform: bool = 'true', feature_dim: int = 64, feature_dim_ratio: float = 0.5, num_decision_steps: int = 6, relaxation_factor: float = 1.5, decay_every: float = 100, gradient_thresh: float = 2000, sparsity_loss_weight: float = 1e-05, batch_momentum: float = 0.95, batch_size_ratio: float = 0.25, num_transformer_layers: int = 4, num_transformer_layers_ratio: float = 0.25, class_weight: float = 1.0, loss_function_type: str = 'default', alpha_focal_loss: float = 0.25, gamma_focal_loss: float = 2.0, enable_profiler: bool = 'false', seed: int = 1, eval_steps: int = 0, batch_size: int = 100, eval_frequency_secs: int = 600, replica_count: int = 1, machine_spec: dict = '{"machine_type": "n1-standard-16"}', materialized_test_split: MaterializedSplit = '', encryption_spec_key_name: str = '')

automl_tabular_tabnet_trainer Launch a TabNet custom training job using Vertex CustomJob API.

Args:
project (str):

Required. The GCP project that runs the pipeline components.

location (str):

Required. The GCP region that runs the pipeline components.

root_dir (str): Required. The root GCS directory for the pipeline components. target_column (str): Required. The target column name. prediction_type (str): Required. The type of prediction the Model is to produce.

“classification” or “regression”.

weight_column (Optional[str]): The weight column name. max_steps (Optional[int]): Number of steps (batches) to run the trainer for. max_train_secs (Optional[int]): Amount of time in seconds to run the trainer for. learning_rate (float): The learning rate used by the linear optimizer. optimizer_type (Optional[str]): The type of optimizer to use. Choices are “adam”, “ftrl” and

“sgd” for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

l1_regularization_strength (Optional[float]): L1 regularization strength for

optimizer_type=”ftrl”.

l2_regularization_strength (Optional[float]): L2 regularization strength for

optimizer_type=”ftrl”

l2_shrinkage_regularization_strength (Optional[float]): L2 shrinkage regularization strength

for optimizer_type=”ftrl”.

beta_1 (Optional[float]): Beta 1 value for optimizer_type=”adam”. beta_2 (Optional[float]): Beta 2 value for optimizer_type=”adam”. large_category_dim (Optional[int]): Embedding dimension for categorical feature with

large number of categories.

large_category_thresh (Optional[int]): Threshold for number of categories to apply large_category_dim

embedding dimension to.

yeo_johnson_transform (Optional[bool]): Enables trainable Yeo-Johnson power transform. feature_dim (Optional[int]): Dimensionality of the hidden representation in feature

transformation block.

feature_dim_ratio (Optional[float]): The ratio of Output Dimension (Dimensionality of the

outputs of each decision step) to Feature Dimension.

num_decision_steps (Optional[int]): Number of sequential decision steps. relaxation_factor (Optional[float]): Relaxation factor that promotes the reuse of each feature

at different decision steps. When it is 1, a feature is enforced to be used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.

decay_every (Optional[float]): Number of iterations for periodically applying learning rate decaying. gradient_thresh (Optional[float]): Threshold for the norm of gradients for clipping. sparsity_loss_weight (Optional[float]): Weight of the loss for sparsity regularization

(increasing it will yield more sparse feature selection).

batch_momentum (Optional[float]): Momentum in ghost batch normalization. batch_size_ratio (Optional[float]): The ratio of Virtual Batch Size

(size of the ghost batch normalization) to Batch Size.

num_transformer_layers (Optional[int]): The number of transformer layers for each decision step.

used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.

num_transformer_layers_ratio (Optional[float]): The ratio of Shared Transformer Layer to Transformer Layers. class_weight (Optional[float]): The class weight is used to computes a weighted cross

entropy which is helpful in classify imbalanced dataset.

loss_function_type (Optional[str]): Loss function type. Loss function in classification

[cross_entropy, weighted_cross_entropy, focal_loss], default is cross_entropy. Loss function in regression: [rmse, mae, mse], default is mse.

alpha_focal_loss (Optional[float]): Alpha value (balancing factor) in focal_loss function. gamma_focal_loss (Optional[float]): Gamma value (modulating factor) for focal loss for focal loss. enable_profiler (Optional[bool]): Enables profiling and saves a trace during evaluation. seed (Optional[int]): Seed to be used for this run. eval_steps (Optional[int]): ANumber of steps (batches) to run evaluation for. If not

specified, it means run evaluation on the whole validation dataset. This value must be >= 1.

batch_size (Optional[int]): Batch size for training. eval_frequency_secs (Optional[int]): Frequency at which evaluation and checkpointing will

take place.

replica_count (Optional[int]): The replica count. machine_spec (Optional[Dict[str, Any]]): The machine spec. instance_baseline (AutoMLTabularInstanceBaseline): The path to a JSON file for baseline values. metadata (TabularExampleGenMetadata): Amount of time in seconds to run the trainer for. materialized_train_split (MaterializedSplit): The path to the materialized train split. materialized_eval_split (MaterializedSplit): The path to the materialized validation split. materialized_test_split (MaterializedSplit): The path to the materialized test split. transform_output (TransformOutput): The path to transform output. training_schema_uri (TrainingSchema): The path to the training schema. encryption_spec_key_name (Optional[str]): The KMS key name.

Returns:
gcp_resources (str):

Serialized gcp_resources proto tracking the custom training job.

unmanaged_container_model (google.UnmanagedContainerModel): The UnmanagedContainerModel artifact.

google_cloud_pipeline_components.experimental.automl.tabular.TransformOp()

automl_tabular_transform Transformation raw features to engineered features

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer. If not set, default to us-central1.

root_dir (str): The Cloud Storage location to store the output. metadata (TabularExampleGenMetadata): The tabular example gen metadata. dataset_schema (DatasetSchema): The schema of the dataset. train_split (Dataset): The train split. eval_split (Dataset): The eval split. test_split (Dataset): The test split. dataflow_machine_type (Optional[str]):

The machine type used for dataflow jobs. If not set, default to n1-standard-16.

dataflow_max_num_workers (Optional[int]):

The number of workers to run the dataflow job. If not set, default to 25.

dataflow_disk_size_gb (Optional[int]):

The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.

dataflow_subnetwork (Optional[str]):

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips (Optional[bool]):

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

Returns:

materialized_train_split (MaterializedSplit): The materialized train split. materialized_eval_split (MaterializedSplit): The materialized eval split. materialized_eval_split (MaterializedSplit): The materialized test split. training_schema_uri (TrainingSchema): The training schema. transform_output (TransformOutput): The transform output artifact. gcp_resources (str):

google_cloud_pipeline_components.experimental.automl.tabular.WideAndDeepTrainerOp()

automl_tabular_wide_and_deep_trainer Launch a Wide & Deep custom training job using Vertex CustomJob API.

Args:
project (str):

Required. The GCP project that runs the pipeline components.

location (str):

Required. The GCP region that runs the pipeline components.

root_dir (str): Required. The root GCS directory for the pipeline components. target_column (str): Required. The target column name. prediction_type (str): Required. The type of prediction the Model is to produce.

“classification” or “regression”.

weight_column (Optional[str]): The weight column name. max_steps (Optional[int]): Number of steps (batches) to run the trainer for. max_train_secs (Optional[int]): Amount of time in seconds to run the trainer for. learning_rate (float): The learning rate used by the linear optimizer. optimizer_type (Optional[str]): The type of optimizer to use. Choices are “adam”, “ftrl” and

“sgd” for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

l1_regularization_strength (Optional[float]): L1 regularization strength for

optimizer_type=”ftrl”.

l2_regularization_strength (Optional[float]): L2 regularization strength for

optimizer_type=”ftrl”

l2_shrinkage_regularization_strength (Optional[float]): L2 shrinkage regularization strength

for optimizer_type=”ftrl”.

beta_1 (Optional[float]): Beta 1 value for optimizer_type=”adam”. beta_2 (Optional[float]): Beta 2 value for optimizer_type=”adam”. hidden_units (Optional[str]): Hidden layer sizes to use for DNN feature columns, provided in

comma-separated layers.

use_wide (Optional[bool]): If set to True, the categorical columns will be used in the wide

part of the DNN model.

embed_categories (Optional[bool]): If set to True, the categorical columns will be used

embedded and used in the deep part of the model. Embedding size is the square root of the column cardinality.

dnn_dropout (Optional[float]): The probability we will drop out a given coordinate. dnn_learning_rate (Optional[float]): The learning rate for training the deep part of the

model.

dnn_optimizer_type (Optional[str]): The type of optimizer to use for the deep part of the

model. Choices are “adam”, “ftrl” and “sgd”. for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

dnn_l1_regularization_strength (Optional[float]): L1 regularization strength for

dnn_optimizer_type=”ftrl”.

dnn_l2_regularization_strength (Optional[float]): L2 regularization strength for

dnn_optimizer_type=”ftrl”.

dnn_l2_shrinkage_regularization_strength (Optional[float]): L2 shrinkage regularization

strength for dnn_optimizer_type=”ftrl”.

dnn_beta_1 (Optional[float]): Beta 1 value for dnn_optimizer_type=”adam”. dnn_beta_2 (Optional[float]): Beta 2 value for dnn_optimizer_type=”adam”. enable_profiler (Optional[bool]): Enables profiling and saves a trace during evaluation. seed (Optional[int]): Seed to be used for this run. eval_steps (Optional[int]): ANumber of steps (batches) to run evaluation for. If not

specified, it means run evaluation on the whole validation dataset. This value must be >= 1.

batch_size (Optional[int]): Batch size for training. eval_frequency_secs (Optional[int]): Frequency at which evaluation and checkpointing will

take place.

replica_count (Optional[int]): The replica count. machine_spec (Optional[Dict[str, Any]]): The machine spec. instance_baseline (AutoMLTabularInstanceBaseline): The path to a JSON file for baseline values. metadata (TabularExampleGenMetadata): Amount of time in seconds to run the trainer for. materialized_train_split (MaterializedSplit): The path to the materialized train split. materialized_eval_split (MaterializedSplit): The path to the materialized validation split. materialized_test_split (MaterializedSplit): The path to the materialized test split. transform_output (TransformOutput): The path to transform output. training_schema_uri (TrainingSchema): The path to the training schema. encryption_spec_key_name (Optional[str]): The KMS key name.

Returns:
gcp_resources (str):

Serialized gcp_resources proto tracking the custom training job.

unmanaged_container_model (google.UnmanagedContainerModel): The UnmanagedContainerModel artifact.