google_cloud_pipeline_components.experimental.automl.tabular package

Submodules

google_cloud_pipeline_components.experimental.automl.tabular.utils module

Util functions for AutoML Tabular pipeline.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_automl_tabular_feature_selection_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int | None = None, stage_2_num_parallel_trials: int | None = None, stage_2_num_selected_trials: int | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, predefined_split_key: str | None = None, timestamp_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, weight_column: str | None = None, study_spec_parameters_override: List[Dict[str, Any]] | None = None, optimization_objective_recall_value: float | None = None, optimization_objective_precision_value: float | None = None, stage_1_tuner_worker_pool_specs_override: Dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str | None = None, stats_and_example_gen_dataflow_max_num_workers: int | None = None, stats_and_example_gen_dataflow_disk_size_gb: int | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: str | None = None, additional_experiments: Dict[str, Any] | None = None, dataflow_service_account: str | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_batch_explain_machine_type: str | None = None, evaluation_batch_explain_starting_replica_count: int | None = None, evaluation_batch_explain_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None, max_selected_features: int = 1000, apply_feature_selection_tuning: bool = False, run_distillation: bool = False, distill_batch_predict_machine_type: str | None = None, distill_batch_predict_starting_replica_count: int | None = None, distill_batch_predict_max_replica_count: int | None = None) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular v1 default training pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,

“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations: The path to a GCS file containing the transformations to

apply.

train_budget_milli_node_hours: The train budget of creating this model,

expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. predefined_split_key: The predefined_split column name. timestamp_split_key: The timestamp_split column name. stratified_split_key: The stratified_split column name. training_fraction: The training fraction. validation_fraction: The validation fraction. test_fraction: float = The test fraction. weight_column: The weight column name. study_spec_parameters_override: The list for overriding study spec. The list

optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.
stage 1 tuner worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
cv trainer worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
the default subnetwork will be used. Example:

https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. dataflow_service_account: Custom service account to run dataflow jobs. run_evaluation: Whether to run evaluation in the training pipeline. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: The max number of prediction

server for batch predict components during evaluation.

evaluation_batch_explain_machine_type: The prediction server machine type

for batch explain components during evaluation.

evaluation_batch_explain_starting_replica_count: The initial number of

prediction server for batch explain components during evaluation.

evaluation_batch_explain_max_replica_count: The max number of prediction

server for batch explain components during evaluation.

evaluation_dataflow_machine_type: The dataflow machine type for evaluation

components.

evaluation_dataflow_starting_num_workers: The initial number of Dataflow

workers for evaluation components.

evaluation_dataflow_max_num_workers: The max number of Dataflow workers for

evaluation components.

evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

evaluation components.

max_selected_features: number of features to select for training, apply_feature_selection_tuning: tuning feature selection rate if true. run_distillation: Whether to run distill in the training pipeline. distill_batch_predict_machine_type: The prediction server machine type for

batch predict component in the model distillation.

distill_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict component in the model distillation.

distill_batch_predict_max_replica_count: The max number of prediction server

for batch predict component in the model distillation.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_automl_tabular_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int | None = None, stage_2_num_parallel_trials: int | None = None, stage_2_num_selected_trials: int | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, predefined_split_key: str | None = None, timestamp_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, weight_column: str | None = None, study_spec_parameters_override: List[Dict[str, Any]] | None = None, optimization_objective_recall_value: float | None = None, optimization_objective_precision_value: float | None = None, stage_1_tuner_worker_pool_specs_override: Dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str | None = None, stats_and_example_gen_dataflow_max_num_workers: int | None = None, stats_and_example_gen_dataflow_disk_size_gb: int | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: str | None = None, additional_experiments: Dict[str, Any] | None = None, dataflow_service_account: str | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_batch_explain_machine_type: str | None = None, evaluation_batch_explain_starting_replica_count: int | None = None, evaluation_batch_explain_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None, run_distillation: bool = False, distill_batch_predict_machine_type: str | None = None, distill_batch_predict_starting_replica_count: int | None = None, distill_batch_predict_max_replica_count: int | None = None, stage_1_tuning_result_artifact_uri: str | None = None, quantiles: List[float] | None = None, enable_probabilistic_inference: bool = False) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular v1 default training pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,

“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations: The path to a GCS file containing the transformations to

apply.

train_budget_milli_node_hours: The train budget of creating this model,

expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. predefined_split_key: The predefined_split column name. timestamp_split_key: The timestamp_split column name. stratified_split_key: The stratified_split column name. training_fraction: The training fraction. validation_fraction: The validation fraction. test_fraction: float = The test fraction. weight_column: The weight column name. study_spec_parameters_override: The list for overriding study spec. The list

optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.
stage 1 tuner worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
cv trainer worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
the default subnetwork will be used. Example:

https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. dataflow_service_account: Custom service account to run dataflow jobs. run_evaluation: Whether to run evaluation in the training pipeline. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: The max number of prediction

server for batch predict components during evaluation.

evaluation_batch_explain_machine_type: The prediction server machine type

for batch explain components during evaluation.

evaluation_batch_explain_starting_replica_count: The initial number of

prediction server for batch explain components during evaluation.

evaluation_batch_explain_max_replica_count: The max number of prediction

server for batch explain components during evaluation.

evaluation_dataflow_machine_type: The dataflow machine type for evaluation

components.

evaluation_dataflow_starting_num_workers: The initial number of Dataflow

workers for evaluation components.

evaluation_dataflow_max_num_workers: The max number of Dataflow workers for

evaluation components.

evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

evaluation components.

run_distillation: Whether to run distill in the training pipeline. distill_batch_predict_machine_type: The prediction server machine type for

batch predict component in the model distillation.

distill_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict component in the model distillation.

distill_batch_predict_max_replica_count: The max number of prediction server

for batch predict component in the model distillation.

stage_1_tuning_result_artifact_uri: The stage 1 tuning result artifact GCS

URI.

quantiles: Quantiles to use for probabilistic inference. Up to 5 quantiles

are allowed of values between 0 and 1, exclusive. Represents the quantiles to use for that objective. Quantiles must be unique.

enable_probabilistic_inference: If probabilistic inference is enabled, the

model will fit a distribution that captures the uncertainty of a prediction. At inference time, the predictive distribution is used to make a point prediction that minimizes the optimization objective. For example, the mean of a predictive distribution is the point prediction that minimizes RMSE loss. If quantiles are specified, then the quantiles of the distribution are also returned.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_builtin_algorithm_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, study_spec_metric_id: str, study_spec_metric_goal: str, study_spec_parameters_override: List[Dict[str, Any]], max_trial_count: int, parallel_trial_count: int, algorithm: str, enable_profiler: bool = False, seed: int = 1, eval_steps: int = 0, eval_frequency_secs: int = 600, transform_config: str | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str = '', max_failed_trial_count: int = 0, study_spec_algorithm: str = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str = 'BEST_MEASUREMENT', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Dict[str, Any] | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_starting_num_workers: int = 10, evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the built-in algorithm HyperparameterTuningJob pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

study_spec_metric_id: Metric to optimize, possible values: [ ‘loss’,

‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].

study_spec_metric_goal: Optimization goal of the metric, possible values:

“MAXIMIZE”, “MINIMIZE”.

study_spec_parameters_override: List of dictionaries representing parameters

to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. algorithm: Algorithm to train. One of “tabnet” and “wide_and_deep”. enable_profiler: Enables profiling and saves a trace during evaluation. seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or

negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

eval_frequency_secs: Frequency at which evaluation and checkpointing will

take place.

transform_config: Path to v1 TF transformation configuration. dataset_level_custom_transformation_definitions: Dataset-level custom

transformation definitions in string format.

dataset_level_transformations: Dataset-level transformation configuration in

string format.

predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions

in string format.

tf_transformations_path: Path to TF transformation configuration. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen

before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm: The search algorithm specified for the study. One of

“ALGORITHM_UNSPECIFIED”, “GRID_SEARCH”, or “RANDOM_SEARCH”.

study_spec_measurement_selection_type: Which measurement to use if/when the

service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

worker_pool_specs_override: The dictionary for overriding training and
evaluation worker pool specs. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: The max number of prediction

server for batch predict components during evaluation.

evaluation_dataflow_machine_type: The dataflow machine type for evaluation

components.

evaluation_dataflow_starting_num_workers: The initial number of Dataflow

workers for evaluation components.

evaluation_dataflow_max_num_workers: The max number of Dataflow workers for

evaluation components.

evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_default_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Dict[str, Any] | None = None, optimization_objective_recall_value: float = -1, optimization_objective_precision_value: float = -1, stage_1_tuner_worker_pool_specs_override: Dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '', additional_experiments: Dict[str, Any] | None = None, dataflow_service_account: str = '', run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, run_distillation: bool = False, distill_batch_predict_machine_type: str = 'n1-standard-16', distill_batch_predict_starting_replica_count: int = 25, distill_batch_predict_max_replica_count: int = 25) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular default training pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column_name: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,

“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. train_budget_milli_node_hours: The train budget of creating this model,

expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. weight_column_name: The weight column name. study_spec_override: The dictionary for overriding study spec. The

optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.
stage 1 tuner worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
cv trainer worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
the default subnetwork will be used. Example:

https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. dataflow_service_account: Custom service account to run dataflow jobs. run_evaluation: Whether to run evaluation in the training pipeline. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: The max number of prediction

server for batch predict components during evaluation.

evaluation_dataflow_machine_type: The dataflow machine type for evaluation

components.

evaluation_dataflow_max_num_workers: The max number of Dataflow workers for

evaluation components.

evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

evaluation components.

run_distillation: Whether to run distill in the training pipeline. distill_batch_predict_machine_type: The prediction server machine type for

batch predict component in the model distillation.

distill_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict component in the model distillation.

distill_batch_predict_max_replica_count: The max number of prediction server

for batch predict component in the model distillation.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_distill_skip_evaluation_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Dict[str, Any] | None = None, optimization_objective_recall_value: float = -1, optimization_objective_precision_value: float = -1, stage_1_tuner_worker_pool_specs_override: Dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '', additional_experiments: Dict[str, Any] | None = None, distill_batch_predict_machine_type: str = 'n1-standard-16', distill_batch_predict_starting_replica_count: int = 25, distill_batch_predict_max_replica_count: int = 25) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that distill and skips evaluation.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column_name: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,

“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. train_budget_milli_node_hours: The train budget of creating this model,

expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. weight_column_name: The weight column name. study_spec_override: The dictionary for overriding study spec. The

optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.
stage 1 tuner worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
cv trainer worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
the default subnetwork will be used. Example:

https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. distill_batch_predict_machine_type: The prediction server machine type for

batch predict component in the model distillation.

distill_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict component in the model distillation.

distill_batch_predict_max_replica_count: The max number of prediction server

for batch predict component in the model distillation.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_feature_selection_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, algorithm: str, prediction_type: str, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, max_selected_features: int | None = None, dataflow_machine_type: str = 'n1-standard-16', dataflow_max_num_workers: int = 25, dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, dataflow_service_account: str = '')

Get the feature selection pipeline that generates feature ranking and selected features.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. algorithm: Algorithm to select features, default to be AMI. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

data_source_csv_filenames: A string that represents a list of comma

separated CSV filenames.

data_source_bigquery_table_path: The BigQuery table path. max_selected_features: number of features to be selected. dataflow_machine_type: The dataflow machine type for feature_selection

component.

dataflow_max_num_workers: The max number of Dataflow workers for

feature_selection component.

dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

feature_selection component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
the default subnetwork will be used. Example:

https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

dataflow_service_account: Custom service account to run dataflow jobs.

Returns:

Tuple of pipeline_definition_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_model_comparison_pipeline_and_parameters(project: str, location: str, root_dir: str, prediction_type: str, training_jobs: Dict[str, Dict[str, Any]], data_source_csv_filenames: str = '', data_source_bigquery_table_path: str = '', evaluation_data_source_csv_filenames: str = '', evaluation_data_source_bigquery_table_path: str = '', experiment: str = '', service_account: str = '', network: str = '') Tuple[str, Dict[str, Any]]

Returns a compiled model comparison pipeline and formatted parameters.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components prediction_type: The type of problem being solved. Can be one of:

regression, classification, or forecasting.

training_jobs: A dict mapping name to a dict of training job inputs. data_source_csv_filenames: Comma-separated paths to CSVs stored in GCS to

use as the training dataset for all training pipelines. This should be None if data_source_bigquery_table_path is not None. This should only contain data from the training and validation split and not from the test split.

data_source_bigquery_table_path: Path to BigQuery Table to use as the

training dataset for all training pipelines. This should be None if data_source_csv_filenames is not None. This should only contain data from the training and validation split and not from the test split.

evaluation_data_source_csv_filenames: Comma-separated paths to CSVs stored

in GCS to use as the evaluation dataset for all training pipelines. This should be None if evaluation_data_source_bigquery_table_path is not None. This should only contain data from the test split and not from the training and validation split.

evaluation_data_source_bigquery_table_path: Path to BigQuery Table to use as

the evaluation dataset for all training pipelines. This should be None if evaluation_data_source_csv_filenames is not None. This should only contain data from the test split and not from the training and validation split.

experiment: Vertex Experiment to add training pipeline runs to. A new

Experiment will be created if none is provided.

service_account: Specifies the service account for the sub-pipeline jobs. network: The full name of the Compute Engine network to which the

sub-pipeline jobs should be peered.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_skip_architecture_search_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_tuning_result_artifact_uri: str, stage_2_num_parallel_trials: int | None = None, stage_2_num_selected_trials: int | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, predefined_split_key: str | None = None, timestamp_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, weight_column: str | None = None, optimization_objective_recall_value: float | None = None, optimization_objective_precision_value: float | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str | None = None, stats_and_example_gen_dataflow_max_num_workers: int | None = None, stats_and_example_gen_dataflow_disk_size_gb: int | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: str | None = None, additional_experiments: Dict[str, Any] | None = None, dataflow_service_account: str | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_batch_explain_machine_type: str | None = None, evaluation_batch_explain_starting_replica_count: int | None = None, evaluation_batch_explain_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that skips architecture search.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,

“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations: The transformations to apply. train_budget_milli_node_hours: The train budget of creating this model,

expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_tuning_result_artifact_uri: The stage 1 tuning result artifact GCS

URI.

stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. predefined_split_key: The predefined_split column name. timestamp_split_key: The timestamp_split column name. stratified_split_key: The stratified_split column name. training_fraction: The training fraction. validation_fraction: The validation fraction. test_fraction: float = The test fraction. weight_column: The weight column name. optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
cv trainer worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
the default subnetwork will be used. Example:

https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. dataflow_service_account: Custom service account to run dataflow jobs. run_evaluation: Whether to run evaluation in the training pipeline. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: The max number of prediction

server for batch predict components during evaluation.

evaluation_batch_explain_machine_type: The prediction server machine type

for batch explain components during evaluation.

evaluation_batch_explain_starting_replica_count: The initial number of

prediction server for batch explain components during evaluation.

evaluation_batch_explain_max_replica_count: The max number of prediction

server for batch explain components during evaluation.

evaluation_dataflow_machine_type: The dataflow machine type for evaluation

components.

evaluation_dataflow_starting_num_workers: The initial number of Dataflow

workers for evaluation components.

evaluation_dataflow_max_num_workers: The max number of Dataflow workers for

evaluation components.

evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

evaluation components.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_skip_evaluation_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Dict[str, Any] | None = None, optimization_objective_recall_value: float = -1, optimization_objective_precision_value: float = -1, stage_1_tuner_worker_pool_specs_override: Dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '', additional_experiments: Dict[str, Any] | None = None) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that skips evaluation.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column_name: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,

“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. train_budget_milli_node_hours: The train budget of creating this model,

expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. weight_column_name: The weight column name. study_spec_override: The dictionary for overriding study spec. The

optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.
stage 1 tuner worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
cv trainer worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
the default subnetwork will be used. Example:

https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, study_spec_metric_id: str, study_spec_metric_goal: str, study_spec_parameters_override: List[Dict[str, Any]], max_trial_count: int, parallel_trial_count: int, transform_config: str | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool = False, feature_selection_algorithm: str | None = None, materialized_examples_format: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, eval_frequency_secs: int = 600, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str = '', max_failed_trial_count: int = 0, study_spec_algorithm: str = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str = 'BEST_MEASUREMENT', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Dict[str, Any] | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_starting_num_workers: int = 10, evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the TabNet HyperparameterTuningJob pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

study_spec_metric_id: Metric to optimize, possible values: [ ‘loss’,

‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].

study_spec_metric_goal: Optimization goal of the metric, possible values:

“MAXIMIZE”, “MINIMIZE”.

study_spec_parameters_override: List of dictionaries representing parameters

to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. transform_config: Path to v1 TF transformation configuration. dataset_level_custom_transformation_definitions: Dataset-level custom

transformation definitions in string format.

dataset_level_transformations: Dataset-level transformation configuration in

string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. materialized_examples_format: The format for the materialized examples. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions

in string format.

tf_transformations_path: Path to TF transformation configuration. enable_profiler: Enables profiling and saves a trace during evaluation. cache_data: Whether to cache data or not. If set to ‘auto’, caching is

determined based on the dataset size.

seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or

negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

eval_frequency_secs: Frequency at which evaluation and checkpointing will

take place.

data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen

before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm: The search algorithm specified for the study. One of

“ALGORITHM_UNSPECIFIED”, “GRID_SEARCH”, or “RANDOM_SEARCH”.

study_spec_measurement_selection_type: Which measurement to use if/when the

service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

worker_pool_specs_override: The dictionary for overriding training and
evaluation worker pool specs. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: The max number of prediction

server for batch predict components during evaluation.

evaluation_dataflow_machine_type: The dataflow machine type for evaluation

components.

evaluation_dataflow_starting_num_workers: The initial number of Dataflow

workers for evaluation components.

evaluation_dataflow_max_num_workers: The max number of Dataflow workers for

evaluation components.

evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_study_spec_parameters_override(dataset_size_bucket: str, prediction_type: str, training_budget_bucket: str) List[Dict[str, Any]]

Get study_spec_parameters_override for a TabNet hyperparameter tuning job.

Args:
dataset_size_bucket: Size of the dataset. One of “small” (< 1M rows),

“medium” (1M - 100M rows), or “large” (> 100M rows).

prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

training_budget_bucket: Bucket of the estimated training budget. One of

“small” (< $600), “medium” ($600 - $2400), or “large” (> $2400). This parameter is only used as a hint for the hyperparameter search space, unrelated to the real cost.

Returns:

List of study_spec_parameters_override.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, learning_rate: float, transform_config: str | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool = False, feature_selection_algorithm: str | None = None, materialized_examples_format: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, max_steps: int = -1, max_train_secs: int = -1, large_category_dim: int = 1, large_category_thresh: int = 300, yeo_johnson_transform: bool = True, feature_dim: int = 64, feature_dim_ratio: float = 0.5, num_decision_steps: int = 6, relaxation_factor: float = 1.5, decay_every: float = 100, decay_rate: float = 0.95, gradient_thresh: float = 2000, sparsity_loss_weight: float = 1e-05, batch_momentum: float = 0.95, batch_size_ratio: float = 0.25, num_transformer_layers: int = 4, num_transformer_layers_ratio: float = 0.25, class_weight: float = 1.0, loss_function_type: str = 'default', alpha_focal_loss: float = 0.25, gamma_focal_loss: float = 2.0, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, batch_size: int = 100, measurement_selection_type: str | None = None, optimization_metric: str | None = None, eval_frequency_secs: int = 600, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str = '', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Dict[str, Any] | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_starting_num_workers: int = 10, evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the TabNet training pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

learning_rate: The learning rate used by the linear optimizer. transform_config: Path to v1 TF transformation configuration. dataset_level_custom_transformation_definitions: Dataset-level custom

transformation definitions in string format.

dataset_level_transformations: Dataset-level transformation configuration in

string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. materialized_examples_format: The format for the materialized examples. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions

in string format.

tf_transformations_path: Path to TF transformation configuration. max_steps: Number of steps to run the trainer for. max_train_secs: Amount of time in seconds to run the trainer for. large_category_dim: Embedding dimension for categorical feature with large

number of categories.

large_category_thresh: Threshold for number of categories to apply

large_category_dim embedding dimension to.

yeo_johnson_transform: Enables trainable Yeo-Johnson power transform. feature_dim: Dimensionality of the hidden representation in feature

transformation block.

feature_dim_ratio: The ratio of output dimension (dimensionality of the

outputs of each decision step) to feature dimension.

num_decision_steps: Number of sequential decision steps. relaxation_factor: Relaxation factor that promotes the reuse of each feature

at different decision steps. When it is 1, a feature is enforced to be used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.

decay_every: Number of iterations for periodically applying learning rate

decaying.

decay_rate: Learning rate decaying. gradient_thresh: Threshold for the norm of gradients for clipping. sparsity_loss_weight: Weight of the loss for sparsity regularization

(increasing it will yield more sparse feature selection).

batch_momentum: Momentum in ghost batch normalization. batch_size_ratio: The ratio of virtual batch size (size of the ghost batch

normalization) to batch size.

num_transformer_layers: The number of transformer layers for each decision

step. used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.

num_transformer_layers_ratio: The ratio of shared transformer layer to

transformer layers.

class_weight: The class weight is used to computes a weighted cross entropy

which is helpful in classify imbalanced dataset. Only used for classification.

loss_function_type: Loss function type. Loss function in classification

[cross_entropy, weighted_cross_entropy, focal_loss], default is cross_entropy. Loss function in regression: [rmse, mae, mse], default is mse.

alpha_focal_loss: Alpha value (balancing factor) in focal_loss function.

Only used for classification.

gamma_focal_loss: Gamma value (modulating factor) for focal loss for focal

loss. Only used for classification.

enable_profiler: Enables profiling and saves a trace during evaluation. cache_data: Whether to cache data or not. If set to ‘auto’, caching is

determined based on the dataset size.

seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or

negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

batch_size: Batch size for training. measurement_selection_type: Which measurement to use if/when the service

automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

optimization_metric: Optimization metric used for

measurement_selection_type. Default is “rmse” for regression and “auc” for classification.

eval_frequency_secs: Frequency at which evaluation and checkpointing will

take place.

data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

worker_pool_specs_override: The dictionary for overriding training and
evaluation worker pool specs. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: The max number of prediction

server for batch predict components during evaluation.

evaluation_dataflow_machine_type: The dataflow machine type for evaluation

components.

evaluation_dataflow_starting_num_workers: The initial number of Dataflow

workers for evaluation components.

evaluation_dataflow_max_num_workers: The max number of Dataflow workers for

evaluation components.

evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_wide_and_deep_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, study_spec_metric_id: str, study_spec_metric_goal: str, study_spec_parameters_override: List[Dict[str, Any]], max_trial_count: int, parallel_trial_count: int, transform_config: str | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool = False, feature_selection_algorithm: str | None = None, materialized_examples_format: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, eval_frequency_secs: int = 600, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str = '', max_failed_trial_count: int = 0, study_spec_algorithm: str = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str = 'BEST_MEASUREMENT', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Dict[str, Any] | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_starting_num_workers: int = 10, evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the Wide & Deep algorithm HyperparameterTuningJob pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

study_spec_metric_id: Metric to optimize, possible values: [ ‘loss’,

‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].

study_spec_metric_goal: Optimization goal of the metric, possible values:

“MAXIMIZE”, “MINIMIZE”.

study_spec_parameters_override: List of dictionaries representing parameters

to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. transform_config: Path to v1 TF transformation configuration. dataset_level_custom_transformation_definitions: Dataset-level custom

transformation definitions in string format.

dataset_level_transformations: Dataset-level transformation configuration in

string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. materialized_examples_format: The format for the materialized examples. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions

in string format.

tf_transformations_path: Path to TF transformation configuration. enable_profiler: Enables profiling and saves a trace during evaluation. cache_data: Whether to cache data or not. If set to ‘auto’, caching is

determined based on the dataset size.

seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or

negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

eval_frequency_secs: Frequency at which evaluation and checkpointing will

take place.

data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen

before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm: The search algorithm specified for the study. One of

“ALGORITHM_UNSPECIFIED”, “GRID_SEARCH”, or “RANDOM_SEARCH”.

study_spec_measurement_selection_type: Which measurement to use if/when the

service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

worker_pool_specs_override: The dictionary for overriding training and
evaluation worker pool specs. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: The max number of prediction

server for batch predict components during evaluation.

evaluation_dataflow_machine_type: The dataflow machine type for evaluation

components.

evaluation_dataflow_starting_num_workers: The initial number of Dataflow

workers for evaluation components.

evaluation_dataflow_max_num_workers: The max number of Dataflow workers for

evaluation components.

evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_wide_and_deep_study_spec_parameters_override() List[Dict[str, Any]]

Get study_spec_parameters_override for a Wide & Deep hyperparameter tuning job.

Returns:

List of study_spec_parameters_override.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_wide_and_deep_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, learning_rate: float, dnn_learning_rate: float, transform_config: str | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool = False, feature_selection_algorithm: str | None = None, materialized_examples_format: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, optimizer_type: str = 'adam', max_steps: int = -1, max_train_secs: int = -1, l1_regularization_strength: float = 0, l2_regularization_strength: float = 0, l2_shrinkage_regularization_strength: float = 0, beta_1: float = 0.9, beta_2: float = 0.999, hidden_units: str = '30,30,30', use_wide: bool = True, embed_categories: bool = True, dnn_dropout: float = 0, dnn_optimizer_type: str = 'adam', dnn_l1_regularization_strength: float = 0, dnn_l2_regularization_strength: float = 0, dnn_l2_shrinkage_regularization_strength: float = 0, dnn_beta_1: float = 0.9, dnn_beta_2: float = 0.999, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, batch_size: int = 100, measurement_selection_type: str | None = None, optimization_metric: str | None = None, eval_frequency_secs: int = 600, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str = '', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Dict[str, Any] | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_starting_num_workers: int = 10, evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the Wide & Deep training pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

‘classification’ or ‘regression’.

learning_rate: The learning rate used by the linear optimizer. dnn_learning_rate: The learning rate for training the deep part of the

model.

transform_config: Path to v1 TF transformation configuration. dataset_level_custom_transformation_definitions: Dataset-level custom

transformation definitions in string format.

dataset_level_transformations: Dataset-level transformation configuration in

string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. materialized_examples_format: The format for the materialized examples. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions

in string format.

tf_transformations_path: Path to TF transformation configuration. optimizer_type: The type of optimizer to use. Choices are “adam”, “ftrl” and

“sgd” for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

max_steps: Number of steps to run the trainer for. max_train_secs: Amount of time in seconds to run the trainer for. l1_regularization_strength: L1 regularization strength for

optimizer_type=”ftrl”.

l2_regularization_strength: L2 regularization strength for

optimizer_type=”ftrl”.

l2_shrinkage_regularization_strength: L2 shrinkage regularization strength

for optimizer_type=”ftrl”.

beta_1: Beta 1 value for optimizer_type=”adam”. beta_2: Beta 2 value for optimizer_type=”adam”. hidden_units: Hidden layer sizes to use for DNN feature columns, provided in

comma-separated layers.

use_wide: If set to true, the categorical columns will be used in the wide

part of the DNN model.

embed_categories: If set to true, the categorical columns will be used

embedded and used in the deep part of the model. Embedding size is the square root of the column cardinality.

dnn_dropout: The probability we will drop out a given coordinate. dnn_optimizer_type: The type of optimizer to use for the deep part of the

model. Choices are “adam”, “ftrl” and “sgd”. for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

dnn_l1_regularization_strength: L1 regularization strength for

dnn_optimizer_type=”ftrl”.

dnn_l2_regularization_strength: L2 regularization strength for

dnn_optimizer_type=”ftrl”.

dnn_l2_shrinkage_regularization_strength: L2 shrinkage regularization

strength for dnn_optimizer_type=”ftrl”.

dnn_beta_1: Beta 1 value for dnn_optimizer_type=”adam”. dnn_beta_2: Beta 2 value for dnn_optimizer_type=”adam”. enable_profiler: Enables profiling and saves a trace during evaluation. cache_data: Whether to cache data or not. If set to ‘auto’, caching is

determined based on the dataset size.

seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or

negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

batch_size: Batch size for training. measurement_selection_type: Which measurement to use if/when the service

automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

optimization_metric: Optimization metric used for

measurement_selection_type. Default is “rmse” for regression and “auc” for classification.

eval_frequency_secs: Frequency at which evaluation and checkpointing will

take place.

data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

worker_pool_specs_override: The dictionary for overriding training and
evaluation worker pool specs. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: The max number of prediction

server for batch predict components during evaluation.

evaluation_dataflow_machine_type: The dataflow machine type for evaluation

components.

evaluation_dataflow_starting_num_workers: The initial number of Dataflow

workers for evaluation components.

evaluation_dataflow_max_num_workers: The max number of Dataflow workers for

evaluation components.

evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definition_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_xgboost_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, objective: str, study_spec_metric_id: str, study_spec_metric_goal: str, max_trial_count: int, parallel_trial_count: int, study_spec_parameters_override: List[Dict[str, Any]] | None = None, eval_metric: str | None = None, disable_default_eval_metric: int | None = None, seed: int | None = None, seed_per_iteration: bool | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool | None = None, feature_selection_algorithm: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str | None = None, max_failed_trial_count: int | None = None, training_machine_type: str | None = None, training_total_replica_count: int | None = None, training_accelerator_type: str | None = None, training_accelerator_count: int | None = None, study_spec_algorithm: str | None = None, study_spec_measurement_selection_type: str | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, run_evaluation: bool | None = None, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None, dataflow_service_account: str | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool | None = None, encryption_spec_key_name: str | None = None)

Get the XGBoost HyperparameterTuningJob pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. objective: Specifies the learning task and the learning objective. Must be

one of [reg:squarederror, reg:squaredlogerror, reg:logistic, reg:gamma, reg:tweedie, reg:pseudohubererror, binary:logistic, multi:softprob].

study_spec_metric_id: Metric to optimize. For options, please look under

‘eval_metric’ at https://xgboost.readthedocs.io/en/stable/parameter.html#learning-task-parameters.

study_spec_metric_goal: Optimization goal of the metric, possible values:

“MAXIMIZE”, “MINIMIZE”.

max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. study_spec_parameters_override: List of dictionaries representing parameters

to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

eval_metric: Evaluation metrics for validation data represented as a

comma-separated string.

disable_default_eval_metric: Flag to disable default metric. Set to >0 to

disable. Default to 0.

seed: Random seed. seed_per_iteration: Seed PRNG determnisticly via iterator number. dataset_level_custom_transformation_definitions: Dataset-level custom

transformation definitions in string format.

dataset_level_transformations: Dataset-level transformation configuration in

string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions

in string format.

tf_transformations_path: Path to TF transformation configuration. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen

before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

training_machine_type: Machine type. training_total_replica_count: Number of workers. training_accelerator_type: Accelerator type. training_accelerator_count: Accelerator count. study_spec_algorithm: The search algorithm specified for the study. One of

‘ALGORITHM_UNSPECIFIED’, ‘GRID_SEARCH’, or ‘RANDOM_SEARCH’.

study_spec_measurement_selection_type: Which measurement to use if/when the

service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: The max number of prediction

server for batch predict components during evaluation.

evaluation_dataflow_machine_type: The dataflow machine type for evaluation

components.

evaluation_dataflow_starting_num_workers: The initial number of Dataflow

workers for evaluation components.

evaluation_dataflow_max_num_workers: The max number of Dataflow workers for

evaluation components.

evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definition_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_xgboost_study_spec_parameters_override() List[Dict[str, Any]]

Get study_spec_parameters_override for an XGBoost hyperparameter tuning job.

Returns:

List of study_spec_parameters_override.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_xgboost_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, objective: str, eval_metric: str | None = None, num_boost_round: int | None = None, early_stopping_rounds: int | None = None, base_score: float | None = None, disable_default_eval_metric: int | None = None, seed: int | None = None, seed_per_iteration: bool | None = None, booster: str | None = None, eta: float | None = None, gamma: float | None = None, max_depth: int | None = None, min_child_weight: float | None = None, max_delta_step: float | None = None, subsample: float | None = None, colsample_bytree: float | None = None, colsample_bylevel: float | None = None, colsample_bynode: float | None = None, reg_lambda: float | None = None, reg_alpha: float | None = None, tree_method: str | None = None, scale_pos_weight: float | None = None, updater: str | None = None, refresh_leaf: int | None = None, process_type: str | None = None, grow_policy: str | None = None, sampling_method: str | None = None, monotone_constraints: str | None = None, interaction_constraints: str | None = None, sample_type: str | None = None, normalize_type: str | None = None, rate_drop: float | None = None, one_drop: int | None = None, skip_drop: float | None = None, num_parallel_tree: int | None = None, feature_selector: str | None = None, top_k: int | None = None, max_cat_to_onehot: int | None = None, max_leaves: int | None = None, max_bin: int | None = None, tweedie_variance_power: float | None = None, huber_slope: float | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool | None = None, feature_selection_algorithm: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str | None = None, training_machine_type: str | None = None, training_total_replica_count: int | None = None, training_accelerator_type: str | None = None, training_accelerator_count: int | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, run_evaluation: bool | None = None, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None, dataflow_service_account: str | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool | None = None, encryption_spec_key_name: str | None = None)

Get the XGBoost training pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. objective: Specifies the learning task and the learning objective. Must be

one of [reg:squarederror, reg:squaredlogerror, reg:logistic, reg:gamma, reg:tweedie, reg:pseudohubererror, binary:logistic, multi:softprob].

eval_metric: Evaluation metrics for validation data represented as a

comma-separated string.

num_boost_round: Number of boosting iterations. early_stopping_rounds: Activates early stopping. Validation error needs to

decrease at least every early_stopping_rounds round(s) to continue training.

base_score: The initial prediction score of all instances, global bias. disable_default_eval_metric: Flag to disable default metric. Set to >0 to

disable. Default to 0.

seed: Random seed. seed_per_iteration: Seed PRNG determnisticly via iterator number. booster: Which booster to use, can be gbtree, gblinear or dart. gbtree and

dart use tree based model while gblinear uses linear function.

eta: Learning rate. gamma: Minimum loss reduction required to make a further partition on a leaf

node of the tree.

max_depth: Maximum depth of a tree. min_child_weight: Minimum sum of instance weight(hessian) needed in a child. max_delta_step: Maximum delta step we allow each tree’s weight estimation to

be.

subsample: Subsample ratio of the training instance. colsample_bytree: Subsample ratio of columns when constructing each tree. colsample_bylevel: Subsample ratio of columns for each split, in each level. colsample_bynode: Subsample ratio of columns for each node (split). reg_lambda: L2 regularization term on weights. reg_alpha: L1 regularization term on weights. tree_method: The tree construction algorithm used in XGBoost. Choices:

[“auto”, “exact”, “approx”, “hist”, “gpu_exact”, “gpu_hist”].

scale_pos_weight: Control the balance of positive and negative weights. updater: A comma separated string defining the sequence of tree updaters to

run.

refresh_leaf: Refresh updater plugin. Update tree leaf and nodes’s stats if

True. When it is False, only node stats are updated.

process_type: A type of boosting process to run. Choices:[“default”,

“update”]

grow_policy: Controls a way new nodes are added to the tree. Only supported

if tree_method is hist. Choices:[“depthwise”, “lossguide”]

sampling_method: The method to use to sample the training instances. monotone_constraints: Constraint of variable monotonicity. interaction_constraints: Constraints for interaction representing permitted

interactions.

sample_type: [dart booster only] Type of sampling algorithm.

Choices:[“uniform”, “weighted”]

normalize_type: [dart booster only] Type of normalization algorithm,

Choices:[“tree”, “forest”]

rate_drop: [dart booster only] Dropout rate.’ one_drop: [dart booster only] When this flag is enabled, at least one tree

is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper).

skip_drop: [dart booster only] Probability of skipping the dropout procedure

during a boosting iteration.

num_parallel_tree: Number of parallel trees constructed during each

iteration. This option is used to support boosted random forest.

feature_selector: [linear booster only] Feature selection and ordering

method.

top_k: The number of top features to select in greedy and thrifty feature

selector. The value of 0 means using all the features.

max_cat_to_onehot: A threshold for deciding whether XGBoost should use

one-hot encoding based split for categorical data.

max_leaves: Maximum number of nodes to be added. max_bin: Maximum number of discrete bins to bucket continuous features. tweedie_variance_power: Parameter that controls the variance of the Tweedie

distribution.

huber_slope: A parameter used for Pseudo-Huber loss to define the delta

term.

dataset_level_custom_transformation_definitions: Dataset-level custom

transformation definitions in string format.

dataset_level_transformations: Dataset-level transformation configuration in

string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions

in string format.

tf_transformations_path: Path to TF transformation configuration. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. training_machine_type: Machine type. training_total_replica_count: Number of workers. training_accelerator_type: Accelerator type. training_accelerator_count: Accelerator count. transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: The max number of prediction

server for batch predict components during evaluation.

evaluation_dataflow_machine_type: The dataflow machine type for evaluation

components.

evaluation_dataflow_starting_num_workers: The initial number of Dataflow

workers for evaluation components.

evaluation_dataflow_max_num_workers: The max number of Dataflow workers for

evaluation components.

evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definition_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.input_dictionary_to_parameter(input_dict: Dict[str, Any] | None) str

Convert json input dict to encoded parameter string.

This function is required due to the limitation on YAML component definition that YAML definition does not have a keyword for apply quote escape, so the JSON argument’s quote must be manually escaped using this function.

Args:

input_dict: The input json dictionary.

Returns:

The encoded string used for parameter.

Module contents

Module for AutoML Tables KFP components.