google_cloud_pipeline_components.experimental.automl.tabular package

Submodules

google_cloud_pipeline_components.experimental.automl.tabular.utils module

Util functions for AutoML Tabular pipeline.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_automl_tabular_feature_selection_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int | None = None, stage_2_num_parallel_trials: int | None = None, stage_2_num_selected_trials: int | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, predefined_split_key: str | None = None, timestamp_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, weight_column: str | None = None, study_spec_parameters_override: List[Dict[str, Any]] | None = None, optimization_objective_recall_value: float | None = None, optimization_objective_precision_value: float | None = None, stage_1_tuner_worker_pool_specs_override: Dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str | None = None, stats_and_example_gen_dataflow_max_num_workers: int | None = None, stats_and_example_gen_dataflow_disk_size_gb: int | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: str | None = None, additional_experiments: Dict[str, Any] | None = None, dataflow_service_account: str | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_batch_explain_machine_type: str | None = None, evaluation_batch_explain_starting_replica_count: int | None = None, evaluation_batch_explain_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None, max_selected_features: int = 1000, apply_feature_selection_tuning: bool = False, run_distillation: bool = False, distill_batch_predict_machine_type: str | None = None, distill_batch_predict_starting_replica_count: int | None = None, distill_batch_predict_max_replica_count: int | None = None) → Tuple[str, Dict[str, Any]]

Get the AutoML Tabular v1 default training pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,: “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.
transformations: The path to a GCS file containing the transformations to: apply.
train_budget_milli_node_hours: The train budget of creating this model,: expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. predefined_split_key: The predefined_split column name. timestamp_split_key: The timestamp_split column name. stratified_split_key: The stratified_split column name. training_fraction: The training fraction. validation_fraction: The validation fraction. test_fraction: float = The test fraction. weight_column: The weight column name. study_spec_parameters_override: The list for overriding study spec. The list

should be of format https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.

optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.

stage 1 tuner worker pool spec. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage

cv trainer worker pool spec. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. dataflow_service_account: Custom service account to run dataflow jobs. run_evaluation: Whether to run evaluation in the training pipeline. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of: prediction server for batch predict components during evaluation.
evaluation_batch_predict_max_replica_count: The max number of prediction: server for batch predict components during evaluation.
evaluation_batch_explain_machine_type: The prediction server machine type: for batch explain components during evaluation.
evaluation_batch_explain_starting_replica_count: The initial number of: prediction server for batch explain components during evaluation.
evaluation_batch_explain_max_replica_count: The max number of prediction: server for batch explain components during evaluation.
evaluation_dataflow_machine_type: The dataflow machine type for evaluation: components.
evaluation_dataflow_starting_num_workers: The initial number of Dataflow: workers for evaluation components.
evaluation_dataflow_max_num_workers: The max number of Dataflow workers for: evaluation components.
evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for: evaluation components.

max_selected_features: number of features to select for training, apply_feature_selection_tuning: tuning feature selection rate if true. run_distillation: Whether to run distill in the training pipeline. distill_batch_predict_machine_type: The prediction server machine type for

batch predict component in the model distillation.

distill_batch_predict_starting_replica_count: The initial number of: prediction server for batch predict component in the model distillation.
distill_batch_predict_max_replica_count: The max number of prediction server: for batch predict component in the model distillation.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_automl_tabular_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int | None = None, stage_2_num_parallel_trials: int | None = None, stage_2_num_selected_trials: int | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, predefined_split_key: str | None = None, timestamp_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, weight_column: str | None = None, study_spec_parameters_override: List[Dict[str, Any]] | None = None, optimization_objective_recall_value: float | None = None, optimization_objective_precision_value: float | None = None, stage_1_tuner_worker_pool_specs_override: Dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str | None = None, stats_and_example_gen_dataflow_max_num_workers: int | None = None, stats_and_example_gen_dataflow_disk_size_gb: int | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: str | None = None, additional_experiments: Dict[str, Any] | None = None, dataflow_service_account: str | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_batch_explain_machine_type: str | None = None, evaluation_batch_explain_starting_replica_count: int | None = None, evaluation_batch_explain_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None, run_distillation: bool = False, distill_batch_predict_machine_type: str | None = None, distill_batch_predict_starting_replica_count: int | None = None, distill_batch_predict_max_replica_count: int | None = None, stage_1_tuning_result_artifact_uri: str | None = None, quantiles: List[float] | None = None, enable_probabilistic_inference: bool = False) → Tuple[str, Dict[str, Any]]

Get the AutoML Tabular v1 default training pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,: “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.
transformations: The path to a GCS file containing the transformations to: apply.
train_budget_milli_node_hours: The train budget of creating this model,: expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. predefined_split_key: The predefined_split column name. timestamp_split_key: The timestamp_split column name. stratified_split_key: The stratified_split column name. training_fraction: The training fraction. validation_fraction: The validation fraction. test_fraction: float = The test fraction. weight_column: The weight column name. study_spec_parameters_override: The list for overriding study spec. The list

should be of format https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.

optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.

stage 1 tuner worker pool spec. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage

cv trainer worker pool spec. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. dataflow_service_account: Custom service account to run dataflow jobs. run_evaluation: Whether to run evaluation in the training pipeline. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of: prediction server for batch predict components during evaluation.
evaluation_batch_predict_max_replica_count: The max number of prediction: server for batch predict components during evaluation.
evaluation_batch_explain_machine_type: The prediction server machine type: for batch explain components during evaluation.
evaluation_batch_explain_starting_replica_count: The initial number of: prediction server for batch explain components during evaluation.
evaluation_batch_explain_max_replica_count: The max number of prediction: server for batch explain components during evaluation.
evaluation_dataflow_machine_type: The dataflow machine type for evaluation: components.
evaluation_dataflow_starting_num_workers: The initial number of Dataflow: workers for evaluation components.
evaluation_dataflow_max_num_workers: The max number of Dataflow workers for: evaluation components.
evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for: evaluation components.

run_distillation: Whether to run distill in the training pipeline. distill_batch_predict_machine_type: The prediction server machine type for

batch predict component in the model distillation.

distill_batch_predict_starting_replica_count: The initial number of: prediction server for batch predict component in the model distillation.
distill_batch_predict_max_replica_count: The max number of prediction server: for batch predict component in the model distillation.
stage_1_tuning_result_artifact_uri: The stage 1 tuning result artifact GCS: URI.
quantiles: Quantiles to use for probabilistic inference. Up to 5 quantiles: are allowed of values between 0 and 1, exclusive. Represents the quantiles to use for that objective. Quantiles must be unique.
enable_probabilistic_inference: If probabilistic inference is enabled, the: model will fit a distribution that captures the uncertainty of a prediction. At inference time, the predictive distribution is used to make a point prediction that minimizes the optimization objective. For example, the mean of a predictive distribution is the point prediction that minimizes RMSE loss. If quantiles are specified, then the quantiles of the distribution are also returned.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_builtin_algorithm_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, study_spec_metric_id: str, study_spec_metric_goal: str, study_spec_parameters_override: List[Dict[str, Any]], max_trial_count: int, parallel_trial_count: int, algorithm: str, enable_profiler: bool = False, seed: int = 1, eval_steps: int = 0, eval_frequency_secs: int = 600, transform_config: str | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str = '', max_failed_trial_count: int = 0, study_spec_algorithm: str = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str = 'BEST_MEASUREMENT', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Dict[str, Any] | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_starting_num_workers: int = 10, evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') → Tuple[str, Dict[str, Any]]

Get the built-in algorithm HyperparameterTuningJob pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

study_spec_metric_id: Metric to optimize, possible values: [ ‘loss’,: ‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].
study_spec_metric_goal: Optimization goal of the metric, possible values:: “MAXIMIZE”, “MINIMIZE”.
study_spec_parameters_override: List of dictionaries representing parameters: to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. algorithm: Algorithm to train. One of “tabnet” and “wide_and_deep”. enable_profiler: Enables profiling and saves a trace during evaluation. seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or

negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

eval_frequency_secs: Frequency at which evaluation and checkpointing will: take place.

transform_config: Path to v1 TF transformation configuration. dataset_level_custom_transformation_definitions: Dataset-level custom

transformation definitions in string format.

dataset_level_transformations: Dataset-level transformation configuration in: string format.

predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions: in string format.

tf_transformations_path: Path to TF transformation configuration. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen

before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm: The search algorithm specified for the study. One of

“ALGORITHM_UNSPECIFIED”, “GRID_SEARCH”, or “RANDOM_SEARCH”.

study_spec_measurement_selection_type: Which measurement to use if/when the

service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

worker_pool_specs_override: The dictionary for overriding training and

evaluation worker pool specs. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of: prediction server for batch predict components during evaluation.
evaluation_batch_predict_max_replica_count: The max number of prediction: server for batch predict components during evaluation.
evaluation_dataflow_machine_type: The dataflow machine type for evaluation: components.
evaluation_dataflow_starting_num_workers: The initial number of Dataflow: workers for evaluation components.
evaluation_dataflow_max_num_workers: The max number of Dataflow workers for: evaluation components.
evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for: evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:
https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP: addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_default_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Dict[str, Any] | None = None, optimization_objective_recall_value: float = -1, optimization_objective_precision_value: float = -1, stage_1_tuner_worker_pool_specs_override: Dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '', additional_experiments: Dict[str, Any] | None = None, dataflow_service_account: str = '', run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, run_distillation: bool = False, distill_batch_predict_machine_type: str = 'n1-standard-16', distill_batch_predict_starting_replica_count: int = 25, distill_batch_predict_max_replica_count: int = 25) → Tuple[str, Dict[str, Any]]

Get the AutoML Tabular default training pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column_name: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,: “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. train_budget_milli_node_hours: The train budget of creating this model,

expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. weight_column_name: The weight column name. study_spec_override: The dictionary for overriding study spec. The

dictionary should be of format https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.

optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.

stage 1 tuner worker pool spec. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage

cv trainer worker pool spec. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. dataflow_service_account: Custom service account to run dataflow jobs. run_evaluation: Whether to run evaluation in the training pipeline. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of: prediction server for batch predict components during evaluation.
evaluation_batch_predict_max_replica_count: The max number of prediction: server for batch predict components during evaluation.
evaluation_dataflow_machine_type: The dataflow machine type for evaluation: components.
evaluation_dataflow_max_num_workers: The max number of Dataflow workers for: evaluation components.
evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for: evaluation components.

run_distillation: Whether to run distill in the training pipeline. distill_batch_predict_machine_type: The prediction server machine type for

batch predict component in the model distillation.

distill_batch_predict_starting_replica_count: The initial number of: prediction server for batch predict component in the model distillation.
distill_batch_predict_max_replica_count: The max number of prediction server: for batch predict component in the model distillation.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_distill_skip_evaluation_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Dict[str, Any] | None = None, optimization_objective_recall_value: float = -1, optimization_objective_precision_value: float = -1, stage_1_tuner_worker_pool_specs_override: Dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '', additional_experiments: Dict[str, Any] | None = None, distill_batch_predict_machine_type: str = 'n1-standard-16', distill_batch_predict_starting_replica_count: int = 25, distill_batch_predict_max_replica_count: int = 25) → Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that distill and skips evaluation.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column_name: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,: “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. train_budget_milli_node_hours: The train budget of creating this model,

expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. weight_column_name: The weight column name. study_spec_override: The dictionary for overriding study spec. The

dictionary should be of format https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.

optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.

stage 1 tuner worker pool spec. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage

cv trainer worker pool spec. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. distill_batch_predict_machine_type: The prediction server machine type for

batch predict component in the model distillation.

distill_batch_predict_starting_replica_count: The initial number of: prediction server for batch predict component in the model distillation.
distill_batch_predict_max_replica_count: The max number of prediction server: for batch predict component in the model distillation.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_feature_selection_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, algorithm: str, prediction_type: str, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, max_selected_features: int | None = None, dataflow_machine_type: str = 'n1-standard-16', dataflow_max_num_workers: int = 25, dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, dataflow_service_account: str = '')

Get the feature selection pipeline that generates feature ranking and selected features.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. algorithm: Algorithm to select features, default to be AMI. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

data_source_csv_filenames: A string that represents a list of comma: separated CSV filenames.

data_source_bigquery_table_path: The BigQuery table path. max_selected_features: number of features to be selected. dataflow_machine_type: The dataflow machine type for feature_selection

component.

dataflow_max_num_workers: The max number of Dataflow workers for

feature_selection component.

dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

feature_selection component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

dataflow_service_account: Custom service account to run dataflow jobs.

Returns:

Tuple of pipeline_definition_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_model_comparison_pipeline_and_parameters(project: str, location: str, root_dir: str, prediction_type: str, training_jobs: Dict[str, Dict[str, Any]], data_source_csv_filenames: str = '', data_source_bigquery_table_path: str = '', evaluation_data_source_csv_filenames: str = '', evaluation_data_source_bigquery_table_path: str = '', experiment: str = '', service_account: str = '', network: str = '') → Tuple[str, Dict[str, Any]]

Returns a compiled model comparison pipeline and formatted parameters.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components prediction_type: The type of problem being solved. Can be one of:

regression, classification, or forecasting.

training_jobs: A dict mapping name to a dict of training job inputs. data_source_csv_filenames: Comma-separated paths to CSVs stored in GCS to

use as the training dataset for all training pipelines. This should be None if data_source_bigquery_table_path is not None. This should only contain data from the training and validation split and not from the test split.

data_source_bigquery_table_path: Path to BigQuery Table to use as the: training dataset for all training pipelines. This should be None if data_source_csv_filenames is not None. This should only contain data from the training and validation split and not from the test split.
evaluation_data_source_csv_filenames: Comma-separated paths to CSVs stored: in GCS to use as the evaluation dataset for all training pipelines. This should be None if evaluation_data_source_bigquery_table_path is not None. This should only contain data from the test split and not from the training and validation split.
evaluation_data_source_bigquery_table_path: Path to BigQuery Table to use as: the evaluation dataset for all training pipelines. This should be None if evaluation_data_source_csv_filenames is not None. This should only contain data from the test split and not from the training and validation split.
experiment: Vertex Experiment to add training pipeline runs to. A new: Experiment will be created if none is provided.

service_account: Specifies the service account for the sub-pipeline jobs. network: The full name of the Compute Engine network to which the

sub-pipeline jobs should be peered.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_skip_architecture_search_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_tuning_result_artifact_uri: str, stage_2_num_parallel_trials: int | None = None, stage_2_num_selected_trials: int | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, predefined_split_key: str | None = None, timestamp_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, weight_column: str | None = None, optimization_objective_recall_value: float | None = None, optimization_objective_precision_value: float | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str | None = None, stats_and_example_gen_dataflow_max_num_workers: int | None = None, stats_and_example_gen_dataflow_disk_size_gb: int | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: str | None = None, additional_experiments: Dict[str, Any] | None = None, dataflow_service_account: str | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_batch_explain_machine_type: str | None = None, evaluation_batch_explain_starting_replica_count: int | None = None, evaluation_batch_explain_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None) → Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that skips architecture search.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,: “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations: The transformations to apply. train_budget_milli_node_hours: The train budget of creating this model,

expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_tuning_result_artifact_uri: The stage 1 tuning result artifact GCS: URI.

stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. predefined_split_key: The predefined_split column name. timestamp_split_key: The timestamp_split column name. stratified_split_key: The stratified_split column name. training_fraction: The training fraction. validation_fraction: The validation fraction. test_fraction: float = The test fraction. weight_column: The weight column name. optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage

cv trainer worker pool spec. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. dataflow_service_account: Custom service account to run dataflow jobs. run_evaluation: Whether to run evaluation in the training pipeline. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of: prediction server for batch predict components during evaluation.
evaluation_batch_predict_max_replica_count: The max number of prediction: server for batch predict components during evaluation.
evaluation_batch_explain_machine_type: The prediction server machine type: for batch explain components during evaluation.
evaluation_batch_explain_starting_replica_count: The initial number of: prediction server for batch explain components during evaluation.
evaluation_batch_explain_max_replica_count: The max number of prediction: server for batch explain components during evaluation.
evaluation_dataflow_machine_type: The dataflow machine type for evaluation: components.
evaluation_dataflow_starting_num_workers: The initial number of Dataflow: workers for evaluation components.
evaluation_dataflow_max_num_workers: The max number of Dataflow workers for: evaluation components.
evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for: evaluation components.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_skip_evaluation_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Dict[str, Any] | None = None, optimization_objective_recall_value: float = -1, optimization_objective_precision_value: float = -1, stage_1_tuner_worker_pool_specs_override: Dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '', additional_experiments: Dict[str, Any] | None = None) → Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that skips evaluation.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column_name: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,: “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. train_budget_milli_node_hours: The train budget of creating this model,

expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. weight_column_name: The weight column name. study_spec_override: The dictionary for overriding study spec. The

dictionary should be of format https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.

optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.

stage 1 tuner worker pool spec. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage

cv trainer worker pool spec. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, study_spec_metric_id: str, study_spec_metric_goal: str, study_spec_parameters_override: List[Dict[str, Any]], max_trial_count: int, parallel_trial_count: int, transform_config: str | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool = False, feature_selection_algorithm: str | None = None, materialized_examples_format: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, eval_frequency_secs: int = 600, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str = '', max_failed_trial_count: int = 0, study_spec_algorithm: str = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str = 'BEST_MEASUREMENT', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Dict[str, Any] | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_starting_num_workers: int = 10, evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') → Tuple[str, Dict[str, Any]]

Get the TabNet HyperparameterTuningJob pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

study_spec_metric_id: Metric to optimize, possible values: [ ‘loss’,: ‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].
study_spec_metric_goal: Optimization goal of the metric, possible values:: “MAXIMIZE”, “MINIMIZE”.
study_spec_parameters_override: List of dictionaries representing parameters: to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. transform_config: Path to v1 TF transformation configuration. dataset_level_custom_transformation_definitions: Dataset-level custom

transformation definitions in string format.

dataset_level_transformations: Dataset-level transformation configuration in: string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. materialized_examples_format: The format for the materialized examples. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions: in string format.

tf_transformations_path: Path to TF transformation configuration. enable_profiler: Enables profiling and saves a trace during evaluation. cache_data: Whether to cache data or not. If set to ‘auto’, caching is

determined based on the dataset size.

seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or

negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

eval_frequency_secs: Frequency at which evaluation and checkpointing will: take place.

data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen

before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm: The search algorithm specified for the study. One of

“ALGORITHM_UNSPECIFIED”, “GRID_SEARCH”, or “RANDOM_SEARCH”.

study_spec_measurement_selection_type: Which measurement to use if/when the

service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

worker_pool_specs_override: The dictionary for overriding training and

evaluation worker pool specs. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of: prediction server for batch predict components during evaluation.
evaluation_batch_predict_max_replica_count: The max number of prediction: server for batch predict components during evaluation.
evaluation_dataflow_machine_type: The dataflow machine type for evaluation: components.
evaluation_dataflow_starting_num_workers: The initial number of Dataflow: workers for evaluation components.
evaluation_dataflow_max_num_workers: The max number of Dataflow workers for: evaluation components.
evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for: evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:
https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP: addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_study_spec_parameters_override(dataset_size_bucket: str, prediction_type: str, training_budget_bucket: str) → List[Dict[str, Any]]

Get study_spec_parameters_override for a TabNet hyperparameter tuning job.

Args:

dataset_size_bucket: Size of the dataset. One of “small” (< 1M rows),: “medium” (1M - 100M rows), or “large” (> 100M rows).
prediction_type: The type of prediction the model is to produce.: “classification” or “regression”.
training_budget_bucket: Bucket of the estimated training budget. One of: “small” (< $600), “medium” ($600 - $2400), or “large” (> $2400). This parameter is only used as a hint for the hyperparameter search space, unrelated to the real cost.

Returns:

List of study_spec_parameters_override.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, learning_rate: float, transform_config: str | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool = False, feature_selection_algorithm: str | None = None, materialized_examples_format: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, max_steps: int = -1, max_train_secs: int = -1, large_category_dim: int = 1, large_category_thresh: int = 300, yeo_johnson_transform: bool = True, feature_dim: int = 64, feature_dim_ratio: float = 0.5, num_decision_steps: int = 6, relaxation_factor: float = 1.5, decay_every: float = 100, decay_rate: float = 0.95, gradient_thresh: float = 2000, sparsity_loss_weight: float = 1e-05, batch_momentum: float = 0.95, batch_size_ratio: float = 0.25, num_transformer_layers: int = 4, num_transformer_layers_ratio: float = 0.25, class_weight: float = 1.0, loss_function_type: str = 'default', alpha_focal_loss: float = 0.25, gamma_focal_loss: float = 2.0, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, batch_size: int = 100, measurement_selection_type: str | None = None, optimization_metric: str | None = None, eval_frequency_secs: int = 600, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str = '', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Dict[str, Any] | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_starting_num_workers: int = 10, evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') → Tuple[str, Dict[str, Any]]

Get the TabNet training pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

learning_rate: The learning rate used by the linear optimizer. transform_config: Path to v1 TF transformation configuration. dataset_level_custom_transformation_definitions: Dataset-level custom

transformation definitions in string format.

dataset_level_transformations: Dataset-level transformation configuration in: string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. materialized_examples_format: The format for the materialized examples. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions: in string format.

tf_transformations_path: Path to TF transformation configuration. max_steps: Number of steps to run the trainer for. max_train_secs: Amount of time in seconds to run the trainer for. large_category_dim: Embedding dimension for categorical feature with large

number of categories.

large_category_thresh: Threshold for number of categories to apply: large_category_dim embedding dimension to.

yeo_johnson_transform: Enables trainable Yeo-Johnson power transform. feature_dim: Dimensionality of the hidden representation in feature

transformation block.

feature_dim_ratio: The ratio of output dimension (dimensionality of the: outputs of each decision step) to feature dimension.

num_decision_steps: Number of sequential decision steps. relaxation_factor: Relaxation factor that promotes the reuse of each feature

at different decision steps. When it is 1, a feature is enforced to be used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.

decay_every: Number of iterations for periodically applying learning rate: decaying.

decay_rate: Learning rate decaying. gradient_thresh: Threshold for the norm of gradients for clipping. sparsity_loss_weight: Weight of the loss for sparsity regularization

(increasing it will yield more sparse feature selection).

batch_momentum: Momentum in ghost batch normalization. batch_size_ratio: The ratio of virtual batch size (size of the ghost batch

normalization) to batch size.

num_transformer_layers: The number of transformer layers for each decision: step. used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.
num_transformer_layers_ratio: The ratio of shared transformer layer to: transformer layers.
class_weight: The class weight is used to computes a weighted cross entropy: which is helpful in classify imbalanced dataset. Only used for classification.
loss_function_type: Loss function type. Loss function in classification: [cross_entropy, weighted_cross_entropy, focal_loss], default is cross_entropy. Loss function in regression: [rmse, mae, mse], default is mse.
alpha_focal_loss: Alpha value (balancing factor) in focal_loss function.: Only used for classification.
gamma_focal_loss: Gamma value (modulating factor) for focal loss for focal: loss. Only used for classification.

enable_profiler: Enables profiling and saves a trace during evaluation. cache_data: Whether to cache data or not. If set to ‘auto’, caching is

determined based on the dataset size.

seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or

negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

batch_size: Batch size for training. measurement_selection_type: Which measurement to use if/when the service

automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

optimization_metric: Optimization metric used for: measurement_selection_type. Default is “rmse” for regression and “auc” for classification.
eval_frequency_secs: Frequency at which evaluation and checkpointing will: take place.

data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

worker_pool_specs_override: The dictionary for overriding training and

evaluation worker pool specs. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of: prediction server for batch predict components during evaluation.
evaluation_batch_predict_max_replica_count: The max number of prediction: server for batch predict components during evaluation.
evaluation_dataflow_machine_type: The dataflow machine type for evaluation: components.
evaluation_dataflow_starting_num_workers: The initial number of Dataflow: workers for evaluation components.
evaluation_dataflow_max_num_workers: The max number of Dataflow workers for: evaluation components.
evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for: evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:
https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP: addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_wide_and_deep_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, study_spec_metric_id: str, study_spec_metric_goal: str, study_spec_parameters_override: List[Dict[str, Any]], max_trial_count: int, parallel_trial_count: int, transform_config: str | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool = False, feature_selection_algorithm: str | None = None, materialized_examples_format: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, eval_frequency_secs: int = 600, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str = '', max_failed_trial_count: int = 0, study_spec_algorithm: str = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str = 'BEST_MEASUREMENT', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Dict[str, Any] | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_starting_num_workers: int = 10, evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') → Tuple[str, Dict[str, Any]]

Get the Wide & Deep algorithm HyperparameterTuningJob pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

study_spec_metric_id: Metric to optimize, possible values: [ ‘loss’,: ‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].
study_spec_metric_goal: Optimization goal of the metric, possible values:: “MAXIMIZE”, “MINIMIZE”.
study_spec_parameters_override: List of dictionaries representing parameters: to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. transform_config: Path to v1 TF transformation configuration. dataset_level_custom_transformation_definitions: Dataset-level custom

transformation definitions in string format.

dataset_level_transformations: Dataset-level transformation configuration in: string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. materialized_examples_format: The format for the materialized examples. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions: in string format.

tf_transformations_path: Path to TF transformation configuration. enable_profiler: Enables profiling and saves a trace during evaluation. cache_data: Whether to cache data or not. If set to ‘auto’, caching is

determined based on the dataset size.

seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or

negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

eval_frequency_secs: Frequency at which evaluation and checkpointing will: take place.

data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen

before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm: The search algorithm specified for the study. One of

“ALGORITHM_UNSPECIFIED”, “GRID_SEARCH”, or “RANDOM_SEARCH”.

study_spec_measurement_selection_type: Which measurement to use if/when the

service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

worker_pool_specs_override: The dictionary for overriding training and

evaluation worker pool specs. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of: prediction server for batch predict components during evaluation.
evaluation_batch_predict_max_replica_count: The max number of prediction: server for batch predict components during evaluation.
evaluation_dataflow_machine_type: The dataflow machine type for evaluation: components.
evaluation_dataflow_starting_num_workers: The initial number of Dataflow: workers for evaluation components.
evaluation_dataflow_max_num_workers: The max number of Dataflow workers for: evaluation components.
evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for: evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:
https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP: addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_wide_and_deep_study_spec_parameters_override() → List[Dict[str, Any]]

Get study_spec_parameters_override for a Wide & Deep hyperparameter tuning job.

Returns:: List of study_spec_parameters_override.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_wide_and_deep_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, learning_rate: float, dnn_learning_rate: float, transform_config: str | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool = False, feature_selection_algorithm: str | None = None, materialized_examples_format: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, optimizer_type: str = 'adam', max_steps: int = -1, max_train_secs: int = -1, l1_regularization_strength: float = 0, l2_regularization_strength: float = 0, l2_shrinkage_regularization_strength: float = 0, beta_1: float = 0.9, beta_2: float = 0.999, hidden_units: str = '30,30,30', use_wide: bool = True, embed_categories: bool = True, dnn_dropout: float = 0, dnn_optimizer_type: str = 'adam', dnn_l1_regularization_strength: float = 0, dnn_l2_regularization_strength: float = 0, dnn_l2_shrinkage_regularization_strength: float = 0, dnn_beta_1: float = 0.9, dnn_beta_2: float = 0.999, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, batch_size: int = 100, measurement_selection_type: str | None = None, optimization_metric: str | None = None, eval_frequency_secs: int = 600, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str = '', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Dict[str, Any] | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_starting_num_workers: int = 10, evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') → Tuple[str, Dict[str, Any]]

Get the Wide & Deep training pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

‘classification’ or ‘regression’.

learning_rate: The learning rate used by the linear optimizer. dnn_learning_rate: The learning rate for training the deep part of the

model.

transform_config: Path to v1 TF transformation configuration. dataset_level_custom_transformation_definitions: Dataset-level custom

transformation definitions in string format.

dataset_level_transformations: Dataset-level transformation configuration in: string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. materialized_examples_format: The format for the materialized examples. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions: in string format.

tf_transformations_path: Path to TF transformation configuration. optimizer_type: The type of optimizer to use. Choices are “adam”, “ftrl” and

“sgd” for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

max_steps: Number of steps to run the trainer for. max_train_secs: Amount of time in seconds to run the trainer for. l1_regularization_strength: L1 regularization strength for

optimizer_type=”ftrl”.

l2_regularization_strength: L2 regularization strength for: optimizer_type=”ftrl”.
l2_shrinkage_regularization_strength: L2 shrinkage regularization strength: for optimizer_type=”ftrl”.

beta_1: Beta 1 value for optimizer_type=”adam”. beta_2: Beta 2 value for optimizer_type=”adam”. hidden_units: Hidden layer sizes to use for DNN feature columns, provided in

comma-separated layers.

use_wide: If set to true, the categorical columns will be used in the wide: part of the DNN model.
embed_categories: If set to true, the categorical columns will be used: embedded and used in the deep part of the model. Embedding size is the square root of the column cardinality.

dnn_dropout: The probability we will drop out a given coordinate. dnn_optimizer_type: The type of optimizer to use for the deep part of the

model. Choices are “adam”, “ftrl” and “sgd”. for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

dnn_l1_regularization_strength: L1 regularization strength for: dnn_optimizer_type=”ftrl”.
dnn_l2_regularization_strength: L2 regularization strength for: dnn_optimizer_type=”ftrl”.
dnn_l2_shrinkage_regularization_strength: L2 shrinkage regularization: strength for dnn_optimizer_type=”ftrl”.

dnn_beta_1: Beta 1 value for dnn_optimizer_type=”adam”. dnn_beta_2: Beta 2 value for dnn_optimizer_type=”adam”. enable_profiler: Enables profiling and saves a trace during evaluation. cache_data: Whether to cache data or not. If set to ‘auto’, caching is

determined based on the dataset size.

seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or

negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

batch_size: Batch size for training. measurement_selection_type: Which measurement to use if/when the service

automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

optimization_metric: Optimization metric used for: measurement_selection_type. Default is “rmse” for regression and “auc” for classification.
eval_frequency_secs: Frequency at which evaluation and checkpointing will: take place.

data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

worker_pool_specs_override: The dictionary for overriding training and

evaluation worker pool specs. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of: prediction server for batch predict components during evaluation.
evaluation_batch_predict_max_replica_count: The max number of prediction: server for batch predict components during evaluation.
evaluation_dataflow_machine_type: The dataflow machine type for evaluation: components.
evaluation_dataflow_starting_num_workers: The initial number of Dataflow: workers for evaluation components.
evaluation_dataflow_max_num_workers: The max number of Dataflow workers for: evaluation components.
evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for: evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:
https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP: addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definition_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_xgboost_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, objective: str, study_spec_metric_id: str, study_spec_metric_goal: str, max_trial_count: int, parallel_trial_count: int, study_spec_parameters_override: List[Dict[str, Any]] | None = None, eval_metric: str | None = None, disable_default_eval_metric: int | None = None, seed: int | None = None, seed_per_iteration: bool | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool | None = None, feature_selection_algorithm: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str | None = None, max_failed_trial_count: int | None = None, training_machine_type: str | None = None, training_total_replica_count: int | None = None, training_accelerator_type: str | None = None, training_accelerator_count: int | None = None, study_spec_algorithm: str | None = None, study_spec_measurement_selection_type: str | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, run_evaluation: bool | None = None, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None, dataflow_service_account: str | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool | None = None, encryption_spec_key_name: str | None = None)

Get the XGBoost HyperparameterTuningJob pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. objective: Specifies the learning task and the learning objective. Must be

one of [reg:squarederror, reg:squaredlogerror, reg:logistic, reg:gamma, reg:tweedie, reg:pseudohubererror, binary:logistic, multi:softprob].

study_spec_metric_id: Metric to optimize. For options, please look under: ‘eval_metric’ at https://xgboost.readthedocs.io/en/stable/parameter.html#learning-task-parameters.
study_spec_metric_goal: Optimization goal of the metric, possible values:: “MAXIMIZE”, “MINIMIZE”.

max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. study_spec_parameters_override: List of dictionaries representing parameters

to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

eval_metric: Evaluation metrics for validation data represented as a: comma-separated string.
disable_default_eval_metric: Flag to disable default metric. Set to >0 to: disable. Default to 0.

seed: Random seed. seed_per_iteration: Seed PRNG determnisticly via iterator number. dataset_level_custom_transformation_definitions: Dataset-level custom

transformation definitions in string format.

dataset_level_transformations: Dataset-level transformation configuration in: string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions: in string format.

tf_transformations_path: Path to TF transformation configuration. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen

before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

training_machine_type: Machine type. training_total_replica_count: Number of workers. training_accelerator_type: Accelerator type. training_accelerator_count: Accelerator count. study_spec_algorithm: The search algorithm specified for the study. One of

‘ALGORITHM_UNSPECIFIED’, ‘GRID_SEARCH’, or ‘RANDOM_SEARCH’.

study_spec_measurement_selection_type: Which measurement to use if/when the: service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.
transform_dataflow_machine_type: The dataflow machine type for transform: component.
transform_dataflow_max_num_workers: The max number of Dataflow workers for: transform component.
transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for: transform component.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of: prediction server for batch predict components during evaluation.
evaluation_batch_predict_max_replica_count: The max number of prediction: server for batch predict components during evaluation.
evaluation_dataflow_machine_type: The dataflow machine type for evaluation: components.
evaluation_dataflow_starting_num_workers: The initial number of Dataflow: workers for evaluation components.
evaluation_dataflow_max_num_workers: The max number of Dataflow workers for: evaluation components.
evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for: evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:
https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP: addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definition_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_xgboost_study_spec_parameters_override() → List[Dict[str, Any]]

Get study_spec_parameters_override for an XGBoost hyperparameter tuning job.

Returns:: List of study_spec_parameters_override.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_xgboost_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, objective: str, eval_metric: str | None = None, num_boost_round: int | None = None, early_stopping_rounds: int | None = None, base_score: float | None = None, disable_default_eval_metric: int | None = None, seed: int | None = None, seed_per_iteration: bool | None = None, booster: str | None = None, eta: float | None = None, gamma: float | None = None, max_depth: int | None = None, min_child_weight: float | None = None, max_delta_step: float | None = None, subsample: float | None = None, colsample_bytree: float | None = None, colsample_bylevel: float | None = None, colsample_bynode: float | None = None, reg_lambda: float | None = None, reg_alpha: float | None = None, tree_method: str | None = None, scale_pos_weight: float | None = None, updater: str | None = None, refresh_leaf: int | None = None, process_type: str | None = None, grow_policy: str | None = None, sampling_method: str | None = None, monotone_constraints: str | None = None, interaction_constraints: str | None = None, sample_type: str | None = None, normalize_type: str | None = None, rate_drop: float | None = None, one_drop: int | None = None, skip_drop: float | None = None, num_parallel_tree: int | None = None, feature_selector: str | None = None, top_k: int | None = None, max_cat_to_onehot: int | None = None, max_leaves: int | None = None, max_bin: int | None = None, tweedie_variance_power: float | None = None, huber_slope: float | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool | None = None, feature_selection_algorithm: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str | None = None, training_machine_type: str | None = None, training_total_replica_count: int | None = None, training_accelerator_type: str | None = None, training_accelerator_count: int | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, run_evaluation: bool | None = None, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None, dataflow_service_account: str | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool | None = None, encryption_spec_key_name: str | None = None)

Get the XGBoost training pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. objective: Specifies the learning task and the learning objective. Must be

one of [reg:squarederror, reg:squaredlogerror, reg:logistic, reg:gamma, reg:tweedie, reg:pseudohubererror, binary:logistic, multi:softprob].

eval_metric: Evaluation metrics for validation data represented as a: comma-separated string.

num_boost_round: Number of boosting iterations. early_stopping_rounds: Activates early stopping. Validation error needs to

decrease at least every early_stopping_rounds round(s) to continue training.

base_score: The initial prediction score of all instances, global bias. disable_default_eval_metric: Flag to disable default metric. Set to >0 to

disable. Default to 0.

seed: Random seed. seed_per_iteration: Seed PRNG determnisticly via iterator number. booster: Which booster to use, can be gbtree, gblinear or dart. gbtree and

dart use tree based model while gblinear uses linear function.

eta: Learning rate. gamma: Minimum loss reduction required to make a further partition on a leaf

node of the tree.

max_depth: Maximum depth of a tree. min_child_weight: Minimum sum of instance weight(hessian) needed in a child. max_delta_step: Maximum delta step we allow each tree’s weight estimation to

be.

subsample: Subsample ratio of the training instance. colsample_bytree: Subsample ratio of columns when constructing each tree. colsample_bylevel: Subsample ratio of columns for each split, in each level. colsample_bynode: Subsample ratio of columns for each node (split). reg_lambda: L2 regularization term on weights. reg_alpha: L1 regularization term on weights. tree_method: The tree construction algorithm used in XGBoost. Choices:

[“auto”, “exact”, “approx”, “hist”, “gpu_exact”, “gpu_hist”].

scale_pos_weight: Control the balance of positive and negative weights. updater: A comma separated string defining the sequence of tree updaters to

run.

refresh_leaf: Refresh updater plugin. Update tree leaf and nodes’s stats if: True. When it is False, only node stats are updated.
process_type: A type of boosting process to run. Choices:[“default”,: “update”]
grow_policy: Controls a way new nodes are added to the tree. Only supported: if tree_method is hist. Choices:[“depthwise”, “lossguide”]

sampling_method: The method to use to sample the training instances. monotone_constraints: Constraint of variable monotonicity. interaction_constraints: Constraints for interaction representing permitted

interactions.

sample_type: [dart booster only] Type of sampling algorithm.: Choices:[“uniform”, “weighted”]
normalize_type: [dart booster only] Type of normalization algorithm,: Choices:[“tree”, “forest”]

rate_drop: [dart booster only] Dropout rate.’ one_drop: [dart booster only] When this flag is enabled, at least one tree

is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper).

skip_drop: [dart booster only] Probability of skipping the dropout procedure: during a boosting iteration.
num_parallel_tree: Number of parallel trees constructed during each: iteration. This option is used to support boosted random forest.
feature_selector: [linear booster only] Feature selection and ordering: method.
top_k: The number of top features to select in greedy and thrifty feature: selector. The value of 0 means using all the features.
max_cat_to_onehot: A threshold for deciding whether XGBoost should use: one-hot encoding based split for categorical data.

max_leaves: Maximum number of nodes to be added. max_bin: Maximum number of discrete bins to bucket continuous features. tweedie_variance_power: Parameter that controls the variance of the Tweedie

distribution.

huber_slope: A parameter used for Pseudo-Huber loss to define the delta: term.
dataset_level_custom_transformation_definitions: Dataset-level custom: transformation definitions in string format.
dataset_level_transformations: Dataset-level transformation configuration in: string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions: in string format.

tf_transformations_path: Path to TF transformation configuration. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. training_machine_type: Machine type. training_total_replica_count: Number of workers. training_accelerator_type: Accelerator type. training_accelerator_count: Accelerator count. transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for: transform component.
transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for: transform component.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of: prediction server for batch predict components during evaluation.
evaluation_batch_predict_max_replica_count: The max number of prediction: server for batch predict components during evaluation.
evaluation_dataflow_machine_type: The dataflow machine type for evaluation: components.
evaluation_dataflow_starting_num_workers: The initial number of Dataflow: workers for evaluation components.
evaluation_dataflow_max_num_workers: The max number of Dataflow workers for: evaluation components.
evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for: evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

the default subnetwork will be used. Example:
https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP: addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definition_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.input_dictionary_to_parameter(input_dict: Dict[str, Any] | None) → str

Convert json input dict to encoded parameter string.

This function is required due to the limitation on YAML component definition that YAML definition does not have a keyword for apply quote escape, so the JSON argument’s quote must be manually escaped using this function.

Args:: input_dict: The input json dictionary.
Returns:: The encoded string used for parameter.

Module contents

Module for AutoML Tables KFP components.