AutoML Tabular

GA AutoML tabular components.

Components:

CvTrainerOp(project, location, root_dir, ...)

Tunes AutoML Tabular models and selects top trials using cross-validation.

EnsembleOp(project, location, root_dir, ...)

Ensembles AutoML Tabular models.

FinalizerOp(project, location, root_dir, ...)

Finalizes AutoML Tabular pipelines.

InfraValidatorOp(unmanaged_container_model)

Validates the trained AutoML Tabular model is a valid model.

SplitMaterializedDataOp(materialized_data, ...)

Splits materialized dataset into train, eval, and test data splits.

Stage1TunerOp(project, location, root_dir, ...)

Searches AutoML Tabular architectures and selects the top trials.

StatsAndExampleGenOp(project, location, ...)

Generates stats and training instances for tabular data.

TrainingConfiguratorAndValidatorOp(...[, ...])

Configures training and validates data and user-input configurations.

TransformOp(project, location, root_dir, ...)

Transforms raw features to engineered features.

Functions:

get_automl_tabular_pipeline_and_parameters(...)

Get the AutoML Tabular v1 default training pipeline.

v1.automl.tabular.CvTrainerOp(project: str, location: str, root_dir: str, deadline_hours: float, num_parallel_trials: int, single_run_max_secs: int, num_selected_trials: int, transform_output: dsl.Input[system.Artifact], metadata: dsl.Input[system.Artifact], materialized_cv_splits: dsl.Input[system.Artifact], tuning_result_input: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), tuning_result_output: dsl.Output[system.Artifact], execution_metrics: dsl.OutputPath(dict), worker_pool_specs_override_json: list | None = [], num_selected_features: int | None = 0, encryption_spec_key_name: str | None = '')

Tunes AutoML Tabular models and selects top trials using cross-validation.

Parameters:
project: str

Project to run Cross-validation trainer.

location: str

Location for running the Cross-validation trainer.

root_dir: str

The Cloud Storage location to store the output.

worker_pool_specs_override_json: list | None = []

JSON worker pool specs. E.g., [{“machine_spec”: {“machine_type”: “n1-standard-16”}},{},{},{“machine_spec”: {“machine_type”: “n1-standard-16”}}]

deadline_hours: float

Number of hours the cross-validation trainer should run.

num_parallel_trials: int

Number of parallel training trials.

single_run_max_secs: int

Max number of seconds each training trial runs.

num_selected_trials: int

Number of selected trials. The number of weak learners in the final model is 5 * num_selected_trials.

num_selected_features: int | None = 0

Number of selected features. The number of features to learn in the NN models.

transform_output: dsl.Input[system.Artifact]

The transform output artifact.

metadata: dsl.Input[system.Artifact]

The tabular example gen metadata.

materialized_cv_splits: dsl.Input[system.Artifact]

The materialized cross-validation splits.

tuning_result_input: dsl.Input[system.Artifact]

AutoML Tabular tuning result.

encryption_spec_key_name: str | None = ''

Customer-managed encryption key.

Returns:

ning_result_output: dsl.Output[system.Artifact]

The trained model and architectures.

gcp_resources: dsl.OutputPath(str)

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

execution_metrics: dsl.OutputPath(dict)

Core metrics in dictionary of component execution.

v1.automl.tabular.EnsembleOp(project: str, location: str, root_dir: str, transform_output: dsl.Input[system.Artifact], metadata: dsl.Input[system.Artifact], dataset_schema: dsl.Input[system.Artifact], tuning_result_input: dsl.Input[system.Artifact], instance_baseline: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), model_architecture: dsl.Output[system.Artifact], model: dsl.Output[system.Artifact], unmanaged_container_model: dsl.Output[google.UnmanagedContainerModel], model_without_custom_ops: dsl.Output[system.Artifact], explanation_metadata: dsl.OutputPath(dict), explanation_metadata_artifact: dsl.Output[system.Artifact], explanation_parameters: dsl.OutputPath(dict), warmup_data: dsl.Input[system.Dataset] | None = None, encryption_spec_key_name: str | None = '', export_additional_model_without_custom_ops: bool | None = False)

Ensembles AutoML Tabular models.

Parameters:
project: str

Project to run Cross-validation trainer.

location: str

Location for running the Cross-validation trainer.

root_dir: str

The Cloud Storage location to store the output.

transform_output: dsl.Input[system.Artifact]

The transform output artifact.

metadata: dsl.Input[system.Artifact]

The tabular example gen metadata.

dataset_schema: dsl.Input[system.Artifact]

The schema of the dataset.

tuning_result_input: dsl.Input[system.Artifact]

AutoML Tabular tuning result.

instance_baseline: dsl.Input[system.Artifact]

The instance baseline used to calculate explanations.

warmup_data: dsl.Input[system.Dataset] | None = None

The warm up data. Ensemble component will save the warm up data together with the model artifact, used to warm up the model when prediction server starts.

encryption_spec_key_name: str | None = ''

Customer-managed encryption key.

export_additional_model_without_custom_ops: bool | None = False

True if export an additional model without custom TF operators to the model_without_custom_ops output.

Returns:

gcp_resources: dsl.OutputPath(str)

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

model_architecture: dsl.Output[system.Artifact]

The architecture of the output model.

model: dsl.Output[system.Artifact]

The output model.

model_without_custom_ops: dsl.Output[system.Artifact]

The output model without custom TF operators, this output will be empty unless export_additional_model_without_custom_ops is set.

model_uri: Unknown

The URI of the output model.

instance_schema_uri: Unknown

The URI of the instance schema.

rediction_schema_uri: Unknown

The URI of the prediction schema.

explanation_metadata: dsl.OutputPath(dict)

The explanation metadata used by Vertex online and batch explanations.

explanation_metadata: dsl.OutputPath(dict)

The explanation parameters used by Vertex online and batch explanations.

v1.automl.tabular.FinalizerOp(project: str, location: str, root_dir: str, gcp_resources: dsl.OutputPath(str), encryption_spec_key_name: str | None = '')

Finalizes AutoML Tabular pipelines.

Parameters:
project: str

Project to run Cross-validation trainer.

location: str

Location for running the Cross-validation trainer.

root_dir: str

The Cloud Storage location to store the output.

encryption_spec_key_name: str | None = ''

Customer-managed encryption key.

Returns:

gcp_resources: dsl.OutputPath(str)

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

v1.automl.tabular.InfraValidatorOp(unmanaged_container_model: dsl.Input[google.UnmanagedContainerModel])

Validates the trained AutoML Tabular model is a valid model.

Parameters:
unmanaged_container_model: dsl.Input[google.UnmanagedContainerModel]

google.UnmanagedContainerModel for model to be validated.

v1.automl.tabular.SplitMaterializedDataOp(materialized_data: dsl.Input[system.Dataset], materialized_train_split: dsl.Output[system.Artifact], materialized_eval_split: dsl.Output[system.Artifact], materialized_test_split: dsl.Output[system.Artifact])

Splits materialized dataset into train, eval, and test data splits.

The materialized dataset generated by the Feature Transform Engine consists of all the splits that were combined into the input transform dataset (i.e., train, eval, and test splits). This components splits the output materialized dataset into corresponding materialized data splits so that the splits can be used by down-stream training or evaluation components.

Parameters:
materialized_data: dsl.Input[system.Dataset]

Materialized dataset output by the Feature

Transform Engine.

Returns:

materialized_train_split: dsl.Output[system.Artifact]

Path patern to materialized train split.

materialized_eval_split: dsl.Output[system.Artifact]

Path patern to materialized eval split.

materialized_test_split: dsl.Output[system.Artifact]

Path patern to materialized test split.

v1.automl.tabular.Stage1TunerOp(project: str, location: str, root_dir: str, num_selected_trials: int, deadline_hours: float, num_parallel_trials: int, single_run_max_secs: int, metadata: dsl.Input[system.Artifact], transform_output: dsl.Input[system.Artifact], materialized_train_split: dsl.Input[system.Artifact], materialized_eval_split: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), tuning_result_output: dsl.Output[system.Artifact], execution_metrics: dsl.OutputPath(dict), study_spec_parameters_override: list | None = [], worker_pool_specs_override_json: list | None = [], reduce_search_space_mode: str | None = 'regular', num_selected_features: int | None = 0, disable_early_stopping: bool | None = False, feature_ranking: dsl.Input[system.Artifact] | None = None, tune_feature_selection_rate: bool | None = False, encryption_spec_key_name: str | None = '', run_distillation: bool | None = False)

Searches AutoML Tabular architectures and selects the top trials.

Parameters:
project: str

Project to run Cross-validation trainer.

location: str

Location for running the Cross-validation trainer.

root_dir: str

The Cloud Storage location to store the output.

study_spec_parameters_override: list | None = []

JSON study spec. E.g., [{“parameter_id”: “model_type”,”categorical_value_spec”: {“values”: [“nn”]}}]

worker_pool_specs_override_json: list | None = []

JSON worker pool specs. E.g., [{“machine_spec”: {“machine_type”: “n1-standard-16”}},{},{},{“machine_spec”: {“machine_type”: “n1-standard-16”}}]

reduce_search_space_mode: str | None = 'regular'

The reduce search space mode. Possible values: “regular” (default), “minimal”, “full”.

num_selected_trials: int

Number of selected trials. The number of weak learners in the final model is 5 * num_selected_trials.

num_selected_features: int | None = 0

Number of selected features. The number of features to learn in the NN models.

deadline_hours: float

Number of hours the cross-validation trainer should run.

disable_early_stopping: bool | None = False

True if disable early stopping. Default value is false.

num_parallel_trials: int

Number of parallel training trials.

single_run_max_secs: int

Max number of seconds each training trial runs.

metadata: dsl.Input[system.Artifact]

The tabular example gen metadata.

transform_output: dsl.Input[system.Artifact]

The transform output artifact.

materialized_train_split: dsl.Input[system.Artifact]

The materialized train split.

materialized_eval_split: dsl.Input[system.Artifact]

The materialized eval split.

encryption_spec_key_name: str | None = ''

Customer-managed encryption key.

run_distillation: bool | None = False

True if in distillation mode. The default value is false.

Returns:

gcp_resources: dsl.OutputPath(str)

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

ning_result_output: dsl.Output[system.Artifact]

The trained model and architectures.

execution_metrics: dsl.OutputPath(dict)

Core metrics in dictionary of component execution.

v1.automl.tabular.StatsAndExampleGenOp(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, transformations: str, dataset_schema: dsl.Output[system.Artifact], dataset_stats: dsl.Output[system.Artifact], train_split: dsl.Output[system.Dataset], eval_split: dsl.Output[system.Dataset], test_split: dsl.Output[system.Dataset], test_split_json: dsl.OutputPath(list), downsampled_test_split_json: dsl.OutputPath(list), instance_baseline: dsl.Output[system.Artifact], metadata: dsl.Output[system.Artifact], gcp_resources: dsl.OutputPath(str), weight_column_name: str | None = '', optimization_objective: str | None = '', optimization_objective_recall_value: float | None = -1, optimization_objective_precision_value: float | None = -1, transformations_path: str | None = '', request_type: str | None = 'COLUMN_STATS_ONLY', dataflow_machine_type: str | None = 'n1-standard-16', dataflow_max_num_workers: int | None = 25, dataflow_disk_size_gb: int | None = 40, dataflow_subnetwork: str | None = '', dataflow_use_public_ips: bool | None = True, dataflow_service_account: str | None = '', encryption_spec_key_name: str | None = '', run_distillation: bool | None = False, additional_experiments: str | None = '', additional_experiments_json: dict | None = {}, data_source_csv_filenames: str | None = '', data_source_bigquery_table_path: str | None = '', predefined_split_key: str | None = '', timestamp_split_key: str | None = '', stratified_split_key: str | None = '', training_fraction: float | None = -1, validation_fraction: float | None = -1, test_fraction: float | None = -1, quantiles: list | None = [], enable_probabilistic_inference: bool | None = False)

Generates stats and training instances for tabular data.

Parameters:
project: str

Project to run dataset statistics and example generation.

location: str

Location for running dataset statistics and example generation.

root_dir: str

The Cloud Storage location to store the output.

target_column_name: str

The target column name.

weight_column_name: str | None = ''

The weight column name.

prediction_type: str

The prediction type. Supported values: “classification”, “regression”.

optimization_objective: str | None = ''

Objective function the model is optimizing towards. The training process creates a model that maximizes/minimizes the value of the objective function over the validation set. The supported optimization objectives depend on the prediction type. If the field is not set, a default objective function is used. classification: “maximize-au-roc” (default) - Maximize the area under the receiver operating characteristic (ROC) curve. “minimize-log-loss” - Minimize log loss. “maximize-au-prc” - Maximize the area under the precision-recall curve. “maximize-precision-at-recall” - Maximize precision for a specified recall value. “maximize-recall-at-precision” - Maximize recall for a specified precision value. classification (multi-class): “minimize-log-loss” (default) - Minimize log loss. regression: “minimize-rmse” (default) - Minimize root-mean-squared error (RMSE). “minimize-mae” - Minimize mean-absolute error (MAE). “minimize-rmsle” - Minimize root-mean-squared log error (RMSLE).

optimization_objective_recall_value: float | None = -1

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: float | None = -1

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

transformations: str

Quote escaped JSON string for transformations. Each transformation will apply transform function to given input column. And the result will be used for training. When creating transformation for BigQuery Struct column, the column should be flattened using “.” as the delimiter.

transformations_path: str | None = ''

Path to a GCS file containing JSON string for transformations.

dataflow_machine_type: str | None = 'n1-standard-16'

The machine type used for dataflow jobs. If not set, default to n1-standard-16.

dataflow_max_num_workers: int | None = 25

The number of workers to run the dataflow job. If not set, default to 25.

dataflow_disk_size_gb: int | None = 40

The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.

dataflow_subnetwork: str | None = ''

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: bool | None = True

Specifies whether Dataflow workers use public IP addresses.

dataflow_service_account: str | None = ''

Custom service account to run dataflow jobs.

encryption_spec_key_name: str | None = ''

Customer-managed encryption key.

run_distillation: bool | None = False

True if in distillation mode. The default value is false.

Returns:

dataset_schema: dsl.Output[system.Artifact]

The schema of the dataset.

dataset_stats: dsl.Output[system.Artifact]

The stats of the dataset.

rain_split: dsl.Output[system.Dataset]

The train split.

eval_split: dsl.Output[system.Dataset]

The eval split.

est_split: dsl.Output[system.Dataset]

The test split.

est_split_json: dsl.OutputPath(list)

The test split JSON object.

downsampled_test_split_json: dsl.OutputPath(list)

The downsampled test split JSON object.

instance_baseline: dsl.Output[system.Artifact]

The instance baseline used to calculate explanations.

metadata: dsl.Output[system.Artifact]

The tabular example gen metadata.

gcp_resources: dsl.OutputPath(str)

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

v1.automl.tabular.TrainingConfiguratorAndValidatorOp(dataset_stats: dsl.Input[system.Artifact], split_example_counts: str, training_schema: dsl.Input[system.Artifact], instance_schema: dsl.Input[system.Artifact], metadata: dsl.Output[system.Artifact], instance_baseline: dsl.Output[system.Artifact], target_column: str | None = '', weight_column: str | None = '', prediction_type: str | None = '', optimization_objective: str | None = '', optimization_objective_recall_value: float | None = -1, optimization_objective_precision_value: float | None = -1, run_evaluation: bool | None = False, run_distill: bool | None = False, enable_probabilistic_inference: bool | None = False, time_series_identifier_column: str | None = None, time_series_identifier_columns: list | None = [], time_column: str | None = '', time_series_attribute_columns: list | None = [], available_at_forecast_columns: list | None = [], unavailable_at_forecast_columns: list | None = [], quantiles: list | None = [], context_window: int | None = -1, forecast_horizon: int | None = -1, forecasting_model_type: str | None = '', forecasting_transformations: dict | None = {}, stage_1_deadline_hours: float | None = None, stage_2_deadline_hours: float | None = None, group_columns: list | None = None, group_total_weight: float = 0.0, temporal_total_weight: float = 0.0, group_temporal_total_weight: float = 0.0)

Configures training and validates data and user-input configurations.

Parameters:
dataset_stats: dsl.Input[system.Artifact]

Dataset stats generated by feature transform engine.

split_example_counts: str

JSON string of data split example counts for train, validate, and test splits.

training_schema_path

Schema of input data to the tf_model at training time.

instance_schema: dsl.Input[system.Artifact]

Schema of input data to the tf_model at serving time.

target_column: str | None = ''

Target column of input data.

weight_column: str | None = ''

Weight column of input data.

prediction_type: str | None = ''

Model prediction type. One of “classification”, “regression”, “time_series”.

optimization_objective: str | None = ''

Objective function the model is optimizing towards. The training process creates a model that maximizes/minimizes the value of the objective function over the validation set. The supported optimization objectives depend on the prediction type. If the field is not set, a default objective function is used. classification: “maximize-au-roc” (default) - Maximize the area under the receiver operating characteristic (ROC) curve. “minimize-log-loss” - Minimize log loss. “maximize-au-prc” - Maximize the area under the precision-recall curve. “maximize-precision-at-recall” - Maximize precision for a specified recall value. “maximize-recall-at-precision” - Maximize recall for a specified precision value. classification (multi-class): “minimize-log-loss” (default) - Minimize log loss. regression: “minimize-rmse” (default) - Minimize root-mean-squared error (RMSE). “minimize-mae” - Minimize mean-absolute error (MAE). “minimize-rmsle” - Minimize root-mean-squared log error (RMSLE).

optimization_objective_recall_value: float | None = -1

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: float | None = -1

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

run_evaluation: bool | None = False

Whether we are running evaluation in the training pipeline.

run_distill: bool | None = False

Whether the distillation should be applied to the training.

enable_probabilistic_inference: bool | None = False

If probabilistic inference is enabled, the model will fit a distribution that captures the uncertainty of a prediction. At inference time, the predictive distribution is used to make a point prediction that minimizes the optimization objective. For example, the mean of a predictive distribution is the point prediction that minimizes RMSE loss. If quantiles are specified, then the quantiles of the distribution are also returned.

time_series_identifier_column: str | None = None

[Deprecated] The time series identifier column. Used by forecasting only. Raises exception if used - use the “time_series_identifier_column” field instead.

time_series_identifier_columns: list | None = []

The list of time series identifier columns. Used by forecasting only.

time_column: str | None = ''

The column that indicates the time. Used by forecasting only.

time_series_attribute_columns: list | None = []

The column names of the time series attributes.

available_at_forecast_columns: list | None = []

The names of the columns that are available at forecast time.

unavailable_at_forecast_columns: list | None = []

The names of the columns that are not available at forecast time.

quantiles: list | None = []

All quantiles that the model need to predict.

context_window: int | None = -1

The length of the context window.

forecast_horizon: int | None = -1

The length of the forecast horizon.

forecasting_model_type: str | None = ''

The model types, e.g. l2l, seq2seq, tft.

forecasting_transformations: dict | None = {}

Dict mapping auto and/or type-resolutions to feature columns. The supported types are auto, categorical, numeric, text, and timestamp.

stage_1_deadline_hours: float | None = None

Stage 1 training budget in hours.

stage_2_deadline_hours: float | None = None

Stage 2 training budget in hours.

group_columns: list | None = None

A list of time series attribute column names that define the time series hierarchy.

group_total_weight: float = 0.0

The weight of the loss for predictions aggregated over time series in the same group.

temporal_total_weight: float = 0.0

The weight of the loss for predictions aggregated over the horizon for a single time series.

group_temporal_total_weight: float = 0.0

The weight of the loss for predictions aggregated over both the horizon and time series in the same hierarchy group.

Returns:

metadata: dsl.Output[system.Artifact]

The tabular example gen metadata.

v1.automl.tabular.TransformOp(project: str, location: str, root_dir: str, metadata: dsl.Input[system.Artifact], dataset_schema: dsl.Input[system.Artifact], train_split: dsl.Input[system.Dataset], eval_split: dsl.Input[system.Dataset], test_split: dsl.Input[system.Dataset], materialized_train_split: dsl.Output[system.Artifact], materialized_eval_split: dsl.Output[system.Artifact], materialized_test_split: dsl.Output[system.Artifact], training_schema_uri: dsl.Output[system.Artifact], transform_output: dsl.Output[system.Artifact], gcp_resources: dsl.OutputPath(str), dataflow_machine_type: str | None = 'n1-standard-16', dataflow_max_num_workers: int | None = 25, dataflow_disk_size_gb: int | None = 40, dataflow_subnetwork: str | None = '', dataflow_use_public_ips: bool | None = True, dataflow_service_account: str | None = '', encryption_spec_key_name: str | None = '')

Transforms raw features to engineered features.

Parameters:
project: str

Project to run Cross-validation trainer.

location: str

Location for running the Cross-validation trainer.

root_dir: str

The Cloud Storage location to store the output.

metadata: dsl.Input[system.Artifact]

The tabular example gen metadata.

dataset_schema: dsl.Input[system.Artifact]

The schema of the dataset.

train_split: dsl.Input[system.Dataset]

The train split.

eval_split: dsl.Input[system.Dataset]

The eval split.

test_split: dsl.Input[system.Dataset]

The test split.

dataflow_machine_type: str | None = 'n1-standard-16'

The machine type used for dataflow jobs. If not set, default to n1-standard-16.

dataflow_max_num_workers: int | None = 25

The number of workers to run the dataflow job. If not set, default to 25.

dataflow_disk_size_gb: int | None = 40

The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.

dataflow_subnetwork: str | None = ''

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: bool | None = True

Specifies whether Dataflow workers use public IP addresses.

dataflow_service_account: str | None = ''

Custom service account to run dataflow jobs.

encryption_spec_key_name: str | None = ''

Customer-managed encryption key.

Returns:

materialized_train_split: dsl.Output[system.Artifact]

The materialized train split.

materialized_eval_split: dsl.Output[system.Artifact]

The materialized eval split.

materialized_eval_split: dsl.Output[system.Artifact]

The materialized test split.

raining_schema_uri: dsl.Output[system.Artifact]

The training schema.

ransform_output: dsl.Output[system.Artifact]

The transform output artifact.

gcp_resources: dsl.OutputPath(str)

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

v1.automl.tabular.get_automl_tabular_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int | None = None, stage_2_num_parallel_trials: int | None = None, stage_2_num_selected_trials: int | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, predefined_split_key: str | None = None, timestamp_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, weight_column: str | None = None, study_spec_parameters_override: list[dict[str, Any]] | None = None, optimization_objective_recall_value: float | None = None, optimization_objective_precision_value: float | None = None, stage_1_tuner_worker_pool_specs_override: dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str | None = None, stats_and_example_gen_dataflow_max_num_workers: int | None = None, stats_and_example_gen_dataflow_disk_size_gb: int | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: str | None = None, additional_experiments: dict[str, Any] | None = None, dataflow_service_account: str | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_batch_explain_machine_type: str | None = None, evaluation_batch_explain_starting_replica_count: int | None = None, evaluation_batch_explain_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None, run_distillation: bool = False, distill_batch_predict_machine_type: str | None = None, distill_batch_predict_starting_replica_count: int | None = None, distill_batch_predict_max_replica_count: int | None = None, stage_1_tuning_result_artifact_uri: str | None = None, quantiles: list[float] | None = None, enable_probabilistic_inference: bool = False, num_selected_features: int | None = None, model_display_name: str = '', model_description: str = '') tuple[str, dict[str, Any]][source]

Get the AutoML Tabular v1 default training pipeline.

Parameters:
project: str

The GCP project that runs the pipeline components.

location: str

The GCP region that runs the pipeline components.

root_dir: str

The root GCS directory for the pipeline components.

target_column: str

The target column name.

prediction_type: str

The type of prediction the model is to produce. “classification” or “regression”.

optimization_objective: str

For binary classification, “maximize-au-roc”, “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations: str

The path to a GCS file containing the transformations to apply.

train_budget_milli_node_hours: float

The train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials: int | None = None

Number of parallel trails for stage 1.

stage_2_num_parallel_trials: int | None = None

Number of parallel trails for stage 2.

stage_2_num_selected_trials: int | None = None

Number of selected trials for stage 2.

data_source_csv_filenames: str | None = None

The CSV data source.

data_source_bigquery_table_path: str | None = None

The BigQuery data source.

predefined_split_key: str | None = None

The predefined_split column name.

timestamp_split_key: str | None = None

The timestamp_split column name.

stratified_split_key: str | None = None

The stratified_split column name.

training_fraction: float | None = None

The training fraction.

validation_fraction: float | None = None

The validation fraction.

test_fraction: float | None = None

float = The test fraction.

weight_column: str | None = None

The weight column name.

study_spec_parameters_override: list[dict[str, Any]] | None = None

The list for overriding study spec. The list should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.

optimization_objective_recall_value: float | None = None

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: float | None = None

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override: dict[str, Any] | None = None

The dictionary for overriding. stage 1 tuner worker pool spec. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

cv_trainer_worker_pool_specs_override: dict[str, Any] | None = None

The dictionary for overriding stage cv trainer worker pool spec. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: bool = False

Whether to export additional model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: str | None = None

The dataflow machine type for stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: int | None = None

The max number of Dataflow workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: int | None = None

Dataflow worker’s disk size in GB for stats_and_example_gen component.

transform_dataflow_machine_type: str | None = None

The dataflow machine type for transform component.

transform_dataflow_max_num_workers: int | None = None

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb: int | None = None

Dataflow worker’s disk size in GB for transform component.

dataflow_subnetwork: str | None = None

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: bool = True

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name: str | None = None

The KMS key name.

additional_experiments: dict[str, Any] | None = None

Use this field to config private preview features.

dataflow_service_account: str | None = None

Custom service account to run dataflow jobs.

run_evaluation: bool = True

Whether to run evaluation in the training pipeline.

evaluation_batch_predict_machine_type: str | None = None

The prediction server machine type for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: int | None = None

The initial number of prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: int | None = None

The max number of prediction server for batch predict components during evaluation.

evaluation_batch_explain_machine_type: str | None = None

The prediction server machine type for batch explain components during evaluation.

evaluation_batch_explain_starting_replica_count: int | None = None

The initial number of prediction server for batch explain components during evaluation.

evaluation_batch_explain_max_replica_count: int | None = None

The max number of prediction server for batch explain components during evaluation.

evaluation_dataflow_machine_type: str | None = None

The dataflow machine type for evaluation components.

evaluation_dataflow_starting_num_workers: int | None = None

The initial number of Dataflow workers for evaluation components.

evaluation_dataflow_max_num_workers: int | None = None

The max number of Dataflow workers for evaluation components.

evaluation_dataflow_disk_size_gb: int | None = None

Dataflow worker’s disk size in GB for evaluation components.

run_distillation: bool = False

Whether to run distill in the training pipeline.

distill_batch_predict_machine_type: str | None = None

The prediction server machine type for batch predict component in the model distillation.

distill_batch_predict_starting_replica_count: int | None = None

The initial number of prediction server for batch predict component in the model distillation.

distill_batch_predict_max_replica_count: int | None = None

The max number of prediction server for batch predict component in the model distillation.

stage_1_tuning_result_artifact_uri: str | None = None

The stage 1 tuning result artifact GCS URI.

quantiles: list[float] | None = None

Quantiles to use for probabilistic inference. Up to 5 quantiles are allowed of values between 0 and 1, exclusive. Represents the quantiles to use for that objective. Quantiles must be unique.

enable_probabilistic_inference: bool = False

If probabilistic inference is enabled, the model will fit a distribution that captures the uncertainty of a prediction. At inference time, the predictive distribution is used to make a point prediction that minimizes the optimization objective. For example, the mean of a predictive distribution is the point prediction that minimizes RMSE loss. If quantiles are specified, then the quantiles of the distribution are also returned.

num_selected_features: int | None = None

Number of selected features for feature selection, defaults to None, in which case all features are used.

model_display_name: str = ''

The display name of the uploaded Vertex model.

model_description: str = ''

The description for the uploaded model.

Returns:

Tuple of pipeline_definition_path and parameter_values.