AutoML Tabular

GA AutoML tabular components.

Components:

CvTrainerOp(project, location, root_dir, ...)

Tunes AutoML Tabular models and selects top trials using cross-validation.

EnsembleOp(project, location, root_dir, ...)

Ensembles AutoML Tabular models.

FinalizerOp(project, location, root_dir, ...)

Finalizes AutoML Tabular pipelines.

InfraValidatorOp(unmanaged_container_model)

Validates the trained AutoML Tabular model is a valid model.

SplitMaterializedDataOp(materialized_data, ...)

Splits materialized dataset into train, eval, and test data splits.

Stage1TunerOp(project, location, root_dir, ...)

Searches AutoML Tabular architectures and selects the top trials.

StatsAndExampleGenOp(project, location, ...)

Generates stats and training instances for tabular data.

TrainingConfiguratorAndValidatorOp(...[, ...])

Configures training and validates data and user-input configurations.

TransformOp(project, location, root_dir, ...)

Transforms raw features to engineered features.

v1.automl.tabular.CvTrainerOp(project: str, location: str, root_dir: str, deadline_hours: float, num_parallel_trials: int, single_run_max_secs: int, num_selected_trials: int, transform_output: dsl.Input[system.Artifact], metadata: dsl.Input[system.Artifact], materialized_cv_splits: dsl.Input[system.Artifact], tuning_result_input: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), tuning_result_output: dsl.Output[system.Artifact], execution_metrics: dsl.OutputPath(dict), worker_pool_specs_override_json: list | None = [], num_selected_features: int | None = 0, encryption_spec_key_name: str | None = '')

Tunes AutoML Tabular models and selects top trials using cross-validation.

Parameters
project: str

Project to run Cross-validation trainer.

location: str

Location for running the Cross-validation trainer.

root_dir: str

The Cloud Storage location to store the output.

worker_pool_specs_override_json: list | None = []

JSON worker pool specs. E.g., [{“machine_spec”: {“machine_type”: “n1-standard-16”}},{},{},{“machine_spec”: {“machine_type”: “n1-standard-16”}}]

deadline_hours: float

Number of hours the cross-validation trainer should run.

num_parallel_trials: int

Number of parallel training trials.

single_run_max_secs: int

Max number of seconds each training trial runs.

num_selected_trials: int

Number of selected trials. The number of weak learners in the final model is 5 * num_selected_trials.

num_selected_features: int | None = 0

Number of selected features. The number of features to learn in the NN models.

transform_output: dsl.Input[system.Artifact]

The transform output artifact.

metadata: dsl.Input[system.Artifact]

The tabular example gen metadata.

materialized_cv_splits: dsl.Input[system.Artifact]

The materialized cross-validation splits.

tuning_result_input: dsl.Input[system.Artifact]

AutoML Tabular tuning result.

encryption_spec_key_name: str | None = ''

Customer-managed encryption key.

Returns

ning_result_output: dsl.Output[system.Artifact]

The trained model and architectures.

gcp_resources: dsl.OutputPath(str)

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

execution_metrics: dsl.OutputPath(dict)

Core metrics in dictionary of component execution.

v1.automl.tabular.EnsembleOp(project: str, location: str, root_dir: str, transform_output: dsl.Input[system.Artifact], metadata: dsl.Input[system.Artifact], dataset_schema: dsl.Input[system.Artifact], tuning_result_input: dsl.Input[system.Artifact], instance_baseline: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), model_architecture: dsl.Output[system.Artifact], model: dsl.Output[system.Artifact], unmanaged_container_model: dsl.Output[google.UnmanagedContainerModel], model_without_custom_ops: dsl.Output[system.Artifact], explanation_metadata: dsl.OutputPath(dict), explanation_metadata_artifact: dsl.Output[system.Artifact], explanation_parameters: dsl.OutputPath(dict), warmup_data: dsl.Input[system.Dataset] | None = None, encryption_spec_key_name: str | None = '', export_additional_model_without_custom_ops: bool | None = False)

Ensembles AutoML Tabular models.

Parameters
project: str

Project to run Cross-validation trainer.

location: str

Location for running the Cross-validation trainer.

root_dir: str

The Cloud Storage location to store the output.

transform_output: dsl.Input[system.Artifact]

The transform output artifact.

metadata: dsl.Input[system.Artifact]

The tabular example gen metadata.

dataset_schema: dsl.Input[system.Artifact]

The schema of the dataset.

tuning_result_input: dsl.Input[system.Artifact]

AutoML Tabular tuning result.

instance_baseline: dsl.Input[system.Artifact]

The instance baseline used to calculate explanations.

warmup_data: dsl.Input[system.Dataset] | None = None

The warm up data. Ensemble component will save the warm up data together with the model artifact, used to warm up the model when prediction server starts.

encryption_spec_key_name: str | None = ''

Customer-managed encryption key.

export_additional_model_without_custom_ops: bool | None = False

True if export an additional model without custom TF operators to the model_without_custom_ops output.

Returns

gcp_resources: dsl.OutputPath(str)

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

model_architecture: dsl.Output[system.Artifact]

The architecture of the output model.

model: dsl.Output[system.Artifact]

The output model.

model_without_custom_ops: dsl.Output[system.Artifact]

The output model without custom TF operators, this output will be empty unless export_additional_model_without_custom_ops is set.

model_uri: Unknown

The URI of the output model.

instance_schema_uri: Unknown

The URI of the instance schema.

rediction_schema_uri: Unknown

The URI of the prediction schema.

explanation_metadata: dsl.OutputPath(dict)

The explanation metadata used by Vertex online and batch explanations.

explanation_metadata: dsl.OutputPath(dict)

The explanation parameters used by Vertex online and batch explanations.

v1.automl.tabular.FinalizerOp(project: str, location: str, root_dir: str, gcp_resources: dsl.OutputPath(str), encryption_spec_key_name: str | None = '')

Finalizes AutoML Tabular pipelines.

Parameters
project: str

Project to run Cross-validation trainer.

location: str

Location for running the Cross-validation trainer.

root_dir: str

The Cloud Storage location to store the output.

encryption_spec_key_name: str | None = ''

Customer-managed encryption key.

Returns

gcp_resources: dsl.OutputPath(str)

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

v1.automl.tabular.InfraValidatorOp(unmanaged_container_model: dsl.Input[google.UnmanagedContainerModel])

Validates the trained AutoML Tabular model is a valid model.

Parameters
unmanaged_container_model: dsl.Input[google.UnmanagedContainerModel]

google.UnmanagedContainerModel for model to be validated.

v1.automl.tabular.SplitMaterializedDataOp(materialized_data: dsl.Input[system.Dataset], materialized_train_split: dsl.Output[system.Artifact], materialized_eval_split: dsl.Output[system.Artifact], materialized_test_split: dsl.Output[system.Artifact])

Splits materialized dataset into train, eval, and test data splits.

The materialized dataset generated by the Feature Transform Engine consists of all the splits that were combined into the input transform dataset (i.e., train, eval, and test splits). This components splits the output materialized dataset into corresponding materialized data splits so that the splits can be used by down-stream training or evaluation components.

Parameters
materialized_data: dsl.Input[system.Dataset]

Materialized dataset output by the Feature

Transform Engine.

Returns

materialized_train_split: dsl.Output[system.Artifact]

Path patern to materialized train split.

materialized_eval_split: dsl.Output[system.Artifact]

Path patern to materialized eval split.

materialized_test_split: dsl.Output[system.Artifact]

Path patern to materialized test split.

v1.automl.tabular.Stage1TunerOp(project: str, location: str, root_dir: str, num_selected_trials: int, deadline_hours: float, num_parallel_trials: int, single_run_max_secs: int, metadata: dsl.Input[system.Artifact], transform_output: dsl.Input[system.Artifact], materialized_train_split: dsl.Input[system.Artifact], materialized_eval_split: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), tuning_result_output: dsl.Output[system.Artifact], execution_metrics: dsl.OutputPath(dict), study_spec_parameters_override: list | None = [], worker_pool_specs_override_json: list | None = [], reduce_search_space_mode: str | None = 'regular', num_selected_features: int | None = 0, disable_early_stopping: bool | None = False, feature_ranking: dsl.Input[system.Artifact] | None = None, tune_feature_selection_rate: bool | None = False, encryption_spec_key_name: str | None = '', run_distillation: bool | None = False)

Searches AutoML Tabular architectures and selects the top trials.

Parameters
project: str

Project to run Cross-validation trainer.

location: str

Location for running the Cross-validation trainer.

root_dir: str

The Cloud Storage location to store the output.

study_spec_parameters_override: list | None = []

JSON study spec. E.g., [{“parameter_id”: “model_type”,”categorical_value_spec”: {“values”: [“nn”]}}]

worker_pool_specs_override_json: list | None = []

JSON worker pool specs. E.g., [{“machine_spec”: {“machine_type”: “n1-standard-16”}},{},{},{“machine_spec”: {“machine_type”: “n1-standard-16”}}]

reduce_search_space_mode: str | None = 'regular'

The reduce search space mode. Possible

values: “regular” (default), “minimal”, “full”. :param num_selected_trials: Number of selected trials. The number of weak learners in the final model is 5 * num_selected_trials. :param num_selected_features: Number of selected features. The number of features to learn in the NN models. :param deadline_hours: Number of hours the cross-validation trainer should run. :param disable_early_stopping: True if disable early stopping. Default value is false. :param num_parallel_trials: Number of parallel training trials. :param single_run_max_secs: Max number of seconds each training trial runs. :param metadata: The tabular example gen metadata. :param transform_output: The transform output artifact. :param materialized_train_split: The materialized train split. :param materialized_eval_split: The materialized eval split. :param encryption_spec_key_name: Customer-managed encryption key. :param run_distillation: True if in distillation mode. The default value is false.

Returns

gcp_resources: dsl.OutputPath(str)

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

ning_result_output: dsl.Output[system.Artifact]

The trained model and architectures.

execution_metrics: dsl.OutputPath(dict)

Core metrics in dictionary of component execution.

v1.automl.tabular.StatsAndExampleGenOp(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, transformations: str, dataset_schema: dsl.Output[system.Artifact], dataset_stats: dsl.Output[system.Artifact], train_split: dsl.Output[system.Dataset], eval_split: dsl.Output[system.Dataset], test_split: dsl.Output[system.Dataset], test_split_json: dsl.OutputPath(list), downsampled_test_split_json: dsl.OutputPath(list), instance_baseline: dsl.Output[system.Artifact], metadata: dsl.Output[system.Artifact], gcp_resources: dsl.OutputPath(str), weight_column_name: str | None = '', optimization_objective: str | None = '', optimization_objective_recall_value: float | None = - 1, optimization_objective_precision_value: float | None = - 1, transformations_path: str | None = '', request_type: str | None = 'COLUMN_STATS_ONLY', dataflow_machine_type: str | None = 'n1-standard-16', dataflow_max_num_workers: int | None = 25, dataflow_disk_size_gb: int | None = 40, dataflow_subnetwork: str | None = '', dataflow_use_public_ips: bool | None = True, dataflow_service_account: str | None = '', encryption_spec_key_name: str | None = '', run_distillation: bool | None = False, additional_experiments: str | None = '', additional_experiments_json: dict | None = {}, data_source_csv_filenames: str | None = '', data_source_bigquery_table_path: str | None = '', predefined_split_key: str | None = '', timestamp_split_key: str | None = '', stratified_split_key: str | None = '', training_fraction: float | None = - 1, validation_fraction: float | None = - 1, test_fraction: float | None = - 1, quantiles: list | None = [], enable_probabilistic_inference: bool | None = False)

Generates stats and training instances for tabular data.

Parameters
project: str

Project to run dataset statistics and example generation.

location: str

Location for running dataset statistics and example generation.

root_dir: str

The Cloud Storage location to store the output.

target_column_name: str

The target column name.

weight_column_name: str | None = ''

The weight column name.

prediction_type: str

The prediction type. Supported values: “classification”, “regression”.

optimization_objective: str | None = ''

Objective function the model is optimizing towards. The training process creates a model that maximizes/minimizes the value of the objective function over the validation set. The supported optimization objectives depend on the prediction type. If the field is not set, a default objective function is used.

classification: “maximize-au-roc” (default) - Maximize the area under the receiver operating characteristic (ROC) curve. “minimize-log-loss” - Minimize log loss. “maximize-au-prc” - Maximize the area under the precision-recall curve. “maximize-precision-at-recall” - Maximize precision for a specified recall value. “maximize-recall-at-precision” - Maximize recall for a specified precision value. classification (multi-class): “minimize-log-loss” (default) - Minimize log loss. regression: “minimize-rmse” (default) - Minimize root-mean-squared error (RMSE). “minimize-mae” - Minimize mean-absolute error (MAE). “minimize-rmsle” - Minimize root-mean-squared log error (RMSLE). :param optimization_objective_recall_value: Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive. :param optimization_objective_precision_value: Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive. :param transformations: Quote escaped JSON string for transformations. Each transformation will apply transform function to given input column. And the result will be used for training. When creating transformation for BigQuery Struct column, the column should be flattened using “.” as the delimiter. :param transformations_path: Path to a GCS file containing JSON string for transformations. :param dataflow_machine_type: The machine type used for dataflow jobs. If not set, default to n1-standard-16. :param dataflow_max_num_workers: The number of workers to run the dataflow job. If not set, default to 25. :param dataflow_disk_size_gb: The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40. :param dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications :param dataflow_use_public_ips: Specifies whether Dataflow workers use public IP addresses. :param dataflow_service_account: Custom service account to run dataflow jobs. :param encryption_spec_key_name: Customer-managed encryption key. :param run_distillation: True if in distillation mode. The default value is false.

Returns

dataset_schema: dsl.Output[system.Artifact]

The schema of the dataset.

dataset_stats: dsl.Output[system.Artifact]

The stats of the dataset.

rain_split: dsl.Output[system.Dataset]

The train split.

eval_split: dsl.Output[system.Dataset]

The eval split.

est_split: dsl.Output[system.Dataset]

The test split.

est_split_json: dsl.OutputPath(list)

The test split JSON object.

downsampled_test_split_json: dsl.OutputPath(list)

The downsampled test split JSON object.

instance_baseline: dsl.Output[system.Artifact]

The instance baseline used to calculate explanations.

metadata: dsl.Output[system.Artifact]

The tabular example gen metadata.

gcp_resources: dsl.OutputPath(str)

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

v1.automl.tabular.TrainingConfiguratorAndValidatorOp(dataset_stats: dsl.Input[system.Artifact], split_example_counts: str, training_schema: dsl.Input[system.Artifact], instance_schema: dsl.Input[system.Artifact], metadata: dsl.Output[system.Artifact], instance_baseline: dsl.Output[system.Artifact], target_column: str | None = '', weight_column: str | None = '', prediction_type: str | None = '', optimization_objective: str | None = '', optimization_objective_recall_value: float | None = - 1, optimization_objective_precision_value: float | None = - 1, run_evaluation: bool | None = False, run_distill: bool | None = False, enable_probabilistic_inference: bool | None = False, time_series_identifier_column: str | None = None, time_series_identifier_columns: list | None = [], time_column: str | None = '', time_series_attribute_columns: list | None = [], available_at_forecast_columns: list | None = [], unavailable_at_forecast_columns: list | None = [], quantiles: list | None = [], context_window: int | None = - 1, forecast_horizon: int | None = - 1, forecasting_model_type: str | None = '', forecasting_transformations: dict | None = {}, stage_1_deadline_hours: float | None = None, stage_2_deadline_hours: float | None = None, group_columns: list | None = None, group_total_weight: float = 0.0, temporal_total_weight: float = 0.0, group_temporal_total_weight: float = 0.0)

Configures training and validates data and user-input configurations.

Parameters
dataset_stats: dsl.Input[system.Artifact]

Dataset stats generated by feature transform engine.

split_example_counts: str

JSON string of data split example counts for train, validate, and test splits.

training_schema_path

Schema of input data to the tf_model at training time.

instance_schema: dsl.Input[system.Artifact]

Schema of input data to the tf_model at serving time.

target_column: str | None = ''

Target column of input data.

weight_column: str | None = ''

Weight column of input data.

prediction_type: str | None = ''

Model prediction type. One of “classification”, “regression”, “time_series”.

optimization_objective: str | None = ''

Objective function the model is optimizing towards. The training process creates a model that maximizes/minimizes the value of the objective function over the validation set. The supported optimization objectives depend on the prediction type. If the field is not set, a default objective function is used.

classification: “maximize-au-roc” (default) - Maximize the area under the receiver operating characteristic (ROC) curve. “minimize-log-loss” - Minimize log loss. “maximize-au-prc” - Maximize the area under the precision-recall curve. “maximize-precision-at-recall” - Maximize precision for a specified recall value. “maximize-recall-at-precision” - Maximize recall for a specified precision value. classification (multi-class): “minimize-log-loss” (default) - Minimize log loss. regression: “minimize-rmse” (default) - Minimize root-mean-squared error (RMSE). “minimize-mae” - Minimize mean-absolute error (MAE). “minimize-rmsle” - Minimize root-mean-squared log error (RMSLE). :param optimization_objective_recall_value: Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive. :param optimization_objective_precision_value: Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive. :param run_evaluation: Whether we are running evaluation in the training pipeline. :param run_distill: Whether the distillation should be applied to the training. :param enable_probabilistic_inference: If probabilistic inference is enabled, the model will fit a distribution that captures the uncertainty of a prediction. At inference time, the predictive distribution is used to make a point prediction that minimizes the optimization objective. For example, the mean of a predictive distribution is the point prediction that minimizes RMSE loss. If quantiles are specified, then the quantiles of the distribution are also returned. :param time_series_identifier_column: [Deprecated] The time series identifier column. Used by forecasting only. Raises exception if used - use the “time_series_identifier_column” field instead. :param time_series_identifier_columns: The list of time series identifier columns. Used by forecasting only. :param time_column: The column that indicates the time. Used by forecasting only. :param time_series_attribute_columns: The column names of the time series attributes. :param available_at_forecast_columns: The names of the columns that are available at forecast time. :param unavailable_at_forecast_columns: The names of the columns that are not available at forecast time. :param quantiles: All quantiles that the model need to predict. :param context_window: The length of the context window. :param forecast_horizon: The length of the forecast horizon. :param forecasting_model_type: The model types, e.g. l2l, seq2seq, tft. :param forecasting_transformations: Dict mapping auto and/or type-resolutions to feature columns. The supported types are auto, categorical, numeric, text, and timestamp. :param stage_1_deadline_hours: Stage 1 training budget in hours. :param stage_2_deadline_hours: Stage 2 training budget in hours. :param group_columns: A list of time series attribute column names that define the time series hierarchy. :param group_total_weight: The weight of the loss for predictions aggregated over time series in the same group. :param temporal_total_weight: The weight of the loss for predictions aggregated over the horizon for a single time series. :param group_temporal_total_weight: The weight of the loss for predictions aggregated over both the horizon and time series in the same hierarchy group.

Returns

metadata: dsl.Output[system.Artifact]

The tabular example gen metadata.

v1.automl.tabular.TransformOp(project: str, location: str, root_dir: str, metadata: dsl.Input[system.Artifact], dataset_schema: dsl.Input[system.Artifact], train_split: dsl.Input[system.Dataset], eval_split: dsl.Input[system.Dataset], test_split: dsl.Input[system.Dataset], materialized_train_split: dsl.Output[system.Artifact], materialized_eval_split: dsl.Output[system.Artifact], materialized_test_split: dsl.Output[system.Artifact], training_schema_uri: dsl.Output[system.Artifact], transform_output: dsl.Output[system.Artifact], gcp_resources: dsl.OutputPath(str), dataflow_machine_type: str | None = 'n1-standard-16', dataflow_max_num_workers: int | None = 25, dataflow_disk_size_gb: int | None = 40, dataflow_subnetwork: str | None = '', dataflow_use_public_ips: bool | None = True, dataflow_service_account: str | None = '', encryption_spec_key_name: str | None = '')

Transforms raw features to engineered features.

Parameters
project: str

Project to run Cross-validation trainer.

location: str

Location for running the Cross-validation trainer.

root_dir: str

The Cloud Storage location to store the output.

metadata: dsl.Input[system.Artifact]

The tabular example gen metadata.

dataset_schema: dsl.Input[system.Artifact]

The schema of the dataset.

train_split: dsl.Input[system.Dataset]

The train split.

eval_split: dsl.Input[system.Dataset]

The eval split.

test_split: dsl.Input[system.Dataset]

The test split.

dataflow_machine_type: str | None = 'n1-standard-16'

The machine type used for dataflow jobs. If not set, default to n1-standard-16.

dataflow_max_num_workers: int | None = 25

The number of workers to run the dataflow job. If not set, default to 25.

dataflow_disk_size_gb: int | None = 40

The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.

dataflow_subnetwork: str | None = ''

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More

details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications :param dataflow_use_public_ips: Specifies whether Dataflow workers use public IP addresses. :param dataflow_service_account: Custom service account to run dataflow jobs. :param encryption_spec_key_name: Customer-managed encryption key.

Returns

materialized_train_split: dsl.Output[system.Artifact]

The materialized train split.

materialized_eval_split: dsl.Output[system.Artifact]

The materialized eval split.

materialized_eval_split: dsl.Output[system.Artifact]

The materialized test split.

raining_schema_uri: dsl.Output[system.Artifact]

The training schema.

ransform_output: dsl.Output[system.Artifact]

The transform output artifact.

gcp_resources: dsl.OutputPath(str)

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.