AutoML Tabular¶

GA AutoML tabular components.

Components:

`CvTrainerOp`(project, location, root_dir, ...)	Tunes AutoML Tabular models and selects top trials using cross-validation.
`EnsembleOp`(project, location, root_dir, ...)	Ensembles AutoML Tabular models.
`FinalizerOp`(project, location, root_dir, ...)	Finalizes AutoML Tabular pipelines.
`InfraValidatorOp`(unmanaged_container_model)	Validates the trained AutoML Tabular model is a valid model.
`SplitMaterializedDataOp`(materialized_data, ...)	Splits materialized dataset into train, eval, and test data splits.
`Stage1TunerOp`(project, location, root_dir, ...)	Searches AutoML Tabular architectures and selects the top trials.
`StatsAndExampleGenOp`(project, location, ...)	Generates stats and training instances for tabular data.
`TrainingConfiguratorAndValidatorOp`(...[, ...])	Configures training and validates data and user-input configurations.
`TransformOp`(project, location, root_dir, ...)	Transforms raw features to engineered features.

v1.automl.tabular.CvTrainerOp(project: str, location: str, root_dir: str, deadline_hours: float, num_parallel_trials: int, single_run_max_secs: int, num_selected_trials: int, transform_output: dsl.Input[system.Artifact], metadata: dsl.Input[system.Artifact], materialized_cv_splits: dsl.Input[system.Artifact], tuning_result_input: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), tuning_result_output: dsl.Output[system.Artifact], execution_metrics: dsl.OutputPath(dict), worker_pool_specs_override_json: list | None = [], num_selected_features: int | None = 0, encryption_spec_key_name: str | None = '')¶

Tunes AutoML Tabular models and selects top trials using cross-validation.

Parameters¶

project: str¶: Project to run Cross-validation trainer.
location: str¶: Location for running the Cross-validation trainer.
root_dir: str¶: The Cloud Storage location to store the output.
worker_pool_specs_override_json: list | None = []¶: JSON worker pool specs. E.g., [{“machine_spec”: {“machine_type”: “n1-standard-16”}},{},{},{“machine_spec”: {“machine_type”: “n1-standard-16”}}]
deadline_hours: float¶: Number of hours the cross-validation trainer should run.
num_parallel_trials: int¶: Number of parallel training trials.
single_run_max_secs: int¶: Max number of seconds each training trial runs.
num_selected_trials: int¶: Number of selected trials. The number of weak learners in the final model is 5 * num_selected_trials.
num_selected_features: int | None = 0¶: Number of selected features. The number of features to learn in the NN models.
transform_output: dsl.Input[system.Artifact]¶: The transform output artifact.
metadata: dsl.Input[system.Artifact]¶: The tabular example gen metadata.
materialized_cv_splits: dsl.Input[system.Artifact]¶: The materialized cross-validation splits.
tuning_result_input: dsl.Input[system.Artifact]¶: AutoML Tabular tuning result.
encryption_spec_key_name: str | None = ''¶: Customer-managed encryption key.

Returns¶

ning_result_output: dsl.Output[system.Artifact]: The trained model and architectures.
gcp_resources: dsl.OutputPath(str): GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
execution_metrics: dsl.OutputPath(dict): Core metrics in dictionary of component execution.

v1.automl.tabular.EnsembleOp(project: str, location: str, root_dir: str, transform_output: dsl.Input[system.Artifact], metadata: dsl.Input[system.Artifact], dataset_schema: dsl.Input[system.Artifact], tuning_result_input: dsl.Input[system.Artifact], instance_baseline: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), model_architecture: dsl.Output[system.Artifact], model: dsl.Output[system.Artifact], unmanaged_container_model: dsl.Output[google.UnmanagedContainerModel], model_without_custom_ops: dsl.Output[system.Artifact], explanation_metadata: dsl.OutputPath(dict), explanation_metadata_artifact: dsl.Output[system.Artifact], explanation_parameters: dsl.OutputPath(dict), warmup_data: dsl.Input[system.Dataset] | None = None, encryption_spec_key_name: str | None = '', export_additional_model_without_custom_ops: bool | None = False)¶

Ensembles AutoML Tabular models.

Parameters¶

project: str¶: Project to run Cross-validation trainer.
location: str¶: Location for running the Cross-validation trainer.
root_dir: str¶: The Cloud Storage location to store the output.
transform_output: dsl.Input[system.Artifact]¶: The transform output artifact.
metadata: dsl.Input[system.Artifact]¶: The tabular example gen metadata.
dataset_schema: dsl.Input[system.Artifact]¶: The schema of the dataset.
tuning_result_input: dsl.Input[system.Artifact]¶: AutoML Tabular tuning result.
instance_baseline: dsl.Input[system.Artifact]¶: The instance baseline used to calculate explanations.
warmup_data: dsl.Input[system.Dataset] | None = None¶: The warm up data. Ensemble component will save the warm up data together with the model artifact, used to warm up the model when prediction server starts.
encryption_spec_key_name: str | None = ''¶: Customer-managed encryption key.
export_additional_model_without_custom_ops: bool | None = False¶: True if export an additional model without custom TF operators to the model_without_custom_ops output.

Returns¶

gcp_resources: dsl.OutputPath(str): GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
model_architecture: dsl.Output[system.Artifact]: The architecture of the output model.
model: dsl.Output[system.Artifact]: The output model.
model_without_custom_ops: dsl.Output[system.Artifact]: The output model without custom TF operators, this output will be empty unless export_additional_model_without_custom_ops is set.
model_uri: Unknown: The URI of the output model.
instance_schema_uri: Unknown: The URI of the instance schema.
rediction_schema_uri: Unknown: The URI of the prediction schema.
explanation_metadata: dsl.OutputPath(dict): The explanation metadata used by Vertex online and batch explanations.
explanation_metadata: dsl.OutputPath(dict): The explanation parameters used by Vertex online and batch explanations.

v1.automl.tabular.FinalizerOp(project: str, location: str, root_dir: str, gcp_resources: dsl.OutputPath(str), encryption_spec_key_name: str | None = '')¶

Finalizes AutoML Tabular pipelines.

Parameters¶

project: str¶: Project to run Cross-validation trainer.
location: str¶: Location for running the Cross-validation trainer.
root_dir: str¶: The Cloud Storage location to store the output.
encryption_spec_key_name: str | None = ''¶: Customer-managed encryption key.

Returns¶

gcp_resources: dsl.OutputPath(str): GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

v1.automl.tabular.InfraValidatorOp(unmanaged_container_model: dsl.Input[google.UnmanagedContainerModel])¶

Validates the trained AutoML Tabular model is a valid model.

Parameters¶

unmanaged_container_model: dsl.Input[google.UnmanagedContainerModel]¶: google.UnmanagedContainerModel for model to be validated.

v1.automl.tabular.SplitMaterializedDataOp(materialized_data: dsl.Input[system.Dataset], materialized_train_split: dsl.Output[system.Artifact], materialized_eval_split: dsl.Output[system.Artifact], materialized_test_split: dsl.Output[system.Artifact])¶

Splits materialized dataset into train, eval, and test data splits.

The materialized dataset generated by the Feature Transform Engine consists of all the splits that were combined into the input transform dataset (i.e., train, eval, and test splits). This components splits the output materialized dataset into corresponding materialized data splits so that the splits can be used by down-stream training or evaluation components.

Parameters¶

materialized_data: dsl.Input[system.Dataset]¶: Materialized dataset output by the Feature

Transform Engine.

Returns¶

materialized_train_split: dsl.Output[system.Artifact]: Path patern to materialized train split.
materialized_eval_split: dsl.Output[system.Artifact]: Path patern to materialized eval split.
materialized_test_split: dsl.Output[system.Artifact]: Path patern to materialized test split.

v1.automl.tabular.Stage1TunerOp(project: str, location: str, root_dir: str, num_selected_trials: int, deadline_hours: float, num_parallel_trials: int, single_run_max_secs: int, metadata: dsl.Input[system.Artifact], transform_output: dsl.Input[system.Artifact], materialized_train_split: dsl.Input[system.Artifact], materialized_eval_split: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), tuning_result_output: dsl.Output[system.Artifact], execution_metrics: dsl.OutputPath(dict), study_spec_parameters_override: list | None = [], worker_pool_specs_override_json: list | None = [], reduce_search_space_mode: str | None = 'regular', num_selected_features: int | None = 0, disable_early_stopping: bool | None = False, feature_ranking: dsl.Input[system.Artifact] | None = None, tune_feature_selection_rate: bool | None = False, encryption_spec_key_name: str | None = '', run_distillation: bool | None = False)¶

Searches AutoML Tabular architectures and selects the top trials.

Parameters¶

project: str¶: Project to run Cross-validation trainer.
location: str¶: Location for running the Cross-validation trainer.
root_dir: str¶: The Cloud Storage location to store the output.
study_spec_parameters_override: list | None = []¶: JSON study spec. E.g., [{“parameter_id”: “model_type”,”categorical_value_spec”: {“values”: [“nn”]}}]
worker_pool_specs_override_json: list | None = []¶: JSON worker pool specs. E.g., [{“machine_spec”: {“machine_type”: “n1-standard-16”}},{},{},{“machine_spec”: {“machine_type”: “n1-standard-16”}}]
reduce_search_space_mode: str | None = 'regular'¶: The reduce search space mode. Possible

values: “regular” (default), “minimal”, “full”. :param num_selected_trials: Number of selected trials. The number of weak learners in the final model is 5 * num_selected_trials. :param num_selected_features: Number of selected features. The number of features to learn in the NN models. :param deadline_hours: Number of hours the cross-validation trainer should run. :param disable_early_stopping: True if disable early stopping. Default value is false. :param num_parallel_trials: Number of parallel training trials. :param single_run_max_secs: Max number of seconds each training trial runs. :param metadata: The tabular example gen metadata. :param transform_output: The transform output artifact. :param materialized_train_split: The materialized train split. :param materialized_eval_split: The materialized eval split. :param encryption_spec_key_name: Customer-managed encryption key. :param run_distillation: True if in distillation mode. The default value is false.

Returns¶

gcp_resources: dsl.OutputPath(str): GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
ning_result_output: dsl.Output[system.Artifact]: The trained model and architectures.
execution_metrics: dsl.OutputPath(dict): Core metrics in dictionary of component execution.

v1.automl.tabular.StatsAndExampleGenOp(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, transformations: str, dataset_schema: dsl.Output[system.Artifact], dataset_stats: dsl.Output[system.Artifact], train_split: dsl.Output[system.Dataset], eval_split: dsl.Output[system.Dataset], test_split: dsl.Output[system.Dataset], test_split_json: dsl.OutputPath(list), downsampled_test_split_json: dsl.OutputPath(list), instance_baseline: dsl.Output[system.Artifact], metadata: dsl.Output[system.Artifact], gcp_resources: dsl.OutputPath(str), weight_column_name: str | None = '', optimization_objective: str | None = '', optimization_objective_recall_value: float | None = - 1, optimization_objective_precision_value: float | None = - 1, transformations_path: str | None = '', request_type: str | None = 'COLUMN_STATS_ONLY', dataflow_machine_type: str | None = 'n1-standard-16', dataflow_max_num_workers: int | None = 25, dataflow_disk_size_gb: int | None = 40, dataflow_subnetwork: str | None = '', dataflow_use_public_ips: bool | None = True, dataflow_service_account: str | None = '', encryption_spec_key_name: str | None = '', run_distillation: bool | None = False, additional_experiments: str | None = '', additional_experiments_json: dict | None = {}, data_source_csv_filenames: str | None = '', data_source_bigquery_table_path: str | None = '', predefined_split_key: str | None = '', timestamp_split_key: str | None = '', stratified_split_key: str | None = '', training_fraction: float | None = - 1, validation_fraction: float | None = - 1, test_fraction: float | None = - 1, quantiles: list | None = [], enable_probabilistic_inference: bool | None = False)¶

Generates stats and training instances for tabular data.

Parameters¶

project: str¶: Project to run dataset statistics and example generation.
location: str¶: Location for running dataset statistics and example generation.
root_dir: str¶: The Cloud Storage location to store the output.
target_column_name: str¶: The target column name.
weight_column_name: str | None = ''¶: The weight column name.
prediction_type: str¶: The prediction type. Supported values: “classification”, “regression”.
optimization_objective: str | None = ''¶: Objective function the model is optimizing towards. The training process creates a model that maximizes/minimizes the value of the objective function over the validation set. The supported optimization objectives depend on the prediction type. If the field is not set, a default objective function is used.

classification: “maximize-au-roc” (default) - Maximize the area under the receiver operating characteristic (ROC) curve. “minimize-log-loss” - Minimize log loss. “maximize-au-prc” - Maximize the area under the precision-recall curve. “maximize-precision-at-recall” - Maximize precision for a specified recall value. “maximize-recall-at-precision” - Maximize recall for a specified precision value. classification (multi-class): “minimize-log-loss” (default) - Minimize log loss. regression: “minimize-rmse” (default) - Minimize root-mean-squared error (RMSE). “minimize-mae” - Minimize mean-absolute error (MAE). “minimize-rmsle” - Minimize root-mean-squared log error (RMSLE). :param optimization_objective_recall_value: Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive. :param optimization_objective_precision_value: Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive. :param transformations: Quote escaped JSON string for transformations. Each transformation will apply transform function to given input column. And the result will be used for training. When creating transformation for BigQuery Struct column, the column should be flattened using “.” as the delimiter. :param transformations_path: Path to a GCS file containing JSON string for transformations. :param dataflow_machine_type: The machine type used for dataflow jobs. If not set, default to n1-standard-16. :param dataflow_max_num_workers: The number of workers to run the dataflow job. If not set, default to 25. :param dataflow_disk_size_gb: The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40. :param dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications :param dataflow_use_public_ips: Specifies whether Dataflow workers use public IP addresses. :param dataflow_service_account: Custom service account to run dataflow jobs. :param encryption_spec_key_name: Customer-managed encryption key. :param run_distillation: True if in distillation mode. The default value is false.

Returns¶

dataset_schema: dsl.Output[system.Artifact]: The schema of the dataset.
dataset_stats: dsl.Output[system.Artifact]: The stats of the dataset.
rain_split: dsl.Output[system.Dataset]: The train split.
eval_split: dsl.Output[system.Dataset]: The eval split.
est_split: dsl.Output[system.Dataset]: The test split.
est_split_json: dsl.OutputPath(list): The test split JSON object.
downsampled_test_split_json: dsl.OutputPath(list): The downsampled test split JSON object.
instance_baseline: dsl.Output[system.Artifact]: The instance baseline used to calculate explanations.
metadata: dsl.Output[system.Artifact]: The tabular example gen metadata.
gcp_resources: dsl.OutputPath(str): GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

v1.automl.tabular.TrainingConfiguratorAndValidatorOp(dataset_stats: dsl.Input[system.Artifact], split_example_counts: str, training_schema: dsl.Input[system.Artifact], instance_schema: dsl.Input[system.Artifact], metadata: dsl.Output[system.Artifact], instance_baseline: dsl.Output[system.Artifact], target_column: str | None = '', weight_column: str | None = '', prediction_type: str | None = '', optimization_objective: str | None = '', optimization_objective_recall_value: float | None = - 1, optimization_objective_precision_value: float | None = - 1, run_evaluation: bool | None = False, run_distill: bool | None = False, enable_probabilistic_inference: bool | None = False, time_series_identifier_column: str | None = None, time_series_identifier_columns: list | None = [], time_column: str | None = '', time_series_attribute_columns: list | None = [], available_at_forecast_columns: list | None = [], unavailable_at_forecast_columns: list | None = [], quantiles: list | None = [], context_window: int | None = - 1, forecast_horizon: int | None = - 1, forecasting_model_type: str | None = '', forecasting_transformations: dict | None = {}, stage_1_deadline_hours: float | None = None, stage_2_deadline_hours: float | None = None, group_columns: list | None = None, group_total_weight: float = 0.0, temporal_total_weight: float = 0.0, group_temporal_total_weight: float = 0.0)¶

Configures training and validates data and user-input configurations.

Parameters¶

dataset_stats: dsl.Input[system.Artifact]¶: Dataset stats generated by feature transform engine.
split_example_counts: str¶: JSON string of data split example counts for train, validate, and test splits.
training_schema_path: Schema of input data to the tf_model at training time.
instance_schema: dsl.Input[system.Artifact]¶: Schema of input data to the tf_model at serving time.
target_column: str | None = ''¶: Target column of input data.
weight_column: str | None = ''¶: Weight column of input data.
prediction_type: str | None = ''¶: Model prediction type. One of “classification”, “regression”, “time_series”.
optimization_objective: str | None = ''¶: Objective function the model is optimizing towards. The training process creates a model that maximizes/minimizes the value of the objective function over the validation set. The supported optimization objectives depend on the prediction type. If the field is not set, a default objective function is used.

classification: “maximize-au-roc” (default) - Maximize the area under the receiver operating characteristic (ROC) curve. “minimize-log-loss” - Minimize log loss. “maximize-au-prc” - Maximize the area under the precision-recall curve. “maximize-precision-at-recall” - Maximize precision for a specified recall value. “maximize-recall-at-precision” - Maximize recall for a specified precision value. classification (multi-class): “minimize-log-loss” (default) - Minimize log loss. regression: “minimize-rmse” (default) - Minimize root-mean-squared error (RMSE). “minimize-mae” - Minimize mean-absolute error (MAE). “minimize-rmsle” - Minimize root-mean-squared log error (RMSLE). :param optimization_objective_recall_value: Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive. :param optimization_objective_precision_value: Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive. :param run_evaluation: Whether we are running evaluation in the training pipeline. :param run_distill: Whether the distillation should be applied to the training. :param enable_probabilistic_inference: If probabilistic inference is enabled, the model will fit a distribution that captures the uncertainty of a prediction. At inference time, the predictive distribution is used to make a point prediction that minimizes the optimization objective. For example, the mean of a predictive distribution is the point prediction that minimizes RMSE loss. If quantiles are specified, then the quantiles of the distribution are also returned. :param time_series_identifier_column: [Deprecated] The time series identifier column. Used by forecasting only. Raises exception if used - use the “time_series_identifier_column” field instead. :param time_series_identifier_columns: The list of time series identifier columns. Used by forecasting only. :param time_column: The column that indicates the time. Used by forecasting only. :param time_series_attribute_columns: The column names of the time series attributes. :param available_at_forecast_columns: The names of the columns that are available at forecast time. :param unavailable_at_forecast_columns: The names of the columns that are not available at forecast time. :param quantiles: All quantiles that the model need to predict. :param context_window: The length of the context window. :param forecast_horizon: The length of the forecast horizon. :param forecasting_model_type: The model types, e.g. l2l, seq2seq, tft. :param forecasting_transformations: Dict mapping auto and/or type-resolutions to feature columns. The supported types are auto, categorical, numeric, text, and timestamp. :param stage_1_deadline_hours: Stage 1 training budget in hours. :param stage_2_deadline_hours: Stage 2 training budget in hours. :param group_columns: A list of time series attribute column names that define the time series hierarchy. :param group_total_weight: The weight of the loss for predictions aggregated over time series in the same group. :param temporal_total_weight: The weight of the loss for predictions aggregated over the horizon for a single time series. :param group_temporal_total_weight: The weight of the loss for predictions aggregated over both the horizon and time series in the same hierarchy group.

Returns¶

metadata: dsl.Output[system.Artifact]: The tabular example gen metadata.

v1.automl.tabular.TransformOp(project: str, location: str, root_dir: str, metadata: dsl.Input[system.Artifact], dataset_schema: dsl.Input[system.Artifact], train_split: dsl.Input[system.Dataset], eval_split: dsl.Input[system.Dataset], test_split: dsl.Input[system.Dataset], materialized_train_split: dsl.Output[system.Artifact], materialized_eval_split: dsl.Output[system.Artifact], materialized_test_split: dsl.Output[system.Artifact], training_schema_uri: dsl.Output[system.Artifact], transform_output: dsl.Output[system.Artifact], gcp_resources: dsl.OutputPath(str), dataflow_machine_type: str | None = 'n1-standard-16', dataflow_max_num_workers: int | None = 25, dataflow_disk_size_gb: int | None = 40, dataflow_subnetwork: str | None = '', dataflow_use_public_ips: bool | None = True, dataflow_service_account: str | None = '', encryption_spec_key_name: str | None = '')¶

Transforms raw features to engineered features.

Parameters¶

project: str¶: Project to run Cross-validation trainer.
location: str¶: Location for running the Cross-validation trainer.
root_dir: str¶: The Cloud Storage location to store the output.
metadata: dsl.Input[system.Artifact]¶: The tabular example gen metadata.
dataset_schema: dsl.Input[system.Artifact]¶: The schema of the dataset.
train_split: dsl.Input[system.Dataset]¶: The train split.
eval_split: dsl.Input[system.Dataset]¶: The eval split.
test_split: dsl.Input[system.Dataset]¶: The test split.
dataflow_machine_type: str | None = 'n1-standard-16'¶: The machine type used for dataflow jobs. If not set, default to n1-standard-16.
dataflow_max_num_workers: int | None = 25¶: The number of workers to run the dataflow job. If not set, default to 25.
dataflow_disk_size_gb: int | None = 40¶: The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.
dataflow_subnetwork: str | None = ''¶: Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More

details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications :param dataflow_use_public_ips: Specifies whether Dataflow workers use public IP addresses. :param dataflow_service_account: Custom service account to run dataflow jobs. :param encryption_spec_key_name: Customer-managed encryption key.

Returns¶

materialized_train_split: dsl.Output[system.Artifact]: The materialized train split.
materialized_eval_split: dsl.Output[system.Artifact]: The materialized eval split.
materialized_eval_split: dsl.Output[system.Artifact]: The materialized test split.
raining_schema_uri: dsl.Output[system.Artifact]: The training schema.
ransform_output: dsl.Output[system.Artifact]: The transform output artifact.
gcp_resources: dsl.OutputPath(str): GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.