AutoML Tabular¶
GA AutoML tabular components.
Components:
|
Tunes AutoML Tabular models and selects top trials using cross-validation. |
|
Ensembles AutoML Tabular models. |
|
Finalizes AutoML Tabular pipelines. |
|
Validates the trained AutoML Tabular model is a valid model. |
|
Splits materialized dataset into train, eval, and test data splits. |
|
Searches AutoML Tabular architectures and selects the top trials. |
|
Generates stats and training instances for tabular data. |
|
Configures training and validates data and user-input configurations. |
|
Transforms raw features to engineered features. |
Functions:
Get the AutoML Tabular v1 default training pipeline. |
-
v1.automl.tabular.CvTrainerOp(project: str, location: str, root_dir: str, deadline_hours: float, num_parallel_trials: int, single_run_max_secs: int, num_selected_trials: int, transform_output: dsl.Input[system.Artifact], metadata: dsl.Input[system.Artifact], materialized_cv_splits: dsl.Input[system.Artifact], tuning_result_input: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), tuning_result_output: dsl.Output[system.Artifact], execution_metrics: dsl.OutputPath(dict), worker_pool_specs_override_json: list | None =
[]
, num_selected_features: int | None =0
, encryption_spec_key_name: str | None =''
)¶ Tunes AutoML Tabular models and selects top trials using cross-validation.
- Parameters¶:
- project: str¶
Project to run Cross-validation trainer.
- location: str¶
Location for running the Cross-validation trainer.
- root_dir: str¶
The Cloud Storage location to store the output.
- worker_pool_specs_override_json: list | None =
[]
¶ JSON worker pool specs. E.g., [{“machine_spec”: {“machine_type”: “n1-standard-16”}},{},{},{“machine_spec”: {“machine_type”: “n1-standard-16”}}]
- deadline_hours: float¶
Number of hours the cross-validation trainer should run.
- num_parallel_trials: int¶
Number of parallel training trials.
- single_run_max_secs: int¶
Max number of seconds each training trial runs.
- num_selected_trials: int¶
Number of selected trials. The number of weak learners in the final model is 5 * num_selected_trials.
- num_selected_features: int | None =
0
¶ Number of selected features. The number of features to learn in the NN models.
- transform_output: dsl.Input[system.Artifact]¶
The transform output artifact.
- metadata: dsl.Input[system.Artifact]¶
The tabular example gen metadata.
- materialized_cv_splits: dsl.Input[system.Artifact]¶
The materialized cross-validation splits.
- tuning_result_input: dsl.Input[system.Artifact]¶
AutoML Tabular tuning result.
- encryption_spec_key_name: str | None =
''
¶ Customer-managed encryption key.
- Returns¶:
ning_result_output: dsl.Output[system.Artifact]
The trained model and architectures.
gcp_resources: dsl.OutputPath(str)
GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
execution_metrics: dsl.OutputPath(dict)
Core metrics in dictionary of component execution.
-
v1.automl.tabular.EnsembleOp(project: str, location: str, root_dir: str, transform_output: dsl.Input[system.Artifact], metadata: dsl.Input[system.Artifact], dataset_schema: dsl.Input[system.Artifact], tuning_result_input: dsl.Input[system.Artifact], instance_baseline: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), model_architecture: dsl.Output[system.Artifact], model: dsl.Output[system.Artifact], unmanaged_container_model: dsl.Output[google.UnmanagedContainerModel], model_without_custom_ops: dsl.Output[system.Artifact], explanation_metadata: dsl.OutputPath(dict), explanation_metadata_artifact: dsl.Output[system.Artifact], explanation_parameters: dsl.OutputPath(dict), warmup_data: dsl.Input[system.Dataset] | None =
None
, encryption_spec_key_name: str | None =''
, export_additional_model_without_custom_ops: bool | None =False
)¶ Ensembles AutoML Tabular models.
- Parameters¶:
- project: str¶
Project to run Cross-validation trainer.
- location: str¶
Location for running the Cross-validation trainer.
- root_dir: str¶
The Cloud Storage location to store the output.
- transform_output: dsl.Input[system.Artifact]¶
The transform output artifact.
- metadata: dsl.Input[system.Artifact]¶
The tabular example gen metadata.
- dataset_schema: dsl.Input[system.Artifact]¶
The schema of the dataset.
- tuning_result_input: dsl.Input[system.Artifact]¶
AutoML Tabular tuning result.
- instance_baseline: dsl.Input[system.Artifact]¶
The instance baseline used to calculate explanations.
- warmup_data: dsl.Input[system.Dataset] | None =
None
¶ The warm up data. Ensemble component will save the warm up data together with the model artifact, used to warm up the model when prediction server starts.
- encryption_spec_key_name: str | None =
''
¶ Customer-managed encryption key.
- export_additional_model_without_custom_ops: bool | None =
False
¶ True if export an additional model without custom TF operators to the
model_without_custom_ops
output.
- Returns¶:
gcp_resources: dsl.OutputPath(str)
GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
model_architecture: dsl.Output[system.Artifact]
The architecture of the output model.
model: dsl.Output[system.Artifact]
The output model.
model_without_custom_ops: dsl.Output[system.Artifact]
The output model without custom TF operators, this output will be empty unless export_additional_model_without_custom_ops is set.
model_uri: Unknown
The URI of the output model.
instance_schema_uri: Unknown
The URI of the instance schema.
rediction_schema_uri: Unknown
The URI of the prediction schema.
explanation_metadata: dsl.OutputPath(dict)
The explanation metadata used by Vertex online and batch explanations.
explanation_metadata: dsl.OutputPath(dict)
The explanation parameters used by Vertex online and batch explanations.
-
v1.automl.tabular.FinalizerOp(project: str, location: str, root_dir: str, gcp_resources: dsl.OutputPath(str), encryption_spec_key_name: str | None =
''
)¶ Finalizes AutoML Tabular pipelines.
- Parameters¶:
- Returns¶:
gcp_resources: dsl.OutputPath(str)
GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
- v1.automl.tabular.InfraValidatorOp(unmanaged_container_model: dsl.Input[google.UnmanagedContainerModel])¶
Validates the trained AutoML Tabular model is a valid model.
- v1.automl.tabular.SplitMaterializedDataOp(materialized_data: dsl.Input[system.Dataset], materialized_train_split: dsl.Output[system.Artifact], materialized_eval_split: dsl.Output[system.Artifact], materialized_test_split: dsl.Output[system.Artifact])¶
Splits materialized dataset into train, eval, and test data splits.
The materialized dataset generated by the Feature Transform Engine consists of all the splits that were combined into the input transform dataset (i.e., train, eval, and test splits). This components splits the output materialized dataset into corresponding materialized data splits so that the splits can be used by down-stream training or evaluation components.
- Parameters¶:
- materialized_data: dsl.Input[system.Dataset]¶
Materialized dataset output by the Feature
Transform Engine.
- Returns¶:
materialized_train_split: dsl.Output[system.Artifact]
Path patern to materialized train split.
materialized_eval_split: dsl.Output[system.Artifact]
Path patern to materialized eval split.
materialized_test_split: dsl.Output[system.Artifact]
Path patern to materialized test split.
-
v1.automl.tabular.Stage1TunerOp(project: str, location: str, root_dir: str, num_selected_trials: int, deadline_hours: float, num_parallel_trials: int, single_run_max_secs: int, metadata: dsl.Input[system.Artifact], transform_output: dsl.Input[system.Artifact], materialized_train_split: dsl.Input[system.Artifact], materialized_eval_split: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), tuning_result_output: dsl.Output[system.Artifact], execution_metrics: dsl.OutputPath(dict), study_spec_parameters_override: list | None =
[]
, worker_pool_specs_override_json: list | None =[]
, reduce_search_space_mode: str | None ='regular'
, num_selected_features: int | None =0
, disable_early_stopping: bool | None =False
, feature_ranking: dsl.Input[system.Artifact] | None =None
, tune_feature_selection_rate: bool | None =False
, encryption_spec_key_name: str | None =''
, run_distillation: bool | None =False
)¶ Searches AutoML Tabular architectures and selects the top trials.
- Parameters¶:
- project: str¶
Project to run Cross-validation trainer.
- location: str¶
Location for running the Cross-validation trainer.
- root_dir: str¶
The Cloud Storage location to store the output.
- study_spec_parameters_override: list | None =
[]
¶ JSON study spec. E.g., [{“parameter_id”: “model_type”,”categorical_value_spec”: {“values”: [“nn”]}}]
- worker_pool_specs_override_json: list | None =
[]
¶ JSON worker pool specs. E.g., [{“machine_spec”: {“machine_type”: “n1-standard-16”}},{},{},{“machine_spec”: {“machine_type”: “n1-standard-16”}}]
- reduce_search_space_mode: str | None =
'regular'
¶ The reduce search space mode. Possible values: “regular” (default), “minimal”, “full”.
- num_selected_trials: int¶
Number of selected trials. The number of weak learners in the final model is 5 * num_selected_trials.
- num_selected_features: int | None =
0
¶ Number of selected features. The number of features to learn in the NN models.
- deadline_hours: float¶
Number of hours the cross-validation trainer should run.
- disable_early_stopping: bool | None =
False
¶ True if disable early stopping. Default value is false.
- num_parallel_trials: int¶
Number of parallel training trials.
- single_run_max_secs: int¶
Max number of seconds each training trial runs.
- metadata: dsl.Input[system.Artifact]¶
The tabular example gen metadata.
- transform_output: dsl.Input[system.Artifact]¶
The transform output artifact.
- materialized_train_split: dsl.Input[system.Artifact]¶
The materialized train split.
- materialized_eval_split: dsl.Input[system.Artifact]¶
The materialized eval split.
- encryption_spec_key_name: str | None =
''
¶ Customer-managed encryption key.
- run_distillation: bool | None =
False
¶ True if in distillation mode. The default value is false.
- Returns¶:
gcp_resources: dsl.OutputPath(str)
GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
ning_result_output: dsl.Output[system.Artifact]
The trained model and architectures.
execution_metrics: dsl.OutputPath(dict)
Core metrics in dictionary of component execution.
-
v1.automl.tabular.StatsAndExampleGenOp(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, transformations: str, dataset_schema: dsl.Output[system.Artifact], dataset_stats: dsl.Output[system.Artifact], train_split: dsl.Output[system.Dataset], eval_split: dsl.Output[system.Dataset], test_split: dsl.Output[system.Dataset], test_split_json: dsl.OutputPath(list), downsampled_test_split_json: dsl.OutputPath(list), instance_baseline: dsl.Output[system.Artifact], metadata: dsl.Output[system.Artifact], gcp_resources: dsl.OutputPath(str), weight_column_name: str | None =
''
, optimization_objective: str | None =''
, optimization_objective_recall_value: float | None =-1
, optimization_objective_precision_value: float | None =-1
, transformations_path: str | None =''
, request_type: str | None ='COLUMN_STATS_ONLY'
, dataflow_machine_type: str | None ='n1-standard-16'
, dataflow_max_num_workers: int | None =25
, dataflow_disk_size_gb: int | None =40
, dataflow_subnetwork: str | None =''
, dataflow_use_public_ips: bool | None =True
, dataflow_service_account: str | None =''
, encryption_spec_key_name: str | None =''
, run_distillation: bool | None =False
, additional_experiments: str | None =''
, additional_experiments_json: dict | None ={}
, data_source_csv_filenames: str | None =''
, data_source_bigquery_table_path: str | None =''
, predefined_split_key: str | None =''
, timestamp_split_key: str | None =''
, stratified_split_key: str | None =''
, training_fraction: float | None =-1
, validation_fraction: float | None =-1
, test_fraction: float | None =-1
, quantiles: list | None =[]
, enable_probabilistic_inference: bool | None =False
)¶ Generates stats and training instances for tabular data.
- Parameters¶:
- project: str¶
Project to run dataset statistics and example generation.
- location: str¶
Location for running dataset statistics and example generation.
- root_dir: str¶
The Cloud Storage location to store the output.
- target_column_name: str¶
The target column name.
- weight_column_name: str | None =
''
¶ The weight column name.
- prediction_type: str¶
The prediction type. Supported values: “classification”, “regression”.
- optimization_objective: str | None =
''
¶ Objective function the model is optimizing towards. The training process creates a model that maximizes/minimizes the value of the objective function over the validation set. The supported optimization objectives depend on the prediction type. If the field is not set, a default objective function is used. classification: “maximize-au-roc” (default) - Maximize the area under the receiver operating characteristic (ROC) curve. “minimize-log-loss” - Minimize log loss. “maximize-au-prc” - Maximize the area under the precision-recall curve. “maximize-precision-at-recall” - Maximize precision for a specified recall value. “maximize-recall-at-precision” - Maximize recall for a specified precision value. classification (multi-class): “minimize-log-loss” (default) - Minimize log loss. regression: “minimize-rmse” (default) - Minimize root-mean-squared error (RMSE). “minimize-mae” - Minimize mean-absolute error (MAE). “minimize-rmsle” - Minimize root-mean-squared log error (RMSLE).
- optimization_objective_recall_value: float | None =
-1
¶ Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.
- optimization_objective_precision_value: float | None =
-1
¶ Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.
- transformations: str¶
Quote escaped JSON string for transformations. Each transformation will apply transform function to given input column. And the result will be used for training. When creating transformation for BigQuery Struct column, the column should be flattened using “.” as the delimiter.
- transformations_path: str | None =
''
¶ Path to a GCS file containing JSON string for transformations.
- dataflow_machine_type: str | None =
'n1-standard-16'
¶ The machine type used for dataflow jobs. If not set, default to n1-standard-16.
- dataflow_max_num_workers: int | None =
25
¶ The number of workers to run the dataflow job. If not set, default to 25.
- dataflow_disk_size_gb: int | None =
40
¶ The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.
- dataflow_subnetwork: str | None =
''
¶ Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications
- dataflow_use_public_ips: bool | None =
True
¶ Specifies whether Dataflow workers use public IP addresses.
- dataflow_service_account: str | None =
''
¶ Custom service account to run dataflow jobs.
- encryption_spec_key_name: str | None =
''
¶ Customer-managed encryption key.
- run_distillation: bool | None =
False
¶ True if in distillation mode. The default value is false.
- Returns¶:
dataset_schema: dsl.Output[system.Artifact]
The schema of the dataset.
dataset_stats: dsl.Output[system.Artifact]
The stats of the dataset.
rain_split: dsl.Output[system.Dataset]
The train split.
eval_split: dsl.Output[system.Dataset]
The eval split.
est_split: dsl.Output[system.Dataset]
The test split.
est_split_json: dsl.OutputPath(list)
The test split JSON object.
downsampled_test_split_json: dsl.OutputPath(list)
The downsampled test split JSON object.
instance_baseline: dsl.Output[system.Artifact]
The instance baseline used to calculate explanations.
metadata: dsl.Output[system.Artifact]
The tabular example gen metadata.
gcp_resources: dsl.OutputPath(str)
GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
-
v1.automl.tabular.TrainingConfiguratorAndValidatorOp(dataset_stats: dsl.Input[system.Artifact], split_example_counts: str, training_schema: dsl.Input[system.Artifact], instance_schema: dsl.Input[system.Artifact], metadata: dsl.Output[system.Artifact], instance_baseline: dsl.Output[system.Artifact], target_column: str | None =
''
, weight_column: str | None =''
, prediction_type: str | None =''
, optimization_objective: str | None =''
, optimization_objective_recall_value: float | None =-1
, optimization_objective_precision_value: float | None =-1
, run_evaluation: bool | None =False
, run_distill: bool | None =False
, enable_probabilistic_inference: bool | None =False
, time_series_identifier_column: str | None =None
, time_series_identifier_columns: list | None =[]
, time_column: str | None =''
, time_series_attribute_columns: list | None =[]
, available_at_forecast_columns: list | None =[]
, unavailable_at_forecast_columns: list | None =[]
, quantiles: list | None =[]
, context_window: int | None =-1
, forecast_horizon: int | None =-1
, forecasting_model_type: str | None =''
, forecasting_transformations: dict | None ={}
, stage_1_deadline_hours: float | None =None
, stage_2_deadline_hours: float | None =None
, group_columns: list | None =None
, group_total_weight: float =0.0
, temporal_total_weight: float =0.0
, group_temporal_total_weight: float =0.0
)¶ Configures training and validates data and user-input configurations.
- Parameters¶:
- dataset_stats: dsl.Input[system.Artifact]¶
Dataset stats generated by feature transform engine.
- split_example_counts: str¶
JSON string of data split example counts for train, validate, and test splits.
- training_schema_path
Schema of input data to the tf_model at training time.
- instance_schema: dsl.Input[system.Artifact]¶
Schema of input data to the tf_model at serving time.
- target_column: str | None =
''
¶ Target column of input data.
- weight_column: str | None =
''
¶ Weight column of input data.
- prediction_type: str | None =
''
¶ Model prediction type. One of “classification”, “regression”, “time_series”.
- optimization_objective: str | None =
''
¶ Objective function the model is optimizing towards. The training process creates a model that maximizes/minimizes the value of the objective function over the validation set. The supported optimization objectives depend on the prediction type. If the field is not set, a default objective function is used. classification: “maximize-au-roc” (default) - Maximize the area under the receiver operating characteristic (ROC) curve. “minimize-log-loss” - Minimize log loss. “maximize-au-prc” - Maximize the area under the precision-recall curve. “maximize-precision-at-recall” - Maximize precision for a specified recall value. “maximize-recall-at-precision” - Maximize recall for a specified precision value. classification (multi-class): “minimize-log-loss” (default) - Minimize log loss. regression: “minimize-rmse” (default) - Minimize root-mean-squared error (RMSE). “minimize-mae” - Minimize mean-absolute error (MAE). “minimize-rmsle” - Minimize root-mean-squared log error (RMSLE).
- optimization_objective_recall_value: float | None =
-1
¶ Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.
- optimization_objective_precision_value: float | None =
-1
¶ Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.
- run_evaluation: bool | None =
False
¶ Whether we are running evaluation in the training pipeline.
- run_distill: bool | None =
False
¶ Whether the distillation should be applied to the training.
- enable_probabilistic_inference: bool | None =
False
¶ If probabilistic inference is enabled, the model will fit a distribution that captures the uncertainty of a prediction. At inference time, the predictive distribution is used to make a point prediction that minimizes the optimization objective. For example, the mean of a predictive distribution is the point prediction that minimizes RMSE loss. If quantiles are specified, then the quantiles of the distribution are also returned.
- time_series_identifier_column: str | None =
None
¶ [Deprecated] The time series identifier column. Used by forecasting only. Raises exception if used - use the “time_series_identifier_column” field instead.
- time_series_identifier_columns: list | None =
[]
¶ The list of time series identifier columns. Used by forecasting only.
- time_column: str | None =
''
¶ The column that indicates the time. Used by forecasting only.
- time_series_attribute_columns: list | None =
[]
¶ The column names of the time series attributes.
- available_at_forecast_columns: list | None =
[]
¶ The names of the columns that are available at forecast time.
The names of the columns that are not available at forecast time.
- quantiles: list | None =
[]
¶ All quantiles that the model need to predict.
- context_window: int | None =
-1
¶ The length of the context window.
- forecast_horizon: int | None =
-1
¶ The length of the forecast horizon.
- forecasting_model_type: str | None =
''
¶ The model types, e.g. l2l, seq2seq, tft.
- forecasting_transformations: dict | None =
{}
¶ Dict mapping auto and/or type-resolutions to feature columns. The supported types are auto, categorical, numeric, text, and timestamp.
- stage_1_deadline_hours: float | None =
None
¶ Stage 1 training budget in hours.
- stage_2_deadline_hours: float | None =
None
¶ Stage 2 training budget in hours.
- group_columns: list | None =
None
¶ A list of time series attribute column names that define the time series hierarchy.
- group_total_weight: float =
0.0
¶ The weight of the loss for predictions aggregated over time series in the same group.
- temporal_total_weight: float =
0.0
¶ The weight of the loss for predictions aggregated over the horizon for a single time series.
- group_temporal_total_weight: float =
0.0
¶ The weight of the loss for predictions aggregated over both the horizon and time series in the same hierarchy group.
- Returns¶:
metadata: dsl.Output[system.Artifact]
The tabular example gen metadata.
-
v1.automl.tabular.TransformOp(project: str, location: str, root_dir: str, metadata: dsl.Input[system.Artifact], dataset_schema: dsl.Input[system.Artifact], train_split: dsl.Input[system.Dataset], eval_split: dsl.Input[system.Dataset], test_split: dsl.Input[system.Dataset], materialized_train_split: dsl.Output[system.Artifact], materialized_eval_split: dsl.Output[system.Artifact], materialized_test_split: dsl.Output[system.Artifact], training_schema_uri: dsl.Output[system.Artifact], transform_output: dsl.Output[system.Artifact], gcp_resources: dsl.OutputPath(str), dataflow_machine_type: str | None =
'n1-standard-16'
, dataflow_max_num_workers: int | None =25
, dataflow_disk_size_gb: int | None =40
, dataflow_subnetwork: str | None =''
, dataflow_use_public_ips: bool | None =True
, dataflow_service_account: str | None =''
, encryption_spec_key_name: str | None =''
)¶ Transforms raw features to engineered features.
- Parameters¶:
- project: str¶
Project to run Cross-validation trainer.
- location: str¶
Location for running the Cross-validation trainer.
- root_dir: str¶
The Cloud Storage location to store the output.
- metadata: dsl.Input[system.Artifact]¶
The tabular example gen metadata.
- dataset_schema: dsl.Input[system.Artifact]¶
The schema of the dataset.
- train_split: dsl.Input[system.Dataset]¶
The train split.
- eval_split: dsl.Input[system.Dataset]¶
The eval split.
- test_split: dsl.Input[system.Dataset]¶
The test split.
- dataflow_machine_type: str | None =
'n1-standard-16'
¶ The machine type used for dataflow jobs. If not set, default to n1-standard-16.
- dataflow_max_num_workers: int | None =
25
¶ The number of workers to run the dataflow job. If not set, default to 25.
- dataflow_disk_size_gb: int | None =
40
¶ The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.
- dataflow_subnetwork: str | None =
''
¶ Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications
- dataflow_use_public_ips: bool | None =
True
¶ Specifies whether Dataflow workers use public IP addresses.
- dataflow_service_account: str | None =
''
¶ Custom service account to run dataflow jobs.
- encryption_spec_key_name: str | None =
''
¶ Customer-managed encryption key.
- Returns¶:
materialized_train_split: dsl.Output[system.Artifact]
The materialized train split.
materialized_eval_split: dsl.Output[system.Artifact]
The materialized eval split.
materialized_eval_split: dsl.Output[system.Artifact]
The materialized test split.
raining_schema_uri: dsl.Output[system.Artifact]
The training schema.
ransform_output: dsl.Output[system.Artifact]
The transform output artifact.
gcp_resources: dsl.OutputPath(str)
GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
-
v1.automl.tabular.get_automl_tabular_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int | None =
None
, stage_2_num_parallel_trials: int | None =None
, stage_2_num_selected_trials: int | None =None
, data_source_csv_filenames: str | None =None
, data_source_bigquery_table_path: str | None =None
, predefined_split_key: str | None =None
, timestamp_split_key: str | None =None
, stratified_split_key: str | None =None
, training_fraction: float | None =None
, validation_fraction: float | None =None
, test_fraction: float | None =None
, weight_column: str | None =None
, study_spec_parameters_override: list[dict[str, Any]] | None =None
, optimization_objective_recall_value: float | None =None
, optimization_objective_precision_value: float | None =None
, stage_1_tuner_worker_pool_specs_override: dict[str, Any] | None =None
, cv_trainer_worker_pool_specs_override: dict[str, Any] | None =None
, export_additional_model_without_custom_ops: bool =False
, stats_and_example_gen_dataflow_machine_type: str | None =None
, stats_and_example_gen_dataflow_max_num_workers: int | None =None
, stats_and_example_gen_dataflow_disk_size_gb: int | None =None
, transform_dataflow_machine_type: str | None =None
, transform_dataflow_max_num_workers: int | None =None
, transform_dataflow_disk_size_gb: int | None =None
, dataflow_subnetwork: str | None =None
, dataflow_use_public_ips: bool =True
, encryption_spec_key_name: str | None =None
, additional_experiments: dict[str, Any] | None =None
, dataflow_service_account: str | None =None
, run_evaluation: bool =True
, evaluation_batch_predict_machine_type: str | None =None
, evaluation_batch_predict_starting_replica_count: int | None =None
, evaluation_batch_predict_max_replica_count: int | None =None
, evaluation_batch_explain_machine_type: str | None =None
, evaluation_batch_explain_starting_replica_count: int | None =None
, evaluation_batch_explain_max_replica_count: int | None =None
, evaluation_dataflow_machine_type: str | None =None
, evaluation_dataflow_starting_num_workers: int | None =None
, evaluation_dataflow_max_num_workers: int | None =None
, evaluation_dataflow_disk_size_gb: int | None =None
, run_distillation: bool =False
, distill_batch_predict_machine_type: str | None =None
, distill_batch_predict_starting_replica_count: int | None =None
, distill_batch_predict_max_replica_count: int | None =None
, stage_1_tuning_result_artifact_uri: str | None =None
, quantiles: list[float] | None =None
, enable_probabilistic_inference: bool =False
, num_selected_features: int | None =None
, model_display_name: str =''
, model_description: str =''
) tuple[str, dict[str, Any]] [source]¶ Get the AutoML Tabular v1 default training pipeline.
- Parameters¶:
- project: str¶
The GCP project that runs the pipeline components.
- location: str¶
The GCP region that runs the pipeline components.
- root_dir: str¶
The root GCS directory for the pipeline components.
- target_column: str¶
The target column name.
- prediction_type: str¶
The type of prediction the model is to produce. “classification” or “regression”.
- optimization_objective: str¶
For binary classification, “maximize-au-roc”, “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.
- transformations: str¶
The path to a GCS file containing the transformations to apply.
- train_budget_milli_node_hours: float¶
The train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.
- stage_1_num_parallel_trials: int | None =
None
¶ Number of parallel trails for stage 1.
- stage_2_num_parallel_trials: int | None =
None
¶ Number of parallel trails for stage 2.
- stage_2_num_selected_trials: int | None =
None
¶ Number of selected trials for stage 2.
- data_source_csv_filenames: str | None =
None
¶ The CSV data source.
- data_source_bigquery_table_path: str | None =
None
¶ The BigQuery data source.
- predefined_split_key: str | None =
None
¶ The predefined_split column name.
- timestamp_split_key: str | None =
None
¶ The timestamp_split column name.
- stratified_split_key: str | None =
None
¶ The stratified_split column name.
- training_fraction: float | None =
None
¶ The training fraction.
- validation_fraction: float | None =
None
¶ The validation fraction.
- test_fraction: float | None =
None
¶ float = The test fraction.
- weight_column: str | None =
None
¶ The weight column name.
- study_spec_parameters_override: list[dict[str, Any]] | None =
None
¶ The list for overriding study spec. The list should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.
- optimization_objective_recall_value: float | None =
None
¶ Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.
- optimization_objective_precision_value: float | None =
None
¶ Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.
- stage_1_tuner_worker_pool_specs_override: dict[str, Any] | None =
None
¶ The dictionary for overriding. stage 1 tuner worker pool spec. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.
- cv_trainer_worker_pool_specs_override: dict[str, Any] | None =
None
¶ The dictionary for overriding stage cv trainer worker pool spec. The dictionary should be of format: https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.
- export_additional_model_without_custom_ops: bool =
False
¶ Whether to export additional model without custom TensorFlow operators.
- stats_and_example_gen_dataflow_machine_type: str | None =
None
¶ The dataflow machine type for stats_and_example_gen component.
- stats_and_example_gen_dataflow_max_num_workers: int | None =
None
¶ The max number of Dataflow workers for stats_and_example_gen component.
- stats_and_example_gen_dataflow_disk_size_gb: int | None =
None
¶ Dataflow worker’s disk size in GB for stats_and_example_gen component.
- transform_dataflow_machine_type: str | None =
None
¶ The dataflow machine type for transform component.
- transform_dataflow_max_num_workers: int | None =
None
¶ The max number of Dataflow workers for transform component.
- transform_dataflow_disk_size_gb: int | None =
None
¶ Dataflow worker’s disk size in GB for transform component.
- dataflow_subnetwork: str | None =
None
¶ Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications
- dataflow_use_public_ips: bool =
True
¶ Specifies whether Dataflow workers use public IP addresses.
- encryption_spec_key_name: str | None =
None
¶ The KMS key name.
- additional_experiments: dict[str, Any] | None =
None
¶ Use this field to config private preview features.
- dataflow_service_account: str | None =
None
¶ Custom service account to run dataflow jobs.
- run_evaluation: bool =
True
¶ Whether to run evaluation in the training pipeline.
- evaluation_batch_predict_machine_type: str | None =
None
¶ The prediction server machine type for batch predict components during evaluation.
- evaluation_batch_predict_starting_replica_count: int | None =
None
¶ The initial number of prediction server for batch predict components during evaluation.
- evaluation_batch_predict_max_replica_count: int | None =
None
¶ The max number of prediction server for batch predict components during evaluation.
- evaluation_batch_explain_machine_type: str | None =
None
¶ The prediction server machine type for batch explain components during evaluation.
- evaluation_batch_explain_starting_replica_count: int | None =
None
¶ The initial number of prediction server for batch explain components during evaluation.
- evaluation_batch_explain_max_replica_count: int | None =
None
¶ The max number of prediction server for batch explain components during evaluation.
- evaluation_dataflow_machine_type: str | None =
None
¶ The dataflow machine type for evaluation components.
- evaluation_dataflow_starting_num_workers: int | None =
None
¶ The initial number of Dataflow workers for evaluation components.
- evaluation_dataflow_max_num_workers: int | None =
None
¶ The max number of Dataflow workers for evaluation components.
- evaluation_dataflow_disk_size_gb: int | None =
None
¶ Dataflow worker’s disk size in GB for evaluation components.
- run_distillation: bool =
False
¶ Whether to run distill in the training pipeline.
- distill_batch_predict_machine_type: str | None =
None
¶ The prediction server machine type for batch predict component in the model distillation.
- distill_batch_predict_starting_replica_count: int | None =
None
¶ The initial number of prediction server for batch predict component in the model distillation.
- distill_batch_predict_max_replica_count: int | None =
None
¶ The max number of prediction server for batch predict component in the model distillation.
- stage_1_tuning_result_artifact_uri: str | None =
None
¶ The stage 1 tuning result artifact GCS URI.
- quantiles: list[float] | None =
None
¶ Quantiles to use for probabilistic inference. Up to 5 quantiles are allowed of values between 0 and 1, exclusive. Represents the quantiles to use for that objective. Quantiles must be unique.
- enable_probabilistic_inference: bool =
False
¶ If probabilistic inference is enabled, the model will fit a distribution that captures the uncertainty of a prediction. At inference time, the predictive distribution is used to make a point prediction that minimizes the optimization objective. For example, the mean of a predictive distribution is the point prediction that minimizes RMSE loss. If quantiles are specified, then the quantiles of the distribution are also returned.
- num_selected_features: int | None =
None
¶ Number of selected features for feature selection, defaults to None, in which case all features are used.
- model_display_name: str =
''
¶ The display name of the uploaded Vertex model.
- model_description: str =
''
¶ The description for the uploaded model.
- Returns¶:
Tuple of pipeline_definition_path and parameter_values.