google_cloud_pipeline_components.experimental.automl.tabular package

Submodules

google_cloud_pipeline_components.experimental.automl.tabular.utils module

Util functions for AutoML Tabular pipeline.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_automl_tabular_feature_selection_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_num_parallel_trials: Optional[int] = None, stage_2_num_parallel_trials: Optional[int] = None, stage_2_num_selected_trials: Optional[int] = None, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, predefined_split_key: Optional[str] = None, timestamp_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, weight_column: Optional[str] = None, study_spec_parameters_override: Optional[List[Dict[str, Any]]] = None, optimization_objective_recall_value: Optional[float] = None, optimization_objective_precision_value: Optional[float] = None, stage_1_tuner_worker_pool_specs_override: Optional[Dict[str, Any]] = None, cv_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: Optional[str] = None, stats_and_example_gen_dataflow_max_num_workers: Optional[int] = None, stats_and_example_gen_dataflow_disk_size_gb: Optional[int] = None, transform_dataflow_machine_type: Optional[str] = None, transform_dataflow_max_num_workers: Optional[int] = None, transform_dataflow_disk_size_gb: Optional[int] = None, dataflow_subnetwork: Optional[str] = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: Optional[str] = None, additional_experiments: Optional[Dict[str, Any]] = None, dataflow_service_account: Optional[str] = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: Optional[str] = None, evaluation_batch_predict_starting_replica_count: Optional[int] = None, evaluation_batch_predict_max_replica_count: Optional[int] = None, evaluation_dataflow_machine_type: Optional[str] = None, evaluation_dataflow_max_num_workers: Optional[int] = None, evaluation_dataflow_disk_size_gb: Optional[int] = None, max_selected_features: int = 1000, apply_feature_selection_tuning: bool = False, run_distillation: bool = False, distill_batch_predict_machine_type: Optional[str] = None, distill_batch_predict_starting_replica_count: Optional[int] = None, distill_batch_predict_max_replica_count: Optional[int] = None) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular v1 default training pipeline.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column:

The target column name.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

optimization_objective:

For binary classification, “maximize-au-roc”, “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations:

The path to a GCS file containing the transformations to apply.

train_budget_milli_node_hours:

The train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials:

Number of parallel trails for stage 1.

stage_2_num_parallel_trials:

Number of parallel trails for stage 2.

stage_2_num_selected_trials:

Number of selected trials for stage 2.

data_source_csv_filenames:

The CSV data source.

data_source_bigquery_table_path:

The BigQuery data source.

predefined_split_key:

The predefined_split column name.

timestamp_split_key:

The timestamp_split column name.

stratified_split_key:

The stratified_split column name.

training_fraction:

The training fraction.

validation_fraction:

The validation fraction.

test_fraction:

float = The test fraction.

weight_column:

The weight column name.

study_spec_parameters_override:

The list for overriding study spec. The list should be of format https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.

optimization_objective_recall_value:

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value:

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override:

The dictionary for overriding. stage 1 tuner worker pool spec. The dictionary should be of format

cv_trainer_worker_pool_specs_override:

The dictionary for overriding stage cv trainer worker pool spec. The dictionary should be of format

export_additional_model_without_custom_ops:

Whether to export additional model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type:

The dataflow machine type for stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers:

The max number of Dataflow workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for stats_and_example_gen component.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

additional_experiments:

Use this field to config private preview features.

dataflow_service_account:

Custom service account to run dataflow jobs.

run_evaluation:

Whether to run evaluation in the training pipeline.

evaluation_batch_predict_machine_type:

The prediction server machine type for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count:

The max number of prediction server for batch predict components during evaluation.

evaluation_dataflow_machine_type:

The dataflow machine type for evaluation components.

evaluation_dataflow_max_num_workers:

The max number of Dataflow workers for evaluation components.

evaluation_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for evaluation components.

max_selected_features:

number of features to select for training,

apply_feature_selection_tuning:

tuning feature selection rate if true.

run_distillation:

Whether to run distill in the training pipeline.

distill_batch_predict_machine_type:

The prediction server machine type for batch predict component in the model distillation.

distill_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict component in the model distillation.

distill_batch_predict_max_replica_count:

The max number of prediction server for batch predict component in the model distillation.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_automl_tabular_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_num_parallel_trials: Optional[int] = None, stage_2_num_parallel_trials: Optional[int] = None, stage_2_num_selected_trials: Optional[int] = None, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, predefined_split_key: Optional[str] = None, timestamp_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, weight_column: Optional[str] = None, study_spec_parameters_override: Optional[List[Dict[str, Any]]] = None, optimization_objective_recall_value: Optional[float] = None, optimization_objective_precision_value: Optional[float] = None, stage_1_tuner_worker_pool_specs_override: Optional[Dict[str, Any]] = None, cv_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: Optional[str] = None, stats_and_example_gen_dataflow_max_num_workers: Optional[int] = None, stats_and_example_gen_dataflow_disk_size_gb: Optional[int] = None, transform_dataflow_machine_type: Optional[str] = None, transform_dataflow_max_num_workers: Optional[int] = None, transform_dataflow_disk_size_gb: Optional[int] = None, dataflow_subnetwork: Optional[str] = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: Optional[str] = None, additional_experiments: Optional[Dict[str, Any]] = None, dataflow_service_account: Optional[str] = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: Optional[str] = None, evaluation_batch_predict_starting_replica_count: Optional[str] = None, evaluation_batch_predict_max_replica_count: Optional[str] = None, evaluation_dataflow_machine_type: Optional[str] = None, evaluation_dataflow_max_num_workers: Optional[int] = None, evaluation_dataflow_disk_size_gb: Optional[int] = None, run_distillation: bool = False, distill_batch_predict_machine_type: Optional[str] = None, distill_batch_predict_starting_replica_count: Optional[int] = None, distill_batch_predict_max_replica_count: Optional[int] = None, stage_1_tuning_result_artifact_uri: Optional[str] = None, quantiles: Optional[List[float]] = None, enable_probabilistic_inference: bool = False) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular v1 default training pipeline.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column:

The target column name.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

optimization_objective:

For binary classification, “maximize-au-roc”, “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations:

The path to a GCS file containing the transformations to apply.

train_budget_milli_node_hours:

The train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials:

Number of parallel trails for stage 1.

stage_2_num_parallel_trials:

Number of parallel trails for stage 2.

stage_2_num_selected_trials:

Number of selected trials for stage 2.

data_source_csv_filenames:

The CSV data source.

data_source_bigquery_table_path:

The BigQuery data source.

predefined_split_key:

The predefined_split column name.

timestamp_split_key:

The timestamp_split column name.

stratified_split_key:

The stratified_split column name.

training_fraction:

The training fraction.

validation_fraction:

The validation fraction.

test_fraction:

float = The test fraction.

weight_column:

The weight column name.

study_spec_parameters_override:

The list for overriding study spec. The list should be of format https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.

optimization_objective_recall_value:

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value:

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override:

The dictionary for overriding. stage 1 tuner worker pool spec. The dictionary should be of format

cv_trainer_worker_pool_specs_override:

The dictionary for overriding stage cv trainer worker pool spec. The dictionary should be of format

export_additional_model_without_custom_ops:

Whether to export additional model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type:

The dataflow machine type for stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers:

The max number of Dataflow workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for stats_and_example_gen component.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

additional_experiments:

Use this field to config private preview features.

dataflow_service_account:

Custom service account to run dataflow jobs.

run_evaluation:

Whether to run evaluation in the training pipeline.

evaluation_batch_predict_machine_type:

The prediction server machine type for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count:

The max number of prediction server for batch predict components during evaluation.

evaluation_dataflow_machine_type:

The dataflow machine type for evaluation components.

evaluation_dataflow_max_num_workers:

The max number of Dataflow workers for evaluation components.

evaluation_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for evaluation components.

run_distillation:

Whether to run distill in the training pipeline.

distill_batch_predict_machine_type:

The prediction server machine type for batch predict component in the model distillation.

distill_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict component in the model distillation.

distill_batch_predict_max_replica_count:

The max number of prediction server for batch predict component in the model distillation.

stage_1_tuning_result_artifact_uri:

The stage 1 tuning result artifact GCS URI.

quantiles:

Quantiles to use for probabilistic inference. Up to 5 quantiles are allowed of values between 0 and 1, exclusive. Represents the quantiles to use for that objective. Quantiles must be unique.

enable_probabilistic_inference:

If probabilistic inference is enabled, the model will fit a distribution that captures the uncertainty of a prediction. At inference time, the predictive distribution is used to make a point prediction that minimizes the optimization objective. For example, the mean of a predictive distribution is the point prediction that minimizes RMSE loss. If quantiles are specified, then the quantiles of the distribution are also returned.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_builtin_algorithm_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, study_spec_metric_id: str, study_spec_metric_goal: str, study_spec_parameters_override: List[Dict[str, Any]], max_trial_count: int, parallel_trial_count: int, algorithm: str, enable_profiler: bool = False, seed: int = 1, eval_steps: int = 0, eval_frequency_secs: int = 600, transform_config: Optional[str] = None, dataset_level_custom_transformation_definitions: Optional[List[Dict[str, Any]]] = None, dataset_level_transformations: Optional[List[Dict[str, Any]]] = None, predefined_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, tf_auto_transform_features: Optional[List[str]] = None, tf_custom_transformation_definitions: Optional[List[Dict[str, Any]]] = None, tf_transformations_path: Optional[str] = None, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, bigquery_staging_full_dataset_id: Optional[str] = None, weight_column: str = '', max_failed_trial_count: int = 0, study_spec_algorithm: str = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str = 'BEST_MEASUREMENT', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Optional[Dict[str, Any]] = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-standard-16', evaluation_batch_predict_starting_replica_count: int = 25, evaluation_batch_predict_max_replica_count: int = 25, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_max_num_workers: int = 25, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the built-in algorithm HyperparameterTuningJob pipeline.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column:

The target column name.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

study_spec_metric_id:

Metric to optimize, possible values: [ ‘loss’, ‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].

study_spec_metric_goal:

Optimization goal of the metric, possible values: “MAXIMIZE”, “MINIMIZE”.

study_spec_parameters_override:

List of dictionaries representing parameters to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count:

The desired total number of trials.

parallel_trial_count:

The desired number of trials to run in parallel.

algorithm:

Algorithm to train. One of “tabnet” and “wide_and_deep”.

enable_profiler:

Enables profiling and saves a trace during evaluation.

seed:

Seed to be used for this run.

eval_steps:

Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

eval_frequency_secs:

Frequency at which evaluation and checkpointing will take place.

transform_config:

Path to v1 TF transformation configuration.

dataset_level_custom_transformation_definitions:

Dataset-level custom transformation definitions in string format.

dataset_level_transformations:

Dataset-level transformation configuration in string format.

predefined_split_key:

Predefined split key.

stratified_split_key:

Stratified split key.

training_fraction:

Training fraction.

validation_fraction:

Validation fraction.

test_fraction:

Test fraction.

tf_auto_transform_features:

List of auto transform features in the comma-separated string format.

tf_custom_transformation_definitions:

TF custom transformation definitions in string format.

tf_transformations_path:

Path to TF transformation configuration.

data_source_csv_filenames:

The CSV data source.

data_source_bigquery_table_path:

The BigQuery data source.

bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column:

The weight column name.

max_failed_trial_count:

The number of failed trials that need to be seen before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm:

The search algorithm specified for the study. One of “ALGORITHM_UNSPECIFIED”, “GRID_SEARCH”, or “RANDOM_SEARCH”.

study_spec_measurement_selection_type:

Which measurement to use if/when the service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

worker_pool_specs_override:
The dictionary for overriding training and
evaluation worker pool specs. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

run_evaluation:

Whether to run evaluation steps during training.

evaluation_batch_predict_machine_type:

The prediction server machine type for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count:

The max number of prediction server for batch predict components during evaluation.

evaluation_dataflow_machine_type:

The dataflow machine type for evaluation components.

evaluation_dataflow_max_num_workers:

The max number of Dataflow workers for evaluation components.

evaluation_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for evaluation components.

dataflow_service_account:

Custom service account to run dataflow jobs.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_default_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Optional[Dict[str, Any]] = None, optimization_objective_recall_value: float = -1, optimization_objective_precision_value: float = -1, stage_1_tuner_worker_pool_specs_override: Optional[Dict[str, Any]] = None, cv_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '', additional_experiments: Optional[Dict[str, Any]] = None, dataflow_service_account: str = '', run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-standard-16', evaluation_batch_predict_starting_replica_count: int = 25, evaluation_batch_predict_max_replica_count: int = 25, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_max_num_workers: int = 25, evaluation_dataflow_disk_size_gb: int = 50, run_distillation: bool = False, distill_batch_predict_machine_type: str = 'n1-standard-16', distill_batch_predict_starting_replica_count: int = 25, distill_batch_predict_max_replica_count: int = 25) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular default training pipeline.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column_name:

The target column name.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

optimization_objective:

For binary classification, “maximize-au-roc”, “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations:

The transformations to apply.

split_spec:

The split spec.

data_source:

The data source.

train_budget_milli_node_hours:

The train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials:

Number of parallel trails for stage 1.

stage_2_num_parallel_trials:

Number of parallel trails for stage 2.

stage_2_num_selected_trials:

Number of selected trials for stage 2.

weight_column_name:

The weight column name.

study_spec_override:

The dictionary for overriding study spec. The dictionary should be of format https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.

optimization_objective_recall_value:

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value:

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override:

The dictionary for overriding. stage 1 tuner worker pool spec. The dictionary should be of format

cv_trainer_worker_pool_specs_override:

The dictionary for overriding stage cv trainer worker pool spec. The dictionary should be of format

export_additional_model_without_custom_ops:

Whether to export additional model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type:

The dataflow machine type for stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers:

The max number of Dataflow workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for stats_and_example_gen component.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

additional_experiments:

Use this field to config private preview features.

dataflow_service_account:

Custom service account to run dataflow jobs.

run_evaluation:

Whether to run evaluation in the training pipeline.

evaluation_batch_predict_machine_type:

The prediction server machine type for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count:

The max number of prediction server for batch predict components during evaluation.

evaluation_dataflow_machine_type:

The dataflow machine type for evaluation components.

evaluation_dataflow_max_num_workers:

The max number of Dataflow workers for evaluation components.

evaluation_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for evaluation components.

run_distillation:

Whether to run distill in the training pipeline.

distill_batch_predict_machine_type:

The prediction server machine type for batch predict component in the model distillation.

distill_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict component in the model distillation.

distill_batch_predict_max_replica_count:

The max number of prediction server for batch predict component in the model distillation.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_distill_skip_evaluation_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Optional[Dict[str, Any]] = None, optimization_objective_recall_value: float = -1, optimization_objective_precision_value: float = -1, stage_1_tuner_worker_pool_specs_override: Optional[Dict[str, Any]] = None, cv_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '', additional_experiments: Optional[Dict[str, Any]] = None, distill_batch_predict_machine_type: str = 'n1-standard-16', distill_batch_predict_starting_replica_count: int = 25, distill_batch_predict_max_replica_count: int = 25) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that distill and skips evaluation.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column_name: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

optimization_objective: For binary classification, “maximize-au-roc”,

“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. train_budget_milli_node_hours: The train budget of creating this model,

expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. weight_column_name: The weight column name. study_spec_override: The dictionary for overriding study spec. The

optimization_objective_recall_value: Required when optimization_objective is

“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value: Required when optimization_objective

is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.
stage 1 tuner worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
cv trainer worker pool spec. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

export_additional_model_without_custom_ops: Whether to export additional

model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type: The dataflow machine type for

stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow

workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in

GB for stats_and_example_gen component.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
the default subnetwork will be used. Example:

https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. distill_batch_predict_machine_type: The prediction server machine type for

batch predict component in the model distillation.

distill_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict component in the model distillation.

distill_batch_predict_max_replica_count: The max number of prediction server

for batch predict component in the model distillation.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_feature_selection_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, algorithm: str, prediction_type: str, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, max_selected_features: Optional[int] = None, dataflow_machine_type: str = 'n1-standard-16', dataflow_max_num_workers: int = 25, dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, dataflow_service_account: str = '')

Get the feature selection pipeline that generates feature ranking and selected features.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column:

The target column name.

algorithm:

Algorithm to select features, default to be AMI.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

data_source_csv_filenames:

A string that represents a list of comma separated CSV filenames.

data_source_bigquery_table_path:

The BigQuery table path.

max_selected_features:

number of features to be selected.

dataflow_machine_type:

The dataflow machine type for feature_selection component.

dataflow_max_num_workers:

The max number of Dataflow workers for feature_selection component.

dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for feature_selection component.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

dataflow_service_account:

Custom service account to run dataflow jobs.

Returns:

Tuple of pipeline_definition_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_model_comparison_pipeline_and_parameters(project: str, location: str, root_dir: str, prediction_type: str, training_jobs: Dict[str, Dict[str, Any]], data_source_csv_filenames: str = '-', data_source_bigquery_table_path: str = '-', evaluation_data_source_csv_filenames: str = '-', evaluation_data_source_bigquery_table_path: str = '-', experiment: str = '-', service_account: str = '-', network: str = '-') Tuple[str, Dict[str, Any]]

Returns a compiled model comparison pipeline and formatted parameters.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components prediction_type: The type of problem being solved. Can be one of:

regression, classification, or forecasting.

training_jobs: A dict mapping name to a dict of training job inputs. data_source_csv_filenames: Comma-separated paths to CSVs stored in GCS to

use as the training dataset for all training pipelines. This should be None if data_source_bigquery_table_path is not None. This should only contain data from the training and validation split and not from the test split.

data_source_bigquery_table_path: Path to BigQuery Table to use as the

training dataset for all training pipelines. This should be None if data_source_csv_filenames is not None. This should only contain data from the training and validation split and not from the test split.

evaluation_data_source_csv_filenames: Comma-separated paths to CSVs stored

in GCS to use as the evaluation dataset for all training pipelines. This should be None if evaluation_data_source_bigquery_table_path is not None. This should only contain data from the test split and not from the training and validation split.

evaluation_data_source_bigquery_table_path: Path to BigQuery Table to use as

the evaluation dataset for all training pipelines. This should be None if evaluation_data_source_csv_filenames is not None. This should only contain data from the test split and not from the training and validation split.

experiment: Vertex Experiment to add training pipeline runs to. A new

Experiment will be created if none is provided.

service_account: Specifies the service account for the sub-pipeline jobs. network: The full name of the Compute Engine network to which the

sub-pipeline jobs should be peered.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_skip_architecture_search_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_tuning_result_artifact_uri: str, stage_2_num_parallel_trials: Optional[int] = None, stage_2_num_selected_trials: Optional[int] = None, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, predefined_split_key: Optional[str] = None, timestamp_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, weight_column: Optional[str] = None, optimization_objective_recall_value: Optional[float] = None, optimization_objective_precision_value: Optional[float] = None, cv_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: Optional[str] = None, stats_and_example_gen_dataflow_max_num_workers: Optional[int] = None, stats_and_example_gen_dataflow_disk_size_gb: Optional[int] = None, transform_dataflow_machine_type: Optional[str] = None, transform_dataflow_max_num_workers: Optional[int] = None, transform_dataflow_disk_size_gb: Optional[int] = None, dataflow_subnetwork: Optional[str] = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: Optional[str] = None, additional_experiments: Optional[Dict[str, Any]] = None, dataflow_service_account: Optional[str] = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: Optional[str] = None, evaluation_batch_predict_starting_replica_count: Optional[int] = None, evaluation_batch_predict_max_replica_count: Optional[int] = None, evaluation_dataflow_machine_type: Optional[str] = None, evaluation_dataflow_max_num_workers: Optional[int] = None, evaluation_dataflow_disk_size_gb: Optional[int] = None) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that skips architecture search.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column:

The target column name.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

optimization_objective:

For binary classification, “maximize-au-roc”, “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations:

The transformations to apply.

train_budget_milli_node_hours:

The train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_tuning_result_artifact_uri:

The stage 1 tuning result artifact GCS URI.

stage_2_num_parallel_trials:

Number of parallel trails for stage 2.

stage_2_num_selected_trials:

Number of selected trials for stage 2.

data_source_csv_filenames:

The CSV data source.

data_source_bigquery_table_path:

The BigQuery data source.

predefined_split_key:

The predefined_split column name.

timestamp_split_key:

The timestamp_split column name.

stratified_split_key:

The stratified_split column name.

training_fraction:

The training fraction.

validation_fraction:

The validation fraction.

test_fraction:

float = The test fraction.

weight_column:

The weight column name.

optimization_objective_recall_value:

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value:

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

cv_trainer_worker_pool_specs_override:

The dictionary for overriding stage cv trainer worker pool spec. The dictionary should be of format

export_additional_model_without_custom_ops:

Whether to export additional model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type:

The dataflow machine type for stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers:

The max number of Dataflow workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for stats_and_example_gen component.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

additional_experiments:

Use this field to config private preview features.

dataflow_service_account:

Custom service account to run dataflow jobs.

run_evaluation:

Whether to run evaluation in the training pipeline.

evaluation_batch_predict_machine_type:

The prediction server machine type for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count:

The max number of prediction server for batch predict components during evaluation.

evaluation_dataflow_machine_type:

The dataflow machine type for evaluation components.

evaluation_dataflow_max_num_workers:

The max number of Dataflow workers for evaluation components.

evaluation_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for evaluation components.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_skip_evaluation_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Optional[Dict[str, Any]] = None, optimization_objective_recall_value: float = -1, optimization_objective_precision_value: float = -1, stage_1_tuner_worker_pool_specs_override: Optional[Dict[str, Any]] = None, cv_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '', additional_experiments: Optional[Dict[str, Any]] = None) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that skips evaluation.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column_name:

The target column name.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

optimization_objective:

For binary classification, “maximize-au-roc”, “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations:

The transformations to apply.

split_spec:

The split spec.

data_source:

The data source.

train_budget_milli_node_hours:

The train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials:

Number of parallel trails for stage 1.

stage_2_num_parallel_trials:

Number of parallel trails for stage 2.

stage_2_num_selected_trials:

Number of selected trials for stage 2.

weight_column_name:

The weight column name.

study_spec_override:

The dictionary for overriding study spec. The dictionary should be of format https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.

optimization_objective_recall_value:

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value:

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override:

The dictionary for overriding. stage 1 tuner worker pool spec. The dictionary should be of format

cv_trainer_worker_pool_specs_override:

The dictionary for overriding stage cv trainer worker pool spec. The dictionary should be of format

export_additional_model_without_custom_ops:

Whether to export additional model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type:

The dataflow machine type for stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers:

The max number of Dataflow workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for stats_and_example_gen component.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

additional_experiments:

Use this field to config private preview features.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, study_spec_metric_id: str, study_spec_metric_goal: str, study_spec_parameters_override: List[Dict[str, Any]], max_trial_count: int, parallel_trial_count: int, transform_config: Optional[str] = None, dataset_level_custom_transformation_definitions: Optional[List[Dict[str, Any]]] = None, dataset_level_transformations: Optional[List[Dict[str, Any]]] = None, run_feature_selection: bool = False, feature_selection_algorithm: Optional[str] = None, max_selected_features: Optional[int] = None, predefined_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, tf_auto_transform_features: Optional[List[str]] = None, tf_custom_transformation_definitions: Optional[List[Dict[str, Any]]] = None, tf_transformations_path: Optional[str] = None, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, eval_frequency_secs: int = 600, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, bigquery_staging_full_dataset_id: Optional[str] = None, weight_column: str = '', max_failed_trial_count: int = 0, study_spec_algorithm: str = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str = 'BEST_MEASUREMENT', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Optional[Dict[str, Any]] = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-standard-16', evaluation_batch_predict_starting_replica_count: int = 25, evaluation_batch_predict_max_replica_count: int = 25, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_max_num_workers: int = 25, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the TabNet HyperparameterTuningJob pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

study_spec_metric_id: Metric to optimize, possible values: [ ‘loss’,

‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].

study_spec_metric_goal: Optimization goal of the metric, possible values:

“MAXIMIZE”, “MINIMIZE”.

study_spec_parameters_override: List of dictionaries representing parameters

to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. transform_config:

Path to v1 TF transformation configuration.

dataset_level_custom_transformation_definitions:

Dataset-level custom transformation definitions in string format.

dataset_level_transformations:

Dataset-level transformation configuration in string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. max_selected_features: Maximum number of features to select. predefined_split_key:

Predefined split key.

stratified_split_key:

Stratified split key.

training_fraction:

Training fraction.

validation_fraction:

Validation fraction.

test_fraction:

Test fraction.

tf_auto_transform_features:

List of auto transform features in the comma-separated string format.

tf_custom_transformation_definitions:

TF custom transformation definitions in string format.

tf_transformations_path:

Path to TF transformation configuration.

enable_profiler: Enables profiling and saves a trace during evaluation. cache_data: Whether to cache data or not. If set to ‘auto’, caching is

determined based on the dataset size.

seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or

negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

eval_frequency_secs: Frequency at which evaluation and checkpointing will

take place.

data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen

before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm: The search algorithm specified for the study. One of

“ALGORITHM_UNSPECIFIED”, “GRID_SEARCH”, or “RANDOM_SEARCH”.

study_spec_measurement_selection_type: Which measurement to use if/when the

service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

worker_pool_specs_override: The dictionary for overriding training and
evaluation worker pool specs. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: The max number of prediction

server for batch predict components during evaluation.

evaluation_dataflow_machine_type: The dataflow machine type for evaluation

components.

evaluation_dataflow_max_num_workers: The max number of Dataflow workers for

evaluation components.

evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_study_spec_parameters_override(dataset_size_bucket: str, prediction_type: str, training_budget_bucket: str) List[Dict[str, Any]]

Get study_spec_parameters_override for a TabNet hyperparameter tuning job.

Args:
dataset_size_bucket:

Size of the dataset. One of “small” (< 1M rows), “medium” (1M - 100M rows), or “large” (> 100M rows).

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

training_budget_bucket:

Bucket of the estimated training budget. One of “small” (< $600), “medium” ($600 - $2400), or “large” (> $2400). This parameter is only used as a hint for the hyperparameter search space, unrelated to the real cost.

Returns:

List of study_spec_parameters_override.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, learning_rate: float, transform_config: Optional[str] = None, dataset_level_custom_transformation_definitions: Optional[List[Dict[str, Any]]] = None, dataset_level_transformations: Optional[List[Dict[str, Any]]] = None, run_feature_selection: bool = False, feature_selection_algorithm: Optional[str] = None, max_selected_features: Optional[int] = None, predefined_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, tf_auto_transform_features: Optional[List[str]] = None, tf_custom_transformation_definitions: Optional[List[Dict[str, Any]]] = None, tf_transformations_path: Optional[str] = None, max_steps: int = -1, max_train_secs: int = -1, large_category_dim: int = 1, large_category_thresh: int = 300, yeo_johnson_transform: bool = True, feature_dim: int = 64, feature_dim_ratio: float = 0.5, num_decision_steps: int = 6, relaxation_factor: float = 1.5, decay_every: float = 100, decay_rate: float = 0.95, gradient_thresh: float = 2000, sparsity_loss_weight: float = 1e-05, batch_momentum: float = 0.95, batch_size_ratio: float = 0.25, num_transformer_layers: int = 4, num_transformer_layers_ratio: float = 0.25, class_weight: float = 1.0, loss_function_type: str = 'default', alpha_focal_loss: float = 0.25, gamma_focal_loss: float = 2.0, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, batch_size: int = 100, measurement_selection_type: Optional[str] = None, optimization_metric: Optional[str] = None, eval_frequency_secs: int = 600, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, bigquery_staging_full_dataset_id: Optional[str] = None, weight_column: str = '', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Optional[Dict[str, Any]] = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-standard-16', evaluation_batch_predict_starting_replica_count: int = 25, evaluation_batch_predict_max_replica_count: int = 25, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_max_num_workers: int = 25, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the TabNet training pipeline.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column:

The target column name.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

learning_rate:

The learning rate used by the linear optimizer.

transform_config:

Path to v1 TF transformation configuration.

dataset_level_custom_transformation_definitions:

Dataset-level custom transformation definitions in string format.

dataset_level_transformations:

Dataset-level transformation configuration in string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. max_selected_features: Maximum number of features to select. predefined_split_key:

Predefined split key.

stratified_split_key:

Stratified split key.

training_fraction:

Training fraction.

validation_fraction:

Validation fraction.

test_fraction:

Test fraction.

tf_auto_transform_features:

List of auto transform features in the comma-separated string format.

tf_custom_transformation_definitions:

TF custom transformation definitions in string format.

tf_transformations_path:

Path to TF transformation configuration.

max_steps:

Number of steps to run the trainer for.

max_train_secs:

Amount of time in seconds to run the trainer for.

large_category_dim:

Embedding dimension for categorical feature with large number of categories.

large_category_thresh:

Threshold for number of categories to apply large_category_dim embedding dimension to.

yeo_johnson_transform:

Enables trainable Yeo-Johnson power transform.

feature_dim:

Dimensionality of the hidden representation in feature transformation block.

feature_dim_ratio:

The ratio of output dimension (dimensionality of the outputs of each decision step) to feature dimension.

num_decision_steps:

Number of sequential decision steps.

relaxation_factor:

Relaxation factor that promotes the reuse of each feature at different decision steps. When it is 1, a feature is enforced to be used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.

decay_every:

Number of iterations for periodically applying learning rate decaying.

decay_rate:

Learning rate decaying.

gradient_thresh:

Threshold for the norm of gradients for clipping.

sparsity_loss_weight:

Weight of the loss for sparsity regularization (increasing it will yield more sparse feature selection).

batch_momentum:

Momentum in ghost batch normalization.

batch_size_ratio:

The ratio of virtual batch size (size of the ghost batch normalization) to batch size.

num_transformer_layers:

The number of transformer layers for each decision step. used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.

num_transformer_layers_ratio:

The ratio of shared transformer layer to transformer layers.

class_weight:

The class weight is used to computes a weighted cross entropy which is helpful in classify imbalanced dataset. Only used for classification.

loss_function_type:

Loss function type. Loss function in classification [cross_entropy, weighted_cross_entropy, focal_loss], default is cross_entropy. Loss function in regression:

[rmse, mae, mse], default is

mse.

alpha_focal_loss:

Alpha value (balancing factor) in focal_loss function. Only used for classification.

gamma_focal_loss:

Gamma value (modulating factor) for focal loss for focal loss. Only used for classification.

enable_profiler:

Enables profiling and saves a trace during evaluation.

cache_data: Whether to cache data or not. If set to ‘auto’, caching is

determined based on the dataset size.

seed:

Seed to be used for this run.

eval_steps:

Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

batch_size:

Batch size for training.

measurement_selection_type:

Which measurement to use if/when the service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

optimization_metric:

Optimization metric used for measurement_selection_type. Default is “rmse” for regression and “auc” for classification.

eval_frequency_secs:

Frequency at which evaluation and checkpointing will take place.

data_source_csv_filenames:

The CSV data source.

data_source_bigquery_table_path:

The BigQuery data source.

bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column:

The weight column name.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

worker_pool_specs_override:
The dictionary for overriding training and
evaluation worker pool specs. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

run_evaluation:

Whether to run evaluation steps during training.

evaluation_batch_predict_machine_type:

The prediction server machine type for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count:

The max number of prediction server for batch predict components during evaluation.

evaluation_dataflow_machine_type:

The dataflow machine type for evaluation components.

evaluation_dataflow_max_num_workers:

The max number of Dataflow workers for evaluation components.

evaluation_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for evaluation components.

dataflow_service_account:

Custom service account to run dataflow jobs.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_wide_and_deep_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, study_spec_metric_id: str, study_spec_metric_goal: str, study_spec_parameters_override: List[Dict[str, Any]], max_trial_count: int, parallel_trial_count: int, transform_config: Optional[str] = None, dataset_level_custom_transformation_definitions: Optional[List[Dict[str, Any]]] = None, dataset_level_transformations: Optional[List[Dict[str, Any]]] = None, run_feature_selection: bool = False, feature_selection_algorithm: Optional[str] = None, max_selected_features: Optional[int] = None, predefined_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, tf_auto_transform_features: Optional[List[str]] = None, tf_custom_transformation_definitions: Optional[List[Dict[str, Any]]] = None, tf_transformations_path: Optional[str] = None, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, eval_frequency_secs: int = 600, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, bigquery_staging_full_dataset_id: Optional[str] = None, weight_column: str = '', max_failed_trial_count: int = 0, study_spec_algorithm: str = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str = 'BEST_MEASUREMENT', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Optional[Dict[str, Any]] = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-standard-16', evaluation_batch_predict_starting_replica_count: int = 25, evaluation_batch_predict_max_replica_count: int = 25, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_max_num_workers: int = 25, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the Wide & Deep algorithm HyperparameterTuningJob pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.

“classification” or “regression”.

study_spec_metric_id: Metric to optimize, possible values: [ ‘loss’,

‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].

study_spec_metric_goal: Optimization goal of the metric, possible values:

“MAXIMIZE”, “MINIMIZE”.

study_spec_parameters_override: List of dictionaries representing parameters

to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. transform_config:

Path to v1 TF transformation configuration.

dataset_level_custom_transformation_definitions:

Dataset-level custom transformation definitions in string format.

dataset_level_transformations:

Dataset-level transformation configuration in string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. max_selected_features: Maximum number of features to select. predefined_split_key:

Predefined split key.

stratified_split_key:

Stratified split key.

training_fraction:

Training fraction.

validation_fraction:

Validation fraction.

test_fraction:

Test fraction.

tf_auto_transform_features:

List of auto transform features in the comma-separated string format.

tf_custom_transformation_definitions:

TF custom transformation definitions in string format.

tf_transformations_path:

Path to TF transformation configuration.

enable_profiler: Enables profiling and saves a trace during evaluation. cache_data: Whether to cache data or not. If set to ‘auto’, caching is

determined based on the dataset size.

seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or

negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

eval_frequency_secs: Frequency at which evaluation and checkpointing will

take place.

data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen

before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm: The search algorithm specified for the study. One of

“ALGORITHM_UNSPECIFIED”, “GRID_SEARCH”, or “RANDOM_SEARCH”.

study_spec_measurement_selection_type: Which measurement to use if/when the

service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

worker_pool_specs_override: The dictionary for overriding training and
evaluation worker pool specs. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: The max number of prediction

server for batch predict components during evaluation.

evaluation_dataflow_machine_type: The dataflow machine type for evaluation

components.

evaluation_dataflow_max_num_workers: The max number of Dataflow workers for

evaluation components.

evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_wide_and_deep_study_spec_parameters_override() List[Dict[str, Any]]

Get study_spec_parameters_override for a Wide & Deep hyperparameter tuning job.

Returns:

List of study_spec_parameters_override.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_wide_and_deep_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, learning_rate: float, dnn_learning_rate: float, transform_config: Optional[str] = None, dataset_level_custom_transformation_definitions: Optional[List[Dict[str, Any]]] = None, dataset_level_transformations: Optional[List[Dict[str, Any]]] = None, run_feature_selection: bool = False, feature_selection_algorithm: Optional[str] = None, max_selected_features: Optional[int] = None, predefined_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, tf_auto_transform_features: Optional[List[str]] = None, tf_custom_transformation_definitions: Optional[List[Dict[str, Any]]] = None, tf_transformations_path: Optional[str] = None, optimizer_type: str = 'adam', max_steps: int = -1, max_train_secs: int = -1, l1_regularization_strength: float = 0, l2_regularization_strength: float = 0, l2_shrinkage_regularization_strength: float = 0, beta_1: float = 0.9, beta_2: float = 0.999, hidden_units: str = '30,30,30', use_wide: bool = True, embed_categories: bool = True, dnn_dropout: float = 0, dnn_optimizer_type: str = 'adam', dnn_l1_regularization_strength: float = 0, dnn_l2_regularization_strength: float = 0, dnn_l2_shrinkage_regularization_strength: float = 0, dnn_beta_1: float = 0.9, dnn_beta_2: float = 0.999, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, batch_size: int = 100, measurement_selection_type: Optional[str] = None, optimization_metric: Optional[str] = None, eval_frequency_secs: int = 600, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, bigquery_staging_full_dataset_id: Optional[str] = None, weight_column: str = '', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Optional[Dict[str, Any]] = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-standard-16', evaluation_batch_predict_starting_replica_count: int = 25, evaluation_batch_predict_max_replica_count: int = 25, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_max_num_workers: int = 25, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the Wide & Deep training pipeline.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column:

The target column name.

prediction_type:

The type of prediction the model is to produce. ‘classification’ or ‘regression’.

learning_rate:

The learning rate used by the linear optimizer.

dnn_learning_rate:

The learning rate for training the deep part of the model.

transform_config:

Path to v1 TF transformation configuration.

dataset_level_custom_transformation_definitions:

Dataset-level custom transformation definitions in string format.

dataset_level_transformations:

Dataset-level transformation configuration in string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. max_selected_features: Maximum number of features to select. predefined_split_key:

Predefined split key.

stratified_split_key:

Stratified split key.

training_fraction:

Training fraction.

validation_fraction:

Validation fraction.

test_fraction:

Test fraction.

tf_auto_transform_features:

List of auto transform features in the comma-separated string format.

tf_custom_transformation_definitions:

TF custom transformation definitions in string format.

tf_transformations_path:

Path to TF transformation configuration.

optimizer_type:

The type of optimizer to use. Choices are “adam”, “ftrl” and “sgd” for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

max_steps:

Number of steps to run the trainer for.

max_train_secs:

Amount of time in seconds to run the trainer for.

l1_regularization_strength:

L1 regularization strength for optimizer_type=”ftrl”.

l2_regularization_strength:

L2 regularization strength for optimizer_type=”ftrl”.

l2_shrinkage_regularization_strength:

L2 shrinkage regularization strength for optimizer_type=”ftrl”.

beta_1:

Beta 1 value for optimizer_type=”adam”.

beta_2:

Beta 2 value for optimizer_type=”adam”.

hidden_units:

Hidden layer sizes to use for DNN feature columns, provided in comma-separated layers.

use_wide:

If set to true, the categorical columns will be used in the wide part of the DNN model.

embed_categories:

If set to true, the categorical columns will be used embedded and used in the deep part of the model. Embedding size is the square root of the column cardinality.

dnn_dropout:

The probability we will drop out a given coordinate.

dnn_optimizer_type:

The type of optimizer to use for the deep part of the model. Choices are “adam”, “ftrl” and “sgd”. for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

dnn_l1_regularization_strength:

L1 regularization strength for dnn_optimizer_type=”ftrl”.

dnn_l2_regularization_strength:

L2 regularization strength for dnn_optimizer_type=”ftrl”.

dnn_l2_shrinkage_regularization_strength:

L2 shrinkage regularization strength for dnn_optimizer_type=”ftrl”.

dnn_beta_1:

Beta 1 value for dnn_optimizer_type=”adam”.

dnn_beta_2:

Beta 2 value for dnn_optimizer_type=”adam”.

enable_profiler:

Enables profiling and saves a trace during evaluation.

cache_data: Whether to cache data or not. If set to ‘auto’, caching is

determined based on the dataset size.

seed:

Seed to be used for this run.

eval_steps:

Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

batch_size:

Batch size for training.

measurement_selection_type:

Which measurement to use if/when the service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

optimization_metric:

Optimization metric used for measurement_selection_type. Default is “rmse” for regression and “auc” for classification.

eval_frequency_secs:

Frequency at which evaluation and checkpointing will take place.

data_source_csv_filenames:

The CSV data source.

data_source_bigquery_table_path:

The BigQuery data source.

bigquery_staging_full_dataset_id:

The BigQuery staging full dataset id for storing intermediate tables.

weight_column:

The weight column name.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

worker_pool_specs_override:
The dictionary for overriding training and
evaluation worker pool specs. The dictionary should be of format

https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/custom_job.proto#L172.

run_evaluation:

Whether to run evaluation steps during training.

evaluation_batch_predict_machine_type:

The prediction server machine type for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count:

The max number of prediction server for batch predict components during evaluation.

evaluation_dataflow_machine_type:

The dataflow machine type for evaluation components.

evaluation_dataflow_max_num_workers:

The max number of Dataflow workers for evaluation components.

evaluation_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for evaluation components.

dataflow_service_account:

Custom service account to run dataflow jobs.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

Returns:

Tuple of pipeline_definition_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_xgboost_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, objective: str, study_spec_metric_id: str, study_spec_metric_goal: str, max_trial_count: int, parallel_trial_count: int, study_spec_parameters_override: Optional[List[Dict[str, Any]]] = None, eval_metric: Optional[str] = None, disable_default_eval_metric: Optional[int] = None, seed: Optional[int] = None, seed_per_iteration: Optional[bool] = None, dataset_level_custom_transformation_definitions: Optional[List[Dict[str, Any]]] = None, dataset_level_transformations: Optional[List[Dict[str, Any]]] = None, run_feature_selection: Optional[bool] = None, feature_selection_algorithm: Optional[str] = None, max_selected_features: Optional[int] = None, predefined_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, tf_auto_transform_features: Optional[List[str]] = None, tf_custom_transformation_definitions: Optional[List[Dict[str, Any]]] = None, tf_transformations_path: Optional[str] = None, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, bigquery_staging_full_dataset_id: Optional[str] = None, weight_column: Optional[str] = None, max_failed_trial_count: Optional[int] = None, training_machine_type: Optional[str] = None, training_total_replica_count: Optional[int] = None, training_accelerator_type: Optional[str] = None, training_accelerator_count: Optional[int] = None, study_spec_algorithm: Optional[str] = None, study_spec_measurement_selection_type: Optional[str] = None, transform_dataflow_machine_type: Optional[str] = None, transform_dataflow_max_num_workers: Optional[int] = None, transform_dataflow_disk_size_gb: Optional[int] = None, run_evaluation: Optional[bool] = None, evaluation_batch_predict_machine_type: Optional[str] = None, evaluation_batch_predict_starting_replica_count: Optional[int] = None, evaluation_batch_predict_max_replica_count: Optional[int] = None, evaluation_dataflow_machine_type: Optional[str] = None, evaluation_dataflow_max_num_workers: Optional[int] = None, evaluation_dataflow_disk_size_gb: Optional[int] = None, dataflow_service_account: Optional[str] = None, dataflow_subnetwork: Optional[str] = None, dataflow_use_public_ips: Optional[bool] = None, encryption_spec_key_name: Optional[str] = None)

Get the XGBoost HyperparameterTuningJob pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. objective: Specifies the learning task and the learning

objective. Must be one of [reg:squarederror, reg:squaredlogerror, reg:logistic, reg:gamma, reg:tweedie, reg:pseudohubererror, binary:logistic, multi:softprob].

study_spec_metric_id: Metric to optimize. For options, please look under

‘eval_metric’ at https://xgboost.readthedocs.io/en/stable/parameter.html#learning-task-parameters.

study_spec_metric_goal: Optimization goal of the metric, possible values:

“MAXIMIZE”, “MINIMIZE”.

max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. study_spec_parameters_override: List of dictionaries representing parameters

to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

eval_metric: Evaluation metrics for validation data represented as a

comma-separated string.

disable_default_eval_metric: Flag to disable default metric. Set to >0

to disable. Default to 0.

seed: Random seed. seed_per_iteration: Seed PRNG determnisticly via iterator number. dataset_level_custom_transformation_definitions: Dataset-level custom

transformation definitions in string format.

dataset_level_transformations: Dataset-level transformation configuration in

string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions

in string format.

tf_transformations_path: Path to TF transformation configuration. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen

before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

training_machine_type: Machine type. training_total_replica_count: Number of workers. training_accelerator_type: Accelerator type. training_accelerator_count: Accelerator count. study_spec_algorithm: The search algorithm specified for the study. One of

‘ALGORITHM_UNSPECIFIED’, ‘GRID_SEARCH’, or ‘RANDOM_SEARCH’.

study_spec_measurement_selection_type: Which measurement to use if/when the

service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: The max number of prediction

server for batch predict components during evaluation.

evaluation_dataflow_machine_type: The dataflow machine type for evaluation

components.

evaluation_dataflow_max_num_workers: The max number of Dataflow workers for

evaluation components.

evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definition_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_xgboost_study_spec_parameters_override() List[Dict[str, Any]]

Get study_spec_parameters_override for an XGBoost hyperparameter tuning job.

Returns:

List of study_spec_parameters_override.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_xgboost_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, objective: str, eval_metric: Optional[str] = None, num_boost_round: Optional[int] = None, early_stopping_rounds: Optional[int] = None, base_score: Optional[float] = None, disable_default_eval_metric: Optional[int] = None, seed: Optional[int] = None, seed_per_iteration: Optional[bool] = None, booster: Optional[str] = None, eta: Optional[float] = None, gamma: Optional[float] = None, max_depth: Optional[int] = None, min_child_weight: Optional[float] = None, max_delta_step: Optional[float] = None, subsample: Optional[float] = None, colsample_bytree: Optional[float] = None, colsample_bylevel: Optional[float] = None, colsample_bynode: Optional[float] = None, reg_lambda: Optional[float] = None, reg_alpha: Optional[float] = None, tree_method: Optional[str] = None, scale_pos_weight: Optional[float] = None, updater: Optional[str] = None, refresh_leaf: Optional[int] = None, process_type: Optional[str] = None, grow_policy: Optional[str] = None, sampling_method: Optional[str] = None, monotone_constraints: Optional[str] = None, interaction_constraints: Optional[str] = None, sample_type: Optional[str] = None, normalize_type: Optional[str] = None, rate_drop: Optional[float] = None, one_drop: Optional[int] = None, skip_drop: Optional[float] = None, num_parallel_tree: Optional[int] = None, feature_selector: Optional[str] = None, top_k: Optional[int] = None, max_cat_to_onehot: Optional[int] = None, max_leaves: Optional[int] = None, max_bin: Optional[int] = None, tweedie_variance_power: Optional[float] = None, huber_slope: Optional[float] = None, dataset_level_custom_transformation_definitions: Optional[List[Dict[str, Any]]] = None, dataset_level_transformations: Optional[List[Dict[str, Any]]] = None, run_feature_selection: Optional[bool] = None, feature_selection_algorithm: Optional[str] = None, max_selected_features: Optional[int] = None, predefined_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, tf_auto_transform_features: Optional[List[str]] = None, tf_custom_transformation_definitions: Optional[List[Dict[str, Any]]] = None, tf_transformations_path: Optional[str] = None, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, bigquery_staging_full_dataset_id: Optional[str] = None, weight_column: Optional[str] = None, training_machine_type: Optional[str] = None, training_total_replica_count: Optional[int] = None, training_accelerator_type: Optional[str] = None, training_accelerator_count: Optional[int] = None, transform_dataflow_machine_type: Optional[str] = None, transform_dataflow_max_num_workers: Optional[int] = None, transform_dataflow_disk_size_gb: Optional[int] = None, run_evaluation: Optional[bool] = None, evaluation_batch_predict_machine_type: Optional[str] = None, evaluation_batch_predict_starting_replica_count: Optional[int] = None, evaluation_batch_predict_max_replica_count: Optional[int] = None, evaluation_dataflow_machine_type: Optional[str] = None, evaluation_dataflow_max_num_workers: Optional[int] = None, evaluation_dataflow_disk_size_gb: Optional[int] = None, dataflow_service_account: Optional[str] = None, dataflow_subnetwork: Optional[str] = None, dataflow_use_public_ips: Optional[bool] = None, encryption_spec_key_name: Optional[str] = None)

Get the XGBoost training pipeline.

Args:

project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. objective: Specifies the learning task and the learning

objective. Must be one of [reg:squarederror, reg:squaredlogerror, reg:logistic, reg:gamma, reg:tweedie, reg:pseudohubererror, binary:logistic, multi:softprob].

eval_metric: Evaluation metrics for validation data represented as a

comma-separated string.

num_boost_round: Number of boosting iterations. early_stopping_rounds: Activates early stopping. Validation

error needs to decrease at least every early_stopping_rounds round(s) to continue training.

base_score: The initial prediction score of all instances, global

bias.

disable_default_eval_metric: Flag to disable default metric. Set to >0

to disable. Default to 0.

seed: Random seed. seed_per_iteration: Seed PRNG determnisticly via iterator number. booster: Which booster to use, can be gbtree, gblinear or dart.

gbtree and dart use tree based model while gblinear uses linear function.

eta: Learning rate. gamma: Minimum loss reduction required to make a further partition

on a leaf node of the tree.

max_depth: Maximum depth of a tree. min_child_weight: Minimum sum of instance weight(hessian) needed in

a child.

max_delta_step: Maximum delta step we allow each tree’s weight

estimation to be.

subsample: Subsample ratio of the training instance. colsample_bytree: Subsample ratio of columns when constructing each

tree.

colsample_bylevel: Subsample ratio of columns for each split, in

each level.

colsample_bynode: Subsample ratio of columns for each node (split). reg_lambda: L2 regularization term on weights. reg_alpha: L1 regularization term on weights. tree_method: The tree construction algorithm used in XGBoost. Choices:

[“auto”, “exact”, “approx”, “hist”, “gpu_exact”, “gpu_hist”].

scale_pos_weight: Control the balance of positive and negative

weights.

updater: A comma separated string defining the sequence of tree

updaters to run.

refresh_leaf: Refresh updater plugin. Update tree leaf and nodes’s

stats if True. When it is False, only node stats are updated.

process_type: A type of boosting process to run. Choices:[“default”,

“update”]

grow_policy: Controls a way new nodes are added to the tree. Only

supported if tree_method is hist. Choices:[“depthwise”, “lossguide”]

sampling_method: The method to use to sample the training instances. monotone_constraints: Constraint of variable

monotonicity.

interaction_constraints: Constraints for

interaction representing permitted interactions.

sample_type: [dart booster only] Type of sampling algorithm.

Choices:[“uniform”, “weighted”]

normalize_type: [dart booster only] Type of normalization algorithm,

Choices:[“tree”, “forest”]

rate_drop: [dart booster only] Dropout rate.’ one_drop: [dart booster only] When this flag is enabled, at least one

tree is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper).

skip_drop: [dart booster only] Probability of skipping the dropout

procedure during a boosting iteration.

num_parallel_tree: Number of parallel trees constructed during each

iteration. This option is used to support boosted random forest.

feature_selector: [linear booster only] Feature selection and

ordering method.

top_k: The number of top features to select in greedy and thrifty

feature selector. The value of 0 means using all the features.

max_cat_to_onehot: A threshold for deciding whether XGBoost should

use one-hot encoding based split for categorical data.

max_leaves: Maximum number of nodes to be added. max_bin: Maximum number of discrete bins to bucket continuous features. tweedie_variance_power: Parameter that controls the variance of the Tweedie

distribution.

huber_slope: A parameter used for Pseudo-Huber loss to define the delta

term.

dataset_level_custom_transformation_definitions: Dataset-level custom

transformation definitions in string format.

dataset_level_transformations: Dataset-level transformation configuration in

string format.

run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the

comma-separated string format.

tf_custom_transformation_definitions: TF custom transformation definitions

in string format.

tf_transformations_path: Path to TF transformation configuration. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for

storing intermediate tables.

weight_column: The weight column name. training_machine_type: Machine type. training_total_replica_count: Number of workers. training_accelerator_type: Accelerator type. training_accelerator_count: Accelerator count. transform_dataflow_machine_type: The dataflow machine type for transform

component.

transform_dataflow_max_num_workers: The max number of Dataflow workers for

transform component.

transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

transform component.

run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type

for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count: The initial number of

prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count: The max number of prediction

server for batch predict components during evaluation.

evaluation_dataflow_machine_type: The dataflow machine type for evaluation

components.

evaluation_dataflow_max_num_workers: The max number of Dataflow workers for

evaluation components.

evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for

evaluation components.

dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty

dataflow_use_public_ips: Specifies whether Dataflow workers use public IP

addresses.

encryption_spec_key_name: The KMS key name.

Returns:

Tuple of pipeline_definition_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.input_dictionary_to_parameter(input_dict: Optional[Dict[str, Any]]) str

Convert json input dict to encoded parameter string.

This function is required due to the limitation on YAML component definition that YAML definition does not have a keyword for apply quote escape, so the JSON argument’s quote must be manually escaped using this function.

Args:
input_dict:

The input json dictionary.

Returns:

The encoded string used for parameter.

Module contents

Module for AutoML Tables KFP components.

google_cloud_pipeline_components.experimental.automl.tabular.CvTrainerOp()

automl_tabular_cv_trainer AutoML Tabular cross-validation trainer

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer.

root_dir (str):

The Cloud Storage location to store the output.

worker_pool_specs_override_json (JsonArray):

JSON worker pool specs. E.g., [{“machine_spec”: {“machine_type”: “n1-standard-16”}},{},{},{“machine_spec”: {“machine_type”: “n1-standard-16”}}]

deadline_hours (float):

Number of hours the cross-validation trainer should run.

num_parallel_trials (int):

Number of parallel training trials.

single_run_max_secs (int):

Max number of seconds each training trial runs.

num_selected_trials (int):

Number of selected trials. The number of weak learners in the final model is 5 * num_selected_trials.

num_selected_features (int):

Number of selected features. The number of features to learn in the NN models.

transform_output (TransformOutput):

The transform output artifact.

metadata (TabularExampleGenMetadata):

The tabular example gen metadata.

materialized_cv_splits (MaterializedSplit):

The materialized cross-validation splits.

tuning_result_input (AutoMLTabularTuningResult):

AutoML Tabular tuning result.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

Returns:
tuning_result_output (AutoMLTabularTuningResult):

The trained model and architectures.

gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

execution_metrics (JsonObject):

Core metrics in dictionary of component execution.

google_cloud_pipeline_components.experimental.automl.tabular.EnsembleOp()

automl_tabular_ensemble Ensemble AutoML Tabular models

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer.

root_dir (str):

The Cloud Storage location to store the output.

transform_output (TransformOutput):

The transform output artifact.

metadata (TabularExampleGenMetadata):

The tabular example gen metadata.

dataset_schema (DatasetSchema):

The schema of the dataset.

tuning_result_input (AutoMLTabularTuningResult):

AutoML Tabular tuning result.

instance_baseline (AutoMLTabularInstanceBaseline):

The instance baseline used to calculate explanations.

warmup_data (Dataset):

The warm up data. Ensemble component will save the warm up data together with the model artifact, used to warm up the model when prediction server starts.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

export_additional_model_without_custom_ops (Optional[str]):

True if export an additional model without custom TF operators to the model_without_custom_ops output.

Returns:
gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

model_architecture (AutoMLTabularModelArchitecture):

The architecture of the output model.

model (system.Model):

The output model.

model_without_custom_ops (system.Model):

The output model without custom TF operators, this output will be empty unless export_additional_model_without_custom_ops is set.

model_uri (str):

The URI of the output model.

instance_schema_uri (str):

The URI of the instance schema.

prediction_schema_uri (str):

The URI of the prediction schema.

explanation_metadata (str):

The explanation metadata used by Vertex online and batch explanations.

explanation_metadata (str):

The explanation parameters used by Vertex online and batch explanations.

google_cloud_pipeline_components.experimental.automl.tabular.FeatureSelectionOp()

tabular_feature_ranking_and_selection Launch a feature selection task to pick top features.

Args:
project (str):

Required. Project to run feature selection.

location (str):

Location for running the feature selection. If not set, default to us-central1.

root_dir (str):

The Cloud Storage location to store the output.

dataflow_machine_type (Optional[str]):

The machine type used for dataflow jobs. If not set, default to n1-standard-16.

dataflow_max_num_workers (Optional[int]):

The number of workers to run the dataflow job. If not set, default to 25.

dataflow_disk_size_gb (Optional[int]):

The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.

dataflow_subnetwork (Optional[str]):

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips (Optional[bool]):

Specifies whether Dataflow workers use public IP addresses.

dataflow_service_account (Optional[str]):

Custom service account to run dataflow jobs.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key. If this is set, then all resources will be encrypted with the provided encryption key.

data_source(Dataset):

The input dataset artifact which references csv, BigQuery, or TF Records.

target_column_name(str):

Target column name of the input dataset.

max_selected_features (Optional[int]):

number of features to select by the algorithm. If not set, default to 1000.

Returns:
feature_ranking (TabularFeatureRanking):

the dictionary of feature names and feature ranking values.

selected_features (JsonObject):

A json array of selected feature names.

gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.experimental.automl.tabular.FeatureTransformEngineOp(root_dir: str, project: str, location: str, dataset_level_custom_transformation_definitions: list = '[]', dataset_level_transformations: list = '[]', forecasting_time_column: str = '', forecasting_time_series_identifier_column: str = '', forecasting_time_series_attribute_columns: str = '', forecasting_unavailable_at_forecast_columns: str = '', forecasting_available_at_forecast_columns: str = '', forecasting_forecast_horizon: int = '-1', forecasting_context_window: int = '-1', forecasting_predefined_window_column: str = '', forecasting_window_stride_length: int = '-1', forecasting_window_max_count: int = '-1', forecasting_apply_windowing: bool = 'true', predefined_split_key: str = '', stratified_split_key: str = '', timestamp_split_key: str = '', training_fraction: float = '-1', validation_fraction: float = '-1', test_fraction: float = '-1', tf_auto_transform_features: list = '[]', tf_custom_transformation_definitions: list = '[]', tf_transformations_path: str = '', target_column: str = '', weight_column: str = '', prediction_type: str = '', model_type: str = None, run_distill: bool = 'false', run_feature_selection: bool = 'false', feature_selection_algorithm: str = 'AMI', max_selected_features: int = 1000, data_source_csv_filenames: str = '', data_source_bigquery_table_path: str = '', bigquery_staging_full_dataset_id: str = '', dataflow_machine_type: str = 'n1-standard-16', dataflow_max_num_workers: int = '25', dataflow_disk_size_gb: int = '40', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = 'true', dataflow_service_account: str = '', encryption_spec_key_name: str = '', autodetect_csv_schema: bool = 'false')

feature_transform_engine Feature Transform Engine (FTE) component to transform raw data to engineered features.

FTE performs dataset level transformations, data splitting, data statistic generation, and TensorFlow-based row level transformations on the input dataset based on the provided transformation configuration.

Args:
root_dir (str):

The Cloud Storage location to store the output.

project (str):

Project to run feature transform engine.

location (str):

Location for the created GCP services.

dataset_level_custom_transformation_definitions (Optional[JsonArray]):

List of dataset-level custom transformation definitions.

Custom, bring-your-own dataset-level transform functions, where users can define and import their own transform function and use it with FTE’s built-in transformations. Using custom transformations is an experimental feature and it is currently not supported during batch prediction.

Example:

[
  {
    "transformation": "ConcatCols",
    "module_path": "/path/to/custom_transform_fn_dlt.py",
    "function_name": "concat_cols"
  }
]

Using custom transform function together with FTE’s built-in transformations:

[
  {
    "transformation": "Join",
    "right_table_uri": "bq://test-project.dataset_test.table",
    "join_keys": [["join_key_col", "join_key_col"]]
  },{
    "transformation": "ConcatCols",
    "cols": ["feature_1", "feature_2"],
    "output_col": "feature_1_2"
  }
]
dataset_level_transformations (Optional[JsonArray]):

List of dataset-level transformations.

Example:

[
  {
    "transformation": "Join",
    "right_table_uri": "bq://test-project.dataset_test.table",
    "join_keys": [["join_key_col", "join_key_col"]]
  },
  ...
]

Additional information about FTE’s currently supported built-in transformations:

Join:

Joins features from right_table_uri. For each join key, the left table keys will be included and the right table keys will be dropped.

Example:

{
  "transformation": "Join",
  "right_table_uri": "bq://test-project.dataset_test.table",
  "join_keys": [["join_key_col", "join_key_col"]]
}
Arguments:
right_table_uri (str):

Right table BigQuery uri to join with input_full_table_id.

join_keys (List[List[str]]):

Features to join on. For each nested list, the first element is a left table column and the second is its corresponding right table column.

TimeAggregate:

Creates a new feature composed of values of an existing feature from a fixed time period ago or in the future. Ex: A feature for sales by store 1 year ago.

Example:

{
  "transformation": "TimeAggregate",
  "time_difference": 40,
  "time_difference_units": "DAY",
  "time_series_identifier_columns": ["store_id"],
  "time_column": "time_col",
  "time_difference_target_column": "target_col",
  "output_column": "output_col"
}
Arguments:
time_difference (int):

Number of time_difference_units to look back or into the future on our time_difference_target_column.

time_difference_units (str):

Units of time_difference to look back or into the future on our time_difference_target_column. Must be one of

  • ‘DAY’

  • ‘WEEK’ (Equivalent to 7 DAYs)

  • ‘MONTH’

  • ‘QUARTER’

  • ‘YEAR’

time_series_identifier_columns (List[str]):

Names of the time series identifier columns.

time_column (str):

Name of the time column.

time_difference_target_column (str):

Column we wish to get the value of time_difference time_difference_units in the past or future.

output_column (str):

Name of our new time aggregate feature.

is_future (Optional[bool]):

Whether we wish to look forward in time. Defaults to False.

PartitionByMax/PartitionByMin/PartitionByAvg/PartitionBySum:

Performs a partition by reduce operation (one of max, min, avg, or sum) with a fixed historic time period. Ex: Getting avg sales (the reduce column) for each store (partition_by_column) over the previous 5 days (time_column, time_ago_units, and time_ago).

Example:

{
  "transformation": "PartitionByMax",
  "reduce_column": "sell_price",
  "partition_by_columns": ["store_id", "state_id"],
  "time_column": "date",
  "time_ago": 1,
  "time_ago_units": "WEEK",
  "output_column": "partition_by_reduce_max_output"
}
Arguments:
reduce_column (str):

Column to apply the reduce operation on. Reduce operations include the following: Max, Min, Avg, Sum.

partition_by_columns (List[str]):

List of columns to partition by.

time_column (str):

Time column for the partition by operation’s window function.

time_ago (int):

Number of time_ago_units to look back on our target_column, starting from time_column (inclusive).

time_ago_units (str):

Units of time_ago to look back on our target_column. Must be one of

  • ‘DAY’

  • ‘WEEK’

output_column (str):

Name of our output feature.

forecasting_time_column (Optional[str]):

Forecasting time column.

forecasting_time_series_identifier_column (Optional[str]):

Forecasting time series identifier column.

forecasting_time_series_attribute_columns (Optional[str]):

Forecasting time series attribute columns.

forecasting_unavailable_at_forecast_columns (Optional[str]):

Forecasting unavailable at forecast columns.

forecasting_available_at_forecast_columns (Optional[str]):

Forecasting available at forecast columns.

forecasting_forecast_horizon (Optional[int]):

Forecasting horizon.

forecasting_context_window (Optional[int]):

Forecasting context window.

forecasting_predefined_window_column (Optional[str]):

Forecasting predefined window column.

forecasting_window_stride_length (Optional[int]):

Forecasting window stride length.

forecasting_window_max_count (Optional[int]):

Forecasting window max count.

forecasting_apply_windowing (Optional[bool]):

Whether to apply window strategy.

predefined_split_key (Optional[str]):

Predefined split key.

stratified_split_key (Optional[str]):

Stratified split key.

timestamp_split_key (Optional[str]):

Timestamp split key.

training_fraction (Optional[float]):

Fraction of input data for training.

validation_fraction (Optional[float]):

Fraction of input data for validation.

test_fraction (Optional[float]):

Fraction of input data for testing.

tf_auto_transform_features (Optional[JsonArray]):

List of auto TF transform features.

FTE will automatically configure a set of built-in transformations for each feature based on its data statistics.

tf_custom_transformation_definitions (Optional[JsonArray]):

List of TensorFlow-based custom transformation definitions.

Custom, bring-your-own transform functions, where users can define and import their own transform function and use it with FTE’s built-in transformations.

Example:

[
  {
    "transformation": "PlusOne",
    "module_path": "gs://bucket/custom_transform_fn.py",
    "function_name": "plus_one_transform"
  },
  {
    "transformation": "MultiplyTwo",
    "module_path": "gs://bucket/custom_transform_fn.py",
    "function_name": "multiply_two_transform"
  }
]

Using custom transform function together with FTE’s built-in transformations:

[
  {
    "transformation": "CastToFloat",
    "input_columns": ["feature_1"],
    "output_columns": ["feature_1"]
  },{
    "transformation": "PlusOne",
    "input_columns": ["feature_1"]
    "output_columns": ["feature_1_plused_one"]
  },{
    "transformation": "MultiplyTwo",
    "input_columns": ["feature_1"]
    "output_columns": ["feature_1_multiplied_two"]
  }
]
tf_transformations_path (Optional[str]):

Path to TensorFlow-based transformation configuration.

Path to a JSON file used to specified FTE’s TF transformation configurations. In the following, we provide some sample transform configurations to demonstrate FTE’s capabilities.

All transformations on input columns are explicitly specified with FTE’s built-in transformations. Chaining of multiple transformations on a single column is also supported. For example:

[
  {
    "transformation": "ZScale",
    "input_columns": ["feature_1"]
  }, {
    "transformation": "ZScale",
    "input_columns": ["feature_2"]
  }
]

Additional information about FTE’s currently supported built-in transformations:

Datetime:

Extracts datetime featues from a column containing timestamp strings.

Example:

{
  "transformation": "Datetime",
  "input_columns": ["feature_1"],
  "time_format": "%Y-%m-%d"
}
Arguments:
input_columns (List[str]):

A list with a single column to perform the datetime transformation on.

output_columns (Optional[List[str]]):

Names of output columns, one for each datetime_features element.

time_format (str):

Datetime format string. Time format is a combination of Date + Time Delimiter (optional) + Time (optional) directives. Valid date directives are as follows

  • ‘%Y-%m-%d’ # 2018-11-30

  • ‘%Y/%m/%d’ # 2018/11/30

  • ‘%y-%m-%d’ # 18-11-30

  • ‘%y/%m/%d’ # 18/11/30

  • ‘%m-%d-%Y’ # 11-30-2018

  • ‘%m/%d/%Y’ # 11/30/2018

  • ‘%m-%d-%y’ # 11-30-18

  • ‘%m/%d/%y’ # 11/30/18

  • ‘%d-%m-%Y’ # 30-11-2018

  • ‘%d/%m/%Y’ # 30/11/2018

  • ‘%d-%B-%Y’ # 30-November-2018

  • ‘%d-%m-%y’ # 30-11-18

  • ‘%d/%m/%y’ # 30/11/18

  • ‘%d-%B-%y’ # 30-November-18

  • ‘%d%m%Y’ # 30112018

  • ‘%m%d%Y’ # 11302018

  • ‘%Y%m%d’ # 20181130

Valid time delimiters are as follows
  • ‘T’

  • ‘ ‘

Valid time directives are as follows
  • ‘%H:%M’ # 23:59

  • ‘%H:%M:%S’ # 23:59:58

  • ‘%H:%M:%S.%f’ # 23:59:58[.123456]

  • ‘%H:%M:%S.%f%z’ # 23:59:58[.123456]+0000

  • ‘%H:%M:%S%z’, # 23:59:58+0000

datetime_features (Optional[List[str]]):
List of datetime features to be extract. Each entry must be one of
  • ‘YEAR’

  • ‘MONTH’

  • ‘DAY’

  • ‘DAY_OF_WEEK’

  • ‘DAY_OF_YEAR’

  • ‘WEEK_OF_YEAR’

  • ‘QUARTER’

  • ‘HOUR’

  • ‘MINUTE’

  • ‘SECOND’

Defaults to [‘YEAR’, ‘MONTH’, ‘DAY’, ‘DAY_OF_WEEK’, ‘DAY_OF_YEAR’, ‘WEEK_OF_YEAR’]

Log:

Performs the natural log on a numeric column.

Example:

{
  "transformation": "Log",
  "input_columns": ["feature_1"]
}
Arguments:
input_columns (List[str]):

A list with a single column to perform the log transformation on.

output_columns (Optional[List[str]]):

A list with a single output column name, corresponding to the output of our transformation.

ZScale:

Performs Z-scale normalization on a numeric column.

Example:

{
  "transformation": "ZScale",
  "input_columns": ["feature_1"]
}
Arguments:
input_columns (List[str]):

A list with a single column to perform the z-scale transformation on.

output_columns (Optional[List[str]]):

A list with a single output column name, corresponding to the output of our transformation.

Vocabulary:

Converts strings to integers, where each unique string gets a unique integer representation.

Example:

{
  "transformation": "Vocabulary",
  "input_columns": ["feature_1"]
}
Arguments:
input_columns (List[str]):

A list with a single column to perform the vocabulary transformation on.

output_columns (Optional[List[str]]):

A list with a single output column name, corresponding to the output of our transformation.

top_k (Optional[int]):

Number of the most frequent words in the vocabulary to use for generating dictionary lookup indices. If not specified, all words in the vocabulary will be used. Defaults to None.

frequency_threshold (Optional[int]):

Limit the vocabulary only to words whose number of occurrences in the input exceeds frequency_threshold. If not specified, all words in the vocabulary will be included. If both top_k and frequency_threshold are specified, a word must satisfy both conditions to be included. Defaults to None.

Categorical:

Transforms categorical columns to integer columns.

Example:

{
  "transformation": "Categorical",
  "input_columns": ["feature_1"],
  "top_k": 10
}
Arguments:
input_columns (List[str]):

A list with a single column to perform the categorical transformation on.

output_columns (Optional[List[str]]):

A list with a single output column name, corresponding to the output of our transformation.

top_k (Optional[int]):

Number of the most frequent words in the vocabulary to use for generating dictionary lookup indices. If not specified, all words in the vocabulary will be used.

frequency_threshold (Optional[int]):

Limit the vocabulary only to words whose number of occurrences in the input exceeds frequency_threshold. If not specified, all words in the vocabulary will be included. If both top_k and frequency_threshold are specified, a word must satisfy both conditions to be included.

Reduce:

Given a column where each entry is a numeric array, reduces arrays according to our reduce_mode.

Example:

{
  "transformation": "Reduce",
  "input_columns": ["feature_1"],
  "reduce_mode": "MEAN",
  "output_columns": ["feature_1_mean"]
}
Arguments:
input_columns (List[str]):

A list with a single column to perform the reduce transformation on.

output_columns (Optional[List[str]]):

A list with a single output column name, corresponding to the output of our transformation.

reduce_mode (Optional[str]):
One of
  • ‘MAX’

  • ‘MIN’

  • ‘MEAN’

  • ‘LAST_K’

Defaults to ‘MEAN’.

last_k (Optional[int]):

The number of last k elements when ‘LAST_K’ reduce mode is used. Defaults to 1.

SplitString:

Given a column of strings, splits strings into token arrays.

Example:

{
  "transformation": "SplitString",
  "input_columns": ["feature_1"],
  "separator": "$"
}
Arguments:
input_columns (List[str]):

A list with a single column to perform the split string transformation on.

output_columns (Optional[List[str]]):

A list with a single output column name, corresponding to the output of our transformation.

separator (Optional[str]):

Separator to split input string into tokens. Defaults to ‘ ‘.

missing_token (Optional[str]):

Missing token to use when no string is included. Defaults to ‘ _MISSING_ ‘.

NGram:

Given a column of strings, splits strings into token arrays where each token is an integer.

Example:

{
  "transformation": "NGram",
  "input_columns": ["feature_1"],
  "min_ngram_size": 1,
  "max_ngram_size": 2,
  "separator": " "
}
Arguments:
input_columns (List[str]):

A list with a single column to perform the n-gram transformation on.

output_columns (Optional[List[str]]):

A list with a single output column name, corresponding to the output of our transformation.

min_ngram_size (Optional[int]):

Minimum n-gram size. Must be a positive number and <= max_ngram_size. Defaults to 1.

max_ngram_size (Optional[int]):

Maximum n-gram size. Must be a positive number and >= min_ngram_size. Defaults to 2.

top_k (Optional[int]):

Number of the most frequent words in the vocabulary to use for generating dictionary lookup indices. If not specified, all words in the vocabulary will be used. Defaults to None.

frequency_threshold (Optional[int]):

Limit the dictionary’s vocabulary only to words whose number of occurrences in the input exceeds frequency_threshold. If not specified, all words in the vocabulary will be included. If both top_k and frequency_threshold are specified, a word must satisfy both conditions to be included. Defaults to None.

separator (Optional[str]):

Separator to split input string into tokens. Defaults to ‘ ‘.

missing_token (Optional[str]):

Missing token to use when no string is included. Defaults to ‘ _MISSING_ ‘.

Clip:

Given a numeric column, clips elements such that elements < min_value are assigned min_value, and elements > max_value are assigned max_value.

Example:

{
  "transformation": "Clip",
  "input_columns": ["col1"],
  "output_columns": ["col1_clipped"],
  "min_value": 1.,
  "max_value": 10.,
}
Arguments:
input_columns (List[str]):

A list with a single column to perform the n-gram transformation on.

output_columns (Optional[List[str]]):

A list with a single output column name, corresponding to the output of our transformation.

min_value (Optional[float]):

Number where all values below min_value are set to min_value. If no min_value is provided, min clipping will not occur. Defaults to None.

max_value (Optional[float]):

Number where all values above max_value are set to max_value If no max_value is provided, max clipping will not occur. Defaults to None.

MultiHotEncoding:

Performs multi-hot encoding on a categorical array column.

Example:

{
  "transformation": "MultiHotEncoding",
  "input_columns": ["col1"],
}

The number of classes is determened by the largest number included in the input if it is numeric or the total number of unique values of the input if it is type str.

If the input is has type str and an element contians separator tokens, the input will be split at separator indices, and the each element of the split list will be considered a seperate class. For example,

Input:

[
  ["foo bar"],      # Example 0
  ["foo", "bar"],   # Example 1
  ["foo"],          # Example 2
  ["bar"],          # Example 3
]

Output (with default separator=” “):

[
  [1, 1],          # Example 0
  [1, 1],          # Example 1
  [1, 0],          # Example 2
  [0, 1],          # Example 3
]
Arguments:
input_columns (List[str]):

A list with a single column to perform the multi-hot-encoding on.

output_columns (Optional[List[str]]):

A list with a single output column name, corresponding to the output of our transformation.

top_k (Optional[int]):

Number of the most frequent words in the vocabulary to use for generating dictionary lookup indices. If not specified, all words in the vocabulary will be used. Defaults to None.

frequency_threshold (Optional[int]):

Limit the dictionary’s vocabulary only to words whose number of occurrences in the input exceeds frequency_threshold. If not specified, all words in the vocabulary will be included. If both top_k and frequency_threshold are specified, a word must satisfy both conditions to be included. Defaults to None.

separator (Optional[str]):

Separator to split input string into tokens. Defaults to ‘ ‘.

MaxAbsScale:

Performs maximum absolute scaling on a numeric column.

Example:

{
  "transformation": "MaxAbsScale",
  "input_columns": ["col1"],
  "output_columns": ["col1_max_abs_scaled"]
}
Arguments:
input_columns (List[str]):

A list with a single column to perform max-abs-scale on.

output_columns (Optional[List[str]]):

A list with a single output column name, corresponding to the output of our transformation.

Custom:

Transformations defined in tf_custom_transformation_definitions are included here in the TensorFlow-based transformation configuration. For example, given the following tf_custom_transformation_definitions:

[
  {
    "transformation": "PlusX",
    "module_path": "gs://bucket/custom_transform_fn.py",
    "function_name": "plus_one_transform"
  }
]

We can include the following transformation:

{
  "transformation": "PlusX",
  "input_columns": ["col1"],
  "output_columns": ["col1_max_abs_scaled"]
  "x": 5
}

Note that input_columns must still be included in our arguments and output_columns is optional. All other arguments are those defined in custom_transform_fn.py, which includes “x” in this case. See tf_custom_transformation_definitions above.

target_column (Optional[str]):

Target column of input data.

weight_column (Optional[str]):

Weight column of input data.

prediction_type (Optional[str]):

Model prediction type. One of “classification”, “regression”, “time_series”.

run_distill (Optional[bool]):

Whether the distillation should be applied to the training.

run_feature_selection (Optional[bool]):

Whether the feature selection should be applied to the dataset.

feature_selection_algorithm (Optional[str]):

The algorithm of feature selection. One of “AMI”, “CMIM”, “JMIM”, “MRMR”, default to be “AMI”.

The algorithms available are: AMI(Adjusted Mutual Information):

CMIM(Conditional Mutual Information Maximization):

Reference paper: Mohamed Bennasar, Yulia Hicks, Rossitza Setchi, “Feature selection using Joint Mutual Information Maximisation,” Expert Systems with Applications, vol. 42, issue 22, 1 December 2015, Pages 8520-8532.

JMIM(Joint Mutual Information Maximization):

Reference paper: Mohamed Bennasar, Yulia Hicks, Rossitza Setchi, “Feature selection using Joint Mutual Information Maximisation,” Expert Systems with Applications, vol. 42, issue 22, 1 December 2015, Pages 8520-8532.

MRMR(MIQ Minimum-redundancy Maximum-relevance):

Reference paper: Hanchuan Peng, Fuhui Long, and Chris Ding. “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy.” IEEE Transactions on pattern analysis and machine intelligence 27, no. 8 (2005): 1226-1238.

max_selected_features (Optional[int]):

Maximum number of features to select.

If specified, the transform config will be purged by only using the selected features that ranked top in the feature ranking, which has the ranking value for all supported features. If the number of input features is smaller than max_selected_features specified, we will still run the feature selection process and generate the feature ranking, no features will be excluded.

The value will be set to 1000 by default if run_feature_selection is enabled.

data_source_csv_filenames (Optional[str]):

CSV input data source to run feature transform on.

data_source_bigquery_table_path (Optional[str]):

BigQuery input data source to run feature transform on.

bigquery_staging_full_dataset_id (Optional[str]):

Dataset in “projectId.datasetId” format for storing intermediate-FTE BigQuery tables.

If the specified dataset does not exist in BigQuery, FTE will create the dataset. If no bigquery_staging_full_dataset_id is specified, all intermediate tables will be stored in a dataset created under the provided project in the provided location during FTE execution called “vertex_feature_transform_engine_staging_{location.replace(‘-’, ‘_’)}”.

All tables generated by FTE will have a 30 day TTL.

model_type (Optional[str]):

Model type, which we wish to engineer features for. Can be one of: neural_network, boosted_trees.

dataflow_machine_type (Optional[str]):

The machine type used for dataflow jobs. If not set, default to n1-standard-16.

dataflow_max_num_workers (Optional[int]):

The number of workers to run the dataflow job. If not set, default to 25.

dataflow_disk_size_gb (Optional[int]):

The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.

dataflow_subnetwork (Optional[str]):

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips (Optional[bool]):

Specifies whether Dataflow workers use public IP addresses.

dataflow_service_account (Optional[str]):

Custom service account to run Dataflow jobs.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

autodetect_csv_schema (Optional[bool]):

If True, infers the column types when importing CSVs into BigQuery.

Returns:
dataset_stats (AutoMLTabularDatasetStats):

The stats of the dataset.

materialized_data (Dataset):

The materialized dataset.

transform_output (TransformOutput):

The transform output artifact.

split_example_counts (str):

JSON string of data split example counts for train, validate, and test splits.

bigquery_test_split_uri (str):

BigQuery URI for the test split to pass to the batch prediction component during evaluation.

bigquery_downsampled_test_split_uri (str):

BigQuery URI for the downsampled test split to pass to the batch prediction component during batch explain.

instance_schema_path (DatasetSchema):

Schema of input data to the tf_model at serving time.

training_schema_path (DatasetSchema):

Schema of input data to the tf_model at training time.

feature_ranking (TabularFeatureRanking):

The ranking of features, all features supported in the dataset will be included.

for “AMI” algorithm, array features won’t be available in the ranking as arrays are not supported yet.

gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.experimental.automl.tabular.FinalizerOp()

automl_tabular_finalizer Finalizer for AutoML Tabular pipelines

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer.

root_dir (str):

The Cloud Storage location to store the output.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

Returns:
gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.experimental.automl.tabular.InfraValidatorOp()

automl_tabular_infra_validator Validates the trained AutoML Tabular model is a valid model.

Args:
unmanaged_container_model (str):

google.UnmanagedContainerModel for model to be validated.

google_cloud_pipeline_components.experimental.automl.tabular.SplitMaterializedDataOp(materialized_data: Dataset)

Split materialized data Splits materialized dataset into materialized train, eval, and test data splits.

The materialized dataset generated by the Feature Transform Engine consists of all the splits that were combined into the input transform dataset (i.e., train, eval, and test splits). This components splits the output materialized dataset into corresponding materialized data splits so that the splits can be used by down-stream training or evaluation components.

Args:
materialized_data (Dataset):

Materialized dataset output by the Feature Transform Engine.

Returns:
materialized_train_split (MaterializedSplit):

Path patern to materialized train split.

materialized_eval_split (MaterializedSplit):

Path patern to materialized eval split.

materialized_test_split (MaterializedSplit):

Path patern to materialized test split.

google_cloud_pipeline_components.experimental.automl.tabular.Stage1TunerOp()

automl_tabular_stage_1_tuner AutoML Tabular stage 1 tuner

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer.

root_dir (str):

The Cloud Storage location to store the output.

study_spec_parameters_override (JsonArray):

JSON study spec. E.g., [{“parameter_id”: “model_type”,”categorical_value_spec”: {“values”: [“nn”]}}]

worker_pool_specs_override_json (JsonArray):

JSON worker pool specs. E.g., [{“machine_spec”: {“machine_type”: “n1-standard-16”}},{},{},{“machine_spec”: {“machine_type”: “n1-standard-16”}}]

reduce_search_space_mode (str):

The reduce search space mode. Possible values: “regular” (default), “minimal”, “full”.

num_selected_trials (int):

Number of selected trials. The number of weak learners in the final model is 5 * num_selected_trials.

num_selected_features (int):

Number of selected features. The number of features to learn in the NN models.

deadline_hours (float):

Number of hours the cross-validation trainer should run.

disable_early_stopping (bool):

True if disable early stopping. Default value is false.

num_parallel_trials (int):

Number of parallel training trials.

single_run_max_secs (int):

Max number of seconds each training trial runs.

metadata (TabularExampleGenMetadata):

The tabular example gen metadata.

transform_output (TransformOutput):

The transform output artifact.

materialized_train_split (MaterializedSplit):

The materialized train split.

materialized_eval_split (MaterializedSplit):

The materialized eval split.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

run_distillation (bool):

True if in distillation mode. The default value is false.

Returns:
gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

tuning_result_output (AutoMLTabularTuningResult):

The trained model and architectures.

execution_metrics (JsonObject):

Core metrics in dictionary of component execution.

google_cloud_pipeline_components.experimental.automl.tabular.StatsAndExampleGenOp(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, transformations: str, weight_column_name: str = '', optimization_objective: str = '', optimization_objective_recall_value: float = '-1', optimization_objective_precision_value: float = '-1', transformations_path: str = '', split_spec: str = None, data_source: str = None, request_type: str = 'COLUMN_STATS_ONLY', dataflow_machine_type: str = 'n1-standard-16', dataflow_max_num_workers: int = '25', dataflow_disk_size_gb: int = '40', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = 'true', dataflow_service_account: str = '', encryption_spec_key_name: str = '', run_distillation: bool = 'false', additional_experiments: str = '', additional_experiments_json: dict = '{}', data_source_csv_filenames: str = '', data_source_bigquery_table_path: str = '', predefined_split_key: str = '', timestamp_split_key: str = '', stratified_split_key: str = '', training_fraction: float = '-1', validation_fraction: float = '-1', test_fraction: float = '-1', quantiles: list = '[]', enable_probabilistic_inference: bool = 'false')

tabular_stats_and_example_gen Statistics and example gen for tabular data

Args:
project (str):

Required. Project to run dataset statistics and example generation.

location (str):

Location for running dataset statistics and example generation.

root_dir (str):

The Cloud Storage location to store the output.

target_column_name (str):

The target column name.

weight_column_name (str):

The weight column name.

prediction_type (str):

The prediction type. Supported values: “classification”, “regression”.

optimization_objective (str):

Objective function the model is optimizing towards. The training process creates a model that maximizes/minimizes the value of the objective function over the validation set. The supported optimization objectives depend on the prediction type. If the field is not set, a default objective function is used. classification (binary):

“maximize-au-roc” (default) - Maximize the area under the receiver

operating characteristic (ROC) curve.

“minimize-log-loss” - Minimize log loss. “maximize-au-prc” - Maximize the area under the precision-recall curve. “maximize-precision-at-recall” - Maximize precision for a specified recall value. “maximize-recall-at-precision” - Maximize recall for a specified precision value.

classification (multi-class):

“minimize-log-loss” (default) - Minimize log loss.

regression:

“minimize-rmse” (default) - Minimize root-mean-squared error (RMSE). “minimize-mae” - Minimize mean-absolute error (MAE). “minimize-rmsle” - Minimize root-mean-squared log error (RMSLE).

optimization_objective_recall_value (str):

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value (str):

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

transformations (str):

Quote escaped JSON string for transformations. Each transformation will apply transform function to given input column. And the result will be used for training. When creating transformation for BigQuery Struct column, the column should be flattened using “.” as the delimiter.

transformations_path (Optional[str]):

Path to a GCS file containing JSON string for transformations.

split_spec (str):

Quote escaped JSON string for split spec.

data_source (str):

Quote escaped JSON string for data source.

dataflow_machine_type (Optional[str]):

The machine type used for dataflow jobs. If not set, default to n1-standard-16.

dataflow_max_num_workers (Optional[int]):

The number of workers to run the dataflow job. If not set, default to 25.

dataflow_disk_size_gb (Optional[int]):

The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.

dataflow_subnetwork (Optional[str]):

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips (Optional[bool]):

Specifies whether Dataflow workers use public IP addresses.

dataflow_service_account (Optional[str]):

Custom service account to run dataflow jobs.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

run_distillation (bool): True if in distillation mode. The default value is false.

Returns:
dataset_schema (DatasetSchema):

The schema of the dataset.

dataset_stats (AutoMLTabularDatasetStats):

The stats of the dataset.

train_split (Dataset):

The train split.

eval_split (Dataset):

The eval split.

test_split (Dataset):

The test split.

test_split_json (JsonObject):

The test split JSON object.

downsampled_test_split_json (JsonObject):

The downsampled test split JSON object.

instance_baseline (AutoMLTabularInstanceBaseline):

The instance baseline used to calculate explanations.

metadata (TabularExampleGenMetadata):

The tabular example gen metadata.

gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.experimental.automl.tabular.TabNetHyperparameterTuningJobOp()

tabnet_hyperparameter_tuning_job Launch a TabNet hyperparameter tuning job using Vertex HyperparameterTuningJob API.

Args:
project (str):

Required. The GCP project that runs the pipeline components.

location (str):

Required. The GCP region that runs the pipeline components.

root_dir (str):

Required. The root GCS directory for the pipeline components.

target_column (str):

Required. The target column name.

prediction_type (str):

Required. The type of prediction the model is to produce. “classification” or “regression”.

weight_column (Optional[str]):

The weight column name.

enable_profiler (Optional[bool]):

Enables profiling and saves a trace during evaluation.

cache_data (Optional[str]):

Whether to cache data or not. If set to ‘auto’, caching is determined based on the dataset size.

seed (Optional[int]):

Seed to be used for this run.

eval_steps (Optional[int]):

Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

eval_frequency_secs (Optional[int]):

Frequency at which evaluation and checkpointing will take place.

study_spec_metric_id (str):

Required. Metric to optimize, , possible values: [ ‘loss’, ‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].

study_spec_metric_goal (str):

Required. Optimization goal of the metric, possible values: “MAXIMIZE”, “MINIMIZE”.

study_spec_parameters_override (list[str]):

List of dictionaries representing parameters to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count (int):

Required. The desired total number of trials.

parallel_trial_count (int):

Required. The desired number of trials to run in parallel.

max_failed_trial_count (Optional[int]):

The number of failed trials that need to be seen before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm (Optional[str]):

The search algorithm specified for the study. One of ‘ALGORITHM_UNSPECIFIED’, ‘GRID_SEARCH’, or ‘RANDOM_SEARCH’.

study_spec_measurement_selection_type (Optional[str]):

Which measurement to use if/when the service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

training_machine_spec (Optional[Dict[str, Any]]):

The training machine spec. See https://cloud.google.com/compute/docs/machine-types for options.

training_disk_spec (Optional[Dict[str, Any]]):

The training disk spec.

instance_baseline (AutoMLTabularInstanceBaseline):

The path to a JSON file for baseline values.

metadata (TabularExampleGenMetadata):

Amount of time in seconds to run the trainer for.

materialized_train_split (MaterializedSplit):

The path to the materialized train split.

materialized_eval_split (MaterializedSplit):

The path to the materialized validation split.

transform_output (TransformOutput):

The path to transform output.

training_schema_uri (TrainingSchema):

The path to the training schema.

encryption_spec_key_name (Optional[str]):

The KMS key name.

Returns:
gcp_resources (str):

Serialized gcp_resources proto tracking the custom training job.

instance_schema_uri (str):

The path to the instance schema.

prediction_schema_uri (str):

The path to the prediction schema.

trials (str):

The path to the hyperparameter tuning trials

prediction_docker_uri_output (str):

The URI of the prediction container.

execution_metrics (JsonObject):

Core metrics in dictionary of hyperparameter tuning job execution.

google_cloud_pipeline_components.experimental.automl.tabular.TabNetTrainerOp(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, learning_rate: float, instance_baseline: AutoMLTabularInstanceBaseline, metadata: TabularExampleGenMetadata, materialized_train_split: MaterializedSplit, materialized_eval_split: MaterializedSplit, transform_output: TransformOutput, training_schema_uri: TrainingSchema, weight_column: str = '', max_steps: int = -1, max_train_secs: int = -1, large_category_dim: int = 1, large_category_thresh: int = 300, yeo_johnson_transform: bool = 'true', feature_dim: int = 64, feature_dim_ratio: float = 0.5, num_decision_steps: int = 6, relaxation_factor: float = 1.5, decay_every: float = 100, decay_rate: float = 0.95, gradient_thresh: float = 2000, sparsity_loss_weight: float = 1e-05, batch_momentum: float = 0.95, batch_size_ratio: float = 0.25, num_transformer_layers: int = 4, num_transformer_layers_ratio: float = 0.25, class_weight: float = 1.0, loss_function_type: str = 'default', alpha_focal_loss: float = 0.25, gamma_focal_loss: float = 2.0, enable_profiler: bool = 'false', cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, batch_size: int = 100, measurement_selection_type: str = 'BEST_MEASUREMENT', optimization_metric: str = '', eval_frequency_secs: int = 600, training_machine_spec: dict = '{"machine_type": "c2-standard-16"}', training_disk_spec: dict = '{"boot_disk_type": "pd-ssd", "boot_disk_size_gb": 100}', encryption_spec_key_name: str = '')

tabnet_trainer Launch a TabNet custom training job using Vertex CustomJob API.

Args:
project (str):

Required. The GCP project that runs the pipeline components.

location (str):

Required. The GCP region that runs the pipeline components.

root_dir (str):

Required. The root GCS directory for the pipeline components.

target_column (str):

Required. The target column name.

prediction_type (str):

Required. The type of prediction the model is to produce. “classification” or “regression”.

weight_column (Optional[str]):

The weight column name.

max_steps (Optional[int]):

Number of steps to run the trainer for.

max_train_secs (Optional[int]):

Amount of time in seconds to run the trainer for.

learning_rate (float):

The learning rate used by the linear optimizer.

large_category_dim (Optional[int]):

Embedding dimension for categorical feature with large number of categories.

large_category_thresh (Optional[int]):

Threshold for number of categories to apply large_category_dim embedding dimension to.

yeo_johnson_transform (Optional[bool]):

Enables trainable Yeo-Johnson power transform.

feature_dim (Optional[int]):

Dimensionality of the hidden representation in feature transformation block.

feature_dim_ratio (Optional[float]):

The ratio of output dimension (dimensionality of the outputs of each decision step) to feature dimension.

num_decision_steps (Optional[int]):

Number of sequential decision steps.

relaxation_factor (Optional[float]):

Relaxation factor that promotes the reuse of each feature at different decision steps. When it is 1, a feature is enforced to be used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.

decay_every (Optional[float]):

Number of iterations for periodically applying learning rate decaying.

decay_rate (Optional[float]):

Learning rate decaying.

gradient_thresh (Optional[float]):

Threshold for the norm of gradients for clipping.

sparsity_loss_weight (Optional[float]):

Weight of the loss for sparsity regularization (increasing it will yield more sparse feature selection).

batch_momentum (Optional[float]):

Momentum in ghost batch normalization.

batch_size_ratio (Optional[float]):

The ratio of virtual batch size (size of the ghost batch normalization) to batch size.

num_transformer_layers (Optional[int]):

The number of transformer layers for each decision step. used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.

num_transformer_layers_ratio (Optional[float]):

The ratio of shared transformer layer to transformer layers.

class_weight (Optional[float]):

The class weight is used to computes a weighted cross entropy which is helpful in classify imbalanced dataset. Only used for classification.

loss_function_type (Optional[str]):

Loss function type. Loss function in classification [cross_entropy, weighted_cross_entropy, focal_loss], default is cross_entropy. Loss function in regression: [rmse, mae, mse], default is mse.

alpha_focal_loss (Optional[float]):

Alpha value (balancing factor) in focal_loss function. Only used for classification.

gamma_focal_loss (Optional[float]):

Gamma value (modulating factor) for focal loss for focal loss. Only used for classification.

enable_profiler (Optional[bool]):

Enables profiling and saves a trace during evaluation.

cache_data (Optional[str]):

Whether to cache data or not. If set to ‘auto’, caching is determined based on the dataset size.

seed (Optional[int]):

Seed to be used for this run.

eval_steps (Optional[int]):

Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

batch_size (Optional[int]):

Batch size for training.

measurement_selection_type (Optional[str]):

Which measurement to use if/when the service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

optimization_metric (Optional[str]):

Optimization metric used for measurement_selection_type. Default is “rmse” for regression and “auc” for classification.

eval_frequency_secs (Optional[int]):

Frequency at which evaluation and checkpointing will take place.

training_machine_spec (Optional[Dict[str, Any]]):

The training machine spec. See https://cloud.google.com/compute/docs/machine-types for options.

training_disk_spec (Optional[Dict[str, Any]]):

The training disk spec.

instance_baseline (AutoMLTabularInstanceBaseline):

The path to a JSON file for baseline values.

metadata (TabularExampleGenMetadata):

Amount of time in seconds to run the trainer for.

materialized_train_split (MaterializedSplit):

The path to the materialized train split.

materialized_eval_split (MaterializedSplit):

The path to the materialized validation split.

transform_output (TransformOutput):

The path to transform output.

training_schema_uri (TrainingSchema):

The path to the training schema.

encryption_spec_key_name (Optional[str]):

The KMS key name.

Returns:
gcp_resources (str):

Serialized gcp_resources proto tracking the custom training job.

unmanaged_container_model (google.UnmanagedContainerModel):

The UnmanagedContainerModel artifact.

google_cloud_pipeline_components.experimental.automl.tabular.TrainingConfiguratorAndValidatorOp(dataset_stats: AutoMLTabularDatasetStats, split_example_counts: str, training_schema: TrainingSchema, instance_schema: InstanceSchema, target_column: str = '', weight_column: str = '', prediction_type: str = '', optimization_objective: str = '', optimization_objective_recall_value: float = '-1', optimization_objective_precision_value: float = '-1', run_evaluation: bool = 'false', run_distill: bool = 'false', enable_probabilistic_inference: bool = 'false', time_series_identifier_column: str = '', time_column: str = '', time_series_attribute_columns: str = '', available_at_forecast_columns: str = '', unavailable_at_forecast_columns: str = '', quantiles: str = '', context_window: int = '-1', forecast_horizon: int = '-1', forecasting_model_type: str = '', forecasting_transformations_path: str = '')

training_configurator_and_validator Component to configure training and validate data and user-input configurations.

Args:
dataset_stats (AutoMLTabularDatasetStats):

Dataset stats generated by feature transform engine.

split_example_counts (str):

JSON string of data split example counts for train, validate, and test splits.

training_schema_path (DatasetSchema):

Schema of input data to the tf_model at training time.

instance_schema_path (DatasetSchema):

Schema of input data to the tf_model at serving time.

target_column (str):

Target column of input data.

weight_column (str):

Weight column of input data.

prediction_type (str):

Model prediction type. One of “classification”, “regression”, “time_series”.

optimization_objective (str):

Objective function the model is optimizing towards. The training process creates a model that maximizes/minimizes the value of the objective function over the validation set. The supported optimization objectives depend on the prediction type. If the field is not set, a default objective function is used. classification (binary):

“maximize-au-roc” (default) - Maximize the area under the receiver

operating characteristic (ROC) curve.

“minimize-log-loss” - Minimize log loss. “maximize-au-prc” - Maximize the area under the precision-recall curve. “maximize-precision-at-recall” - Maximize precision for a specified recall value. “maximize-recall-at-precision” - Maximize recall for a specified precision value.

classification (multi-class):

“minimize-log-loss” (default) - Minimize log loss.

regression:

“minimize-rmse” (default) - Minimize root-mean-squared error (RMSE). “minimize-mae” - Minimize mean-absolute error (MAE). “minimize-rmsle” - Minimize root-mean-squared log error (RMSLE).

optimization_objective_recall_value (str):

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value (str):

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

run_evaluation (bool):

Whether we are running evaluation in the training pipeline.

run_distill (bool):

Whether the distillation should be applied to the training.

enable_probabilistic_inference: If probabilistic inference is enabled, the

model will fit a distribution that captures the uncertainty of a prediction. At inference time, the predictive distribution is used to make a point prediction that minimizes the optimization objective. For example, the mean of a predictive distribution is the point prediction that minimizes RMSE loss. If quantiles are specified, then the quantiles of the distribution are also returned.

time_series_identifier_column (str):

Time series idenfier column. Used by forecasting only.

time_column (str):

The column that indicates the time. Used by forecasting only.

time_series_attribute_columns (str):

The column names of the time series attributes.

available_at_forecast_columns (str):

The names of the columns that are available at forecast time.

unavailable_at_forecast_columns (str):

The names of the columns that are not available at forecast time.

quantiles (str):

All quantiles that the model need to predict.

context_window (int):

The length of the context window.

forecast_horizon (int):

The length of the forecast horizon.

forecasting_model_type (str):

The model types, e.g. l2l, seq2seq, tft.

forecasting_transformations_path (str):

The path to the JSON format forecasting transformations. Used by forecasting only.

Returns:
metadata (TabularExampleGenMetadata):

The tabular example gen metadata.

google_cloud_pipeline_components.experimental.automl.tabular.TransformOp()

automl_tabular_transform Transformation raw features to engineered features

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer.

root_dir (str):

The Cloud Storage location to store the output.

metadata (TabularExampleGenMetadata):

The tabular example gen metadata.

dataset_schema (DatasetSchema):

The schema of the dataset.

train_split (Dataset):

The train split.

eval_split (Dataset):

The eval split.

test_split (Dataset):

The test split.

dataflow_machine_type (Optional[str]):

The machine type used for dataflow jobs. If not set, default to n1-standard-16.

dataflow_max_num_workers (Optional[int]):

The number of workers to run the dataflow job. If not set, default to 25.

dataflow_disk_size_gb (Optional[int]):

The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.

dataflow_subnetwork (Optional[str]):

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips (Optional[bool]):

Specifies whether Dataflow workers use public IP addresses.

dataflow_service_account (Optional[str]):

Custom service account to run dataflow jobs.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

Returns:
materialized_train_split (MaterializedSplit):

The materialized train split.

materialized_eval_split (MaterializedSplit):

The materialized eval split.

materialized_eval_split (MaterializedSplit):

The materialized test split.

training_schema_uri (TrainingSchema):

The training schema.

transform_output (TransformOutput):

The transform output artifact.

gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.experimental.automl.tabular.WideAndDeepHyperparameterTuningJobOp()

wide_and_deep_hyperparameter_tuning_job Launch a Wide & Deep hyperparameter tuning job using Vertex HyperparameterTuningJob API.

Args:
project (str):

Required. The GCP project that runs the pipeline components.

location (str):

Required. The GCP region that runs the pipeline components.

root_dir (str):

Required. The root GCS directory for the pipeline components.

target_column (str):

Required. The target column name.

prediction_type (str):

Required. The type of prediction the model is to produce. “classification” or “regression”.

weight_column (Optional[str]):

The weight column name.

enable_profiler (Optional[bool]):

Enables profiling and saves a trace during evaluation.

cache_data (Optional[str]):

Whether to cache data or not. If set to ‘auto’, caching is determined based on the dataset size.

seed (Optional[int]):

Seed to be used for this run.

eval_steps (Optional[int]):

Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

eval_frequency_secs (Optional[int]):

Frequency at which evaluation and checkpointing will take place.

study_spec_metric_id (str):

Required. Metric to optimize, , possible values: [ ‘loss’, ‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].

study_spec_metric_goal (str):

Required. Optimization goal of the metric, possible values: “MAXIMIZE”, “MINIMIZE”.

study_spec_parameters_override (list[str]):

List of dictionaries representing parameters to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count (int):

Required. The desired total number of trials.

parallel_trial_count (int):

Required. The desired number of trials to run in parallel.

max_failed_trial_count (Optional[int]):

The number of failed trials that need to be seen before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm (Optional[str]):

The search algorithm specified for the study. One of ‘ALGORITHM_UNSPECIFIED’, ‘GRID_SEARCH’, or ‘RANDOM_SEARCH’.

study_spec_measurement_selection_type (Optional[str]):

Which measurement to use if/when the service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

training_machine_spec (Optional[Dict[str, Any]]):

The training machine spec. See https://cloud.google.com/compute/docs/machine-types for options.

training_disk_spec (Optional[Dict[str, Any]]):

The training disk spec.

instance_baseline (AutoMLTabularInstanceBaseline):

The path to a JSON file for baseline values.

metadata (TabularExampleGenMetadata):

Amount of time in seconds to run the trainer for.

materialized_train_split (MaterializedSplit):

The path to the materialized train split.

materialized_eval_split (MaterializedSplit):

The path to the materialized validation split.

transform_output (TransformOutput):

The path to transform output.

training_schema_uri (TrainingSchema):

The path to the training schema.

encryption_spec_key_name (Optional[str]):

The KMS key name.

Returns:
gcp_resources (str):

Serialized gcp_resources proto tracking the custom training job.

instance_schema_uri (str):

The path to the instance schema.

prediction_schema_uri (str):

The path to the prediction schema.

trials (str):

The path to the hyperparameter tuning trials

prediction_docker_uri_output (str):

The URI of the prediction container.

execution_metrics (JsonObject):

Core metrics in dictionary of hyperparameter tuning job execution.

google_cloud_pipeline_components.experimental.automl.tabular.WideAndDeepTrainerOp()

wide_and_deep_trainer Launch a Wide & Deep custom training job using Vertex CustomJob API.

Args:
project (str):

Required. The GCP project that runs the pipeline components.

location (str):

Required. The GCP region that runs the pipeline components.

root_dir (str):

Required. The root GCS directory for the pipeline components.

target_column (str):

Required. The target column name.

prediction_type (str):

Required. The type of prediction the model is to produce. “classification” or “regression”.

weight_column (Optional[str]):

The weight column name.

max_steps (Optional[int]):

Number of steps to run the trainer for.

max_train_secs (Optional[int]):

Amount of time in seconds to run the trainer for.

learning_rate (float):

The learning rate used by the linear optimizer.

optimizer_type (Optional[str]):

The type of optimizer to use. Choices are “adam”, “ftrl” and “sgd” for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

l1_regularization_strength (Optional[float]):

L1 regularization strength for optimizer_type=”ftrl”.

l2_regularization_strength (Optional[float]):

L2 regularization strength for optimizer_type=”ftrl”

l2_shrinkage_regularization_strength (Optional[float]):

L2 shrinkage regularization strength for optimizer_type=”ftrl”.

beta_1 (Optional[float]):

Beta 1 value for optimizer_type=”adam”.

beta_2 (Optional[float]):

Beta 2 value for optimizer_type=”adam”.

hidden_units (Optional[str]):

Hidden layer sizes to use for DNN feature columns, provided in comma-separated layers.

use_wide (Optional[bool]):

If set to true, the categorical columns will be used in the wide part of the DNN model.

embed_categories (Optional[bool]):

If set to true, the categorical columns will be used embedded and used in the deep part of the model. Embedding size is the square root of the column cardinality.

dnn_dropout (Optional[float]):

The probability we will drop out a given coordinate.

dnn_learning_rate (Optional[float]):

The learning rate for training the deep part of the model.

dnn_optimizer_type (Optional[str]):

The type of optimizer to use for the deep part of the model. Choices are “adam”, “ftrl” and “sgd”. for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

dnn_l1_regularization_strength (Optional[float]):

L1 regularization strength for dnn_optimizer_type=”ftrl”.

dnn_l2_regularization_strength (Optional[float]):

L2 regularization strength for dnn_optimizer_type=”ftrl”.

dnn_l2_shrinkage_regularization_strength (Optional[float]):

L2 shrinkage regularization strength for dnn_optimizer_type=”ftrl”.

dnn_beta_1 (Optional[float]):

Beta 1 value for dnn_optimizer_type=”adam”.

dnn_beta_2 (Optional[float]):

Beta 2 value for dnn_optimizer_type=”adam”.

enable_profiler (Optional[bool]):

Enables profiling and saves a trace during evaluation.

cache_data (Optional[str]):

Whether to cache data or not. If set to ‘auto’, caching is determined based on the dataset size.

seed (Optional[int]):

Seed to be used for this run.

eval_steps (Optional[int]):

Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

batch_size (Optional[int]):

Batch size for training.

measurement_selection_type (Optional[str]):

Which measurement to use if/when the service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

optimization_metric (Optional[str]):

Optimization metric used for measurement_selection_type. Default is “rmse” for regression and “auc” for classification.

eval_frequency_secs (Optional[int]):

Frequency at which evaluation and checkpointing will take place.

training_machine_spec (Optional[Dict[str, Any]]):

The training machine spec. See https://cloud.google.com/compute/docs/machine-types for options.

training_disk_spec (Optional[Dict[str, Any]]):

The training disk spec.

instance_baseline (AutoMLTabularInstanceBaseline):

The path to a JSON file for baseline values.

metadata (TabularExampleGenMetadata):

Amount of time in seconds to run the trainer for.

materialized_train_split (MaterializedSplit):

The path to the materialized train split.

materialized_eval_split (MaterializedSplit):

The path to the materialized validation split.

transform_output (TransformOutput):

The path to transform output.

training_schema_uri (TrainingSchema):

The path to the training schema.

encryption_spec_key_name (Optional[str]):

The KMS key name.

Returns:
gcp_resources (str):

Serialized gcp_resources proto tracking the custom training job.

unmanaged_container_model (google.UnmanagedContainerModel):

The UnmanagedContainerModel artifact.

google_cloud_pipeline_components.experimental.automl.tabular.XGBoostHyperparameterTuningJobOp()

xgboost_hyperparameter_tuning_job Launch a XGBoost hyperparameter tuning job using Vertex HyperparameterTuningJob API.

Args:
project (str):

Required. The GCP project that runs the pipeline components.

location (str):

Required. The GCP region that runs the pipeline components.

study_spec_metric_id (str):

Required. Metric to optimize. For options, please look under ‘eval_metric’ at https://xgboost.readthedocs.io/en/stable/parameter.html#learning-task-parameters.

study_spec_metric_goal (str):

Required. Optimization goal of the metric, possible values: “MAXIMIZE”, “MINIMIZE”.

study_spec_parameters_override (list[str]):

List of dictionaries representing parameters to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count (int):

Required. The desired total number of trials.

parallel_trial_count (int):

Required. The desired number of trials to run in parallel.

max_failed_trial_count (Optional[int]):

The number of failed trials that need to be seen before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm (Optional[str]):

The search algorithm specified for the study. One of ‘ALGORITHM_UNSPECIFIED’, ‘GRID_SEARCH’, or ‘RANDOM_SEARCH’.

study_spec_measurement_selection_type (Optional[str]):

Which measurement to use if/when the service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

worker_pool_specs (JsonArray):

The worker pool specs.

encryption_spec_key_name (Optional[str]):

The KMS key name.

Returns:
gcp_resources (str):

Serialized gcp_resources proto tracking the custom training job.

google_cloud_pipeline_components.experimental.automl.tabular.XGBoostTrainerOp()

xgboost_trainer Launch a XGBoost custom training job using Vertex CustomJob API.

Args:
project (str):

Required. The GCP project that runs the pipeline components.

location (str):

Required. The GCP region that runs the pipeline components.

worker_pool_specs (JsonArray):

The worker pool specs.

encryption_spec_key_name (Optional[str]):

The KMS key name.

Returns:
gcp_resources (str):

Serialized gcp_resources proto tracking the custom training job.