google_cloud_pipeline_components.experimental.automl.tabular package

Submodules

google_cloud_pipeline_components.experimental.automl.tabular.utils module

Util functions for AutoML Tabular pipeline.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_automl_tabular_feature_selection_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_num_parallel_trials: Optional[int] = None, stage_2_num_parallel_trials: Optional[int] = None, stage_2_num_selected_trials: Optional[int] = None, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, predefined_split_key: Optional[str] = None, timestamp_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, weight_column: Optional[str] = None, study_spec_parameters_override: Optional[List[Dict[str, Any]]] = None, optimization_objective_recall_value: Optional[float] = None, optimization_objective_precision_value: Optional[float] = None, stage_1_tuner_worker_pool_specs_override: Optional[Dict[str, Any]] = None, cv_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: Optional[str] = None, stats_and_example_gen_dataflow_max_num_workers: Optional[int] = None, stats_and_example_gen_dataflow_disk_size_gb: Optional[int] = None, transform_dataflow_machine_type: Optional[str] = None, transform_dataflow_max_num_workers: Optional[int] = None, transform_dataflow_disk_size_gb: Optional[int] = None, dataflow_subnetwork: Optional[str] = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: Optional[str] = None, additional_experiments: Optional[Dict[str, Any]] = None, dataflow_service_account: Optional[str] = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: Optional[str] = None, evaluation_batch_predict_starting_replica_count: Optional[int] = None, evaluation_batch_predict_max_replica_count: Optional[int] = None, evaluation_dataflow_machine_type: Optional[str] = None, evaluation_dataflow_max_num_workers: Optional[int] = None, evaluation_dataflow_disk_size_gb: Optional[int] = None, max_selected_features: int = 1000, apply_feature_selection_tuning: bool = False, run_distillation: bool = False, distill_batch_predict_machine_type: Optional[str] = None, distill_batch_predict_starting_replica_count: Optional[int] = None, distill_batch_predict_max_replica_count: Optional[int] = None) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular v1 default training pipeline.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column:

The target column name.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

optimization_objective:

For binary classification, “maximize-au-roc”, “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations:

The path to a GCS file containing the transformations to apply.

train_budget_milli_node_hours:

The train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials:

Number of parallel trails for stage 1.

stage_2_num_parallel_trials:

Number of parallel trails for stage 2.

stage_2_num_selected_trials:

Number of selected trials for stage 2.

data_source_csv_filenames:

The CSV data source.

data_source_bigquery_table_path:

The BigQuery data source.

predefined_split_key:

The predefined_split column name.

timestamp_split_key:

The timestamp_split column name.

stratified_split_key:

The stratified_split column name.

training_fraction:

The training fraction.

validation_fraction:

The validation fraction.

test_fraction:

float = The test fraction.

weight_column:

The weight column name.

study_spec_parameters_override:

The list for overriding study spec. The list should be of format https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.

optimization_objective_recall_value:

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value:

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override:

The dictionary for overriding. stage 1 tuner worker pool spec. The dictionary should be of format

cv_trainer_worker_pool_specs_override:

The dictionary for overriding stage cv trainer worker pool spec. The dictionary should be of format

export_additional_model_without_custom_ops:

Whether to export additional model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type:

The dataflow machine type for stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers:

The max number of Dataflow workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for stats_and_example_gen component.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

additional_experiments:

Use this field to config private preview features.

dataflow_service_account:

Custom service account to run dataflow jobs.

run_evaluation:

Whether to run evaluation in the training pipeline.

evaluation_batch_predict_machine_type:

The prediction server machine type for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count:

The max number of prediction server for batch predict components during evaluation.

evaluation_dataflow_machine_type:

The dataflow machine type for evaluation components.

evaluation_dataflow_max_num_workers:

The max number of Dataflow workers for evaluation components.

evaluation_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for evaluation components.

max_selected_features:

number of features to select for training,

apply_feature_selection_tuning:

tuning feature selection rate if true.

run_distillation:

Whether to run distill in the training pipeline.

distill_batch_predict_machine_type:

The prediction server machine type for batch predict component in the model distillation.

distill_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict component in the model distillation.

distill_batch_predict_max_replica_count:

The max number of prediction server for batch predict component in the model distillation.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_automl_tabular_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_num_parallel_trials: Optional[int] = None, stage_2_num_parallel_trials: Optional[int] = None, stage_2_num_selected_trials: Optional[int] = None, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, predefined_split_key: Optional[str] = None, timestamp_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, weight_column: Optional[str] = None, study_spec_parameters_override: Optional[List[Dict[str, Any]]] = None, optimization_objective_recall_value: Optional[float] = None, optimization_objective_precision_value: Optional[float] = None, stage_1_tuner_worker_pool_specs_override: Optional[Dict[str, Any]] = None, cv_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: Optional[str] = None, stats_and_example_gen_dataflow_max_num_workers: Optional[int] = None, stats_and_example_gen_dataflow_disk_size_gb: Optional[int] = None, transform_dataflow_machine_type: Optional[str] = None, transform_dataflow_max_num_workers: Optional[int] = None, transform_dataflow_disk_size_gb: Optional[int] = None, dataflow_subnetwork: Optional[str] = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: Optional[str] = None, additional_experiments: Optional[Dict[str, Any]] = None, dataflow_service_account: Optional[str] = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: Optional[str] = None, evaluation_batch_predict_starting_replica_count: Optional[str] = None, evaluation_batch_predict_max_replica_count: Optional[str] = None, evaluation_dataflow_machine_type: Optional[str] = None, evaluation_dataflow_max_num_workers: Optional[int] = None, evaluation_dataflow_disk_size_gb: Optional[int] = None, run_distillation: bool = False, distill_batch_predict_machine_type: Optional[str] = None, distill_batch_predict_starting_replica_count: Optional[int] = None, distill_batch_predict_max_replica_count: Optional[int] = None) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular v1 default training pipeline.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column:

The target column name.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

optimization_objective:

For binary classification, “maximize-au-roc”, “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations:

The path to a GCS file containing the transformations to apply.

train_budget_milli_node_hours:

The train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials:

Number of parallel trails for stage 1.

stage_2_num_parallel_trials:

Number of parallel trails for stage 2.

stage_2_num_selected_trials:

Number of selected trials for stage 2.

data_source_csv_filenames:

The CSV data source.

data_source_bigquery_table_path:

The BigQuery data source.

predefined_split_key:

The predefined_split column name.

timestamp_split_key:

The timestamp_split column name.

stratified_split_key:

The stratified_split column name.

training_fraction:

The training fraction.

validation_fraction:

The validation fraction.

test_fraction:

float = The test fraction.

weight_column:

The weight column name.

study_spec_parameters_override:

The list for overriding study spec. The list should be of format https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.

optimization_objective_recall_value:

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value:

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override:

The dictionary for overriding. stage 1 tuner worker pool spec. The dictionary should be of format

cv_trainer_worker_pool_specs_override:

The dictionary for overriding stage cv trainer worker pool spec. The dictionary should be of format

export_additional_model_without_custom_ops:

Whether to export additional model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type:

The dataflow machine type for stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers:

The max number of Dataflow workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for stats_and_example_gen component.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

additional_experiments:

Use this field to config private preview features.

dataflow_service_account:

Custom service account to run dataflow jobs.

run_evaluation:

Whether to run evaluation in the training pipeline.

evaluation_batch_predict_machine_type:

The prediction server machine type for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count:

The max number of prediction server for batch predict components during evaluation.

evaluation_dataflow_machine_type:

The dataflow machine type for evaluation components.

evaluation_dataflow_max_num_workers:

The max number of Dataflow workers for evaluation components.

evaluation_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for evaluation components.

run_distillation:

Whether to run distill in the training pipeline.

distill_batch_predict_machine_type:

The prediction server machine type for batch predict component in the model distillation.

distill_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict component in the model distillation.

distill_batch_predict_max_replica_count:

The max number of prediction server for batch predict component in the model distillation.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_builtin_algorithm_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, transform_config: str, study_spec_metrics: List[Dict[str, Any]], study_spec_parameters_override: List[Dict[str, Any]], max_trial_count: int, parallel_trial_count: int, algorithm: str, enable_profiler: bool = False, seed: int = 1, eval_steps: int = 0, eval_frequency_secs: int = 600, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, predefined_split_key: Optional[str] = None, timestamp_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, weight_column: str = '', max_failed_trial_count: int = 0, study_spec_algorithm: str = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str = 'BEST_MEASUREMENT', stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, training_machine_spec: Optional[Dict[str, Any]] = None, training_replica_count: int = 1, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-standard-16', evaluation_batch_predict_starting_replica_count: int = 25, evaluation_batch_predict_max_replica_count: int = 25, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_max_num_workers: int = 25, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the built-in algorithm HyperparameterTuningJob pipeline.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column:

The target column name.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

transform_config:

The path to a GCS file containing the transformations to apply.

study_spec_metrics:

List of dictionaries representing metrics to optimize. The dictionary contains the metric_id, which is reported by the training job, ands the optimization goal of the metric. One of “minimize” or “maximize”.

study_spec_parameters_override:

List of dictionaries representing parameters to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count:

The desired total number of trials.

parallel_trial_count:

The desired number of trials to run in parallel.

algorithm:

Algorithm to train. One of “tabnet” and “wide_and_deep”.

enable_profiler:

Enables profiling and saves a trace during evaluation.

seed:

Seed to be used for this run.

eval_steps:

Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

eval_frequency_secs:

Frequency at which evaluation and checkpointing will take place.

data_source_csv_filenames:

The CSV data source.

data_source_bigquery_table_path:

The BigQuery data source.

predefined_split_key:

The predefined_split column name.

timestamp_split_key:

The timestamp_split column name.

stratified_split_key:

The stratified_split column name.

training_fraction:

The training fraction.

validation_fraction:

The validation fraction.

test_fraction:

The test fraction.

weight_column:

The weight column name.

max_failed_trial_count:

The number of failed trials that need to be seen before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm:

The search algorithm specified for the study. One of “ALGORITHM_UNSPECIFIED”, “GRID_SEARCH”, or “RANDOM_SEARCH”.

study_spec_measurement_selection_type:

Which measurement to use if/when the service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

stats_and_example_gen_dataflow_machine_type:

The dataflow machine type for stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers:

The max number of Dataflow workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for stats_and_example_gen component.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

training_machine_spec:

The machine spec for trainer component. See https://cloud.google.com/compute/docs/machine-types for options.

training_replica_count:

The replica count for the trainer component.

run_evaluation:

Whether to run evaluation steps during training.

evaluation_batch_predict_machine_type:

The prediction server machine type for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count:

The max number of prediction server for batch predict components during evaluation.

evaluation_dataflow_machine_type:

The dataflow machine type for evaluation components.

evaluation_dataflow_max_num_workers:

The max number of Dataflow workers for evaluation components.

evaluation_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for evaluation components.

dataflow_service_account:

Custom service account to run dataflow jobs.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_default_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Optional[Dict[str, Any]] = None, optimization_objective_recall_value: float = -1, optimization_objective_precision_value: float = -1, stage_1_tuner_worker_pool_specs_override: Optional[Dict[str, Any]] = None, cv_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '', additional_experiments: Optional[Dict[str, Any]] = None, dataflow_service_account: str = '', run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-standard-16', evaluation_batch_predict_starting_replica_count: int = 25, evaluation_batch_predict_max_replica_count: int = 25, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_max_num_workers: int = 25, evaluation_dataflow_disk_size_gb: int = 50, run_distillation: bool = False, distill_batch_predict_machine_type: str = 'n1-standard-16', distill_batch_predict_starting_replica_count: int = 25, distill_batch_predict_max_replica_count: int = 25) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular default training pipeline.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column_name:

The target column name.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

optimization_objective:

For binary classification, “maximize-au-roc”, “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations:

The transformations to apply.

split_spec:

The split spec.

data_source:

The data source.

train_budget_milli_node_hours:

The train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials:

Number of parallel trails for stage 1.

stage_2_num_parallel_trials:

Number of parallel trails for stage 2.

stage_2_num_selected_trials:

Number of selected trials for stage 2.

weight_column_name:

The weight column name.

study_spec_override:

The dictionary for overriding study spec. The dictionary should be of format https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.

optimization_objective_recall_value:

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value:

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override:

The dictionary for overriding. stage 1 tuner worker pool spec. The dictionary should be of format

cv_trainer_worker_pool_specs_override:

The dictionary for overriding stage cv trainer worker pool spec. The dictionary should be of format

export_additional_model_without_custom_ops:

Whether to export additional model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type:

The dataflow machine type for stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers:

The max number of Dataflow workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for stats_and_example_gen component.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

additional_experiments:

Use this field to config private preview features.

dataflow_service_account:

Custom service account to run dataflow jobs.

run_evaluation:

Whether to run evaluation in the training pipeline.

evaluation_batch_predict_machine_type:

The prediction server machine type for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count:

The max number of prediction server for batch predict components during evaluation.

evaluation_dataflow_machine_type:

The dataflow machine type for evaluation components.

evaluation_dataflow_max_num_workers:

The max number of Dataflow workers for evaluation components.

evaluation_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for evaluation components.

run_distillation:

Whether to run distill in the training pipeline.

distill_batch_predict_machine_type:

The prediction server machine type for batch predict component in the model distillation.

distill_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict component in the model distillation.

distill_batch_predict_max_replica_count:

The max number of prediction server for batch predict component in the model distillation.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_distill_skip_evaluation_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Optional[Dict[str, Any]] = None, optimization_objective_recall_value: float = -1, optimization_objective_precision_value: float = -1, stage_1_tuner_worker_pool_specs_override: Optional[Dict[str, Any]] = None, cv_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '', additional_experiments: Optional[Dict[str, Any]] = None, distill_batch_predict_machine_type: str = 'n1-standard-16', distill_batch_predict_starting_replica_count: int = 25, distill_batch_predict_max_replica_count: int = 25) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that distill and skips evaluation.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column_name:

The target column name.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

optimization_objective:

For binary classification, “maximize-au-roc”, “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations:

The transformations to apply.

split_spec:

The split spec.

data_source:

The data source.

train_budget_milli_node_hours:

The train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials:

Number of parallel trails for stage 1.

stage_2_num_parallel_trials:

Number of parallel trails for stage 2.

stage_2_num_selected_trials:

Number of selected trials for stage 2.

weight_column_name:

The weight column name.

study_spec_override:

The dictionary for overriding study spec. The dictionary should be of format https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.

optimization_objective_recall_value:

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value:

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override:

The dictionary for overriding. stage 1 tuner worker pool spec. The dictionary should be of format

cv_trainer_worker_pool_specs_override:

The dictionary for overriding stage cv trainer worker pool spec. The dictionary should be of format

export_additional_model_without_custom_ops:

Whether to export additional model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type:

The dataflow machine type for stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers:

The max number of Dataflow workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for stats_and_example_gen component.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

additional_experiments:

Use this field to config private preview features.

distill_batch_predict_machine_type:

The prediction server machine type for batch predict component in the model distillation.

distill_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict component in the model distillation.

distill_batch_predict_max_replica_count:

The max number of prediction server for batch predict component in the model distillation.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_feature_selection_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, algorithm: str, prediction_type: str, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, max_selected_features: Optional[int] = None)

Get the feature selection pipeline that generates feature ranking and selected features.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column:

The target column name.

algorithm:

Algorithm to select features, default to be AMI.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

data_source_csv_filenames:

A string that represents a list of comma separated CSV filenames.

data_source_bigquery_table_path:

The BigQuery table path.

max_selected_features:

number of features to be selected.

Returns:

Tuple of pipeline_definition_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_skip_architecture_search_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_tuning_result_artifact_uri: str, stage_2_num_parallel_trials: Optional[int] = None, stage_2_num_selected_trials: Optional[int] = None, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, predefined_split_key: Optional[str] = None, timestamp_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, weight_column: Optional[str] = None, optimization_objective_recall_value: Optional[float] = None, optimization_objective_precision_value: Optional[float] = None, cv_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: Optional[str] = None, stats_and_example_gen_dataflow_max_num_workers: Optional[int] = None, stats_and_example_gen_dataflow_disk_size_gb: Optional[int] = None, transform_dataflow_machine_type: Optional[str] = None, transform_dataflow_max_num_workers: Optional[int] = None, transform_dataflow_disk_size_gb: Optional[int] = None, dataflow_subnetwork: Optional[str] = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: Optional[str] = None, additional_experiments: Optional[Dict[str, Any]] = None, dataflow_service_account: Optional[str] = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: Optional[str] = None, evaluation_batch_predict_starting_replica_count: Optional[int] = None, evaluation_batch_predict_max_replica_count: Optional[int] = None, evaluation_dataflow_machine_type: Optional[str] = None, evaluation_dataflow_max_num_workers: Optional[int] = None, evaluation_dataflow_disk_size_gb: Optional[int] = None) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that skips architecture search.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column:

The target column name.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

optimization_objective:

For binary classification, “maximize-au-roc”, “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations:

The transformations to apply.

train_budget_milli_node_hours:

The train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_tuning_result_artifact_uri:

The stage 1 tuning result artifact GCS URI.

stage_2_num_parallel_trials:

Number of parallel trails for stage 2.

stage_2_num_selected_trials:

Number of selected trials for stage 2.

data_source_csv_filenames:

The CSV data source.

data_source_bigquery_table_path:

The BigQuery data source.

predefined_split_key:

The predefined_split column name.

timestamp_split_key:

The timestamp_split column name.

stratified_split_key:

The stratified_split column name.

training_fraction:

The training fraction.

validation_fraction:

The validation fraction.

test_fraction:

float = The test fraction.

weight_column:

The weight column name.

optimization_objective_recall_value:

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value:

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

cv_trainer_worker_pool_specs_override:

The dictionary for overriding stage cv trainer worker pool spec. The dictionary should be of format

export_additional_model_without_custom_ops:

Whether to export additional model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type:

The dataflow machine type for stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers:

The max number of Dataflow workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for stats_and_example_gen component.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

additional_experiments:

Use this field to config private preview features.

dataflow_service_account:

Custom service account to run dataflow jobs.

run_evaluation:

Whether to run evaluation in the training pipeline.

evaluation_batch_predict_machine_type:

The prediction server machine type for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count:

The max number of prediction server for batch predict components during evaluation.

evaluation_dataflow_machine_type:

The dataflow machine type for evaluation components.

evaluation_dataflow_max_num_workers:

The max number of Dataflow workers for evaluation components.

evaluation_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for evaluation components.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_skip_evaluation_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Optional[Dict[str, Any]] = None, optimization_objective_recall_value: float = -1, optimization_objective_precision_value: float = -1, stage_1_tuner_worker_pool_specs_override: Optional[Dict[str, Any]] = None, cv_trainer_worker_pool_specs_override: Optional[Dict[str, Any]] = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '', additional_experiments: Optional[Dict[str, Any]] = None) Tuple[str, Dict[str, Any]]

Get the AutoML Tabular training pipeline that skips evaluation.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column_name:

The target column name.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

optimization_objective:

For binary classification, “maximize-au-roc”, “minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.

transformations:

The transformations to apply.

split_spec:

The split spec.

data_source:

The data source.

train_budget_milli_node_hours:

The train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.

stage_1_num_parallel_trials:

Number of parallel trails for stage 1.

stage_2_num_parallel_trials:

Number of parallel trails for stage 2.

stage_2_num_selected_trials:

Number of selected trials for stage 2.

weight_column_name:

The weight column name.

study_spec_override:

The dictionary for overriding study spec. The dictionary should be of format https://github.com/googleapis/googleapis/blob/4e836c7c257e3e20b1de14d470993a2b1f4736a8/google/cloud/aiplatform/v1beta1/study.proto#L181.

optimization_objective_recall_value:

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value:

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

stage_1_tuner_worker_pool_specs_override:

The dictionary for overriding. stage 1 tuner worker pool spec. The dictionary should be of format

cv_trainer_worker_pool_specs_override:

The dictionary for overriding stage cv trainer worker pool spec. The dictionary should be of format

export_additional_model_without_custom_ops:

Whether to export additional model without custom TensorFlow operators.

stats_and_example_gen_dataflow_machine_type:

The dataflow machine type for stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers:

The max number of Dataflow workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for stats_and_example_gen component.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

additional_experiments:

Use this field to config private preview features.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_study_spec_parameters_override(dataset_size_bucket: str, prediction_type: str, training_budget_bucket: str) List[Dict[str, Any]]

Get study_spec_parameters_override for a TabNet hyperparameter tuning job.

Args:
dataset_size_bucket:

Size of the dataset. One of “small” (< 1M rows), “medium” (1M - 100M rows), or “large” (> 100M rows).

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

training_budget_bucket:

Bucket of the estimated training budget. One of “small” (< $600), “medium” ($600 - $2400), or “large” (> $2400). This parameter is only used as a hint for the hyperparameter search space, unrelated to the real cost.

Returns:

List of study_spec_parameters_override.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, transform_config: str, learning_rate: float, max_steps: int = -1, max_train_secs: int = -1, large_category_dim: int = 1, large_category_thresh: int = 300, yeo_johnson_transform: bool = True, feature_dim: int = 64, feature_dim_ratio: float = 0.5, num_decision_steps: int = 6, relaxation_factor: float = 1.5, decay_every: float = 100, decay_rate: float = 0.95, gradient_thresh: float = 2000, sparsity_loss_weight: float = 1e-05, batch_momentum: float = 0.95, batch_size_ratio: float = 0.25, num_transformer_layers: int = 4, num_transformer_layers_ratio: float = 0.25, class_weight: float = 1.0, loss_function_type: str = 'default', alpha_focal_loss: float = 0.25, gamma_focal_loss: float = 2.0, enable_profiler: bool = False, seed: int = 1, eval_steps: int = 0, batch_size: int = 100, eval_frequency_secs: int = 600, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, predefined_split_key: Optional[str] = None, timestamp_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, weight_column: str = '', stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, training_machine_spec: Optional[Dict[str, Any]] = None, training_replica_count: int = 1, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-standard-16', evaluation_batch_predict_starting_replica_count: int = 25, evaluation_batch_predict_max_replica_count: int = 25, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_max_num_workers: int = 25, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the TabNet training pipeline.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column:

The target column name.

prediction_type:

The type of prediction the model is to produce. “classification” or “regression”.

transform_config:

The path to a GCS file containing the transformations to apply.

learning_rate:

The learning rate used by the linear optimizer.

max_steps:

Number of steps to run the trainer for.

max_train_secs:

Amount of time in seconds to run the trainer for.

large_category_dim:

Embedding dimension for categorical feature with large number of categories.

large_category_thresh:

Threshold for number of categories to apply large_category_dim embedding dimension to.

yeo_johnson_transform:

Enables trainable Yeo-Johnson power transform.

feature_dim:

Dimensionality of the hidden representation in feature transformation block.

feature_dim_ratio:

The ratio of output dimension (dimensionality of the outputs of each decision step) to feature dimension.

num_decision_steps:

Number of sequential decision steps.

relaxation_factor:

Relaxation factor that promotes the reuse of each feature at different decision steps. When it is 1, a feature is enforced to be used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.

decay_every:

Number of iterations for periodically applying learning rate decaying.

decay_rate:

Learning rate decaying.

gradient_thresh:

Threshold for the norm of gradients for clipping.

sparsity_loss_weight:

Weight of the loss for sparsity regularization (increasing it will yield more sparse feature selection).

batch_momentum:

Momentum in ghost batch normalization.

batch_size_ratio:

The ratio of virtual batch size (size of the ghost batch normalization) to batch size.

num_transformer_layers:

The number of transformer layers for each decision step. used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.

num_transformer_layers_ratio:

The ratio of shared transformer layer to transformer layers.

class_weight:

The class weight is used to computes a weighted cross entropy which is helpful in classify imbalanced dataset. Only used for classification.

loss_function_type:

Loss function type. Loss function in classification [cross_entropy, weighted_cross_entropy, focal_loss], default is cross_entropy. Loss function in regression:

[rmse, mae, mse], default is

mse.

alpha_focal_loss:

Alpha value (balancing factor) in focal_loss function. Only used for classification.

gamma_focal_loss:

Gamma value (modulating factor) for focal loss for focal loss. Only used for classification.

enable_profiler:

Enables profiling and saves a trace during evaluation.

seed:

Seed to be used for this run.

eval_steps:

Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

batch_size:

Batch size for training.

eval_frequency_secs:

Frequency at which evaluation and checkpointing will take place.

data_source_csv_filenames:

The CSV data source.

data_source_bigquery_table_path:

The BigQuery data source.

predefined_split_key:

The predefined_split column name.

timestamp_split_key:

The timestamp_split column name.

stratified_split_key:

The stratified_split column name.

training_fraction:

The training fraction.

validation_fraction:

The validation fraction.

test_fraction:

The test fraction.

weight_column:

The weight column name.

stats_and_example_gen_dataflow_machine_type:

The dataflow machine type for stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers:

The max number of Dataflow workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for stats_and_example_gen component.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

training_machine_spec:

The machine spec for trainer component. See https://cloud.google.com/compute/docs/machine-types for options.

training_replica_count:

The replica count for the trainer component.

run_evaluation:

Whether to run evaluation steps during training.

evaluation_batch_predict_machine_type:

The prediction server machine type for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count:

The max number of prediction server for batch predict components during evaluation.

evaluation_dataflow_machine_type:

The dataflow machine type for evaluation components.

evaluation_dataflow_max_num_workers:

The max number of Dataflow workers for evaluation components.

evaluation_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for evaluation components.

dataflow_service_account:

Custom service account to run dataflow jobs.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_wide_and_deep_study_spec_parameters_override() List[Dict[str, Any]]

Get study_spec_parameters_override for a Wide & Deep hyperparameter tuning job.

Returns:

List of study_spec_parameters_override.

google_cloud_pipeline_components.experimental.automl.tabular.utils.get_wide_and_deep_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, transform_config: str, learning_rate: float, dnn_learning_rate: float, optimizer_type: str = 'adam', max_steps: int = -1, max_train_secs: int = -1, l1_regularization_strength: float = 0, l2_regularization_strength: float = 0, l2_shrinkage_regularization_strength: float = 0, beta_1: float = 0.9, beta_2: float = 0.999, hidden_units: str = '30,30,30', use_wide: bool = True, embed_categories: bool = True, dnn_dropout: float = 0, dnn_optimizer_type: str = 'ftrl', dnn_l1_regularization_strength: float = 0, dnn_l2_regularization_strength: float = 0, dnn_l2_shrinkage_regularization_strength: float = 0, dnn_beta_1: float = 0.9, dnn_beta_2: float = 0.999, enable_profiler: bool = False, seed: int = 1, eval_steps: int = 0, batch_size: int = 100, eval_frequency_secs: int = 600, data_source_csv_filenames: Optional[str] = None, data_source_bigquery_table_path: Optional[str] = None, predefined_split_key: Optional[str] = None, timestamp_split_key: Optional[str] = None, stratified_split_key: Optional[str] = None, training_fraction: Optional[float] = None, validation_fraction: Optional[float] = None, test_fraction: Optional[float] = None, weight_column: str = '', stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, training_machine_spec: Optional[Dict[str, Any]] = None, training_replica_count: int = 1, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-standard-16', evaluation_batch_predict_starting_replica_count: int = 25, evaluation_batch_predict_max_replica_count: int = 25, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_max_num_workers: int = 25, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]

Get the Wide & Deep training pipeline.

Args:
project:

The GCP project that runs the pipeline components.

location:

The GCP region that runs the pipeline components.

root_dir:

The root GCS directory for the pipeline components.

target_column:

The target column name.

prediction_type:

The type of prediction the model is to produce. ‘classification’ or ‘regression’.

transform_config:

The path to a GCS file containing the transformations to apply.

learning_rate:

The learning rate used by the linear optimizer.

dnn_learning_rate:

The learning rate for training the deep part of the model.

optimizer_type:

The type of optimizer to use. Choices are “adam”, “ftrl” and “sgd” for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

max_steps:

Number of steps to run the trainer for.

max_train_secs:

Amount of time in seconds to run the trainer for.

l1_regularization_strength:

L1 regularization strength for optimizer_type=”ftrl”.

l2_regularization_strength:

L2 regularization strength for optimizer_type=”ftrl”.

l2_shrinkage_regularization_strength:

L2 shrinkage regularization strength for optimizer_type=”ftrl”.

beta_1:

Beta 1 value for optimizer_type=”adam”.

beta_2:

Beta 2 value for optimizer_type=”adam”.

hidden_units:

Hidden layer sizes to use for DNN feature columns, provided in comma-separated layers.

use_wide:

If set to true, the categorical columns will be used in the wide part of the DNN model.

embed_categories:

If set to true, the categorical columns will be used embedded and used in the deep part of the model. Embedding size is the square root of the column cardinality.

dnn_dropout:

The probability we will drop out a given coordinate.

dnn_optimizer_type:

The type of optimizer to use for the deep part of the model. Choices are “adam”, “ftrl” and “sgd”. for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

dnn_l1_regularization_strength:

L1 regularization strength for dnn_optimizer_type=”ftrl”.

dnn_l2_regularization_strength:

L2 regularization strength for dnn_optimizer_type=”ftrl”.

dnn_l2_shrinkage_regularization_strength:

L2 shrinkage regularization strength for dnn_optimizer_type=”ftrl”.

dnn_beta_1:

Beta 1 value for dnn_optimizer_type=”adam”.

dnn_beta_2:

Beta 2 value for dnn_optimizer_type=”adam”.

enable_profiler:

Enables profiling and saves a trace during evaluation.

seed:

Seed to be used for this run.

eval_steps:

Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

batch_size:

Batch size for training.

eval_frequency_secs:

Frequency at which evaluation and checkpointing will take place.

data_source_csv_filenames:

The CSV data source.

data_source_bigquery_table_path:

The BigQuery data source.

predefined_split_key:

The predefined_split column name.

timestamp_split_key:

The timestamp_split column name.

stratified_split_key:

The stratified_split column name.

training_fraction:

The training fraction.

validation_fraction:

The validation fraction.

test_fraction:

The test fraction.

weight_column:

The weight column name.

stats_and_example_gen_dataflow_machine_type:

The dataflow machine type for stats_and_example_gen component.

stats_and_example_gen_dataflow_max_num_workers:

The max number of Dataflow workers for stats_and_example_gen component.

stats_and_example_gen_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for stats_and_example_gen component.

transform_dataflow_machine_type:

The dataflow machine type for transform component.

transform_dataflow_max_num_workers:

The max number of Dataflow workers for transform component.

transform_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for transform component.

training_machine_spec:

The machine spec for trainer component. See https://cloud.google.com/compute/docs/machine-types for options.

training_replica_count:

The replica count for the trainer component.

run_evaluation:

Whether to run evaluation steps during training.

evaluation_batch_predict_machine_type:

The prediction server machine type for batch predict components during evaluation.

evaluation_batch_predict_starting_replica_count:

The initial number of prediction server for batch predict components during evaluation.

evaluation_batch_predict_max_replica_count:

The max number of prediction server for batch predict components during evaluation.

evaluation_dataflow_machine_type:

The dataflow machine type for evaluation components.

evaluation_dataflow_max_num_workers:

The max number of Dataflow workers for evaluation components.

evaluation_dataflow_disk_size_gb:

Dataflow worker’s disk size in GB for evaluation components.

dataflow_service_account:

Custom service account to run dataflow jobs.

dataflow_subnetwork:

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. Example:

dataflow_use_public_ips:

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name:

The KMS key name.

Returns:

Tuple of pipeline_definiton_path and parameter_values.

google_cloud_pipeline_components.experimental.automl.tabular.utils.input_dictionary_to_parameter(input_dict: Optional[Dict[str, Any]]) str

Convert json input dict to encoded parameter string.

This function is required due to the limitation on YAML component definition that YAML definition does not have a keyword for apply quote escape, so the JSON argument’s quote must be manually escaped using this function.

Args:
input_dict:

The input json dictionary.

Returns:

The encoded string used for parameter.

Module contents

Module for AutoML Tables KFP components.

google_cloud_pipeline_components.experimental.automl.tabular.BuiltinAlgorithmHyperparameterTuningJobOp()

automl_tabular_builtin_algorithm_hyperparameter_tuning_job Launch a built-in algorithm hyperparameter tuning job using Vertex HyperparameterTuningJob API.

Args:
project (str):

Required. The GCP project that runs the pipeline components.

location (str):

Required. The GCP region that runs the pipeline components.

root_dir (str):

Required. The root GCS directory for the pipeline components.

target_column (str):

Required. The target column name.

prediction_type (str):

Required. The type of prediction the model is to produce. “classification” or “regression”.

weight_column (Optional[str]):

The weight column name.

enable_profiler (Optional[bool]):

Enables profiling and saves a trace during evaluation.

seed (Optional[int]):

Seed to be used for this run.

eval_steps (Optional[int]):

Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

eval_frequency_secs (Optional[int]):

Frequency at which evaluation and checkpointing will take place.

study_spec_metrics (list[dict]):

Required. List of dictionaries representing metrics to optimize. The dictionary contains the metric_id, which is reported by the training job, ands the optimization goal of the metric. One of “minimize” or “maximize”.

study_spec_parameters_override (list[str]):

List of dictionaries representing parameters to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.

max_trial_count (int):

Required. The desired total number of trials.

parallel_trial_count (int):

Required. The desired number of trials to run in parallel.

tabnet (Optional[bool]):

Train the TabNet model.

wide_and_deep (Optional[bool]):

Train the Wide & Deep model.

max_failed_trial_count (Optional[int]):

The number of failed trials that need to be seen before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.

study_spec_algorithm (Optional[str]):

The search algorithm specified for the study. One of ‘ALGORITHM_UNSPECIFIED’, ‘GRID_SEARCH’, or ‘RANDOM_SEARCH’.

study_spec_measurement_selection_type (Optional[str]):

Which measurement to use if/when the service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.

replica_count (Optional[int]):

The replica count.

machine_spec (Optional[Dict[str, Any]]):

The machine spec. See https://cloud.google.com/compute/docs/machine-types for options.

instance_baseline (AutoMLTabularInstanceBaseline):

The path to a JSON file for baseline values.

metadata (TabularExampleGenMetadata):

Amount of time in seconds to run the trainer for.

materialized_train_split (MaterializedSplit):

The path to the materialized train split.

materialized_eval_split (MaterializedSplit):

The path to the materialized validation split.

materialized_test_split (MaterializedSplit):

The path to the materialized test split.

transform_output (TransformOutput):

The path to transform output.

training_schema_uri (TrainingSchema):

The path to the training schema.

encryption_spec_key_name (Optional[str]):

The KMS key name.

Returns:
gcp_resources (str):

Serialized gcp_resources proto tracking the custom training job.

instance_schema_uri (str):

The path to the instance schema.

prediction_schema_uri (str):

The path to the prediction schema.

trials (str):

The path to the hyperparameter tuning trials

prediction_docker_uri_output (str):

The URI of the prediction container.

google_cloud_pipeline_components.experimental.automl.tabular.CvTrainerOp()

automl_tabular_cv_trainer AutoML Tabular cross-validation trainer

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer.

root_dir (str):

The Cloud Storage location to store the output.

worker_pool_specs_override (str):

Quote escaped JSON string for the worker pool specs. An example of the worker pool specs JSON is: [{“machine_spec”: {“machine_type”: “n1-standard-16”}},{},{},{“machine_spec”: {“machine_type”: “n1-standard-16”}}]

deadline_hours (float):

Number of hours the cross-validation trainer should run.

num_parallel_trials (int):

Number of parallel training trials.

single_run_max_secs (int):

Max number of seconds each training trial runs.

num_selected_trials (int):

Number of selected trials. The number of weak learners in the final model is 5 * num_selected_trials.

transform_output (TransformOutput):

The transform output artifact.

metadata (TabularExampleGenMetadata):

The tabular example gen metadata.

materialized_cv_splits (MaterializedSplit):

The materialized cross-validation splits.

tuning_result_input (AutoMLTabularTuningResult):

AutoML Tabular tuning result.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

Returns:
tuning_result_output (AutoMLTabularTuningResult):

The trained model and architectures.

gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.experimental.automl.tabular.EnsembleOp()

automl_tabular_ensemble Ensemble AutoML Tabular models

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer.

root_dir (str):

The Cloud Storage location to store the output.

transform_output (TransformOutput):

The transform output artifact.

metadata (TabularExampleGenMetadata):

The tabular example gen metadata.

dataset_schema (DatasetSchema):

The schema of the dataset.

tuning_result_input (AutoMLTabularTuningResult):

AutoML Tabular tuning result.

instance_baseline (AutoMLTabularInstanceBaseline):

The instance baseline used to calculate explanations.

warmup_data (Dataset):

The warm up data. Ensemble component will save the warm up data together with the model artifact, used to warm up the model when prediction server starts.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

export_additional_model_without_custom_ops (Optional[str]):

True if export an additional model without custom TF operators to the model_without_custom_ops output.

Returns:
gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

model_architecture (AutoMLTabularModelArchitecture):

The architecture of the output model.

model (system.Model):

The output model.

model_without_custom_ops (system.Model):

The output model without custom TF operators, this output will be empty unless export_additional_model_without_custom_ops is set.

model_uri (str):

The URI of the output model.

instance_schema_uri (str):

The URI of the instance schema.

prediction_schema_uri (str):

The URI of the prediction schema.

explanation_metadata (str):

The explanation metadata used by Vertex online and batch explanations.

explanation_metadata (str):

The explanation parameters used by Vertex online and batch explanations.

google_cloud_pipeline_components.experimental.automl.tabular.FeatureSelectionOp()

tabular_feature_ranking_and_selection Launch a feature selection task to pick top features.

Args:
project (str):

Required. Project to run feature selection.

location (str):

Location for running the feature selection. If not set, default to us-central1.

root_dir (str):

The Cloud Storage location to store the output.

dataflow_machine_type (Optional[str]):

The machine type used for dataflow jobs. If not set, default to n1-standard-16.

dataflow_max_num_workers (Optional[int]):

The number of workers to run the dataflow job. If not set, default to 25.

dataflow_disk_size_gb (Optional[int]):

The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.

dataflow_subnetwork (Optional[str]):

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips (Optional[bool]):

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key. If this is set, then all resources will be encrypted with the provided encryption key.

data_source(Dataset):

The input dataset artifact which references csv, BigQuery, or TF Records.

target_column_name(str):

Target column name of the input dataset.

max_selected_features (Optional[int]):

number of features to select by the algorithm. If not set, default to 1000.

Returns:
feature_ranking (TabularFeatureRanking):

the dictionary of feature names and feature ranking values.

selected_features (JsonObject):

A json array of selected feature names.

gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.experimental.automl.tabular.FeatureTransformEngineOp()

feature_transform_engine Feature transform engine to transform raw data to engineered features.

Feature Transform Engine (FTE) expects input data in form of an analyze dataset (i.e., the dataset to be analyzed to compute dataset-level statistics such as min, max, average, or vocabulary) and a transform dataset (i.e., the dataset to be transform into engineered features). FTE performs transformations on the transform dataset based on the provided transformation configurations.

Args:
project (str):

Required. Project to run feature transform engine.

location (Optional[str]):

Location for running the feature transform engine.

root_dir (str):

The Cloud Storage location to store the output.

analyze_data (Dataset):

Configuration of the dataset to be analyzed.

transform_data (Dataset):

Configuration of the dataset to be transformed.

transform_config (str):

Feature transformation configurations.

Path to a JSON file used to specified FTE’s transformation configurations. In the following, we provide some sample transform configurations to demonstrate FTE’s capabilities.

Full auto transformations: FTE automatically configure a set of built-in transformations for each input column based on its data statistics. For example:

{
  "auto_transforms": ["feature_1", "feature_2", ... ]
}

Fully specified transformations: All transformations on input columns are explicitly specified with FTE’s built-in transformations. Chaining of multiple transformations on a single column is also supported. For example:

{
  "transforms": [{
      "transform": "ZScaleTransform",
      "input_column_names": ["feature_1"]
  }, {
      "transform": "ZScaleTransform",
      "input_column_names": ["feature_2"]
  }]
}

Mix of auto and explicit transformations:

{
  "auto_transforms": ["feature_1", "feature_2", ... ]
  "transforms": [{
      "transform": "ZScaleTransform",
      "input_column_names": ["feature_3"]
  }, {
      "transform": "ZScaleTransform",
      "input_column_names": ["feature_4"]
  }]
}

Custom transformations: Custom, bring-your-own transform function, where users can define and import their own transform function and use it with FTE’s built-in transformations. For example:

{
  "modules": [{
    "transform": "PlusOneTransform",
    "module_path": "gs://bucket/custom_transform_fn.py",
    "function_name": "plus_one_transform"
  }],
  "transforms": [{
      "transform": "CastToFloatTransform",
      "input_column_names": ["feature_1"],
      "output_column_names": ["feature_1"]
  },{
      "transform": "PlusOneTransform",
      "input_column_names": ["feature_1"]
  }
}

Additional information about FTE’s built-in transformations:

DatetimeTransform:

{
  "transform": "DatetimeTransform",
  "input_column_names": ["feature_1"],
  "time_format": "%Y-%m-%d"
}

LogTransform:

{
  "transform": "LogTransform",
  "input_column_names": ["feature_1"]
}

ZScaleTransform:

{
  "transform": "ZScaleTransform",
  "input_column_names": ["feature_1"]
}

VocabularyTransform:

{
  "transform": "VocabularyTransform",
  "input_column_names": ["feature_1"]
}

CategoricalTransform:

{
  "transform": "CategoricalTransform",
  "input_column_names": ["feature_1"],
  "top_k": 10
}

ReduceTransform:

{
  "transform": "ReduceTransform",
  "input_column_names": ["feature_1"],
  "reduce_mode": "MEAN",
  "output_column_names": ["feature_1_mean"]
}

SplitStringTransform:

{
  "transform": "SplitStringTransform",
  "input_column_names": ["feature_1"],
  "separator": "$"
}

NGramTransform:

{
  "transform": "NGramTransform",
  "input_column_names": ["feature_1"],
  "min_ngram_size": 1,
  "max_ngram_size": 2,
  "separator": " "
}

ClipTransform:

{
  "transform": "ClipTransform",
  "input_column_names": ["col1"],
  "output_column_names": ["col1_clipped"],
  "min_value": 1.,
  "max_value": 10.,
}

MultiHotEncodingTransform:

{
  "transform": "MultiHotEncodingTransform",
  "input_column_names": ["col1"],
}

MaxAbsScaleTransform:

{
  "transform": "MaxAbsScaleTransform",
  "input_column_names": ["col1"],
  "output_column_names": ["col1_max_abs_scaled"]
}
dataflow_machine_type (Optional[str]):

The machine type used for dataflow jobs. If not set, default to n1-standard-16.

dataflow_max_num_workers (Optional[int]):

The number of workers to run the dataflow job. If not set, default to 25.

dataflow_disk_size_gb (Optional[int]):

The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.

dataflow_subnetwork (Optional[str]):

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips (Optional[bool]):

Specifies whether Dataflow workers use public IP addresses.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

Returns:
materialized_data (Dataset):

The materialized dataset.

transform_output (TransformOutput):

The transform output artifact.

training_schema (TrainingSchema):

The training schema.

gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.experimental.automl.tabular.FinalizerOp()

automl_tabular_finalizer Finalizer for AutoML Tabular pipelines

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer.

root_dir (str):

The Cloud Storage location to store the output.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

Returns:
gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.experimental.automl.tabular.GenerateAnalyzeAndTransformDataOp(train_split: Dataset, eval_split: Dataset, test_split: Dataset)

Generate analyze and transform data Generates anayze and transform data for Feature Transform Engine.

Feature Transform Engine (FTE) expects input data in form of an analyze dataset (i.e., the dataset to be analyzed to compute dataset-level statistics such as min, max, average, or vocabulary) and a transform dataset (i.e., the dataset to be transform into engineered features).

This component takes the common set of training, evaluation, and testing splits and generates analyze dataset (consists of the train split) and transform dataset (consists of all the splits).

Args:
train_split (Dataset):

Train split dataset output by stats gen component.

eval_split (Dataset):

Eval split dataset output by stats gen component.

test_split (Dataset):

Test split dataset output by stats gen component.

Returns:
analyze_data (Dataset):

Analyze data as input for Feature Transform Engine.

transform_data (Dataset):

Transform data as input for Feature Transform Engine.

google_cloud_pipeline_components.experimental.automl.tabular.InfraValidatorOp()

automl_tabular_infra_validator Validates the trained AutoML Tabular model is a valid model.

Args:
unmanaged_container_model (str):

google.UnmanagedContainerModel for model to be validated.

google_cloud_pipeline_components.experimental.automl.tabular.SplitMaterializedDataOp(materialized_data: Dataset)

Split materialized data Splits materialized dataset into materialized train, eval, and test data splits.

The materialized dataset generated by the Feature Transform Engine consists of all the splits that were combined into the input transform dataset (i.e., train, eval, and test splits). This components splits the output materialized dataset into corresponding materialized data splits so that the splits can be used by down-stream training or evaluation components.

Args:
materialized_data (Dataset):

Materialized dataset output by the Feature Transform Engine.

Returns:
materialized_train_split (MaterializedSplit):

Path patern to materialized train split.

materialized_eval_split (MaterializedSplit):

Path patern to materialized eval split.

materialized_test_split (MaterializedSplit):

Path patern to materialized test split.

google_cloud_pipeline_components.experimental.automl.tabular.Stage1TunerOp()

automl_tabular_stage_1_tuner AutoML Tabular stage 1 tuner

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer.

root_dir (str):

The Cloud Storage location to store the output.

study_spec_override (str):

Quote escaped JSON string for the study spec. An example of the study specs JSON is: {“parameters”:[{“parameter_id”: “model_type”,”categorical_value_spec”: {“values”: [“nn”]}}]}

worker_pool_specs_override (str):

Quote escaped JSON string for the worker pool specs. An example of the worker pool specs JSON is: [{“machine_spec”: {“machine_type”: “n1-standard-16”}},{},{},{“machine_spec”: {“machine_type”: “n1-standard-16”}}]

reduce_search_space_mode (str):

The reduce search space mode. Possible values: “regular” (default), “minimal”, “full”.

num_selected_trials (int):

Number of selected trials. The number of weak learners in the final model is 5 * num_selected_trials.

deadline_hours (float):

Number of hours the cross-validation trainer should run.

disable_early_stopping (bool):

True if disable early stopping. Default value is false.

num_parallel_trials (int):

Number of parallel training trials.

single_run_max_secs (int):

Max number of seconds each training trial runs.

metadata (TabularExampleGenMetadata):

The tabular example gen metadata.

transform_output (TransformOutput):

The transform output artifact.

materialized_train_split (MaterializedSplit):

The materialized train split.

materialized_eval_split (MaterializedSplit):

The materialized eval split.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

run_distillation (bool):

True if in distillation mode. The default value is false.

Returns:
gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

tuning_result_output (AutoMLTabularTuningResult):

The trained model and architectures.

google_cloud_pipeline_components.experimental.automl.tabular.StatsAndExampleGenOp(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, transformations: str, weight_column_name: str = '', optimization_objective: str = '', optimization_objective_recall_value: float = '-1', optimization_objective_precision_value: float = '-1', transformations_path: str = '', split_spec: str = None, data_source: str = None, request_type: str = 'COLUMN_STATS_ONLY', dataflow_machine_type: str = 'n1-standard-16', dataflow_max_num_workers: int = '25', dataflow_disk_size_gb: int = '40', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = 'true', dataflow_service_account: str = '', encryption_spec_key_name: str = '', run_distillation: bool = 'false', additional_experiments: str = '', additional_experiments_json: dict = '{}', data_source_csv_filenames: str = '', data_source_bigquery_table_path: str = '', predefined_split_key: str = '', timestamp_split_key: str = '', stratified_split_key: str = '', training_fraction: float = '-1', validation_fraction: float = '-1', test_fraction: float = '-1')

tabular_stats_and_example_gen Statistics and example gen for tabular data

Args:
project (str):

Required. Project to run dataset statistics and example generation.

location (str):

Location for running dataset statistics and example generation.

root_dir (str):

The Cloud Storage location to store the output.

target_column_name (str):

The target column name.

weight_column_name (str):

The weight column name.

prediction_type (str):

The prediction type. Supported values: “classification”, “regression”.

optimization_objective (str):

Objective function the model is optimizing towards. The training process creates a model that maximizes/minimizes the value of the objective function over the validation set. The supported optimization objectives depend on the prediction type. If the field is not set, a default objective function is used. classification (binary):

“maximize-au-roc” (default) - Maximize the area under the receiver

operating characteristic (ROC) curve.

“minimize-log-loss” - Minimize log loss. “maximize-au-prc” - Maximize the area under the precision-recall curve. “maximize-precision-at-recall” - Maximize precision for a specified recall value. “maximize-recall-at-precision” - Maximize recall for a specified precision value.

classification (multi-class):

“minimize-log-loss” (default) - Minimize log loss.

regression:

“minimize-rmse” (default) - Minimize root-mean-squared error (RMSE). “minimize-mae” - Minimize mean-absolute error (MAE). “minimize-rmsle” - Minimize root-mean-squared log error (RMSLE).

optimization_objective_recall_value (str):

Required when optimization_objective is “maximize-precision-at-recall”. Must be between 0 and 1, inclusive.

optimization_objective_precision_value (str):

Required when optimization_objective is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.

transformations (str):

Quote escaped JSON string for transformations. Each transformation will apply transform function to given input column. And the result will be used for training. When creating transformation for BigQuery Struct column, the column should be flattened using “.” as the delimiter.

transformations_path (Optional[str]):

Path to a GCS file containing JSON string for transformations.

split_spec (str):

Quote escaped JSON string for split spec.

data_source (str):

Quote escaped JSON string for data source.

dataflow_machine_type (Optional[str]):

The machine type used for dataflow jobs. If not set, default to n1-standard-16.

dataflow_max_num_workers (Optional[int]):

The number of workers to run the dataflow job. If not set, default to 25.

dataflow_disk_size_gb (Optional[int]):

The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.

dataflow_subnetwork (Optional[str]):

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips (Optional[bool]):

Specifies whether Dataflow workers use public IP addresses.

dataflow_service_account (Optional[str]):

Custom service account to run dataflow jobs.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

run_distillation (bool): True if in distillation mode. The default value is false.

Returns:
dataset_schema (DatasetSchema):

The schema of the dataset.

dataset_stats (AutoMLTabularDatasetStats):

The stats of the dataset.

train_split (Dataset):

The train split.

eval_split (Dataset):

The eval split.

test_split (Dataset):

The test split.

test_split_json (JsonObject):

The test split JSON object.

downsampled_test_split_json (JsonObject):

The downsampled test split JSON object.

instance_baseline (AutoMLTabularInstanceBaseline):

The instance baseline used to calculate explanations.

metadata (TabularExampleGenMetadata):

The tabular example gen metadata.

gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.experimental.automl.tabular.TabNetTrainerOp(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, learning_rate: float, instance_baseline: AutoMLTabularInstanceBaseline, metadata: TabularExampleGenMetadata, materialized_train_split: MaterializedSplit, materialized_eval_split: MaterializedSplit, transform_output: TransformOutput, training_schema_uri: TrainingSchema, weight_column: str = '', max_steps: int = -1, max_train_secs: int = -1, large_category_dim: int = 1, large_category_thresh: int = 300, yeo_johnson_transform: bool = 'true', feature_dim: int = 64, feature_dim_ratio: float = 0.5, num_decision_steps: int = 6, relaxation_factor: float = 1.5, decay_every: float = 100, decay_rate: float = 0.95, gradient_thresh: float = 2000, sparsity_loss_weight: float = 1e-05, batch_momentum: float = 0.95, batch_size_ratio: float = 0.25, num_transformer_layers: int = 4, num_transformer_layers_ratio: float = 0.25, class_weight: float = 1.0, loss_function_type: str = 'default', alpha_focal_loss: float = 0.25, gamma_focal_loss: float = 2.0, enable_profiler: bool = 'false', seed: int = 1, eval_steps: int = 0, batch_size: int = 100, eval_frequency_secs: int = 600, replica_count: int = 1, machine_spec: dict = '{"machine_type": "c2-standard-16"}', materialized_test_split: MaterializedSplit = '', encryption_spec_key_name: str = '')

automl_tabular_tabnet_trainer Launch a TabNet custom training job using Vertex CustomJob API.

Args:
project (str):

Required. The GCP project that runs the pipeline components.

location (str):

Required. The GCP region that runs the pipeline components.

root_dir (str):

Required. The root GCS directory for the pipeline components.

target_column (str):

Required. The target column name.

prediction_type (str):

Required. The type of prediction the model is to produce. “classification” or “regression”.

weight_column (Optional[str]):

The weight column name.

max_steps (Optional[int]):

Number of steps to run the trainer for.

max_train_secs (Optional[int]):

Amount of time in seconds to run the trainer for.

learning_rate (float):

The learning rate used by the linear optimizer.

large_category_dim (Optional[int]):

Embedding dimension for categorical feature with large number of categories.

large_category_thresh (Optional[int]):

Threshold for number of categories to apply large_category_dim embedding dimension to.

yeo_johnson_transform (Optional[bool]):

Enables trainable Yeo-Johnson power transform.

feature_dim (Optional[int]):

Dimensionality of the hidden representation in feature transformation block.

feature_dim_ratio (Optional[float]):

The ratio of output dimension (dimensionality of the outputs of each decision step) to feature dimension.

num_decision_steps (Optional[int]):

Number of sequential decision steps.

relaxation_factor (Optional[float]):

Relaxation factor that promotes the reuse of each feature at different decision steps. When it is 1, a feature is enforced to be used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.

decay_every (Optional[float]):

Number of iterations for periodically applying learning rate decaying.

decay_rate (Optional[float]):

Learning rate decaying.

gradient_thresh (Optional[float]):

Threshold for the norm of gradients for clipping.

sparsity_loss_weight (Optional[float]):

Weight of the loss for sparsity regularization (increasing it will yield more sparse feature selection).

batch_momentum (Optional[float]):

Momentum in ghost batch normalization.

batch_size_ratio (Optional[float]):

The ratio of virtual batch size (size of the ghost batch normalization) to batch size.

num_transformer_layers (Optional[int]):

The number of transformer layers for each decision step. used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.

num_transformer_layers_ratio (Optional[float]):

The ratio of shared transformer layer to transformer layers.

class_weight (Optional[float]):

The class weight is used to computes a weighted cross entropy which is helpful in classify imbalanced dataset. Only used for classification.

loss_function_type (Optional[str]):

Loss function type. Loss function in classification [cross_entropy, weighted_cross_entropy, focal_loss], default is cross_entropy. Loss function in regression: [rmse, mae, mse], default is mse.

alpha_focal_loss (Optional[float]):

Alpha value (balancing factor) in focal_loss function. Only used for classification.

gamma_focal_loss (Optional[float]):

Gamma value (modulating factor) for focal loss for focal loss. Only used for classification.

enable_profiler (Optional[bool]):

Enables profiling and saves a trace during evaluation.

seed (Optional[int]):

Seed to be used for this run.

eval_steps (Optional[int]):

Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

batch_size (Optional[int]):

Batch size for training.

eval_frequency_secs (Optional[int]):

Frequency at which evaluation and checkpointing will take place.

replica_count (Optional[int]):

The replica count.

machine_spec (Optional[Dict[str, Any]]):

The machine spec. See https://cloud.google.com/compute/docs/machine-types for options.

instance_baseline (AutoMLTabularInstanceBaseline):

The path to a JSON file for baseline values.

metadata (TabularExampleGenMetadata):

Amount of time in seconds to run the trainer for.

materialized_train_split (MaterializedSplit):

The path to the materialized train split.

materialized_eval_split (MaterializedSplit):

The path to the materialized validation split.

materialized_test_split (MaterializedSplit):

The path to the materialized test split.

transform_output (TransformOutput):

The path to transform output.

training_schema_uri (TrainingSchema):

The path to the training schema.

encryption_spec_key_name (Optional[str]):

The KMS key name.

Returns:
gcp_resources (str):

Serialized gcp_resources proto tracking the custom training job.

unmanaged_container_model (google.UnmanagedContainerModel):

The UnmanagedContainerModel artifact.

google_cloud_pipeline_components.experimental.automl.tabular.TransformConfigurationPlannerOp()

transform_configuration_planner Automatically generates transform configuration for the Feature Transform Engine.

When configuring transformation for the Feature Transform Engine, users have an option to specify “auto” transformation on input columns. In such case, this transform configuration planner component will automatically identify the most appropriate set of transformations for the columns and generate the transformation configurations that can be used by the Feature Transform Engine.

Args:
project (str):

Required. Project to run Fte transform configuration

location (Optional[str]):

Location for running the te transform configuration.

root_dir (str):

The Cloud Storage location to store the output.

transform_config (str):

Feature transformation configurations.

target_column_name (str):

The target column name.

weight_column_name (str):

The weight column name.

prediction_type (str):

The prediction type. Supported values: “classification”, “regression”.

is_distill (bool):

True if in distillation mode. The default value is false.

dataset_stats (AutoMLTabularDatasetStats):

The dataset stats of the dataset

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key

Returns:
fte_transform_configuration_artifact_path (str):

The path to the fte transform configutation path.

gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.experimental.automl.tabular.TransformOp()

automl_tabular_transform Transformation raw features to engineered features

Args:
project (str):

Required. Project to run Cross-validation trainer.

location (str):

Location for running the Cross-validation trainer.

root_dir (str):

The Cloud Storage location to store the output.

metadata (TabularExampleGenMetadata):

The tabular example gen metadata.

dataset_schema (DatasetSchema):

The schema of the dataset.

train_split (Dataset):

The train split.

eval_split (Dataset):

The eval split.

test_split (Dataset):

The test split.

dataflow_machine_type (Optional[str]):

The machine type used for dataflow jobs. If not set, default to n1-standard-16.

dataflow_max_num_workers (Optional[int]):

The number of workers to run the dataflow job. If not set, default to 25.

dataflow_disk_size_gb (Optional[int]):

The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.

dataflow_subnetwork (Optional[str]):

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications

dataflow_use_public_ips (Optional[bool]):

Specifies whether Dataflow workers use public IP addresses.

dataflow_service_account (Optional[str]):

Custom service account to run dataflow jobs.

encryption_spec_key_name (Optional[str]):

Customer-managed encryption key.

Returns:
materialized_train_split (MaterializedSplit):

The materialized train split.

materialized_eval_split (MaterializedSplit):

The materialized eval split.

materialized_eval_split (MaterializedSplit):

The materialized test split.

training_schema_uri (TrainingSchema):

The training schema.

transform_output (TransformOutput):

The transform output artifact.

gcp_resources (str):

GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

google_cloud_pipeline_components.experimental.automl.tabular.WideAndDeepTrainerOp()

automl_tabular_wide_and_deep_trainer Launch a Wide & Deep custom training job using Vertex CustomJob API.

Args:
project (str):

Required. The GCP project that runs the pipeline components.

location (str):

Required. The GCP region that runs the pipeline components.

root_dir (str):

Required. The root GCS directory for the pipeline components.

target_column (str):

Required. The target column name.

prediction_type (str):

Required. The type of prediction the model is to produce. “classification” or “regression”.

weight_column (Optional[str]):

The weight column name.

max_steps (Optional[int]):

Number of steps to run the trainer for.

max_train_secs (Optional[int]):

Amount of time in seconds to run the trainer for.

learning_rate (float):

The learning rate used by the linear optimizer.

optimizer_type (Optional[str]):

The type of optimizer to use. Choices are “adam”, “ftrl” and “sgd” for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

l1_regularization_strength (Optional[float]):

L1 regularization strength for optimizer_type=”ftrl”.

l2_regularization_strength (Optional[float]):

L2 regularization strength for optimizer_type=”ftrl”

l2_shrinkage_regularization_strength (Optional[float]):

L2 shrinkage regularization strength for optimizer_type=”ftrl”.

beta_1 (Optional[float]):

Beta 1 value for optimizer_type=”adam”.

beta_2 (Optional[float]):

Beta 2 value for optimizer_type=”adam”.

hidden_units (Optional[str]):

Hidden layer sizes to use for DNN feature columns, provided in comma-separated layers.

use_wide (Optional[bool]):

If set to true, the categorical columns will be used in the wide part of the DNN model.

embed_categories (Optional[bool]):

If set to true, the categorical columns will be used embedded and used in the deep part of the model. Embedding size is the square root of the column cardinality.

dnn_dropout (Optional[float]):

The probability we will drop out a given coordinate.

dnn_learning_rate (Optional[float]):

The learning rate for training the deep part of the model.

dnn_optimizer_type (Optional[str]):

The type of optimizer to use for the deep part of the model. Choices are “adam”, “ftrl” and “sgd”. for the Adam, FTRL, and Gradient Descent Optimizers, respectively.

dnn_l1_regularization_strength (Optional[float]):

L1 regularization strength for dnn_optimizer_type=”ftrl”.

dnn_l2_regularization_strength (Optional[float]):

L2 regularization strength for dnn_optimizer_type=”ftrl”.

dnn_l2_shrinkage_regularization_strength (Optional[float]):

L2 shrinkage regularization strength for dnn_optimizer_type=”ftrl”.

dnn_beta_1 (Optional[float]):

Beta 1 value for dnn_optimizer_type=”adam”.

dnn_beta_2 (Optional[float]):

Beta 2 value for dnn_optimizer_type=”adam”.

enable_profiler (Optional[bool]):

Enables profiling and saves a trace during evaluation.

seed (Optional[int]):

Seed to be used for this run.

eval_steps (Optional[int]):

Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.

batch_size (Optional[int]):

Batch size for training.

eval_frequency_secs (Optional[int]):

Frequency at which evaluation and checkpointing will take place.

replica_count (Optional[int]):

The replica count.

machine_spec (Optional[Dict[str, Any]]):

The machine spec. See https://cloud.google.com/compute/docs/machine-types for options.

instance_baseline (AutoMLTabularInstanceBaseline):

The path to a JSON file for baseline values.

metadata (TabularExampleGenMetadata):

Amount of time in seconds to run the trainer for.

materialized_train_split (MaterializedSplit):

The path to the materialized train split.

materialized_eval_split (MaterializedSplit):

The path to the materialized validation split.

materialized_test_split (MaterializedSplit):

The path to the materialized test split.

transform_output (TransformOutput):

The path to transform output.

training_schema_uri (TrainingSchema):

The path to the training schema.

encryption_spec_key_name (Optional[str]):

The KMS key name.

Returns:
gcp_resources (str):

Serialized gcp_resources proto tracking the custom training job.

unmanaged_container_model (google.UnmanagedContainerModel):

The UnmanagedContainerModel artifact.