google_cloud_pipeline_components.experimental.automl.tabular package
Submodules
google_cloud_pipeline_components.experimental.automl.tabular.utils module
Util functions for AutoML Tabular pipeline.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_automl_tabular_feature_selection_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int | None = None, stage_2_num_parallel_trials: int | None = None, stage_2_num_selected_trials: int | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, predefined_split_key: str | None = None, timestamp_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, weight_column: str | None = None, study_spec_parameters_override: List[Dict[str, Any]] | None = None, optimization_objective_recall_value: float | None = None, optimization_objective_precision_value: float | None = None, stage_1_tuner_worker_pool_specs_override: Dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str | None = None, stats_and_example_gen_dataflow_max_num_workers: int | None = None, stats_and_example_gen_dataflow_disk_size_gb: int | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: str | None = None, additional_experiments: Dict[str, Any] | None = None, dataflow_service_account: str | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_batch_explain_machine_type: str | None = None, evaluation_batch_explain_starting_replica_count: int | None = None, evaluation_batch_explain_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None, max_selected_features: int = 1000, apply_feature_selection_tuning: bool = False, run_distillation: bool = False, distill_batch_predict_machine_type: str | None = None, distill_batch_predict_starting_replica_count: int | None = None, distill_batch_predict_max_replica_count: int | None = None) Tuple[str, Dict[str, Any]]
Get the AutoML Tabular v1 default training pipeline.
- Args:
project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.
“classification” or “regression”.
- optimization_objective: For binary classification, “maximize-au-roc”,
“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.
- transformations: The path to a GCS file containing the transformations to
apply.
- train_budget_milli_node_hours: The train budget of creating this model,
expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.
stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. predefined_split_key: The predefined_split column name. timestamp_split_key: The timestamp_split column name. stratified_split_key: The stratified_split column name. training_fraction: The training fraction. validation_fraction: The validation fraction. test_fraction: float = The test fraction. weight_column: The weight column name. study_spec_parameters_override: The list for overriding study spec. The list
- optimization_objective_recall_value: Required when optimization_objective is
“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.
- optimization_objective_precision_value: Required when optimization_objective
is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.
- stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.
- stage 1 tuner worker pool spec. The dictionary should be of format
- cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
- cv trainer worker pool spec. The dictionary should be of format
- export_additional_model_without_custom_ops: Whether to export additional
model without custom TensorFlow operators.
- stats_and_example_gen_dataflow_machine_type: The dataflow machine type for
stats_and_example_gen component.
- stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow
workers for stats_and_example_gen component.
- stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in
GB for stats_and_example_gen component.
- transform_dataflow_machine_type: The dataflow machine type for transform
component.
- transform_dataflow_max_num_workers: The max number of Dataflow workers for
transform component.
- transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
transform component.
- dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
- the default subnetwork will be used. Example:
- dataflow_use_public_ips: Specifies whether Dataflow workers use public IP
addresses.
encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. dataflow_service_account: Custom service account to run dataflow jobs. run_evaluation: Whether to run evaluation in the training pipeline. evaluation_batch_predict_machine_type: The prediction server machine type
for batch predict components during evaluation.
- evaluation_batch_predict_starting_replica_count: The initial number of
prediction server for batch predict components during evaluation.
- evaluation_batch_predict_max_replica_count: The max number of prediction
server for batch predict components during evaluation.
- evaluation_batch_explain_machine_type: The prediction server machine type
for batch explain components during evaluation.
- evaluation_batch_explain_starting_replica_count: The initial number of
prediction server for batch explain components during evaluation.
- evaluation_batch_explain_max_replica_count: The max number of prediction
server for batch explain components during evaluation.
- evaluation_dataflow_machine_type: The dataflow machine type for evaluation
components.
- evaluation_dataflow_starting_num_workers: The initial number of Dataflow
workers for evaluation components.
- evaluation_dataflow_max_num_workers: The max number of Dataflow workers for
evaluation components.
- evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
evaluation components.
max_selected_features: number of features to select for training, apply_feature_selection_tuning: tuning feature selection rate if true. run_distillation: Whether to run distill in the training pipeline. distill_batch_predict_machine_type: The prediction server machine type for
batch predict component in the model distillation.
- distill_batch_predict_starting_replica_count: The initial number of
prediction server for batch predict component in the model distillation.
- distill_batch_predict_max_replica_count: The max number of prediction server
for batch predict component in the model distillation.
- Returns:
Tuple of pipeline_definiton_path and parameter_values.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_automl_tabular_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int | None = None, stage_2_num_parallel_trials: int | None = None, stage_2_num_selected_trials: int | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, predefined_split_key: str | None = None, timestamp_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, weight_column: str | None = None, study_spec_parameters_override: List[Dict[str, Any]] | None = None, optimization_objective_recall_value: float | None = None, optimization_objective_precision_value: float | None = None, stage_1_tuner_worker_pool_specs_override: Dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str | None = None, stats_and_example_gen_dataflow_max_num_workers: int | None = None, stats_and_example_gen_dataflow_disk_size_gb: int | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: str | None = None, additional_experiments: Dict[str, Any] | None = None, dataflow_service_account: str | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_batch_explain_machine_type: str | None = None, evaluation_batch_explain_starting_replica_count: int | None = None, evaluation_batch_explain_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None, run_distillation: bool = False, distill_batch_predict_machine_type: str | None = None, distill_batch_predict_starting_replica_count: int | None = None, distill_batch_predict_max_replica_count: int | None = None, stage_1_tuning_result_artifact_uri: str | None = None, quantiles: List[float] | None = None, enable_probabilistic_inference: bool = False) Tuple[str, Dict[str, Any]]
Get the AutoML Tabular v1 default training pipeline.
- Args:
project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.
“classification” or “regression”.
- optimization_objective: For binary classification, “maximize-au-roc”,
“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.
- transformations: The path to a GCS file containing the transformations to
apply.
- train_budget_milli_node_hours: The train budget of creating this model,
expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.
stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. predefined_split_key: The predefined_split column name. timestamp_split_key: The timestamp_split column name. stratified_split_key: The stratified_split column name. training_fraction: The training fraction. validation_fraction: The validation fraction. test_fraction: float = The test fraction. weight_column: The weight column name. study_spec_parameters_override: The list for overriding study spec. The list
- optimization_objective_recall_value: Required when optimization_objective is
“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.
- optimization_objective_precision_value: Required when optimization_objective
is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.
- stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.
- stage 1 tuner worker pool spec. The dictionary should be of format
- cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
- cv trainer worker pool spec. The dictionary should be of format
- export_additional_model_without_custom_ops: Whether to export additional
model without custom TensorFlow operators.
- stats_and_example_gen_dataflow_machine_type: The dataflow machine type for
stats_and_example_gen component.
- stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow
workers for stats_and_example_gen component.
- stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in
GB for stats_and_example_gen component.
- transform_dataflow_machine_type: The dataflow machine type for transform
component.
- transform_dataflow_max_num_workers: The max number of Dataflow workers for
transform component.
- transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
transform component.
- dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
- the default subnetwork will be used. Example:
- dataflow_use_public_ips: Specifies whether Dataflow workers use public IP
addresses.
encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. dataflow_service_account: Custom service account to run dataflow jobs. run_evaluation: Whether to run evaluation in the training pipeline. evaluation_batch_predict_machine_type: The prediction server machine type
for batch predict components during evaluation.
- evaluation_batch_predict_starting_replica_count: The initial number of
prediction server for batch predict components during evaluation.
- evaluation_batch_predict_max_replica_count: The max number of prediction
server for batch predict components during evaluation.
- evaluation_batch_explain_machine_type: The prediction server machine type
for batch explain components during evaluation.
- evaluation_batch_explain_starting_replica_count: The initial number of
prediction server for batch explain components during evaluation.
- evaluation_batch_explain_max_replica_count: The max number of prediction
server for batch explain components during evaluation.
- evaluation_dataflow_machine_type: The dataflow machine type for evaluation
components.
- evaluation_dataflow_starting_num_workers: The initial number of Dataflow
workers for evaluation components.
- evaluation_dataflow_max_num_workers: The max number of Dataflow workers for
evaluation components.
- evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
evaluation components.
run_distillation: Whether to run distill in the training pipeline. distill_batch_predict_machine_type: The prediction server machine type for
batch predict component in the model distillation.
- distill_batch_predict_starting_replica_count: The initial number of
prediction server for batch predict component in the model distillation.
- distill_batch_predict_max_replica_count: The max number of prediction server
for batch predict component in the model distillation.
- stage_1_tuning_result_artifact_uri: The stage 1 tuning result artifact GCS
URI.
- quantiles: Quantiles to use for probabilistic inference. Up to 5 quantiles
are allowed of values between 0 and 1, exclusive. Represents the quantiles to use for that objective. Quantiles must be unique.
- enable_probabilistic_inference: If probabilistic inference is enabled, the
model will fit a distribution that captures the uncertainty of a prediction. At inference time, the predictive distribution is used to make a point prediction that minimizes the optimization objective. For example, the mean of a predictive distribution is the point prediction that minimizes RMSE loss. If quantiles are specified, then the quantiles of the distribution are also returned.
- Returns:
Tuple of pipeline_definiton_path and parameter_values.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_builtin_algorithm_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, study_spec_metric_id: str, study_spec_metric_goal: str, study_spec_parameters_override: List[Dict[str, Any]], max_trial_count: int, parallel_trial_count: int, algorithm: str, enable_profiler: bool = False, seed: int = 1, eval_steps: int = 0, eval_frequency_secs: int = 600, transform_config: str | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str = '', max_failed_trial_count: int = 0, study_spec_algorithm: str = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str = 'BEST_MEASUREMENT', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Dict[str, Any] | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_starting_num_workers: int = 10, evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]
Get the built-in algorithm HyperparameterTuningJob pipeline.
- Args:
project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.
“classification” or “regression”.
- study_spec_metric_id: Metric to optimize, possible values: [ ‘loss’,
‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].
- study_spec_metric_goal: Optimization goal of the metric, possible values:
“MAXIMIZE”, “MINIMIZE”.
- study_spec_parameters_override: List of dictionaries representing parameters
to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.
max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. algorithm: Algorithm to train. One of “tabnet” and “wide_and_deep”. enable_profiler: Enables profiling and saves a trace during evaluation. seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or
negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.
- eval_frequency_secs: Frequency at which evaluation and checkpointing will
take place.
transform_config: Path to v1 TF transformation configuration. dataset_level_custom_transformation_definitions: Dataset-level custom
transformation definitions in string format.
- dataset_level_transformations: Dataset-level transformation configuration in
string format.
predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the
comma-separated string format.
- tf_custom_transformation_definitions: TF custom transformation definitions
in string format.
tf_transformations_path: Path to TF transformation configuration. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for
storing intermediate tables.
weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen
before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.
- study_spec_algorithm: The search algorithm specified for the study. One of
“ALGORITHM_UNSPECIFIED”, “GRID_SEARCH”, or “RANDOM_SEARCH”.
- study_spec_measurement_selection_type: Which measurement to use if/when the
service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.
- transform_dataflow_machine_type: The dataflow machine type for transform
component.
- transform_dataflow_max_num_workers: The max number of Dataflow workers for
transform component.
- transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
transform component.
- worker_pool_specs_override: The dictionary for overriding training and
- evaluation worker pool specs. The dictionary should be of format
run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type
for batch predict components during evaluation.
- evaluation_batch_predict_starting_replica_count: The initial number of
prediction server for batch predict components during evaluation.
- evaluation_batch_predict_max_replica_count: The max number of prediction
server for batch predict components during evaluation.
- evaluation_dataflow_machine_type: The dataflow machine type for evaluation
components.
- evaluation_dataflow_starting_num_workers: The initial number of Dataflow
workers for evaluation components.
- evaluation_dataflow_max_num_workers: The max number of Dataflow workers for
evaluation components.
- evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
evaluation components.
dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
- the default subnetwork will be used. Example:
- dataflow_use_public_ips: Specifies whether Dataflow workers use public IP
addresses.
encryption_spec_key_name: The KMS key name.
- Returns:
Tuple of pipeline_definiton_path and parameter_values.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_default_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Dict[str, Any] | None = None, optimization_objective_recall_value: float = -1, optimization_objective_precision_value: float = -1, stage_1_tuner_worker_pool_specs_override: Dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '', additional_experiments: Dict[str, Any] | None = None, dataflow_service_account: str = '', run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, run_distillation: bool = False, distill_batch_predict_machine_type: str = 'n1-standard-16', distill_batch_predict_starting_replica_count: int = 25, distill_batch_predict_max_replica_count: int = 25) Tuple[str, Dict[str, Any]]
Get the AutoML Tabular default training pipeline.
- Args:
project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column_name: The target column name. prediction_type: The type of prediction the model is to produce.
“classification” or “regression”.
- optimization_objective: For binary classification, “maximize-au-roc”,
“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.
transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. train_budget_milli_node_hours: The train budget of creating this model,
expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.
stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. weight_column_name: The weight column name. study_spec_override: The dictionary for overriding study spec. The
- optimization_objective_recall_value: Required when optimization_objective is
“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.
- optimization_objective_precision_value: Required when optimization_objective
is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.
- stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.
- stage 1 tuner worker pool spec. The dictionary should be of format
- cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
- cv trainer worker pool spec. The dictionary should be of format
- export_additional_model_without_custom_ops: Whether to export additional
model without custom TensorFlow operators.
- stats_and_example_gen_dataflow_machine_type: The dataflow machine type for
stats_and_example_gen component.
- stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow
workers for stats_and_example_gen component.
- stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in
GB for stats_and_example_gen component.
- transform_dataflow_machine_type: The dataflow machine type for transform
component.
- transform_dataflow_max_num_workers: The max number of Dataflow workers for
transform component.
- transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
transform component.
- dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
- the default subnetwork will be used. Example:
- dataflow_use_public_ips: Specifies whether Dataflow workers use public IP
addresses.
encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. dataflow_service_account: Custom service account to run dataflow jobs. run_evaluation: Whether to run evaluation in the training pipeline. evaluation_batch_predict_machine_type: The prediction server machine type
for batch predict components during evaluation.
- evaluation_batch_predict_starting_replica_count: The initial number of
prediction server for batch predict components during evaluation.
- evaluation_batch_predict_max_replica_count: The max number of prediction
server for batch predict components during evaluation.
- evaluation_dataflow_machine_type: The dataflow machine type for evaluation
components.
- evaluation_dataflow_max_num_workers: The max number of Dataflow workers for
evaluation components.
- evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
evaluation components.
run_distillation: Whether to run distill in the training pipeline. distill_batch_predict_machine_type: The prediction server machine type for
batch predict component in the model distillation.
- distill_batch_predict_starting_replica_count: The initial number of
prediction server for batch predict component in the model distillation.
- distill_batch_predict_max_replica_count: The max number of prediction server
for batch predict component in the model distillation.
- Returns:
Tuple of pipeline_definiton_path and parameter_values.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_distill_skip_evaluation_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Dict[str, Any] | None = None, optimization_objective_recall_value: float = -1, optimization_objective_precision_value: float = -1, stage_1_tuner_worker_pool_specs_override: Dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '', additional_experiments: Dict[str, Any] | None = None, distill_batch_predict_machine_type: str = 'n1-standard-16', distill_batch_predict_starting_replica_count: int = 25, distill_batch_predict_max_replica_count: int = 25) Tuple[str, Dict[str, Any]]
Get the AutoML Tabular training pipeline that distill and skips evaluation.
- Args:
project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column_name: The target column name. prediction_type: The type of prediction the model is to produce.
“classification” or “regression”.
- optimization_objective: For binary classification, “maximize-au-roc”,
“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.
transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. train_budget_milli_node_hours: The train budget of creating this model,
expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.
stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. weight_column_name: The weight column name. study_spec_override: The dictionary for overriding study spec. The
- optimization_objective_recall_value: Required when optimization_objective is
“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.
- optimization_objective_precision_value: Required when optimization_objective
is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.
- stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.
- stage 1 tuner worker pool spec. The dictionary should be of format
- cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
- cv trainer worker pool spec. The dictionary should be of format
- export_additional_model_without_custom_ops: Whether to export additional
model without custom TensorFlow operators.
- stats_and_example_gen_dataflow_machine_type: The dataflow machine type for
stats_and_example_gen component.
- stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow
workers for stats_and_example_gen component.
- stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in
GB for stats_and_example_gen component.
- transform_dataflow_machine_type: The dataflow machine type for transform
component.
- transform_dataflow_max_num_workers: The max number of Dataflow workers for
transform component.
- transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
transform component.
- dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
- the default subnetwork will be used. Example:
- dataflow_use_public_ips: Specifies whether Dataflow workers use public IP
addresses.
encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. distill_batch_predict_machine_type: The prediction server machine type for
batch predict component in the model distillation.
- distill_batch_predict_starting_replica_count: The initial number of
prediction server for batch predict component in the model distillation.
- distill_batch_predict_max_replica_count: The max number of prediction server
for batch predict component in the model distillation.
- Returns:
Tuple of pipeline_definiton_path and parameter_values.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_feature_selection_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, algorithm: str, prediction_type: str, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, max_selected_features: int | None = None, dataflow_machine_type: str = 'n1-standard-16', dataflow_max_num_workers: int = 25, dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, dataflow_service_account: str = '')
Get the feature selection pipeline that generates feature ranking and selected features.
- Args:
project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. algorithm: Algorithm to select features, default to be AMI. prediction_type: The type of prediction the model is to produce.
“classification” or “regression”.
- data_source_csv_filenames: A string that represents a list of comma
separated CSV filenames.
data_source_bigquery_table_path: The BigQuery table path. max_selected_features: number of features to be selected. dataflow_machine_type: The dataflow machine type for feature_selection
component.
- dataflow_max_num_workers: The max number of Dataflow workers for
feature_selection component.
- dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
feature_selection component.
- dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
- the default subnetwork will be used. Example:
- dataflow_use_public_ips: Specifies whether Dataflow workers use public IP
addresses.
dataflow_service_account: Custom service account to run dataflow jobs.
- Returns:
Tuple of pipeline_definition_path and parameter_values.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_model_comparison_pipeline_and_parameters(project: str, location: str, root_dir: str, prediction_type: str, training_jobs: Dict[str, Dict[str, Any]], data_source_csv_filenames: str = '', data_source_bigquery_table_path: str = '', evaluation_data_source_csv_filenames: str = '', evaluation_data_source_bigquery_table_path: str = '', experiment: str = '', service_account: str = '', network: str = '') Tuple[str, Dict[str, Any]]
Returns a compiled model comparison pipeline and formatted parameters.
- Args:
project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components prediction_type: The type of problem being solved. Can be one of:
regression, classification, or forecasting.
training_jobs: A dict mapping name to a dict of training job inputs. data_source_csv_filenames: Comma-separated paths to CSVs stored in GCS to
use as the training dataset for all training pipelines. This should be None if data_source_bigquery_table_path is not None. This should only contain data from the training and validation split and not from the test split.
- data_source_bigquery_table_path: Path to BigQuery Table to use as the
training dataset for all training pipelines. This should be None if data_source_csv_filenames is not None. This should only contain data from the training and validation split and not from the test split.
- evaluation_data_source_csv_filenames: Comma-separated paths to CSVs stored
in GCS to use as the evaluation dataset for all training pipelines. This should be None if evaluation_data_source_bigquery_table_path is not None. This should only contain data from the test split and not from the training and validation split.
- evaluation_data_source_bigquery_table_path: Path to BigQuery Table to use as
the evaluation dataset for all training pipelines. This should be None if evaluation_data_source_csv_filenames is not None. This should only contain data from the test split and not from the training and validation split.
- experiment: Vertex Experiment to add training pipeline runs to. A new
Experiment will be created if none is provided.
service_account: Specifies the service account for the sub-pipeline jobs. network: The full name of the Compute Engine network to which the
sub-pipeline jobs should be peered.
- Returns:
Tuple of pipeline_definiton_path and parameter_values.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_skip_architecture_search_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, optimization_objective: str, transformations: str, train_budget_milli_node_hours: float, stage_1_tuning_result_artifact_uri: str, stage_2_num_parallel_trials: int | None = None, stage_2_num_selected_trials: int | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, predefined_split_key: str | None = None, timestamp_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, weight_column: str | None = None, optimization_objective_recall_value: float | None = None, optimization_objective_precision_value: float | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str | None = None, stats_and_example_gen_dataflow_max_num_workers: int | None = None, stats_and_example_gen_dataflow_disk_size_gb: int | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool = True, encryption_spec_key_name: str | None = None, additional_experiments: Dict[str, Any] | None = None, dataflow_service_account: str | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_batch_explain_machine_type: str | None = None, evaluation_batch_explain_starting_replica_count: int | None = None, evaluation_batch_explain_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None) Tuple[str, Dict[str, Any]]
Get the AutoML Tabular training pipeline that skips architecture search.
- Args:
project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.
“classification” or “regression”.
- optimization_objective: For binary classification, “maximize-au-roc”,
“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.
transformations: The transformations to apply. train_budget_milli_node_hours: The train budget of creating this model,
expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.
- stage_1_tuning_result_artifact_uri: The stage 1 tuning result artifact GCS
URI.
stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. predefined_split_key: The predefined_split column name. timestamp_split_key: The timestamp_split column name. stratified_split_key: The stratified_split column name. training_fraction: The training fraction. validation_fraction: The validation fraction. test_fraction: float = The test fraction. weight_column: The weight column name. optimization_objective_recall_value: Required when optimization_objective is
“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.
- optimization_objective_precision_value: Required when optimization_objective
is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.
- cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
- cv trainer worker pool spec. The dictionary should be of format
- export_additional_model_without_custom_ops: Whether to export additional
model without custom TensorFlow operators.
- stats_and_example_gen_dataflow_machine_type: The dataflow machine type for
stats_and_example_gen component.
- stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow
workers for stats_and_example_gen component.
- stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in
GB for stats_and_example_gen component.
- transform_dataflow_machine_type: The dataflow machine type for transform
component.
- transform_dataflow_max_num_workers: The max number of Dataflow workers for
transform component.
- transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
transform component.
- dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
- the default subnetwork will be used. Example:
- dataflow_use_public_ips: Specifies whether Dataflow workers use public IP
addresses.
encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features. dataflow_service_account: Custom service account to run dataflow jobs. run_evaluation: Whether to run evaluation in the training pipeline. evaluation_batch_predict_machine_type: The prediction server machine type
for batch predict components during evaluation.
- evaluation_batch_predict_starting_replica_count: The initial number of
prediction server for batch predict components during evaluation.
- evaluation_batch_predict_max_replica_count: The max number of prediction
server for batch predict components during evaluation.
- evaluation_batch_explain_machine_type: The prediction server machine type
for batch explain components during evaluation.
- evaluation_batch_explain_starting_replica_count: The initial number of
prediction server for batch explain components during evaluation.
- evaluation_batch_explain_max_replica_count: The max number of prediction
server for batch explain components during evaluation.
- evaluation_dataflow_machine_type: The dataflow machine type for evaluation
components.
- evaluation_dataflow_starting_num_workers: The initial number of Dataflow
workers for evaluation components.
- evaluation_dataflow_max_num_workers: The max number of Dataflow workers for
evaluation components.
- evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
evaluation components.
- Returns:
Tuple of pipeline_definiton_path and parameter_values.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_skip_evaluation_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column_name: str, prediction_type: str, optimization_objective: str, transformations: Dict[str, Any], split_spec: Dict[str, Any], data_source: Dict[str, Any], train_budget_milli_node_hours: float, stage_1_num_parallel_trials: int = 35, stage_2_num_parallel_trials: int = 35, stage_2_num_selected_trials: int = 5, weight_column_name: str = '', study_spec_override: Dict[str, Any] | None = None, optimization_objective_recall_value: float = -1, optimization_objective_precision_value: float = -1, stage_1_tuner_worker_pool_specs_override: Dict[str, Any] | None = None, cv_trainer_worker_pool_specs_override: Dict[str, Any] | None = None, export_additional_model_without_custom_ops: bool = False, stats_and_example_gen_dataflow_machine_type: str = 'n1-standard-16', stats_and_example_gen_dataflow_max_num_workers: int = 25, stats_and_example_gen_dataflow_disk_size_gb: int = 40, transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '', additional_experiments: Dict[str, Any] | None = None) Tuple[str, Dict[str, Any]]
Get the AutoML Tabular training pipeline that skips evaluation.
- Args:
project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column_name: The target column name. prediction_type: The type of prediction the model is to produce.
“classification” or “regression”.
- optimization_objective: For binary classification, “maximize-au-roc”,
“minimize-log-loss”, “maximize-au-prc”, “maximize-precision-at-recall”, or “maximize-recall-at-precision”. For multi class classification, “minimize-log-loss”. For regression, “minimize-rmse”, “minimize-mae”, or “minimize-rmsle”.
transformations: The transformations to apply. split_spec: The split spec. data_source: The data source. train_budget_milli_node_hours: The train budget of creating this model,
expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.
stage_1_num_parallel_trials: Number of parallel trails for stage 1. stage_2_num_parallel_trials: Number of parallel trails for stage 2. stage_2_num_selected_trials: Number of selected trials for stage 2. weight_column_name: The weight column name. study_spec_override: The dictionary for overriding study spec. The
- optimization_objective_recall_value: Required when optimization_objective is
“maximize-precision-at-recall”. Must be between 0 and 1, inclusive.
- optimization_objective_precision_value: Required when optimization_objective
is “maximize-recall-at-precision”. Must be between 0 and 1, inclusive.
- stage_1_tuner_worker_pool_specs_override: The dictionary for overriding.
- stage 1 tuner worker pool spec. The dictionary should be of format
- cv_trainer_worker_pool_specs_override: The dictionary for overriding stage
- cv trainer worker pool spec. The dictionary should be of format
- export_additional_model_without_custom_ops: Whether to export additional
model without custom TensorFlow operators.
- stats_and_example_gen_dataflow_machine_type: The dataflow machine type for
stats_and_example_gen component.
- stats_and_example_gen_dataflow_max_num_workers: The max number of Dataflow
workers for stats_and_example_gen component.
- stats_and_example_gen_dataflow_disk_size_gb: Dataflow worker’s disk size in
GB for stats_and_example_gen component.
- transform_dataflow_machine_type: The dataflow machine type for transform
component.
- transform_dataflow_max_num_workers: The max number of Dataflow workers for
transform component.
- transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
transform component.
- dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
- the default subnetwork will be used. Example:
- dataflow_use_public_ips: Specifies whether Dataflow workers use public IP
addresses.
encryption_spec_key_name: The KMS key name. additional_experiments: Use this field to config private preview features.
- Returns:
Tuple of pipeline_definiton_path and parameter_values.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, study_spec_metric_id: str, study_spec_metric_goal: str, study_spec_parameters_override: List[Dict[str, Any]], max_trial_count: int, parallel_trial_count: int, transform_config: str | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool = False, feature_selection_algorithm: str | None = None, materialized_examples_format: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, eval_frequency_secs: int = 600, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str = '', max_failed_trial_count: int = 0, study_spec_algorithm: str = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str = 'BEST_MEASUREMENT', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Dict[str, Any] | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_starting_num_workers: int = 10, evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]
Get the TabNet HyperparameterTuningJob pipeline.
- Args:
project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.
“classification” or “regression”.
- study_spec_metric_id: Metric to optimize, possible values: [ ‘loss’,
‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].
- study_spec_metric_goal: Optimization goal of the metric, possible values:
“MAXIMIZE”, “MINIMIZE”.
- study_spec_parameters_override: List of dictionaries representing parameters
to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.
max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. transform_config: Path to v1 TF transformation configuration. dataset_level_custom_transformation_definitions: Dataset-level custom
transformation definitions in string format.
- dataset_level_transformations: Dataset-level transformation configuration in
string format.
run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. materialized_examples_format: The format for the materialized examples. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the
comma-separated string format.
- tf_custom_transformation_definitions: TF custom transformation definitions
in string format.
tf_transformations_path: Path to TF transformation configuration. enable_profiler: Enables profiling and saves a trace during evaluation. cache_data: Whether to cache data or not. If set to ‘auto’, caching is
determined based on the dataset size.
seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or
negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.
- eval_frequency_secs: Frequency at which evaluation and checkpointing will
take place.
data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for
storing intermediate tables.
weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen
before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.
- study_spec_algorithm: The search algorithm specified for the study. One of
“ALGORITHM_UNSPECIFIED”, “GRID_SEARCH”, or “RANDOM_SEARCH”.
- study_spec_measurement_selection_type: Which measurement to use if/when the
service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.
- transform_dataflow_machine_type: The dataflow machine type for transform
component.
- transform_dataflow_max_num_workers: The max number of Dataflow workers for
transform component.
- transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
transform component.
- worker_pool_specs_override: The dictionary for overriding training and
- evaluation worker pool specs. The dictionary should be of format
run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type
for batch predict components during evaluation.
- evaluation_batch_predict_starting_replica_count: The initial number of
prediction server for batch predict components during evaluation.
- evaluation_batch_predict_max_replica_count: The max number of prediction
server for batch predict components during evaluation.
- evaluation_dataflow_machine_type: The dataflow machine type for evaluation
components.
- evaluation_dataflow_starting_num_workers: The initial number of Dataflow
workers for evaluation components.
- evaluation_dataflow_max_num_workers: The max number of Dataflow workers for
evaluation components.
- evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
evaluation components.
dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
- the default subnetwork will be used. Example:
- dataflow_use_public_ips: Specifies whether Dataflow workers use public IP
addresses.
encryption_spec_key_name: The KMS key name.
- Returns:
Tuple of pipeline_definiton_path and parameter_values.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_study_spec_parameters_override(dataset_size_bucket: str, prediction_type: str, training_budget_bucket: str) List[Dict[str, Any]]
Get study_spec_parameters_override for a TabNet hyperparameter tuning job.
- Args:
- dataset_size_bucket: Size of the dataset. One of “small” (< 1M rows),
“medium” (1M - 100M rows), or “large” (> 100M rows).
- prediction_type: The type of prediction the model is to produce.
“classification” or “regression”.
- training_budget_bucket: Bucket of the estimated training budget. One of
“small” (< $600), “medium” ($600 - $2400), or “large” (> $2400). This parameter is only used as a hint for the hyperparameter search space, unrelated to the real cost.
- Returns:
List of study_spec_parameters_override.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_tabnet_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, learning_rate: float, transform_config: str | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool = False, feature_selection_algorithm: str | None = None, materialized_examples_format: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, max_steps: int = -1, max_train_secs: int = -1, large_category_dim: int = 1, large_category_thresh: int = 300, yeo_johnson_transform: bool = True, feature_dim: int = 64, feature_dim_ratio: float = 0.5, num_decision_steps: int = 6, relaxation_factor: float = 1.5, decay_every: float = 100, decay_rate: float = 0.95, gradient_thresh: float = 2000, sparsity_loss_weight: float = 1e-05, batch_momentum: float = 0.95, batch_size_ratio: float = 0.25, num_transformer_layers: int = 4, num_transformer_layers_ratio: float = 0.25, class_weight: float = 1.0, loss_function_type: str = 'default', alpha_focal_loss: float = 0.25, gamma_focal_loss: float = 2.0, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, batch_size: int = 100, measurement_selection_type: str | None = None, optimization_metric: str | None = None, eval_frequency_secs: int = 600, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str = '', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Dict[str, Any] | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_starting_num_workers: int = 10, evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]
Get the TabNet training pipeline.
- Args:
project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.
“classification” or “regression”.
learning_rate: The learning rate used by the linear optimizer. transform_config: Path to v1 TF transformation configuration. dataset_level_custom_transformation_definitions: Dataset-level custom
transformation definitions in string format.
- dataset_level_transformations: Dataset-level transformation configuration in
string format.
run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. materialized_examples_format: The format for the materialized examples. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the
comma-separated string format.
- tf_custom_transformation_definitions: TF custom transformation definitions
in string format.
tf_transformations_path: Path to TF transformation configuration. max_steps: Number of steps to run the trainer for. max_train_secs: Amount of time in seconds to run the trainer for. large_category_dim: Embedding dimension for categorical feature with large
number of categories.
- large_category_thresh: Threshold for number of categories to apply
large_category_dim embedding dimension to.
yeo_johnson_transform: Enables trainable Yeo-Johnson power transform. feature_dim: Dimensionality of the hidden representation in feature
transformation block.
- feature_dim_ratio: The ratio of output dimension (dimensionality of the
outputs of each decision step) to feature dimension.
num_decision_steps: Number of sequential decision steps. relaxation_factor: Relaxation factor that promotes the reuse of each feature
at different decision steps. When it is 1, a feature is enforced to be used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.
- decay_every: Number of iterations for periodically applying learning rate
decaying.
decay_rate: Learning rate decaying. gradient_thresh: Threshold for the norm of gradients for clipping. sparsity_loss_weight: Weight of the loss for sparsity regularization
(increasing it will yield more sparse feature selection).
batch_momentum: Momentum in ghost batch normalization. batch_size_ratio: The ratio of virtual batch size (size of the ghost batch
normalization) to batch size.
- num_transformer_layers: The number of transformer layers for each decision
step. used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.
- num_transformer_layers_ratio: The ratio of shared transformer layer to
transformer layers.
- class_weight: The class weight is used to computes a weighted cross entropy
which is helpful in classify imbalanced dataset. Only used for classification.
- loss_function_type: Loss function type. Loss function in classification
[cross_entropy, weighted_cross_entropy, focal_loss], default is cross_entropy. Loss function in regression: [rmse, mae, mse], default is mse.
- alpha_focal_loss: Alpha value (balancing factor) in focal_loss function.
Only used for classification.
- gamma_focal_loss: Gamma value (modulating factor) for focal loss for focal
loss. Only used for classification.
enable_profiler: Enables profiling and saves a trace during evaluation. cache_data: Whether to cache data or not. If set to ‘auto’, caching is
determined based on the dataset size.
seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or
negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.
batch_size: Batch size for training. measurement_selection_type: Which measurement to use if/when the service
automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.
- optimization_metric: Optimization metric used for
measurement_selection_type. Default is “rmse” for regression and “auc” for classification.
- eval_frequency_secs: Frequency at which evaluation and checkpointing will
take place.
data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for
storing intermediate tables.
weight_column: The weight column name. transform_dataflow_machine_type: The dataflow machine type for transform
component.
- transform_dataflow_max_num_workers: The max number of Dataflow workers for
transform component.
- transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
transform component.
- worker_pool_specs_override: The dictionary for overriding training and
- evaluation worker pool specs. The dictionary should be of format
run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type
for batch predict components during evaluation.
- evaluation_batch_predict_starting_replica_count: The initial number of
prediction server for batch predict components during evaluation.
- evaluation_batch_predict_max_replica_count: The max number of prediction
server for batch predict components during evaluation.
- evaluation_dataflow_machine_type: The dataflow machine type for evaluation
components.
- evaluation_dataflow_starting_num_workers: The initial number of Dataflow
workers for evaluation components.
- evaluation_dataflow_max_num_workers: The max number of Dataflow workers for
evaluation components.
- evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
evaluation components.
dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
- the default subnetwork will be used. Example:
- dataflow_use_public_ips: Specifies whether Dataflow workers use public IP
addresses.
encryption_spec_key_name: The KMS key name.
- Returns:
Tuple of pipeline_definiton_path and parameter_values.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_wide_and_deep_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, study_spec_metric_id: str, study_spec_metric_goal: str, study_spec_parameters_override: List[Dict[str, Any]], max_trial_count: int, parallel_trial_count: int, transform_config: str | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool = False, feature_selection_algorithm: str | None = None, materialized_examples_format: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, eval_frequency_secs: int = 600, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str = '', max_failed_trial_count: int = 0, study_spec_algorithm: str = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str = 'BEST_MEASUREMENT', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Dict[str, Any] | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_starting_num_workers: int = 10, evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]
Get the Wide & Deep algorithm HyperparameterTuningJob pipeline.
- Args:
project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.
“classification” or “regression”.
- study_spec_metric_id: Metric to optimize, possible values: [ ‘loss’,
‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].
- study_spec_metric_goal: Optimization goal of the metric, possible values:
“MAXIMIZE”, “MINIMIZE”.
- study_spec_parameters_override: List of dictionaries representing parameters
to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.
max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. transform_config: Path to v1 TF transformation configuration. dataset_level_custom_transformation_definitions: Dataset-level custom
transformation definitions in string format.
- dataset_level_transformations: Dataset-level transformation configuration in
string format.
run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. materialized_examples_format: The format for the materialized examples. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the
comma-separated string format.
- tf_custom_transformation_definitions: TF custom transformation definitions
in string format.
tf_transformations_path: Path to TF transformation configuration. enable_profiler: Enables profiling and saves a trace during evaluation. cache_data: Whether to cache data or not. If set to ‘auto’, caching is
determined based on the dataset size.
seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or
negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.
- eval_frequency_secs: Frequency at which evaluation and checkpointing will
take place.
data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for
storing intermediate tables.
weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen
before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.
- study_spec_algorithm: The search algorithm specified for the study. One of
“ALGORITHM_UNSPECIFIED”, “GRID_SEARCH”, or “RANDOM_SEARCH”.
- study_spec_measurement_selection_type: Which measurement to use if/when the
service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.
- transform_dataflow_machine_type: The dataflow machine type for transform
component.
- transform_dataflow_max_num_workers: The max number of Dataflow workers for
transform component.
- transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
transform component.
- worker_pool_specs_override: The dictionary for overriding training and
- evaluation worker pool specs. The dictionary should be of format
run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type
for batch predict components during evaluation.
- evaluation_batch_predict_starting_replica_count: The initial number of
prediction server for batch predict components during evaluation.
- evaluation_batch_predict_max_replica_count: The max number of prediction
server for batch predict components during evaluation.
- evaluation_dataflow_machine_type: The dataflow machine type for evaluation
components.
- evaluation_dataflow_starting_num_workers: The initial number of Dataflow
workers for evaluation components.
- evaluation_dataflow_max_num_workers: The max number of Dataflow workers for
evaluation components.
- evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
evaluation components.
dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
- the default subnetwork will be used. Example:
- dataflow_use_public_ips: Specifies whether Dataflow workers use public IP
addresses.
encryption_spec_key_name: The KMS key name.
- Returns:
Tuple of pipeline_definiton_path and parameter_values.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_wide_and_deep_study_spec_parameters_override() List[Dict[str, Any]]
Get study_spec_parameters_override for a Wide & Deep hyperparameter tuning job.
- Returns:
List of study_spec_parameters_override.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_wide_and_deep_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, learning_rate: float, dnn_learning_rate: float, transform_config: str | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool = False, feature_selection_algorithm: str | None = None, materialized_examples_format: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, optimizer_type: str = 'adam', max_steps: int = -1, max_train_secs: int = -1, l1_regularization_strength: float = 0, l2_regularization_strength: float = 0, l2_shrinkage_regularization_strength: float = 0, beta_1: float = 0.9, beta_2: float = 0.999, hidden_units: str = '30,30,30', use_wide: bool = True, embed_categories: bool = True, dnn_dropout: float = 0, dnn_optimizer_type: str = 'adam', dnn_l1_regularization_strength: float = 0, dnn_l2_regularization_strength: float = 0, dnn_l2_shrinkage_regularization_strength: float = 0, dnn_beta_1: float = 0.9, dnn_beta_2: float = 0.999, enable_profiler: bool = False, cache_data: str = 'auto', seed: int = 1, eval_steps: int = 0, batch_size: int = 100, measurement_selection_type: str | None = None, optimization_metric: str | None = None, eval_frequency_secs: int = 600, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str = '', transform_dataflow_machine_type: str = 'n1-standard-16', transform_dataflow_max_num_workers: int = 25, transform_dataflow_disk_size_gb: int = 40, worker_pool_specs_override: Dict[str, Any] | None = None, run_evaluation: bool = True, evaluation_batch_predict_machine_type: str = 'n1-highmem-8', evaluation_batch_predict_starting_replica_count: int = 20, evaluation_batch_predict_max_replica_count: int = 20, evaluation_dataflow_machine_type: str = 'n1-standard-4', evaluation_dataflow_starting_num_workers: int = 10, evaluation_dataflow_max_num_workers: int = 100, evaluation_dataflow_disk_size_gb: int = 50, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, encryption_spec_key_name: str = '') Tuple[str, Dict[str, Any]]
Get the Wide & Deep training pipeline.
- Args:
project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. prediction_type: The type of prediction the model is to produce.
‘classification’ or ‘regression’.
learning_rate: The learning rate used by the linear optimizer. dnn_learning_rate: The learning rate for training the deep part of the
model.
transform_config: Path to v1 TF transformation configuration. dataset_level_custom_transformation_definitions: Dataset-level custom
transformation definitions in string format.
- dataset_level_transformations: Dataset-level transformation configuration in
string format.
run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. materialized_examples_format: The format for the materialized examples. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the
comma-separated string format.
- tf_custom_transformation_definitions: TF custom transformation definitions
in string format.
tf_transformations_path: Path to TF transformation configuration. optimizer_type: The type of optimizer to use. Choices are “adam”, “ftrl” and
“sgd” for the Adam, FTRL, and Gradient Descent Optimizers, respectively.
max_steps: Number of steps to run the trainer for. max_train_secs: Amount of time in seconds to run the trainer for. l1_regularization_strength: L1 regularization strength for
optimizer_type=”ftrl”.
- l2_regularization_strength: L2 regularization strength for
optimizer_type=”ftrl”.
- l2_shrinkage_regularization_strength: L2 shrinkage regularization strength
for optimizer_type=”ftrl”.
beta_1: Beta 1 value for optimizer_type=”adam”. beta_2: Beta 2 value for optimizer_type=”adam”. hidden_units: Hidden layer sizes to use for DNN feature columns, provided in
comma-separated layers.
- use_wide: If set to true, the categorical columns will be used in the wide
part of the DNN model.
- embed_categories: If set to true, the categorical columns will be used
embedded and used in the deep part of the model. Embedding size is the square root of the column cardinality.
dnn_dropout: The probability we will drop out a given coordinate. dnn_optimizer_type: The type of optimizer to use for the deep part of the
model. Choices are “adam”, “ftrl” and “sgd”. for the Adam, FTRL, and Gradient Descent Optimizers, respectively.
- dnn_l1_regularization_strength: L1 regularization strength for
dnn_optimizer_type=”ftrl”.
- dnn_l2_regularization_strength: L2 regularization strength for
dnn_optimizer_type=”ftrl”.
- dnn_l2_shrinkage_regularization_strength: L2 shrinkage regularization
strength for dnn_optimizer_type=”ftrl”.
dnn_beta_1: Beta 1 value for dnn_optimizer_type=”adam”. dnn_beta_2: Beta 2 value for dnn_optimizer_type=”adam”. enable_profiler: Enables profiling and saves a trace during evaluation. cache_data: Whether to cache data or not. If set to ‘auto’, caching is
determined based on the dataset size.
seed: Seed to be used for this run. eval_steps: Number of steps to run evaluation for. If not specified or
negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.
batch_size: Batch size for training. measurement_selection_type: Which measurement to use if/when the service
automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.
- optimization_metric: Optimization metric used for
measurement_selection_type. Default is “rmse” for regression and “auc” for classification.
- eval_frequency_secs: Frequency at which evaluation and checkpointing will
take place.
data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for
storing intermediate tables.
weight_column: The weight column name. transform_dataflow_machine_type: The dataflow machine type for transform
component.
- transform_dataflow_max_num_workers: The max number of Dataflow workers for
transform component.
- transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
transform component.
- worker_pool_specs_override: The dictionary for overriding training and
- evaluation worker pool specs. The dictionary should be of format
run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type
for batch predict components during evaluation.
- evaluation_batch_predict_starting_replica_count: The initial number of
prediction server for batch predict components during evaluation.
- evaluation_batch_predict_max_replica_count: The max number of prediction
server for batch predict components during evaluation.
- evaluation_dataflow_machine_type: The dataflow machine type for evaluation
components.
- evaluation_dataflow_starting_num_workers: The initial number of Dataflow
workers for evaluation components.
- evaluation_dataflow_max_num_workers: The max number of Dataflow workers for
evaluation components.
- evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
evaluation components.
dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
- the default subnetwork will be used. Example:
- dataflow_use_public_ips: Specifies whether Dataflow workers use public IP
addresses.
encryption_spec_key_name: The KMS key name.
- Returns:
Tuple of pipeline_definition_path and parameter_values.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_xgboost_hyperparameter_tuning_job_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, objective: str, study_spec_metric_id: str, study_spec_metric_goal: str, max_trial_count: int, parallel_trial_count: int, study_spec_parameters_override: List[Dict[str, Any]] | None = None, eval_metric: str | None = None, disable_default_eval_metric: int | None = None, seed: int | None = None, seed_per_iteration: bool | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool | None = None, feature_selection_algorithm: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str | None = None, max_failed_trial_count: int | None = None, training_machine_type: str | None = None, training_total_replica_count: int | None = None, training_accelerator_type: str | None = None, training_accelerator_count: int | None = None, study_spec_algorithm: str | None = None, study_spec_measurement_selection_type: str | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, run_evaluation: bool | None = None, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None, dataflow_service_account: str | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool | None = None, encryption_spec_key_name: str | None = None)
Get the XGBoost HyperparameterTuningJob pipeline.
- Args:
project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. objective: Specifies the learning task and the learning objective. Must be
one of [reg:squarederror, reg:squaredlogerror, reg:logistic, reg:gamma, reg:tweedie, reg:pseudohubererror, binary:logistic, multi:softprob].
- study_spec_metric_id: Metric to optimize. For options, please look under
‘eval_metric’ at https://xgboost.readthedocs.io/en/stable/parameter.html#learning-task-parameters.
- study_spec_metric_goal: Optimization goal of the metric, possible values:
“MAXIMIZE”, “MINIMIZE”.
max_trial_count: The desired total number of trials. parallel_trial_count: The desired number of trials to run in parallel. study_spec_parameters_override: List of dictionaries representing parameters
to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.
- eval_metric: Evaluation metrics for validation data represented as a
comma-separated string.
- disable_default_eval_metric: Flag to disable default metric. Set to >0 to
disable. Default to 0.
seed: Random seed. seed_per_iteration: Seed PRNG determnisticly via iterator number. dataset_level_custom_transformation_definitions: Dataset-level custom
transformation definitions in string format.
- dataset_level_transformations: Dataset-level transformation configuration in
string format.
run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the
comma-separated string format.
- tf_custom_transformation_definitions: TF custom transformation definitions
in string format.
tf_transformations_path: Path to TF transformation configuration. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for
storing intermediate tables.
weight_column: The weight column name. max_failed_trial_count: The number of failed trials that need to be seen
before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.
training_machine_type: Machine type. training_total_replica_count: Number of workers. training_accelerator_type: Accelerator type. training_accelerator_count: Accelerator count. study_spec_algorithm: The search algorithm specified for the study. One of
‘ALGORITHM_UNSPECIFIED’, ‘GRID_SEARCH’, or ‘RANDOM_SEARCH’.
- study_spec_measurement_selection_type: Which measurement to use if/when the
service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.
- transform_dataflow_machine_type: The dataflow machine type for transform
component.
- transform_dataflow_max_num_workers: The max number of Dataflow workers for
transform component.
- transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
transform component.
run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type
for batch predict components during evaluation.
- evaluation_batch_predict_starting_replica_count: The initial number of
prediction server for batch predict components during evaluation.
- evaluation_batch_predict_max_replica_count: The max number of prediction
server for batch predict components during evaluation.
- evaluation_dataflow_machine_type: The dataflow machine type for evaluation
components.
- evaluation_dataflow_starting_num_workers: The initial number of Dataflow
workers for evaluation components.
- evaluation_dataflow_max_num_workers: The max number of Dataflow workers for
evaluation components.
- evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
evaluation components.
dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
- the default subnetwork will be used. Example:
- dataflow_use_public_ips: Specifies whether Dataflow workers use public IP
addresses.
encryption_spec_key_name: The KMS key name.
- Returns:
Tuple of pipeline_definition_path and parameter_values.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_xgboost_study_spec_parameters_override() List[Dict[str, Any]]
Get study_spec_parameters_override for an XGBoost hyperparameter tuning job.
- Returns:
List of study_spec_parameters_override.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.get_xgboost_trainer_pipeline_and_parameters(project: str, location: str, root_dir: str, target_column: str, objective: str, eval_metric: str | None = None, num_boost_round: int | None = None, early_stopping_rounds: int | None = None, base_score: float | None = None, disable_default_eval_metric: int | None = None, seed: int | None = None, seed_per_iteration: bool | None = None, booster: str | None = None, eta: float | None = None, gamma: float | None = None, max_depth: int | None = None, min_child_weight: float | None = None, max_delta_step: float | None = None, subsample: float | None = None, colsample_bytree: float | None = None, colsample_bylevel: float | None = None, colsample_bynode: float | None = None, reg_lambda: float | None = None, reg_alpha: float | None = None, tree_method: str | None = None, scale_pos_weight: float | None = None, updater: str | None = None, refresh_leaf: int | None = None, process_type: str | None = None, grow_policy: str | None = None, sampling_method: str | None = None, monotone_constraints: str | None = None, interaction_constraints: str | None = None, sample_type: str | None = None, normalize_type: str | None = None, rate_drop: float | None = None, one_drop: int | None = None, skip_drop: float | None = None, num_parallel_tree: int | None = None, feature_selector: str | None = None, top_k: int | None = None, max_cat_to_onehot: int | None = None, max_leaves: int | None = None, max_bin: int | None = None, tweedie_variance_power: float | None = None, huber_slope: float | None = None, dataset_level_custom_transformation_definitions: List[Dict[str, Any]] | None = None, dataset_level_transformations: List[Dict[str, Any]] | None = None, run_feature_selection: bool | None = None, feature_selection_algorithm: str | None = None, max_selected_features: int | None = None, predefined_split_key: str | None = None, stratified_split_key: str | None = None, training_fraction: float | None = None, validation_fraction: float | None = None, test_fraction: float | None = None, tf_auto_transform_features: List[str] | None = None, tf_custom_transformation_definitions: List[Dict[str, Any]] | None = None, tf_transformations_path: str | None = None, data_source_csv_filenames: str | None = None, data_source_bigquery_table_path: str | None = None, bigquery_staging_full_dataset_id: str | None = None, weight_column: str | None = None, training_machine_type: str | None = None, training_total_replica_count: int | None = None, training_accelerator_type: str | None = None, training_accelerator_count: int | None = None, transform_dataflow_machine_type: str | None = None, transform_dataflow_max_num_workers: int | None = None, transform_dataflow_disk_size_gb: int | None = None, run_evaluation: bool | None = None, evaluation_batch_predict_machine_type: str | None = None, evaluation_batch_predict_starting_replica_count: int | None = None, evaluation_batch_predict_max_replica_count: int | None = None, evaluation_dataflow_machine_type: str | None = None, evaluation_dataflow_starting_num_workers: int | None = None, evaluation_dataflow_max_num_workers: int | None = None, evaluation_dataflow_disk_size_gb: int | None = None, dataflow_service_account: str | None = None, dataflow_subnetwork: str | None = None, dataflow_use_public_ips: bool | None = None, encryption_spec_key_name: str | None = None)
Get the XGBoost training pipeline.
- Args:
project: The GCP project that runs the pipeline components. location: The GCP region that runs the pipeline components. root_dir: The root GCS directory for the pipeline components. target_column: The target column name. objective: Specifies the learning task and the learning objective. Must be
one of [reg:squarederror, reg:squaredlogerror, reg:logistic, reg:gamma, reg:tweedie, reg:pseudohubererror, binary:logistic, multi:softprob].
- eval_metric: Evaluation metrics for validation data represented as a
comma-separated string.
num_boost_round: Number of boosting iterations. early_stopping_rounds: Activates early stopping. Validation error needs to
decrease at least every early_stopping_rounds round(s) to continue training.
base_score: The initial prediction score of all instances, global bias. disable_default_eval_metric: Flag to disable default metric. Set to >0 to
disable. Default to 0.
seed: Random seed. seed_per_iteration: Seed PRNG determnisticly via iterator number. booster: Which booster to use, can be gbtree, gblinear or dart. gbtree and
dart use tree based model while gblinear uses linear function.
eta: Learning rate. gamma: Minimum loss reduction required to make a further partition on a leaf
node of the tree.
max_depth: Maximum depth of a tree. min_child_weight: Minimum sum of instance weight(hessian) needed in a child. max_delta_step: Maximum delta step we allow each tree’s weight estimation to
be.
subsample: Subsample ratio of the training instance. colsample_bytree: Subsample ratio of columns when constructing each tree. colsample_bylevel: Subsample ratio of columns for each split, in each level. colsample_bynode: Subsample ratio of columns for each node (split). reg_lambda: L2 regularization term on weights. reg_alpha: L1 regularization term on weights. tree_method: The tree construction algorithm used in XGBoost. Choices:
[“auto”, “exact”, “approx”, “hist”, “gpu_exact”, “gpu_hist”].
scale_pos_weight: Control the balance of positive and negative weights. updater: A comma separated string defining the sequence of tree updaters to
run.
- refresh_leaf: Refresh updater plugin. Update tree leaf and nodes’s stats if
True. When it is False, only node stats are updated.
- process_type: A type of boosting process to run. Choices:[“default”,
“update”]
- grow_policy: Controls a way new nodes are added to the tree. Only supported
if tree_method is hist. Choices:[“depthwise”, “lossguide”]
sampling_method: The method to use to sample the training instances. monotone_constraints: Constraint of variable monotonicity. interaction_constraints: Constraints for interaction representing permitted
interactions.
- sample_type: [dart booster only] Type of sampling algorithm.
Choices:[“uniform”, “weighted”]
- normalize_type: [dart booster only] Type of normalization algorithm,
Choices:[“tree”, “forest”]
rate_drop: [dart booster only] Dropout rate.’ one_drop: [dart booster only] When this flag is enabled, at least one tree
is always dropped during the dropout (allows Binomial-plus-one or epsilon-dropout from the original DART paper).
- skip_drop: [dart booster only] Probability of skipping the dropout procedure
during a boosting iteration.
- num_parallel_tree: Number of parallel trees constructed during each
iteration. This option is used to support boosted random forest.
- feature_selector: [linear booster only] Feature selection and ordering
method.
- top_k: The number of top features to select in greedy and thrifty feature
selector. The value of 0 means using all the features.
- max_cat_to_onehot: A threshold for deciding whether XGBoost should use
one-hot encoding based split for categorical data.
max_leaves: Maximum number of nodes to be added. max_bin: Maximum number of discrete bins to bucket continuous features. tweedie_variance_power: Parameter that controls the variance of the Tweedie
distribution.
- huber_slope: A parameter used for Pseudo-Huber loss to define the delta
term.
- dataset_level_custom_transformation_definitions: Dataset-level custom
transformation definitions in string format.
- dataset_level_transformations: Dataset-level transformation configuration in
string format.
run_feature_selection: Whether to enable feature selection. feature_selection_algorithm: Feature selection algorithm. max_selected_features: Maximum number of features to select. predefined_split_key: Predefined split key. stratified_split_key: Stratified split key. training_fraction: Training fraction. validation_fraction: Validation fraction. test_fraction: Test fraction. tf_auto_transform_features: List of auto transform features in the
comma-separated string format.
- tf_custom_transformation_definitions: TF custom transformation definitions
in string format.
tf_transformations_path: Path to TF transformation configuration. data_source_csv_filenames: The CSV data source. data_source_bigquery_table_path: The BigQuery data source. bigquery_staging_full_dataset_id: The BigQuery staging full dataset id for
storing intermediate tables.
weight_column: The weight column name. training_machine_type: Machine type. training_total_replica_count: Number of workers. training_accelerator_type: Accelerator type. training_accelerator_count: Accelerator count. transform_dataflow_machine_type: The dataflow machine type for transform
component.
- transform_dataflow_max_num_workers: The max number of Dataflow workers for
transform component.
- transform_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
transform component.
run_evaluation: Whether to run evaluation steps during training. evaluation_batch_predict_machine_type: The prediction server machine type
for batch predict components during evaluation.
- evaluation_batch_predict_starting_replica_count: The initial number of
prediction server for batch predict components during evaluation.
- evaluation_batch_predict_max_replica_count: The max number of prediction
server for batch predict components during evaluation.
- evaluation_dataflow_machine_type: The dataflow machine type for evaluation
components.
- evaluation_dataflow_starting_num_workers: The initial number of Dataflow
workers for evaluation components.
- evaluation_dataflow_max_num_workers: The max number of Dataflow workers for
evaluation components.
- evaluation_dataflow_disk_size_gb: Dataflow worker’s disk size in GB for
evaluation components.
dataflow_service_account: Custom service account to run dataflow jobs. dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty
- the default subnetwork will be used. Example:
- dataflow_use_public_ips: Specifies whether Dataflow workers use public IP
addresses.
encryption_spec_key_name: The KMS key name.
- Returns:
Tuple of pipeline_definition_path and parameter_values.
- google_cloud_pipeline_components.experimental.automl.tabular.utils.input_dictionary_to_parameter(input_dict: Dict[str, Any] | None) str
Convert json input dict to encoded parameter string.
This function is required due to the limitation on YAML component definition that YAML definition does not have a keyword for apply quote escape, so the JSON argument’s quote must be manually escaped using this function.
- Args:
input_dict: The input json dictionary.
- Returns:
The encoded string used for parameter.
Module contents
Module for AutoML Tables KFP components.