AutoML Forecasting

GA AutoML forecasting components.

Components:

ProphetTrainerOp(project, location, ...[, ...])

Trains and tunes one Prophet model per time series using Dataflow.

Functions:

get_bqml_arima_predict_pipeline_and_parameters(...)

Get the BQML ARIMA_PLUS prediction pipeline.

get_bqml_arima_train_pipeline_and_parameters(...)

Get the BQML ARIMA_PLUS training pipeline.

get_prophet_prediction_pipeline_and_parameters(...)

Returns Prophet prediction pipeline and formatted parameters.

get_prophet_train_pipeline_and_parameters(...)

Returns Prophet train pipeline and formatted parameters.

v1.automl.forecasting.ProphetTrainerOp(project: str, location: str, root_dir: str, target_column: str, time_column: str, time_series_identifier_column: str, forecast_horizon: int, window_column: str, data_granularity_unit: str, predefined_split_column: str, source_bigquery_uri: str, gcp_resources: dsl.OutputPath(str), unmanaged_container_model: dsl.Output[google.UnmanagedContainerModel], evaluated_examples_directory: dsl.Output[system.Artifact], optimization_objective: str | None = 'rmse', max_num_trials: int | None = 6, encryption_spec_key_name: str | None = '', dataflow_max_num_workers: int | None = 10, dataflow_machine_type: str | None = 'n1-standard-1', dataflow_disk_size_gb: int | None = 40, dataflow_service_account: str | None = '', dataflow_subnetwork: str | None = '', dataflow_use_public_ips: bool | None = True)

Trains and tunes one Prophet model per time series using Dataflow.

Parameters:
project: str

The GCP project that runs the pipeline components.

location: str

The GCP region for Vertex AI.

root_dir: str

The Cloud Storage location to store the output.

time_column: str

Name of the column that identifies time order in the time series.

time_series_identifier_column: str

Name of the column that identifies the time series.

target_column: str

Name of the column that the model is to predict values for.

forecast_horizon: int

The number of time periods into the future for which forecasts will be created. Future periods start after the latest timestamp for each time series.

optimization_objective: str | None = 'rmse'

Optimization objective for tuning. Supported metrics come from Prophet’s performance_metrics function. These are mse, rmse, mae, mape, mdape, smape, and coverage.

data_granularity_unit: str

String representing the units of time for the time column.

predefined_split_column: str

The predefined_split column name. A string that represents a list of comma separated CSV filenames.

source_bigquery_uri: str

The BigQuery table path of format bq (str)://bq_project.bq_dataset.bq_table

window_column: str

Name of the column that should be used to filter input rows. The column should contain either booleans or string booleans; if the value of the row is True, generate a sliding window from that row.

max_num_trials: int | None = 6

Maximum number of tuning trials to perform per time series. There are up to 100 possible combinations to explore for each time series. Recommended values to try are 3, 6, and 24.

encryption_spec_key_name: str | None = ''

Customer-managed encryption key.

dataflow_machine_type: str | None = 'n1-standard-1'

The dataflow machine type used for training.

dataflow_max_num_workers: int | None = 10

The max number of Dataflow workers used for training.

dataflow_disk_size_gb: int | None = 40

Dataflow worker’s disk size in GB during training.

dataflow_service_account: str | None = ''

Custom service account to run dataflow jobs.

dataflow_subnetwork: str | None = ''

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used.

dataflow_use_public_ips: bool | None = True

Specifies whether Dataflow workers use public IP addresses.

Returns:

gcp_resources: dsl.OutputPath(str)

Serialized gcp_resources proto tracking the custom training job.

nmanaged_container_model: dsl.Output[google.UnmanagedContainerModel]

The UnmanagedContainerModel artifact.

v1.automl.forecasting.get_bqml_arima_predict_pipeline_and_parameters(project: str, location: str, model_name: str, data_source_csv_filenames: str = '', data_source_bigquery_table_path: str = '', bigquery_destination_uri: str = '', generate_explanation: bool = False) tuple[str, dict[str, Any]][source]

Get the BQML ARIMA_PLUS prediction pipeline.

Parameters:
project: str

The GCP project that runs the pipeline components.

location: str

The GCP region for Vertex AI.

model_name: str

ARIMA_PLUS BQML model URI.

data_source_csv_filenames: str = ''

A string that represents a list of comma separated CSV filenames.

data_source_bigquery_table_path: str = ''

The BigQuery table path of format: bq://bq_project.bq_dataset.bq_table.

bigquery_destination_uri: str = ''

URI of the desired destination dataset. If not specified, a resource will be created under a new dataset in the project.

generate_explanation: bool = False

Generate explanation along with the batch prediction results. This will cause the batch prediction output to include explanations.

Returns:

Tuple of pipeline_definition_path and parameter_values.

v1.automl.forecasting.get_bqml_arima_train_pipeline_and_parameters(project: str, location: str, root_dir: str, time_column: str, time_series_identifier_column: str, target_column: str, forecast_horizon: int, data_granularity_unit: str, predefined_split_key: str = '', timestamp_split_key: str = '', training_fraction: float = -1.0, validation_fraction: float = -1.0, test_fraction: float = -1.0, data_source_csv_filenames: str = '', data_source_bigquery_table_path: str = '', window_column: str = '', window_stride_length: int = -1, window_max_count: int = -1, bigquery_destination_uri: str = '', override_destination: bool = False, max_order: int = 5, run_evaluation: bool = True) tuple[str, dict[str, Any]][source]

Get the BQML ARIMA_PLUS training pipeline.

Parameters:
project: str

The GCP project that runs the pipeline components.

location: str

The GCP region for Vertex AI.

root_dir: str

The Cloud Storage location to store the output.

time_column: str

Name of the column that identifies time order in the time series.

time_series_identifier_column: str

Name of the column that identifies the time series.

target_column: str

Name of the column that the model is to predict values for.

forecast_horizon: int

The number of time periods into the future for which forecasts will be created. Future periods start after the latest timestamp for each time series.

data_granularity_unit: str

The data granularity unit. Accepted values are: minute, hour, day, week, month, year.

predefined_split_key: str = ''

The predefined_split column name.

timestamp_split_key: str = ''

The timestamp_split column name.

training_fraction: float = -1.0

The training fraction.

validation_fraction: float = -1.0

The validation fraction.

test_fraction: float = -1.0

float = The test fraction.

data_source_csv_filenames: str = ''

A string that represents a list of comma separated CSV filenames.

data_source_bigquery_table_path: str = ''

The BigQuery table path of format: bq://bq_project.bq_dataset.bq_table.

window_column: str = ''

Name of the column that should be used to filter input rows. The column should contain either booleans or string booleans; if the value of the row is True, generate a sliding window from that row.

window_stride_length: int = -1

Step length used to generate input examples. Every window_stride_length rows will be used to generate a sliding window.

window_max_count: int = -1

Number of rows that should be used to generate input examples. If the total row count is larger than this number, the input data will be randomly sampled to hit the count.

bigquery_destination_uri: str = ''

URI of the desired destination dataset. If not specified, resources will be created under a new dataset in the project. Unlike in Vertex Forecasting, all resources will be given hardcoded names under this dataset, and the model artifact will also be exported here.

override_destination: bool = False

Whether to overwrite the metrics and evaluated examples tables if they already exist. If this is False and the tables exist, this pipeline will fail.

max_order: int = 5

Integer between 1 and 5 representing the size of the parameter search space for ARIMA_PLUS. 5 would result in the highest accuracy model, but also the longest training runtime.

run_evaluation: bool = True

Whether to run evaluation steps during training.

Returns:

Tuple of pipeline_definition_path and parameter_values.

v1.automl.forecasting.get_prophet_prediction_pipeline_and_parameters(project: str, location: str, model_name: str, time_column: str, time_series_identifier_column: str, target_column: str, data_source_csv_filenames: str = '', data_source_bigquery_table_path: str = '', bigquery_destination_uri: str = '', machine_type: str = 'n1-standard-2', max_num_workers: int = 10) tuple[str, dict[str, Any]][source]

Returns Prophet prediction pipeline and formatted parameters.

Unlike the prediction server for Vertex Forecasting, the Prophet prediction server returns predictions batched by time series id. This pipeline shows how these predictions can be disaggregated to get results similar to what Vertex Forecasting provides.

Parameters:
project: str

The GCP project that runs the pipeline components.

location: str

The GCP region for Vertex AI.

model_name: str

The name of the Model resource, in a form of projects/{project}/locations/{location}/models/{model}.

time_column: str

Name of the column that identifies time order in the time series.

time_series_identifier_column: str

Name of the column that identifies the time series.

target_column: str

Name of the column that the model is to predict values for.

data_source_csv_filenames: str = ''

A string that represents a list of comma separated CSV filenames.

data_source_bigquery_table_path: str = ''

The BigQuery table path of format: bq://bq_project.bq_dataset.bq_table.

bigquery_destination_uri: str = ''

URI of the desired destination dataset. If not specified, resources will be created under a new dataset in the project.

machine_type: str = 'n1-standard-2'

The machine type used for batch prediction.

max_num_workers: int = 10

The max number of workers used for batch prediction.

Returns:

Tuple of pipeline_definition_path and parameter_values.

v1.automl.forecasting.get_prophet_train_pipeline_and_parameters(project: str, location: str, root_dir: str, time_column: str, time_series_identifier_column: str, target_column: str, forecast_horizon: int, optimization_objective: str, data_granularity_unit: str, predefined_split_key: str = '', timestamp_split_key: str = '', training_fraction: float = -1.0, validation_fraction: float = -1.0, test_fraction: float = -1.0, data_source_csv_filenames: str = '', data_source_bigquery_table_path: str = '', window_column: str = '', window_stride_length: int = -1, window_max_count: int = -1, max_num_trials: int = 6, trainer_dataflow_machine_type: str = 'n1-standard-1', trainer_dataflow_max_num_workers: int = 10, trainer_dataflow_disk_size_gb: int = 40, evaluation_dataflow_machine_type: str = 'n1-standard-1', evaluation_dataflow_max_num_workers: int = 10, evaluation_dataflow_disk_size_gb: int = 40, dataflow_service_account: str = '', dataflow_subnetwork: str = '', dataflow_use_public_ips: bool = True, run_evaluation: bool = True) tuple[str, dict[str, Any]][source]

Returns Prophet train pipeline and formatted parameters.

Parameters:
project: str

The GCP project that runs the pipeline components.

location: str

The GCP region for Vertex AI.

root_dir: str

The Cloud Storage location to store the output.

time_column: str

Name of the column that identifies time order in the time series.

time_series_identifier_column: str

Name of the column that identifies the time series.

target_column: str

Name of the column that the model is to predict values for.

forecast_horizon: int

The number of time periods into the future for which forecasts will be created. Future periods start after the latest timestamp for each time series.

optimization_objective: str

Optimization objective for the model.

data_granularity_unit: str

String representing the units of time for the time column.

predefined_split_key: str = ''

The predefined_split column name.

timestamp_split_key: str = ''

The timestamp_split column name.

training_fraction: float = -1.0

The training fraction.

validation_fraction: float = -1.0

The validation fraction.

test_fraction: float = -1.0

float = The test fraction.

data_source_csv_filenames: str = ''

A string that represents a list of comma separated CSV filenames.

data_source_bigquery_table_path: str = ''

The BigQuery table path of format: bq://bq_project.bq_dataset.bq_table.

window_column: str = ''

Name of the column that should be used to filter input rows. The column should contain either booleans or string booleans; if the value of the row is True, generate a sliding window from that row.

window_stride_length: int = -1

Step length used to generate input examples. Every window_stride_length rows will be used to generate a sliding window.

window_max_count: int = -1

Number of rows that should be used to generate input examples. If the total row count is larger than this number, the input data will be randomly sampled to hit the count.

max_num_trials: int = 6

Maximum number of tuning trials to perform per time series.

trainer_dataflow_machine_type: str = 'n1-standard-1'

The dataflow machine type used for training.

trainer_dataflow_max_num_workers: int = 10

The max number of Dataflow workers used for training.

trainer_dataflow_disk_size_gb: int = 40

Dataflow worker’s disk size in GB during training.

evaluation_dataflow_machine_type: str = 'n1-standard-1'

The dataflow machine type used for evaluation.

evaluation_dataflow_max_num_workers: int = 10

The max number of Dataflow workers used for evaluation.

evaluation_dataflow_disk_size_gb: int = 40

Dataflow worker’s disk size in GB during evaluation.

dataflow_service_account: str = ''

Custom service account to run dataflow jobs.

dataflow_subnetwork: str = ''

Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used.

dataflow_use_public_ips: bool = True

Specifies whether Dataflow workers use public IP addresses.

run_evaluation: bool = True

Whether to run evaluation steps during training.

Returns:

Tuple of pipeline_definition_path and parameter_values.