AutoML Tabular¶

Preview AutoML tabular components.

Components:

`AutoFeatureEngineeringOp`(root_dir, project, ...)	Find the top features from the dataset.
`DistillationStageFeatureTransformEngineOp`(...)	Feature Transform Engine (FTE) component to transform raw data to engineered features during model distilation.
`FeatureSelectionOp`(project, location, ...[, ...])	Launches a feature selection task to pick top features.
`FeatureTransformEngineOp`(root_dir, project, ...)	Transforms raw data to engineered features.
`TabNetHyperparameterTuningJobOp`(project, ...)	Tunes TabNet hyperparameters using Vertex HyperparameterTuningJob API.
`TabNetTrainerOp`(project, location, root_dir, ...)	Trains a TabNet model using Vertex CustomJob API.
`WideAndDeepHyperparameterTuningJobOp`(...[, ...])	Tunes Wide & Deep hyperparameters using Vertex HyperparameterTuningJob API.
`WideAndDeepTrainerOp`(project, location, ...)	Trains a Wide & Deep model using Vertex CustomJob API.
`XGBoostHyperparameterTuningJobOp`(project, ...)	Tunes XGBoost hyperparameters using Vertex HyperparameterTuningJob API.
`XGBoostTrainerOp`(project, location, ...[, ...])	Trains an XGBoost model using Vertex CustomJob API.

preview.automl.tabular.AutoFeatureEngineeringOp(root_dir: str, project: str, location: str, gcp_resources: dsl.OutputPath(str), materialized_data: dsl.Output[system.Dataset], feature_ranking: dsl.Output[system.Artifact], target_column: str | None = '', weight_column: str | None = '', data_source_csv_filenames: str | None = '', data_source_bigquery_table_path: str | None = '', bigquery_staging_full_dataset_id: str | None = '', materialized_examples_format: str | None = 'tfrecords_gzip')¶: Find the top features from the dataset.

preview.automl.tabular.DistillationStageFeatureTransformEngineOp(root_dir: str, project: str, location: str, transform_config_path: str, bigquery_train_full_table_uri: str, bigquery_validate_full_table_uri: str, target_column: str, prediction_type: str, materialized_data: dsl.Output[system.Dataset], transform_output: dsl.Output[system.Artifact], gcp_resources: dsl.OutputPath(str), bigquery_staging_full_dataset_id: str | None = '', weight_column: str | None = '', dataflow_machine_type: str | None = 'n1-standard-16', dataflow_max_num_workers: int | None = 25, dataflow_disk_size_gb: int | None = 40, dataflow_subnetwork: str | None = '', dataflow_use_public_ips: bool | None = True, dataflow_service_account: str | None = '', encryption_spec_key_name: str | None = '', autodetect_csv_schema: bool | None = False)¶

Feature Transform Engine (FTE) component to transform raw data to engineered features during model distilation.

The FTE transform configuration is generated as part of the FTE stage prior to distillation. This distillation-stage FTE component re-uses this config to transform the input datasets with predicted outputs included (soft targets).

Parameters¶

root_dir: str¶: The Cloud Storage location to store the output.
project: str¶: Project to run feature transform engine.
location: str¶: Location for the created GCP services.
transform_config_path: str¶: Path to the transform config output by the pre-distillation FTE component.
bigquery_train_full_table_uri: str¶: BigQuery full table id for our train split output by pre-distillation FTE with soft target included.
bigquery_validate_full_table_uri: str¶: BigQuery full table id for our validation split output by pre-distillation FTE with soft target included.
target_column: str¶: Target column of input data. prediction_type (str): Model prediction type. One of “classification”, “regression”, “time_series”.
bigquery_staging_full_dataset_id: str | None = ''¶: Dataset in ‘projectId.datasetId’ format for storing intermediate-FTE BigQuery tables. If the specified dataset does not exist in BigQuery, FTE will create the dataset. If no bigquery_staging_full_dataset_id is specified, all intermediate tables will be stored in a dataset created under the provided project in the input data source’s location during FTE execution called ‘vertex_feature_transform_engine_staging_{location.replace(‘-’, ‘_’)}’. All tables generated by FTE will have a 30 day TTL.
weight_column: str | None = ''¶: Weight column of input data.
dataflow_machine_type: str | None = 'n1-standard-16'¶: The machine type used for dataflow jobs. If not set, default to n1-standard-16.
dataflow_max_num_workers: int | None = 25¶: The number of workers to run the dataflow job. If not set, default to 25.
dataflow_disk_size_gb: int | None = 40¶: The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.
dataflow_subnetwork: str | None = ''¶: Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications dataflow_use_public_ips (Optional[bool]): Specifies whether Dataflow workers use public IP addresses.
dataflow_service_account: str | None = ''¶: Custom service account to run Dataflow jobs.
encryption_spec_key_name: str | None = ''¶: Customer-managed encryption key.

Returns¶

materialized_data: dsl.Output[system.Dataset]: The materialized dataset.
ransform_output: dsl.Output[system.Artifact]: The transform output artifact.
gcp_resources: dsl.OutputPath(str): GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

preview.automl.tabular.FeatureSelectionOp(project: str, location: str, root_dir: str, data_source: dsl.Input[system.Dataset], target_column_name: str, feature_ranking: dsl.Output[system.Artifact], selected_features: dsl.Output[system.Artifact], gcp_resources: dsl.OutputPath(str), dataflow_machine_type: str | None = 'n1-standard-16', dataflow_max_num_workers: int | None = 25, dataflow_disk_size_gb: int | None = 40, dataflow_subnetwork: str | None = '', dataflow_use_public_ips: bool | None = True, dataflow_service_account: str | None = '', encryption_spec_key_name: str | None = '', algorithm: str | None = 'AMI', prediction_type: str | None = 'unknown', binary_classification: str | None = 'false', max_selected_features: int | None = 1000)¶

Launches a feature selection task to pick top features.

Parameters¶

project: str¶: Project to run feature selection.
location: str¶: Location for running the feature selection. If not set, default to us-central1.
root_dir: str¶: The Cloud Storage location to store the output.
dataflow_machine_type: str | None = 'n1-standard-16'¶: The machine type used for dataflow jobs. If not set, default to n1-standard-16.
dataflow_max_num_workers: int | None = 25¶: The number of workers to run the dataflow job. If not set, default to 25.
dataflow_disk_size_gb: int | None = 40¶: The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40.
dataflow_subnetwork: str | None = ''¶: Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications
dataflow_use_public_ips: bool | None = True¶: Specifies whether Dataflow workers use public IP addresses.
dataflow_service_account: str | None = ''¶: Custom service account to run dataflow jobs.
encryption_spec_key_name: str | None = ''¶: Customer-managed encryption key. If this is set, then all resources will be encrypted with the provided encryption key. data_source(Dataset): The input dataset artifact which references csv, BigQuery, or TF Records. target_column_name(str): Target column name of the input dataset.
max_selected_features: int | None = 1000¶: number of features to select by the algorithm. If not set, default to 1000.

Returns¶

feature_ranking: dsl.Output[system.Artifact]: the dictionary of feature names and feature ranking values.
selected_features: dsl.Output[system.Artifact]: A json array of selected feature names.
gcp_resources: dsl.OutputPath(str): GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.

preview.automl.tabular.FeatureTransformEngineOp(root_dir: str, project: str, location: str, dataset_stats: dsl.Output[system.Artifact], materialized_data: dsl.Output[system.Dataset], transform_output: dsl.Output[system.Artifact], split_example_counts: dsl.OutputPath(str), instance_schema: dsl.Output[system.Artifact], training_schema: dsl.Output[system.Artifact], bigquery_train_split_uri: dsl.OutputPath(str), bigquery_validation_split_uri: dsl.OutputPath(str), bigquery_test_split_uri: dsl.OutputPath(str), bigquery_downsampled_test_split_uri: dsl.OutputPath(str), feature_ranking: dsl.Output[system.Artifact], gcp_resources: dsl.OutputPath(str), dataset_level_custom_transformation_definitions: list | None = [], dataset_level_transformations: list | None = [], forecasting_time_column: str | None = '', forecasting_time_series_identifier_column: str | None = None, forecasting_time_series_identifier_columns: list | None = [], forecasting_time_series_attribute_columns: list | None = [], forecasting_unavailable_at_forecast_columns: list | None = [], forecasting_available_at_forecast_columns: list | None = [], forecasting_forecast_horizon: int | None = - 1, forecasting_context_window: int | None = - 1, forecasting_predefined_window_column: str | None = '', forecasting_window_stride_length: int | None = - 1, forecasting_window_max_count: int | None = - 1, forecasting_holiday_regions: list | None = [], forecasting_apply_windowing: bool | None = True, predefined_split_key: str | None = '', stratified_split_key: str | None = '', timestamp_split_key: str | None = '', training_fraction: float | None = - 1, validation_fraction: float | None = - 1, test_fraction: float | None = - 1, stats_gen_execution_engine: str | None = 'dataflow', tf_transform_execution_engine: str | None = 'dataflow', tf_auto_transform_features: dict | None = {}, tf_custom_transformation_definitions: list | None = [], tf_transformations_path: str | None = '', legacy_transformations_path: str | None = '', target_column: str | None = '', weight_column: str | None = '', prediction_type: str | None = '', model_type: str | None = None, multimodal_tabular_columns: list | None = [], multimodal_timeseries_columns: list | None = [], multimodal_text_columns: list | None = [], multimodal_image_columns: list | None = [], run_distill: bool | None = False, run_feature_selection: bool | None = False, feature_selection_algorithm: str | None = 'AMI', feature_selection_execution_engine: str | None = 'dataflow', materialized_examples_format: str | None = 'tfrecords_gzip', max_selected_features: int | None = 1000, data_source_csv_filenames: str | None = '', data_source_bigquery_table_path: str | None = '', bigquery_staging_full_dataset_id: str | None = '', dataflow_machine_type: str | None = 'n1-standard-16', dataflow_max_num_workers: int | None = 25, dataflow_disk_size_gb: int | None = 40, dataflow_subnetwork: str | None = '', dataflow_use_public_ips: bool | None = True, dataflow_service_account: str | None = '', encryption_spec_key_name: str | None = '', autodetect_csv_schema: bool | None = False, group_columns: list | None = None, group_total_weight: float = 0.0, temporal_total_weight: float = 0.0, group_temporal_total_weight: float = 0.0)¶

Transforms raw data to engineered features.

FTE performs dataset level transformations, data splitting, data statistic generation, and TensorFlow-based row level transformations on the input dataset based on the provided transformation configuration.

Parameters¶

root_dir: str¶: The Cloud Storage location to store the output.
project: str¶: Project to run feature transform engine.
location: str¶: Location for the created GCP services.
dataset_level_custom_transformation_definitions: list | None = []¶: List of dataset-level custom transformation definitions. Custom, bring-your-own dataset-level transform functions, where users can define and import their own transform function and use it with FTE’s built-in transformations. Using custom transformations is an experimental feature and it is currently not supported during batch prediction.

[ { "transformation": "ConcatCols", "module_path": "/path/to/custom_transform_fn_dlt.py", "function_name": "concat_cols" } ]  Using custom transform function together with FTE's built-in transformations:  .. code-block:: python  [ { "transformation": "Join", "right_table_uri": "bq://test-project.dataset_test.table", "join_keys": [["join_key_col", "join_key_col"]] },{ "transformation": "ConcatCols", "cols": ["feature_1", "feature_2"], "output_col": "feature_1_2" } ]

Parameters¶

dataset_level_transformations: list | None = []¶: List of dataset-level transformations.

[ { "transformation": "Join", "right_table_uri": "bq://test-project.dataset_test.table", "join_keys": [["join_key_col", "join_key_col"]] }, ... ]  Additional information about FTE's currently supported built-in
    transformations:
    Join: Joins features from right_table_uri. For each join key, the left table keys will be included and the right table keys will be dropped.
        Example:  .. code-block:: python  { "transformation": "Join", "right_table_uri": "bq://test-project.dataset_test.table", "join_keys": [["join_key_col", "join_key_col"]] }
        Arguments:
            right_table_uri: Right table BigQuery uri to join with input_full_table_id.
            join_keys: Features to join on. For each nested list, the first element is a left table column and the second is its corresponding right table column.
    TimeAggregate: Creates a new feature composed of values of an existing feature from a fixed time period ago or in the future.
      Ex: A feature for sales by store 1 year ago.
        Example:  .. code-block:: python  { "transformation": "TimeAggregate", "time_difference": 40, "time_difference_units": "DAY", "time_series_identifier_columns": ["store_id"], "time_column": "time_col", "time_difference_target_column": "target_col", "output_column": "output_col" }
        Arguments:
            time_difference: Number of time_difference_units to look back or into the future on our time_difference_target_column.
            time_difference_units: Units of time_difference to look back or into the future on our time_difference_target_column. Must be one of * 'DAY' * 'WEEK' (Equivalent to 7 DAYs) * 'MONTH' * 'QUARTER' * 'YEAR'
            time_series_identifier_columns: Names of the time series identifier columns.
            time_column: Name of the time column.
            time_difference_target_column: Column we wish to get the value of time_difference time_difference_units in the past or future.
            output_column: Name of our new time aggregate feature.
            is_future: Whether we wish to look forward in time. Defaults to False. PartitionByMax/PartitionByMin/PartitionByAvg/PartitionBySum: Performs a partition by reduce operation (one of max, min, avg, or sum) with a fixed historic time period. Ex: Getting avg sales (the reduce column) for each store (partition_by_column) over the previous 5 days (time_column, time_ago_units, and time_ago).
        Example:  .. code-block:: python  { "transformation": "PartitionByMax", "reduce_column": "sell_price", "partition_by_columns": ["store_id", "state_id"], "time_column": "date", "time_ago": 1, "time_ago_units": "WEEK", "output_column": "partition_by_reduce_max_output" }
        Arguments:
            reduce_column: Column to apply the reduce operation on. Reduce operations include the
                following: Max, Min, Avg, Sum.
            partition_by_columns: List of columns to partition by.
            time_column: Time column for the partition by operation's window function.
            time_ago: Number of time_ago_units to look back on our target_column, starting from time_column (inclusive).
            time_ago_units: Units of time_ago to look back on our target_column. Must be one of * 'DAY' * 'WEEK'
            output_column: Name of our output feature.

Parameters¶

forecasting_time_column: str | None = ''¶: Forecasting time column.
forecasting_time_series_identifier_column: str | None = None¶: [Deprecated] A forecasting time series identifier column. Raises an exception if used - use the “time_series_identifier_column” field instead.
forecasting_time_series_identifier_columns: list | None = []¶: The list of forecasting time series identifier columns.
forecasting_time_series_attribute_columns: list | None = []¶: Forecasting time series attribute columns.
forecasting_unavailable_at_forecast_columns: list | None = []¶: Forecasting unavailable at forecast columns.
forecasting_available_at_forecast_columns: list | None = []¶: Forecasting available at forecast columns.
forecasting_forecast_horizon: int | None = - 1¶: Forecasting horizon.
forecasting_context_window: int | None = - 1¶: Forecasting context window.
forecasting_predefined_window_column: str | None = ''¶: Forecasting predefined window column.
forecasting_window_stride_length: int | None = - 1¶: Forecasting window stride length.
forecasting_window_max_count: int | None = - 1¶: Forecasting window max count.
forecasting_holiday_regions: list | None = []¶: The geographical region based on which the holiday effect is applied in modeling by adding holiday categorical array feature that include all holidays matching the date. This option only allowed when data granularity is day. By default, holiday effect modeling is disabled. To turn it on, specify the holiday region using this option.

Top level: * ‘GLOBAL’ Second level: continental regions: * ‘NA’: North America * ‘JAPAC’: Japan and Asia Pacific * ‘EMEA’: Europe, the Middle East and Africa * ‘LAC’: Latin America and the Caribbean Third level: countries from ISO 3166-1 Country codes. Valid regions: * ‘GLOBAL’ * ‘NA’ * ‘JAPAC’ * ‘EMEA’ * ‘LAC’ * ‘AE’ * ‘AR’ * ‘AT’ * ‘AU’ * ‘BE’ * ‘BR’ * ‘CA’ * ‘CH’ * ‘CL’ * ‘CN’ * ‘CO’ * ‘CZ’ * ‘DE’ * ‘DK’ * ‘DZ’ * ‘EC’ * ‘EE’ * ‘EG’ * ‘ES’ * ‘FI’ * ‘FR’ * ‘GB’ * ‘GR’ * ‘HK’ * ‘HU’ * ‘ID’ * ‘IE’ * ‘IL’ * ‘IN’ * ‘IR’ * ‘IT’ * ‘JP’ * ‘KR’ * ‘LV’ * ‘MA’ * ‘MX’ * ‘MY’ * ‘NG’ * ‘NL’ * ‘NO’ * ‘NZ’ * ‘PE’ * ‘PH’ * ‘PK’ * ‘PL’ * ‘PT’ * ‘RO’ * ‘RS’ * ‘RU’ * ‘SA’ * ‘SE’ * ‘SG’ * ‘SI’ * ‘SK’ * ‘TH’ * ‘TR’ * ‘TW’ * ‘UA’ * ‘US’ * ‘VE’ * ‘VN’ * ‘ZA’ :param forecasting_apply_windowing: Whether to apply window strategy. :param predefined_split_key: Predefined split key. :param stratified_split_key: Stratified split key. :param timestamp_split_key: Timestamp split key. :param training_fraction: Fraction of input data for training. :param validation_fraction: Fraction of input data for validation. :param test_fraction: Fraction of input data for testing. :param stats_gen_execution_engine: Execution engine to perform statistics generation. Can be one of: “dataflow” (by default) or “bigquery”. Using “bigquery” as the execution engine is experimental. :param tf_transform_execution_engine: Execution engine to perform row-level TF transformations. Can be one of: “dataflow” (by default) or “bigquery”. Using “bigquery” as the execution engine is experimental and is for allowlisted customers only. In addition, executing on “bigquery” only supports auto transformations (i.e., specified by tf_auto_transform_features) and will raise an error when tf_custom_transformation_definitions or tf_transformations_path is set. :param tf_auto_transform_features: Dict mapping auto and/or type-resolutions to TF transform features. FTE will automatically configure a set of built-in transformations for each feature based on its data statistics. If users do not want auto type resolution, but want the set of transformations for a given type to be automatically generated, they may specify pre-resolved transformations types. The following type hint dict keys are supported: * ‘auto’ * ‘categorical’ * ‘numeric’ * ‘text’ * ‘timestamp’ Example: { "auto": ["feature1"], "categorical": ["feature2", "feature3"], }. Note that the target and weight column may not be included as an auto transformation unless users are running forecasting. :param tf_custom_transformation_definitions: List of TensorFlow-based custom transformation definitions. Custom, bring-your-own transform functions, where users can define and import their own transform function and use it with FTE’s built-in transformations. [ { "transformation": "PlusOne", "module_path": "gs://bucket/custom_transform_fn.py", "function_name": "plus_one_transform" }, { "transformation": "MultiplyTwo", "module_path": "gs://bucket/custom_transform_fn.py", "function_name": "multiply_two_transform" } ] Using custom transform function together with FTE's built-in transformations: .. code-block:: python [ { "transformation": "CastToFloat", "input_columns": ["feature_1"], "output_columns": ["feature_1"] },{ "transformation": "PlusOne", "input_columns": ["feature_1"] "output_columns": ["feature_1_plused_one"] },{ "transformation": "MultiplyTwo", "input_columns": ["feature_1"] "output_columns": ["feature_1_multiplied_two"] } ] :param tf_transformations_path: Path to TensorFlow-based transformation configuration. Path to a JSON file used to specified FTE's TF transformation configurations. In the following, we provide some sample transform configurations to demonstrate FTE's capabilities. All transformations on input columns are explicitly specified with FTE's built-in transformations. Chaining of multiple transformations on a single column is also supported. For example: .. code-block:: python [ { "transformation": "ZScale", "input_columns": ["feature_1"] }, { "transformation": "ZScale", "input_columns": ["feature_2"] } ]. Additional information about FTE’s currently supported built-in transformations: Datetime: Extracts datetime featues from a column containing timestamp strings. Example: .. code-block:: python { “transformation”: “Datetime”, “input_columns”: [“feature_1”], “time_format”: “%Y-%m-%d” } Arguments: input_columns: A list with a single column to perform the datetime transformation on. output_columns: Names of output columns, one for each datetime_features element. time_format: Datetime format string. Time format is a combination of Date + Time Delimiter (optional) + Time (optional) directives. Valid date directives are as follows * ‘%Y-%m-%d’ # 2018-11-30 * ‘%Y/%m/%d’ # 2018/11/30 * ‘%y-%m-%d’ # 18-11-30 * ‘%y/%m/%d’ # 18/11/30 * ‘%m-%d-%Y’ # 11-30-2018 * ‘%m/%d/%Y’ # 11/30/2018 * ‘%m-%d-%y’ # 11-30-18 * ‘%m/%d/%y’ # 11/30/18 * ‘%d-%m-%Y’ # 30-11-2018 * ‘%d/%m/%Y’ # 30/11/2018 * ‘%d-%B-%Y’ # 30-November-2018 * ‘%d-%m-%y’ # 30-11-18 * ‘%d/%m/%y’ # 30/11/18 * ‘%d-%B-%y’ # 30-November-18 * ‘%d%m%Y’ # 30112018 * ‘%m%d%Y’ # 11302018 * ‘%Y%m%d’ # 20181130 Valid time delimiters are as follows * ‘T’ * ‘ ‘ Valid time directives are as follows * ‘%H:%M’ # 23:59 * ‘%H:%M:%S’ # 23:59:58 * ‘%H:%M:%S.%f’ # 23:59:58[.123456] * ‘%H:%M:%S.%f%z’ # 23:59:58[.123456]+0000 * ‘%H:%M:%S%z’, # 23:59:58+0000 datetime_features: List of datetime features to be extract. Each entry must be one of * ‘YEAR’ * ‘MONTH’ * ‘DAY’ * ‘DAY_OF_WEEK’ * ‘DAY_OF_YEAR’ * ‘WEEK_OF_YEAR’ * ‘QUARTER’ * ‘HOUR’ * ‘MINUTE’ * ‘SECOND’ Defaults to [‘YEAR’, ‘MONTH’, ‘DAY’, ‘DAY_OF_WEEK’, ‘DAY_OF_YEAR’, ‘WEEK_OF_YEAR’] Log: Performs the natural log on a numeric column. Example: .. code-block:: python { “transformation”: “Log”, “input_columns”: [“feature_1”] } Arguments: input_columns: A list with a single column to perform the log transformation on. output_columns: A list with a single output column name, corresponding to the output of our transformation. ZScale: Performs Z-scale normalization on a numeric column. Example: .. code-block:: python { “transformation”: “ZScale”, “input_columns”: [“feature_1”] } Arguments: input_columns: A list with a single column to perform the z-scale transformation on. output_columns: A list with a single output column name, corresponding to the output of our transformation. Vocabulary: Converts strings to integers, where each unique string gets a unique integer representation. Example: .. code-block:: python { “transformation”: “Vocabulary”, “input_columns”: [“feature_1”] } Arguments: input_columns: A list with a single column to perform the vocabulary transformation on. output_columns: A list with a single output column name, corresponding to the output of our transformation. top_k: Number of the most frequent words in the vocabulary to use for generating dictionary lookup indices. If not specified, all words in the vocabulary will be used. Defaults to None. frequency_threshold: Limit the vocabulary only to words whose number of occurrences in the input exceeds frequency_threshold. If not specified, all words in the vocabulary will be included. If both top_k and frequency_threshold are specified, a word must satisfy both conditions to be included. Defaults to None. Categorical: Transforms categorical columns to integer columns. Example: .. code-block:: python { “transformation”: “Categorical”, “input_columns”: [“feature_1”], “top_k”: 10 } Arguments: input_columns: A list with a single column to perform the categorical transformation on. output_columns: A list with a single output column name, corresponding to the output of our transformation. top_k: Number of the most frequent words in the vocabulary to use for generating dictionary lookup indices. If not specified, all words in the vocabulary will be used. frequency_threshold: Limit the vocabulary only to words whose number of occurrences in the input exceeds frequency_threshold. If not specified, all words in the vocabulary will be included. If both top_k and frequency_threshold are specified, a word must satisfy both conditions to be included. Reduce: Given a column where each entry is a numeric array, reduces arrays according to our reduce_mode. Example: .. code-block:: python { “transformation”: “Reduce”, “input_columns”: [“feature_1”], “reduce_mode”: “MEAN”, “output_columns”: [“feature_1_mean”] } Arguments: input_columns: A list with a single column to perform the reduce transformation on. output_columns: A list with a single output column name, corresponding to the output of our transformation. reduce_mode: One of * ‘MAX’ * ‘MIN’ * ‘MEAN’ * ‘LAST_K’ Defaults to ‘MEAN’. last_k: The number of last k elements when ‘LAST_K’ reduce mode is used. Defaults to 1. SplitString: Given a column of strings, splits strings into token arrays. Example: .. code-block:: python { “transformation”: “SplitString”, “input_columns”: [“feature_1”], “separator”: “$” } Arguments: input_columns: A list with a single column to perform the split string transformation on. output_columns: A list with a single output column name, corresponding to the output of our transformation. separator: Separator to split input string into tokens. Defaults to ‘ ‘. missing_token: Missing token to use when no string is included. Defaults to ‘ MISSING ‘. NGram: Given a column of strings, splits strings into token arrays where each token is an integer. Example: .. code-block:: python { “transformation”: “NGram”, “input_columns”: [“feature_1”], “min_ngram_size”: 1, “max_ngram_size”: 2, “separator”: ” ” } Arguments: input_columns: A list with a single column to perform the n-gram transformation on. output_columns: A list with a single output column name, corresponding to the output of our transformation. min_ngram_size: Minimum n-gram size. Must be a positive number and <= max_ngram_size. Defaults to 1. max_ngram_size: Maximum n-gram size. Must be a positive number and >= min_ngram_size. Defaults to 2. top_k: Number of the most frequent words in the vocabulary to use for generating dictionary lookup indices. If not specified, all words in the vocabulary will be used. Defaults to None. frequency_threshold: Limit the dictionary’s vocabulary only to words whose number of occurrences in the input exceeds frequency_threshold. If not specified, all words in the vocabulary will be included. If both top_k and frequency_threshold are specified, a word must satisfy both conditions to be included. Defaults to None. separator: Separator to split input string into tokens. Defaults to ‘ ‘. missing_token: Missing token to use when no string is included. Defaults to ‘ MISSING ‘. Clip: Given a numeric column, clips elements such that elements < min_value are assigned min_value, and elements > max_value are assigned max_value. Example: .. code-block:: python { “transformation”: “Clip”, “input_columns”: [“col1”], “output_columns”: [“col1_clipped”], “min_value”: 1., “max_value”: 10., } Arguments: input_columns: A list with a single column to perform the n-gram transformation on. output_columns: A list with a single output column name, corresponding to the output of our transformation. min_value: Number where all values below min_value are set to min_value. If no min_value is provided, min clipping will not occur. Defaults to None. max_value: Number where all values above max_value are set to max_value If no max_value is provided, max clipping will not occur. Defaults to None. MultiHotEncoding: Performs multi-hot encoding on a categorical array column. Example: .. code-block:: python { “transformation”: “MultiHotEncoding”, “input_columns”: [“col1”], } The number of classes is determened by the largest number included in the input if it is numeric or the total number of unique values of the input if it is type str. If the input is has type str and an element contians separator tokens, the input will be split at separator indices, and the each element of the split list will be considered a seperate class. For example, Input: .. code-block:: python [ [“foo bar”], # Example 0 [“foo”, “bar”], # Example 1 [“foo”], # Example 2 [“bar”], # Example 3 ] Output (with default separator=” “): .. code-block:: python [ [1, 1], # Example 0 [1, 1], # Example 1 [1, 0], # Example 2 [0, 1], # Example 3 ] Arguments: input_columns: A list with a single column to perform the multi-hot-encoding on. output_columns: A list with a single output column name, corresponding to the output of our transformation. top_k: Number of the most frequent words in the vocabulary to use for generating dictionary lookup indices. If not specified, all words in the vocabulary will be used. Defaults to None. frequency_threshold: Limit the dictionary’s vocabulary only to words whose number of occurrences in the input exceeds frequency_threshold. If not specified, all words in the vocabulary will be included. If both top_k and frequency_threshold are specified, a word must satisfy both conditions to be included. Defaults to None. separator: Separator to split input string into tokens. Defaults to ‘ ‘. MaxAbsScale: Performs maximum absolute scaling on a numeric column. Example: .. code-block:: python { “transformation”: “MaxAbsScale”, “input_columns”: [“col1”], “output_columns”: [“col1_max_abs_scaled”] } Arguments: input_columns: A list with a single column to perform max-abs-scale on. output_columns: A list with a single output column name, corresponding to the output of our transformation. Custom: Transformations defined in tf_custom_transformation_definitions are included here in the TensorFlow-based transformation configuration. For example, given the following tf_custom_transformation_definitions: .. code-block:: python [ { “transformation”: “PlusX”, “module_path”: “gs://bucket/custom_transform_fn.py”, “function_name”: “plus_one_transform” } ] We can include the following transformation: .. code-block:: python { “transformation”: “PlusX”, “input_columns”: [“col1”], “output_columns”: [“col1_max_abs_scaled”] “x”: 5 } Note that input_columns must still be included in our arguments and output_columns is optional. All other arguments are those defined in custom_transform_fn.py, which includes "x" in this case. See tf_custom_transformation_definitions above. legacy_transformations_path (Optional[str]) Deprecated. Prefer tf_auto_transform_features. Path to a GCS file containing JSON string for legacy style transformations. Note that legacy_transformations_path and tf_auto_transform_features cannot both be specified. :param target_column: Target column of input data. :param weight_column: Weight column of input data. :param prediction_type: Model prediction type. One of “classification”, “regression”, “time_series”. :param run_distill: (deprecated) Whether the distillation should be applied to the training. :param run_feature_selection: Whether the feature selection should be applied to the dataset. :param feature_selection_algorithm: The algorithm of feature selection. One of “AMI”, “CMIM”, “JMIM”, “MRMR”, default to be “AMI”. The algorithms available are: AMI(Adjusted Mutual Information): Reference: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_mutual_info_score.html Arrays are not yet supported in this algorithm. CMIM(Conditional Mutual Information Maximization): Reference paper: Mohamed Bennasar, Yulia Hicks, Rossitza Setchi, “Feature selection using Joint Mutual Information Maximisation,” Expert Systems with Applications, vol. 42, issue 22, 1 December 2015, Pages 8520-8532. JMIM(Joint Mutual Information Maximization Reference: paper: Mohamed Bennasar, Yulia Hicks, Rossitza Setchi, “Feature selection using Joint Mutual Information Maximisation,” Expert Systems with Applications, vol. 42, issue 22, 1 December 2015, Pages 8520-8532. MRMR(MIQ Minimum-redundancy Maximum-relevance): Reference paper: Hanchuan Peng, Fuhui Long, and Chris Ding. “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy.” IEEE Transactions on pattern analysis and machine intelligence 27, no. 8: 1226-1238. :param feature_selection_execution_engine: Execution engine to run feature selection, value can be dataflow, bigquery. :param materialized_examples_format: The format to use for the materialized examples. Should be either ‘tfrecords_gzip’ (default) or ‘parquet’. :param max_selected_features: Maximum number of features to select. If specified, the transform config will be purged by only using the selected features that ranked top in the feature ranking, which has the ranking value for all supported features. If the number of input features is smaller than max_selected_features specified, we will still run the feature selection process and generate the feature ranking, no features will be excluded. The value will be set to 1000 by default if run_feature_selection is enabled. :param data_source_csv_filenames: CSV input data source to run feature transform on. :param data_source_bigquery_table_path: BigQuery input data source to run feature transform on. :param bigquery_staging_full_dataset_id: Dataset in “projectId.datasetId” format for storing intermediate-FTE BigQuery tables. If the specified dataset does not exist in BigQuery, FTE will create the dataset. If no bigquery_staging_full_dataset_id is specified, all intermediate tables will be stored in a dataset created under the provided project in the input data source’s location during FTE execution called “vertex_feature_transform_engine_staging_{location.replace(‘-’, ‘_’)}”. All tables generated by FTE will have a 30 day TTL. :param model_type: Model type, which we wish to engineer features for. Can be one of: neural_network, boosted_trees, l2l, seq2seq, tft, or tide. Defaults to the empty value, None. :param multimodal_tabular_columns: List of multimodal tabular columns. Defaults to an empty list :param multimodal_timeseries_columns: List of multimodal timeseries columns. Defaults to an empty list :param multimodal_text_columns: List of multimodal text columns. Defaults to an empty list :param multimodal_image_columns: List of multimodal image columns. Defaults to an empty list. :param dataflow_machine_type: The machine type used for dataflow jobs. If not set, default to n1-standard-16. :param dataflow_max_num_workers: The number of workers to run the dataflow job. If not set, default to 25. :param dataflow_disk_size_gb: The disk size, in gigabytes, to use on each Dataflow worker instance. If not set, default to 40. :param dataflow_subnetwork: Dataflow’s fully qualified subnetwork name, when empty the default subnetwork will be used. More details: https://cloud.google.com/dataflow/docs/guides/specifying-networks#example_network_and_subnetwork_specifications :param dataflow_use_public_ips: Specifies whether Dataflow workers use public IP addresses. :param dataflow_service_account: Custom service account to run Dataflow jobs. :param encryption_spec_key_name: Customer-managed encryption key. :param autodetect_csv_schema: If True, infers the column types when importing CSVs into BigQuery.

Returns¶

dataset_stats: dsl.Output[system.Artifact]: The stats of the dataset.
materialized_data: dsl.Output[system.Dataset]: The materialized dataset.
ransform_output: dsl.Output[system.Artifact]: The transform output artifact.
split_example_counts: dsl.OutputPath(str): JSON string of data split example counts for train, validate, and test splits.
bigquery_train_split_uri: dsl.OutputPath(str): BigQuery URI for the train split to pass to the batch prediction component during distillation.
bigquery_validation_split_uri: dsl.OutputPath(str): BigQuery URI for the validation split to pass to the batch prediction component during distillation.
bigquery_test_split_uri: dsl.OutputPath(str): BigQuery URI for the test split to pass to the batch prediction component during evaluation.
bigquery_downsampled_test_split_uri: dsl.OutputPath(str): BigQuery URI for the downsampled test split to pass to the batch prediction component during batch explain.
instance_schema_path: Unknown: Schema of input data to the tf_model at serving time.
raining_schema_path: Unknown: Schema of input data to the tf_model at training time.
feature_ranking: dsl.Output[system.Artifact]: The ranking of features, all features supported in the dataset will be included. For “AMI” algorithm, array features won’t be available in the ranking as arrays are not supported yet.
gcp_resources: dsl.OutputPath(str): GCP resources created by this component. For more details, see https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/proto/README.md.
group_columns: typing.Union[list, NoneType]: A list of time series attribute column names that define the time series hierarchy.
group_total_weight: <class 'float'>: The weight of the loss for predictions aggregated over time series in the same group.
emporal_total_weight: <class 'float'>: The weight of the loss for predictions aggregated over the horizon for a single time series.
group_temporal_total_weight: <class 'float'>: The weight of the loss for predictions aggregated over both the horizon and time series in the same hierarchy group.

preview.automl.tabular.TabNetHyperparameterTuningJobOp(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, study_spec_metric_id: str, study_spec_metric_goal: str, study_spec_parameters_override: list, max_trial_count: int, parallel_trial_count: int, instance_baseline: dsl.Input[system.Artifact], metadata: dsl.Input[system.Artifact], materialized_train_split: dsl.Input[system.Artifact], materialized_eval_split: dsl.Input[system.Artifact], transform_output: dsl.Input[system.Artifact], training_schema_uri: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), instance_schema_uri: dsl.OutputPath(str), prediction_schema_uri: dsl.OutputPath(str), trials: dsl.OutputPath(str), prediction_docker_uri_output: dsl.OutputPath(str), execution_metrics: dsl.OutputPath(dict), weight_column: str | None = '', enable_profiler: bool | None = False, cache_data: str | None = 'auto', seed: int | None = 1, eval_steps: int | None = 0, eval_frequency_secs: int | None = 600, max_failed_trial_count: int | None = 0, study_spec_algorithm: str | None = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str | None = 'BEST_MEASUREMENT', training_machine_spec: dict | None = {'machine_type': 'c2-standard-16'}, training_disk_spec: dict | None = {'boot_disk_size_gb': 100, 'boot_disk_type': 'pd-ssd'}, encryption_spec_key_name: str | None = '')¶

Tunes TabNet hyperparameters using Vertex HyperparameterTuningJob API.

Parameters¶

project: str¶: The GCP project that runs the pipeline components.
location: str¶: The GCP region that runs the pipeline components.
root_dir: str¶: The root GCS directory for the pipeline components.
target_column: str¶: The target column name.
prediction_type: str¶: The type of prediction the model is to produce. “classification” or “regression”.
weight_column: str | None = ''¶: The weight column name.
enable_profiler: bool | None = False¶: Enables profiling and saves a trace during evaluation.
cache_data: str | None = 'auto'¶: Whether to cache data or not. If set to ‘auto’, caching is determined based on the dataset size.
seed: int | None = 1¶: Seed to be used for this run.
eval_steps: int | None = 0¶: Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.
eval_frequency_secs: int | None = 600¶: Frequency at which evaluation and checkpointing will take place.
study_spec_metric_id: str¶: Metric to optimize, possible values: [ ‘loss’, ‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].
study_spec_metric_goal: str¶: Optimization goal of the metric, possible values: “MAXIMIZE”, “MINIMIZE”.
study_spec_parameters_override: list¶: List of dictionaries representing parameters to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.
max_trial_count: int¶: The desired total number of trials.
parallel_trial_count: int¶: The desired number of trials to run in parallel.
max_failed_trial_count: int | None = 0¶: The number of failed trials that need to be seen before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.
study_spec_algorithm: str | None = 'ALGORITHM_UNSPECIFIED'¶: The search algorithm specified for the study. One of ‘ALGORITHM_UNSPECIFIED’, ‘GRID_SEARCH’, or ‘RANDOM_SEARCH’.
study_spec_measurement_selection_type: str | None = 'BEST_MEASUREMENT'¶: Which measurement to use if/when the service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.
training_machine_spec: dict | None = {'machine_type': 'c2-standard-16'}¶: The training machine spec. See https://cloud.google.com/compute/docs/machine-types for options.
training_disk_spec: dict | None = {'boot_disk_size_gb': 100, 'boot_disk_type': 'pd-ssd'}¶: The training disk spec.
instance_baseline: dsl.Input[system.Artifact]¶: The path to a JSON file for baseline values.
metadata: dsl.Input[system.Artifact]¶: Amount of time in seconds to run the trainer for.
materialized_train_split: dsl.Input[system.Artifact]¶: The path to the materialized train split.
materialized_eval_split: dsl.Input[system.Artifact]¶: The path to the materialized validation split.
transform_output: dsl.Input[system.Artifact]¶: The path to transform output.
training_schema_uri: dsl.Input[system.Artifact]¶: The path to the training schema.
encryption_spec_key_name: str | None = ''¶: The KMS key name.

Returns¶

gcp_resources: dsl.OutputPath(str): Serialized gcp_resources proto tracking the custom training job.
instance_schema_uri: dsl.OutputPath(str): The path to the instance schema.
rediction_schema_uri: dsl.OutputPath(str): The path to the prediction schema.
rials: dsl.OutputPath(str): The path to the hyperparameter tuning trials
rediction_docker_uri_output: dsl.OutputPath(str): The URI of the prediction container.
execution_metrics: dsl.OutputPath(dict): Core metrics in dictionary of hyperparameter tuning job execution.

preview.automl.tabular.TabNetTrainerOp(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, learning_rate: float, instance_baseline: dsl.Input[system.Artifact], metadata: dsl.Input[system.Artifact], materialized_train_split: dsl.Input[system.Artifact], materialized_eval_split: dsl.Input[system.Artifact], transform_output: dsl.Input[system.Artifact], training_schema_uri: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), unmanaged_container_model: dsl.Output[google.UnmanagedContainerModel], weight_column: str | None = '', max_steps: int | None = - 1, max_train_secs: int | None = - 1, large_category_dim: int | None = 1, large_category_thresh: int | None = 300, yeo_johnson_transform: bool | None = True, feature_dim: int | None = 64, feature_dim_ratio: float | None = 0.5, num_decision_steps: int | None = 6, relaxation_factor: float | None = 1.5, decay_every: float | None = 100, decay_rate: float | None = 0.95, gradient_thresh: float | None = 2000, sparsity_loss_weight: float | None = 1e-05, batch_momentum: float | None = 0.95, batch_size_ratio: float | None = 0.25, num_transformer_layers: int | None = 4, num_transformer_layers_ratio: float | None = 0.25, class_weight: float | None = 1.0, loss_function_type: str | None = 'default', alpha_focal_loss: float | None = 0.25, gamma_focal_loss: float | None = 2.0, enable_profiler: bool | None = False, cache_data: str | None = 'auto', seed: int | None = 1, eval_steps: int | None = 0, batch_size: int | None = 100, measurement_selection_type: str | None = 'BEST_MEASUREMENT', optimization_metric: str | None = '', eval_frequency_secs: int | None = 600, training_machine_spec: dict | None = {'machine_type': 'c2-standard-16'}, training_disk_spec: dict | None = {'boot_disk_size_gb': 100, 'boot_disk_type': 'pd-ssd'}, encryption_spec_key_name: str | None = '')¶

Trains a TabNet model using Vertex CustomJob API.

Parameters¶

project: str¶: The GCP project that runs the pipeline components.
location: str¶: The GCP region that runs the pipeline components.
root_dir: str¶: The root GCS directory for the pipeline components.
target_column: str¶: The target column name.
prediction_type: str¶: The type of prediction the model is to produce. “classification” or “regression”.
weight_column: str | None = ''¶: The weight column name.
max_steps: int | None = - 1¶: Number of steps to run the trainer for.
max_train_secs: int | None = - 1¶: Amount of time in seconds to run the trainer for.
learning_rate: float¶: The learning rate used by the linear optimizer.
large_category_dim: int | None = 1¶: Embedding dimension for categorical feature with large number of categories.
large_category_thresh: int | None = 300¶: Threshold for number of categories to apply large_category_dim embedding dimension to.
yeo_johnson_transform: bool | None = True¶: Enables trainable Yeo-Johnson power transform.
feature_dim: int | None = 64¶: Dimensionality of the hidden representation in feature transformation block.
feature_dim_ratio: float | None = 0.5¶: The ratio of output dimension (dimensionality of the outputs of each decision step) to feature dimension.
num_decision_steps: int | None = 6¶: Number of sequential decision steps.
relaxation_factor: float | None = 1.5¶: Relaxation factor that promotes the reuse of each feature at different decision steps. When it is 1, a feature is enforced to be used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.
decay_every: float | None = 100¶: Number of iterations for periodically applying learning rate decaying.
decay_rate: float | None = 0.95¶: Learning rate decaying.
gradient_thresh: float | None = 2000¶: Threshold for the norm of gradients for clipping.
sparsity_loss_weight: float | None = 1e-05¶: Weight of the loss for sparsity regularization (increasing it will yield more sparse feature selection).
batch_momentum: float | None = 0.95¶: Momentum in ghost batch normalization.
batch_size_ratio: float | None = 0.25¶: The ratio of virtual batch size (size of the ghost batch normalization) to batch size.
num_transformer_layers: int | None = 4¶: The number of transformer layers for each decision step. used only at one decision step and as it increases, more flexibility is provided to use a feature at multiple decision steps.
num_transformer_layers_ratio: float | None = 0.25¶: The ratio of shared transformer layer to transformer layers.
class_weight: float | None = 1.0¶: The class weight is used to computes a weighted cross entropy which is helpful in classify imbalanced dataset. Only used for classification.
loss_function_type: str | None = 'default'¶: Loss function type. Loss function in classification [cross_entropy, weighted_cross_entropy, focal_loss], default is cross_entropy. Loss function in regression: [rmse, mae, mse], default is mse.
alpha_focal_loss: float | None = 0.25¶: Alpha value (balancing factor) in focal_loss function. Only used for classification.
gamma_focal_loss: float | None = 2.0¶: Gamma value (modulating factor) for focal loss for focal loss. Only used for classification.
enable_profiler: bool | None = False¶: Enables profiling and saves a trace during evaluation.
cache_data: str | None = 'auto'¶: Whether to cache data or not. If set to ‘auto’, caching is determined based on the dataset size.
seed: int | None = 1¶: Seed to be used for this run.
eval_steps: int | None = 0¶: Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.
batch_size: int | None = 100¶: Batch size for training.
measurement_selection_type: str | None = 'BEST_MEASUREMENT'¶: Which measurement to use if/when the service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.
optimization_metric: str | None = ''¶: Optimization metric used for measurement_selection_type. Default is “rmse” for regression and “auc” for classification.
eval_frequency_secs: int | None = 600¶: Frequency at which evaluation and checkpointing will take place.
training_machine_spec: dict | None = {'machine_type': 'c2-standard-16'}¶: The training machine spec. See https://cloud.google.com/compute/docs/machine-types for options.
training_disk_spec: dict | None = {'boot_disk_size_gb': 100, 'boot_disk_type': 'pd-ssd'}¶: The training disk spec.
instance_baseline: dsl.Input[system.Artifact]¶: The path to a JSON file for baseline values.
metadata: dsl.Input[system.Artifact]¶: Amount of time in seconds to run the trainer for.
materialized_train_split: dsl.Input[system.Artifact]¶: The path to the materialized train split.
materialized_eval_split: dsl.Input[system.Artifact]¶: The path to the materialized validation split.
transform_output: dsl.Input[system.Artifact]¶: The path to transform output.
training_schema_uri: dsl.Input[system.Artifact]¶: The path to the training schema.
encryption_spec_key_name: str | None = ''¶: The KMS key name.

Returns¶

gcp_resources: dsl.OutputPath(str): Serialized gcp_resources proto tracking the custom training job.
nmanaged_container_model: dsl.Output[google.UnmanagedContainerModel]: The UnmanagedContainerModel artifact.

preview.automl.tabular.WideAndDeepHyperparameterTuningJobOp(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, study_spec_metric_id: str, study_spec_metric_goal: str, study_spec_parameters_override: list, max_trial_count: int, parallel_trial_count: int, instance_baseline: dsl.Input[system.Artifact], metadata: dsl.Input[system.Artifact], materialized_train_split: dsl.Input[system.Artifact], materialized_eval_split: dsl.Input[system.Artifact], transform_output: dsl.Input[system.Artifact], training_schema_uri: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), instance_schema_uri: dsl.OutputPath(str), prediction_schema_uri: dsl.OutputPath(str), trials: dsl.OutputPath(str), prediction_docker_uri_output: dsl.OutputPath(str), execution_metrics: dsl.OutputPath(dict), weight_column: str | None = '', enable_profiler: bool | None = False, cache_data: str | None = 'auto', seed: int | None = 1, eval_steps: int | None = 0, eval_frequency_secs: int | None = 600, max_failed_trial_count: int | None = 0, study_spec_algorithm: str | None = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str | None = 'BEST_MEASUREMENT', training_machine_spec: dict | None = {'machine_type': 'c2-standard-16'}, training_disk_spec: dict | None = {'boot_disk_size_gb': 100, 'boot_disk_type': 'pd-ssd'}, encryption_spec_key_name: str | None = '')¶

Tunes Wide & Deep hyperparameters using Vertex HyperparameterTuningJob API.

Parameters¶

project: str¶: The GCP project that runs the pipeline components.
location: str¶: The GCP region that runs the pipeline components.
root_dir: str¶: The root GCS directory for the pipeline components.
target_column: str¶: The target column name.
prediction_type: str¶: The type of prediction the model is to produce. “classification” or “regression”.
weight_column: str | None = ''¶: The weight column name.
enable_profiler: bool | None = False¶: Enables profiling and saves a trace during evaluation.
cache_data: str | None = 'auto'¶: Whether to cache data or not. If set to ‘auto’, caching is determined based on the dataset size.
seed: int | None = 1¶: Seed to be used for this run.
eval_steps: int | None = 0¶: Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.
eval_frequency_secs: int | None = 600¶: Frequency at which evaluation and checkpointing will take place.
study_spec_metric_id: str¶: Metric to optimize, possible values: [ ‘loss’, ‘average_loss’, ‘rmse’, ‘mae’, ‘mql’, ‘accuracy’, ‘auc’, ‘precision’, ‘recall’].
study_spec_metric_goal: str¶: Optimization goal of the metric, possible values: “MAXIMIZE”, “MINIMIZE”.
study_spec_parameters_override: list¶: List of dictionaries representing parameters to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.
max_trial_count: int¶: The desired total number of trials.
parallel_trial_count: int¶: The desired number of trials to run in parallel.
max_failed_trial_count: int | None = 0¶: The number of failed trials that need to be seen before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.
study_spec_algorithm: str | None = 'ALGORITHM_UNSPECIFIED'¶: The search algorithm specified for the study. One of ‘ALGORITHM_UNSPECIFIED’, ‘GRID_SEARCH’, or ‘RANDOM_SEARCH’.
study_spec_measurement_selection_type: str | None = 'BEST_MEASUREMENT'¶: Which measurement to use if/when the service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.
training_machine_spec: dict | None = {'machine_type': 'c2-standard-16'}¶: The training machine spec. See https://cloud.google.com/compute/docs/machine-types for options.
training_disk_spec: dict | None = {'boot_disk_size_gb': 100, 'boot_disk_type': 'pd-ssd'}¶: The training disk spec.
instance_baseline: dsl.Input[system.Artifact]¶: The path to a JSON file for baseline values.
metadata: dsl.Input[system.Artifact]¶: Amount of time in seconds to run the trainer for.
materialized_train_split: dsl.Input[system.Artifact]¶: The path to the materialized train split.
materialized_eval_split: dsl.Input[system.Artifact]¶: The path to the materialized validation split.
transform_output: dsl.Input[system.Artifact]¶: The path to transform output.
training_schema_uri: dsl.Input[system.Artifact]¶: The path to the training schema.
encryption_spec_key_name: str | None = ''¶: The KMS key name.

Returns¶

gcp_resources: dsl.OutputPath(str): Serialized gcp_resources proto tracking the custom training job.
instance_schema_uri: dsl.OutputPath(str): The path to the instance schema.
rediction_schema_uri: dsl.OutputPath(str): The path to the prediction schema.
rials: dsl.OutputPath(str): The path to the hyperparameter tuning trials
rediction_docker_uri_output: dsl.OutputPath(str): The URI of the prediction container.
execution_metrics: dsl.OutputPath(dict): Core metrics in dictionary of hyperparameter tuning job execution.

preview.automl.tabular.WideAndDeepTrainerOp(project: str, location: str, root_dir: str, target_column: str, prediction_type: str, learning_rate: float, dnn_learning_rate: float, instance_baseline: dsl.Input[system.Artifact], metadata: dsl.Input[system.Artifact], materialized_train_split: dsl.Input[system.Artifact], materialized_eval_split: dsl.Input[system.Artifact], transform_output: dsl.Input[system.Artifact], training_schema_uri: dsl.Input[system.Artifact], gcp_resources: dsl.OutputPath(str), unmanaged_container_model: dsl.Output[google.UnmanagedContainerModel], weight_column: str | None = '', max_steps: int | None = - 1, max_train_secs: int | None = - 1, optimizer_type: str | None = 'adam', l1_regularization_strength: float | None = 0, l2_regularization_strength: float | None = 0, l2_shrinkage_regularization_strength: float | None = 0, beta_1: float | None = 0.9, beta_2: float | None = 0.999, hidden_units: str | None = '30,30,30', use_wide: bool | None = True, embed_categories: bool | None = True, dnn_dropout: float | None = 0, dnn_optimizer_type: str | None = 'ftrl', dnn_l1_regularization_strength: float | None = 0, dnn_l2_regularization_strength: float | None = 0, dnn_l2_shrinkage_regularization_strength: float | None = 0, dnn_beta_1: float | None = 0.9, dnn_beta_2: float | None = 0.999, enable_profiler: bool | None = False, cache_data: str | None = 'auto', seed: int | None = 1, eval_steps: int | None = 0, batch_size: int | None = 100, measurement_selection_type: str | None = 'BEST_MEASUREMENT', optimization_metric: str | None = '', eval_frequency_secs: int | None = 600, training_machine_spec: dict | None = {'machine_type': 'c2-standard-16'}, training_disk_spec: dict | None = {'boot_disk_size_gb': 100, 'boot_disk_type': 'pd-ssd'}, encryption_spec_key_name: str | None = '')¶

Trains a Wide & Deep model using Vertex CustomJob API.

Parameters¶

project: str¶: The GCP project that runs the pipeline components.
location: str¶: The GCP region that runs the pipeline components.
root_dir: str¶: The root GCS directory for the pipeline components.
target_column: str¶: The target column name.
prediction_type: str¶: The type of prediction the model is to produce. “classification” or “regression”.
weight_column: str | None = ''¶: The weight column name.
max_steps: int | None = - 1¶: Number of steps to run the trainer for.
max_train_secs: int | None = - 1¶: Amount of time in seconds to run the trainer for.
learning_rate: float¶: The learning rate used by the linear optimizer.
optimizer_type: str | None = 'adam'¶: The type of optimizer to use. Choices are “adam”, “ftrl” and “sgd” for the Adam, FTRL, and Gradient Descent Optimizers, respectively.
l1_regularization_strength: float | None = 0¶: L1 regularization strength for optimizer_type=”ftrl”.
l2_regularization_strength: float | None = 0¶: L2 regularization strength for optimizer_type=”ftrl”
l2_shrinkage_regularization_strength: float | None = 0¶: L2 shrinkage regularization strength for optimizer_type=”ftrl”.
beta_1: float | None = 0.9¶: Beta 1 value for optimizer_type=”adam”.
beta_2: float | None = 0.999¶: Beta 2 value for optimizer_type=”adam”.
hidden_units: str | None = '30,30,30'¶: Hidden layer sizes to use for DNN feature columns, provided in comma-separated layers.
use_wide: bool | None = True¶: If set to true, the categorical columns will be used in the wide part of the DNN model.
embed_categories: bool | None = True¶: If set to true, the categorical columns will be used embedded and used in the deep part of the model. Embedding size is the square root of the column cardinality.
dnn_dropout: float | None = 0¶: The probability we will drop out a given coordinate.
dnn_learning_rate: float¶: The learning rate for training the deep part of the model.
dnn_optimizer_type: str | None = 'ftrl'¶: The type of optimizer to use for the deep part of the model. Choices are “adam”, “ftrl” and “sgd”. for the Adam, FTRL, and Gradient Descent Optimizers, respectively.
dnn_l1_regularization_strength: float | None = 0¶: L1 regularization strength for dnn_optimizer_type=”ftrl”.
dnn_l2_regularization_strength: float | None = 0¶: L2 regularization strength for dnn_optimizer_type=”ftrl”.
dnn_l2_shrinkage_regularization_strength: float | None = 0¶: L2 shrinkage regularization strength for dnn_optimizer_type=”ftrl”.
dnn_beta_1: float | None = 0.9¶: Beta 1 value for dnn_optimizer_type=”adam”.
dnn_beta_2: float | None = 0.999¶: Beta 2 value for dnn_optimizer_type=”adam”.
enable_profiler: bool | None = False¶: Enables profiling and saves a trace during evaluation.
cache_data: str | None = 'auto'¶: Whether to cache data or not. If set to ‘auto’, caching is determined based on the dataset size.
seed: int | None = 1¶: Seed to be used for this run.
eval_steps: int | None = 0¶: Number of steps to run evaluation for. If not specified or negative, it means run evaluation on the whole validation dataset. If set to 0, it means run evaluation for a fixed number of samples.
batch_size: int | None = 100¶: Batch size for training.
measurement_selection_type: str | None = 'BEST_MEASUREMENT'¶: Which measurement to use if/when the service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.
optimization_metric: str | None = ''¶: Optimization metric used for measurement_selection_type. Default is “rmse” for regression and “auc” for classification.
eval_frequency_secs: int | None = 600¶: Frequency at which evaluation and checkpointing will take place.
training_machine_spec: dict | None = {'machine_type': 'c2-standard-16'}¶: The training machine spec. See https://cloud.google.com/compute/docs/machine-types for options.
training_disk_spec: dict | None = {'boot_disk_size_gb': 100, 'boot_disk_type': 'pd-ssd'}¶: The training disk spec.
instance_baseline: dsl.Input[system.Artifact]¶: The path to a JSON file for baseline values.
metadata: dsl.Input[system.Artifact]¶: Amount of time in seconds to run the trainer for.
materialized_train_split: dsl.Input[system.Artifact]¶: The path to the materialized train split.
materialized_eval_split: dsl.Input[system.Artifact]¶: The path to the materialized validation split.
transform_output: dsl.Input[system.Artifact]¶: The path to transform output.
training_schema_uri: dsl.Input[system.Artifact]¶: The path to the training schema.
encryption_spec_key_name: str | None = ''¶: The KMS key name.

Returns¶

gcp_resources: dsl.OutputPath(str): Serialized gcp_resources proto tracking the custom training job.
nmanaged_container_model: dsl.Output[google.UnmanagedContainerModel]: The UnmanagedContainerModel artifact.

preview.automl.tabular.XGBoostHyperparameterTuningJobOp(project: str, location: str, study_spec_metric_id: str, study_spec_metric_goal: str, study_spec_parameters_override: list, max_trial_count: int, parallel_trial_count: int, worker_pool_specs: list, gcp_resources: dsl.OutputPath(str), max_failed_trial_count: int | None = 0, study_spec_algorithm: str | None = 'ALGORITHM_UNSPECIFIED', study_spec_measurement_selection_type: str | None = 'BEST_MEASUREMENT', encryption_spec_key_name: str | None = '')¶

Tunes XGBoost hyperparameters using Vertex HyperparameterTuningJob API.

Parameters¶

project: str¶: The GCP project that runs the pipeline components.
location: str¶: The GCP region that runs the pipeline components.
study_spec_metric_id: str¶: Metric to optimize. For options, please look under ‘eval_metric’ at https://xgboost.readthedocs.io/en/stable/parameter.html#learning-task-parameters.
study_spec_metric_goal: str¶: Optimization goal of the metric, possible values: “MAXIMIZE”, “MINIMIZE”.
study_spec_parameters_override: list¶: List of dictionaries representing parameters to optimize. The dictionary key is the parameter_id, which is passed to training job as a command line argument, and the dictionary value is the parameter specification of the metric.
max_trial_count: int¶: The desired total number of trials.
parallel_trial_count: int¶: The desired number of trials to run in parallel.
max_failed_trial_count: int | None = 0¶: The number of failed trials that need to be seen before failing the HyperparameterTuningJob. If set to 0, Vertex AI decides how many trials must fail before the whole job fails.
study_spec_algorithm: str | None = 'ALGORITHM_UNSPECIFIED'¶: The search algorithm specified for the study. One of ‘ALGORITHM_UNSPECIFIED’, ‘GRID_SEARCH’, or ‘RANDOM_SEARCH’.
study_spec_measurement_selection_type: str | None = 'BEST_MEASUREMENT'¶: Which measurement to use if/when the service automatically selects the final measurement from previously reported intermediate measurements. One of “BEST_MEASUREMENT” or “LAST_MEASUREMENT”.
worker_pool_specs: list¶: The worker pool specs.
encryption_spec_key_name: str | None = ''¶: The KMS key name.

Returns¶

gcp_resources: dsl.OutputPath(str): Serialized gcp_resources proto tracking the custom training job.

preview.automl.tabular.XGBoostTrainerOp(project: str, location: str, worker_pool_specs: list, gcp_resources: dsl.OutputPath(str), encryption_spec_key_name: str | None = '')¶

Trains an XGBoost model using Vertex CustomJob API.

Parameters¶

project: str¶: The GCP project that runs the pipeline components.
location: str¶: The GCP region that runs the pipeline components.
worker_pool_specs: list¶: The worker pool specs.
encryption_spec_key_name: str | None = ''¶: The KMS key name.

Returns¶

gcp_resources: dsl.OutputPath(str): Serialized gcp_resources proto tracking the custom training job.