Dataset

Manage datasets via Vertex AI Datasets.

Components:

GetVertexDatasetOp(dataset_resource_name, ...)

Gets a Dataset artifact as a Vertex Dataset artifact.

ImageDatasetCreateOp(display_name, dataset)

Creates a new image Dataset and optionally imports data into Dataset when source and import_schema_uri are passed.

ImageDatasetExportDataOp(dataset, ...[, ...])

Exports Dataset to a GCS output directory.

ImageDatasetImportDataOp(dataset, ...[, ...])

Uploads data to an existing managed Dataset.

TabularDatasetCreateOp(display_name, dataset)

Creates a new tabular Dataset.

TabularDatasetExportDataOp(dataset, ...[, ...])

Exports Dataset to a GCS output directory.

TextDatasetCreateOp(display_name, dataset[, ...])

Creates a new text [Dataset](https://cloud.google.com/vertex- ai/docs/reference/rest/v1/projects.locations.datasets) and optionally imports data into Dataset when source and import_schema_uri are passed.

TextDatasetExportDataOp(dataset, output_dir, ...)

Exports Dataset to a GCS output directory.

TextDatasetImportDataOp(dataset, output__dataset)

Uploads data to an existing managed Dataset.

TimeSeriesDatasetCreateOp(display_name, dataset)

Creates a new time series Dataset.

TimeSeriesDatasetExportDataOp(dataset, ...)

Exports Dataset to a GCS output directory.

VideoDatasetCreateOp(display_name, dataset)

Creates a new video Dataset and optionally imports data into Dataset when source and import_schema_uri are passed.

VideoDatasetExportDataOp(dataset, ...[, ...])

Exports Dataset to a GCS output directory.

VideoDatasetImportDataOp(dataset, ...[, ...])

Uploads data to an existing managed Dataset.

v1.dataset.GetVertexDatasetOp(dataset_resource_name: str, dataset: dsl.Output[google.VertexDataset], gcp_resources: dsl.OutputPath(str))

Gets a Dataset artifact as a Vertex Dataset artifact.

Parameters:
dataset_resource_name: str

Vertex Dataset resource name in the format of projects/{project}/locations/{location}/datasets/{dataset}.

Returns:

dataset: dsl.Output[google.VertexDataset]

Vertex Dataset artifact with a resourceName metadata field in the format of projects/{project}/locations/{location}/datasets/{dataset}.

v1.dataset.ImageDatasetCreateOp(display_name: str, dataset: dsl.Output[google.VertexDataset], location: str | None = 'us-central1', data_item_labels: dict[str, str] | None = {}, gcs_source: str | None = None, import_schema_uri: str | None = None, labels: dict[str, str] | None = {}, encryption_spec_key_name: str | None = None, project: str = '{{$.pipeline_google_cloud_project_id}}')

Creates a new image Dataset and optionally imports data into Dataset when source and import_schema_uri are passed.

Parameters:
display_name: str

The user-defined name of the Dataset. The name can be up to 128 characters long and can be consist of any UTF-8 characters.

gcs_source: str | None = None

Google Cloud Storage URI(-s) to the input file(s). May contain wildcards. For more information on wildcards, see https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames. For example, "gs://bucket/file.csv" or ["gs://bucket/file1.csv", "gs://bucket/file2.csv"].

import_schema_uri: str | None = None

Points to a YAML file stored on Google Cloud Storage describing the import format. Validation will be done against the schema. The schema is defined as an OpenAPI 3.0.2 Schema Object.

data_item_labels: dict[str, str] | None = {}

Labels that will be applied to newly imported DataItems. If an identical DataItem as one being imported already exists in the Dataset, then these labels will be appended to these of the already existing one, and if labels with identical key is imported before, the old label value will be overwritten. If two DataItems are identical in the same import data operation, the labels will be combined and if key collision happens in this case, one of the values will be picked randomly. Two DataItems are considered identical if their content bytes are identical (e.g. image bytes or pdf bytes). These labels will be overridden by Annotation labels specified inside index file refenced by import_schema_uri, e.g. jsonl file.

location: str | None = 'us-central1'

Optional location to retrieve Dataset from.

labels: dict[str, str] | None = {}

Labels with user-defined metadata to organize your Tensorboards. Label keys and values can be no longer than 64 characters (Unicode codepoints), can only contain lowercase letters, numeric characters, underscores and dashes. International characters are allowed. No more than 64 user labels can be associated with one Tensorboard (System labels are excluded). See https://goo.gl/xmQnxf for more information and examples of labels. System reserved label keys are prefixed with “aiplatform.googleapis.com/” and are immutable.

encryption_spec_key_name: str | None = None

The Cloud KMS resource identifier of the customer managed encryption key used to protect the Dataset. Has the form: projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key. The key needs to be in the same region as where the compute resource is created. If set, this Dataset and all sub-resources of this Dataset will be secured by this key. Overrides encryption_spec_key_name set in aiplatform.init.

project: str = '{{$.pipeline_google_cloud_project_id}}'

Project to retrieve Dataset from. Defaults to the project in which the PipelineJob is run.

Returns:

dataset: dsl.Output[google.VertexDataset]

Instantiated representation of the managed image Dataset resource.

v1.dataset.ImageDatasetExportDataOp(dataset: dsl.Input[google.VertexDataset], output_dir: str, exported_dataset: dsl.Output[google.VertexDataset], location: str | None = 'us-central1', project: str = '{{$.pipeline_google_cloud_project_id}}')

Exports Dataset to a GCS output directory.

Parameters:
output_dir: str

The Google Cloud Storage location where the output is to be written to. In the given directory a new directory will be created with name: export-data-<dataset-display-name>-<timestamp-of-export-call> where timestamp is in YYYYMMDDHHMMSS format. All export output will be written into that directory. Inside that directory, annotations with the same schema will be grouped into sub directories which are named with the corresponding annotations’ schema title. Inside these sub directories, a schema.yaml will be created to describe the output format. If the uri doesn’t end with ‘/’, a ‘/’ will be automatically appended. The directory is created if it doesn’t exist.

location: str | None = 'us-central1'

Optional location to retrieve Dataset from.

project: str = '{{$.pipeline_google_cloud_project_id}}'

Project to retrieve Dataset from. Defaults to the project in which the PipelineJob is run.

Returns:

exported_dataset: dsl.Output[google.VertexDataset]

All of the files that are exported in this export operation.

ImageDatasetImportDataOp(dataset: dsl.Input[google.VertexDataset], dataset: dsl.Output[google.VertexDataset], location: ~typing.Optional[str] = 'us-central1', data_item_labels: ~typing.Optional[dict] = {}, gcs_source: ~typing.Optional[str] = None, import_schema_uri: ~typing.Optional[str] = None, project: str = '{{$.pipeline_google_cloud_project_id}}')

Uploads data to an existing managed Dataset.

Parameters:
location

Optional location to retrieve Dataset from.

dataset

The Dataset to be updated.

gcs_source

Google Cloud Storage URI(-s) to the input file(s). May contain wildcards. For more information on wildcards, see https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames. For example, “gs://bucket/file.csv” or [“gs://bucket/file1.csv”, “gs://bucket/file2.csv”].

import_schema_uri

Points to a YAML file stored on Google Cloud Storage describing the import format. Validation will be done against the schema. The schema is defined as an OpenAPI 3.0.2 Schema Object.

data_item_labels

Labels that will be applied to newly imported DataItems. If an identical DataItem as one being imported already exists in the Dataset, then these labels will be appended to these of the already existing one, and if labels with identical key is imported before, the old label value will be overwritten. If two DataItems are identical in the same import data operation, the labels will be combined and if key collision happens in this case, one of the values will be picked randomly. Two DataItems are considered identical if their content bytes are identical (e.g. image bytes or pdf bytes). These labels will be overridden by Annotation labels specified inside index file refenced by import_schema_uri, e.g. jsonl file.

project

Project to retrieve Dataset from. Defaults to the project in which the PipelineJob is run.

Returns:

dataset: dsl.Input[google.VertexDataset]

Instantiated representation of the managed Dataset resource.

v1.dataset.TabularDatasetCreateOp(display_name: str, dataset: dsl.Output[google.VertexDataset], location: str | None = 'us-central1', gcs_source: str | None = None, bq_source: str | None = None, labels: dict | None = {}, encryption_spec_key_name: str | None = None, project: str = '{{$.pipeline_google_cloud_project_id}}')

Creates a new tabular Dataset.

Parameters:
display_name: str

The user-defined name of the Dataset. The name can be up to 128 characters long and can be consist of any UTF-8 characters.

gcs_source: str | None = None

Google Cloud Storage URI(-s) to the input file(s). May contain wildcards. For more information on wildcards, see https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames. For example, "gs://bucket/file.csv" or ["gs://bucket/file1.csv", "gs://bucket/file2.csv"].

bq_source: str | None = None

BigQuery URI to the input table. For example, “bq://project.dataset.table_name”.

location: str | None = 'us-central1'

Optional location to retrieve Dataset from.

labels: dict | None = {}

Labels with user-defined metadata to organize your Tensorboards. Label keys and values can be no longer than 64 characters (Unicode codepoints), can only contain lowercase letters, numeric characters, underscores and dashes. International characters are allowed. No more than 64 user labels can be associated with one Tensorboard (System labels are excluded). See https://goo.gl/xmQnxf for more information and examples of labels. System reserved label keys are prefixed with “aiplatform.googleapis.com/” and are immutable.

encryption_spec_key_name: str | None = None

The Cloud KMS resource identifier of the customer managed encryption key used to protect the Dataset. Has the form: projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key. The key needs to be in the same region as where the compute resource is created. If set, this Dataset and all sub-resources of this Dataset will be secured by this key. Overrides encryption_spec_key_name set in aiplatform.init.

project: str = '{{$.pipeline_google_cloud_project_id}}'

Project to retrieve Dataset from. Defaults to the project in which the PipelineJob is run.

Returns:

dataset: dsl.Output[google.VertexDataset]

Instantiated representation of the managed tabular Dataset resource.

v1.dataset.TabularDatasetExportDataOp(dataset: dsl.Input[google.VertexDataset], output_dir: str, exported_dataset: dsl.Output[google.VertexDataset], location: str | None = 'us-central1', project: str = '{{$.pipeline_google_cloud_project_id}}')

Exports Dataset to a GCS output directory.

Parameters:
output_dir: str

The Google Cloud Storage location where the output is to be written to. In the given directory a new directory will be created with name: export-data-<dataset-display-name>-<timestamp-of-export-call> where timestamp is in YYYYMMDDHHMMSS format. All export output will be written into that directory. Inside that directory, annotations with the same schema will be grouped into sub directories which are named with the corresponding annotations’ schema title. Inside these sub directories, a schema.yaml will be created to describe the output format. If the uri doesn’t end with ‘/’, a ‘/’ will be automatically appended. The directory is created if it doesn’t exist.

location: str | None = 'us-central1'

Optional location to retrieve Dataset from.

project: str = '{{$.pipeline_google_cloud_project_id}}'

Project to retrieve Dataset from. Defaults to the project in which the PipelineJob is run.

Returns:

exported_dataset: dsl.Output[google.VertexDataset]

All of the files that are exported in this export operation.

v1.dataset.TextDatasetCreateOp(display_name: str, dataset: dsl.Output[google.VertexDataset], location: str | None = 'us-central1', data_item_labels: dict | None = {}, gcs_source: str | None = None, import_schema_uri: str | None = None, labels: dict | None = {}, encryption_spec_key_name: str | None = None, project: str = '{{$.pipeline_google_cloud_project_id}}')

Creates a new text [Dataset](https://cloud.google.com/vertex- ai/docs/reference/rest/v1/projects.locations.datasets) and optionally imports data into Dataset when source and import_schema_uri are passed.

Parameters:
display_name: str

The user-defined name of the Dataset. The name can be up to 128 characters long and can be consist of any UTF-8 characters.

gcs_source: str | None = None

Google Cloud Storage URI(-s) to the input file(s). May contain wildcards. For more information on wildcards, see https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames. For example, "gs://bucket/file.csv" or ["gs://bucket/file1.csv", "gs://bucket/file2.csv"].

import_schema_uri: str | None = None

Points to a YAML file stored on Google Cloud Storage describing the import format. Validation will be done against the schema. The schema is defined as an OpenAPI 3.0.2 Schema Object.

data_item_labels: dict | None = {}

Labels that will be applied to newly imported DataItems. If an identical DataItem as one being imported already exists in the Dataset, then these labels will be appended to these of the already existing one, and if labels with identical key is imported before, the old label value will be overwritten. If two DataItems are identical in the same import data operation, the labels will be combined and if key collision happens in this case, one of the values will be picked randomly. Two DataItems are considered identical if their content bytes are identical (e.g. image bytes or pdf bytes). These labels will be overridden by Annotation labels specified inside index file refenced by import_schema_uri, e.g. jsonl file.

location: str | None = 'us-central1'

Optional location to retrieve Dataset from.

labels: dict | None = {}

Labels with user-defined metadata to organize your Tensorboards. Label keys and values can be no longer than 64 characters (Unicode codepoints), can only contain lowercase letters, numeric characters, underscores and dashes. International characters are allowed. No more than 64 user labels can be associated with one Tensorboard (System labels are excluded). See https://goo.gl/xmQnxf for more information and examples of labels. System reserved label keys are prefixed with “aiplatform.googleapis.com/” and are immutable.

encryption_spec_key_name: str | None = None

The Cloud KMS resource identifier of the customer managed encryption key used to protect the Dataset. Has the form: projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key. The key needs to be in the same region as where the compute resource is created. If set, this Dataset and all sub-resources of this Dataset will be secured by this key. Overrides encryption_spec_key_name set in aiplatform.init.

project: str = '{{$.pipeline_google_cloud_project_id}}'

Project to retrieve Dataset from. Defaults to the project in which the PipelineJob is run.

Returns:

dataset: dsl.Output[google.VertexDataset]

Instantiated representation of the managed text Datasetresource.

v1.dataset.TextDatasetExportDataOp(dataset: dsl.Input[google.VertexDataset], output_dir: str, exported_dataset: dsl.Output[google.VertexDataset], location: str | None = 'us-central1', project: str = '{{$.pipeline_google_cloud_project_id}}')

Exports Dataset to a GCS output directory.

Parameters:
output_dir: str

The Google Cloud Storage location where the output is to be written to. In the given directory a new directory will be created with name: export-data-<dataset-display-name>-<timestamp-of-export-call> where timestamp is in YYYYMMDDHHMMSS format. All export output will be written into that directory. Inside that directory, annotations with the same schema will be grouped into sub directories which are named with the corresponding annotations’ schema title. Inside these sub directories, a schema.yaml will be created to describe the output format. If the uri doesn’t end with ‘/’, a ‘/’ will be automatically appended. The directory is created if it doesn’t exist.

location: str | None = 'us-central1'

Optional location to retrieve Dataset from.

project: str = '{{$.pipeline_google_cloud_project_id}}'

Project to retrieve Dataset from. Defaults to the project in which the PipelineJob is run.

Returns:

exported_dataset: dsl.Output[google.VertexDataset]

All of the files that are exported in this export operation.

TextDatasetImportDataOp(dataset: dsl.Input[google.VertexDataset], dataset: dsl.Output[google.VertexDataset], location: ~typing.Optional[str] = 'us-central1', data_item_labels: ~typing.Optional[dict] = {}, gcs_source: ~typing.Optional[str] = None, import_schema_uri: ~typing.Optional[str] = None, project: str = '{{$.pipeline_google_cloud_project_id}}')

Uploads data to an existing managed Dataset.

Parameters:
location

Optional location to retrieve Datasetfrom.

dataset

The Datasetto be updated.

gcs_source

Google Cloud Storage URI(-s) to the input file(s). May contain wildcards. For more information on wildcards, see https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames. For example, “gs://bucket/file.csv” or [“gs://bucket/file1.csv”, “gs://bucket/file2.csv”].

import_schema_uri

Points to a YAML file stored on Google Cloud Storage describing the import format. Validation will be done against the schema. The schema is defined as an OpenAPI 3.0.2 Schema Object.

data_item_labels

Labels that will be applied to newly imported DataItems. If an identical DataItem as one being imported already exists in the Dataset, then these labels will be appended to these of the already existing one, and if labels with identical key is imported before, the old label value will be overwritten. If two DataItems are identical in the same import data operation, the labels will be combined and if key collision happens in this case, one of the values will be picked randomly. Two DataItems are considered identical if their content bytes are identical (e.g. image bytes or pdf bytes). These labels will be overridden by Annotation labels specified inside index file refenced by import_schema_uri, e.g. jsonl file.

project

Project to retrieve Dataset from. Defaults to the project in which the PipelineJob is run.

Returns:

dataset: dsl.Input[google.VertexDataset]

Instantiated representation of the managed Datasetresource.

v1.dataset.TimeSeriesDatasetCreateOp(display_name: str, dataset: dsl.Output[google.VertexDataset], location: str | None = 'us-central1', gcs_source: str | None = None, bq_source: str | None = None, labels: dict | None = {}, encryption_spec_key_name: str | None = None, project: str = '{{$.pipeline_google_cloud_project_id}}')

Creates a new time series Dataset.

Parameters:
display_name: str

The user-defined name of the Dataset. The name can be up to 128 characters long and can be consist of any UTF-8 characters.

gcs_source: str | None = None

Google Cloud Storage URI(-s) to the input file(s). May contain wildcards. For more information on wildcards, see https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames. For example, "gs://bucket/file.csv" or ["gs://bucket/file1.csv", "gs://bucket/file2.csv"].

bq_source: str | None = None

BigQuery URI to the input table. For example, bq://project.dataset.table_name”.

location: str | None = 'us-central1'

Optional location to retrieve Dataset from.

labels: dict | None = {}

Labels with user-defined metadata to organize your Tensorboards. Label keys and values can be no longer than 64 characters (Unicode codepoints), can only contain lowercase letters, numeric characters, underscores and dashes. International characters are allowed. No more than 64 user labels can be associated with one Tensorboard (System labels are excluded). See https://goo.gl/xmQnxf for more information and examples of labels. System reserved label keys are prefixed with “aiplatform.googleapis.com/” and are immutable.

encryption_spec_key_name: str | None = None

The Cloud KMS resource identifier of the customer managed encryption key used to protect the dataset. Has the form: projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key. The key needs to be in the same region as where the compute resource is created. If set, this Dataset and all sub-resources of this Dataset will be secured by this key. Overrides encryption_spec_key_name set in aiplatform.init.

project: str = '{{$.pipeline_google_cloud_project_id}}'

Project to retrieve Dataset from. Defaults to the project in which the PipelineJob is run.

Returns:

dataset: dsl.Output[google.VertexDataset]

Instantiated representation of the managed time series Datasetresource.

v1.dataset.TimeSeriesDatasetExportDataOp(dataset: dsl.Input[google.VertexDataset], output_dir: str, exported_dataset: dsl.Output[google.VertexDataset], location: str | None = 'us-central1', project: str = '{{$.pipeline_google_cloud_project_id}}')

Exports Dataset to a GCS output directory.

Parameters:
output_dir: str

The Google Cloud Storage location where the output is to be written to. In the given directory a new directory will be created with name: export-data-<dataset-display-name>-<timestamp-of-export-call> where timestamp is in YYYYMMDDHHMMSS format. All export output will be written into that directory. Inside that directory, annotations with the same schema will be grouped into sub directories which are named with the corresponding annotations’ schema title. Inside these sub directories, a schema.yaml will be created to describe the output format. If the uri doesn’t end with ‘/’, a ‘/’ will be automatically appended. The directory is created if it doesn’t exist.

location: str | None = 'us-central1'

Optional location to retrieve Datasetfrom.

project: str = '{{$.pipeline_google_cloud_project_id}}'

Project to retrieve Datasetfrom. Defaults to the project in which the PipelineJob is run.

Returns:

exported_dataset: dsl.Output[google.VertexDataset]

All of the files that are exported in this export operation.

v1.dataset.VideoDatasetCreateOp(display_name: str, dataset: dsl.Output[google.VertexDataset], location: str | None = 'us-central1', data_item_labels: dict | None = {}, gcs_source: str | None = None, import_schema_uri: str | None = None, labels: dict | None = {}, encryption_spec_key_name: str | None = None, project: str = '{{$.pipeline_google_cloud_project_id}}')

Creates a new video Dataset and optionally imports data into Dataset when source and import_schema_uri are passed.

Parameters:
display_name: str

The user-defined name of the Dataset. The name can be up to 128 characters long and can be consist of any UTF-8 characters.

gcs_source: str | None = None

Google Cloud Storage URI(-s) to the input file(s). May contain wildcards. For more information on wildcards, see https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames. For example, "gs://bucket/file.csv" or ["gs://bucket/file1.csv", "gs://bucket/file2.csv"].

import_schema_uri: str | None = None

Points to a YAML file stored on Google Cloud Storage describing the import format. Validation will be done against the schema. The schema is defined as an OpenAPI 3.0.2 Schema Object.

data_item_labels: dict | None = {}

Labels that will be applied to newly imported DataItems. If an identical DataItem as one being imported already exists in the Dataset, then these labels will be appended to these of the already existing one, and if labels with identical key is imported before, the old label value will be overwritten. If two DataItems are identical in the same import data operation, the labels will be combined and if key collision happens in this case, one of the values will be picked randomly. Two DataItems are considered identical if their content bytes are identical (e.g. image bytes or pdf bytes). These labels will be overridden by Annotation labels specified inside index file refenced by import_schema_uri,

location: str | None = 'us-central1'

Optional location to retrieve Dataset from.

labels: dict | None = {}

Labels with user-defined metadata to organize your Tensorboards. Label keys and values can be no longer than 64 characters (Unicode codepoints), can only contain lowercase letters, numeric characters, underscores and dashes. International characters are allowed. No more than 64 user labels can be associated with one Tensorboard (System labels are excluded). See https://goo.gl/xmQnxf for more information and examples of labels. System reserved label keys are prefixed with “aiplatform.googleapis.com/” and are immutable.

encryption_spec_key_name: str | None = None

The Cloud KMS resource identifier of the customer managed encryption key used to protect the Dataset. Has the form: projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key. The key needs to be in the same region as where the compute resource is created. If set, this Dataset and all sub-resources of this Dataset will be secured by this key. Overrides encryption_spec_key_name set in aiplatform.init.

project: str = '{{$.pipeline_google_cloud_project_id}}'

Project to retrieve Dataset from. Defaults to the project in which the PipelineJob is run.

Returns:

dataset: dsl.Output[google.VertexDataset]

Instantiated representation of the managed video Datasetresource.

v1.dataset.VideoDatasetExportDataOp(dataset: dsl.Input[google.VertexDataset], output_dir: str, exported_dataset: dsl.Output[google.VertexDataset], location: str | None = 'us-central1', project: str = '{{$.pipeline_google_cloud_project_id}}')

Exports Dataset to a GCS output directory.

Parameters:
output_dir: str

The Google Cloud Storage location where the output is to be written to. In the given directory a new directory will be created with name: export-data-<dataset-display-name>-<timestamp-of-export-call> where timestamp is in YYYYMMDDHHMMSS format. All export output will be written into that directory. Inside that directory, annotations with the same schema will be grouped into sub directories which are named with the corresponding annotations’ schema title. Inside these sub directories, a schema.yaml will be created to describe the output format. If the uri doesn’t end with ‘/’, a ‘/’ will be automatically appended. The directory is created if it doesn’t exist.

location: str | None = 'us-central1'

Optional location to retrieve Dataset from.

project: str = '{{$.pipeline_google_cloud_project_id}}'

Project to retrieve Dataset from. Defaults to the project in which the PipelineJob is run.

Returns:

exported_dataset: dsl.Output[google.VertexDataset]

All of the files that are exported in this export operation.

VideoDatasetImportDataOp(dataset: dsl.Input[google.VertexDataset], dataset: dsl.Output[google.VertexDataset], location: ~typing.Optional[str] = 'us-central1', data_item_labels: ~typing.Optional[dict] = {}, gcs_source: ~typing.Optional[str] = None, import_schema_uri: ~typing.Optional[str] = None, project: str = '{{$.pipeline_google_cloud_project_id}}')

Uploads data to an existing managed Dataset.

Parameters:
location

Optional location to retrieve Dataset from.

dataset

The Dataset to be updated.

gcs_source

Google Cloud Storage URI(-s) to the input file(s). May contain wildcards. For more information on wildcards, see https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames. For example, “gs://bucket/file.csv” or [“gs://bucket/file1.csv”, “gs://bucket/file2.csv”].

import_schema_uri

Points to a YAML file stored on Google Cloud Storage describing the import format. Validation will be done against the schema. The schema is defined as an OpenAPI 3.0.2 Schema Object.

data_item_labels

Labels that will be applied to newly imported DataItems. If an identical DataItem as one being imported already exists in the Dataset, then these labels will be appended to these of the already existing one, and if labels with identical key is imported before, the old label value will be overwritten. If two DataItems are identical in the same import data operation, the labels will be combined and if key collision happens in this case, one of the values will be picked randomly. Two DataItems are considered identical if their content bytes are identical (e.g. image bytes or pdf bytes). These labels will be overridden by Annotation labels specified inside index file refenced by import_schema_uri, e.g. jsonl file.

project

Project to retrieve Dataset from. Defaults to the project in which the PipelineJob is run.

Returns:

dataset: dsl.Input[google.VertexDataset]

Instantiated representation of the managed Dataset resource.