tfds.core.DatasetInfo | TensorFlow Datasets (original) (raw)

Information about a dataset.

tfds.core.DatasetInfo(
    *,
    builder: Union[DatasetIdentity, Any],
    description: Optional[str] = None,
    features: Optional[feature_lib.FeatureConnector] = None,
    supervised_keys: Optional[SupervisedKeysType] = None,
    disable_shuffling: bool = False,
    homepage: Optional[str] = None,
    citation: Optional[str] = None,
    metadata: Optional[Metadata] = None,
    license: Optional[str] = None,
    redistribution_info: Optional[Dict[str, str]] = None,
    split_dict: Optional[splits_lib.SplitDict] = None
)

DatasetInfo documents datasets, including its name, version, and features. See the constructor arguments and properties for a full list.

Args
builder	DatasetBuilder or DatasetIdentity. The dataset builder or identity will be used to populate this info.
description	str, description of this dataset.
features	tfds.features.FeaturesDict, Information on the feature dict of the tf.data.Dataset() object from the builder.as_dataset() method.
supervised_keys	Specifies the input structure for supervised learning, if applicable for the dataset, used with "as_supervised". The keys correspond to the feature names to select in info.features. When calling tfds.core.DatasetBuilder.as_dataset() withas_supervised=True, the tf.data.Dataset object will yield the structure defined by the keys passed here, instead of that defined by the features argument. Typically this is a (input_key, target_key)tuple, and the dataset yields a tuple of tensors (input, target)tensors.To yield a more complex structure, pass a tuple of tf.nest compatible structures of feature keys. The resulting Dataset will yield structures with each key replaced by the coresponding tensor. For example, passing a triple of keys would return a dataset that yields (feature, target, sample_weights) triples for keras. Using supervised_keys=({'a':'a','b':'b'}, 'c') would create a dataset yielding a tuple with a dictionary of features in the featuresposition. Note that selecting features in nested tfds.features.FeaturesDictobjects is not supported.
disable_shuffling	bool, specify whether to shuffle the examples.
homepage	str, optional, the homepage for this dataset.
citation	str, optional, the citation to use for this dataset.
metadata	tfds.core.Metadata, additonal object which will be stored/restored with the dataset. This allows for storing additional information with the dataset.
license	license of the dataset.
redistribution_info	information needed for redistribution, as specified in dataset_info_pb2.RedistributionInfo. The content of the licensesubfield will automatically be written to a LICENSE file stored with the dataset.
split_dict	information about the splits in this dataset.

Attributes
as_json
as_proto
as_proto_with_features
citation
config_description
config_name
config_tags
data_dir
dataset_size	Generated dataset files size, in bytes.
description
disable_shuffling
download_size	Downloaded files size, in bytes.
features
file_format
full_name	Full canonical name: (//).
homepage
initialized	Whether DatasetInfo has been fully initialized.
metadata
module_name
name
redistribution_info
release_notes
splits
supervised_keys
version

Methods

`add_file_data_source_access`

View source

add_file_data_source_access(
    path: Union[epath.PathLike, Iterable[epath.PathLike]],
    url: Optional[str] = None
) -> None

Records that the given query was used to generate this dataset.

Arguments
path	path or paths of files that were read. Can be a file pattern. Multiple paths or patterns can be specified as a comma-separated string or a list.
url	URL referring to the data being used.

`add_sql_data_source_access`

View source

add_sql_data_source_access(
    sql_query: str
) -> None

Records that the given query was used to generate this dataset.

`add_tfds_data_source_access`

View source

add_tfds_data_source_access(
    dataset_reference: naming.DatasetReference, url: Optional[str] = None
) -> None

Records that the given query was used to generate this dataset.

Args
dataset_reference
url	a URL referring to the TFDS dataset.

`add_url_access`

View source

add_url_access(
    url: str, checksum: Optional[str] = None
) -> None

Records the URL used to generate this dataset.

`from_proto`

View source

@classmethod from_proto( builder, proto: dataset_info_pb2.DatasetInfo ) -> 'DatasetInfo'

Instantiates DatasetInfo from the given builder and proto.

`initialize_from_bucket`

View source

initialize_from_bucket() -> None

Initialize DatasetInfo from GCS bucket info files.

`read_from_directory`

View source

read_from_directory(
    dataset_info_dir: epath.PathLike
) -> None

Update DatasetInfo from the metadata files in dataset_info_dir.

This function updates all the dynamically generated fields (num_examples, hash, time of creation,...) of the DatasetInfo.

This will overwrite all previous metadata.

Args
dataset_info_dir	The directory containing the metadata file. This should be the root directory of a specific dataset version.

Raises
FileNotFoundError	If the dataset_info.json can't be found.

`set_file_format`

View source

set_file_format(
    file_format: Union[None, str, file_adapters.FileFormat],
    override: bool = False
) -> None

Internal function to define the file format.

The file format is set during FileReaderBuilder.__init__, not DatasetInfo.init.

Args
file_format	The file format.
override	Whether the file format should be overridden if it is already set.

Raises
ValueError	if the file format was already set and the overrideparameter was False.
RuntimeError	if an incorrect combination of options is given, e.g.override=True when the DatasetInfo is already fully initialized.

`set_splits`

View source

set_splits(
    split_dict: splits_lib.SplitDict
) -> None

Split setter (private method).

`update_data_dir`

View source

update_data_dir(
    data_dir: str
) -> None

Updates the data dir for each split.

`write_to_directory`

View source

write_to_directory(
    dataset_info_dir: epath.PathLike, all_metadata=True
) -> None

Write DatasetInfo as JSON to dataset_info_dir + labels & features.

Args
dataset_info_dir	path to directory in which to save thedataset_info.json file, as well as features.json and *.labels.txtif applicable.
all_metadata	defaults to True. If False, will not write metadata which may have an impact on how the data is read (features.json). Should be set to True whenever write_to_directory is called for the first time for a new dataset.