tfds.core.DatasetInfo  |  TensorFlow Datasets (original) (raw)

Information about a dataset.

tfds.core.DatasetInfo(
    *,
    builder: Union[DatasetIdentity, Any],
    description: Optional[str] = None,
    features: Optional[feature_lib.FeatureConnector] = None,
    supervised_keys: Optional[SupervisedKeysType] = None,
    disable_shuffling: bool = False,
    homepage: Optional[str] = None,
    citation: Optional[str] = None,
    metadata: Optional[Metadata] = None,
    license: Optional[str] = None,
    redistribution_info: Optional[Dict[str, str]] = None,
    split_dict: Optional[splits_lib.SplitDict] = None
)

DatasetInfo documents datasets, including its name, version, and features. See the constructor arguments and properties for a full list.

Args
builder DatasetBuilder or DatasetIdentity. The dataset builder or identity will be used to populate this info.
description str, description of this dataset.
features tfds.features.FeaturesDict, Information on the feature dict of the tf.data.Dataset() object from the builder.as_dataset() method.
supervised_keys Specifies the input structure for supervised learning, if applicable for the dataset, used with "as_supervised". The keys correspond to the feature names to select in info.features. When calling tfds.core.DatasetBuilder.as_dataset() withas_supervised=True, the tf.data.Dataset object will yield the structure defined by the keys passed here, instead of that defined by the features argument. Typically this is a (input_key, target_key)tuple, and the dataset yields a tuple of tensors (input, target)tensors.To yield a more complex structure, pass a tuple of tf.nest compatible structures of feature keys. The resulting Dataset will yield structures with each key replaced by the coresponding tensor. For example, passing a triple of keys would return a dataset that yields (feature, target, sample_weights) triples for keras. Using supervised_keys=({'a':'a','b':'b'}, 'c') would create a dataset yielding a tuple with a dictionary of features in the featuresposition. Note that selecting features in nested tfds.features.FeaturesDictobjects is not supported.
disable_shuffling bool, specify whether to shuffle the examples.
homepage str, optional, the homepage for this dataset.
citation str, optional, the citation to use for this dataset.
metadata tfds.core.Metadata, additonal object which will be stored/restored with the dataset. This allows for storing additional information with the dataset.
license license of the dataset.
redistribution_info information needed for redistribution, as specified in dataset_info_pb2.RedistributionInfo. The content of the licensesubfield will automatically be written to a LICENSE file stored with the dataset.
split_dict information about the splits in this dataset.
Attributes
as_json
as_proto
as_proto_with_features
citation
config_description
config_name
config_tags
data_dir
dataset_size Generated dataset files size, in bytes.
description
disable_shuffling
download_size Downloaded files size, in bytes.
features
file_format
full_name Full canonical name: (//).
homepage
initialized Whether DatasetInfo has been fully initialized.
metadata
module_name
name
redistribution_info
release_notes
splits
supervised_keys
version

Methods

add_file_data_source_access

View source

add_file_data_source_access(
    path: Union[epath.PathLike, Iterable[epath.PathLike]],
    url: Optional[str] = None
) -> None

Records that the given query was used to generate this dataset.

Arguments
path path or paths of files that were read. Can be a file pattern. Multiple paths or patterns can be specified as a comma-separated string or a list.
url URL referring to the data being used.

add_sql_data_source_access

View source

add_sql_data_source_access(
    sql_query: str
) -> None

Records that the given query was used to generate this dataset.

add_tfds_data_source_access

View source

add_tfds_data_source_access(
    dataset_reference: naming.DatasetReference, url: Optional[str] = None
) -> None

Records that the given query was used to generate this dataset.

Args
dataset_reference
url a URL referring to the TFDS dataset.

add_url_access

View source

add_url_access(
    url: str, checksum: Optional[str] = None
) -> None

Records the URL used to generate this dataset.

from_proto

View source

@classmethod from_proto( builder, proto: dataset_info_pb2.DatasetInfo ) -> 'DatasetInfo'

Instantiates DatasetInfo from the given builder and proto.

initialize_from_bucket

View source

initialize_from_bucket() -> None

Initialize DatasetInfo from GCS bucket info files.

read_from_directory

View source

read_from_directory(
    dataset_info_dir: epath.PathLike
) -> None

Update DatasetInfo from the metadata files in dataset_info_dir.

This function updates all the dynamically generated fields (num_examples, hash, time of creation,...) of the DatasetInfo.

This will overwrite all previous metadata.

Args
dataset_info_dir The directory containing the metadata file. This should be the root directory of a specific dataset version.
Raises
FileNotFoundError If the dataset_info.json can't be found.

set_file_format

View source

set_file_format(
    file_format: Union[None, str, file_adapters.FileFormat],
    override: bool = False
) -> None

Internal function to define the file format.

The file format is set during FileReaderBuilder.__init__, not DatasetInfo.init.

Args
file_format The file format.
override Whether the file format should be overridden if it is already set.
Raises
ValueError if the file format was already set and the overrideparameter was False.
RuntimeError if an incorrect combination of options is given, e.g.override=True when the DatasetInfo is already fully initialized.

set_splits

View source

set_splits(
    split_dict: splits_lib.SplitDict
) -> None

Split setter (private method).

update_data_dir

View source

update_data_dir(
    data_dir: str
) -> None

Updates the data dir for each split.

write_to_directory

View source

write_to_directory(
    dataset_info_dir: epath.PathLike, all_metadata=True
) -> None

Write DatasetInfo as JSON to dataset_info_dir + labels & features.

Args
dataset_info_dir path to directory in which to save thedataset_info.json file, as well as features.json and *.labels.txtif applicable.
all_metadata defaults to True. If False, will not write metadata which may have an impact on how the data is read (features.json). Should be set to True whenever write_to_directory is called for the first time for a new dataset.