base datasets

load_dataset(dataset, unpack_dataset_columns=False, **kwargs)

Load dataset as np.ndarray of shape (nr_of_samples, 2).

It is 2D array with each row representing one point in time series. The first column is the x-variable and the second column is the y-variable.

If unpack_dataset_columns=True is specified as kwargs, the dataset is unpacked to two separate arrays x and y.

The list of available datasets is in the traffic_weaver.datasets.data_description module.

Parameters:
  • dataset (str) – Name of the dataset to load.

  • unpack_dataset_columns (bool, default=False) – If True, the dataset is unpacked to two separate arrays x and y.

Returns:

dataset – 2D array with each row representing one point in time series. The first column is the x-variable and the second column is the y-variable.

Return type:

np.ndarray of shape (nr_of_samples, 2)

Examples

>>> data = load_dataset('sandvine_audio')
get_data_home(data_home: str = None) str

Return the path of the data directory.

Datasets are stored in ‘.traffic-weaver-data’ directory in the user directory.

This directory can be changed by setting TRAFFIC_WEAVER_DATA environment variable.

Parameters:

data_home (str, default=None) – The path to the data directory. If None, the default directory is .traffic-weaver-data.

Returns:

data_home – The path to the data directory.

Return type:

str

Examples

>>> import os
>>> from traffic_weaver.datasets import get_data_home
>>> data_home = get_data_home()
>>> os.path.exists(data_home)
True
clear_data_home(data_home: str = None)

Remove all files in the data directory.

Parameters:

data_home (str, default=None) – The path to the data directory. If None, the default directory is .traffic-weaver-data.

load_csv_dataset_from_resources(file_name, resources_module='traffic_weaver.datasets.data', unpack_dataset_columns=False)

Load dataset from resources.

Parameters:
  • file_name (str) – name of the file to load.

  • resources_module (str, default='traffic_weaver.datasets.data') – The package name where the resources are located.

  • unpack_dataset_columns (bool, default=False) – If True, the dataset is unpacked to two separate arrays x and y.

Returns:

dataset – 2D array with each row representing one point in time series. The first column is the x-variable and the second column is the y-variable.

Return type:

np.ndarray of shape (nr_of_samples, 2)

load_csv_dataset_from_remote(remote: RemoteFileMetadata, dataset_filename, dataset_folder, data_home=None, download_if_missing: bool = True, download_even_if_available: bool = False, validate_checksum: bool = True, n_retries=3, delay=1.0, gzip=False, unpack_dataset_columns=False)

Load a dataset from a remote location in csv.gz format. After downloading the dataset it is stored in the cache folder for further use in pickle format.

Parameters:
  • remote (RemoteFileMetadata) – Named tuple containing remote dataset meta information: url, filename, checksum.

  • dataset_filename (str) – Name for the dataset file.

  • dataset_folder (str) – Folder in data_home where the dataset is stored.

  • data_home (str, default=None) – Download cache folder fot the dataset. By default data is stored in ~/.traffic-weaver-data.

  • download_if_missing (bool, default=True) – If False, raise an OSError if the data is not locally available instead of trying to download the data from the source.

  • download_even_if_available (bool, default=False) – If True, download the data even if it is already available locally.

  • validate_checksum (bool, default=True) – If True, check the SHA256 checksum of the downloaded file.

  • n_retries (int, default=3) – Number of retries in case of HTTPError or URLError when downloading the data.

  • delay (float, default=1.0) – Number of seconds between retries.

  • gzip (bool, default=False) – If True, the file is assumed to be compressed in gzip format in the remote.

  • unpack_dataset_columns (bool, default=False) – If True, the dataset is unpacked to two separate arrays x and y.

Returns:

dataset – 2D array with each row representing one point in time series. The first column is the x-variable and the second column is the y-variable.

Return type:

np.ndarray of shape (nr_of_samples, 2)

load_dataset_description(datasetsource_filename, resources_module='traffic_weaver.datasets.data_description')

Load source of the dataset from filename from resources.

Parameters:
  • datasetsource_filename (str) – name of the file to load.

  • resources_module (str, default='traffic_weaver.datasets.datadescription') – The package name where the resources are located.

Returns:

description – Source of the dataset.

Return type:

str