Multiview dataset management

class Dataset

This is the base class for all the type of multiview datasets of SuMMIT.

get_shape(view_index=0, sample_indices=None)

Gets the shape of the needed view on the asked samples

Parameters
  • view_index (int) – The index of the view to extract

  • sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.

Return type

Tuple containing the shape

init_sample_indices(sample_indices=None)

If no sample indices are provided, selects all the available samples.

Parameters

sample_indices (np.array,) – An array-like containing the indices of the samples.

to_numpy_array(sample_indices=None, view_indices=None)

Concatenates the needed views in one big numpy array while saving the limits of each view in a list, to be able to retrieve them later.

Parameters
  • sample_indices (array like) – The indices of the samples to extract from the dataset

  • view_indices (array like) – The indices of the view to concatenate in the numpy array

Returns

  • concat_views (numpy array,) – The numpy array containing all the needed views.

  • view_limits (list of int) – The limits of each slice used to extract the views.

class HDF5Dataset(views=None, labels=None, are_sparse=False, file_name='dataset.hdf5', view_names=None, path='', hdf5_file=None, labels_names=None, is_temp=False, sample_ids=None, feature_ids=None)

Dataset class

This is used to encapsulate the multiview dataset while keeping it stored on the disk instead of in RAM.

Parameters
  • views (list of numpy arrays or None) – The list containing each view of the dataset as a numpy array of shape (nb samples, nb features).

  • labels (numpy array or None) – The labels for the multiview dataset, of shape (nb samples, ).

  • are_sparse (list of bool, or None) – The list of boolean telling if each view is sparse or not.

  • file_name (str, or None) – The name of the hdf5 file that will be created to store the multiview dataset.

  • view_names (list of str, or None) – The name of each view.

  • path (str, or None) – The path where the hdf5 dataset file will be stored

  • hdf5_file (h5py.File object, or None) – If not None, the dataset will be imported directly from this file.

  • labels_names (list of str, or None) – The name for each unique value of the labels given in labels.

  • is_temp (bool) – Used if a temporary dataset has to be stored by the benchmark.

dataset

The h5py file pbject that points to the hdf5 dataset on the disk.

Type

h5py.File object

nb_view

The number of views in the dataset.

Type

int

view_dict
The dictionnary with the name of each view as the keys and their indices

as values

Type

dict

get_label_names(decode=True, sample_indices=None)

Used to get the list of the label names for the given set of samples

Parameters
  • decode (bool) – If True, will decode the label names before listing them

  • sample_indices (numpy.ndarray) – The array containing the indices of the needed samples

Returns

  • list

  • seleted labels’ names

get_labels(sample_indices=None)

Gets the label array for the asked samples

Parameters

sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.

Return type

numpy.ndarray containing the labels of the asked samples

get_name()

Gets the name of the dataset hdf5 file

get_nb_class(sample_indices=None)

Gets the number of classes of the dataset for the asked samples

Parameters

sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.

Returns

int

Return type

The number of classes

get_nb_samples()

Used to get the number of samples available in hte dataset

Return type

int

get_shape(view_index=0, sample_indices=None)

Gets the shape of the needed view on the asked samples

Parameters
  • view_index (int) – The index of the view to extract

  • sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.

Return type

Tuple containing the shape

get_v(view_index, sample_indices=None)

Extract the view and returns a numpy.ndarray containing the description of the samples specified in sample_indices

Parameters
  • view_index (int) – The index of the view to extract

  • sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.

Return type

A numpy.ndarray containing the view data for the needed samples

get_view_dict()

Returns the dictionary containing view indices as keys and their corresponding names as values

get_view_name(view_idx)

Method to get a view’s name from its index.

Parameters

view_idx (int) – The index of the view in the dataset

Return type

The view’s name.

init_attrs()

Used to init the attributes that are modified when self.dataset changes

init_sample_indices(sample_indices=None)

If no sample indices are provided, selects all the available samples.

Parameters

sample_indices (np.array,) – An array-like containing the indices of the samples.

rm()

Method used to delete the dataset file on the disk if the dataset is temporary.

to_numpy_array(sample_indices=None, view_indices=None)

Concatenates the needed views in one big numpy array while saving the limits of each view in a list, to be able to retrieve them later.

Parameters
  • sample_indices (array like) – The indices of the samples to extract from the dataset

  • view_indices (array like) – The indices of the view to concatenate in the numpy array

Returns

  • concat_views (numpy array,) – The numpy array containing all the needed views.

  • view_limits (list of int) – The limits of each slice used to extract the views.

class RAMDataset(views=None, labels=None, are_sparse=False, view_names=None, labels_names=None, sample_ids=None, name=None, feature_ids=None)
get_shape(view_index=0, sample_indices=None)

Gets the shape of the needed view on the asked samples

Parameters
  • view_index (int) – The index of the view to extract

  • sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.

Return type

Tuple containing the shape

init_sample_indices(sample_indices=None)

If no sample indices are provided, selects all the available samples.

Parameters

sample_indices (np.array,) – An array-like containing the indices of the samples.

to_numpy_array(sample_indices=None, view_indices=None)

Concatenates the needed views in one big numpy array while saving the limits of each view in a list, to be able to retrieve them later.

Parameters
  • sample_indices (array like) – The indices of the samples to extract from the dataset

  • view_indices (array like) – The indices of the view to concatenate in the numpy array

Returns

  • concat_views (numpy array,) – The numpy array containing all the needed views.

  • view_limits (list of int) – The limits of each slice used to extract the views.

confirm(resp=True, timeout=15)

Used to process answer

copy_hdf5(pathF, name, nbCores)

Used to copy a HDF5 database in case of multicore computing

datasets_already_exist(pathF, name, nbCores)

Used to check if it’s necessary to copy datasets

delete_HDF5(benchmarkArgumentsDictionaries, nbCores, dataset)

Used to delete temporary copies at the end of the benchmark

extract_subset(matrix, used_indices)

Used to extract a subset of a matrix even if it’s sparse WIP

get_samples_views_indices(dataset, samples_indices, view_indices)

This function is used to get all the samples indices and view indices if needed

init_multiple_datasets(path_f, name, nb_cores)

Used to create copies of the dataset if multicore computation is used.

This is a temporary solution to fix the sharing memory issue with HDF5 datasets.

Parameters
  • path_f (string) – Path to the original dataset directory

  • name (string) – Name of the dataset

  • nb_cores (int) – The number of threads that the benchmark can use

Returns

datasetFiles – Dictionary resuming which mono- and multiview algorithms which will be used in the benchmark.

Return type

None

input_(timeout=15)

used as a UI to stop if too much HDD space will be used