Multiview dataset management

class Dataset

This is the base class for all the type of multiview datasets of SuMMIT.

get_shape(view_index=0, sample_indices=None)

Gets the shape of the needed view on the asked samples

Parameters:

view_index (int) – The index of the view to extract
sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.

Return type:

Tuple containing the shape

init_sample_indices(sample_indices=None)

If no sample indices are provided, selects all the available samples.

Parameters:: sample_indices (np.array,) – An array-like containing the indices of the samples.

to_numpy_array(sample_indices=None, view_indices=None)

Concatenates the needed views in one big numpy array while saving the limits of each view in a list, to be able to retrieve them later.

Parameters:

sample_indices (array like) – The indices of the samples to extract from the dataset
view_indices (array like) – The indices of the view to concatenate in the numpy array

Returns:

concat_views (numpy array,) – The numpy array containing all the needed views.
view_limits (list of int) – The limits of each slice used to extract the views.

class HDF5Dataset(views=None, labels=None, are_sparse=False, file_name='dataset.hdf5', view_names=None, path='', hdf5_file=None, labels_names=None, is_temp=False, sample_ids=None, feature_ids=None)

Dataset class

This is used to encapsulate the multiview dataset while keeping it stored on the disk instead of in RAM.

Parameters:

views (list of numpy arrays or None) – The list containing each view of the dataset as a numpy array of shape (nb samples, nb features).
labels (numpy array or None) – The labels for the multiview dataset, of shape (nb samples, ).
are_sparse (list of bool, or None) – The list of boolean telling if each view is sparse or not.
file_name (str, or None) – The name of the hdf5 file that will be created to store the multiview dataset.
view_names (list of str, or None) – The name of each view.
path (str, or None) – The path where the hdf5 dataset file will be stored
hdf5_file (h5py.File object, or None) – If not None, the dataset will be imported directly from this file.
labels_names (list of str, or None) – The name for each unique value of the labels given in labels.
is_temp (bool) – Used if a temporary dataset has to be stored by the benchmark.

dataset

The h5py file pbject that points to the hdf5 dataset on the disk.

Type:: h5py.File object

nb_view

The number of views in the dataset.

Type:: int

view_dict

The dictionnary with the name of each view as the keys and their indices: as values

Type:: dict

get_label_names(decode=False, sample_indices=None)

Used to get the list of the label names for the given set of samples

Parameters:

decode (bool) – If True, will decode the label names before listing them
sample_indices (numpy.ndarray) – The array containing the indices of the needed samples

Returns:

list
seleted labels’ names

get_labels(sample_indices=None)

Gets the label array for the asked samples

Parameters:: sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.
Return type:: numpy.ndarray containing the labels of the asked samples

get_name(): Gets the name of the dataset hdf5 file

get_nb_class(sample_indices=None)

Gets the number of classes of the dataset for the asked samples

Parameters:: sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.
Returns:: int
Return type:: The number of classes

get_nb_samples()

Used to get the number of samples available in hte dataset

Return type:: int

get_shape(view_index=0, sample_indices=None)

Gets the shape of the needed view on the asked samples

Parameters:

view_index (int) – The index of the view to extract
sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.

Return type:

Tuple containing the shape

get_v(view_index, sample_indices=None)

Extract the view and returns a numpy.ndarray containing the description of the samples specified in sample_indices

Parameters:

view_index (int) – The index of the view to extract
sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.

Return type:

A numpy.ndarray containing the view data for the needed samples

get_view_dict(): Returns the dictionary containing view indices as keys and their corresponding names as values

get_view_name(view_idx)

Method to get a view’s name from its index.

Parameters:: view_idx (int) – The index of the view in the dataset
Return type:: The view’s name.

init_attrs(): Used to init the attributes that are modified when self.dataset changes

init_sample_indices(sample_indices=None)

If no sample indices are provided, selects all the available samples.

Parameters:: sample_indices (np.array,) – An array-like containing the indices of the samples.

rm(): Method used to delete the dataset file on the disk if the dataset is temporary.

to_numpy_array(sample_indices=None, view_indices=None)

Concatenates the needed views in one big numpy array while saving the limits of each view in a list, to be able to retrieve them later.

Parameters:

sample_indices (array like) – The indices of the samples to extract from the dataset
view_indices (array like) – The indices of the view to concatenate in the numpy array

Returns:

concat_views (numpy array,) – The numpy array containing all the needed views.
view_limits (list of int) – The limits of each slice used to extract the views.

class RAMDataset(views=None, labels=None, are_sparse=False, view_names=None, labels_names=None, sample_ids=None, name=None, feature_ids=None)

get_shape(view_index=0, sample_indices=None)

Gets the shape of the needed view on the asked samples

Parameters:

view_index (int) – The index of the view to extract
sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.

Return type:

Tuple containing the shape

init_sample_indices(sample_indices=None)

If no sample indices are provided, selects all the available samples.

Parameters:: sample_indices (np.array,) – An array-like containing the indices of the samples.

to_numpy_array(sample_indices=None, view_indices=None)

Concatenates the needed views in one big numpy array while saving the limits of each view in a list, to be able to retrieve them later.

Parameters:

sample_indices (array like) – The indices of the samples to extract from the dataset
view_indices (array like) – The indices of the view to concatenate in the numpy array

Returns:

concat_views (numpy array,) – The numpy array containing all the needed views.
view_limits (list of int) – The limits of each slice used to extract the views.

confirm(resp=True, timeout=15): Used to process answer

copy_hdf5(pathF, name, nbCores): Used to copy a HDF5 database in case of multicore computing

datasets_already_exist(pathF, name, nbCores): Used to check if it’s necessary to copy datasets

delete_HDF5(benchmarkArgumentsDictionaries, nbCores, dataset): Used to delete temporary copies at the end of the benchmark

extract_subset(matrix, used_indices): Used to extract a subset of a matrix even if it’s sparse WIP

get_samples_views_indices(dataset, samples_indices, view_indices): This function is used to get all the samples indices and view indices if needed

init_multiple_datasets(path_f, name, nb_cores)

Used to create copies of the dataset if multicore computation is used.

This is a temporary solution to fix the sharing memory issue with HDF5 datasets.

Parameters:

path_f (string) – Path to the original dataset directory
name (string) – Name of the dataset
nb_cores (int) – The number of threads that the benchmark can use

Returns:

datasetFiles – Dictionary resuming which mono- and multiview algorithms which will be used in the benchmark.

Return type:

None

input_(timeout=15): used as a UI to stop if too much HDD space will be used