Multiview dataset management
- class Dataset
This is the base class for all the type of multiview datasets of SuMMIT.
- get_shape(view_index=0, sample_indices=None)
Gets the shape of the needed view on the asked samples
- Parameters
view_index (int) – The index of the view to extract
sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.
- Return type
Tuple containing the shape
- init_sample_indices(sample_indices=None)
If no sample indices are provided, selects all the available samples.
- Parameters
sample_indices (np.array,) – An array-like containing the indices of the samples.
- to_numpy_array(sample_indices=None, view_indices=None)
Concatenates the needed views in one big numpy array while saving the limits of each view in a list, to be able to retrieve them later.
- Parameters
sample_indices (array like) – The indices of the samples to extract from the dataset
view_indices (array like) – The indices of the view to concatenate in the numpy array
- Returns
concat_views (numpy array,) – The numpy array containing all the needed views.
view_limits (list of int) – The limits of each slice used to extract the views.
- class HDF5Dataset(views=None, labels=None, are_sparse=False, file_name='dataset.hdf5', view_names=None, path='', hdf5_file=None, labels_names=None, is_temp=False, sample_ids=None, feature_ids=None)
Dataset class
This is used to encapsulate the multiview dataset while keeping it stored on the disk instead of in RAM.
- Parameters
views (list of numpy arrays or None) – The list containing each view of the dataset as a numpy array of shape (nb samples, nb features).
labels (numpy array or None) – The labels for the multiview dataset, of shape (nb samples, ).
are_sparse (list of bool, or None) – The list of boolean telling if each view is sparse or not.
file_name (str, or None) – The name of the hdf5 file that will be created to store the multiview dataset.
view_names (list of str, or None) – The name of each view.
path (str, or None) – The path where the hdf5 dataset file will be stored
hdf5_file (h5py.File object, or None) – If not None, the dataset will be imported directly from this file.
labels_names (list of str, or None) – The name for each unique value of the labels given in labels.
is_temp (bool) – Used if a temporary dataset has to be stored by the benchmark.
- dataset
The h5py file pbject that points to the hdf5 dataset on the disk.
- Type
h5py.File object
- nb_view
The number of views in the dataset.
- Type
int
- view_dict
- The dictionnary with the name of each view as the keys and their indices
as values
- Type
dict
- get_label_names(decode=True, sample_indices=None)
Used to get the list of the label names for the given set of samples
- Parameters
decode (bool) – If True, will decode the label names before listing them
sample_indices (numpy.ndarray) – The array containing the indices of the needed samples
- Returns
list
seleted labels’ names
- get_labels(sample_indices=None)
Gets the label array for the asked samples
- Parameters
sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.
- Return type
numpy.ndarray containing the labels of the asked samples
- get_name()
Gets the name of the dataset hdf5 file
- get_nb_class(sample_indices=None)
Gets the number of classes of the dataset for the asked samples
- Parameters
sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.
- Returns
int
- Return type
The number of classes
- get_nb_samples()
Used to get the number of samples available in hte dataset
- Return type
int
- get_shape(view_index=0, sample_indices=None)
Gets the shape of the needed view on the asked samples
- Parameters
view_index (int) – The index of the view to extract
sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.
- Return type
Tuple containing the shape
- get_v(view_index, sample_indices=None)
Extract the view and returns a numpy.ndarray containing the description of the samples specified in sample_indices
- Parameters
view_index (int) – The index of the view to extract
sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.
- Return type
A numpy.ndarray containing the view data for the needed samples
- get_view_dict()
Returns the dictionary containing view indices as keys and their corresponding names as values
- get_view_name(view_idx)
Method to get a view’s name from its index.
- Parameters
view_idx (int) – The index of the view in the dataset
- Return type
The view’s name.
- init_attrs()
Used to init the attributes that are modified when self.dataset changes
- init_sample_indices(sample_indices=None)
If no sample indices are provided, selects all the available samples.
- Parameters
sample_indices (np.array,) – An array-like containing the indices of the samples.
- rm()
Method used to delete the dataset file on the disk if the dataset is temporary.
- to_numpy_array(sample_indices=None, view_indices=None)
Concatenates the needed views in one big numpy array while saving the limits of each view in a list, to be able to retrieve them later.
- Parameters
sample_indices (array like) – The indices of the samples to extract from the dataset
view_indices (array like) – The indices of the view to concatenate in the numpy array
- Returns
concat_views (numpy array,) – The numpy array containing all the needed views.
view_limits (list of int) – The limits of each slice used to extract the views.
- class RAMDataset(views=None, labels=None, are_sparse=False, view_names=None, labels_names=None, sample_ids=None, name=None, feature_ids=None)
- get_shape(view_index=0, sample_indices=None)
Gets the shape of the needed view on the asked samples
- Parameters
view_index (int) – The index of the view to extract
sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.
- Return type
Tuple containing the shape
- init_sample_indices(sample_indices=None)
If no sample indices are provided, selects all the available samples.
- Parameters
sample_indices (np.array,) – An array-like containing the indices of the samples.
- to_numpy_array(sample_indices=None, view_indices=None)
Concatenates the needed views in one big numpy array while saving the limits of each view in a list, to be able to retrieve them later.
- Parameters
sample_indices (array like) – The indices of the samples to extract from the dataset
view_indices (array like) – The indices of the view to concatenate in the numpy array
- Returns
concat_views (numpy array,) – The numpy array containing all the needed views.
view_limits (list of int) – The limits of each slice used to extract the views.
- confirm(resp=True, timeout=15)
Used to process answer
- copy_hdf5(pathF, name, nbCores)
Used to copy a HDF5 database in case of multicore computing
- datasets_already_exist(pathF, name, nbCores)
Used to check if it’s necessary to copy datasets
- delete_HDF5(benchmarkArgumentsDictionaries, nbCores, dataset)
Used to delete temporary copies at the end of the benchmark
- extract_subset(matrix, used_indices)
Used to extract a subset of a matrix even if it’s sparse WIP
- get_samples_views_indices(dataset, samples_indices, view_indices)
This function is used to get all the samples indices and view indices if needed
- init_multiple_datasets(path_f, name, nb_cores)
Used to create copies of the dataset if multicore computation is used.
This is a temporary solution to fix the sharing memory issue with HDF5 datasets.
- Parameters
path_f (string) – Path to the original dataset directory
name (string) – Name of the dataset
nb_cores (int) – The number of threads that the benchmark can use
- Returns
datasetFiles – Dictionary resuming which mono- and multiview algorithms which will be used in the benchmark.
- Return type
None
- input_(timeout=15)
used as a UI to stop if too much HDD space will be used