MAGE tutorial : the sample types

In this tutorial, we will learn how to generate a multiview dataset presenting :

redundancy,
complementarity and
mutual error.

Definitions

In this tutorial, will will denote a sample as

Redundant if all the views have enough information to classify it correctly without collaboration,
Complementary if only some of the views have enough information to classify it correctly without collaboration it is useful the assess the ability to extract the relevant information among the views.
Part of the Mutual Error if none of the views has enough information to classify it correctly without collaboration. A mutliview classifier able to classify these examples is apt to get information from several features from different views and combine it to classify the examples.

Hands on experience : initialization

We will initialize the arguments as earlier :

[3]:

from multiview_generator.gaussian_classes import MultiViewGaussianSubProblemsGenerator
from tabulate import tabulate
import numpy as np
import os

random_state = np.random.RandomState(42)
name = "tuto"
n_views = 4
n_classes = 3
error_matrix = [
   [0.4, 0.4, 0.4, 0.4],
   [0.55, 0.4, 0.4, 0.4],
   [0.4, 0.5, 0.52, 0.55]
]
n_samples = 2000
n_features = 3
class_weights = [0.333, 0.333, 0.333,]

To control the three previously introduced characteristics, we have to provide three floats :

[4]:

complementarity = 0.3
redundancy = 0.2
mutual_error = 0.1

Now we can generate the dataset with the given configuration.

[5]:

generator = MultiViewGaussianSubProblemsGenerator(name=name, n_views=n_views,
                                          n_classes=n_classes,
                                          n_samples=n_samples,
                                          n_features=n_features,
                                          class_weights=class_weights,
                                          error_matrix=error_matrix,
                                          random_state=random_state,
                                          redundancy=redundancy,
                                          complementarity=complementarity,
                                          mutual_error=mutual_error)

dataset, y = generator.generate_multi_view_dataset()

Here, the generator distinguishes four types of examples, the thrre previously introduced and the ones that were used to fill the dataset.

Dataset analysis using SuMMIT 

In order to differentiate them, we use generator.sample_ids. In this attribute, we can find an array with the ids of all the generated exmaples, characterizing their type :

[6]:

generator.sample_ids[:10]

[6]:

['0_l_0_m-0_0.37-1_0.04-2_0.27-3_0.81',
 '1_l_0_m-0_0.48-1_1.28-2_0.28-3_0.55',
 '2_l_0_m-0_0.96-1_0.32-2_0.08-3_0.56',
 '3_l_0_m-0_2.49-1_0.18-2_0.97-3_0.35',
 '4_l_0_m-0_0.11-1_0.92-2_0.21-3_0.4',
 '5_l_0_m-0_0.84-1_0.43-2_0.48-3_1.17',
 '6_l_0_m-0_0.84-1_1.41-2_0.13-3_0.46',
 '7_l_0_m-0_0.14-1_0.64-2_0.62-3_0.4',
 '8_l_0_m-0_0.04-1_0.31-2_0.63-3_0.21',
 '9_l_0_m-0_0.86-1_1.18-2_0.09-3_0.35']

Here, we printed the 10 first ones, and we have :

the redundant samples tagged _r-,
the mutual error ones tagged _m-,
the complementary ones tagged _c- and

To get a visualization on these properties, we will use SuMMIT with decision trees on each view.

[7]:

from summit.execute import execute

print(supp_dir)
generator.to_hdf5_mc(supp_dir)
execute(config_path=os.path.join(supp_dir, "config_summit.yml"))

_static/supplementary_material
selected labels  [0 0 0 ... 2 2 2]
self.dataset  ['label_1' 'label_2' 'label_3']

To extract the result, we need a small script that will fetch the right folder :

[8]:

def get_iframe_path(filename):
    # détecte si on est en Sphinx (variable d'environnement ou autre)
    if os.environ.get("SPHINX_BUILD") == "1":
        # chemin dans _static/tuto_latest lors du build
        return f"_static/tuto_latest/{filename}"
    else:
        # chemin direct dans dossier dynamique lors du notebook interactif
        base_path = os.path.join('supplementary_material', 'tuto')
        latest_dir = fetch_latest_dir(os.listdir(base_path))
        return os.path.join(base_path, latest_dir, filename)

[9]:

from datetime import datetime
from IPython.display import display
from IPython.display import IFrame

def fetch_latest_dir(experiment_directories, latest_date=datetime(1560,12,25,12,12)):
    for experiment_directory in experiment_directories:
        experiment_time = experiment_directory.split("-")[0].split("_")[1:]
        experiment_time += experiment_directory.split('-')[1].split("_")[:2]
        experiment_time = map(int, experiment_time)
        dt = datetime(*experiment_time)
        if dt > latest_date:
            latest_date = dt
            latest_experiment_dir = experiment_directory
    return latest_experiment_dir

experiment_directory = fetch_latest_dir(os.listdir(os.path.join(supp_dir, 'tuto')))
error_fig_path = os.path.join(supp_dir, 'tuto', experiment_directory, "error_analysis_2D.html")
if os.path.exists(error_fig_path) :
    iframe_path = get_iframe_path("error_analysis_2D.html")
    display(IFrame(src=error_fig_path, width=900, height=500))
else:
    print(f" FFile not Found : {error_fig_path}")

This graph represents the failure of each classifier on each sample. So a black rectangle on row i, column j means that classifier j always failed to classify example i. So, by zooming in, we can focus on several samples and we see that the type of samples are well defined as the mutual error ones are systematically misclassified by the decision trees, the redundant ones are well-classified and the complementary ones are classified only by a portion of the views.

[10]:

fig_path  = os.path.join(supp_dir, 'tuto', experiment_directory, r'tuto-mean_on_5_iter-accuracy_score*-class.html')
if os.path.exists(fig_path):
    iframe_path = get_iframe_path("tuto-mean_on_5_iter-accuracy_score*-class.html'")
    display(IFrame(src=fig_path , width=900, height=500))
else:
    print(f" File not found: {fig_path}")

MAGE tutorial : the sample types

Definitions

Hands on experience : initialization

Dataset analysis using SuMMIT

Dataset analysis using SuMMIT 