MAGE tutorial : the sample types
In this tutorial, we will learn how to generate a multiview dataset presenting :
redundancy,
complementarity and
mutual error.
Definitions
In this tutorial, will will denote a sample as
Redundant if all the views have enough information to classify it correctly without collaboration,
Complementary if only some of the views have enough information to classify it correctly without collaboration it is useful the assess the ability to extract the relevant information among the views.
Part of the Mutual Error if none of the views has enough information to classify it correctly without collaboration. A mutliview classifier able to classify these examples is apt to get information from several features from different views and combine it to classify the examples.
Hands on experience : initialization
We will initialize the arguments as earlier :
[3]:
from multiview_generator.gaussian_classes import MultiViewGaussianSubProblemsGenerator
from tabulate import tabulate
import numpy as np
import os
random_state = np.random.RandomState(42)
name = "tuto"
n_views = 4
n_classes = 3
error_matrix = [
[0.4, 0.4, 0.4, 0.4],
[0.55, 0.4, 0.4, 0.4],
[0.4, 0.5, 0.52, 0.55]
]
n_samples = 2000
n_features = 3
class_weights = [0.333, 0.333, 0.333,]
To control the three previously introduced characteristics, we have to provide three floats :
[4]:
complementarity = 0.3
redundancy = 0.2
mutual_error = 0.1
Now we can generate the dataset with the given configuration.
[5]:
generator = MultiViewGaussianSubProblemsGenerator(name=name, n_views=n_views,
n_classes=n_classes,
n_samples=n_samples,
n_features=n_features,
class_weights=class_weights,
error_matrix=error_matrix,
random_state=random_state,
redundancy=redundancy,
complementarity=complementarity,
mutual_error=mutual_error)
dataset, y = generator.generate_multi_view_dataset()
Here, the generator distinguishes four types of examples, the thrre previously introduced and the ones that were used to fill the dataset.
Dataset analysis using SuMMIT
In order to differentiate them, we use generator.sample_ids
. In this attribute, we can find an array with the ids of all the generated exmaples, characterizing their type :
[6]:
generator.sample_ids[:10]
[6]:
['0_l_0_m-0_0.37-1_0.04-2_0.27-3_0.81',
'1_l_0_m-0_0.48-1_1.28-2_0.28-3_0.55',
'2_l_0_m-0_0.96-1_0.32-2_0.08-3_0.56',
'3_l_0_m-0_2.49-1_0.18-2_0.97-3_0.35',
'4_l_0_m-0_0.11-1_0.92-2_0.21-3_0.4',
'5_l_0_m-0_0.84-1_0.43-2_0.48-3_1.17',
'6_l_0_m-0_0.84-1_1.41-2_0.13-3_0.46',
'7_l_0_m-0_0.14-1_0.64-2_0.62-3_0.4',
'8_l_0_m-0_0.04-1_0.31-2_0.63-3_0.21',
'9_l_0_m-0_0.86-1_1.18-2_0.09-3_0.35']
Here, we printed the 10 first ones, and we have :
the redundant samples tagged
_r-
,the mutual error ones tagged
_m-
,the complementary ones tagged
_c-
and
To get a visualization on these properties, we will use SuMMIT with decision trees on each view.
[7]:
from summit.execute import execute
print(supp_dir)
generator.to_hdf5_mc(supp_dir)
execute(config_path=os.path.join(supp_dir, "config_summit.yml"))
_static/supplementary_material
selected labels [0 0 0 ... 2 2 2]
self.dataset ['label_1' 'label_2' 'label_3']
To extract the result, we need a small script that will fetch the right folder :
[8]:
def get_iframe_path(filename):
# détecte si on est en Sphinx (variable d'environnement ou autre)
if os.environ.get("SPHINX_BUILD") == "1":
# chemin dans _static/tuto_latest lors du build
return f"_static/tuto_latest/{filename}"
else:
# chemin direct dans dossier dynamique lors du notebook interactif
base_path = os.path.join('supplementary_material', 'tuto')
latest_dir = fetch_latest_dir(os.listdir(base_path))
return os.path.join(base_path, latest_dir, filename)
[9]:
from datetime import datetime
from IPython.display import display
from IPython.display import IFrame
def fetch_latest_dir(experiment_directories, latest_date=datetime(1560,12,25,12,12)):
for experiment_directory in experiment_directories:
experiment_time = experiment_directory.split("-")[0].split("_")[1:]
experiment_time += experiment_directory.split('-')[1].split("_")[:2]
experiment_time = map(int, experiment_time)
dt = datetime(*experiment_time)
if dt > latest_date:
latest_date = dt
latest_experiment_dir = experiment_directory
return latest_experiment_dir
experiment_directory = fetch_latest_dir(os.listdir(os.path.join(supp_dir, 'tuto')))
error_fig_path = os.path.join(supp_dir, 'tuto', experiment_directory, "error_analysis_2D.html")
if os.path.exists(error_fig_path) :
iframe_path = get_iframe_path("error_analysis_2D.html")
display(IFrame(src=error_fig_path, width=900, height=500))
else:
print(f" FFile not Found : {error_fig_path}")
This graph represents the failure of each classifier on each sample. So a black rectangle on row i, column j means that classifier j always failed to classify example i. So, by zooming in, we can focus on several samples and we see that the type of samples are well defined as the mutual error ones are systematically misclassified by the decision trees, the redundant ones are well-classified and the complementary ones are classified only by a portion of the views.
[10]:
fig_path = os.path.join(supp_dir, 'tuto', experiment_directory, r'tuto-mean_on_5_iter-accuracy_score*-class.html')
if os.path.exists(fig_path):
iframe_path = get_iframe_path("tuto-mean_on_5_iter-accuracy_score*-class.html'")
display(IFrame(src=fig_path , width=900, height=500))
else:
print(f" File not found: {fig_path}")