Multiview Dataset Generator Demo
Once you have installed MAGE, you are able to run it with this notebook.
[1]:
from multiview_generator.gaussian_classes import MultiViewGaussianSubProblemsGenerator
from tabulate import tabulate
import numpy as np
random_state = np.random.RandomState(42)
Basic configuration
Let us suppose that you want to build a multiview dataset with 4 views and three classes :
[2]:
name = "demo"
n_views = 4
n_classes = 3
In order to configure the dataset, you have to provide the error matrix that gives the expected error of the Byaes classifier for Class i on View j as the value in row i column j :
[3]:
error_matrix = [
[0.30, 0.32, 0.38, 0.30],
[0.35, 0.28, 0.20, 0.15],
[0.25, 0.29, 0.15, 0.21]
]
print(tabulate(error_matrix, tablefmt="grid"))
+------+------+------+------+
| 0.3 | 0.32 | 0.38 | 0.3 |
+------+------+------+------+
| 0.35 | 0.28 | 0.2 | 0.15 |
+------+------+------+------+
| 0.25 | 0.29 | 0.15 | 0.21 |
+------+------+------+------+
Once this has been defined, you can set all the other parameters of the dataset :
the number of samples,
the number of features of each view,
the proportion of samples in each class.
[4]:
n_samples = 2000
n_features = 3
class_weights = [0.333, 0.333, 0.333,]
Generate the dataset
With the basic configuration done, we can generate the dataset :
[5]:
generator = MultiViewGaussianSubProblemsGenerator(name=name, n_views=n_views,
n_classes=n_classes,
n_samples=n_samples,
n_features=n_features,
class_weights=class_weights,
error_matrix=error_matrix,
random_state=random_state)
dataset, y = generator.generate_multi_view_dataset()
for view_index, view_data in enumerate(dataset):
print("View {} of shape {}".format(view_index+1, view_data.shape))
View 1 of shape (1998, 3)
View 2 of shape (1998, 3)
View 3 of shape (1998, 3)
View 4 of shape (1998, 3)
Here, we see that the output shape is 1998 instead of 1000 as the classes are supposed to be equivalent.
Get a description of it
Now, if you wish to get information about the generated dataset, run :
[6]:
description = generator.gen_report(save=False)
This will generate a markdown report on the dataset. Here, we used save=False
so the description is not saved in a file.
To print it in this notebook, we use :
[7]:
from IPython.display import display,Markdown
display(Markdown(description))
Generated dataset description
The dataset named demo
has been generated by MAGE and is comprised of
1998 samples, splitted in
3 classes, described by
4 views.
The input error matrix is
View 1 |
View 2 |
View 3 |
View 4 |
|
---|---|---|---|---|
Class 1 |
0.3 |
0.32 |
0.38 |
0.3 |
Class 2 |
0.35 |
0.28 |
0.2 |
0.15 |
Class 3 |
0.25 |
0.29 |
0.15 |
0.21 |
The classes are balanced as :
Class 1 : 666 samples (33% of the dataset)
Class 2 : 666 samples (33% of the dataset)
Class 3 : 666 samples (33% of the dataset)
The views have
64.56% redundancy,
1.0% mutual error and
34.53% complementarity with a level of [[3] [3] [3]].
Views description
View 1
This view is generated with StumpsGenerator, with the following configuration :
class_sep: 1.0
n_clusters_per_class: 1
n_features: 3
This view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.
Its empirical bayesian classifier is a decision stump
View 2
This view is generated with StumpsGenerator, with the following configuration :
class_sep: 1.0
n_clusters_per_class: 1
n_features: 3
This view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.
Its empirical bayesian classifier is a decision stump
View 3
This view is generated with StumpsGenerator, with the following configuration :
class_sep: 1.0
n_clusters_per_class: 1
n_features: 3
This view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.
Its empirical bayesian classifier is a decision stump
View 4
This view is generated with StumpsGenerator, with the following configuration :
class_sep: 1.0
n_clusters_per_class: 1
n_features: 3
This view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.
Its empirical bayesian classifier is a decision stump
Statistical analysis
Bayes error matrix :
Class 1 |
Class 2 |
Class 3 |
|
---|---|---|---|
View 1 |
0.328829 |
0.334835 |
0.25976 |
View 2 |
0.33033 |
0.282282 |
0.283784 |
View 3 |
0.369369 |
0.198198 |
0.126126 |
View 4 |
0.310811 |
0.141141 |
0.189189 |
The error, as computed by the ‘empirical bayes’ classifier of each view :
Class 1 |
Class 2 |
Class 3 |
|
---|---|---|---|
View 1 |
0.304805 |
0.297297 |
0.363363 |
View 2 |
0.325826 |
0.280781 |
0.219219 |
View 3 |
0.381381 |
0.160661 |
0.0975976 |
View 4 |
0.279279 |
0.148649 |
0.171171 |
This report has been automatically generated on July 28, 2025 at 20:44:51
But if you just want to save it, you can use :
[9]:
generator.gen_report(output_path=supp_dir, save=True)
[9]:
"# Generated dataset description\n\nThe dataset named `demo` has been generated by [MAGE](https://gitlab.lis-lab.fr/dev/multiview_generator) and is comprised of \n\n* 1998 samples, splitted in \n* 3 classes, described by \n* 4 views.\n\nThe input error matrix is \n \n| | View 1 | View 2 | View 3 | View 4 |\n|---------|----------|----------|----------|----------|\n| Class 1 | 0.3 | 0.32 | 0.38 | 0.3 |\n| Class 2 | 0.35 | 0.28 | 0.2 | 0.15 |\n| Class 3 | 0.25 | 0.29 | 0.15 | 0.21 |\n\n The classes are balanced as : \n\n* Class 1 : 666 samples (33% of the dataset)\n* Class 2 : 666 samples (33% of the dataset)\n* Class 3 : 666 samples (33% of the dataset)\n\n The views have \n\n* 64.56% redundancy, \n* 1.0% mutual error and \n* 34.53% complementarity with a level of [[3]\n [3]\n [3]].\n\n## Views description\n\n### View 1\n\nThis view is generated with StumpsGenerator, with the following configuration : \n```yaml\nclass_sep: 1.0\nn_clusters_per_class: 1\nn_features: 3\n```\n\nThis view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.\n\n Its empirical bayesian classifier is a decision stump\n\n### View 2\n\nThis view is generated with StumpsGenerator, with the following configuration : \n```yaml\nclass_sep: 1.0\nn_clusters_per_class: 1\nn_features: 3\n```\n\nThis view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.\n\n Its empirical bayesian classifier is a decision stump\n\n### View 3\n\nThis view is generated with StumpsGenerator, with the following configuration : \n```yaml\nclass_sep: 1.0\nn_clusters_per_class: 1\nn_features: 3\n```\n\nThis view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.\n\n Its empirical bayesian classifier is a decision stump\n\n### View 4\n\nThis view is generated with StumpsGenerator, with the following configuration : \n```yaml\nclass_sep: 1.0\nn_clusters_per_class: 1\nn_features: 3\n```\n\nThis view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.\n\n Its empirical bayesian classifier is a decision stump\n\n## Statistical analysis\n\nBayes error matrix : \n\n| | Class 1 | Class 2 | Class 3 |\n|--------|-----------|-----------|-----------|\n| View 1 | 0.328829 | 0.334835 | 0.25976 |\n| View 2 | 0.33033 | 0.282282 | 0.283784 |\n| View 3 | 0.369369 | 0.198198 | 0.126126 |\n| View 4 | 0.310811 | 0.141141 | 0.189189 |\n\n The error, as computed by the 'empirical bayes' classifier of each view : \n\n| | Class 1 | Class 2 | Class 3 |\n|--------|-----------|-----------|-----------|\n| View 1 | 0.304805 | 0.297297 | 0.363363 |\n| View 2 | 0.325826 | 0.280781 | 0.219219 |\n| View 3 | 0.381381 | 0.160661 | 0.0975976 |\n| View 4 | 0.279279 | 0.148649 | 0.171171 |\n\nThis report has been automatically generated on July 28, 2025 at 20:44:52"
This will save the description in the current directory, in a file called demo.md
as the name of the dataset is “demo”.
Save the dataset in an HDF5 file
Moreover, it is possible to save tha dataset in an HDF5 file, compatible with SuMMIT with
[10]:
generator.to_hdf5_mc(saving_path=supp_dir)
Visualizing the dataset with plotly
Here, we purposely used only 3 featrues per view, so the generated dataset is easily plottable in 3D.
Let us plot each view :
[11]:
import plotly
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from plotly.colors import DEFAULT_PLOTLY_COLORS
from IPython.display import display
from IPython.display import IFrame
fig = make_subplots(rows=2, cols=2,
subplot_titles= ["View {}".format(view_index)
for view_index in range(n_views)],
specs=[[{'type': 'scatter3d'}, {'type': 'scatter3d'}, ],
[{'type': 'scatter3d'},
{'type': 'scatter3d'}, ]])
row = 1
col = 1
show_legend = True
# Plot the data for each view and each label
for view_index in range(n_views):
for lab_index in range(n_classes):
concerned_examples = np.where(generator.y == lab_index)[0]
fig.add_trace(
go.Scatter3d(
x=generator.dataset[view_index][concerned_examples, 0],
y=generator.dataset[view_index][concerned_examples, 1],
z=generator.dataset[view_index][concerned_examples, 2],
text=[generator.sample_ids[ind] for ind in concerned_examples],
hoverinfo='text',
legendgroup="Class {}".format(lab_index),
mode='markers', marker=dict(size=1,
color=DEFAULT_PLOTLY_COLORS[lab_index],
opacity=0.8),
name="Class {}".format(lab_index),
showlegend=show_legend),
row=row, col=col)
show_legend = False
col += 1
if col == 3:
col = 1
row += 1
fig_path = os.path.join(supp_dir, "fig.html")
plotly.offline.plot(fig, filename=fig_path, auto_open=False)
IFrame(src=fig_path , width=500, height=500)
[11]:
The figure shows us the dataset with a 3D-subplot for each view. It is possible to remove the samples of a specific class by clicking on a label in the legend. The sub-problems are of dimension 3 (3 features), however, to separate 3 classes only 2 features are needed, so the first two dimensions (x and y in the plots) are “relevant”, while the third is filled with noise.
Getting the outputted error matrix
In order to measure the outputted error matrix, as the views have been generated with make_classification, the DecisionTree is a good approximation of the Bayes classifier.
In order to estimate the test error in the dataset for each class with a Decision Tree, we use a StratifiedKFold :
[12]:
from sklearn.model_selection import StratifiedKFold
n_folds = 5
folds_generator = StratifiedKFold(n_folds, random_state=random_state,
shuffle=True)
# Splitting the array containing the indices of the samples
folds = folds_generator.split(np.arange(generator.y.shape[0]), generator.y)
# Getting the list of each the sample indices in each fold.
folds = [[list(train), list(test)] for train, test in folds]
Then, we get a Decision Tree of depth 3 (as each view has 3 features), and fit it on each view, for each fold. The ouptuted score is the cross-validation score on the 5 folds.
[13]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
dt = DecisionTreeClassifier(max_depth=10)
confusion_mat = np.zeros((n_folds, n_views, n_classes, n_classes))
n_sample_per_class = np.zeros((n_views, n_classes, n_folds))
# For each view
for view_index in range(n_views):
# For each fold
for fold_index, [train, test] in enumerate(folds):
# Fit the decision tree on the training set
dt.fit(generator.dataset[view_index][train, :], generator.y[train])
# Predict on the testing set
pred = dt.predict(generator.dataset[view_index][test, :])
# Get the confusion matrix
confusion_mat[fold_index, view_index, :, :] = confusion_matrix(generator.y[test], pred)
for class_index in range(n_classes):
n_sample_per_class[view_index, class_index, fold_index] = np.where(generator.y[test]==class_index)[0].shape[0]
confusion_mat = np.mean(confusion_mat, axis=0)
n_sample_per_class = np.mean(n_sample_per_class, axis=2)
output = np.zeros((n_classes, n_views))
# Get the class error thanks with the confusion matrix
for class_index in range(n_classes):
for view_index in range(n_views):
output[class_index, view_index] = 1-confusion_mat[view_index, class_index, class_index]/n_sample_per_class[view_index, class_index]
print("Input error matrix : \n{}\n\nOutputted error matrix : \n{}\n\nDifference :\n{}".format(tabulate(error_matrix, tablefmt='grid'), tabulate(output, tablefmt='grid'), tabulate(error_matrix-output, tablefmt='grid')))
Input error matrix :
+------+------+------+------+
| 0.3 | 0.32 | 0.38 | 0.3 |
+------+------+------+------+
| 0.35 | 0.28 | 0.2 | 0.15 |
+------+------+------+------+
| 0.25 | 0.29 | 0.15 | 0.21 |
+------+------+------+------+
Outputted error matrix :
+----------+----------+----------+----------+
| 0.387387 | 0.393393 | 0.381381 | 0.324324 |
+----------+----------+----------+----------+
| 0.294294 | 0.277778 | 0.159159 | 0.147147 |
+----------+----------+----------+----------+
| 0.237237 | 0.258258 | 0.160661 | 0.205706 |
+----------+----------+----------+----------+
Difference :
+------------+-------------+-------------+-------------+
| -0.0873874 | -0.0733934 | -0.00138138 | -0.0243243 |
+------------+-------------+-------------+-------------+
| 0.0557057 | 0.00222222 | 0.0408408 | 0.00285285 |
+------------+-------------+-------------+-------------+
| 0.0127628 | 0.0317417 | -0.0106607 | 0.00429429 |
+------------+-------------+-------------+-------------+
Here, we can see that there is a slight difference between the input error matrix and the ouput one.
Conclusion
In this demo, we used MAGE to generate a basic multiview dataset, and we performed a naive analysis on it. The next tutorial will be focused on introducing redundancy, mutual error and complementarity.