Multiview Dataset Generator Demo

Once you have installed MAGE, you are able to run it with this notebook.

[1]:

from multiview_generator.gaussian_classes import MultiViewGaussianSubProblemsGenerator
from tabulate import tabulate
import numpy as np

random_state = np.random.RandomState(42)

Basic configuration

Let us suppose that you want to build a multiview dataset with 4 views and three classes :

[2]:

name = "demo"
n_views = 4
n_classes = 3

In order to configure the dataset, you have to provide the error matrix that gives the expected error of the Byaes classifier for Class i on View j as the value in row i column j :

[3]:

error_matrix = [
   [0.30, 0.32, 0.38, 0.30],
   [0.35, 0.28, 0.20, 0.15],
   [0.25, 0.29, 0.15, 0.21]
]
print(tabulate(error_matrix, tablefmt="grid"))

+------+------+------+------+
| 0.3  | 0.32 | 0.38 | 0.3  |
+------+------+------+------+
| 0.35 | 0.28 | 0.2  | 0.15 |
+------+------+------+------+
| 0.25 | 0.29 | 0.15 | 0.21 |
+------+------+------+------+

Once this has been defined, you can set all the other parameters of the dataset :

the number of samples,
the number of features of each view,
the proportion of samples in each class.

[4]:

n_samples = 2000
n_features = 3
class_weights = [0.333, 0.333, 0.333,]

Generate the dataset

With the basic configuration done, we can generate the dataset :

[5]:

generator = MultiViewGaussianSubProblemsGenerator(name=name, n_views=n_views,
                                          n_classes=n_classes,
                                          n_samples=n_samples,
                                          n_features=n_features,
                                          class_weights=class_weights,
                                          error_matrix=error_matrix,
                                          random_state=random_state)

dataset, y = generator.generate_multi_view_dataset()

for view_index, view_data in enumerate(dataset):
    print("View {} of shape {}".format(view_index+1, view_data.shape))

View 1 of shape (1998, 3)
View 2 of shape (1998, 3)
View 3 of shape (1998, 3)
View 4 of shape (1998, 3)

Here, we see that the output shape is 1998 instead of 1000 as the classes are supposed to be equivalent.

Get a description of it

Now, if you wish to get information about the generated dataset, run :

[6]:

description = generator.gen_report(save=False)

This will generate a markdown report on the dataset. Here, we used save=False so the description is not saved in a file.

To print it in this notebook, we use :

[7]:

from IPython.display import display,Markdown
display(Markdown(description))

Generated dataset description

The dataset named demo has been generated by MAGE and is comprised of

1998 samples, splitted in
3 classes, described by
4 views.

The input error matrix is

	View 1	View 2	View 3	View 4
Class 1	0.3	0.32	0.38	0.3
Class 2	0.35	0.28	0.2	0.15
Class 3	0.25	0.29	0.15	0.21

The classes are balanced as :

Class 1 : 666 samples (33% of the dataset)
Class 2 : 666 samples (33% of the dataset)
Class 3 : 666 samples (33% of the dataset)

The views have

64.56% redundancy,
1.0% mutual error and
34.53% complementarity with a level of [[3] [3] [3]].

Views description

View 1

This view is generated with StumpsGenerator, with the following configuration :

class_sep: 1.0
n_clusters_per_class: 1
n_features: 3

This view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.

Its empirical bayesian classifier is a decision stump

View 2

This view is generated with StumpsGenerator, with the following configuration :

class_sep: 1.0
n_clusters_per_class: 1
n_features: 3

This view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.

Its empirical bayesian classifier is a decision stump

View 3

This view is generated with StumpsGenerator, with the following configuration :

class_sep: 1.0
n_clusters_per_class: 1
n_features: 3

This view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.

Its empirical bayesian classifier is a decision stump

View 4

This view is generated with StumpsGenerator, with the following configuration :

class_sep: 1.0
n_clusters_per_class: 1
n_features: 3

This view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.

Its empirical bayesian classifier is a decision stump

Statistical analysis

Bayes error matrix :

	Class 1	Class 2	Class 3
View 1	0.328829	0.334835	0.25976
View 2	0.33033	0.282282	0.283784
View 3	0.369369	0.198198	0.126126
View 4	0.310811	0.141141	0.189189

The error, as computed by the ‘empirical bayes’ classifier of each view :

	Class 1	Class 2	Class 3
View 1	0.304805	0.297297	0.363363
View 2	0.325826	0.280781	0.219219
View 3	0.381381	0.160661	0.0975976
View 4	0.279279	0.148649	0.171171

This report has been automatically generated on July 28, 2025 at 20:44:51

But if you just want to save it, you can use :

[9]:

generator.gen_report(output_path=supp_dir, save=True)

[9]:

"# Generated dataset description\n\nThe dataset named `demo` has been generated by [MAGE](https://gitlab.lis-lab.fr/dev/multiview_generator) and is comprised of \n\n* 1998 samples, splitted in \n* 3 classes, described by \n* 4 views.\n\nThe input error matrix is \n \n|         |   View 1 |   View 2 |   View 3 |   View 4 |\n|---------|----------|----------|----------|----------|\n| Class 1 |     0.3  |     0.32 |     0.38 |     0.3  |\n| Class 2 |     0.35 |     0.28 |     0.2  |     0.15 |\n| Class 3 |     0.25 |     0.29 |     0.15 |     0.21 |\n\n The classes are balanced as : \n\n* Class 1 : 666 samples (33% of the dataset)\n* Class 2 : 666 samples (33% of the dataset)\n* Class 3 : 666 samples (33% of the dataset)\n\n The views have \n\n* 64.56% redundancy, \n* 1.0% mutual error and \n* 34.53% complementarity with a level of [[3]\n [3]\n [3]].\n\n## Views description\n\n### View 1\n\nThis view is generated with StumpsGenerator, with the following configuration : \n```yaml\nclass_sep: 1.0\nn_clusters_per_class: 1\nn_features: 3\n```\n\nThis view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.\n\n Its empirical bayesian classifier is a decision stump\n\n### View 2\n\nThis view is generated with StumpsGenerator, with the following configuration : \n```yaml\nclass_sep: 1.0\nn_clusters_per_class: 1\nn_features: 3\n```\n\nThis view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.\n\n Its empirical bayesian classifier is a decision stump\n\n### View 3\n\nThis view is generated with StumpsGenerator, with the following configuration : \n```yaml\nclass_sep: 1.0\nn_clusters_per_class: 1\nn_features: 3\n```\n\nThis view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.\n\n Its empirical bayesian classifier is a decision stump\n\n### View 4\n\nThis view is generated with StumpsGenerator, with the following configuration : \n```yaml\nclass_sep: 1.0\nn_clusters_per_class: 1\nn_features: 3\n```\n\nThis view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.\n\n Its empirical bayesian classifier is a decision stump\n\n## Statistical analysis\n\nBayes error matrix : \n\n|        |   Class 1 |   Class 2 |   Class 3 |\n|--------|-----------|-----------|-----------|\n| View 1 |  0.328829 |  0.334835 |  0.25976  |\n| View 2 |  0.33033  |  0.282282 |  0.283784 |\n| View 3 |  0.369369 |  0.198198 |  0.126126 |\n| View 4 |  0.310811 |  0.141141 |  0.189189 |\n\n The error, as computed by the 'empirical bayes' classifier of each view : \n\n|        |   Class 1 |   Class 2 |   Class 3 |\n|--------|-----------|-----------|-----------|\n| View 1 |  0.304805 |  0.297297 | 0.363363  |\n| View 2 |  0.325826 |  0.280781 | 0.219219  |\n| View 3 |  0.381381 |  0.160661 | 0.0975976 |\n| View 4 |  0.279279 |  0.148649 | 0.171171  |\n\nThis report has been automatically generated on July 28, 2025 at 20:44:52"

This will save the description in the current directory, in a file called demo.md as the name of the dataset is “demo”.

Save the dataset in an HDF5 file

Moreover, it is possible to save tha dataset in an HDF5 file, compatible with SuMMIT with

[10]:

generator.to_hdf5_mc(saving_path=supp_dir)

Visualizing the dataset with plotly 

Here, we purposely used only 3 featrues per view, so the generated dataset is easily plottable in 3D.

Let us plot each view :

[11]:

import plotly
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from plotly.colors import DEFAULT_PLOTLY_COLORS
from IPython.display import display
from IPython.display import IFrame


fig = make_subplots(rows=2, cols=2,
                    subplot_titles= ["View {}".format(view_index)
                                     for view_index in range(n_views)],
                    specs=[[{'type': 'scatter3d'}, {'type': 'scatter3d'}, ],
                               [{'type': 'scatter3d'},
                                {'type': 'scatter3d'}, ]])
row = 1
col = 1
show_legend = True
# Plot the data for each view and each label
for view_index in range(n_views):
    for lab_index in range(n_classes):
        concerned_examples = np.where(generator.y == lab_index)[0]
        fig.add_trace(
            go.Scatter3d(
                x=generator.dataset[view_index][concerned_examples, 0],
                y=generator.dataset[view_index][concerned_examples, 1],
                z=generator.dataset[view_index][concerned_examples, 2],
                text=[generator.sample_ids[ind] for ind in concerned_examples],
                hoverinfo='text',
                legendgroup="Class {}".format(lab_index),
                mode='markers', marker=dict(size=1,
                                            color=DEFAULT_PLOTLY_COLORS[lab_index],
                                            opacity=0.8),
                name="Class {}".format(lab_index),
                showlegend=show_legend),
            row=row, col=col)
    show_legend = False
    col += 1
    if col == 3:
        col = 1
        row += 1

fig_path = os.path.join(supp_dir, "fig.html")
plotly.offline.plot(fig, filename=fig_path, auto_open=False)
IFrame(src=fig_path , width=500, height=500)

[11]:

The figure shows us the dataset with a 3D-subplot for each view. It is possible to remove the samples of a specific class by clicking on a label in the legend. The sub-problems are of dimension 3 (3 features), however, to separate 3 classes only 2 features are needed, so the first two dimensions (x and y in the plots) are “relevant”, while the third is filled with noise.

Getting the outputted error matrix

In order to measure the outputted error matrix, as the views have been generated with make_classification, the DecisionTree is a good approximation of the Bayes classifier.

In order to estimate the test error in the dataset for each class with a Decision Tree, we use a StratifiedKFold :

[12]:

from sklearn.model_selection import StratifiedKFold

n_folds = 5

folds_generator = StratifiedKFold(n_folds, random_state=random_state,
                                 shuffle=True)
# Splitting the array containing the indices of the samples
folds = folds_generator.split(np.arange(generator.y.shape[0]), generator.y)

# Getting the list of each the sample indices in each fold.
folds = [[list(train), list(test)] for train, test in folds]

Then, we get a Decision Tree of depth 3 (as each view has 3 features), and fit it on each view, for each fold. The ouptuted score is the cross-validation score on the 5 folds.

[13]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

dt = DecisionTreeClassifier(max_depth=10)
confusion_mat = np.zeros((n_folds, n_views, n_classes, n_classes))
n_sample_per_class = np.zeros((n_views, n_classes, n_folds))

# For each view
for view_index in range(n_views):
    # For each fold
    for fold_index, [train, test] in enumerate(folds):

        # Fit the decision tree on the training set
        dt.fit(generator.dataset[view_index][train, :], generator.y[train])
        # Predict on the testing set
        pred = dt.predict(generator.dataset[view_index][test, :])

        # Get the confusion matrix
        confusion_mat[fold_index, view_index, :, :] = confusion_matrix(generator.y[test], pred)
        for class_index in range(n_classes):
            n_sample_per_class[view_index, class_index, fold_index] = np.where(generator.y[test]==class_index)[0].shape[0]
confusion_mat = np.mean(confusion_mat, axis=0)
n_sample_per_class = np.mean(n_sample_per_class, axis=2)
output = np.zeros((n_classes, n_views))
# Get the class error thanks with the confusion matrix
for class_index in range(n_classes):
    for view_index in range(n_views):
        output[class_index, view_index] = 1-confusion_mat[view_index, class_index, class_index]/n_sample_per_class[view_index, class_index]

print("Input error matrix : \n{}\n\nOutputted error matrix : \n{}\n\nDifference :\n{}".format(tabulate(error_matrix, tablefmt='grid'), tabulate(output, tablefmt='grid'), tabulate(error_matrix-output, tablefmt='grid')))

Input error matrix :
+------+------+------+------+
| 0.3  | 0.32 | 0.38 | 0.3  |
+------+------+------+------+
| 0.35 | 0.28 | 0.2  | 0.15 |
+------+------+------+------+
| 0.25 | 0.29 | 0.15 | 0.21 |
+------+------+------+------+

Outputted error matrix :
+----------+----------+----------+----------+
| 0.387387 | 0.393393 | 0.381381 | 0.324324 |
+----------+----------+----------+----------+
| 0.294294 | 0.277778 | 0.159159 | 0.147147 |
+----------+----------+----------+----------+
| 0.237237 | 0.258258 | 0.160661 | 0.205706 |
+----------+----------+----------+----------+

Difference :
+------------+-------------+-------------+-------------+
| -0.0873874 | -0.0733934  | -0.00138138 | -0.0243243  |
+------------+-------------+-------------+-------------+
|  0.0557057 |  0.00222222 |  0.0408408  |  0.00285285 |
+------------+-------------+-------------+-------------+
|  0.0127628 |  0.0317417  | -0.0106607  |  0.00429429 |
+------------+-------------+-------------+-------------+

Here, we can see that there is a slight difference between the input error matrix and the ouput one.

Conclusion

In this demo, we used MAGE to generate a basic multiview dataset, and we performed a naive analysis on it. The next tutorial will be focused on introducing redundancy, mutual error and complementarity.

Multiview Dataset Generator Demo

Basic configuration

Generate the dataset

Get a description of it

Generated dataset description

Views description

View 1

View 2

View 3

View 4

Statistical analysis

Save the dataset in an HDF5 file

Visualizing the dataset with plotly

Getting the outputted error matrix

Conclusion

Visualizing the dataset with plotly 