Multiview Dataset Generator Demo

Once you have installed MAGE, you are able to run it with this notebook.

[1]:
from multiview_generator.gaussian_classes import MultiViewGaussianSubProblemsGenerator
from tabulate import tabulate
import numpy as np

random_state = np.random.RandomState(42)

Basic configuration

Let us suppose that you want to build a multiview dataset with 4 views and three classes :

[2]:
name = "demo"
n_views = 4
n_classes = 3

In order to configure the dataset, you have to provide the error matrix that gives the expected error of the Byaes classifier for Class i on View j as the value in row i column j :

[3]:
error_matrix = [
   [0.30, 0.32, 0.38, 0.30],
   [0.35, 0.28, 0.20, 0.15],
   [0.25, 0.29, 0.15, 0.21]
]
print(tabulate(error_matrix, tablefmt="grid"))
+------+------+------+------+
| 0.3  | 0.32 | 0.38 | 0.3  |
+------+------+------+------+
| 0.35 | 0.28 | 0.2  | 0.15 |
+------+------+------+------+
| 0.25 | 0.29 | 0.15 | 0.21 |
+------+------+------+------+

Once this has been defined, you can set all the other parameters of the dataset :

  • the number of samples,

  • the number of features of each view,

  • the proportion of samples in each class.

[4]:
n_samples = 2000
n_features = 3
class_weights = [0.333, 0.333, 0.333,]

Generate the dataset

With the basic configuration done, we can generate the dataset :

[5]:
generator = MultiViewGaussianSubProblemsGenerator(name=name, n_views=n_views,
                                          n_classes=n_classes,
                                          n_samples=n_samples,
                                          n_features=n_features,
                                          class_weights=class_weights,
                                          error_matrix=error_matrix,
                                          random_state=random_state)

dataset, y = generator.generate_multi_view_dataset()

for view_index, view_data in enumerate(dataset):
    print("View {} of shape {}".format(view_index+1, view_data.shape))

View 1 of shape (1998, 3)
View 2 of shape (1998, 3)
View 3 of shape (1998, 3)
View 4 of shape (1998, 3)

Here, we see that the output shape is 1998 instead of 1000 as the classes are supposed to be equivalent.

Get a description of it

Now, if you wish to get information about the generated dataset, run :

[6]:
description = generator.gen_report(save=False)

This will generate a markdown report on the dataset. Here, we used save=False so the description is not saved in a file.

To print it in this notebook, we use :

[7]:
from IPython.display import display,Markdown
display(Markdown(description))

Generated dataset description

The dataset named demo has been generated by MAGE and is comprised of

  • 1998 samples, splitted in

  • 3 classes, described by

  • 4 views.

The input error matrix is

View 1

View 2

View 3

View 4

Class 1

0.3

0.32

0.38

0.3

Class 2

0.35

0.28

0.2

0.15

Class 3

0.25

0.29

0.15

0.21

The classes are balanced as :

  • Class 1 : 666 samples (33% of the dataset)

  • Class 2 : 666 samples (33% of the dataset)

  • Class 3 : 666 samples (33% of the dataset)

The views have

  • 64.56% redundancy,

  • 1.0% mutual error and

  • 34.53% complementarity with a level of [[3] [3] [3]].

Views description

View 1

This view is generated with StumpsGenerator, with the following configuration :

class_sep: 1.0
n_clusters_per_class: 1
n_features: 3

This view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.

Its empirical bayesian classifier is a decision stump

View 2

This view is generated with StumpsGenerator, with the following configuration :

class_sep: 1.0
n_clusters_per_class: 1
n_features: 3

This view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.

Its empirical bayesian classifier is a decision stump

View 3

This view is generated with StumpsGenerator, with the following configuration :

class_sep: 1.0
n_clusters_per_class: 1
n_features: 3

This view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.

Its empirical bayesian classifier is a decision stump

View 4

This view is generated with StumpsGenerator, with the following configuration :

class_sep: 1.0
n_clusters_per_class: 1
n_features: 3

This view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.

Its empirical bayesian classifier is a decision stump

Statistical analysis

Bayes error matrix :

Class 1

Class 2

Class 3

View 1

0.328829

0.334835

0.25976

View 2

0.33033

0.282282

0.283784

View 3

0.369369

0.198198

0.126126

View 4

0.310811

0.141141

0.189189

The error, as computed by the ‘empirical bayes’ classifier of each view :

Class 1

Class 2

Class 3

View 1

0.304805

0.297297

0.363363

View 2

0.325826

0.280781

0.219219

View 3

0.381381

0.160661

0.0975976

View 4

0.279279

0.148649

0.171171

This report has been automatically generated on July 28, 2025 at 20:44:51

But if you just want to save it, you can use :

[9]:
generator.gen_report(output_path=supp_dir, save=True)
[9]:
"# Generated dataset description\n\nThe dataset named `demo` has been generated by [MAGE](https://gitlab.lis-lab.fr/dev/multiview_generator) and is comprised of \n\n* 1998 samples, splitted in \n* 3 classes, described by \n* 4 views.\n\nThe input error matrix is \n \n|         |   View 1 |   View 2 |   View 3 |   View 4 |\n|---------|----------|----------|----------|----------|\n| Class 1 |     0.3  |     0.32 |     0.38 |     0.3  |\n| Class 2 |     0.35 |     0.28 |     0.2  |     0.15 |\n| Class 3 |     0.25 |     0.29 |     0.15 |     0.21 |\n\n The classes are balanced as : \n\n* Class 1 : 666 samples (33% of the dataset)\n* Class 2 : 666 samples (33% of the dataset)\n* Class 3 : 666 samples (33% of the dataset)\n\n The views have \n\n* 64.56% redundancy, \n* 1.0% mutual error and \n* 34.53% complementarity with a level of [[3]\n [3]\n [3]].\n\n## Views description\n\n### View 1\n\nThis view is generated with StumpsGenerator, with the following configuration : \n```yaml\nclass_sep: 1.0\nn_clusters_per_class: 1\nn_features: 3\n```\n\nThis view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.\n\n Its empirical bayesian classifier is a decision stump\n\n### View 2\n\nThis view is generated with StumpsGenerator, with the following configuration : \n```yaml\nclass_sep: 1.0\nn_clusters_per_class: 1\nn_features: 3\n```\n\nThis view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.\n\n Its empirical bayesian classifier is a decision stump\n\n### View 3\n\nThis view is generated with StumpsGenerator, with the following configuration : \n```yaml\nclass_sep: 1.0\nn_clusters_per_class: 1\nn_features: 3\n```\n\nThis view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.\n\n Its empirical bayesian classifier is a decision stump\n\n### View 4\n\nThis view is generated with StumpsGenerator, with the following configuration : \n```yaml\nclass_sep: 1.0\nn_clusters_per_class: 1\nn_features: 3\n```\n\nThis view has 3 features, among which 2 are relevant for classification (they are the 2 first columns of the view) the other are filled with uniform noise.\n\n Its empirical bayesian classifier is a decision stump\n\n## Statistical analysis\n\nBayes error matrix : \n\n|        |   Class 1 |   Class 2 |   Class 3 |\n|--------|-----------|-----------|-----------|\n| View 1 |  0.328829 |  0.334835 |  0.25976  |\n| View 2 |  0.33033  |  0.282282 |  0.283784 |\n| View 3 |  0.369369 |  0.198198 |  0.126126 |\n| View 4 |  0.310811 |  0.141141 |  0.189189 |\n\n The error, as computed by the 'empirical bayes' classifier of each view : \n\n|        |   Class 1 |   Class 2 |   Class 3 |\n|--------|-----------|-----------|-----------|\n| View 1 |  0.304805 |  0.297297 | 0.363363  |\n| View 2 |  0.325826 |  0.280781 | 0.219219  |\n| View 3 |  0.381381 |  0.160661 | 0.0975976 |\n| View 4 |  0.279279 |  0.148649 | 0.171171  |\n\nThis report has been automatically generated on July 28, 2025 at 20:44:52"

This will save the description in the current directory, in a file called demo.md as the name of the dataset is “demo”.

Save the dataset in an HDF5 file

Moreover, it is possible to save tha dataset in an HDF5 file, compatible with SuMMIT with

[10]:
generator.to_hdf5_mc(saving_path=supp_dir)

Visualizing the dataset with plotly

Here, we purposely used only 3 featrues per view, so the generated dataset is easily plottable in 3D.

Let us plot each view :

[11]:
import plotly
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from plotly.colors import DEFAULT_PLOTLY_COLORS
from IPython.display import display
from IPython.display import IFrame


fig = make_subplots(rows=2, cols=2,
                    subplot_titles= ["View {}".format(view_index)
                                     for view_index in range(n_views)],
                    specs=[[{'type': 'scatter3d'}, {'type': 'scatter3d'}, ],
                               [{'type': 'scatter3d'},
                                {'type': 'scatter3d'}, ]])
row = 1
col = 1
show_legend = True
# Plot the data for each view and each label
for view_index in range(n_views):
    for lab_index in range(n_classes):
        concerned_examples = np.where(generator.y == lab_index)[0]
        fig.add_trace(
            go.Scatter3d(
                x=generator.dataset[view_index][concerned_examples, 0],
                y=generator.dataset[view_index][concerned_examples, 1],
                z=generator.dataset[view_index][concerned_examples, 2],
                text=[generator.sample_ids[ind] for ind in concerned_examples],
                hoverinfo='text',
                legendgroup="Class {}".format(lab_index),
                mode='markers', marker=dict(size=1,
                                            color=DEFAULT_PLOTLY_COLORS[lab_index],
                                            opacity=0.8),
                name="Class {}".format(lab_index),
                showlegend=show_legend),
            row=row, col=col)
    show_legend = False
    col += 1
    if col == 3:
        col = 1
        row += 1

fig_path = os.path.join(supp_dir, "fig.html")
plotly.offline.plot(fig, filename=fig_path, auto_open=False)
IFrame(src=fig_path , width=500, height=500)
[11]:

The figure shows us the dataset with a 3D-subplot for each view. It is possible to remove the samples of a specific class by clicking on a label in the legend. The sub-problems are of dimension 3 (3 features), however, to separate 3 classes only 2 features are needed, so the first two dimensions (x and y in the plots) are “relevant”, while the third is filled with noise.

Getting the outputted error matrix

In order to measure the outputted error matrix, as the views have been generated with make_classification, the DecisionTree is a good approximation of the Bayes classifier.

In order to estimate the test error in the dataset for each class with a Decision Tree, we use a StratifiedKFold :

[12]:
from sklearn.model_selection import StratifiedKFold

n_folds = 5

folds_generator = StratifiedKFold(n_folds, random_state=random_state,
                                 shuffle=True)
# Splitting the array containing the indices of the samples
folds = folds_generator.split(np.arange(generator.y.shape[0]), generator.y)

# Getting the list of each the sample indices in each fold.
folds = [[list(train), list(test)] for train, test in folds]

Then, we get a Decision Tree of depth 3 (as each view has 3 features), and fit it on each view, for each fold. The ouptuted score is the cross-validation score on the 5 folds.

[13]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

dt = DecisionTreeClassifier(max_depth=10)
confusion_mat = np.zeros((n_folds, n_views, n_classes, n_classes))
n_sample_per_class = np.zeros((n_views, n_classes, n_folds))

# For each view
for view_index in range(n_views):
    # For each fold
    for fold_index, [train, test] in enumerate(folds):

        # Fit the decision tree on the training set
        dt.fit(generator.dataset[view_index][train, :], generator.y[train])
        # Predict on the testing set
        pred = dt.predict(generator.dataset[view_index][test, :])

        # Get the confusion matrix
        confusion_mat[fold_index, view_index, :, :] = confusion_matrix(generator.y[test], pred)
        for class_index in range(n_classes):
            n_sample_per_class[view_index, class_index, fold_index] = np.where(generator.y[test]==class_index)[0].shape[0]
confusion_mat = np.mean(confusion_mat, axis=0)
n_sample_per_class = np.mean(n_sample_per_class, axis=2)
output = np.zeros((n_classes, n_views))
# Get the class error thanks with the confusion matrix
for class_index in range(n_classes):
    for view_index in range(n_views):
        output[class_index, view_index] = 1-confusion_mat[view_index, class_index, class_index]/n_sample_per_class[view_index, class_index]

print("Input error matrix : \n{}\n\nOutputted error matrix : \n{}\n\nDifference :\n{}".format(tabulate(error_matrix, tablefmt='grid'), tabulate(output, tablefmt='grid'), tabulate(error_matrix-output, tablefmt='grid')))
Input error matrix :
+------+------+------+------+
| 0.3  | 0.32 | 0.38 | 0.3  |
+------+------+------+------+
| 0.35 | 0.28 | 0.2  | 0.15 |
+------+------+------+------+
| 0.25 | 0.29 | 0.15 | 0.21 |
+------+------+------+------+

Outputted error matrix :
+----------+----------+----------+----------+
| 0.387387 | 0.393393 | 0.381381 | 0.324324 |
+----------+----------+----------+----------+
| 0.294294 | 0.277778 | 0.159159 | 0.147147 |
+----------+----------+----------+----------+
| 0.237237 | 0.258258 | 0.160661 | 0.205706 |
+----------+----------+----------+----------+

Difference :
+------------+-------------+-------------+-------------+
| -0.0873874 | -0.0733934  | -0.00138138 | -0.0243243  |
+------------+-------------+-------------+-------------+
|  0.0557057 |  0.00222222 |  0.0408408  |  0.00285285 |
+------------+-------------+-------------+-------------+
|  0.0127628 |  0.0317417  | -0.0106607  |  0.00429429 |
+------------+-------------+-------------+-------------+

Here, we can see that there is a slight difference between the input error matrix and the ouput one.

Conclusion

In this demo, we used MAGE to generate a basic multiview dataset, and we performed a naive analysis on it. The next tutorial will be focused on introducing redundancy, mutual error and complementarity.