Selection of metafeatures for effective concept identification

The selection was based on output of Experiment 4 and the ranking of mean (accumulated for synthetic and semi-synthetic streams) F-statistic.

The F-statistic, descibing the importance of each metafeature in the concept recognition task, was calculated for each stream. The values were accumulated for each stream type and sorted.

The ranks, defined as the order of sorted, accumulated F-statistics values, had the most significant impact on the selection of metafeatures. Moreover, the diversity of the groups from which the metafeatures were selected was taken into account, trying to avoid combinations that strongly correlated with each other (e.g. C1 and C2 from complexity category, which both express imbalance ratio). The largest pool of metafeatures was determined based on rankings for real-world streams, assuming that they best represent real applications of stream processing methods.

Metafeature	Rank in synthetic	Rank in semi-synthetic	Rank in real-world
int	—	—	4
nre	—	—	5
c1	—	—	7
f1.mean	5	—	10
cl_c.mean	—	—	3
cl_c.sd	—	—	8
cl_ent	—	—	6
j_ent.mean	—	—	2
j_ent.sd	—	—	1
mi.mean	—	—	13
mi.sd	—	—	9
wn.mean	—	—	16
g_m.sd	—	—	15
mean.mean	1	1	12
mean.sd	4	—	—
med.mean	3	3	—
t_m.mean	2	2	—

Selection of metafeatures for effective concept identification

Table of contents