Improving the Stability of the Variable Selection with Small Datasets in Classification and Regression Tasks

Cateni, S.; Colla, V.; Vannucci, M.

doi:10.1007/s11063-022-10916-4

Within the design of a machine learning-based solution for classification or regression problems, variable selection techniques are often applied to identify the input variables, which mainly affect the considered target. The selection of such variables provides very interesting advantages, such as lower complexity of themodel and of the learning algorithm, reduction of computational time and improvement of performances. Moreover, variable selection is useful to gain a profound knowledge of the considered problem. High correlation in variables often produces multiple subsets of equally optimal variables, which makes the traditional method of variable selection unstable, leading to instability and reducing the confidence of selected variables. Stability identifies the reproducibility power of the variable selection method. Therefore, having a high stability is as important as the high precision of the developed model. The paper presents an automatic procedure for variable selection in classification (binary and multi-class) and regression tasks, which provides an optimal stability index without requiring any a priori information on data. The proposed approach has been tested on different small datasets, which are unstable by nature, and has achieved satisfactory results.