Comparison of merging strategies for building machine learning models on multiple independent gene expression data sets
High-dimensional gene expression data are regularly studied for their ability to separate different groups of samples by means of machine learning (ML) models. Meanwhile, a large number of such data are publicly available. Several approaches for meta-analysis on independent sets of gene expression data have been proposed, mainly focusing on the step of feature selection, a typical step in fitting a ML model. Here, we compare different strategies of merging the information of such independent data sets to train a classifier model. Specifically, we compare the strategy of merging data sets directly (strategy A), and the strategy of merging the classification results (strategy B). We use simulations with pure artificial data as well as evaluations based on independent gene expression data from lung fibrosis studies to compare the two merging approaches. In the simulations, the number of studies, the strength of batch effects, and the separability are varied. The comparison incorporates five standard ML techniques typically used for high-dimensional data, namely discriminant analysis, support vector machines, least absolute shrinkage and selection operator, random forest, and artificial neural networks. Using cross-study validations, we found that direct data merging yields higher accuracies when having training data of three or four studies, and merging of classification results performed better when having only two training studies. In the evaluation with the lung fibrosis data, both strategies showed a similar performance.