Sparse Common and Distinctive Covariates Logistic Regression: a classification method that disentangles underlying processes from data from multiple sources

Tue-04

Presentation time:

Soogeun Park, Eva Ceulemans, Katrijn Van Deun

Tilburg University (Park, Van Deun) KU Leuven (Ceulemans)

Large sets of data originating from multiple sources such as demographics, social network, genetic profiling and questionnaires are becoming increasingly common in behavioural sciences and psychology. Such joint datasets that concern the same observation units are known as multiblock data. A nice example that involves multiblock data is a method proposed by Beaton et al., (2016) where genetic, behaviour measurement and neuroimaging data were jointly analyzed to model the risk factors of Alzheimer's disease.

Simultaneously analyzing multiblock datasets comprised of the data conventional in psychology and other kinds of data can lead to interesting insights concerning the complicated relationship involving different processes that govern human behaviour. It allows for a discovery of novel and integrated understanding, such as done by Caspi et al., (2002) which found that MAOA gene and maltreatment play interactive roles in leading to antisocial behaviour. A possible way to further disentangle the interplay among the multiple processes is by characterizing them into two different types; those that concern only data from a single source and the others that jointly encompass mutltiple data sources. We refer to these two types by distinctive and common processes, respectively. Methods with roots from principal component analysis (PCA) have been considered fit for discerning between these two processes, in part because each component can be interpreted as a representation of a certain process. Måge et al., (2019) provided a comparative study of these methods, including simultaneous component analysis (SCA) with distinctive and common components (DISCO-SCA; Schouteden et al., 2013) and joint and individual variation explained (JIVE; Lock et al., 2013).

On the other hand, multiblock data also presents a challenge since it often features a large number of variables and/or high dimensionality. A large number of returned coefficients being studied poses difficulties in interpretation of the underlying processes. To remedy this issue, regularized versions of the abovementioned component methods have been proposed to provide sparse and therefore more interpretable solutions, on top of identifying the common and distinctive processes behind multiple blocks of data (de Schipper and Van Deun, 2018; Gu and Van Deun, 2019). These papers have also demonstrated the potential insights such multiblock component methodology could bring for psychological research by applying the methods on multiple blocks of questionnaire data or those comprised of clinical scales and genetic data.

Whilst these methods are tools for exploring the underlying mechanisms of the multiblock datasets, it is very often of interest to construct a model for classification of individuals in psychological research. Binary classification is often employed especially in the areas related to mental health where various tests are used to help diagnose disorders including alcoholism (Barbor et al., 2001), dementia (Folstein et al., 1975) and eating disorders (Hill et al., 2010; Botella, Huang and Suero, 2014). Constructing a classification model using a multiblock dataset can help discover novel insights behind these issues.

Along these lines of research, a method that addresses the multiple challenges of a classification problem concerning multiblock datasets is needed. The method would find common and distinctive processes underlying the multiblock data and perform variable selection while constructing a predictive model for a categorical outcome variable. Based on the multiblock component methods, sparse common and distinctive covariates regression (SCD-CovR; Park, Ceulemans and Van Deun, 2020) has been proposed for this multitude of research aims, but it targets a continuous outcome. The current paper adapts the SCD-CovR method to handle a categorical outcome.

Our proposed method has its roots from principal covariates regression (PCovR; De Jong and Kiers, 1992) which finds principal components that underlie both the predictor and the outcome variables. A distinctive feature of the method is the weighting parameter that allows balancing between predictor and outcome variables when constructing the components. When specified such that the components are found only considering the predictor variables, the method is equivalent to PCA. Likewise, it becomes equivalent to multivariate linear regression if the balancing is done completely towards the outcome variables. To account for interpretation and consistency with respect to the coefficients estimated by PCovR for large number of variables, sparse PCovR (Van Deun et al., 2018) has been proposed by imposing regularization penalties on the coefficients. An extension of this method that also caters for multiple blocks of data is the aforementioned SCD-CovR (Park, Ceulemans and Van Deun, 2020). The current paper further extends this multiblock version of sparse PCovR by reformulating it within logistic regression in leading to sparse common and distinctive covariates logistic regression (SCD-Cov-logR). Logistic regression model is adapted to dictate the probabilities of class membership by a function of the components and the coefficients.

To assess the performance of our method, we compare it against another component-based methodological framework called DIABLO (Singh et al., 2016). The method also serves the multiple challenges of multiblock data and is an extension of Partial Least Squares (PLS) which has been used in psychology. In differential psychology domain, PLS was proposed as an alternative to SEM when the sample sizes are small (Willaby et al., 2015). Campbell and Ntobedzi (2007) constructed a predictive PLS model that explains the relationship between emotional intelligence, coping styles and psychological distress. A notable difference between the PCovR and PLS methodologies is the aforementioned weighting parameter. In part because of the absence of this parameter, PLS methods have been found to be prone to overfitting as it places heavier focus on the prediction of the outcome variable (Vervloet et al., 2016; Van Deun et al., 2018). Therefore, in our evaluation under simulated and empirical datasets, we expect our method SCD-Cov-logR to show outperformance with respect to uncovering the processes beneath the predictor variables. Moreover, due to the overfitting problem, our method would also be superior at out-of-sample prediction. We conclude our paper with discussion about the potential value that multiblock PCovR methodology can bring to modern data-rich psychological research.