A Survey of Transfer and Multitask Learning in Bioinformatics

  • cc icon
  • ABSTRACT

    Machine learning and data mining have found many applications in biological domains, where we look to build predictive models based on labeled training data. However, in practice, high quality labeled data is scarce, and to label new data incurs high costs. Transfer and multitask learning offer an attractive alternative, by allowing useful knowledge to be extracted and transferred from data in auxiliary domains helps counter the lack of data problem in the target domain. In this article, we survey recent advances in transfer and multitask learning for bioinformatics applications. In particular, we survey several key bioinformatics application areas, including sequence classification, gene expression data analysis, biological network reconstruction and biomedical applications.


  • KEYWORD

    Transfer learning , Bioinformatics , Data mining

  • I. INTRODUCTION

    With the fast growth of biological technology, there has been a rapid growth in the need to process biological data. This data has been made available with lower costs via advanced bio-sensor technologies, and as a result, in the next few years, we expect to witness a dramatic increase in the application of genomics in our everyday lives, such as in our personalized medicine.

    Bioinformatics research is interdisciplinary in nature, in that it integrates diverse areas such as biology and biochemistry,data mining and machine learning, database and information retrieval, computer science theory, as well as many others. Bioinformatics concerns a wide spectrum of biological research,including genomics and proteomics, evolutionary and systems biology, etc. These research areas are increasingly dealing with data collected from a wide spectrum of devices such as microarrays,genomic sequencing, medical imaging and so on [1].Based on this data, data mining in bioinformatics targets the investigation of learning statistical models to infer biological properties from new data. Various techniques such as supervised learning and unsupervised learning have been developed,with promising results and biological insights in areas such as sequence classification, gene expression data analysis and biological network reconstruction, etc.

    A typical assumption in biological data mining is that a sufficient amount of annotated training data is required, so that a robust classifier model can be trained. In many real-world biological scenarios, however, labeled training data is lacking.Sometimes this data can only be obtained by paying a huge cost. This problem of a lack of labeled training data, which partly causes the so-called data sparsity problem, has become a major bottleneck in practice. When data sparsity occurs, over fitting is common when we train a model. As a result, the statistical model will experience a reduction in performance.

    In response to the above problem of data sparsity, various novel machine-learning methods have been developed. Among them is the transfer learning framework, which refers to a new machine learning framework which reuses knowledge from other domains. Transfer learning aims to extract knowledge from some auxiliary domains where the data is annotated, but it cannot be directly used as the training data for a new domain.To reuse this data, transfer learning compares auxiliary and target problem domains in order to find useful knowledge that is common between them, and then reuses this knowledge to help boost learning performance in the target domain. A subarea of transfer learning is multi-task learning, where the target task and the source task are treated the same, and are learned simultaneously.Multitask learning stresses generating benefits for learning performance improvements in all related domains.Transfer learning and multi-task learning have achieved great success in many different application areas over the last two decades [2, 3], including text mining [4-6], speech recognition[7, 8], computer vision (e.g., image [9] and video [10] analysis),and ubiquitous computing [11, 12].

    In this article, we give a survey on some recent transfer learning methods in bioinformatics research, together with an additional overview on biomedical imaging and sensor based ehealth.We will cover some of the most important application areas of bioinformatics, including sequence analysis, gene expression data analysis, genetic analysis, systems biology and biomedical applications in bioinformatics. By conducting a survey in these areas, we hope to take stock of the progress made so far, and help readers navigate through technological advances in the future.

    II. DATA MINING IN BIOINFORMATICS: A SURVEY OF PROBLEMS

    We focus on the following biological problems in this survey:sequence analysis, gene expression data analysis and genetic analysis, systems biology, biomedical applications.

      >  A. Biological Sequence Analysis

    Biological sequence analysis aims to assign functional annotations to sequences of DNA segments, and is important in our understanding of a genome. One example is the identification of splice sites in terms of the exon and intron boundaries, a complex task due to the number of alternative splicing possible.Other examples include the prediction of regulatory regions that allow the binding of proteins and determine their functions; the prediction of transcription start and initiation sites, and the prediction of coding regions. Another important sequence analysis problem is the major histocompatibility complex (MHC) binding prediction from a sequence perspective, where sequences from pathogens provide a huge amount of potential vaccine candidates [13]. MHC molecules are key players in the human immune system, and the prediction of their binding peptides helps in the design of peptide-based vaccines, but there is a major lack of quality labeled training data that can be used to build a good prediction model for this binding prediction problems.In addition, protein subcellular localization prediction problems are also important in biological sequence analysis, but annotated locations are difficult to obtain.

      >  B. Gene Expression Analysis and Genetic Analysis

    Gene expression analysis and genetic analysis through microarrays or gene chips is an important task for the understanding of proteins and mRNAs. A microarray experiment measures the relative mRNA levels of genes, which allows us to compare the gene expression levels of some biological samples over time in order to understand the differences between normal cells and cancer cells [14]. One characteristic of this analysis is that the number of features that correspond to genes is usually more than the number of samples. This makes it difficult to apply traditional feature selection approaches directly to this data to reduce its dimensionality [15]. Another data mining task often applied to gene expression data is co-clustering, which aims to cluster both the samples and genes at the same time[16]. In genetic analysis, recent advances have allowed genome-wide association studies (GWAS) that can assay hundreds of thousands of single nucleotide polymorphisms (SNPs)and relate them to clinical conditions or measurable traits. A combination of statistical and machine learning methods has been exploited in this area [17, 18].

      >  C. Computational Systems Biology

    Computational systems biology refers to the tasks of modeling gene-protein regulatory networks, and making inferences based on protein-protein interaction networks. The computational challenge relates to the task of integrating and exploring large-scale, multi- dimensional data with multiple types. An important issue is how to exploit the topological features of the network data to help build a high-quality model. Besides automatic prediction based on statistical models, mixed initiative approaches such as visualization have also been developed to tackle the large-scale data and complex modeling problems.

      >  D. Biomedical Applications

    Biomedical applications investigated in this survey include biological text mining, biomedical image classification and ubiquitous healthcare. Biological text mining refers to the task of using information retrieval techniques to extract information on genes, proteins and their functional relationships from scientific literature [19]. Today we face a vast amount of biological information and findings that are published as articles, journals,blogs, books and conference proceedings. PubMed and MEDLINE provide some of the most up to date information for biological researchers. A scientist often has to read through a large volume of published text in order to understand and find any potential knowledge embedded in the text. With text mining technology, many new findings that are published in text format,such as new genes and their relationships, can be automatically detected, which helps man and machine work together as a team.

    Furthermore, biomedical image classification is an important problem that appears in many medical applications. Manual classification of images is time-consuming, repetitive, and unreliable.Given a set of training images labeled into a finite number of classes, our goal is to design an automatic image classification method to understand future images. An example of this is breast-cancer identification from medical imaging data using computer-aided detection on the screening mammography.An important issue for such models is how to reduce the false-positive rates in the classifications.

    Ubiquitous computing is increasingly influencing health care and medicine and aims to improve traditional health care [20]. Exercises are essential to keep people healthy. Recently, several devices have been designed to record and monitor users’ routine exercises. For example, the accelerometer [21-23] and gyroscope[24-26] are used to recognize the users’ activities; Polar records heart rate to evaluate the amount of activity in terms of calories in order to manage sports training [27]. Pedometers can quantify activities in terms of number of steps, and then suggest the users to reduce or increase the amount of exercise they undertake [28].

    In all the above-mentioned problems, transfer learning plays an important role for solving the data sparsity problem. Below,we survey some recent transfer learning studies in these areas.To start with, we first introduce some notations first that are used in this survey. The source domain data DS is composed of data instances xiXS with their correspondinglabels yiYS.Thus we denote the source domain data (XS, yS) as {(x1S,y1S),...,(xnSS, ynSS)}. Similarly, the target domain data DT is composed of data instances xiXT with their corresponding labels yiYT.We denote the target domain data (XT, yT) as {(x1T, y1T),..., (snTT, ynTT)}. The functions fS (·) and fT (·) denote predictive functions in the source domain DS and the target domain DT , respectively,where yS ; fS(XS) and yT ; fT(XT). When considering multi-task situations, the data (Xt, yt) of each task tTN can be represented by {(x1t, y1t),..., (xntt, yntt)}, where TN is the number of tasks for multi-task learning.

    III. BIOLOGICAL SEQUENCE ANALYSIS

    In sequence classification, the goal is to annotate gene sequences or protein sequences from a given set of training data. As we mentioned above, the process of learning to annotate sequences often suffers from a lack of labeled data, which often leads too verfitting. To solve this problem, multi-task learning methods are often used, where it possible to learn to annotate two or more sets of sequence data together. In this approach, the sequence data can be from different problem domains. By learning these tasks together, the lack of labeled data problem is alleviated.

    In multi-task learning, task regularization is an often used.Task regularization methods are formulated under a regularization framework, which consist of two objective function terms:an empirical loss term on the training data of all tasks, and a regularization term that encodes the relationships between tasks. We first introduce a general framework of regularization based multi-task approach. In their pioneering work, Evgeniou and Pontil [29] proposed a multi-task extension of support vector machines (SVMs), which minimizes the following objective function,

    image

    The first and second terms of Eq. (1) denote the empirical error and 2-norm of parameter vectors, respectively, which are the same as those in single-task SVMs. In biological sequence classification tasks, these terms refer to the errors committed when making a comparison to the training data. The difference between single-task SVMs and multi-task SVMs lies in the third term of Eq. (1), which is designed to penalize large deviations between each parameter vector and the mean parameter vector of all tasks. The penalized term imposes a constraint that the parameter vectors in all tasks be similar to each other. In multi-task sequence classification problems, this term characterizes the extent of knowledge sharing between the learning tasks in different sequences. With this general background on SVM-based multitask learning in mind, we now sample a few well-known works using regularization based approaches, as well as kernel design for biological sequence analysis. We will pay attention to applications in splice site recognition and MHC binding prediction.

    One of the first works in multi-task biological sequence classification was by Widmer et al. [30], who proposed two regularization based multi-task learning methods to predict the splice sites across different organisms. In order to leverage information from related organisms, Widmer et al. [30] suggested two principled approaches to incorporate relations across various organisms. The proposed methods were implemented by modifying the regularization term in [29]. However, the relations between the organisms in [30] was defined by a tree or graph implied by their taxonomy or phylogeny. The models related to task t served as prior information for the model of t, such that the parameter wt of task t is required to be close to the parameters wothers of the other models. This can be achieved by minimizing the norm of the differences of the parameter vectors, ∥ wt ? wothers ┃, along with the original loss function. The first approach trains the models in a top-down manner, where a model learns for each node in the hierarchy over the datasets of the tasks spanned by the node and the parent nodes which are used as the prior information, ∥ wt ? wparent(t) ┃. The objective function was given as follows:

    image

    In biology, an organism and its ancestors should be similar due to inheritance of properties in evolution. This is imposed as a constraint in the above formula. Another approach adopted by Widmer et al. [30], is the constraint that ∥ wt ? wt ┃ must be small for related tasks t and t' in their formulation.

    image

    Intuitively, the above formulation states that the properties associated with similar entities are close to each other. γtt' reflects evolutionary history before the divergence of the two organisms.

    In addition to the above formulations, Widmer et al. [30] also designed a kernel function that not only considers the data of the task t, but also the data of its corresponding ancestor rt,according to a hierarchical structure R, which is defined manually.The kernel can be derived by defining prediction functions over the task parameters as well as the parameters of the ancestors in terms of the tasks.

    image

    In Eq. (4), u represents the parameter vectors of leaf nodes and v represents the parameter vectors of their corresponding ancestors, which are internal nodes in the hierarchy structure.

    Schweikert et al. [31] considered a number of domain transfer learning methods for splice-site recognition across several organisms. The goal of domain transfer learning is to use a model from a well-analyzed source domain and its associated data to help train a model in the target domain. In the formulation of Schweikert et al. [31], C. elegans is the source domain and C. remanei, P. pacificus and so on are target domains (Fig. 1).

    The domain transfer learning methods include (as illustrated in Fig. 2):

    - Combination: As a baseline, the simplest way is to combine the source domain data and the targetdomain data directly without considering the weights.

    - Convex combination:

    image

    where α is the trade-off parameter, which is used to balance the contributions of the source data and the target data.

    - Dual-task learning:

    image

    where both of wS and wT are optimized and can be solved using a standard QP-solver.

    - Kernel mean matching:

    image

    where Φ is the kernel mapping, which projects the data into a reproducing kernel Hilbert space.

    Experimental results presented in [31] showed that the differences of classification functions for recognizing splice site in these organisms will increase with increasing evolutionary distance.

    Jacob and Vert [32] designed an SVM algorithm that is able to learn peptide-MHC-I binding models for many alleles simultaneously,by sharing binding information across alleles. The sharing of information is controlled by a user-defined measure of similarity between alleles, where the similarity can be defined in terms of supertypes, or more directly by comparing key residues known to play a role in the peptide-MHC binding. Jacob and Vert first represented each pair of alleles a and peptide candidates p by a feature vector. Then, using the kernel trick, they gave the inner product of pairs of alleles and peptides as,

    image

    where for the peptide kernel Kpep, any existing kernel between peptide representations can be plugged in. For the allele kernel Kall, the authors exploited some inner product computation methods to model the relationships across alleles. These allele kernels include the multitask kernel and supertype kernel [32].In this formulation, alleles have corresponding multitask kernels such that the training peptides that are shared under the same supertype and are trained together, allowing the kernels to reflect the relative distances between the alleles.

    In the work of Jacob et al. [33], regularization based models are used for MHC class-I binding prediction. A novel, multitask learning method was implemented by designing a norm or a penalty term over the set of weights, such that the weight vectors of the tasks within a group are similar to each other. To achieve this goal, the penalty function was formulated as follows:

    image

    where Ωmean(W) measures the average of weight vectors, Ωbetween(W)is a measure of inter-cluster variance and Ωwithin(W) is a measure of in-cluster variance.

    Following the pioneering works of Jacob and Vert [32],Jacob et al. [33], and Widmer et al. [34] in 2010 aimed to improve the predictive power of the multitask kernel method for MHC class I binding prediction by developing an advanced kernel based on Jacob and Vert’s orginal. In addition, Widmer et al. [35] investigated multi-task learning scenarios where there exists a latent structure shared across tasks. They modeled the crossover between tasks by defining meta-tasks, so that information is transferred between two tasks t and t' with respect to their relatedness and according to the number of meta-tasks in which t and t' co-occur. The importance of the meta-tasks was defined by the learned mixture weights.

    As mentioned in Section II, protein subcellular localization prediction can be categorized as biological sequence analysis.Xu et al. [36] compared a multitask learning method with a common feature representation. A common feature representation approach was adapted from Argyriou et al. [37, 38]. In the latter paper, Argyriou et al. [38] proposed a multi-task feature learning method to learn common representations for multi-task learning under a regularization framework:

    image

    In the above formulation, the first term l(·, ·) denotes the loss function. U is the common transformation to find a common representation. The second term is a regularization term that penalizes the (2,1)-norm of the matrix A, this aims to enforce a scarcity of common features across the tasks. More specifically,∥A2,12┃ first computes ┃αi2, the 2-norms of the rows of matrix A and then computes the 1-norm of the vector (∥α12,...,αnt2 ).This favors solutions in which entire rows of A are zero, which encourages selecting only features that are useful for all tasks.

    Xu et al. [36] conducted extensive experiments to answer the question: “Can multi-task learning generate more accurate classifiers than single task learning?” They compared the accuracies of the test data between the proposed multi-task learning methods and other prediction model baselines. An example of their results is summarized in Fig. 3, this illustrates the accuracies of multi-task feature learning and those of single task learning.The columns from left to right and rows from up to down represent different organisms: archaea, bacteria, gneg, gpos, bovine,dog, fish, fly, frog, human0, human, mouse, pig, rabbit, rat,fungi, plant0, plant, virus0 and virus in order. Each cell Cij in the figure is the average result found over 5 random trails. More specifically, Cij is the result of jointly training these models on the organism i and the organism j, and using the trained model fj(·) with the test data from the organism j. For diagonal cells Cij(shown in gray), Xu et al. [36] trained models on the training data of organism i only, and evaluated the test data of organism

    i as well; these were used as baselines. The cells marked in red indicate that applying multi-task learning methods gives worse performance than the baselines, whereas those in light green indicate that applying multi-task learning methods achieves better performance than the baselines. Furthermore, the cells in dark green or dark grey represent the best performance when evaluating the test data from each organism in a particular column.Finally, the cells in white mean the performance result is missing because some organisms overlap, as in the case of human0 vs. human, plant0 vs. plant and virus0 vs. virus. For these pairs, one cannot conduct multi-task learning experiments.

    As shown in above figures, the most significant improvement when using the multi-task learning strategy was about 25%. The performance of plants, viruses and animals can be improved by around 10% by using multi-task learning methods. The columns from left to right and rows from up to down in the table show different organisms: archaea, bacteria, gneg, gpos, bovine, dog,fish, fly, frog, human0, human, mouse, pig, rabbit, rat, fungi,plant0, plant, virus 0 and virus in order. This order arranges learning tasks based on organisms in super type order; for example, those organisms that are animals were put together.Moreover, better results are often obtained near the diagonals,where organisms are similar, while worse cases were often located in the cells far from the diagonals. The results in cells near the diagonals were obtained by training using two relatively similar tasks, like dog and fly, bacteria and archaea, and so on. To conclude, multi-task learning techniques can generally help improve the prediction performance for protein subcellular localization in comparison with supervised single-task learning techniques. Furthermore, the relatedness of tasks may affect the final performance in the multi-task learning framework.

    Liu et al. [39] proposed a cross-platform model based on multi-task linear regression. They applied the model to multiple datasets for small interfering RNA (siRNA) efficacy prediction.Given a representation of siRNAs as feature vectors, a linear ridge regression model was applied to predict novel siRNA efficacy from a set of siRNAs with known efficacy. Experiments were run to compare the proposed multi-task learning and traditional single task learning based on root mean squared error. It was shown that in siRNA efficacy prediction, there exists a certain efficacy distribution diversity across the siRNAs bindings onto different mRNAs. Common properties across different siRNAs have influence on potent siRNA design. One way to represent the gene expression data is via a matrix, where gene samples are rows and gene expression conditions are columns.There are two categories of rows: control or case, for any sample.The objective of gene expression classification is to accurately classify new samples based on conditions into case or control. This problem is particularly challenging because the data is very imbalanced: the number of samples are often smaller than the number of features (columns). Furthermore, the data can be very noisy.

    Chen et al. [40] proposed a multi-task method, known as support vector sample learning (MTSVSL), and demonstrated that the genes selected by MTSVSL yield superior classification performance when using cancer gene expression data. They started by extracting significant samples that reside on support vectors, and then learned two tasks using multiple back-propagation neural networks simultaneously. They configured their system so that the output of the second task can help refine the original model. One task is aimed at answering “what kind of sample is this,” and the second task is “is this sample a support vector sample?” Their experimental results show that the second task can improve classifier performance by incorporating the generated bias with the original bias obtained from the first task.

    In the area of genetic analysis, one of the main issues faced today is GWAS. In this area, Puniyan et al. [41] developed a novel multitask regression based technique to perform joint GWAS mapping on individuals from multiple populations. Initially assuming that there is no population structure, the a lasso function can be formulated as:

    image

    where βt denotes the in-population association strength and tP denotes the population index. A limitation of the method is that the model analyzes each population separately without taking advantage of the relatedness among the shared causal SNPs,which may lead it to miss the weak association signals for common SNPs. Puniyani et al. [41] subsequently introduced L1/L2-regularizer for detecting SNPs in different data,

    image

    In the above equation, B is a m × P matrix, m is the number of SNPs genotyped and the j-th row βj corresponds to the j-th SNP in population P. Here, L1 / L2 ensures that the coefficients βj for the j-th SNP are minimized across all populations. This helps by reducing the number of false positives.

    Both gene expression data analysis and genetic analysis suffer from a lack of sample data due to the high cost of data collection in biology. However, the existence of gene expression data under various platforms or conditions, and genome-wide SNP data of different populations provides us with an opportunity to explore relatedness among multiple datasets to alleviate this data scarcity problem. So far, there have only been a few works in this field [40-42] that use transfer learning methods in machine learning. Future studies are expected to explore how to find the most informative genes or SNPs when classifying different sample conditions by investigating their global characteristics among multiple datasets. This objective can be achieved by regularization-based approaches, learning common feature representation approaches, and distribution matching approaches.

    IV. SYSTEMS BIOLOGY

    In recent years, the application of transfer learning to systems biology has become increasingly popular. Transfer learning techniques implemented by task regularization, distribution matching, matrix factorization and Bayesian approaches have all been employed in computational systems biology.

    Gene interaction network analysis has been very useful for gaining insights into various cellular properties. In 2005,Tamada et al. [43] utilized evolutionary information of two organisms to reconstruct their individual gene networks. Suppose that we have two organisms A and B with respective gene expression data DA and DB. The networks of the two organisms GA and GB are built simultaneously by a hill-climbing algorithm that maximizes the posterior probability function P(GA,GB?DA,DB,HAB) where HAB models the evolutionary information shared between A and B. In order to calculate P(HAB?GA,GB) based on the gene expression data DA and DB,two free parameters need to be set. In their approach, these two parameters are chosen empirically.

    As a follow-up, in 2008, Nassar et al. [44] proposed a new score function that captures the evolutionary information shared between A and B by a single parameter β, instead of choosing two free parameters. The inputs of their multitask learning algorithm now include data samples D of the given organism, an input directed acyclic graph (DAG) Gin of the other organism,and a similarity parameter β. The parameter β in their work represents the similarity of the true underlying Bayesian networks.The output derived from the learning algorithm is an improved DAG structure Gout for the target organism. The procedure can be repeated for both organisms using the output DAG of one organism as an input for the other organism.

    Kato et al. [45] considered multiple assays where learning via the sharing of local knowledge occurs, reflected by learning weights in the formula below:

    image

    where Vt represents the local model of a target node t. Intuitively,this formulation employs transfer learning of the target task with the help of its neighbors’ tasks.

    Qi et al. [46] proposed a semi-supervised multi-task framework to predict protein-protein interactions from partially labeled reference sets. The basic idea is to perform multitask learning on a supervised classification task and a semi-supervised auxiliary task via a regularization term. This is equivalent to learning two tasks jointly while optimizing the following loss function:

    image

    Protein-protein interaction prediction is an important aspect of systems biology. Besides the work of Qi et al. [46], Xu et al.[47] also explored how to use multitask learning in this area via a technique known as Collective Matrix Factorization (CMF)[48]. These methods use similarities of the proteins in two interaction networks as the corresponding shared knowledge. They showed that when the source matrix is sufficiently dense and similar to the target PPI network, transfer learning is effective for predicting protein-protein interactions in a sparse network.Consider a similarity matrix Sm×n introduced as the correspondence between networks G and P. The rows and columns of Sm×n correspond to proteins in networks G and P, respectively, and the element Sij of Sm×n represents the similarity between node i in network G and node j in network P. The collective matrix factorization method reconstructs matrices Xt ? f1(ZVT) and Xα ? f2(UVT) together by sharing the common factor V. The objective of collective matrix factorization then is to minimize the regularized loss:

    image

    where

    image

    A sparse multitask regression approach was presented in[49], where a co-clustering algorithm is applied to gene expression data with phenotypic signatures. This algorithm can uncover the dependency between genes and phenotypes. A multitask learning framework with L1 norm regularization is given as follows

    image

    where Td = T·Pd, which represents phenotype responses under different experimental conditions being projected on the same low-dimensional space T. In this equation, the first term enforces a fit between the gene expression and the phenotypic signature under each condition, while the second term enforcessparsity on T.

    Bickel et al. [50] studied the problem of predicting HIV therapy outcomes of different drug combinations based on observed genetic properties of the patients, where each task corresponds to a particular drug combination. They proposed to jointly train models of different drug combinations by pooling data together from all tasks and then use re-sampling weights to adapt the data for each particular task. The goal is to learn a hypothesis ft : x→y for each task t by minimizing a loss function with respect to p(x,y?t), where x describes the genotype of the virus that a patient carries, together with the patient’s treatment history,y is a class label indicating whether the therapy is successful.Simply pooling the available data for all tasks will generate a training sample D = ?(x1,y1,t1),..., xTN,yTN,tTN)?. The approach creates a task-specific re-sampling weight rt(x,y) for each element in the pool of examples.

    V. BIOMEDICAL APPLICATIONS

    Text mining provides rich ground for transfer learning and multi-task learning applications. In this area, several approaches to general text mining with transfer learning have been surveyed in [3].

    In the biomedical domain, one important area is semantic role labeling (SRL) systems, these label the roles of genes, proteins and biological entities discussed in textual form. These texts are often labeled based on manually annotated training instances, which are rare and expensive to prepare. To solve this problem, Dahlmeier and Ng [51] formulated SRL in the biomedical domain as a transfer learning problem, to leverage existing SRL resources from the newswire domain. They employed three domain transfer learning methods: instance weighting, the augment method and instance pruning. Instance weighting [6, 52] is aimed at correcting the probability estimate for the target domain by weighting instances that occur in the auxiliary data sets. Domain adaptation methods [53] map features from the auxiliary and target domains to a common feature space where knowledge transfer is possible.

    In addition to biomedical text mining applications of transfer learning, Bi et al. [54] formulated the detection of different types of clinically-related abnormal structures in medical images using multitask learning. Their method captured taskdependence by sharing common feature representations, this was shown to be effective in eliminating irrelevant features and identifying discriminative features. Given TN tasks, for each task t, the sample set is composed of input features Xt and label vector yt. Traditional single-task learning aims to minimize the objective function: L(at, Xt, yt) + λP(αt) for each task individually,where the former term is a loss function and the latter term regularizes the complexity of the model. In [54], a family of jointly learning algorithms can be derived by rewriting αt = t,where βt is task-specific while C is a diagonal matrix with a diagonal vector equal to c ≥ 0. The objective function can be rewritten as:

    image

    where P1 and P2 are regularization operators, and C is a diagonal matrix with a diagonal vector equal to c ≥ 0. In other words,c is an indicator vector showing if a feature is used in the model,which is then used as a common feature representation across all tasks.

    In the field of sensor-based ubiquitous healthcare, and particularly in motion-sensor-based activity recognition, collecting user labeled samples needs huge manual efforts and may involve privacy issues. Therefore, transfer learning becomes an attractive approach to solving the data sparsity problem. Here we give one example of such a work, which transferred the activity models for a new user by transfer learning [55]. In earlier work, transfer learning was employed to transfer the activity models learned for a user to another user. Van Kasteren et al.[56] described a simple method for transferring the transitional probabilities of two Markov models for two different spaces. In this work, the aim is to recognize activities in a new space without the expensive data annotation and learning processes. Kasteren et al. [56] first mapped data between two sensor networks via a mapping function. Then their algorithm learned parameters using the labeled data in the source space and unlabeled data in the target space. In this work, the mapping is generated manually and the structure of the HMMs is pre-defined. However,in practice, the mapping and model structure should be learned.

    To address this problem, Rashidi and Cook [57] proposed an advanced transfer learning approach to transfer activity knowledge learned in a home to another home, in what they call“Home to Home Transfer Learning” (HHTL). The main components of HHTL are illustrated in the Fig. 4. As the figure shows,HHTL extracts and compresses activity models from the sensor

    data based on label information in a source domain where data is assumed to have been collected and labeled. This data can then be used to build activity models that transfer to the target domain. After extracting activities and their corresponding sensors,source activity models are mapped to target activity models using a semi-EM framework in an iterative manner. Initially,sensor mapping probabilities are adjusted based on activity mapping probabilities. Next, the activity mapping probabilities are adjusted based on the updated sensor mapping probabilities.A target activity’s label is determined to be the same as the source activity’s label when the mapping probability is maximized.

    Subsequently, Rashidi and Cook [58] modified the HHTL algorithm to transfer the knowledge of learned activities from multiple source physical spaces. E.g., knowledge transfer is done from two homes A and B, to a target physical space, e.g.home C. The modified learning model (MHTL) has the same first three steps illustrated in Fig. 4, but in the final step, MHTL adopts a weighted majority voting scheme to assign the final labels to activities in the target domain.

    VI. DATA SETS FOR TRANSFER LEARNING

    Several datasets have been released for further research on transfer learning in biological domains. We list several datasets available here.

    Gene Expression Data: Yeast gene expression data and human gene expression data are available from ([59], http://genome-www.stanford.edu/cellcycle/) and ([60], http://www.stanfore.edu/Human-CellCycle/Hela/), respectively. These two sets of gene-expression data can be used to reconstruct gene networks across different organisms.

    Protein-ligand Binding Data: One group of available data is composed of enzyme-ligand interaction data, G-protein-coupled receptors (GPCR)-ligand interaction data and ion channelligand interaction data (http://bioinformatics.oxfordjournals.org/content/24/19/2149/suppl/DC1). Table 1 lists the statistics of the data. Another group of released ligand interaction data contains four subsets for enzyme, ion channel, GPCR and nuclear receptors [61] (http://web.kuicr.kyoto-u.ac.jp/supp/yoshi/drugtarget/),respectively. The ligand structure similarity matrices and protein sequence similarity matrices are also included in these datasets. Interested readers can build models to predict the protein-ligand binding via a multi-task framework using these datasets or predict the protein-ligand binding via transferring knowledge from auxiliary data such as the ligand structure similarity matrices and protein sequence similarity matrices.

    HIV-1 and Human Interaction Data: Qi et al. [46] preprocessed a HIV-1 and protein interaction dataset (http://www.cs.cmu.edu/qyj/HIVsemi/), and exploited a semi-supervised multitask method to predict HIV-1 and protein interactions.

  • 1. Larranaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano J. A, Armananzas R, Santafe G, Perez A, Robles V 2006 “Machine learning in bioinformatics” [Brief Bioinformatics] Vol.7 P.86-112 google doi
  • 2. Caruana R. A 1993 “Multitask learning: a knowledge-based source of inductive bias” Proceedings of theTenth International Conference on Machine Learning P.41-48 google
  • 3. Pan S. J, Yang Q 2010 “A survey on transfer learning” [IEEE Transactions on Knowledge and Data Engineering] Vol.22 P.1345-1359 google doi
  • 4. Blitzer J, Mcdonald R, Pereira F 2006 “Domain adaptation with structural correspondence learning” Proceedings of the Conference on Empirical Methods in Natural Language Processing P.120-128 google
  • 5. Do C. B, Ng A. Y 2005 “Transfer learning for text classification”Proceedings of the 19th Annual Conference on Neural Information Processing Systems google
  • 6. Dai W, Yang Q, Xue G. R, Yu Y 2007 “Boosting for transfer learning” Proceedings of the 24th International Conference on Machine Learning P.193-200 google
  • 7. Woodland P. C 2001 “Speaker adaptation for continuous density HMMS: a review” ISCA Tutorial and Research Workshop on Adaptation Methods for Speech Recognition google
  • 8. Xiao L, Bilmes J 2006 “Regularized adaptation of discriminative classifiers” IEEE International Conference on Acoustics Speech and Signal Processing P.I237-I240 google
  • 9. Raina R, Battle A, Lee H, Packer B, Ng A. Y 2007 “Selftaught learning: transfer learning from unlabeled data” Proceedings of the 24th International Conference on Machine Learning P.759-766 google
  • 10. Yang J, Yan R, Hauptmann A. G 2007 “Cross-domain video concept detection using adaptive SVMs” Proceedings of the 15th ACM International Conference on Multimedia P.188-197 google
  • 11. Zheng V. W, Hu D. H, Yang Q 2009 “Cross-domain activity recognition” Proceedings of the 11th ACM International Conference on Ubiquitous Computing P.61-70 google
  • 12. Wang H. Y, Zheng V. W, Zhao J, Yang Q 2010 “Indoor localization in multi-floor environments with reduced effort” Proceedings of the 8th IEEE International Conference on Pervasive Computing and Communications P.244-252 google
  • 13. Donnes P, Elofsson A 2002 “Prediction of MHC class I binding peptides using SVMHC” [BMC Bioinformatics] Vol.3 P.25 google doi
  • 14. Aas K 2001 Microarray Data Mining: A Survey google
  • 15. Xing E. P, Jordan M. I, Karp R. M 2001 “Feature selection for high-dimensional genomic microarray data” Proceedings of the 8th International Conference on Machine Learning P.601-608 google
  • 16. Yang W. H, Dai D. Q, Yan H 2011 “Finding correlated biclusters from gene expression data” [IEEE Transactions on Knowledge and Data Engineering] Vol.23 P.568-584 google doi
  • 17. Yang C, He Z, Wan X, Yang Q, Xue H, Yu W 2009 “SNPHarvester:a filtering-based approach for detecting epistatic interactions in genome-wide association studies” [Bioinformatics] Vol.25 P.504-511 google doi
  • 18. Wan X, Yang C, Yang Q, Xue H, Tang N. L, Yu W 2009 “MegaSNPHunter: a learning approach to detect disease predisposition SNPs and high level interactions in genome wide association study” [BMC Bioinformatics] Vol.10 P.13 google doi
  • 19. Krallinger M, Valencia A 2005 “Text-mining and informationretrieval services for molecular biology” [Genome Biology] Vol.6 P.224 google doi
  • 20. Orwat C, Graefe A, Faulwasser T 2008 “Towards pervasive computing in health care: a literature review” [BMC Medical Informatics and Decision Making] Vol.8 P.26 google doi
  • 21. Bao L, Intille S. S, Ferscha A, Mattern F 2004 “Activity recognition from user-annotated acceleration data” Pervasive Computing: Second International Conference PERVASIVE 2004 Linz/Vienna Austria April 21-23 2004. Proceedings. Lectures Note in Computer Science Vol. 3001 P.1-17 google
  • 22. Ravi N, D N, Mysore P, Littman M. L 2005 “Activity recognition from accelerometer data” Proceedings of the 17th Conference on Innovative Applications of Artificial Intelligence P.1541-1546 google
  • 23. Yin J, Chai X, Yang Q 2004 “High-level goal recognition in a wireless LAN” Proceedings of the 19th National Conference on Artificial Intelligence; Sixteenth Innovative Applications of Artificial Intelligence Conference P.578-583 google
  • 24. Cho S. J, Oh J. K, Bang W. C, Chang W, Choi E, Jing Y, Cho J, Kim D. Y 2004 Magic wand: a hand-drawn gesture input device in 3-D space with inertial sensors [Ninth International Workshop on Frontiers in Handwriting Recognition] P.106-111 google
  • 25. Chen M, Huang B, Xu Y 2007 Human abnormal gait modeling via hidden Markov model [Proceedings of the 2007 International Conference on Information Acquisition] P.517-522 google doi
  • 26. Choi S. D, Lee A. S, Lee S. Y 2006 On-line handwritten character recognition with 3D accelerometer [Proceedings of the IEEE International Conference on Information Acquisition] P.845-850 google
  • 27. 2010 Jog falls: a pervasive healthcare platform for diabetes management P.94-111 google
  • 28. Consolvo S, Everitt K, Smith I, Landay J. A 2006 Design requirements for technologies that encourage physical activity [Conference on Human Factors in Computing Systems] P.457-466 google
  • 29. Evgeniou , Pontil 2004 Regularized multi-task learning [Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining] P.109-117 google
  • 30. Widmer C, Leiva J, Altun Y, Ratsch G 2010 Leveraging sequence classification by taxonomy-based multitask learning [Research in Computational Molecular Biology: 14th Annual International Conference] google
  • 31. Schweikert G, Widmer C, Scholkopf B, Ratsch G 2008 An empirical analysis of domain adaptation algorithms for genomic sequence [Proceedings of the 22nd Annual Conference on Neural Information Processing Systems] P.522-534 google
  • 32. Jacob L, Vert J. P 2007 Efficient peptide-MHC-I binding prediction for alleles with few known binders [Bioinformatics] Vol.24 P.358-366 google
  • 33. Jacob L, Bach F, Vert J. P 2009 Clustered multi-task learning:a convex formulation [Proceedings of the 23th Annual Conference on Neural Information Processing Systems] google
  • 34. Widmer C, Toussaint N, Altun Y, Kohlbache O, Ratsch G 2010 Novel machine learning methods for MHC class I binding prediction [Pattern Recognition in Bioinformatics: 5th IAPR International Conference] google
  • 35. 2010 Inferring latent task structure for Multitask Learning by Multiple Kernel Learning [BMC Bioinformatics] Vol.11 P.S5 google
  • 36. Xu Q, Pan S. J, Xue H. H, Yang Q 2011 Multitask learning for protein subcellular location prediction [IEEE/ACM Transactions on Computational Biology and Bioinformatics] Vol.8 P.748-759 google
  • 37. Argyriou 10.1109/TCBB.2010.22 Multi-task feature learning [Advances in Neural Information Processing Systems Vol. 19: Proceedings of the 2006 Neural Information Processing System Conference] P.41-48 google
  • 38. Argyriou A, Evgeniou T, Pontil M 2008 Convex multi-task feature learning [Machine Learning] Vol.73 P.243-272 google
  • 39. Liu Q, Xu Q, Zheng V, Xue H, Cao Z, Yang Q 2010 Multi-task learning for cross-platform siRNA efficacy prediction:an in-silico study [BMC Bioinformatics] Vol.181 P.11 google doi
  • 40. Chen A, Huang Z. W 10.1186/1471-2105-11-181 A new multi-task learning technique to predict classification of leukemia and prostate cancer [Medical Biometrics. Lecture Notes in Computer Science Vol.6165] P.11-20 google
  • 41. Puniyani K, Kim S, Xing E. P 2010 “Multi-population GWA mapping via multi-task regularized regression” [Bioinformatics] Vol.26 P.i208-i216 google doi
  • 42. Xu Q, Xue H, Yang Q 2010 Multi-platform gene expression mining and marker gene analysis [International Journal of Data Mining and Bioinformatics] google doi
  • 43. Tamada Y, Bannai H, Imoto S, Katayama T, Kanehisa M, Miyano S 2005 Utilizing evolutionary information and gene expression data for estimating gene networks with Bayesian network models [Journal of Bioinformatics and Computational Biology] Vol.3 P.1295-1313 google
  • 44. Nassar M, Abdallah R, Zeineddine H. A, Yaacoub E, Dawy Z 10.1142/S0219720005001569 A new multitask learning method for multiorganism gene network estimation [IEEE International Symposium on Information Theory] P.2287-2291 google
  • 45. Kato T, Okada K, Kashima H, Sugiyama M 2010 “A transfer learning approach and selective integration of multiple types of assays for biological network inference” [International Journal of Knowledge Discovery in Bioinformatics] Vol.1 P.66-80 google doi
  • 46. Qi Y, Tastan O, Carbonell J. G, Klein-Seetharaman J, Weston J 2010 Semi-supervised multi-task learning for predicting interactions between HIV-1 and human proteins [Bioinformatics] Vol.26 P.i645-i652 google doi
  • 47. Xu Q, Xiang E. W, Yang Q 10.1093/bioinformatics/btq394 Protein-protein interaction prediction via collective matrix factorization [Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine] P.62-67 google
  • 48. Singh A. P, Gordon G. J 2008 Relational learning via collective matrix factorization [Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining] P.650-658 google doi
  • 49. Zhang K, Gray J. W, Parvin B 2010 Sparse multitask regression for identifying common mechanism of response to therapeutic targets [Bioinformatics] Vol.26 P.i97-i105 google
  • 50. Bickel S, Bogojeska J, Lengauer T, Scheffer T 10.1093/bioinformatics/btq181 Multi-task learning for HIV therapy screening [Proceedings of the 25th International Conference on Machine Learning] P.56-63 google
  • 51. 2010 Domain adaptation for semantic role labeling in the biomedical domain [Bioinformatics] Vol.26 P.1098-1104 google
  • 52. Jiang J, Zhai C 10.1093/bioinformatics/btq075 Instance weighting for domain adaptation in NLP [Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics] P.264-271 google
  • 53. Daume III H 2007 Frustratingly easy domain adaptation [Proceedings of the 45th Annual Meeting of the Associationfor Computational Linguistics] P.256-263 google
  • 54. Bi J, Xiong T, Yu S, Dundar M, Rao R 2008 An improved multi-task learning approach with applications in medical diagnosis [Machine Learning and Knowledge Discovery in Databases:European Conference ECML PKDD 2008] google
  • 55. Rashidi P, Cook D. J 2009 Transferring learned activities in smart environments [Proceedings of the 5th International Conference on Intelligent Environments] P.185-192 google
  • 56. Van Kasteren T. L. M, Englebienne G, Krose B. J. A 2008 Recognizing activities in multiple contexts using transfer learning [AAAI Fall Symposium Technical Report] P.142-149 google
  • 57. Rashidi P, Cook D. J 2010 Home to home transfer learning [Proceedings of the AAAI Plan Activity and Intent Recognition Workshop] google
  • 58. Rashidi P, Cook D. J 2010 “Home to home transfer learning” google
  • 59. Spellman P. T, Sherlock G, Zhang M. Q, Iyer V. R, Anders K, Eisen M. B, Brown P. O, Botstein D, Futcher B 1998 “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization” [Molecular Biology of the Cell] Vol.9 P.3273-3297 google
  • 60. Whitfield M. L, Sherlock G, Saldanha A. J, Murray J. I, Ball C. A, Alexander K. E, Matese J. C, Perou C. M, Hurt M. M, Brown P.O, Botstein D 2002 “Identification of genes periodically expressed in the human cell cycle and their expression in tumors” [Molecular Biology of the Cell] Vol.13 P.1977-2000 google doi
  • 61. Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M 2008 “Prediction of drug-target interaction networks from the integration of chemical and genomic spaces” [Bioinformatics] Vol.24 P.i232-i240 google doi
  • 62. Kashima H, Yamanishi Y, Kato T, Sugiyama M, Tsuda K 2009 “Simultaneous inference of biological networks of multiple species from genome-wide data and evolutionary information: a semi-supervised approach” [Bioinformatics] Vol.25 P.2962-2968 google doi
  • 63. Mei S, Fei W, Zhou S 2011 Gene ontology based transfer learning for protein subcellular localization [BMC Bioinformatics] Vol.12 P.44 google
  • 64. Jacob L, Vert J. P 2008 Protein-ligand interaction prediction: an improved chemogenomics approach [Bioinformatics] Vol.24 P.2149-2156 google
  • 65. Zhao Z, Chen Y, Liu J, Liu M 2010 Cross-mobile ELM based activity recognition [International Journal of Engineering and Industries] Vol.1 P.30-38 google
  • [Fig. 1.] Illustrating source and target domains in biology (adopted from invited talk by Gunnar Ratsch Invited Talk at NIPS Transfer LearningWorkshop December 2009 Whistler B.C. (Schweikert et al. [31]).
    Illustrating source and target domains in biology (adopted from invited talk by Gunnar Ratsch Invited Talk at NIPS Transfer LearningWorkshop December 2009 Whistler B.C. (Schweikert et al. [31]).
  • [Fig. 2.] Four domain adaptation models (adopted from invited talk byGunnar Ratsch Invited Talk at NIPS Transfer Learning Workshop December2009 Whistler B.C. (Schweikert et al. [31]).
    Four domain adaptation models (adopted from invited talk byGunnar Ratsch Invited Talk at NIPS Transfer Learning Workshop December2009 Whistler B.C. (Schweikert et al. [31]).
  • [Fig. 3.] Summary of determined performances.
    Summary of determined performances.
  • [Fig. 4.] Main components of Home to Home Transfer Learning (HHTL)for transferring activities from a source space to a target space (Rashidiand Cook [57]).
    Main components of Home to Home Transfer Learning (HHTL)for transferring activities from a source space to a target space (Rashidiand Cook [57]).
  • [Table 1.] Statistics of datasets
    Statistics of datasets