Gene Expression Data Analysis Suite (GEDAS)
Motivation and Introduction
While the prime focus of this work was on the application of LVQ for modeling the gene expression data, different variants of SOM and LVQ algorithms were applied to datasets related to breast cancer, mouse (Mus musculus), Arabidopsis thaliana, Homo sapiens, sugarcane, etc. The LVQ1 algorithm provided the best results compared to other variants of LVQ and its unsupervised counterpart SOM.
The application of LVQ was followed by enhancement or fine tuning the map generated by SOM using LVQ. When SOM was used as a pattern classifier the accuracies produced by various experiments of SOM improved considerably after application of LVQ.
Extraction of differentially expressed genes from the given dataset can also be done. This work also brings forth some visualization techniques under one platform that are easy to learn, and can help the researchers in easily tracking out desired clusters/classes of genes. The popular visualization of Eisen et al (1998) named Tree View was enhanced to accommodate additional features of generating clusters with more than two genes under the same parent node, using the gene expression data matrix as input instead of the distance matrix, etc.
The enormous growth of bio-molecular databases makes it increasingly important to have fastest methods to process, analyze and understand such massive amounts of data. In the domain of genomics, the microarray gene expression data processing is the new area of interest for many researchers. Though lot of progress has been made in the microarray manufacturing process and to some extent the data mining from these microarrays, there are very few software suites available, either commercially or publicly, to make available different tools under one environment.
The motivation behind this software are the well established works implemented through the GEDA suite of University of Pennsylvania, GEPAS suite of CNIO, Spain, Cluster 3.0 software by de Hoon et al. (2004), J-Express, Cleaver 1.0, SNOMAD of Colantuoni et al (2000) and Expression Profiler: Next Generation by European Bioinformatics Institute (EBI), UK and was primarily based on the following ideas
· Application of the popular ANN model for classification called the learning vector quantization (LVQ) through the three forms/variants viz., LVQ1, LVQ2 and LVQ3.
· Bringing a number of data mining algorithms under one umbrella software environment. The algorithms include SOM, LVQ, k-means, hierarchical clustering, SVM and PCA.
· Support of a number of visualization techniques and gene expression data preprocessing algorithms has been provided in this software. It also contains a host of 19 distance measures.
A number of software for processing the DNA microarray gene expression data were studied when it was found that most of them had perform similar activity except for certain preprocessing and visualization forms. This formed motivation for standardization of inputs, processing and outputs. An attempt was made for the first time to standardize all the possible visualization techniques and extend them to the available data mining algorithms (including the well known ANN techniques) such as SOM, LVQ, k-means, HC, SVM and PCA/SVD, which can be further extended to any new algorithm.
During the process, harmonization of the input format of the dataset was done, as well as all the data mining algorithms were brought under single umbrella software. While the plots/graphs such as scatter plot, histogram, etc. existed since long, the temporal (or wave) graph, tree view, tree map, whole genome view, etc. have been standardized, developed and carefully integrated into the software.
Another most important and noticeable inclusion was the representation of hierarchical clustering output in the form of temporal (or wave) graph. The level number of the tree (or dendrogram) was taken as reference input from the user to generate corresponding temporal graph. With this standardization, it became possible to exhibit the results using other techniques also. A similar concept was described in Luo et al (2003) for implementing a blend of hierarchical clustering and self-organizing maps thereby calling it hierarchically growing self-organizing tree (HGSOT).
The support vector machine approach for solving classification tasks was introduced by Cortes and Vapnik (1995). The SVMLight software is publicly available for implementing various kernels, Joachims (1999). The SVMTorch also provides good support for large scale regression problems, Collobert and Bengio (2001). The SVMLib software of Chang and Lin (2004) was used in this work, whose inputs and outputs requires to be standardized. Four different forms of applications of SVM were incorporated in the GEDAS, viz., linear, polynomial, radial basis and sigmoid. The development of SVM portion of the software tool requires to be completed by addition of suitable visualization techniques and standardization of input and output formats.