Gene Expression Data Analysis Suite (GEDAS)


Data Visualization Techniques

A study of all available visualization techniques and their importance in analysis of gene expression data was carried out.  Different algorithms used different visualization techniques for analysis of the data.  Some of the classical references are Eisen et al (1998), Tavazoie et al (1999), Toronen et al (1999), Kohonen et al (1996), GEDA, de Hoon et al (2004), Caron et al (2001), Chen (2004), Dopazo (1999), Luo et al (2003), SNOMAD of Colantuoni et al (2000), Ewing and Cherry (2001), Kapushesky et al (2004), Cleaver 1.0 of Stanford Biomedical Informatics (2004), etc.

Earlier, there were cases when the same visualization technique was used differently, such as the SOM output as seen in the difference in implementation of the GEPAS server, and the initial SOM version developed Kohonen et al.  While the former brought out clear and unique clusters as hexagonal grid, the latter produced many null clusters as well as similarly behaving clusters, which required altogether different visualization containing special colour coding and mapping scheme to generate heat map for proper identification.  This helped in identification of similar clusters.

Kaski (1997) used SOM on various kinds of datasets, also emphasized on visualization through SOM and brought a comparison of k-means and SOM.  Use of non-linear projection methods such as MDS, Sammon’s mapping, principal curves (which are generalization of principal components), triangulation method and replicator neural network were also discussed in the work. 

 

Figure 1: Profile plots

 

 (a) One histogram for each sample – plots indicate that all samples follow normal distribution

 

 

(b)  The overall frequency vs. gene expression plot of the entire dataset matrix should also produce normal distribution

The GEDAS software provided a platform for holding together all these visualization techniques for analysis of preprocessed data and processed data clusters through the use of group plots and individual plots.  Facility has been provided to easily switch from one output form to another by selecting the corresponding button.  Flexibility to save the images of visualization was provided so that they could be incorporated into the reports.  The software can store the output results either individually according to each visualization technique or as an integrated report.  All outputs can be viewed either on the screen or saved in Acrobat PDF format for future use.  For instance, histograms of all samples can be viewed in the GEDAS software so as to ensure that they follow the normal distribution, as shown in the profile plots in Figure 1.

 

BRCA1 vs. BRCA2

BRCA1 vs. Sporadic

BRCA2 vs. Sporadic

Mean (Group1) vs. Mean (Group2)

 

 

 

 

Mean (Group1) vs. Variance (Group2)

 

 

 

 

Variance (Group1) vs. Variance (Group2)

Figure 2: Group plots provides flexibility to view the differential gene expression in the data by means of plots such as mean vs. mean, mean vs. variance and variance vs. variance of two or more groups.

 It was observed that while one software suite presented well in terms of statistical computation, the other presented it in the form of equivalent graphs, and still further another suite presented the output in altogether different graphical form.  Graphs such as group plots, MA plots, etc. that have great significance in biological data mining have been successfully incorporated in the GEDAS software, Figures 2 and 3.

The visualization techniques provided through this work have been elaborately described in subsequent sections below, over standardized parameters like description, visual output example, complexity, special features, advantages and demerits, applicability, references, etc.  There also exist a number of other visualization techniques that have not been incorporated in this work.

 

Figure 3: The MA group plot provides view of the amount of spread in the overall gene expression pattern in the data groups.  These plots could be obtained for data before transformation and cannot be obtained for negative or zero data.

 

Features table

A consolidation of the views that can be provided for visualization vs. algorithms has been listed in the following table, click on the visualization technique to view the detailed description:

 

Visualization/Algorithm

Raw data

Pre-processed data

SOM

K-Means

LVQ

HC

PCA (gene)

SVM

Histogram

 

 

 

 

 

Checks view

Microarray

Whole sample

Proximity map

Temporal (incl. zoomed cluster view)

 

 

Textual

 

 

PC view

 

 

 

 

 

 

 

Eigen graph

 

 

 

 

 

 

 

Tree view

 

 

 

 

 

 

 

Scatter plot

 

 

 

 

 

 

 

Decision (or search) space

 

 

 

 

 

 

Box-Whisker plot

 

 

 

 

 

 

Gene Ontology

 

 

Look for improved tree view visualization here. 


Home | Back