CSE 527, Au '03: Reading #2: What Students Found

University of Washington Department of Computer Science & Engineering

CSE Home

About Us

Contact Info

I asked for brief reports on good microarray papers. Here's what you found:

Golub, Slonim, Tamayo, Huard, Gaasenbeek, Mesirov, Coller, Loh, Downing, Caligiuri, Bloomfield and Lander, Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, SCIENCE, Vol 286, 15 October 1999

The report describes two new approaches in cancer research using gene expression monitoring. "Class prediction" allows to classifiy different samples according to known tumor types (e.g. AML vs ALL). "Class discovery" makes it possible to find new subclasses of tumor types by solely using gene expression data.

Class prediction: It is crucial to apply the correct therapy to cancer patients. However, different types of cancer require different treatment. So far, the distiction between AML and ALL required a lot of effort and an experienced expert who used a large combination of analysis data. The authors searched for genes where the expression pattern was correlated with the class distinction. The 50 most correlated genes ("informative genes") out of 38 samples were used to create a class predictor. To improve the accuracy, cross-validation was applied.

Class discovery: Self-organizing maps (SOMs) were used to cluster the 38 samples. First, the power of SOMs was determined by searching for the AML-ALL distinction (unsupervised learning). To check if unknown classes that were found by the SOM reflect true structure, the performance of a class predictor bases on these classes was assessed. The "prediction strengths" (indicating the probability of the classification) of the predictions were relatively high.

Critique:
Positive:

"Neighborhood analysis" indicates whether correlation of genes is merely a statistical artifact or indeed meaningful.
The accuracy of the predictor is improved by applying cross-validation.
The "prediction strength" can be regarded as an indicator for the quality/probability of the result.
The results that were gained -- i.e. the set of informative genes -- is of great value to future biological research.
The idea of clustering gene expression data according to clinical outcome helps in discovering the purpose of specific genes and allows better patient treatment.
The power of SOMs was assessed using a known classification.
Even for unknown classifications, the quality of the results were determined by calculating the bprediction strengthsb.

Negative:

6817 genes were sorted by correlation and the 50 most correlated genes were used to build a predictor. Not all 50 genes might be meaningful. Since the number of genes (6817) is much higher than the number of samples (38), it is very likely that the expression of a gene matches coincidently.
The amount of used genes (50) is arbitrary. Further research could determine the optimal amount.
Due to the small sample size (38), small errors (weak purity, b&) have a strong influence on the design of the predictor.
The prediction strengths of samples of one laboratory were significantly lower than those of others. These were not included. This indicates that the results of the classification depend heavily on the circumstances of the experiments.
The amount of samples was by far too low to effectively train a neural network, especially when the number of genes is so large.
The neural network is likely to be not sufficiently trained.

Bergstrom, D.A., Penn, B.H., Strand A., Perry R.L., Rudnicki, M.A., Tapscott S.J., Promoter-Specific Regulation of MyoD Binding and Signal Transduction Cooperate to Pattern Gene Expression, Mol. Cell, 9: (2002) 587-600

Mammalian fibroblast cells null for myoD were made to express the protein. These cells were compared to cells not expressing the protein. The analysis involved performing array analysis at time points throughout the differentation of the cells into muscle. This allowed the formation of clusters based on the patterns of expression. These experiments were repeated in the presence of cyclohexamide to inhibit new protein synthesis. The experiments were repeated a final time and modified to include a drug that inhibits the protein p38. In the end a set of targets were determined to be myoD dependent. Additionally, a portion of those were shown to require p38 for their response to myoD. Several of the genes were further analyzed. MyoD binding was associated with increased histone H3 and H4 phosphorylation. Additionally, p38 was not required for binding of myoD to these promoters. Clusters based on similarity in the pattern of expression were shown to be enriched for genes involved in similar functions.

This neat paper on using database obtained microarray data for further analysis might be of interest to some people in the class:
Stuart et. al. 2003. "A gene-coexpression Network for Global Discovery of Conserved Genetic Modules." Science, 302: 249-255.

Louis M. Staudt, M.D., Ph.D., Molecular Diagnosis of the Hematologic Cancers, NEJM

A friend recommended this article to me over the summer, so this seemed like a good opportunity to finally read it. It's a review article describing the many ways that microarray analysis is impacting diagnosis and treatment of one class of cancers.

The article briefly introduced how microarrays work and how both supervised and unsupervised analysis of the data can provide valuable information. With this as background the article gives concrete examples about how microarray analysis is useful for:

identifying clinically different subtypes of a disease previously diagnosed as a single disease
predicting survival rates at the time of diagnosis
selecting or altering treatment options based on the molecular profile of the disease

The authors argue that microarray analysis should be further integrated into clinical diagnosis. Arrays capable of measuring the expression of only hundreds of chosen genes could provide great benefit for selecting treatments and understanding prognosis. Further, they argue that studies of new treatment methods and drugs should include genomic-scale gene-expression profiling in order to identify molecularly distinct subgroups of the disease which have different responses to the treatment.

Overall I found this to be a great article which survey the many ways that microarrays are impacting cancer research and treatment.

Nguyen, Danh V et al., DNA microarray experiments: Biological and Technological aspects, Biometrics 58, 701-717, December 2002.

http://proquest.umi.com/pqdlink

For better understanding of the DNA microarray technology I decided to read this review paper. It gives a clear description of the different arrays and the different experimental procedures.

From this review I learned about several pitfalls and biases, which are related to this technique and may influence the results and the conclusions of studies using this technique. For example:

cDNA microarray technique, which uses direct labeling (the florescent dye labeled-nucleotides are incorporated directly into the cDNA samples during the reverstranscription) is associated with several problems:
- Since the labeled nucleotide is dUTP the number of labeled nucleotides in any cDNA depends on the number of A's in the RNA strand
- The number of labeled nucleotides also depends on the length of the transcript.
These cause a stronger signal in long transcripts or transcripts with many A's which is not related to the level of the gene expression, and may affect the study results significantly.
The analysis of the signal can also be a source for bias since the results may changed significantly, depends on the way of measuring and account the signal and differentiate it from the background.
Variation between: spot to spot, array to array, subject to subjects. In many microarray studies there is not enough attention to these variations, which can introduce bias in the results and the conclusions significantly.
Lack of normalization in many experimental steps, which can introduce systematic biases.

R. J. Cho, M. Huang, M. J. Campbell, H. Dong, L. Steinmetz, L. Sapinoso, G. Hampton, S. J. Elledge, R. W. Davis and D. J. Lockhart, Transcriptional regulation and function during the human cell cycle, Nat Genet. 2001 Jan;27(1):48-54

http://www-sequence.stanford.edu/human_cell_cycle/index.html (supplementary data)

One aim of the experiments was to identify cell-cycle-regulated genes in human cells on a genome-wide scale. Cells were synchronized and after release from arrest samples were taken in intervals of 2 hours over a period of 24 hours, covering almost 2 complete cell cycles. The experiments were performed using oligonucleotide arrays for 40,000 and 6,800 human transcripts and genes. The experiments on the 6,800 gene arrays were replicated under identical conditions (i.e. 2 micorarrays per time-step) and the resulting data sets were normalized between samples (i.e. microarrays) and then between genes and ESTs. Normalization is absolutely necessary if expression levels from different microarrays are to be compared.

The results of the replicates were averaged. However, the authors do not report on whether they verified that the averaged expression values were similar in all cases or not. E.g. 2 and 4 have an average of 3, but so do -10 and 16, and in the latter case the values are very likely to be erroneous (at least one of them) and misleading. Discussion is necessary on the definition of "similar". In this context should also be considered how Affymetrix-specific attributes like DETECTION (present, marginal or absent) are to be handled.

The new data set was clustered employing both prior biological knowledge (similarity to 12 cell-cycle-regulated seed patterns defined by well known cell-cylce-regulated genes using the Pearson correlation coefficient) and an unsupervised clustering algorithm based on a self-organizing map to verify the previous results (again using the Pearson correlation coefficient as similarity measure). SOM clustering did actually produce similar clusters. Hence, this combined approach including application of biological knowlegde and unsupervised clustering gives more confidence about the results and eliminates the subjective factor in "informed" clustering based on biological knowledge.

This gives confidence in the clustering, but not in the measured expression levels. These were not verified using traditional methods like for instance Northern blots or RT-PCR. If the newly found cell-cycle-regulated transcripts actually play a role in any of the various stages of the cell cycle should be verified experimentally using methods like gene knock-out or short interfering RNA (siRNA). But the similarity to the expression profiles of known cell-cycle-regulated genes makes this very likely. An analysis of regulatory regions of the genes in the same cluster might also yield further insights.

Michael Ochs and Andrew Godwin, Microarrays in Cancer: Research and Applications, BioTechniques, 34(suppl):S4-S15.

This was a review paper of the ways in which cancer research has and could potentially benefit from microarray data. The authors suggest that microarrays would be of particular use in studying cancer because of the disease's many types (each with distinct genetic characteristics) and varying responses to therapy. While very optimistic about microarrays' potential, the authors believe that the greatest contributions will be made when microarray data is combined and studied together with other sources of medical or genetic information.

Wang, D., Coscoy L., Zylberberg, M., Avila, P., Boushey, H., Ganem, D., DeRisi, J., Microarray-based detection and genotyping of viral pathogens, (2002) Proc. Natl. Acad. Sci. USA 99, 15687-15692

This paper by Wang et. al reports on the use of microarrays to detect and genotype a wide range of viruses. The coolest result of this work is that if good viral genome information is available for a comparison, this microarray approach appears to be far superior to conventional techniques (viral cultures, PCR assays) based on the ability to screen a far greater number of samples simultaneously. Additionally, this was shown it can be used to detect previously unknown viruses based on sequence conservation across viral families. A possible disadvantage to the use of microarray analysis of viruses is the requirement that a set of sequenced genomes must be present for the probe sequences to be constructed which may not always be the case. I found this to be a very good paper for illustrating the versatility of microarray experimentation. For example, this can be used for identifying virtually any types of virus (viruses used for the study included rhinoviruses, picornaviruses, polioviruses and enterovirses among others; also, samples were taken from patients with naturally occurring viral infections), and it shows how microarrays can be used in ways other than the genotyping of an organism or a tissue (i.e. a way to identify disease). Certainly this should be continued on an ever increasingly larger scale as more and more viral genome information becomes available. This could have wide reaching applications for disease diagnosis and for the creation of new vaccines in the years to come.

Set Associations of SNP Case-Control and Microarray Data, Journal of Computational Biology, Volume 10, Numbers 3-4, 2003, Pp569-574

This paper is only indirectly associated with microarray analysis, however, as the authors point out, some of their methods do apply microarray data.

Typically, gene mapping is gone one gene at a time. However, the evaluation of complex traits may require the analysis of numerous SNPs in different genomic regions. In a way, this is a clustering problem; the authors have thousands of data points for SNP allele frequencies among both affected and unaffected individuals. Their goal is to determine the set of SNPs which most likely are explain the differences between affected (diseased) and unaffected (not-diseased) individuals. Their method involves using chi-squared measurements to quantify the importance of a given SNP location in disease state. The SNP locations are then ranked in descending order. Subsequently, the researchers can take n possible locations (with the optimal n being empirically determined) and create a set of SNPs that are most likely involved in the disease state.

This same approach could potentially apply to microarray expression data for the purpose of determining locations in the genome most responsible for disease state.

John R. Hawse,1,2 James F. Hejtmancik,3 Quingling Huang,4 Nancy L. Sheets,2 Douglas A. Hosack,5 Richard A. Lempicki,5 Joseph Horwitz,4 Marc Kantorow, Identification and functional clustering of global gene expression differences between human age-related cataract and clear lenses, Molecular Vision 2003; 9:515-537

How exactly does the EASE program work? What algorithms are used to perform the calculations? It's nice to be able to download the raw data; would allow others to easily verify calculations using other methods of statistical analysis.

Patrick O. Brown and David Botstein, Exploring the new world of the genome with DNA microarrays, Nature Genetics Supplement, Volume 21, January 1999, pp. 33-37

A good intro to microarrays. A great deal of information on biological application, about two sentences said about computational analysis (basically that data is collected and clustered into patterns). This would be a good review for anyone who needs to remember what a micro- array is and how it is used. At the end of the paper the authors are making a plea as to proper publishing of papers, instead of just the bottom line, and they explain three things that they themselves are doing to improve the situation.

Martin Schuster, C. Phoebe Lostroh, Tomoo Ogi, and E. P. Greenberg, Identification, Timing, and Signal Specificity of Pseudomonas aeruginosa Quorum-Controlled Genes: a Transcriptome Analysis, Journal of Bacteriology, April 2003, p. 2066-2079, Vol. 185, No. 7

Victoria E. Wagner, Daniel Bushnell, Luciano Passador, Andrew I. Brooks, and Barbara H. Iglewski, Microarray Analysis of Pseudomonas aeruginosa Quorum-Sensing Regulons: Effects of Growth Phase and Environment, Journal of Bacteriology, April 2003, p. 2080-2095, Vol. 185, No. 7

Michael L. Vasil, DNA Microarrays in Analysis of Quorum Sensing: Strengths and Limitations, Journal of Bacteriology, April 2003, p. 2061-2065, Vol. 185, No. 7

Three papers from one journal.

In April of 2003, Journal of Bacteriology ran a series of three very interesting articles: two research papers reporting microarray experiments investigating the genes regulated by the presence of quorum sensing signals, and a guest commentary with an opinion on the benefits and limitations of microarray experiments.

A few words of background. A number of bacteria have been observed to change their behavior when a large number of them are present in an environment. The mechanism that allows bacteria to find out that there are many of them is called quorum sensing and involves production, excretion into the environment and sensing of certain small molecules. The papers cited here study the change of genetic expression levels in Pseudomonas aeruginosa -- a common bacterium that lives in the soil, among other places, but is also the pathogen involved in causing cistic fibrosis in humans. Both papers were trying to answer the same question: which genes are regulated by the presence of the signal in the environment. They employed somewhat different approaches and produced results that only partially overlaped.

The first paper (Schuster, et al) uses two pairs of mutants to as precisely as possible identify the genes that respond to the presence of the signal in the environment. First a mutant that is incapable of producing the signal is placed in two different environments: one where the signal is added by the experimenters and one where signal is not present. Secondly, a mutant lacking a signal receptor is compared to the wild type.

In the second paper (Wagner, et al), only two different strains of the bacteria were used: the wild type and the mutant unable to produce signal. To ensure that the observed response was really due to the variation in the signal level, the authors also recorded the response of the cells to different kinds of media.

The two experiments confirmed a number of genes previously known to be involved in quorum sensing. There was large (but far from pefect) overlap among more than 300 genes that each group identified as being induced by the presence of the signal. However, the two groups agreed on only 4 genes that were repressed by signal's presence (out of 22 and 222 such genes identified by the two groups)!

A.M. Bagirov, B. Ferguson, S Ivkovic, G. Saunders and J. Yearwood, New algorithms for multi-class cancer diagnosis using tumor gene expression signatures., Bioinformatics, vol. 19, no. 14, 2003, pages 1800-1807

This paper proposes new clustering, feature selection, and classification algorithms, and tests them with microarray data taken from 14 different types of tumors. On the positive side, the algorithms are designed to be used together as a sort of complete toolkit for tumor diagnosis. All three of them are also fairly straightforward and easy to understand (and thus, hopefully, to implement). On the negative side, the feature selection algorithm seems rather sensitive to outliers. It simply looks at the minimum and maximum expression levels of each gene in each tumor class, and then looks at the size of the overlap with the other tumor classes in order to determine whether or not the gene is informative. A single outlier can greatly increase the maximum or decrease the minimum, but it doesn't necessarily mean that the gene is not informative - it is possible that all other samples from that class of tumors have expression levels in a very narrow range. Another negative is that in the results section the authors only compare their clustering algorithm with k-means, neglecting to compare it with the other clustering algorithms (or to compare it using other data sets).

Mei-Ling Ting Lee, Frank C. Kuo, G. A. Whitmorei, and Jeffrey Sklar, Importance of Replication in Microarray Gene Expression Studies: Statistical Methods and Evidence From Repetitive cDNA Hybridizations, PNAS August 29, 2000 vol. 97 no. 18 pages 9834-9839

This paper explored what is an apparently often-made assumption that experiments using single MicroArray analysis provide data that can be reasonably assumed to be consistent and reliable. Replication of this analysis is often avoided because of the time, complexity, and cost involved in this type of experimentation but as this study showed it is a worthwhile dimension to any quality study of this type. In the article numerous causes for errors in MicroArray data were explored in the context of some standard experimentation of gene expression levels in prepared E. Coli slides. A statistical model was developed that helped show the expected error and was used to derive confidence levels over certain experiments. The authors came to the conclusion that at least 3 replications of any experiment needed to be conducted before one can expect consistent and reliable results. This is of course interesting since many of the studies we have seen so far rely on this data being correct in a single analysis.

This paper was an excellent introduction to MicroArrays because through the course of describing the experiments in the paper not only was the general process of using this type of analysis described in detail but the many potential sources of errors were also presented. I feel that this should allow me to make better-informed critical analyses of experiments utilizing this techonology in the future. Possibly because the article appeared in an applied math journal, the technical details were thorough but were not overwhelming which led to the paper be quite readable. In general I found it to be a good paper.

Kevin P. White, Scott A. Rifkin, Patrick Hurban, and David S. Hogness, Microarray Analysis of Drosophila Development During Metamorphosis, Science 1999 December 10; 286: 2179-2184

I initially chose this paper because it seemed to focus on the application and biological conclusions - the area with which I am less familiar.

This article unfortunately did not spend much time describing or detailing their computational methods; they spent a couple of paragraphs talking about how they used pairwise correlation statistics and SOM's, and included diagrams, trees and graphs from these algorithms with informative captions. Aside from that, the extent of the computational discussion was that both methods produced the same results, which were consistent with the hypotheses.

On the biological side, there was extensive discussion on the conclusions. The research was intended to find which genes were expressed during the different stages of metamorphosis in fruit flies. The paper went beyond the results to speculate on why the results make sense. The bulk of the paper was actually spent detailing the biological function a particular gene serves by being expressed at a certain time.

This was not a bad article for me to read, because I would naturally have gotten caugtht up in the computational details and skimmed the biology; this forced me to understand the process of drawing biologically relevant conclusions after the mathematical analysis is complete.

My biggest criticism is that they seemed to fall into the trap of making a strong hypothesis, and then designing experiments that would prove them true. It is possible that they did do extensive analysis, but from the quick overview in this paper, it did not seem particularly thorough with many variations and iterations. Their speculations on biological significance of the gene expression times were also not necessarily backed up with data, though again, it is possible that if I dove into their cited sources, I would be less skeptical.

Susmita Datta, Somnath Datta, Comparisons and validation of statistical clustering techniques for microarray gene expression data, Bioinformatics 2003 19: 459-466

This article is interesting because it is talking about the exact weak point of Chu. et al (1998): clustering, and it uses the same data. The third validation method looks good, but I think it depends on the topological structure: it probably will work better for linear clusterings than for circular clusterings.

Jackson, Bartz, et al., Expression profiling reveals off-target gene regulation by RNAi, Nature Biotechnology, June 2003 21(6) pp 635 - 637 (available as E-Journal from UW Libraries)

Most studies use microarrays to measure protein levels indirectly via mRNA levels. This paper is a bit different due to its subject: the mRNA levels themselves are the central figures in the study of RNAi.

RNA interference has been described as transient gene knockout via a special type of RNA (siRNA) "interfering" with the mRNA of a given gene. The authors conclude that siRNA is not gene-specific; it will also silence genes that have sequence similarity with its primary target.

I was impressed with the different kinds of experiments the authors performed to bolster their argument and to rule out other possible causes of the phenomenon they observed.

The authors make a statement about "virtually identical gene expression patterns"; I wonder if there exists some quantitative measurement (perhaps involving p-values) of the similarity of patterns. Sometimes there are just too many red-green expression profiles to eyeball.

Towards the end of the paper, the authors admit they were "unable to identify [sequence] patterns that could help predict off-target activity of siRNAs". I remember reading about how siRNAs do not act alone; they form complexes with special proteins. A future study could examine if those proteins favor certain patterns (e.g if they form more stable and/or more efficient complexes.)

Stefanie Fuhrman, Mary Jane Cunningham, Xiling Wen, Gary Zweiger, Jeffery J. Seilhamer, Roland Somogyi, The application of Shannon entropy in the identification of putative drug targets, Biosystems 55 (2000), pg 5-14

Fuhrman et al. propose an information theoretical approach to identifying genes that could be targets. This paper attracted my attention as I have studied the subject and applied it to other problems before. On the surface their idea is very reasonable and promising: interpret the gene expression pattern over time as a probability distribution and associate an entropy measure to it. Higher entropy indicates higher variability/complexity, which implies relatively prominent role in biological process being charted, and from here the authors go on to conclude that �[�] genes with highest entropy are the biggest participants in a disease process �� Entropy is appealing for several reasons: it is mathematically precise, well- understood, captures our intuition around complexity/unpredictability and generalizes to multiple distributions easily. (For example we can talk about conditional entropy of two distributions, which allows tracking the variation in one gene given the variation in another.) Existing data from an 1998 study is charted by entropy and individual genes vary between a maximum of 1.58 to 0, with the majority clustered between 1.3-1.4 bits.

Problem is there are several leaps of faith in the argument quoted above. Worst one is correlation vs. causation question�variation in gene expression could be symptom of the condition as opposed to underlying cause. Then there are some problems with entropy measure itself: for example, it washes away any �information� contained in the actual levels�jumping from 3 to 4 is same as jumping from 1 to 10�as well as the time sequence. In fact the paper has examples of two hypothetical distributions with equal entropy which are clearly distinct and arguably very different in complexity. (Gene #3 seems more complex than gene #2) There are other complexity metrics that would capture this inherent time-dependence and sequential nature of the data. There are also practical problems, such as the fact that gene expression levels must be quantized into �bins� before the discrete entropy measure can be applied. This paper normalized the expression level into a real number between 0 and 1, and arbitrarily picked three bins by dividing the unit interval into three equal segments. Number of bins and thresholds separating them affect entropy: with B bins the maximum possible achievable entropy is log B (logarithm in base 2), if the distribution of gene expression into these bins is uniform. Similar quantization problems are abundant elsewhere: if sampling frequency is too low, the experiment will miss rapid variations in gene expression, a form of the �aliasing� problem that underestimates the entropy.

Finally even when genes with high entropy are singled out, there are too many candidates. They estimates about 10% of genes have high entropy, which applied to the > 100K human genome still leaves thousands of candidates. As such entropy alone may not be enough by itself for reducing the problem to manageable size.


	Department of Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX [comments to cse527-webmaster@cs.washington.edu]