Significance
Analysis of genomic expression data (from micoarrays) can lead to new and improved diagnostic tools for predicting human subject disease susceptibility and patient survival as a function of genomic expression data (and also infection type information and patient history and treatment regimens). Two types of analyses are particularly interesting: data clustering to find natural groupings of data and supervised learning, where the latter is based on tagging known, past instances with outcomes (e.g., patient survivability outcomes). We see from [Lan00] and from the improvements in [HTE+00] over the prior paper [AED+00] that supervised learning produces results with more precision of prediction than clustering alone. In this portion of the present proposal, the emphasis will, then, be on (particular) supervised machine learning techniques (described below) with clustering techniques as helpers. It is expected too that these learning techniques will be employed in a useful feedback loop with human expert biomedical insights and intuitions so that the diagnostic tools produced will be developed thereby to near perfection.
Biomedical Objectives
Our Biomedical Objectives involve providing learning tools for significanlty improved diagonosis regarding the following diseases.
Below we indicate, for some of these diseases, how, more concretely, for the tools to be developed, the type of information the user of these tools would input to a tool and the kind of general information and specific answers to diagnostic questions the tool would return back to the user.
Synergy With The Other Projects
One major challenge affecting such analysis is integrating the results of such high throughput experimental (as microarrays) approaches into knowledge. Of course software is employed since the data sets will be too large for feasibly processing data by hand. Furthermore, there will be important synergy with the other Projects so that the Learning Project will be positioned well to receive essential data from web searching Agents (from the Agents Project), and this will include those agents which make strong use of Natural Language Processing (NLP) technology (from the NLP Project). As the Learning Project advances, it will provide information back to the Agents Project and the NLP Project toward their enhanced usefulness.
Machine Learning Background and Techniques
In the present proposal, for supervised learning, we will employ particular techniques (described below) within the field of machine learning. These machine learning techniques are, in general, computer procedures for both fitting programs to outcome-tagged data and for outputting the programs fit for subsequent use in predicting outcome-tags of future data [Mit97]. A program so fit to data is said to be learned.
For this Project
we chose particular machine learning techniques (described below)
for their relative
transparency toward providing the user with biomedical insights! They
were chosen over other, popular, but more insight opaque techniques.
The transparency of our chosen techniques is illustrated below through an
example.
A wide variety of machine learning techniques have been employed for bioinformatics ranging from neural nets to decision tree induction [AMS+93,BO98,BB98,SDFH98,WM00]. Even potential functions are usefully applied, for example, in [GME+92]. Machine learning techniques overlap somewhat with classical statistical techniques and they compare favorably with one another, as to accuracy of predictions, over a wide variety of domains [MST94]. We have carefully chosen for the Biomedical Objectives above the machine learning software C5.0 which employs the decision tree induction technique of C4.5 [Qui93,RN95,Mit97] (studied in [MST94]) with the new option to invoke AdaBoosting [FS97,FS99]. AdaBoosting, for many problems, provides a significant improvement in prediction accuracy. It is pedagogically useful to present a previous, successful bioinformatics example employing C5.0. This example is from [OCB01]. The example will, then, help explain both C5.0 and why we chose it for the Biomedical Objectives of the present Project. Along the way we will provide, for useful and general illustration, an indication of how features of (some of the) proposed Biomedical Objectives above would compare to those of the already worked out example (from [OCB01]).
The example is next. [OCB01] presents a bioinformatics machine learning problem pertaining to predicting, for complete genomic sequences from disparate species, whether or not they are orthologous, i.e., whether or not they evolved from a common ancestor and have the same function. For the proposed Biomedical Objectives we would want to predict instead human subject susceptibility and patient survival.
The example from [OCB01] concentrated on employing data about many known orthologies between mouse and human on the one hand and on some but fewer known orthologies between mouse, human, and chicken. For the proposed Biomedical Objectives re PTLD, known would be past human subject susceptibility and patient survival as a function of EBV types, differential gene expression (between controls and infected patients), and environmental conditions such as patient history and treatment regimens. Similarly for CLL. There, for example, RNA will be prepared from lymphocytes isolated from patients with either indolent or aggressive CLL. The RNAs will be appropriately labeled and then hybridized with microarrays containing human genes.
The purpose of the example study [OCB01] was, given known orthologs (with known functions) X and Y between mouse and human, to decide of an arbitrary chicken protein Z (and associated sequences) of unknown orthology whether or not X, Y, and Z were orthologous. For the first proposed Project above, given the EBV types, differential gene expression, and environmental conditions for a current patient, we would want to predict that patient's susceptibility and survivability.
Back to the example from [OCB01]:
well known, quick,
and deservedly popular similarity matching techniques
such as BLAST and variants
[AGM+90,KA90,Pea95,AMS+97] are not sufficient
to detect some divergent orthologs
[BCH98], and the authors of
[OCB01] found that many of their interesting divergent cases
between mammals and birds were not similar enough to detect with
BLAST and variants alone. Here is what was done, then, in
[OCB01]. From knowledge of biochemical evolution [HL98]
and also from
some empirically observed patterns in the data for known orthologs,
were eventually created and tried (with success)
78 attributes expected to be of
relevance for ferreting out defining
similarities and differences between
mammal and bird orthologs. For example, for similarity attributes
the efficient version from
[Got82] of the
Needleman-Wunsch optimal global alignment algorithm [NW70]
was used (with appropriate scoring matrices, e.g., from [BO98])
to obtain both optimal global alignments and identity match percent scores for
the nucleic acid (NA) and the amino acid (AA) sequences being compared. Then
the percents of nucleotide mutations could be and were
calculated from the NA global alignments, and percent scores
were calculated for both transitions
(mutations which
occur frequently for simple biochemical
reasons [Li97]) and transversions (mutations
unlikely to occur [Li97]).
Various attributes measuring ratios and differences also
were created, for
example, attributes featuring lengths and numbers of gaps.
It was (correctly) expected that measures of bias for transversions would be
useful attributes. This was expected since transversions are rare,
so their occurrences and biases may be functionally important.
For example, certain clustering tendencies were empirically
noticed regarding transversions (between mammals and birds),
and subsequently corresponding
attributes were created and computed based on easy to compute one-dimensional
projections of the distribution-free,
scaling and rotation invariant clustering tendency
measure called simplicial depth
[BF84,Liu90,LS93,CO98].
For reference below,
one (of many)
such attributes was called CHICK-MOUSE MINOR TRANSVERSION BIAS.
For each ortholog
in the data set for the 213 orthologs, a vector of the
values for that ortholog
of each of the associated 78 attributes was produced. Such vectors were
created for known non-orthologs too. Each resultant vector was additionally
tagged as to whether it represented orthologs or non-orthologs -- in cases
where that was known. For the proposed Biomedical Objective re PTLD,
a number of
attributes would be initially created based on types of EBV, measures of
differential gene expressions, and environmental conditions. The attributes
can be, but need not be, numerically valued. Standard clustering techniques
(e.g., Independent Component Analysis (ICA)
[Lee98] with software from
http://www.cis.hut.fi/projects/ica/fastica/
and
Principle Component Analysis (PCA)
[GW92,Oja83,Oja89])
would be employed, as in the example from [OCB01] where the
relatively simpler clustering technique of human inspection and projections of
simplicial depth was applied. This would be in the interest of finding data
groupings, tendencies, and combinations which would lead us to create in
conjunction with biomedical intuitions and insights
additional attributes -- which additional attributes would provide
enhanced prediction accuracy. Similarly, for CLL. There we would select, for
example, attributes regarding basic lymphocyte count, such counts in bone
marrow, and attributes testing particular genes for rising or falling
expression with respect to controls.
For the example from [OCB01],
the tagged vectors constituted the training data to be input to C5.0.
One can visualize the training data as on spreadsheet. For
[OCB01] each row of the spreadsheet would have 79 columns.
and would represent one data point.
The first 78 columns would contain the values of the respective 78 attributes.
The 79th would contain the tag regarding orthology. For our Biomedical
Objectives we would have a similar layout for each disease, of course with
the numbers of attributes being quite different in each case.
The C4.5 (decision tree induction) component of C5.0, was first used to fit
to this data an
especially convenient classification program called a decision tree.
The figure above shows a portion of the first
decision tree obtained from the run of the tagged
data from the 213 known orthologs between mouse, human, and chicken.
A decision node in a decision tree
is a point in the tree immediately below which there is a
YES or NO branching. The complete tree of the figure above
has 35 decision nodes, but the incomplete tree
actually shown in the figure
has only 4. The large subtree omitted from the figure then has
31 decision nodes. For the proposed first Project, we would similarly obtain
decision trees with decision nodes featuring instead attributes of relevance to
that project, e.g., differential gene expression measures, etc.
Importantly,
the particular computer procedure the C4.5 component of C5.0 employs to obtain
a decision tree involves producing trees where the topmost decisions made
explain the most data. For example, in the figure above, the decision as to
whether or not CHICK-HUMAN AA IDENTITY <= 25.54 is the most
informationally salient.
The next decisions down are second most salient
in explaining the data, and so on down
the branches of the tree. For example,
in the figure above, the decision as to
whether or not CHICK-MOUSE NA IDENTITY <= 49.5 is the second most
informationally salient. Further down is a decision to be made
about an attribute featuring a ratio.
Also shown in the figure is a decision to
be made as to whether MOUSE A TO CHICK C <= 19.09. The attribute value
tested in this decision node is about a transversion. If some
vector of attribute
values of known or unknown tagging is tested in the tree above, and
the answers to the successive questions are, in order,
NO, YES, NO, and YES, this decision tree returns
the decision NON-ORTHOLOGOUS.
If the sequence of answers is the same except
that the last one is, instead NO, then the decision tree makes the
decision ORTHOLOGOUS. For data points for which
we would want to predict the
correct tagging (because we do not know it), for the vector would have
78 entries and a blank 79th (for the orthology example). For
our Biomedical
Objectives, such a data point would have the number and type
of attributes for a
particular disease and be missing the tagging as to patient outcome. A tree
learned from the training data from that disease could be used to predict the
unknown patient outcome. Back to the orthology example.
Far down in
the omitted subtree is the decision node
asking CHICK-MOUSE MINOR TRANSVERSION BIAS > 2.292322; hence, this
decision regarding
a particular transversion bias for mutations from chicken to mouse is important
to the overall decision, but not as important as the decisions displayed in the
figure. Here is a consequence.
Decision trees created by C4.5 can be read
for human insight as to the relative importance of tests of attribute
values. For the present proposal this is
particularly important and the primary reason for our particular choice of C5.0
as the supervised learning
tool. The group leader at AI duPont insisted on this ``transparency'' for our
machine learning techniques.
Other techniques that might have been chosen such as neural nets are,
by contrast, ``opaque.''
In many cases for our Biomedical Objectives, we will, in general,
be using our biochemical and medical and clinical
knowledge to select measurable or calculable
entities as
attributes, and some of these entities are certainly expected
to be more causally
salient, e.g, regarding outcomes of treatments for disease,
than are others. This is expected when the entities represent gene
expression along chemical pathways. Sometimes upstream gene
expressions
are more relevant, sometimes downstream ones are.
In case decisions
about one such entity is more informationally salient in a decision tree than
another, that quite
plausibly points to more causal significance of the former than
the latter, and, then, this potentially
useful insight can be followed up on
toward further improving our machine learning obtained
prediction and classification programs re, for example, outcomes of disease
treatments. We would, for example,
concentrate on developing additional attributes of relevance to the more
salient (and, perhaps, then, more causally significant) entities.
As noted above C5.0 has the option to invoke AdaBoosting. This is a newer technique for provably improving learners, including C4.5, both for fitting training data [FS96] and for generalization and prediction beyond the training data [FSBL98] (see also [FMS01]). It also handles well the presence of errors or noise in the training data [FS96]. This latter is clearly important re, for example, our Biomedical Objectives. All of our techniques also handle well missing attribute values in attribute vectors [Qui93], and this is important, for example, in our Objectives, where arrays tend regularly to miss one-half of the genes expressed.
Back to AdaBoosting: AdaBoosting first fits a sequence of decision trees to the training data where each tree, beyond the first, judiciously concentrates on the cases where its predecessor made classification errors; then AdaBoosting makes its final classification decision by taking a weighted majority vote of the decisions of the separate trees in the sequence of trees it's already created. More accurate trees in the sequence get higher weight. Since boosting combines a number of decision trees, its use may involve some tolerable loss of insight and efficiency; however, boosting nonetheless looks like linear (i.e., fast) programming [FS99]. In [OCB01] an AdaBoosting created sequence of just three trees (for majority vote between them) provided a perfect classifier for all the training data. More importantly, [OCB01] employed 10-fold cross-validation (i.e., a random 10-th of the data is removed from training and employed instead for testing) with 10 repetitions and obtained, with an AdaBoosting produced sequence of 25 trees (with 35-40 decision nodes per tree), a remarkably low error rate of 2.4% (with Standard Error less than 0.05%) on the entire data set for all 213 orthologs. This shows that the attribute selection for orthology must explain a great deal of the differential evolution between disparate species -- at least toward prediction. This error rate from the 10-fold cross-validation provides a good upper bound estimate on the worst one might expect a 25 tree decision maker, employing all the training data from [OCB01], would perform on totally new cases of interest. Of course, the majority vote of 25 trees each with 35-40 decision nodes is a quite usefully sophisticated and subtle decision maker. Similarly for our Objectives, 10-fold cross-validations would provide estimates of how much to believe predictions of associated learned decision makers. Later, of course, new data would be gathered to further test predictions. Also, we would expect that correct predictions re patients would also require sophisticated and subtle decision makers -- as C5.0 can produce.
Future Significance
The techniques we develop will be useful toward better patient care in the future, and they will adaptable to other disease scenarios -- with reasonable expectations for success based on the success of the particular projects herein proposed.