No Title

Analysis of genomic expression data (from micoarrays) can lead to new and improved diagnostic tools for predicting human subject disease susceptibility and patient survival as a function of genomic expression data (and also infection type information and patient history and treatment regimens). Two types of analyses are particularly interesting: data clustering to find natural groupings of data and supervised learning, where the latter is based on tagging known, past instances with outcomes (e.g., patient survivability outcomes). We see from [Lan00] and from the improvements in [HTE⁺00] over the prior paper [AED⁺00] that supervised learning produces results with more precision of prediction than clustering alone. In this portion of the present proposal, the emphasis will, then, be on (particular) supervised machine learning techniques (described below) with clustering techniques as helpers. It is expected too that these learning techniques will be employed in a useful feedback loop with human expert biomedical insights and intuitions so that the diagnostic tools produced will be developed thereby to near perfection.

Our Biomedical Objectives involve providing learning tools for significanlty improved diagonosis regarding the following diseases.

Below we indicate, for some of these diseases, how, more concretely, for the tools to be developed, the type of information the user of these tools would input to a tool and the kind of general information and specific answers to diagnostic questions the tool would return back to the user.

One major challenge affecting such analysis is integrating the results of such high throughput experimental (as microarrays) approaches into knowledge. Of course software is employed since the data sets will be too large for feasibly processing data by hand. Furthermore, there will be important synergy with the other Projects so that the Learning Project will be positioned well to receive essential data from web searching Agents (from the Agents Project), and this will include those agents which make strong use of Natural Language Processing (NLP) technology (from the NLP Project). As the Learning Project advances, it will provide information back to the Agents Project and the NLP Project toward their enhanced usefulness.

In the present proposal, for supervised learning, we will employ particular techniques (described below) within the field of machine learning. These machine learning techniques are, in general, computer procedures for both fitting programs to outcome-tagged data and for outputting the programs fit for subsequent use in predicting outcome-tags of future data [Mit97]. A program so fit to data is said to be learned.

For this Project we chose particular machine learning techniques (described below) for their relative transparency toward providing the user with biomedical insights! They were chosen over other, popular, but more insight opaque techniques. The transparency of our chosen techniques is illustrated below through an example.

**Figure:** Decision Tree for Predicting Orthologs
$\begin{figure} \tiny \begin{tex2html_preform}\begin{verbatim}CHICK-HUMAN AA IDEN... ...\ NON-ORTHOLOGOUS ORTHOLOGOUS\end{verbatim}\end{tex2html_preform} \end{figure}$

A wide variety of machine learning techniques have been employed for bioinformatics ranging from neural nets to decision tree induction [AMS⁺93,BO98,BB98,SDFH98,WM00]. Even potential functions are usefully applied, for example, in [GME⁺92]. Machine learning techniques overlap somewhat with classical statistical techniques and they compare favorably with one another, as to accuracy of predictions, over a wide variety of domains [MST94]. We have carefully chosen for the Biomedical Objectives above the machine learning software C5.0 which employs the decision tree induction technique of C4.5 [Qui93,RN95,Mit97] (studied in [MST94]) with the new option to invoke AdaBoosting [FS97,FS99]. AdaBoosting, for many problems, provides a significant improvement in prediction accuracy. It is pedagogically useful to present a previous, successful bioinformatics example employing C5.0. This example is from [OCB01]. The example will, then, help explain both C5.0 and why we chose it for the Biomedical Objectives of the present Project. Along the way we will provide, for useful and general illustration, an indication of how features of (some of the) proposed Biomedical Objectives above would compare to those of the already worked out example (from [OCB01]).

The example is next. [OCB01] presents a bioinformatics machine learning problem pertaining to predicting, for complete genomic sequences from disparate species, whether or not they are orthologous, i.e., whether or not they evolved from a common ancestor and have the same function. For the proposed Biomedical Objectives we would want to predict instead human subject susceptibility and patient survival.

The example from [OCB01] concentrated on employing data about many known orthologies between mouse and human on the one hand and on some but fewer known orthologies between mouse, human, and chicken. For the proposed Biomedical Objectives re PTLD, known would be past human subject susceptibility and patient survival as a function of EBV types, differential gene expression (between controls and infected patients), and environmental conditions such as patient history and treatment regimens. Similarly for CLL. There, for example, RNA will be prepared from lymphocytes isolated from patients with either indolent or aggressive CLL. The RNAs will be appropriately labeled and then hybridized with microarrays containing human genes.

The purpose of the example study [OCB01] was, given known orthologs (with known functions) X and Y between mouse and human, to decide of an arbitrary chicken protein Z (and associated sequences) of unknown orthology whether or not X, Y, and Z were orthologous. For the first proposed Project above, given the EBV types, differential gene expression, and environmental conditions for a current patient, we would want to predict that patient's susceptibility and survivability.

Back to the example from [OCB01]: well known, quick, and deservedly popular similarity matching techniques such as BLAST and variants [AGM⁺90,KA90,Pea95,AMS⁺97] are not sufficient to detect some divergent orthologs [BCH98], and the authors of [OCB01] found that many of their interesting divergent cases between mammals and birds were not similar enough to detect with BLAST and variants alone. Here is what was done, then, in [OCB01]. From knowledge of biochemical evolution [HL98] and also from some empirically observed patterns in the data for known orthologs, were eventually created and tried (with success) 78 attributes expected to be of relevance for ferreting out defining similarities and differences between mammal and bird orthologs. For example, for similarity attributes the efficient version from [Got82] of the Needleman-Wunsch optimal global alignment algorithm [NW70] was used (with appropriate scoring matrices, e.g., from [BO98]) to obtain both optimal global alignments and identity match percent scores for the nucleic acid (NA) and the amino acid (AA) sequences being compared. Then the percents of nucleotide mutations could be and were calculated from the NA global alignments, and percent scores were calculated for both transitions (mutations which occur frequently for simple biochemical reasons [Li97]) and transversions (mutations unlikely to occur [Li97]). Various attributes measuring ratios and differences also were created, for example, attributes featuring lengths and numbers of gaps. It was (correctly) expected that measures of bias for transversions would be useful attributes. This was expected since transversions are rare, so their occurrences and biases may be functionally important. For example, certain clustering tendencies were empirically noticed regarding transversions (between mammals and birds), and subsequently corresponding attributes were created and computed based on easy to compute one-dimensional projections of the distribution-free, scaling and rotation invariant clustering tendency measure called simplicial depth [BF84,Liu90,LS93,CO98]. For reference below, one (of many) such attributes was called CHICK-MOUSE MINOR TRANSVERSION BIAS. For each ortholog in the data set for the 213 orthologs, a vector of the values for that ortholog of each of the associated 78 attributes was produced. Such vectors were created for known non-orthologs too. Each resultant vector was additionally tagged as to whether it represented orthologs or non-orthologs -- in cases where that was known. For the proposed Biomedical Objective re PTLD, a number of attributes would be initially created based on types of EBV, measures of differential gene expressions, and environmental conditions. The attributes can be, but need not be, numerically valued. Standard clustering techniques (e.g., Independent Component Analysis (ICA) [Lee98] with software from http://www.cis.hut.fi/projects/ica/fastica/ and Principle Component Analysis (PCA) [GW92,Oja83,Oja89]) would be employed, as in the example from [OCB01] where the relatively simpler clustering technique of human inspection and projections of simplicial depth was applied. This would be in the interest of finding data groupings, tendencies, and combinations which would lead us to create in conjunction with biomedical intuitions and insights additional attributes -- which additional attributes would provide enhanced prediction accuracy. Similarly, for CLL. There we would select, for example, attributes regarding basic lymphocyte count, such counts in bone marrow, and attributes testing particular genes for rising or falling expression with respect to controls.

For the example from [OCB01], the tagged vectors constituted the training data to be input to C5.0. One can visualize the training data as on spreadsheet. For [OCB01] each row of the spreadsheet would have 79 columns. and would represent one data point. The first 78 columns would contain the values of the respective 78 attributes. The 79th would contain the tag regarding orthology. For our Biomedical Objectives we would have a similar layout for each disease, of course with the numbers of attributes being quite different in each case. The C4.5 (decision tree induction) component of C5.0, was first used to fit to this data an especially convenient classification program called a decision tree. The figure above shows a portion of the first decision tree obtained from the run of the tagged data from the 213 known orthologs between mouse, human, and chicken. A decision node in a decision tree is a point in the tree immediately below which there is a YES or NO branching. The complete tree of the figure above has 35 decision nodes, but the incomplete tree actually shown in the figure has only 4. The large subtree omitted from the figure then has 31 decision nodes. For the proposed first Project, we would similarly obtain decision trees with decision nodes featuring instead attributes of relevance to that project, e.g., differential gene expression measures, etc.

Importantly, the particular computer procedure the C4.5 component of C5.0 employs to obtain a decision tree involves producing trees where the topmost decisions made explain the most data. For example, in the figure above, the decision as to whether or not CHICK-HUMAN AA IDENTITY <= 25.54 is the most informationally salient. The next decisions down are second most salient in explaining the data, and so on down the branches of the tree. For example, in the figure above, the decision as to whether or not CHICK-MOUSE NA IDENTITY <= 49.5 is the second most informationally salient. Further down is a decision to be made about an attribute featuring a ratio. Also shown in the figure is a decision to be made as to whether MOUSE A TO CHICK C <= 19.09. The attribute value tested in this decision node is about a transversion. If some vector of attribute values of known or unknown tagging is tested in the tree above, and the answers to the successive questions are, in order, NO, YES, NO, and YES, this decision tree returns the decision NON-ORTHOLOGOUS. If the sequence of answers is the same except that the last one is, instead NO, then the decision tree makes the decision ORTHOLOGOUS. For data points for which we would want to predict the correct tagging (because we do not know it), for the vector would have 78 entries and a blank 79th (for the orthology example). For our Biomedical Objectives, such a data point would have the number and type of attributes for a particular disease and be missing the tagging as to patient outcome. A tree learned from the training data from that disease could be used to predict the unknown patient outcome. Back to the orthology example. Far down in the omitted subtree is the decision node asking CHICK-MOUSE MINOR TRANSVERSION BIAS > 2.292322; hence, this decision regarding a particular transversion bias for mutations from chicken to mouse is important to the overall decision, but not as important as the decisions displayed in the figure. Here is a consequence. Decision trees created by C4.5 can be read for human insight as to the relative importance of tests of attribute values. For the present proposal this is particularly important and the primary reason for our particular choice of C5.0 as the supervised learning tool. The group leader at AI duPont insisted on this ``transparency'' for our machine learning techniques. Other techniques that might have been chosen such as neural nets are, by contrast, ``opaque.'' In many cases for our Biomedical Objectives, we will, in general, be using our biochemical and medical and clinical knowledge to select measurable or calculable entities as attributes, and some of these entities are certainly expected to be more causally salient, e.g, regarding outcomes of treatments for disease, than are others. This is expected when the entities represent gene expression along chemical pathways. Sometimes upstream gene expressions are more relevant, sometimes downstream ones are. In case decisions about one such entity is more informationally salient in a decision tree than another, that quite plausibly points to more causal significance of the former than the latter, and, then, this potentially useful insight can be followed up on toward further improving our machine learning obtained prediction and classification programs re, for example, outcomes of disease treatments. We would, for example, concentrate on developing additional attributes of relevance to the more salient (and, perhaps, then, more causally significant) entities.

As noted above C5.0 has the option to invoke AdaBoosting. This is a newer technique for provably improving learners, including C4.5, both for fitting training data [FS96] and for generalization and prediction beyond the training data [FSBL98] (see also [FMS01]). It also handles well the presence of errors or noise in the training data [FS96]. This latter is clearly important re, for example, our Biomedical Objectives. All of our techniques also handle well missing attribute values in attribute vectors [Qui93], and this is important, for example, in our Objectives, where arrays tend regularly to miss one-half of the genes expressed.

Back to AdaBoosting: AdaBoosting first fits a sequence of decision trees to the training data where each tree, beyond the first, judiciously concentrates on the cases where its predecessor made classification errors; then AdaBoosting makes its final classification decision by taking a weighted majority vote of the decisions of the separate trees in the sequence of trees it's already created. More accurate trees in the sequence get higher weight. Since boosting combines a number of decision trees, its use may involve some tolerable loss of insight and efficiency; however, boosting nonetheless looks like linear (i.e., fast) programming [FS99]. In [OCB01] an AdaBoosting created sequence of just three trees (for majority vote between them) provided a perfect classifier for all the training data. More importantly, [OCB01] employed 10-fold cross-validation (i.e., a random 10-th of the data is removed from training and employed instead for testing) with 10 repetitions and obtained, with an AdaBoosting produced sequence of 25 trees (with 35-40 decision nodes per tree), a remarkably low error rate of 2.4% (with Standard Error less than 0.05%) on the entire data set for all 213 orthologs. This shows that the attribute selection for orthology must explain a great deal of the differential evolution between disparate species -- at least toward prediction. This error rate from the 10-fold cross-validation provides a good upper bound estimate on the worst one might expect a 25 tree decision maker, employing all the training data from [OCB01], would perform on totally new cases of interest. Of course, the majority vote of 25 trees each with 35-40 decision nodes is a quite usefully sophisticated and subtle decision maker. Similarly for our Objectives, 10-fold cross-validations would provide estimates of how much to believe predictions of associated learned decision makers. Later, of course, new data would be gathered to further test predictions. Also, we would expect that correct predictions re patients would also require sophisticated and subtle decision makers -- as C5.0 can produce.

The techniques we develop will be useful toward better patient care in the future, and they will adaptable to other disease scenarios -- with reasonable expectations for success based on the success of the particular projects herein proposed.