Summer Research Webpage

Daniel Yehdego

Week1

Goals for the week:

* To prepare and finalize my profile and webpage for the summer 2011.

* To have access to different campus server accounts.

* To read some basics on MapReduce - Hadoop and and learn how to setup Hadoop.

 

June 13th

* Arrived University of Delaware at 5:00pm from the University of Texas at El Paso.

June 14th

* Introduced to Dr. Taufer and members of the GCLab.

* Dr. Taufer processed my accounts on different GCL and Department machines and received a Key to the GCL room.

* Read about repository and subversions.

June 15th

* Finalized my webpage

   - http://www.cis.udel.edu/~yehdego 

* Attended a seminar by PhD candidate Trice Estrada about computational docking method

    Docking@home 

* Had lunch with Dr. Taufer and GCLab members.

* Read a paper on MapReduce

   MapReduce: Simplified Data Processing on Large Clusters  

 

June 16th

* Setup and configured a single-node Hadoop for simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS).

 Single Node Setup 

 

June 17th

 * Installed, configured and managed non-trivial Hadoop clusters on a one master and two slave cluster setup.

  Cluster Setup

Week2

 

* My goal for this week was to use the already installed and configured Hadoop single node and Cluster setups and
   read and practice on MapReduce Tutorial including:
      - MapReduce

      - User Interfaces
      - Job Configuration
      - Task Execution & Environment
      - Job Submission and Monitoring
      - Job Input and Output and Other Useful Features.
      - Example: WordCount v1.0 and WordCount v2.0

June 20th

 

    * I have read and checked the user-facing aspect of the MapReduce framework.
    * This will help me in implementing, configuring and tuning jobs in a fine-grained manner. 
    * Some of the interfaces that I have looked at are: Payload, Mapper, Reducer,Partitioner, Reporter,Output Collector, Shuffle/Reduce Parameters, Job Configuration, Task Execution & Environment, Map Parameters,

June 21th

 

    * I continue to read on the MapReduce tutorial on the user-interface framework.
    * I have read the process involved on the job submission and monitoring and this includes:
      - Checking the input and output specifications of the job.
      - Computing the InputSplit values for the job.
      - Setting up the requisite accounting information for the Distributed Cache of the job, if necessary.
      - Copying the job's jar and configuration to the MapReduce system directory on the FileSystem.
      - Submitting the job to the JobTracker and optionally monitoring it's status.

June 22th

 

   * Attended a seminar by Narayan Ganesan about Phylogenetic motif identification using HMM.
   * Read a paper on BlastReduce: High Performance Short Read Mapping with MapReduce

June 23th

 

      * We (With Boyu) tried to figure out some error messages on running Hadoop jobs.
      - Replaced hadoop-examples jar file hadoop-0.20.203.0-examples.jar to hadoop-0.20.2-examples.jar and it worked fine.

June 24th

 

* The hadoop job wasn't running on one node of the cluster and we tried to figure out the problem and it turned out to be some inconsistency with the hadoop package that I have used. I replaced all the package with the version hadoop-0.20.2. and its working on all nodes of the cluster. I have tested with different inputs and the result is as expected.


Week3

My goals for this week was:

  • Read papers involving MapReduce applications in Bioinformatics
  • Brainstorm ideas on how to incorporate MapReduce to parallelize data for RNA secondary structure prediction.

June 27th

  • I have read the background information on the distribution properties of inversions and a segmentation algorithm

for the RNA sequences under consideration. This includes: how an algorithm to segment RNA sequences with minimum information loss is developed.

  • The Segmentation method used are:

 i- Peak length centralizing method
 ii- Optimized method and
 iii - Regular method

June 28th - 29th

  • Run Hadoop-Blast in Distributed Hadoop

1. Run Hadoop Distributed File System (HDFS) and Map-Reduce daemon (JobTracker)

2. Prepare for Hadoop-Blast
3. Execute Hadoop-Blast
4. Monitoring Hadoop during a running Job
5. Finish the Map-Reduce process

  • Attended a seminar by Boyu about MapReduce configuration parameters.

June 30th

  • While trying to run a MapReduce Hadoop-Blast job, we encountered some errors on the executable jar file and worked together with Boyu to solve this problem.

 

July 1st

  • Started brainstorming on how to include the idea of parallelizing a given segmented chunks of RNA sequence data using Hadoop MapReduce framework.
  • There are some concerns to be clarified about the chunks of data and their implementation on Hadoop MapReduce:
  • i- The origins of the DNA sequences data, if there is some kind of overlap among the chunks of files, and some other points which are important factors in parallelizing and predicting their secondary structure.

    ii- Since Hadoop application is in Java programming language and most RNA secondary structure predictions programs are in C/C++, we are looking if we can work on Hadoop framework for applications written in C/C++.

Week4

My goals for this week was:

  • Read a tutorial on Hadoop streaming.
  • Implement some executable C/C++ and Python programs in Hadoop streaming.
  • Brainstorm ideas on how to incorporate MapReduce Hadoop streaming to parallelize programs for RNA secondary structure prediction.

 

July 4th

Break due to independence day holiday

 

July 5th

  • Read on Hadoop streaming, how does streaming work, package files with job submissions, streaming Options and Usage:
    * Mapper-Only Jobs
    * Specifying Other Plugins for Jobs
    * Large files and archives in Hadoop Streaming
    * Specifying Additional Configuration Variables for Jobs
    * Other Supported Options

July 6th

  • Talked with Dr. Taufer on the segmentation procedure followed on the given RNA sequence data. we agreed on how to parallelize the chunk of data given. From a given RF file from the Rfam data base, we are going to distribute chunk of segments which are determined by given stem-length, gap-size and segmentation method.
  • Attended a talk by GCLab student Kyle on QCN EmBOINCI.
  • Continued working on Hadoop streaming.

 

July 7th

  • Run a Hadoop MapReduce program in Python using Hadoop streaming.

- First the mapper and the reducer written in Python script is tested locally before using them in a MapReduce job.

 - Then we run the job in MapReduce successfully using a Hadoop streaming.

 - We can also use any executable languages other than Python such as Perl and C/C++ with the Hadoop streaming technique.

 - http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

 - http://wiki.apache.org/hadoop/HadoopStreaming

 

July 8th

  • Downloaded and installed a 2D secondary structure prediction programs UNAFOLD and mfold to the master node.
  • Had some error messages when I tried to run a job UNAFOLD.pl locally and this is because mfold_util library is not defined or is defined incorrectly. I was reading some online forums during the weekend and have found some answers that says the problem can be corrected either by setting and exporting the environment variable $UNAFOLDDIR to point to the share library or by copying the files to the working directory.
  • Correcting this error will be my first task next week.

 

Week5

My goals for this week was:

Read a tutorial on Hadoop streaming.                                                                                             Install and run Unafold and mfold programs on Linux machine.                                               Implement some executable C/C++ and Python programs in Hadoop streaming.               Brainstorm ideas on how to incorporate MapReduce Hadoop streaming to parallelize programs for RNA secondary structure prediction.

July 11 - July 15

After UNAFOLD and Mfold programs are downloaded and installed for 2D secondary structure prediction to the master node, we had some error messages when I tried to run a job UNAFOLD.pl locally and this is because mfold_util library is not defined or is defined incorrectly. I was reading some online forums during the weekend and have found some answers that says the problem can be corrected either by setting and exporting the environment variable $UNAFOLDDIR to point to the share library or by copying the files to the working directory.

 

  • I worked to correct some of the installation error messages. This is done by compiling the library from the mfold util and including to the UNAFOLD package.
  • Finally everything seems to work well but the output files from the UNAFOLD are not in terms  of dot-bracket format.
  • After discussing with Dr. Taufer, we decided to use PknotRG and PknotRE executables for the hadoop-stream application.
  • The reason we selected the above programs is due to their easiness to output in terms of dot-bracket and introduce to Hadoop-streaming application.



Week 6

My goals for this week was:

  • Hadoop streaming implementation for the pknotsRG and pknotsRE 2D secondary structure prediction programs.
  • Read tutorial and online forums on how to implement and execute Hadoop streaming for binary executable programs written in C/C++ ( Example: pknotsRg, pknotsRE).

July 18th -22th

  • Worked on some easy executable python and perl programs to run with hadoop streaming and the output was successful as needed.
  • we have used pknotsRG and pknotsRE for RNA 2D structure prediction locally (with one node), and the binary executable works well and outputs the desired predicted structure interms of dot-bracket format.
  • After checking the pknotsRG/pknotsRE works well with the RF-files locally, we have used hadoop-streaming to parallelize the input data files and predict their 2D structure. But this seems not working and got the following stderr logs

 

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 139

atorg.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)

atorg.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)

at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)

at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)

at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)

at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

at org.apache.hadoop.mapred.Child.main(Child.java:170)

  • After discussing with Dr. Taufer about this problem, we agreed to use some simple already compiled binary executable programs 'cat' and 'wc'. We have used cat and wc as our mapper and reducer respectively and the hadoop-streaming worked well.
  • I tried different hadoop-streaming command line options and I keep getting the same error message on using pknotsRG as our mapper and reducer 'NONE'. I submitted a question about this to the hadoop team forum and they gave me some suggestions.
  • Some of the comments I have received from the hadoop team include:

 

  •  My executable needs to read lines from standard in. Try setting the mapper like this:
  • > -mapper "/data/yehdego/hadoop-0.20.2/pknotsRG -" The - added to the command line says read from STDIN.
  • This doesn't work.
  • And someone suggested to use the following command to help me debug :
  • hadoop fs -cat /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 | /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -

 

  • This is simulating what hadoop streaming is doing. Here we are taking the first 2 lines out of the input file and feeding them to the stdin of pknotsRG. The first step is to make sure that we can get our program to run correctly with something like this. We may need to change the command line to pknotsRG to get it to read the data it is processing from stdin, instead of from a file. Alternatively we may need to write a shell script that will take the data coming from stdin. Write it to a file and then call pknotsRG on that temporary file. Once we have this working then we should try it again with streaming.
  • The later comment looks plausible, because after I tried the suggested hadoop command above locally, pknotsRG failed to read the stdin. From this we can understand that pknotsRG/RE reads the data it is processing from a file and in order for pknotsRG/RE to work successfully with hadoop-streaming, We may need to change the command line to pknotsRG to get it to read the data it is processing from stdin, instead of from a file.
  • Next week I will write a small c binary executable and make it read its input data from stdin and see if hadoop-streaming works successfully.

 

 

Week7

My goals for this week was:

  • Implement and execute Hadoop streaming for binary executable programs written in C/C++  (Example: pknotsRg, pknotsRE).

July 25th-29th

  • The binary executable program pknotsRG/RE only reads a file with a sequence in it. This means, there should be a shell script that will take the data coming from stdin and write it to a temporary file.
  • The ideal would be to modify pknotsRG to be able to read from stdin, We cam also write a small shell script that will take the data coming from stdin, write it to a file and then call pknotsRG on that temporary file.
  • The shell script would look something like the following:

 

    #!/bin/sh

    rm -f temp.txt;

    while read line

    do

    echo $line >> temp.txt;

    done

    exec pknotsRG temp.txt;

  • After this script is placed in a file named hadoopPknotsRG and then run using hadoop-streaming using the following command, the mapping was successful and all the chunks of RNA segments from each RF-File were distributed and their 2D structure was predicted.

 

   HADOOP_HOME$ bin/hadoop jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar

    -mapper ./hadoopPknotsRG

    -file /data/yehdego/hadoop-0.20.2/pknotsRG

    -file /data/yehdego/hadoop-0.20.2/hadoopPknotsRG

    -input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt

    -output /user/yehdego/RF-out

    -reducer NONE

    -verbose

  • The next step is reducing the outputs of the mapper and glue the predicted 2D structure of each chunk of segment from a given RF-File. we could use a reducer to do this, but the issue here is with the shuffle in between the maps and the reduces. The Shuffle will group by the key to send to the reducers and then sort by the key. So in reality our map output looks something like

FROM MAP 1:

    <RNA-1>\t<STRUCTURE-1>\n

    <RNA-2>\t<STRUCTURE-2>\n

 

    FROM MAP 2:

    <RNA-3>\t<STRUCTURE-3>\n

    <RNA-4>\t<STRUCTURE-4>\n

 

    FROM MAP 3:

    <RNA-5>\t<STRUCTURE-5>\n

    <RNA-6>\t<STRUCTURE-6>\n .....

  • If we send it to a single reducer, then the input to the reducer will be sorted alphabetically by the RNA, and the order of the input will be lost. We can work around this by giving each line a unique number that is in the order we want it to be output. But doing this would require us to write some code.
  • After discussing with Dr. Taufer and Boyu, we agreed to modify the PknotsRG so that it outputs a unique ID together with the RNA sequence and the predicted 2D structure. After that a small reducer shell script can be written and the input to the reducer will be sorted based on the unique ID given. Doing this is our next week's objective.
  • I have also attended a weekly GCLab talk by William Killian.

 

 

Week8

Aug 1st

Aug 2nd

Aug 3rd

Aug 4th

Aug 5th



                                         

 

 

 

 


 

                    Week Two

 

                    Week Two

 

                    Week Three

 

                   Week Four

 

                   Week Five

 

                   Week Six

 

                   Week Seven

 

                   Week Eight

 

                   Week Nine