Goals for the week:
* To prepare and finalize my profile and
webpage for the summer 2011.
* To have access to different campus server
accounts.
* To read some basics on MapReduce - Hadoop
and and learn how to setup Hadoop.
June 13th
* Arrived University of Delaware at 5:00pm
from the University of Texas at El Paso
June 14th
* Introduced to Dr. Taufer and members of
the GCLab.
* Dr. Taufer processed my accounts on
different GCL and Department machines and received a Key to the GCL room.
* Read about repository and subversions.
June 15th
* Finalized my webpage
* Attended a seminar by PhD candidate Trice Estrada about computational docking method
* Had lunch with Dr. Taufer and GCLab members.
* Read a paper on MapReduce
MapReduce: Simplified Data Processing on Large Clusters
June 16th
June 17th
* My goal for this week was
to use the already installed and configured Hadoop single node and
Cluster setups and
read and practice on MapReduce Tutorial including:
- MapReduce
- User Interfaces
- Job Configuration
- Task Execution & Environment
- Job Submission and Monitoring
- Job Input and Output and Other Useful Features.
- Example: WordCount v1.0 and WordCount v2.0
June 20th
* I have read and checked the
user-facing aspect of the MapReduce
framework.
* This will help me in implementing, configuring and tuning
jobs in a fine-grained manner.
* Some of the interfaces that I have looked at are: Payload,
Mapper, Reducer,Partitioner, Reporter,Output Collector,
Shuffle/Reduce Parameters, Job Configuration, Task Execution &
Environment, Map Parameters,
June 21th
* I continue to read on the
MapReduce tutorial on the
user-interface framework.
* I have read the process involved on the job submission and
monitoring and this includes:
- Checking the input and output specifications of
the job.
- Computing the InputSplit
values for the job.
- Setting up the requisite accounting information
for the Distributed Cache of the job, if necessary.
- Copying the job's jar and configuration to the
MapReduce system directory on the
FileSystem.
- Submitting the job to the
JobTracker and optionally monitoring
it's status.
June 22th
* Attended a seminar by Narayan
Ganesan about Phylogenetic motif identification using HMM.
* Read a paper on BlastReduce: High
Performance Short Read Mapping with MapReduce
June 23th
* We (With Boyu)
tried to figure out some error messages on running Hadoop jobs.
- Replaced hadoop-examples jar file
hadoop-0.20.203.0-examples.jar to hadoop-0.20.2-examples.jar and it
worked fine.
June 24th
* The hadoop job wasn't running on one node of the cluster and we tried to figure out the problem and it turned out to be some inconsistency with the hadoop package that I have used. I replaced all the package with the version hadoop-0.20.2. and its working on all nodes of the cluster. I have tested with different inputs and the result is as expected.
My goals for this week was:
- Read papers involving MapReduce applications in Bioinformatics
- Brainstorm ideas on how to incorporate MapReduce to parallelize data for RNA secondary structure prediction.
June 27th
- I have read the background information on the distribution properties of inversions and a segmentation algorithm
for the RNA sequences under consideration. This includes: how an algorithm to segment RNA sequences with minimum information loss is developed.
- The Segmentation method used are:
i- Peak length centralizing method
ii- Optimized method and
iii - Regular method
June 28th
- Run Hadoop-Blast in Distributed Hadoop
1. Run Hadoop Distributed File System
(HDFS) and Map-Reduce daemon (JobTracker)
2. Prepare for Hadoop-Blast
3. Execute Hadoop-Blast
4. Monitoring Hadoop during a running Job
5. Finish the Map-Reduce process
- Attended a seminar by Boyu about MapReduce configuration parameters.
June 30th
- While trying to run a MapReduce Hadoop-Blast job, we encountered some errors on the executable jar file and worked together with Boyu to solve this problem.
July 1st
- Started brainstorming on how to include the idea of parallelizing a given segmented chunks of RNA sequence data using Hadoop MapReduce framework.
- There are some concerns to be clarified about the chunks of data and their implementation on Hadoop MapReduce:
-
i- The origins of the DNA sequences data, if there is some kind of overlap among the chunks of files, and some other points which are important factors in parallelizing and predicting their secondary structure.
ii- Since Hadoop application is in Java programming language and most RNA secondary structure predictions programs are in C/C++, we are looking if we can work on Hadoop framework for applications written in C/C++.
My goals for this week was:
- Read a tutorial on Hadoop streaming.
- Implement some executable C/C++ and Python programs in Hadoop streaming.
- Brainstorm ideas on how to incorporate MapReduce Hadoop streaming to parallelize programs for RNA secondary structure prediction.
July 4th
July 5th
- Read on Hadoop streaming, how does streaming work, package files with job submissions, streaming Options and Usage:
* Mapper-Only Jobs
* Specifying Other Plugins for Jobs
* Large files and archives in Hadoop Streaming
* Specifying Additional Configuration Variables for Jobs
* Other Supported Options
July 6th
- Talked with Dr. Taufer on the segmentation procedure followed on the given RNA sequence data. we agreed on how to parallelize the chunk of data given. From a given RF file from the Rfam data base, we are going to distribute chunk of segments which are determined by given stem-length, gap-size and segmentation method.
- Attended a talk by GCLab student Kyle on QCN EmBOINCI.
- Continued working on Hadoop streaming.
July 7th
- Run a Hadoop MapReduce program in Python using Hadoop streaming.
- First the mapper and the
reducer written in Python script is tested locally before using them
in a MapReduce job.
- Then we run the
job in MapReduce successfully using a
Hadoop streaming.
- We can also use
any executable languages other than Python such as Perl and C/C++
with the Hadoop streaming
technique.
-
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
-
http://wiki.apache.org/hadoop/HadoopStreaming
July 8th
- Downloaded and installed a 2D secondary structure prediction programs UNAFOLD and mfold to the master node.
- Had some error messages when I tried to run a job UNAFOLD.pl locally and this is because mfold_util library is not defined or is defined incorrectly. I was reading some online forums during the weekend and have found some answers that says the problem can be corrected either by setting and exporting the environment variable $UNAFOLDDIR to point to the share library or by copying the files to the working directory.
- Correcting this error will be my first task next week.
My goals for this week was:
Read a tutorial on Hadoop streaming. Install and run Unafold and mfold programs on Linux machine. Implement some executable C/C++ and Python programs in Hadoop streaming. Brainstorm ideas on how to incorporate MapReduce Hadoop streaming to parallelize programs for RNA secondary structure prediction.
July 11 - July 15
After UNAFOLD and Mfold programs are downloaded and installed for 2D secondary structure prediction to the master node, we had some error messages when I tried to run a job UNAFOLD.pl locally and this is because mfold_util library is not defined or is defined incorrectly. I was reading some online forums during the weekend and have found some answers that says the problem can be corrected either by setting and exporting the environment variable $UNAFOLDDIR to point to the share library or by copying the files to the working directory.
- I worked to correct some of the installation error messages. This is done by compiling the library from the mfold util and including to the UNAFOLD package.
- Finally everything seems to work well but the output files from the UNAFOLD are not in terms of dot-bracket format.
- After discussing with Dr. Taufer, we decided to use PknotRG and PknotRE executables for the hadoop-stream application.
- The reason we selected the above programs is due to their easiness to output in terms of dot-bracket and introduce to Hadoop-streaming application.
My goals for this week was:
- Hadoop streaming implementation for the pknotsRG and pknotsRE 2D secondary structure prediction programs.
- Read tutorial and online forums on how to implement and execute Hadoop streaming for binary executable programs written in C/C++ ( Example: pknotsRg, pknotsRE).
July 18th
- Worked on some easy executable python and perl programs to run with hadoop streaming and the output was successful as needed.
- we have used pknotsRG and pknotsRE for RNA 2D structure prediction locally (with one node), and the binary executable works well and outputs the desired predicted structure interms of dot-bracket format.
- After checking the pknotsRG/pknotsRE works well with the RF-files locally, we have used hadoop-streaming to parallelize the input data files and predict their 2D structure. But this seems not working and got the following stderr logs
java.lang.RuntimeException:
PipeMapRed.waitOutputThreads(): subprocess failed with code 139
atorg.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
atorg.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at
org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at
org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at
org.apache.hadoop.mapred.Child.main(Child.java:170)
- After discussing with Dr. Taufer about this problem, we agreed to use some simple already compiled binary executable programs 'cat' and 'wc'. We have used cat and wc as our mapper and reducer respectively and the hadoop-streaming worked well.
- I tried different hadoop-streaming command line options and I keep getting the same error message on using pknotsRG as our mapper and reducer 'NONE'. I submitted a question about this to the hadoop team forum and they gave me some suggestions.
- Some of the comments I have received from the hadoop team include:
- My executable needs to read lines from standard in. Try setting the mapper like this:
- > -mapper "/data/yehdego/hadoop-0.20.2/pknotsRG -" The - added to the command line says read from STDIN.
- This doesn't work.
- And someone suggested to use the following command to help me debug :
- hadoop fs -cat /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 | /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG -
- This is simulating what hadoop streaming is
doing. Here we are taking the first 2 lines out of
the input file and feeding them to the stdin of
pknotsRG. The first step is to make sure that we can
get our program to run correctly with something like
this. We may need to change the command line to
pknotsRG to get it to read the data it is processing
from stdin, instead of from a file. Alternatively we
may need to write a shell script that will take the
data coming from stdin. Write it to a file and then
call pknotsRG on that temporary file. Once we have
this working then we should try it again with
streaming.
- The later comment looks plausible, because after
I tried the suggested hadoop command above locally,
pknotsRG failed to read the stdin. From this we can
understand that pknotsRG/RE reads the data it is
processing from a file and in order for pknotsRG/RE
to work successfully with hadoop-streaming, We may
need to change the command line to pknotsRG to get
it to read the data it is processing from stdin,
instead of from a file.
- Next week I will write a small c binary executable and make it read its input data from stdin and see if hadoop-streaming works successfully.
My goals for this week was:
- Implement and execute Hadoop streaming for binary executable programs written in C/C++ (Example: pknotsRg, pknotsRE).
July 25th
- The binary executable program pknotsRG/RE only reads a file with a sequence in it. This means, there should be a shell script that will take the data coming from stdin and write it to a temporary file.
- The ideal would be to modify pknotsRG to be able to read from stdin, We cam also write a small shell script that will take the data coming from stdin, write it to a file and then call pknotsRG on that temporary file.
- The shell script would look something like the following:
#!/bin/sh
rm -f temp.txt;
while read line
do
echo $line >> temp.txt;
done
exec pknotsRG temp.txt;
- After this script is placed in a file named hadoopPknotsRG and then run using hadoop-streaming using the following command, the mapping was successful and all the chunks of RNA segments from each RF-File were distributed and their 2D structure was predicted.
HADOOP_HOME$
bin/hadoop jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar
-mapper ./hadoopPknotsRG
-file /data/yehdego/hadoop-0.20.2/pknotsRG
-file /data/yehdego/hadoop-0.20.2/hadoopPknotsRG
-input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt
-output /user/yehdego/RF-out
-reducer NONE
-verbose
- The next step is reducing the outputs of the mapper and glue the predicted 2D structure of each chunk of segment from a given RF-File. we could use a reducer to do this, but the issue here is with the shuffle in between the maps and the reduces. The Shuffle will group by the key to send to the reducers and then sort by the key. So in reality our map output looks something like
FROM MAP
1:
<RNA-1>\t<STRUCTURE-1>\n
<RNA-2>\t<STRUCTURE-2>\n
FROM MAP 2:
<RNA-3>\t<STRUCTURE-3>\n
<RNA-4>\t<STRUCTURE-4>\n
FROM MAP 3:
<RNA-5>\t<STRUCTURE-5>\n
<RNA-6>\t<STRUCTURE-6>\n .....
- If we send it to a single reducer, then the input to the reducer will be sorted alphabetically by the RNA, and the order of the input will be lost. We can work around this by giving each line a unique number that is in the order we want it to be output. But doing this would require us to write some code.
- After discussing with Dr. Taufer and Boyu, we agreed to modify the PknotsRG so that it outputs a unique ID together with the RNA sequence and the predicted 2D structure. After that a small reducer shell script can be written and the input to the reducer will be sorted based on the unique ID given. Doing this is our next week's objective.
- I have also attended a weekly GCLab talk by William Killian.
Aug 1st
Aug 2nd
Aug 3rd
Aug 4th
Aug 5th