by Forrest Sheng Bao http://fsbao.net
If you are looking for a program to do genome localization of small sequences, such as small RNAs, you can use our program as described in this paper: Small RNA Deep Sequencing Reveals Role for Arabidopsis thaliana RNA-Dependent RNA Polymerases in Viral siRNA Biogenesis. PLoS ONE 4(3): e4971. http://www.plosone.org/article/info:doi/10.1371/journal.pone.0004971 Please cite our paper if you use our program in your papers. Here is a short How-to.
My program is written in Python. So you have to install a Python interpreter first. On most Linux distributions, Python is already installed. On Mac and Windows, please go to http://www.python.org/download/ to download the installation program. Python 2.x is recommended coz I didn't test on Python 3.0.
To run my program, you also need three basic Python modules, Scipy, numPy and Matplotlib. On most Linux distributions, you can use the package manage to install. For example, on Ubuntu Linux, just execute this command sudo apt-get install python-scipy python-numpy python-matplotlib tkinter. For Windows and Mac, please go to their download pages http://www.scipy.org/Download and http://sourceforge.net/project/showfiles.php?group_id=80706&package_id=2...
I do not provide help on Windows platform. So if you have problems on installing them or running my program on Windows, please do NOT email me.
ok, now you are ready to run my program.
Download the compressed file ball of all the sample files and program you will need. It is here http://narnia.cs.ttu.edu/drupal/files/source/2009/HOWTO.tar.bz2 Extract it and enter the directory of extracted files. We will do all the rest on a UNIX shell.
There is not standard format for sequencing result. Ours look like this:
TTTGGATTGAAGGGAGCTCTA 78904 4 TTTGGATTGAAGGGAGCTCTT 17885 4 AGAATCTTGATGATGCTGCAT 6414 4 TCGGACCAGGCTTCATTCCCC 4190 4 TGAAGCTGCCAGCATGATCTA 3422 4 TCGCTTGGTGCAGGTCGGGAA 3069 4 TTCGGACCAGGCTTCATTCCC 2277 4 TTCTTCGTGAATATCTGGCAT 2229 4 TTGGACTGAAGGGAGCTCCCT 1610 4 TTAGTCGACATGTAAACCATT 1324 4
The first column are the siRNA themselves. The second column are reads from sequencing result. The last column is a remark field set by ourselves. You can simply ignore it.
Given an input file in that format and a genome in FASTA format, by running this command
python map.py siRNA_sequencing_result_file genome_FASTA_file length_of_the_genome type
, you will get three files. The first file is an siRNA localization result file, of this format:
TTGTTCGGTTGACTGCGACTC 339 3183 1 TGAGTTCGGTGCTGCATTGCT 339 697 1 AGCGGTTTCCAGATACAGGAT 315 5775 1 TTCGAGTTGTTGGATAAAGGC 310 3104 1 ATGAGTTCGGTGCTGCATTGC 298 696 1 AAAGGATTGGAGGAAAGGATG 282 5501 1 AGTTCGGTGCTGCATTGCTTA 277 699 1 TATTTGTCAGATAAAAGGTTG 275 4817 1 CTTAGGTAATCGACGTAGTTC 275 2158 -1 TTTTCGCTTGGCATCTGCAAC 258 5245 -1
where the first column are the siRNA, the second column are reads, the third column are location of siRNAs on the genome and the last column represents the strand, sense (1) or antisense (-1).
The data for visualizing the result are stored in two files, each of which is for each strand. Then you can run plot.py to plot the localization result.
Ok, now let's go thru one example. To localize siRNAs on the genome, run this command
python map.py s_6-A3-21.txt tmvcg
. We map a bunch of siRNAs in the sequencing result file s_6-A3-21.txt onto the genome of Tobacco Mosaic Virus (TMV), which is stored in the FASTA format file tmvcg. We can get three files are described before, the mapped.tmvcg.s_6-A3-21.txt to store locations of siRNA on TMV genome, the plotdata.tmvcg.s_6-A3-21.A.txt and plotdata.tmvcg.s_6-A3-21.AS.txt, for plotting later. Now plot the localization result. Execute
python plot.py plotdata.tmvcg.s_6-A3-21.A.txt 6303 0
where 6303 is the genome length. Then you can get a picture, plotdata.tmvcg.s_6-A3-21.png I used stem graph instead of line graph in the original paper.
Should you have any questions, please do not hesitate to contact me.