Genomic localization for small RNAs in our PLoS One paper

by Forrest Sheng Bao http://fsbao.net

If you are looking for a program to do genome localization of small sequences, such as small RNAs, you can use our program as described in this paper: Small RNA Deep Sequencing Reveals Role for Arabidopsis thaliana RNA-Dependent RNA Polymerases in Viral siRNA Biogenesis. PLoS ONE 4(3): e4971. http://www.plosone.org/article/info:doi/10.1371/journal.pone.0004971 Please cite our paper if you use our program in your papers. Here is a short How-to.

My program is written in Python. So you have to install a Python interpreter first. On most Linux distributions, Python is already installed. On Mac and Windows, please go to http://www.python.org/download/ to download the installation program. Python 2.x is recommended coz I didn't test on Python 3.0.

To run my program, you also need three basic Python modules, Scipy, numPy and Matplotlib. On most Linux distributions, you can use the package manage to install. For example, on Ubuntu Linux, just execute this command sudo apt-get install python-scipy python-numpy python-matplotlib tkinter. For Windows and Mac, please go to their download pages http://www.scipy.org/Download and http://sourceforge.net/project/showfiles.php?group_id=80706&package_id=2...

I do not provide help on Windows platform. So if you have problems on installing them or running my program on Windows, please do NOT email me.

ok, now you are ready to run my program.

Download the compressed file ball of all the sample files and program you will need. It is here http://narnia.cs.ttu.edu/drupal/files/source/2009/HOWTO.tar.bz2 Extract it and enter the directory of extracted files. We will do all the rest on a UNIX shell.

There is not standard format for sequencing result. Ours look like this:

TTTGGATTGAAGGGAGCTCTA	78904	4
TTTGGATTGAAGGGAGCTCTT	17885	4
AGAATCTTGATGATGCTGCAT	6414	4
TCGGACCAGGCTTCATTCCCC	4190	4
TGAAGCTGCCAGCATGATCTA	3422	4
TCGCTTGGTGCAGGTCGGGAA	3069	4
TTCGGACCAGGCTTCATTCCC	2277	4
TTCTTCGTGAATATCTGGCAT	2229	4
TTGGACTGAAGGGAGCTCCCT	1610	4
TTAGTCGACATGTAAACCATT	1324	4

The first column are the siRNA themselves. The second column are reads from sequencing result. The last column is a remark field set by ourselves. You can simply ignore it.

Given an input file in that format and a genome in FASTA format, by running this command

python map.py siRNA_sequencing_result_file genome_FASTA_file length_of_the_genome type

, you will get three files. The first file is an siRNA localization result file, of this format:

TTGTTCGGTTGACTGCGACTC	339	3183	1	
TGAGTTCGGTGCTGCATTGCT	339	697	1	
AGCGGTTTCCAGATACAGGAT	315	5775	1	
TTCGAGTTGTTGGATAAAGGC	310	3104	1	
ATGAGTTCGGTGCTGCATTGC	298	696	1	
AAAGGATTGGAGGAAAGGATG	282	5501	1	
AGTTCGGTGCTGCATTGCTTA	277	699	1	
TATTTGTCAGATAAAAGGTTG	275	4817	1	
CTTAGGTAATCGACGTAGTTC	275	2158	-1	
TTTTCGCTTGGCATCTGCAAC	258	5245	-1	

where the first column are the siRNA, the second column are reads, the third column are location of siRNAs on the genome and the last column represents the strand, sense (1) or antisense (-1).

The data for visualizing the result are stored in two files, each of which is for each strand. Then you can run plot.py to plot the localization result.

Ok, now let's go thru one example. To localize siRNAs on the genome, run this command

python map.py s_6-A3-21.txt tmvcg

. We map a bunch of siRNAs in the sequencing result file s_6-A3-21.txt onto the genome of Tobacco Mosaic Virus (TMV), which is stored in the FASTA format file tmvcg. We can get three files are described before, the mapped.tmvcg.s_6-A3-21.txt to store locations of siRNA on TMV genome, the plotdata.tmvcg.s_6-A3-21.A.txt and plotdata.tmvcg.s_6-A3-21.AS.txt, for plotting later. Now plot the localization result. Execute

python plot.py  plotdata.tmvcg.s_6-A3-21.A.txt 6303 0

where 6303 is the genome length. Then you can get a picture, plotdata.tmvcg.s_6-A3-21.png I used stem graph instead of line graph in the original paper.

Should you have any questions, please do not hesitate to contact me.