USER GUIDE TO Mosquito Small RNA Genomics v. 1.0 OUTPUTS (updated 2021-03-10)

This guide explains how to read the outputs in the MSRG database. The outputs are generarted by a series of custom Shell, Perl, C and Python scripts avaiable on this GitHub page.

Example for running the script, gene_centric_test: $ more tt bash /[YOUR OWN CUSTOM INSTALLATION PATH]/software/mosquitoSmallRNA/bin/gene-centric.sh AnGam_Fcarc_TN.fastq Angam AGATCGGAAG 20000000 5

Example for running the script, TE_virus_test: $ more tt bash /[YOUR OWN CUSTOM INSTALLATION PATH]/software/mosquitoSmallRNA/bin/TE_virus_pipeline.sh AnGam_Fcarc_TN ../gene_centric_test/summary Angam AGATCGGAAG

Example for running the script, phasing_test: $ more tt bash /[YOUR OWN CUSTOM INSTALLATION PATH]/software/mosquitoSmallRNA/bin/sRNA_Phasing_pipeline.sh AnGam_Fcarc_TN.fastq Angam AG ATCGGAAG

Example for running the script, piRNA_target_test: $ more tt bash /[YOUR OWN CUSTOM INSTALLATION PATH]/software/mosquitoSmallRNA/bin/piRNA_target_pipeline.sh Angam_darkgreen.fa /[YOUR OWN CUSTOM INSTALLATION PATH]/software/mosquitoSmallRNA/database/Angam_transcript_TE_virus.fa

Each folder is named according to the library ID name. Click the links below to see the detailed descriptions of each text file tables. Tab-delimited text file tables are easily importable into MS Excel for sorting, or into MS Access or Filemaker Pro for database queries with Structured Query Language (SQL)

(1) BigWig outputs: For loading into genome browsers like the Broad Institute IGV. Note that visualizations are dependent on the genome assembly presented for each specie, while some more recent assemblies on Vectorbase will not sync with the BigWig and BigBed outputs on the MSRG. To properly load the files on the IGV, download the GFF files from the MSRG.

(2) GeneCentric Outputs : The TOP LINK goes to the folder holding the standard Genecentric Outputs as was generated by the pipeline as in Chirn et al, PLoS Genetics 2015. However, the BOTTOM LINK goes to another folder holding Consolidated Genecentric Outputs which

(3) Length distributions and miRNA counts : tab-delim table of TE depletions with genomic annotation information. Note: this file was extracted from the All Dels. Annotated file.

(4) piRNA Phasing Analysis : tab-delim table of 5kb genomic intervals with counts of TE InDels.

(5) Structural RNAs as small RNAs
: tab-delim table of All Depletions with genomic annotation information,including many smaller deletions. Note: Sorting this file just for TEs yields the TE Depletion Annotated file. For Depletions that do not include a TE, the repName field is blank. All the other column headers are identical and explained for TE Depletions Annotated.

(6) Transposon small RNAs
: tab delim table of the specific reads selected for calling each TE Insertion. From this table, you can select the cluster of reads and use BLAT to query on the UCSC browser to reveal the specific insertion breakpoint.

(7) Virus small RNAs
: tab delim table of the specific reads selected for calling each Depletion. From this table, you can select the cluster of reads and use BLAT to query on the UCSC browser to reveal the specific depletion breakpoint.

(8) Wolbachia small RNAs
: PDF file graphical display of the estimated Copy Number Variation Ratio for 5kb genomic intervals across the Release 6 / Dm6 reference genome. Ratios are based on assumption of 2N diploid genome for animal genomes, but only average undetermined ploidy for cell lines.


DETAILS FOR PARTICULAR MSRG OUTPUT FOLDERS:

(1) BigWig (*.bw) Files (Binary WIG files) : These are binary conversions of the WIG files originally located in the Genecentric Output folders because we found that some browsers cannot handle the memory-intensive procesing of WIG file formats.

There are 4 total files per library. The *.nor-pstv.bw and *.nor-ngtv.bw are Rpm and mapping frequency normalized reads per 25bp window interval for the Positive and Negative genomic strands. The *.unq-pstv.bw and *.unq-ngtv.bw are the Uniquely mapping read counts for the the Positive and Negative genomic strands. You can download the files on your local computer to load into the IGV browser or paste the URL into the UCSC Genome Browser. However, few of the latest mosquito genome assemblies are being updated in the UCSC Genome Browser.

^^^Go back to top^^^

(2) Genecentric Outputs (TOP LINK: Original Genecentric outputs. BOTTOM LINK: Consolidated Genecentric output): various files of different formats to allow visualization and counting of small RNA reads overlapping genes loci. The Consolidated files address the issue of overlapping and redundant gene models that confound genecentric piRNA loci determinations.

TOP LINK goes to Folder holding LibName.wig files, *.sorted.bb files (BigBed format), library read length counts.xls, and Gene counts.collapsed.xls files, and Intergenic Coverage collapsed counts.xls files.

The Gene Counts.collapsed.xls files contain the following column headers:
Gene: VectorBase-based gene models called in the GTF transcriptome file
Locus : Concatenated string of the chromosomal coordinates of the gene locus
Chr : Chromosome
Strand: Genomic strand of gene, plus strand or minus strand
Annotated_Start : gene model's annotated 5' end start coordinate.
Annotated_End : gene model's annotated 3' end start coordinate.
Locus_Start : extends upstream of gene's annotated start if there are more 5'UTR-mapping small RNAs.
Locus_End : extends downstream of gene's annotated end if there are more 4'UTR-mapping small RNAs.
Length : self explanatory gene lenght betwee locus_start and locus_end.
5'UTR_Length, Coding_Length, and 3'UTR_Length : self explanatory calculations of the lengths of the annoated segments of each gene model.
Uniq_5'UTR, Uniq_Coding, Uniq_3'UTR, and Uniq_Locus : self explanatory regions of the gene model, counts of uniquely-mapping read counts in these regions and the whole locus.
Tot_5'UTR, Tot_Coding, Tot_3'UTR, and Tot_Locus : self explanatory regions of the gene model, counts of total counts of all reads including multi-mappers in these regions and the whole locus.
Norm_5'UTR, Norm_Coding, Norm_3'UTR, and Norm_Locus : self explanatory regions of the gene model, Normalized read counts by mapping frequency and Reads Per Million normalization for these regions and the whole locus.
Iterations of the Uniq_ , Tot_, and Norm_ columns as before, but broken down by Plus and Minus strands: these polarity counts are relative to the position of the Gene, not the genomic strand.

BOTTOM LINK goes to Another Folder that holds a Genecentric Consolidated File that takes the Gene Counts.collapsed.xls file and inspects the entries that represent the multiple isoforms of the same gene model name. Most of the columns in this table are the same as the original table, but not consolidated for the multiple gene isoforms.

Consolidated_Annotated_region : Lists the gene isoforms and each of their normalized counts. For example: CPIJ015653-RA=34.028,CPIJ040533-RA=0.415,CPIJ040534-RA=3.870. The algorithm then choosed the isoform with the top counts as the main gene model in these consolidated gene centric tables.

^^^Go back to top^^^

(3) Length distributions and miRNA counts (miRNA.xls and length.all.xls): tab-delim tables that will open in the Excel spreadsheet program.

Length.all.xls
Length: small RNA read lengths after adapter trimming; however plots in paper only examined reads from 18-32nt long.
TotalCount: all the reads with that read length, raw counts
miRNACount : raw counts of reads mapping to miRNA list, both 5' and 3' arms
TECount : raw counts of reads mapping to transposon consensus families list
VirusCount : raw counts of reads mapping to arboviruses list
StructureRNACount : raw counts of reads mapping to structural RNAs list


miRNA.xls
miRNA: names of the miRNAs as pulled from miRbase and VectorBase lists. The "aga-", "aae-", "cqu-" and "aal-" prefixes refer to the species AnGam, CuQuin, AeAeg, and AeAlbo, respectively. Because the AeAeg miRNA list was most extensive, it was used to supplement the CuQuin and AeAlbo miRNA list for new miRNA homologs.
5prime_rpm : normalized read counts to the upstream arm of the miRNA hairpin.
3prime_rpm : normalized read counts to the downtream arm of the miRNA hairpin.

^^^Go back to top^^^

(4) piRNA Phasing Analysis (3'to5'phasing.xls and 5'to5'phasing.xls) : tab-delim table of 5kb genomic intervals with counts of TE InDels.

The 3'-to-5' Phasing and 5'-to-5' Phasing files are calculated based the individual piRNA species according to the algorithms described in Gainetdinov et al (PMC6130920) and lists the frequency of distances of trailing piRNAs in proximity to a given piRNA at positions -200 to +200. Final graphics in Figure 7 plotted the frequecncy of distances from positions -10 to +50 for th e 3'-to-5' Phasing patterns, and positions +20 to +200 for the 5'-to-5' Phasing patterns.

^^^Go back to top^^^

(5) Structural RNAs as small RNAs (Libname_Strucure.PDF and Libname_Strucure.xls ) : the PDF document shows the coverage plot of the reads along the structural RNA, and XLS file is the tab delimited table of the reads mapping to structural RNAs downloaded from VectorBase or queried from Genbank.

Structure RNA name : ribosomal RNAs, transfer RNAs, spliceosomal RNAs, and RNase P RNAs
Length : self-explanatory
18-23(+) and 18-23(-) : reads per million (rpm) counts of reads on the plus or minus strand of the structural RNA, binned by lengths of miRNAs and siRNAs.
24-32(+) and 24-32(-): rpm counts of reads on the plus or minus strand of the structural RNA, binned by lengths of piRNAs.
all small RNAs (+) and all small RNAs (-): rpm counts of reads on the plus or minus strand of the structural RNA. .
Total : Total rpm counts of reads.
Number of peaks: A measure of the number of 25nt windows where there are enough reads >1rpm to be called a peak.
Average Distance Between Peaks : determines how all the peaks are spread out across the structural RNA, to asses how distributed are the small RNAs across the structural RNA.

Ratio : Number of peaks divided by the Average Distance Between Peaks.

^^^Go back to top^^^

(6) Transposon small RNAs (Libname_TRANSPOSON.PDF and Libname_TRANSPOSON.xls ) : the PDF document shows the coverage plot of the reads along the transposon consensus family sequence, and XLS file is the tab delimited table of the reads mapping to transposon families that were downloaded from VectorBase or regenerated by a RepeatModeler v2 run.

Structure RNA name : The repeats families lists were all subjected to our Redundant Family Reduction procedure in Figure S1.
Length : self-explanatory
18-23(+) and 18-23(-) : reads per million (rpm) counts of reads on the plus or minus strand of the transposon RNA, binned by lengths of miRNAs and siRNAs.
24-32(+) and 24-32(-): rpm counts of reads on the plus or minus strand of the transposon RNA, binned by lengths of piRNAs.
all small RNAs (+) and all small RNAs (-): rpm counts of reads on the plus or minus strand of the transposonRNA. .
Total : Total rpm counts of reads.
Number of peaks: A measure of the number of 25nt windows where there are enough reads >1rpm to be called a peak.
Average Distance Between Peaks : determines how all the peaks are spread out across the structural RNA, to asses how distributed are the small RNAs across the transposon RNA.

Ratio : Number of peaks divided by the Average Distance Between Peaks.

^^^Go back to top^^^

(7) Virus small RNAs (Libname_Virus.PDF and Libname_Virus.xls) : the PDF document shows the coverage plot of the reads along the virus sequence, and XLS file is the tab delimited table of the reads mapping to virus geomes that were downloaded from Genbank/NCBI.

Virus name : these were searched on the NCBI and the VIPR (Virus Pathogen Resource) for mosquito arboviruses, and just the completed genomes were selected.
Length : self-explanatory
18-23(+) and 18-23(-) : reads per million (rpm) counts of reads on the plus or minus strand of the virus RNA, binned by lengths of miRNAs and siRNAs.
24-32(+) and 24-32(-): rpm counts of reads on the plus or minus strand of the viral RNA, binned by lengths of piRNAs.
all small RNAs (+) and all small RNAs (-): rpm counts of reads on the plus or minus strand of the viral RNA.
Total : Total rpm counts of reads.
Number of peaks: A measure of the number of 25nt windows where there are enough reads >1rpm to be called a peak.
Average Distance Between Peaks : determines how all the peaks are spread out across the structural RNA, to asses how distributed are the small RNAs across the viral RNA.

Ratio : Number of peaks divided by the Average Distance Between Peaks.

^^^Go back to top^^^

(8) Wolbachia small RNAs (Libname_Wolbachia.PDF and Libname_Wolbachia.xls) : tab delim table of the specific reads selected for calling each TE Insertion. Selecting the cluster of reads and using BLAT to query can reveal the specific insertion breakpoint.

Wolbachia strain name : these were searched on the NCBI, and just the completed genomes were selected.
Length : self-explanatory
18-23(+) and 18-23(-) : reads per million (rpm) counts of reads on the plus or minus strand of the Wolbachia genome, binned by lengths of miRNAs and siRNAs.
24-32(+) and 24-32(-): rpm counts of reads on the plus or minus strand of the Wolbachia genome, binned by lengths of piRNAs.
all small RNAs (+) and all small RNAs (-): rpm counts of reads on the plus or minus strand of the Wolbachia genome.
Total : Total rpm counts of reads.
Number of peaks: A measure of the number of 25nt windows where there are enough reads >1rpm to be called a peak.
Average Distance Between Peaks : determines how all the peaks are spread out across the structural RNA, to asses how distributed are the small RNAs across the viral RNA.

Ratio : Number of peaks divided by the Average Distance Between Peaks.

^^^Go back to top^^^

 Updated 2021-01-11
PHP Hits Count

Disclaimer