- Given a BAM file and an annotation (GTF file), this tool calculates how many reads are mapped to each region of interest.
- The user can decide:
- At which level wants to perform the counting (genes, transcripts...).
- What to do whith reads mapped to multiple locations.
- Paired-end reads status and strand-specifity.
- When a transcriptome GTF file is provided the tool allows to calculate 5’ and 3’ prime coverage bias.
To access the tool use Tools ‣ Compute counts.
- Input data:
- BAM file: liver.bam. RNA-seq of liver tissue from Marioni JC et al
- GTF file: human.64.gtf . Human annotation from Ensembl (v. 64)
- Feature ID: gene_id (to count at the level of genes)
- Feature type: exon (to ignore other features like start/end codons)
- Paired-end reads counts computation and strand-specificity
- Multimapped reads: uniquely-mapped-reads (to ignore not unique alignments)
- liver.counts. Two-column tab-delimited text file, with the feature IDs in the first column and the number of counts in the second column.
- BAM file
- Path to the BAM alignment file.
- Annotation file
- Path to the GTF or BED file containing regions of interest.
Controls when to consider reads and features to be overlapping:
- Reads overlap features if they share genomic regions regardless of the strand.
- For single-end reads, the read and the feature must have the same strand to be overlapping. For paired-end reads, the first read of the pair must be mapped to the same strand as the feature, while the second read must be mapped to the opposite strand.
- For single-end reads, the read and the feature must have the opposite strand. For paired-end reads, the first read of pair must be mapped to the opposite strand of the feature, while the second read of the pair must be on the same strand as the feature.
- Feature ID
- The user can select the attribute of the GTF file to be used as the feature ID. Regions with the same ID will be aggregated as part of the same feature. The application preload the first 1000 lines of the file so a list with possible feature IDs is conveniently provided.
- Feature type
- The user can select the feature type (value of the third column of the GTF) considered for counting. Other types will be ignored. The application preload the first 1000 lines of the file so a list with possible feature IDs is conveniently provided.
- Paired-end reads
- This option allows to activate counting of pairs of reads instead of single reads
- Alignment sorted by name
- For correct analysis of paired-end reads alignment should be sorted by name. If this operation is already performed, sorting can be skipped.
- Path to the ouput file.
- Save computation summary
- This option controls whether to save overall computation statistics. If selected, the statistics will be saved in a file named $INPUT_BAM.counts
- Multi-mapped reads
This option controls what to do whith reads mapped to multiple location:
- Reads mapped to multiple locations will be ignored.
- Multi-mapped reads are detected based on “NH” tag from SAM format. Each read is weighted according to the number of mapped locations. For example, a read mapped to 4 different locations will add 0.25 to the counts of each location. After analysis is finished the value will converted to intger value.
- Calculate 5’ and 3’ coverage bias
- If a GTF file is provided, the user has the possibility of computing 5’ - 3’ bias. The application automatically constructs the 5’ and 3’ UTR (100 bp) from the gene definitions of the GTF file and determines the coverage rate of the 1000 most highly expressed transcripts in the UTR regions. This information is then stored in the computation summary file, together with the statistics of the counting procedure.
This option requires a standard gene model definition. The UTRs are computed for the first and last exons of each transcript. Therefore, exon is the feature of interest (third field of the GTF) and gene_id, transcript_id should be attributes (ninth field of the GTF).
A two-column tab-delimited text file, with the feature IDs in the first column and the number of counts in the second column, and overall calculation stats.
The calculation stats include:
- Feature counts
- Number of reads assigned to various features
- No feature
- Number of reads not aligned to any feature
- Not unique alignment
- Number of reads with non-unique alignment
- Number of reads that align to features ambigously
The following stats are calculate only if option Calulate 5’ and 3’ bias was set:
- Median 5’ bias
- For 1000 most expressed genes the ratio between coverage of 100 leftmost bases and mean coverage is calcualted and median value is provided.
- Median 3’ bias
- For 1000 most expressed gene the ratio between coverage of 100 rightmost bases and mean coverage is calculated and median value is provided.
- Median 5’ to 3
- For 1000 most expressed genes the ratio between coverag of 100 leftmost and 100 rightmost bases is calculated and median value is provided.
- Qualimap provides the possibility of clustering genomic features according to their surrounding coverage profiles. This is particulary interesting in epigenomic studies (e.g. methylation). The user can import a set of features (e.g. TSSs or CpG Islands) together with the BAM file. Then the application preprocess the data and clusters the profiles using the Repitools package (Statham et al). The obtained groups of features are displayed as a heatmap or as line graphs and can be exported for further analysis (e.g. for measuring the correlation between promoter methylation and gene expression).
- Summary of the process:
- filter out the non-uniquely-mapped reads
- compute the smoothed coverages values of the samples at the desired locations
- apply k-means on the smoothed coverage values for the desired values of k
- To perform this analysis the user needs to provide at least two BAM files – one for the sample (enriched) and other for the control (input) – and a list of features as BED file.
- Clustering analysis can be accesed using the menu item File ‣ Tools ‣ Clustering.
Clustering coverage profiles is not a straightforward task and it may be necessary to perform a number of empirical filter steps. In order to correctly interpret the approach the results we encourage the users to read Repitools User Manual.
- Experiment ID
- The experiment name
- Alignment data
Here you can provide your replicates to analyze. Each replicate includes sample file and a control file. For example, in an epigenomics experiment, the sample file could be the MeDIP-seq data and the control the non-enriched data (the so-called INPUT data). Thus, for each replicate the following information has to be provided:
- Replicate name
- Name of the replicate
- Sample file
- Path to sample BAM file
- Control file
- Path to control BAM file
To add a replicate click Add button. To remove a replicate select it and click Remove button. You can modify replicate by using Edit button.
- Regions of interest
- Path to an annotation file in BED or GFF format, which contains regions of interest.
- Relative location to analyze
- Left offset
- Offset in bp upstream the selected regions
- Right offset
- Offset in bp downstream the selected regions
- Bin size
- Can be thought as the resolution of the plot. Bins of the desired size will be computed and the information falling on each bin will be aggregated
- Number of clusters
- Number of groups that you the user wants to divide the data. Several values can be used by separating them with commas
- Fragment length
- Length of the fragments that were initially sequenced. All reads will be enlarged to this length.
- Visualization type
- You can visualize cluster using heatmaps or line-based graphs.
After the analysis is performed, the regions of interest are clustered in groups based on the coverage pattern. The output graph shows the coverage pattern for each cluster either as a heatmap or a line graph. There can be multiple graphs based on the number of clusters provided as input. The name of each graph consists of the experiment name and the number of clusters.
It is possible to export list of features beloning to the particular cluster. To do this use main menu item File ‣ Export gene list or context menu item Export gene list. After activating the item a dialog will appear where you can choose some specific cluster. One can either copy the list of features belonging to this cluster in the clipboard or export it to a text file.