View on GitHub

HMMRATAC

HMMRATAC peak caller for ATAC-seq data

HMMRATAC Guide

Running the program:

java -jar HMMRATAC_V1.2_exe.jar -b <SortedBAM> -i <BAMIndex> -g <GenomeStatsFile> <options>

Version number may change.

Please download the executable file from https://github.com/LiuLabUB/HMMRATAC/releases to run the program. Source files do not contain manifest file needed to run.

Outline:

  1. Creating required files
  2. Commandline options
  3. Output files
  4. Troubleshooting

Creating required files

  1. Sorted BAM file containing the alignments of ATAC-seq reads

    The first file that is necessary is a sorted BAM file containing the pair-end ATAC-seq reads. Typically, an alignment algorithm (such as bwa/bowtie2/minimap2) will output a SAM file as default. The first step is to convert this SAM file into a BAM file. This is best accomplished using samtools. Samtools requires a SAM file to convert into a BAM file. Use the samtools view command as follows, assuming the SAM file is named as atac.sam:

    samtools view –bS -o atac.bam atac.sam

    Note the SAM file should have header lines. If not, check the step3 below to get the genome information file ready genome.info(chromosome lengths) first, and run samtools as:

    samtools view -bS -t genome.info -o atac.bam atac.sam

    The next step is to sort the BAM file. This can also be accomplished using samtools. Use the samtools sort command as follows (example using 4 threads to sort, tweak it as neccessary):

    samtools sort -@ 4 -o atac.sorted.bam atac.bam

    It is possible to remove duplicates or low Mapping quality reads before inputting into HMMRATAC, although HMMRATAC will perform these functions by default. By default HMMRATAC will remove duplicate reads (and this function is currently hard-coded and cannot be turned off). HMMRATAC also removes those reads whose MapQ (mapping quality scores) are below 30. This value is changeable through command line option (using the –q or --minmapq option)

    If you have multiple replicates and you choose to merge replicates prior to running HMMRATAC, this can also be accomplished using samtools. Use the samtools merge command as follows:

    samtools merge –n atac_merged.bam atac_rep1.sorted.bam atac_rep2.sorted.bam ... atac_repN.sorted.bam

    If you merge BAM files sorted by the same samtools tool, the merged file atac_merged.bam should be sorted already. If uncertain, you can always re-sort the resulting merged file before proceeding.

    samtools sort -@ 4 -o atac_merged.sorted.bam atac_merged.bam

    Now, either rename atac.sorted.bam if you have only one replicate, or atac_merged.bam if you have multiple replicates, or atac_merged.sorted.bam if an extra sorting is performed on merged file, to atac.forHMMRATAC.bam.

  2. The index file for the BAM.

    The next required file that HMMRATAC needs is an index file for the sorted BAM file. This file can also be easily created with samtools. Use the samtools index command as follows:

    samtools index atac.forHMMRATAC.bam

    You will have a file named atac.forHMMRATAC.bam.bai in the same working directory.

  3. Genome information file for chromosome sizes.

    The third required file is a genome information file containing chromosome sizes. This file can be downloaded from numerous online sources, including UCSC genome browser’s website. This file is a two column, tab-delimited file containing chromosome name and size in the following format:

** Please Note: The genome file must NOT contain a header line

It is possible to retrieve this file using UCSC Genome Browser’s MySQL database. To retrieve the file, for H. sapiens, use the following command:

mysql –-user=genome --host=genome-mysql.cse.ucsc.edu –A –e \ “select chrom, size from hg19.chromInfo” > genome.info

Alternatively, if your SAM file has header lines and you have successfully generated a BAM file following step 1 and 2 without input of genome information. You can extract genome information from BAM as:

samtools view -H atac.forHMMRATAC.bam| perl -ne 'if(/^@SQ.*?SN:(\w+)\s+LN:(\d+)/){print $1,"\t",$2,"\n"}' > genome.info

Commandline Options

Required Parameters:

Optional Parameters:

Output Files

  1. Name.log

This is a log file generated as HMMRATAC runs. It contains all inputted arguments, the results of the fragment distribution EM algorithm, and various status updates as HMMRATAC progresses, as well as a textual description of the generated model. When finished, it also reports the total amount of actual (real-world) time it took for HMMRATAC to run.

  1. Name.bedgraph

This is the genome-wide state annotation file created if the --bedgraph option is set to true. It is a four column, tab-delimited file, with the following fields:

  1. Chromosome Name

  2. Region Start (zero-based)

  3. Region Stop (zero-based)

  4. State annotation. This represents what state the particular region is assigned to, corresponding to the created model. The state name is prefixed with the letter “E”, to make extractions easier. For instance, to retrieve all regions that were classified as state 1, type:

    grep E1 Name.bedgraph > State1_regions.bed

  5. Name.model

This file contains the model that was generated by HMMRATAC and was used to decode the genome. It is always recommended to check this model after running HMMRATAC. Please note that this file is a binary file that can only be read by HMMRATAC. To see the actual textual description of the model, see the log file. If HMMRATAC runs to completion but produces poor results, it is usually the result of a faulty model. For instance, if the model shows “NaN” for all of the parameters, such as transition or emission probabilities, this usually means that something went wrong with either the choice of training regions or generating the signal tracks. This can occur with too low or too high of a fold-change range for choosing training regions, if the BED file of training regions was faulty or if certain signals need to be trimmed off (due to small numbers of larger fragments). Generally speaking, this poor model occurs when there is not enough separation in the signal tracks between the states. Common fixes include, trimming the signal tracks (option --trim), changing the number of states (option -k), choosing a different fold change range for training site identification (option -u and -l) or changing the list of previously annotated training regions (option -t). Additionally, checking the model ensures that the predicted model is created. Although rare, it is possible for a different state to be better indicative of the open state, than is normally the case (HMMRATAC sorts the model so the last or highest numbered state should be the open state). If this happens, then the resulting peak file may be incorrect. In such cases, it is recommended to report a bedgraph file and extract the proper state (of certain lengths and with the accompanying flanking regions) and manually create a peak BED file. We have tested numerous ATAC-seq datasets, using our default parameters and we generally don’t encounter problems, unless we radically change those settings.

  1. Name_peaks.gappedPeak

This is the standard peak file reported by HMMRATAC by default (when the -p , --peaks option is set to true). The first line in the file is a track line for genome browsers. After that, it is a 15 column, tab-delimited file containg the following fields:

  1. Chromosome Name

  2. Peak Start (zero-based). This is the beginning of the regulatory region that HMMRATAC identifies as a peak. This includes the flanking nucleosome regions. Therefore, this position represents the start of the upstream nucleosome. If there is no upstream nucleosome, this position represents the start of the open state.

  3. Peak Stop (zero-based). This is the end of the regulatory region that HMMRATAC identifies as a peak. This includes the flanking nucleosome regions. Therefore, this position represents the end of the downstream nucleosome. If there is no downstream nucleosome, this position represents the end of the open state.

  4. Peak Name. Unique name for the peak, in the format “Peak_#”. For the high coverage regions that are excluded from Viterbi decoding (described in Section 2, under the -z , --zscore option), the name is in the format “HighCoveragePeak_#”. This allows the high coverage peaks to be easily identified and extracted.

  5. Unused, denoted with “.”

  6. Unused, denoted with “.”

  7. Open state start (zero-based). This is the genomic position where the open state begins. It is therefore possible to use only the open regions, rather than the regulatory regions that HMMRATAC reports by default.

  8. Open state stop (zero-based). This is the genomic position where the open state ends. It is therefore possible to use only the open regions, rather than the regulatory regions that HMMRATAC reports by default.

  9. Color code for display. Set to 255,0,0. This would create a dark-blue color on a genome browser.

  10. The number of sub-regions in the peak region. For standard peaks this is set to 3, indicating the open state region and the two flanking nucleosome state regions. For high coverage peaks, this is set to 1. Peaks that don’t have upstream and/or downstream nucleosomes will have different values as well.

  11. Comma separated list of sub-region lengths. The first and last values are always set to one, making visualization easier. The middle value is the length of the open state region. For high coverage peaks, this is only a single value, denoting the length of the total region. If the peak lacks upstream and/or downstream nucleosomes, these values will reflect that.

  12. Comma separated list of the (zero-based) starts of each sub-region, relative to the total region’s start. For instance, the first value is always 0, meaning 0 bp from the region start (column 2). The second value is the distance from the region start (column 2) to the open region start (column 7), etc.

  13. Peak Score. This is the score for the peak, as determined by the scoring system option described in section 2 (option --score).

  14. Unused, denoted as -1.

  15. Unused, denomted as -1.

  16. Name_summits.bed

This is a BED file containing the summits of each HMMRATAC peak. Summits are determined by finding the position within the open state region, whose Gaussian-smoothed read coverage is the maximum over the entire region. This is only outputted if a peak file is also outputted. If motif analysis is being conducted, it is recommended to use these summits as the points of interest, as the summits tend to be better indicative of TF binding as opposed to peak centers. This is a four column, tab-delimited BED file containing the following fields:

  1. Chromosome Name

  2. Summit Start (zero-based)

  3. Summit Stop (zero-based)

  4. Summit Name. This is the same name as the corresponding Peak name used in the peak file (column 4 of the Name_peaks.gappedPeak file).

  5. Peak Score. Same score as in the .gappedPeak file.

Troubleshooting

  1. Java heap space error:

    The most likely cause of this error is that the genome “blocks” are too large. In order to save memory, HMMRATAC splits the genome into “blocks” and then annotates each position within the block as one of the states of the model. The default size of these blocks is 25MB. We have found, that on personal computers, this may be too large to handle, resulting in the heap space error. The best fix to this issue is to reduce the size of the block, using the --window option (See section 2). It is recommended to reduce this value to 1/10 of the default, ie 2.5MB, for personal computers. If the error continues, you can lower it further.

  2. HMMRATAC produces very large peaks (> 25KB) or results in large gaps with no peaks:

    This isssue is related to the Viterbi algorithm. In our experience, Viterbi has difficulty when encountering very high coverage regions. When it does, one of two things can happen. Either viterbi will call every position, from the high coverage region to the end of the block (block is explained in “Java heap space error” section) as a peak (thereby resulting in extremely large peaks) or as some other state (resulting in large gaps with no called peaks). HMMRATAC corrects for this behavior by skipping over high coverage regions, allowing viterbi to continue correctly. HMMRATAC determines what is or isn’t a high coverage region based on the -z option (see section 2). Lowering the -z option, will make HMMRATAC skip more high coverage regions (by lowering the threshold to identify such regions). It should be noted that all skipped high coverage regions are added back to the gappedPeak output file and are annotated as “High Coverge Peaks”.

  3. HMMRATAC produces very few peaks (< 1000) or doesn’t identify any peaks or the model shows NaN for all values:

    This is almost always due to a faulty model. Check the log file (see section 3) and see if the model is valid. A non-valid model will have NaN as the values of the initial, transition and emission probabilities. This typically occurs when there is not enough separation in the values between states. As an example (using a simplified model), say you had a model with 3 states and a single observation value. The average value for state 1 is 1, the average for state 2 is 2 and the average for state 3 is 3. If a value of 2.5 occurred, then there is an equal chance that the value is assigned to state 2 or state 3. When this happens, HMMRATAC (or really any model based algorithm) won’t be able to decide which state to assign it to. Now we remake the model with a more conservative training set and the values become 1, 5 and 10. Now 2.5 clearly can be assigned to state 1. Therefore, an easy way to correct this is to use more stringent settings in creating the model. This is accomplished with the -l and -u options (see section 2). The default setting for -u is 20 and for -l is 10. A common fix is to increase the -u to 30 or 40. This will allow stronger peaks into the training set, thereby changing the emission values of each state and correcting the problem (not enough separation between states). This is the most common error with HMMRATAC.

  4. The dataset should not be empty ERROR. This error has been encountered on occasion, and is almost always the result of user error.

    This is saying that the data matrix created by HMMRATAC, containing the values for nucleosome free and nucleosome enriched fragments, was empty. If you set the upper and lower fold change bounds for choosing training regions (-u and -l option) to values that prevent ANY training sites from being identified (i.e. -u 10 -l 20), or if you provide an empty BED file of training regions, then the data matrix will be empty and this error will occur. Alternatively, if the BAM files use one type of chromsome annotation (ie. chr1) while the genome file uses a different chromosome annotation (i.e. 1) or vice versa, then this discrepancy will also result in an empty data matrix. Finally, it may be that there were no identified training regions even with reasonable settings, than it may be neccessary to choose a different method to choose training sites, i.e. to set different -u and -l settings.

  5. Tips for checking model.

    The creation of a proper model is critial for HMMRATAC. It is always recommended to check the log file and visually examine the model to confirm that a proper one has been created. The open state should have the highest emission parameters for all 4 signal components, the nucleosome state should have the second highest emission parameters for all 4 signal components and the background state should have the lowest emission parameters for all 4 signal components. If this is not the case, the model is likely bad. An example is a case where the open state has the highest NFR and mononucleosome emission values, but the second highest dinucleosome and trinucleosome emission values. This often happens when there isn’t enough separation in the values between states (similar to item 3 above). A typcial fix would be to re-run the software with a more stringent set of parameters (-u and -l, similar to item 3 above). Another possible issue is that the log file shows that the kmeans model is identical to the final model. In the event that Baum Welch training fails, HMMRATAC will use the original Kmeans model in order to produce some result. Because the kmeans model doesn’t include transition parameters, this is also not ideal. The way to solve this issue, is the same as above, namely setting mote stringent parameters to create more separation in the values between states.

  6. HMMRATAC produces too many peaks. What to keep?

    By default, HMMRATAC will report all peak regions that match the learned model. This includes weaker peaks that have the characteristic patterns of nucleosome-free and nucleosome-enriched fragments but whose read depth is low. It is often necessary to filter the resulting peaks by their score, in order to eliminate these weak peaks, if that is desired. As an example, say you called peaks and wanted to keep only those whose score (defined by the –score option - default is maximum reads in the peak) is greater than or equal to 10. You could use the following command to filter the peak file:

    awk -v OFS="\t" '$13>=10 {print}' NAME_peaks.gappedPeak > NAME.filteredPeaks.gappedPeak

    To filter the summit file by the same threshold:

    awk -v OFS="\t" '$5>=10 {print}' NAME_summits.bed > NAME.filteredSummits.bed

    Generally speaking, the higher the threshold that is used for filtering, the higher the precision will be and the lower the sensitivity will be. A lower score will result in higher sensitivity and lower precision.