
Perl Scripts For Identifying and Organizing Repeats Found in Raw Sequencing Reads
1. cent_repeat_finder
sample input: crf_input.txt
sample output: crf_repeats_out.txt, crf_repeats_out_fasta.txt
cent_repeat_finder uses a sliding window approach to find repeats in dropped genomic reads. As a default, cent_repeat_finder looks for repeats larger than 120 bps. The input is the dropped genomic sequencing reads in fasta format (sample = "crf_input.txt"). The script produces two output files, crf_repeats_out.txt and crf_repeats_out_fasta.txt. The first file (crf_repeats_out.txt) is a summary file consisting of the name of the sequencing read that the repeat was taken from, the two 10 bp "window" sequences used to identify the repeat, the length of the repeat, and the repeat sequence. The second file (crf_repeats_out_fasta.txt) is just the identified repeats in fasta format, where the original name of the sequencing read is used to identify the repeat.
user defined parameters: input/output names, minimum length used to search for repeats
2. repeat_counter
sample input: crf_repeats_out.txt
sample output: repeatcount_out.txt
repeat_counter assess the lengths of all the repeats identified with the cent_repeat_finder script.
Using the summary output file (crf_repeats_out.txt), repeat_counter tallies the number of repeats for each repeat length. The output file (repeatcount_out.txt) is in tab delimited format and is ordered based on the repeat lengths. The total number repeats is also reported.
user defined parameters: input/output names
3. repeat_extractor
sample input: crf_repeats_out_fasta.txt
sample output: repeatextractor_out.txt, repeat_percenta_out.txt, repeat_percentat_out.txt
repeat_extractor uses the repeat fasta file produced by the cent_repeat_finder script (crf_repeats_out_fasta.txt) to pull out repeats between a user specified length range. The lower range is determined by the variable $lowerlimit, the upper range by $upperlimit. Repeats that fall in this length range are extracted and output to a new file (repeatextractor_out.txt). repeat_extractor also provides the adenine and adenine/thymine percent content of each repeat. Graphing these values can help establish if repeats of the same length class are found as reverse complements. For example, if A/T percentage of repeats differs from 50%, then the distribution of A% should be indicative of the two different complementary forms of the repeats.
user defined parameters: input/output names, the length range used to extract repeats
4. read_doubler and read_doubler_revcom
sample input: repeatextractor_out.txt
sample output: readdoubler_output.txt or readdoublerrevcom_output.txt
read_doubler and read_doubler_revcom double the repeats of a given length, using the output (repeatextractor_out.txt) from the repeat_extractor script as input. Doubling the reads is really helpful for alignment. read_doubler_revcom, in addition to doubling the read in the original orientation, gives the reverse complement of the doubled repeat, which can be helpful if one is trying to align repeats that were initially identified only by length and not orientation. The output file is called either readdoubler_output.txt or readdoublerrevcom_output.txt.
user defined parameters: input/output names