Grep unique reads

9/8/2023

The zcat to /dev/null reference is the following: $ time zcat SRR077487_2.filt.fastq. $ time gunzip -c SRR077487_2. | awk 'NR % 4 = 2' | wc -cl | fix_base_count Using gzip we gain about another 10 seconds: gzip -dc ERR047740_1. |Äªverage run-time is 116.69 seconds Konrad's gzip awk wc variant fix_base_count() ))" This is the slowest method with an average run-time of 125.35 seconds gzip awk If i want to grep for a value in a file but display only unique value then which option can i use. Next we want to find the fastest way possible to count these, all timings are the average wall-clock time (real) of 10 runs collected with the bash time on an otherwise unloaded system: zgrep zgrep.

I've chosen this file:Äªs my test file, the correct answers being: Number of reads: 67051220 So most recent version of kseq.h is faster than simply zcat-ing the file (consistently in my tests.).Ä¯irst off for benchmarks with FASTQ it's best to use a specific real-world example with a known answer. My machine is under different load this morning, so I've retested. Same test, with kseq.h from Github, as suggested in the comments: Also this solution gives you more flexibility with what you can do with the data.Äªnd my horrible C can almost certainly be optimised.

So, I get pretty close in speed, but am likely to be more standards compliant. If your input is in sam format check XS:i: it will only be present if the SAM record is for an aligned read and more than one alignment was found for the read.

(By the way, just zcat-ing the data file to /dev/null): real 0m38.736s Konrad's solution (in my hands): real 0m39.682s Printf("Number of bases in sequences: %ld\n", seqlen) Ä¯or my example file (~35m reads of ~75bp) this took: real 0m49.670sÄ¬ompared with your example: real 0m43.616s Printf("Number of sequences: %d\n", seqcount) Seqlen = seqlen + (long)strlen(seq->seq.s) I downloaded the example tarball and modified the example code (excuse my C.): #include Ä¯printf(stderr, "Usage: %s \n", argv) It then reads the contents of the specified files (in the order specified), finds the lines that contain the search string, and finally returns those lines. People deride them too often, but this is where a well-written parser is worth it's weight in gold. We could instead focus on making sure we are getting the right answer.

It's difficult to get this to go massively quicker I think - as with this question working with large gzipped FASTQ files is mostly IO-bound. IT Unique reads IP: 0.305 2022.03.

0 Comments

Grep unique reads

Leave a Reply.

Author

Archives

Categories