Specifications of Common File Formats Used by the ENCODE Consortium (original) (raw)

FASTQ file content

FASTQ files are submitted as they come off the sequencing instrument to allow for maximal decision making of downstream users. The files are accompanied by documentation detailing how the sequencing libraries were constructed to inform the end-user about how they might want to process the data, the strengths and limitations of the various options of data processing, and how these may apply according to the user's biological questions of interest.

ENCODE produces replicate data for most experiments to quantify reliability. Biological replicates involve different biological samples, e.g., different tissue preparations for cell growth and expansion when cell lines are used. Biological replicates are contrasted with technical replicates, for which different sequencing libraries are prepared from the same sample, or different sequencing lanes for the same library. Reads from different replicates are stored in separate files and should include flow cell and lane ID. If multiple lanes are used for the same biological or technical replicate, they are stored in the same file (after a QC check to eliminate failed lanes), with information on flow cell and lane ID included. For experiments that produce paired-end reads, the two reads in each pair are stored in two separate files, with the reads in the same order in the two files.

The reads in FASTQ files are unfiltered, i.e., barcodes, adapter sequences, and spike-ins remain in the files. For Illumina sequencing, the barcodes that are in the so-called third read position should not be present in the sequence. Spike-in reads are kept. For bisulfite sequencing experiments, the raw FASTQ files are presented, wherein most unmethylated cytosines are converted to thymines.

Reads are not "clipped" (no bases are removed). For example, in the case of small RNAs that are shorter than the read-length, there may be adapters flanking these reads—these adapter sequences remain in the FASTQ file. Some libraries are constructed in a way such that the barcode is read out in the sequence (CSHL small RNAs were made this way during phase II of ENCODE) and will appear in the FASTQ. Even though these barcodes would need to be trimmed off prior to mapping, they are still included in the FASTQ file because different users may choose different trimming algorithms.

FASTQ Sequencing quality

FASTQ uses four lines for each sequence with the fourth line denoting the sequencing quality in each position. The consortium reports the Phred quality score from 0 to 93 using ASCII 33 to 126, i.e., Phred score plus 33. This is used by the newest versions of the Illumina pipeline, Sanger and SRA. The Phred score of a base[3][4] is defined as -l0 log10 (e) where e is the estimated probability for a base to be erroneous.