automatic config problem with paired samples numerically named with _samplenum · Issue #1919 · bcbio/bcbio-nextgen (original) (raw)

Not sure how to eloquently describe this problem. Imagine I have PE sequencing data for samples 1-12 named like this, unfortunately with no leading zeros.

sample_1_1.fastq.gz	sample_1_2.fastq.gz
sample_2_1.fastq.gz	sample_2_2.fastq.gz
sample_3_1.fastq.gz	sample_3_2.fastq.gz
sample_4_1.fastq.gz	sample_4_2.fastq.gz
sample_5_1.fastq.gz	sample_5_2.fastq.gz
sample_6_1.fastq.gz	sample_6_2.fastq.gz
sample_7_1.fastq.gz	sample_7_2.fastq.gz
sample_8_1.fastq.gz	sample_8_2.fastq.gz
sample_10_1.fastq.gz	sample_10_2.fastq.gz
sample_11_1.fastq.gz	sample_11_2.fastq.gz
sample_12_1.fastq.gz	sample_12_2.fastq.gz

When attempting automated configuration based on a CSV file using -w template, I get warnings that bcbio is adding minimal metadata for samples _1 and _2, and looking at the yaml file created, the files: list is incorrectly created.

I imagine it's something to do with how the template generation script is looking for _1.fastq.gz and _2.fastq.gz, but is getting confused by the _1 and _2 in the sample names themselves.

In any case, my workaround was to simply rename the files or symlink them without the _ between "sample" and the number. But it's probably not-that-edge-of-a-case potentially worth addressing, or making it at least obvious what's happening -- it took me a few minutes to figure out what the issue was.