Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removed duplicated and trailing whitespace #13

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pickettbd
Copy link

I removed the duplicated and trailing whitespace from index files. In some cases, 2 or more tabs were present between columns. I also removed the trailing whitespace at the end of the lines. Otherwise, the text remains the same.

Issues were fixed with GNU sed like this:
sed -r -i 's,\t+,\t,g' file1 file2 ... fileN
sed -r -i 's,[\t ]+$,,' file1 file2 ... fileN

Here is the list of affected files:

AshkenazimTrio/sequence.index.AJtrio_HG002_NIST_SOLiD5500W_xsq_09042015.HG002
AshkenazimTrio/alignment.index.AJtrio_Illumina_6kb_matepair_wgs_bwamem_GRCh37_07302015.HG002
AshkenazimTrio/alignment.index.AJtrio_Illumina_6kb_matepair_wgs_bwamem_GRCh37_07302015
AshkenazimTrio/sequence.index.AJtrio_BioNano_bnx_10012015
AshkenazimTrio/sequence.index.AJtrio_HG002_Cornell_Oxford_Nanopore_fasta_fastq_10132015
AshkenazimTrio/sequence.index.AJtrio_BioNano_bnx_10012015.HG003
AshkenazimTrio/sequence.index.AJtrio_BioNano_bnx_10012015.HG004
AshkenazimTrio/sequence.index.AJtrio_HG002_NIST_SOLiD5500W_xsq_09042015
AshkenazimTrio/sequence.index.AJtrio_BioNano_bnx_10012015.HG002
AshkenazimTrio/sequence.index.AJtrio_PacBio_MtSinai_NIST_hdf5_08072015
AshkenazimTrio/sequence.index.AJtrio_PacBio_MtSinai_NIST_hdf5_08072015.HG002
AshkenazimTrio/sequence.index.AJtrio_HG002_Cornell_Oxford_Nanopore_fasta_fastq_10132015.HG002
AshkenazimTrio/sequence.index.AJtrio_PacBio_MtSinai_NIST_hdf5_08072015.HG003
AshkenazimTrio/sequence.index.AJtrio_PacBio_MtSinai_NIST_hdf5_08072015.HG004
AshkenazimTrio/alignment.index.AJtrio_Illumina_2x250bps_isaac-align_hg19_06012016.HG004
AshkenazimTrio/alignment.index.AJtrio_Illumina_2x250bps_isaac-align_hg19_06012016
ChineseTrio/sequence.index.ChineseTrio_HG005_BioNano_bnx_10012015.HG005
ChineseTrio/sequence.index.ChineseTrio_HG005_NIST_SOLiD5500W_xsq_09042015
ChineseTrio/sequence.index.ChineseTrio_HG005_BioNano_bnx_10012015
ChineseTrio/sequence.index.ChineseTrio_HG005_NIST_SOLiD5500W_xsq_09042015.HG005
NA12878/sequence.index.NA12878_PacBio_MtSinai_NIST_hdf5_08182015

@chunlinxiao
Copy link
Contributor

Thanks Brandon - I'll check how those occurred in index generation process.

@chunlinxiao
Copy link
Contributor

For sequence.index files, we intended to have 5 columns for capturing paired reads with their md5s and sample name (column5). For alignment.index file, we intended to have 4 columns for bam and bam.bai with their md5.

All the examples you listed above regarding sequence.index files were not paired reads, thus two empty fields were included there.

For some reason during updating those 4 alignment index files, extra space or tab were introduced, but now have been fixed.

@pickettbd
Copy link
Author

Okay- I think that makes sense. Let me just make sure I understand. You're saying that:

  1. The sequence index files should have 5 columns regardless of whether there are paired reads or single reads
  2. Alignment index files should have 4 columns.

Is it also safe to assume the following?

  1. Columns are tab-delimited for both sequence and alignment index files
  2. No spaces or tabs should trail the end of a line

Also- thanks for fixing those 4 alignment index files 😄 🙏

@chunlinxiao
Copy link
Contributor

yeah your assumptions are correct !

Also, please inform us when you find any unusual in index files.

Personally I really appreciate your efforts in helping us to make this resource more valuable.

chunlin

@pickettbd
Copy link
Author

Glad I can help 😄

If I come across any other things, I'll share my findings in an issue.

Another clarifying question for you: are the bionano alignment files supposed to have only 2 columns (XMAP_CMAP & XMAP_CMAP_MD5)?

@chunlinxiao
Copy link
Contributor

actually Bionano xmap/camp index was an exception for alignment.index, and described in https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/README.ftp_structure:

The format of sequence.index (if no paired data, column 3 and 4 will be empty) as follow:
For fastqs:
FASTQ FASTQ_MD5 PAIRED_FASTQ PAIRED_FASTQ_MD5 NIST_SAMPLE_NAME

For hdf5:
HDF5 HDF5_MD5 NIST_SAMPLE_NAME

For SOLiD xsq:
XSQ XSQ_MD5 NIST_SAMPLE_NAME

For BioNano bnx:
BNX BNX_MD5 NIST_SAMPLE_NAME

The format of alignment.index:
For BAM:
BAM BAM_MD5 BAI BAI_MD5

For BioNano XMAP or CMAP:
XMAP_CMAP XMAP_CMAP_MD5

Many thanks to you Brandon.

chunlin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants