Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SAM headers from index directory's <prefix>.hdr if it exists #348

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jmarshall
Copy link
Contributor

Introduce a <prefix>.hdr file that the sysadmin or user should place alongside the other index files, useful for setting up detailed @SQ headers etc to be used by default when mapping.

The idea is to make it easy to have rich @SQ headers (and other headers if desired) in bwa mapped output. Rather than needing to use -H every time, you can set up a <prefix>.hdr file at the same time as you generate the bwa index. In particular, this can be used to add descriptive fields such as MD5, species, assembly, and alternative names (which enables querying for e.g. chr1 or 1 interchangeably; see samtools/hts-specs#100 and samtools/htslib#931).

This PR adds to the mem/samse/sampe commands so that they output SAM headers from:

  • -H options;
  • a .hdr file stored with the index files;
  • the basic hard-coded @HD/@SQ default headers.

An @HD header is always printed, with an @HD line from -H overriding one from <prefix>.hdr (if any), which overrides the default one added since #336. A set of @SQ headers is always printed, with -H options overriding <prefix>.hdr overriding the default, similarly.

Other headers specified in either -H or <prefix>.hdr are all printed.

If -H includes @SQ headers, we consider that the user has carefully set up all the headers to be output and ignore <prefix>.hdr entirely.

Also adds a brief description of <prefix>.alt and a full description of <prefix>.hdr to the manual page.

@nh13
Copy link
Contributor

nh13 commented Mar 18, 2022

What if it just uses a .dict file by default when present? while .hdr is more flexible, .dict files are already in use.

@jmarshall
Copy link
Contributor Author

jmarshall commented Mar 19, 2022

The contents (a bunch of SAM headers) would be much the same, just the filename would be different. I considered using the existing.dict convention for this, but there are several reasons not to:

  1. For a FASTA reference file GRCh38.fa, the usual name for a corresponding dict file would be GRCh38.dict but all the rest of bwa's index files are of the form GRCh38.fa.xyz. (This is the most compelling reason.)
  2. All (except one perhaps, which is shorter) of bwa's index files use three letter extensions. (I would not want to stop this working on DOS 😄)
  3. AIUI Sequence Dictionaries contain only @SQ and maybe @HD headers. The intention here is that it may contain any header — there would be use cases for adding constant @PG or @CO headers to your mapping output.
  4. The purpose of this file is just to have a bunch of headers to be copied into the output. Hence .hdr. If the filename were .dict, users might infer that it was used by bwa as a dictionary to do lookups of some kind.

(You can always use bwa mem -H foo.dict … to use a dict file with whatever filename it might have.)

@nh13
Copy link
Contributor

nh13 commented Mar 19, 2022

Are there types of lines that should be disallowed? In (1), it looks like the intention is that the file should live statically alongside the other bwa files. So RG, and perhaps even PG, should not be static. RG is obvious as to why, but PG being in the file would be misleading if we use a newer version of bwa with older and compatible index folds.

Given .dict files are found a lot in the wild, and can be generated by multiple tools, how about they get used if the .hdr file does not exist? I find having yet another auxiliary file I have to maintain when .dict already exists confusing and cumbersome. If folks want a custom header, they can just specify it similar to your .dict example with the -H option. And is this file generated by bwa, then hand modified to add additional lines, or even add AH fields to SQ lines? If so, then should it really be codified as a bwa generated static file? I don’t think so.

@jmarshall
Copy link
Contributor Author

jmarshall commented Mar 19, 2022

  1. As described, the intention is for this to be a static file alongside the other bwa index files. Like the .alt file, it is prepared separately rather than by bwa index. (It might be worth adding code so that bwa index would generate a starting point for this file, but as tools like CreateSequenceDictionary and samtools dict already do this there's not much need.)
  2. BWA already adds a @PG ID:bwa … line containing its version number. If you added some static @PG lines to this file, it would be to represent other early-stage processing such as bcl2fastq. (This is indeed a bit esoteric; @SQ will be what is common in this file.)
  3. Users could indeed get the effects of this PR by always using bwa mem -H /path/to/ref.fa.hdr … every time they run BWA to do mapping (mem users anyway; samse/sampe users do not have such an option). But the point of this PR is to make it easy: to make decorated @SQ lines something that can be set up once when you set up the index.
  4. The primary motivation for this is to add SP, AS, M5, TP, and AN fields to the @SQ lines. Especially the latter are not fully automatable, so will require hand-editing from some starting point.
  5. You are correct that manually adding AH fields corresponding to the .alt file would be painful. This is why I have a followup samtools PR coming, which will add an option to samtools dict to read the .alt file and emit AH appropriately. (And this is the real reason why this starting point is best prepared by an external dict tool: the .alt file may not exist yet when bwa index is run, so that's really too soon to create this starting point.)

This file is a dict file, just with a slightly different purpose, so the different name adds some flexibility. But if @lh3 prefers, I would add the code to strip a suffix and add .dict instead of just adding .hdr to …ref.fa in the same way that other index file names are constructed.

For the time being, maybe let's stop quibbling about the filename so that we can hear the BWA maintainer's thoughts about the proposed feature 😄

@nh13
Copy link
Contributor

nh13 commented Feb 18, 2024

@jmarshall open to whatever works here

Introduce a <prefix>.hdr file alongside the other index files, useful for
setting up detailed `@SQ` headers etc to be used by default when mapping.

Output SAM headers from: -H options; .hdr file stored with the index files;
the basic hard-coded `@HD`/`@SQ` default headers. An `@HD` header is always
printed, with -H overriding one from <prefix>.hdr (if any), which overrides
the default one. A set of `@SQ` headers is always printed, with -H options
overriding <prefix>.hdr overriding the default, similarly. Other headers
specified in either -H or <prefix>.hdr are all output.

If -H includes `@SQ` headers, we consider that the user has carefully
set up all headers to be output and ignore <prefix>.hdr entirely.

Add descriptions of <prefix>.alt (brief) and <prefix>.hdr (complete)
to the manual page. This <prefix>.hdr file is implemented for mem, bwase,
and bwape. It is not implemented for bwasw, which is deprecated.
Now outputs SAM headers from: -H options; .hdr file stored with the
index files if present, or otherwise a .dict file if that is present;
the basic hard-coded `@HD`/`@SQ` default headers.
@jmarshall
Copy link
Contributor Author

@nh13: I've updated it to accept either:— hence it reads headers from <prefix>.hdr if present, otherwise reads headers from <baseprefix>.dict if that is present; (otherwise does the default thing).

The string munching required to get from foo.fa[sta][.gz] to foo.dict was not much fun to write 😄 but you are right that accepting that filename convention as well is the best thing to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants