nessb
builds heterogeneous networks from a variety of disparate functional
genomics data sources.
nessb [OPTIONS] <output-file>
Lets say you have a series of datasets: some structured (as a DAG) ontology, ontology annotations, and gene sets (differential expression studies, etc.). A single graph can be built from these sources:
$ nessb -o ontology.tsv -a annotations.tsv -g genesets.tsv graph.al
This above nessb
command will generate two files: graph.al
which contains
the harmonized network in the form of an adjacency list, and entity-map-graph.al
,
which maps internal NESS identifiers used for concept representation back to identifiers
from the user supplied data sources (ontology.tsv
, annotations.tsv
, and
genesets.tsv
).
-a, --annotations=FILE
: Integrate ontology annotations into the network-e, --edges=FILE
: Integrate the contents of the edge list file into the network-g, --genesets=FILE
: Integrate gene sets into the network-h, --homologs=FILE
: Integrate homology associations among genes into the network-o, --ontology=FILE
: Integrate ontology structures and relationships into the network-d, --directed
: Build a directed graph (undirected graphs are built by default)--permute=N
: Generate up to N graph permutations for permutation testing-v, --verbose
: Clutter your screen with output--help
: Display the help message
Input files use a simple tab delimited format.
All gene (which includes annotated and homologous genes) and gene set identifiers
must be integers.
If using something like gene symbols, these symbols must be mapped to some unique integer
identifier prior to their use as nessb
input.
nessb
will likely be updated in the future to accomodate string-based identifiers.
The annotation file contains ontology term and gene associations (annotations) in the following format:
term_id | gene_id | evidence_code | species_id |
---|---|---|---|
GO:0051960 | 11 | TAS | 9606 |
GO:0051960 | 16 | TAS | 9606 |
GO:0007399 | 11 | TAS | 9606 |
Each annotation file should have a minimum of four fields: term_id
,
gene_id
, evidence_code
, and species_id
.
term_id
is a unique identifier string which represents a single ontology term.gene_id
is a unique identifier which a gene, gene product, or variant. It must be an integer.evidence_code
is some identifier string which describes the evidence supporting a given term and gene association. If no evidence is provided, this field can be set to any character such as '.' or '_'.species_id
is a unique integer identifier representing the species for a given annotation. NCBI taxon IDs are commonly used.
The edge file represents gene networks using edge list pairs in the following format:
species_id | from | to |
---|---|---|
9606 | 11 | 12 |
9606 | 16 | 12 |
9606 | 12 | 13 |
All fields are required.
species_id
is a unique integer identifier representing the species for a given annotation.- The
from
andto
fields are both unique integer IDs representing a gene or bioentity. These fields indicate a directed edge originates atfrom
and points towardto
The gene set file represents a collection of genes pertaining to some biological process or state in the following format:
geneset_id | species_id | genes |
---|---|---|
246376 | 9606 | 10|11|12 |
75605 | 9606 | 16|15 |
246626 | 9606 | 13 |
All fields are required.
geneset_id
is a unique integer identifier representing a given gene setspecies_id
is a unique integer identifier representing the species for a given annotation.- The
genes
field contains a list of gene identifiers present in the set. Like all gene identifiers, these should be integers. Individual genes are pipe delimited "|".
The ontology file represents a series of child-parent ontology term relationships. Ideally, reassembly of these edges would form a directed acyclic graph (DAG). This file is in a format identical to the edge list format but allows for string based identifiers:
child | parent |
---|---|
GO:0051960 | GO:0007399 |
GO:0022008 | GO:0007399 |
GO:0021675 | GO:0007399 |
GO:0007399 | GO:0048731 |
All fields are required.
- The
child
andparent
fields should each be ontology term identifiers. They represent a child → parent (subconcept → superconcept) relationship.
- GHC 8.2.2
- Stack
See the stack website for instructions on installing stack. After installing stack, make sure it's available on your PATH.
Compile the builder:
$ make
Run tests:
$ make test
Install to the user specific bin directory (usually $HOME/.local/bin
):
$ make install