Skip to content

Latest commit

 

History

History
206 lines (139 loc) · 6.11 KB

readme.rst

File metadata and controls

206 lines (139 loc) · 6.11 KB

NESS Builder

nessb builds heterogeneous networks from a variety of disparate functional genomics data sources.

Usage

nessb [OPTIONS] <output-file>

Lets say you have a series of datasets: some structured (as a DAG) ontology, ontology annotations, and gene sets (differential expression studies, etc.). A single graph can be built from these sources:

$ nessb -o ontology.tsv -a annotations.tsv -g genesets.tsv graph.al

This above nessb command will generate two files: graph.al which contains the harmonized network in the form of an adjacency list, and entity-map-graph.al, which maps internal NESS identifiers used for concept representation back to identifiers from the user supplied data sources (ontology.tsv, annotations.tsv, and genesets.tsv).

Options

  • -a, --annotations=FILE: Integrate ontology annotations into the network
  • -e, --edges=FILE: Integrate the contents of the edge list file into the network
  • -g, --genesets=FILE: Integrate gene sets into the network
  • -h, --homologs=FILE: Integrate homology associations among genes into the network
  • -o, --ontology=FILE: Integrate ontology structures and relationships into the network
  • -d, --directed: Build a directed graph (undirected graphs are built by default)
  • --permute=N: Generate up to N graph permutations for permutation testing
  • -v, --verbose: Clutter your screen with output
  • --help: Display the help message

File Formats

Input files use a simple tab delimited format. All gene (which includes annotated and homologous genes) and gene set identifiers must be integers. If using something like gene symbols, these symbols must be mapped to some unique integer identifier prior to their use as nessb input. nessb will likely be updated in the future to accomodate string-based identifiers.

Annotations

The annotation file contains ontology term and gene associations (annotations) in the following format:

term_id gene_id evidence_code species_id
GO:0051960 11 TAS 9606
GO:0051960 16 TAS 9606
GO:0007399 11 TAS 9606

Each annotation file should have a minimum of four fields: term_id, gene_id, evidence_code, and species_id.

  • term_id is a unique identifier string which represents a single ontology term.
  • gene_id is a unique identifier which a gene, gene product, or variant. It must be an integer.
  • evidence_code is some identifier string which describes the evidence supporting a given term and gene association. If no evidence is provided, this field can be set to any character such as '.' or '_'.
  • species_id is a unique integer identifier representing the species for a given annotation. NCBI taxon IDs are commonly used.

Edges

The edge file represents gene networks using edge list pairs in the following format:

species_id from to
9606 11 12
9606 16 12
9606 12 13

All fields are required.

  • species_id is a unique integer identifier representing the species for a given annotation.
  • The from and to fields are both unique integer IDs representing a gene or bioentity. These fields indicate a directed edge originates at from and points toward to

Gene sets

The gene set file represents a collection of genes pertaining to some biological process or state in the following format:

geneset_id species_id genes
246376 9606 10|11|12
75605 9606 16|15
246626 9606 13

All fields are required.

  • geneset_id is a unique integer identifier representing a given gene set
  • species_id is a unique integer identifier representing the species for a given annotation.
  • The genes field contains a list of gene identifiers present in the set. Like all gene identifiers, these should be integers. Individual genes are pipe delimited "|".

Ontology relationsips

The ontology file represents a series of child-parent ontology term relationships. Ideally, reassembly of these edges would form a directed acyclic graph (DAG). This file is in a format identical to the edge list format but allows for string based identifiers:

child parent
GO:0051960 GO:0007399
GO:0022008 GO:0007399
GO:0021675 GO:0007399
GO:0007399 GO:0048731

All fields are required.

  • The child and parent fields should each be ontology term identifiers. They represent a child &#8594; parent (subconcept &#8594; superconcept) relationship.

Requirements

Installation

See the stack website for instructions on installing stack. After installing stack, make sure it's available on your PATH.

Compile the builder:

$ make

Run tests:

$ make test

Install to the user specific bin directory (usually $HOME/.local/bin):

$ make install