Large Scale Experiment Notes

This guide will show how to use our system along with MongoDB to run author disambiguation on tens (or hundreds) of millions of mentions.

To start, we show how to set up a single MongoDB instance which will be queried by all worker threads. In the future, we can replicate or shard the database for better performance.

Please email me with any questions/issues you encounter.

Recommended Machine Setup

20-40+ GB RAM
20+ CPUs
NUMA architecture (untested without this)
MongoDB 3.0+ installed following any additional instructions Mongo gives for your OS.

Starting Mongo

Create a directory in which to store the database, e.g.:

mkdir -p data/mongodb/authormention_db

To start mongo on a NUMA machine use the following command:

numactl --interleave=all mongod --port 25752 --dbpath data/mongodb/authormention_db

and to start mongo on a non-NUMA machine:

mongod --port 25752 --dbpath data/mongodb/authormention_db

This will create a running MongoDB instance on the given machine and on port 25752. We will then populate the database with mentions using a Scala program described below.

The shell script scripts/db/start-mongo-server.sh can be used.

Data Format

The coreference algorithms are designed to operate on AuthorMention data structures. The AuthorMention data structures extend Cubbie, a serializable object from factorie. Cubbies can be serialized and deserialized easily to and from MongoDB. The AuthorMentions have the following fields:

class AuthorMention extends CorefMention {
  
  // The author in focus for this mention
  val self = new CubbieSlot[Author]("self", () => new Author())
  // The co-authors listed on the publication
  val coauthors = new CubbieListSlot[Author]("coauthors",() => new Author())
  // The title of the publication
  val title = new StringSlot("title")
  // The words of the title that will be used in the word embedding representation of the title
  // NOTE: The casing here should match the casing used in your embedding keystore
  val titleEmbeddingKeywords = new StringListSlot("titleEmbeddingKeywords")
  // The topics discovered using LDA or similar method
  val topics = new StringListSlot("topics")
  // The raw string text containing the abstract and/or body of the publication
  val text = new StringSlot("text")
  // A tokenized version of the text
  val tokenizedText = new StringListSlot("tokenizedText")
  // The venue(s) of publication
  val venues = new CubbieListSlot[Venue]("venues", () => new Venue())
  // The keywords of the publication
  val keywords = new StringListSlot("keywords")
  
  // The canopies to which the mention belongs
  val canopies = new StringListSlot("canopies")
  // For cases where the mention belongs to a single canopy
  val canopy = new StringSlot("canopy")
  // Where the mention came from
  val source = new StringSlot("source")
...
}

The AuthorMention inherits the following fields from its parent class CorefMention:

trait CorefMention extends CubbieWithHTMLFormatting {
  def mentionId: StringSlot = new StringSlot("mentionId")
  def entityId: StringSlot = new StringSlot("entityId")
}

Note that the algorithms can be used on other data structures with some minor changes.

Loading Data

Given a collection of ambiguous author records, in order to use this system and populate the MongoDB, we first need to be able to load (or convert) the data into the AuthorMention data structures. This can be done in many ways.

JSON

One possible way is to use JSON serialized data. The AuthorMention records are JSON serializable and deseriazable. The schema of the JSON object aligns exactly with the schema defined by the "slots" in the AuthorMention. For instance the JSON representation might look like:

{ 
    "mentionId" : "P81-1005_LN_Becket_FN_Lee",
    "titleEmbeddingKeywords" : [ "phony", "a", "heuristic", "phonological", "analyzer" ],
    "topics" : [ ],
    "coauthors" : [ ], 
    "title" : "PHONY: A Heuristic Phonological Analyzer*", 
    "self" : { 
        "middleNames" : [ "a" ], 
        "lastName" : "becket", 
        "firstName" : "lee" 
    }, 
    "tokenizedText" : [ ],
    "keywords" : [ ] 
}

The key for each JSON key-value pair is exactly the name specified in each of the above slot constructors. The value of the JSON pairs has a data type corresponding to the slot field type. Empty fields can be specified either with empty values or can be omitted. A full specification of the JSON structure is available here.

The class LoadJSONAuthorMentions provides methods for loading JSON serialized mentions from files. The files must store 1 AuthorMention per line.

object LoadJSONAuthorMentions {

  // Given a file with 1 JSON Author Mention per line, load the mentions in a single stream
  def load(file: File, codec: String): Iterator[AuthorMention]
  
  
  // Given a file with 1 JSON Author Mention per line, splits the file into multiple groups
  // loads the groups in parallel
  def loadMultiple(file: File, codec: String, num: Int): Iterable[Iterator[AuthorMention]]

}

Additionally, to save AuthorMentions in JSON format you may do the following:

WriteAuthorMentionsToJSON.write(mentions: Iterator[AuthorMention], file: File, codec: String, bufferSize: Int = 1000)

Other Formats

Alternatively, you can write a custom loader for your data. You may either write a loader that loads directly in the AuthorMention format or you may define an intermediate data structure that is then converted into the AuthorMention format. The second option is done for the Rexa and ACL data. There are RexaAuthorMentions and ACLAuthorMentions which are then converted into AuthorMentions using the classes GenerateAuthorMentionsFromRexa and GenerateAuthorMentionsFromACL respectively. You may find it helpful to use these classes as a template.

Populating the MongoDB

General Method

Once you have your data accessible in the AuthorMention format, you may load it into the MongoDB. This can be done with the following commands:

val yourMentions: Iterator[AuthorMentions] = loadMentions()
val db = new AuthorMentionDB(host,port,dbname,collection_name,ensureIndices=false)
PopulateAuthorMentionDB.insert(yourMentions,db,bufferSize=1000)

You can parallelize the insert if you have access to multiple iterators of AuthorMentions:

val yourMentionStreams: Iterable[Iterator[AuthorMentions]] = loadMentionsPar()
val db = new AuthorMentionDB(host,port,dbname,collection_name,ensureIndices=false)
PopulateAuthorMentionDB.insertPar(yourMentionStreams,db,bufferSize=1000)

After you insert the mentions into the database, make sure to create the index on the mentionId field:

db.addIndices()

Using JSON Loader

The script scripts/db/populate-json-mention.sh loads JSON formatted mentions into a MongoDB.

#!/bin/sh

jarpath="target/author_coref-1.1-SNAPSHOT-jar-with-dependencies.jar"

time java -Xmx40G -cp $jarpath edu.umass.cs.iesl.author_coref.db.PopulateAuthorMentionDBFromJSON \
--config=config/db/PopulateJSONMentions.config

The config file config/db/PopulateJSONMentions.config can be edited to fit your requirements. It has the following contents:

--json-file=data/mentions.json
--codec=UTF-8
--hostname=localhost
--port=25752
--dbname=authormention_db
--collection-name=authormention
--num-threads=18
--buffered-size=1000

The shell script calls the Scala class: PopulateAuthorMentionDBFromJSON, which functions in a very similar way to the steps described above:

object PopulateAuthorMentionDBFromJSON {
  def main(args: Array[String]): Unit = {
    val opts = new PopulateAuthorMentionDBFromJSONOpts()
    opts.parse(args)
    val mentions = LoadJSONAuthorMentions.loadMultiple(new File(opts.jsonFile.value),opts.codec.value,opts.numThreads.value)
    val db = new AuthorMentionDB(opts.hostname.value,opts.port.value,opts.dbname.value,opts.collectionName.value,false)
    PopulateAuthorMentionDB.insertPar(mentions,db,opts.bufferSize.value)
    db.addIndices()
  }
}

Generating the Coref Tasks

General Method

To run parallel coreference, you need to specify the canopy/blocking structure of your mentions. To do this, you generate a CorefTask file. This is done for efficiency, so that blocks of mentions can be processed in parallel. Each CorefTask object contains the ids of a set of mentions which are possibly coreferent. The actual mentions corresponding to these ids are then loaded at exactly the moment the CorefTask has been scheduled to be processed by one of the many coreference worker threads.

The general way to determine the Coref Tasks is:

val yourMentionStreams: Iterable[Iterator[AuthorMentions]] = loadMentionsPar()
// Your choice of canopy function, but make sure it is consistent with the choice of function you use in your coreference algorithm in the next step.
val canopyAssignment = (a: AuthorMention) => Canopies.lastAndFirstNofFirst(a.self.value,3)
// Generate the tasks (disregard the last argument-- its use case is not important here
val tasks = GenerateCorefTasks.fromMultiple(authorMentionStreams,canopyAssignment,Set())
// Write to a file
GenerateCorefTasks.writeToFile(tasks,new File(opts.outputFile.value))

The result of this a file with format:

Task Name                  Comma separated mention ids
LAST_liu_FIRST_hon      P05-3005_LN_Liu_FN_Hongfang,S13-1021_LN_Liu_FN_Hongfang,W04-3104_LN_Liu_FN_Hongfang
LAST_liu_FIRST_hui      C10-2082_LN_Liu_FN_Huidan,W06-1624_LN_Liu_FN_Hui,W07-1110_LN_Liu_FN_Hui
LAST_liu_FIRST_yi       C08-1093_LN_Liu_FN_Yi,P07-1047_LN_Liu_FN_Yi,P07-1059_LN_Liu_FN_Yi
LAST_liu_FIRST_yua      C08-1063_LN_Liu_FN_Yuanjie,P13-4012_LN_Liu_FN_Yuanchao,W10-4159_LN_Liu_FN_Yuanchao

Using JSON

To generate the coref tasks from a file of JSON mentions, you can use the script:

./scripts/coref/create_coref_tasks_json.sh

The script calls the Scala class GenerateCorefTasksFromJSON

#!/bin/sh

jarpath="target/author_coref-1.1-SNAPSHOT-jar-with-dependencies.jar"

time java -Xmx20G -cp $jarpath edu.umass.cs.iesl.author_coref.process.GenerateCorefTasksFromJSON \
--config=config/coref/CreateCorefTasksJSON.config

The configuration of the script can be edited in the config/coref/CreateCorefTasksJSON.config config file. Its contents are:

--json-file=data/mentions.json
--output-file=data/coref-tasks.tsv
--num-threads=18
--canopies=lastAndFirst1ofFirst
--name-processor=CaseInsensitiveReEvaluatingNameProcessor

Note how the canopy function and name processor are specified in the command line options.

The class GenerateCorefTasksFromJSON is defined in a similar way to the above description:

object GenerateCorefTasksFromJSON {
  def main(args: Array[String]): Unit = {
    val opts = new GenerateCorefTasksFromJSONOpts()
    opts.parse(args)
    val mentions = LoadJSONAuthorMentions.loadMultiple(new File(opts.jsonFile.value),opts.codec.value,opts.numThreads.value)
    val canopyAssignment = opts.canopies.value.map(Canopies.fromString).map(fn => (authorMention: AuthorMention) => fn(authorMention.self.value)).last
    val ids = if (opts.idRestrictionsFile.wasInvoked) Source.fromFile(opts.idRestrictionsFile.value,opts.codec.value).getLines().toIterable.toSet[String] else Set[String]()
    val nameProcessor = NameProcessor.fromString(opts.nameProcessor.value)
    val tasks = GenerateCorefTasks.fromMultiple(mentions,canopyAssignment,ids,nameProcessor)
    GenerateCorefTasks.writeToFile(tasks,new File(opts.outputFile.value))
  }
}

Word Embeddings

Preparing Training Data

General Method

Another requirement of the system is to have a word embedding model trained on the data. To do this, we need to generate training data for the model. This is done by gathering the text of our AuthorMentions and performing some normalization and writing it out to a file. For instance:

val mentions = loadMentions()
val texts = GenerateWordEmbeddingTrainingData.getTexts(mentions,opts.removePunctuation.value,opts.lowercase.value)
val writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(new File(opts.outputFile.value)),opts.codec.value))
texts.foreach{
  t => 
    writer.write(t)
    writer.write("\n")
    writer.flush()
}
writer.close()

Using JSON

You can use the following script to generate the the training data from JSON serialized mentions:

./scripts/embedding/generate-embedding-training-data-from-json.sh

The script calls the class GenerateWordEmbeddingTrainingDataFromJSON:

#!/bin/sh

jarpath="target/author_coref-1.1-SNAPSHOT-jar-with-dependencies.jar"

time java -Xmx20G -cp $jarpath edu.umass.cs.iesl.author_coref.embedding.GenerateWordEmbeddingTrainingDataFromJSON \
--config=config/embedding/GenerateWordEmbeddingTrainingDataFromJSON.config

With the following config file:

--json-file=data/mentions.json
--output-file=data/embedding/training_data.txt
--remove-punctuation=true
--lowercase=true

Training Embeddings

To train the embeddings, using a Word2Vec like model, use the following script:

./scripts/embedding/train-embeddings.sh

The script uses a word embedding implementation in factorie:

#!/bin/sh
echo "Training word embeddings"
START_TIME=$(date +%x_%H:%M:%S:%N)
START=$(date +%s)
jarpath="target/author_coref-1.1-SNAPSHOT-jar-with-dependencies.jar"

# Edit these settings if you need to
training_data="data/embedding/training_data.txt"
num_threads=20
output_vocabulary="data/embedding/embedding-vocab.txt"
output_embeddings="data/embedding/embeddings.txt"

java -Xmx40G -cp ${jarpath} cc.factorie.app.nlp.embeddings.WordVec \
--min-count=200 \
--train=$training_data \
--output=$output_embeddings \
--save-vocab=$output_vocabulary \
--encoding="UTF-8" \
--threads=$num_threads
 
END=$(date +%s)
END_TIME=$(date +%x_%H:%M:%S:%N)

RTSECONDS=$(($END - $START))
echo -e "Running Time (seconds) = $RTSECONDS "
echo -e "Started script at $START_TIME"
echo -e "Ended script at $END_TIME"

Running Disambiguation

Now we have generated the required inputs to the disambiguation system. The program which runs the parallel coreference takes the following settings. An example config file of these settings is config/coref/ParallelCoref.config.

# Where to find the coref task file, which specifies the separate coref jobs for execution
--coref-task-file=data/coref-tasks.tsv

# Where to write the coref output
--output-dir=data/coref-output

# The number of threads to use
--num-threads=18

# The file encoding
--codec=UTF-8

# The word embeddings
--embedding-dim=200
--embedding-file=data/embedding/embeddings.txt
--case-sensitive=true

# MongoDB
--hostname=localhost
--port=25752
--dbname=authormention_db
--collection-name=authormention

# Canopy Functions to use
--canopies=fullName,firstAndLast,lastAndFirst3ofFirst,lastAndFirst1ofFirst

# Name processor to use
--name-processor=CaseInsensitiveReEvaluatingNameProcessor

An example program for parallel disambiguation is in RunParallelCoreference:

object RunParallelCoreference {

  def main(args: Array[String]): Unit = {

    // Uses command line options from factorie
    val opts = new RunParallelOpts
    opts.parse(args)

    // Load all of the coref tasks into memory, so they can easily be distributed among the different threads
    val allWork = LoadCorefTasks.load(new File(opts.corefTaskFile.value),opts.codec.value)

    // Create the interface to the MongoDB containing the mentions
    val db = new AuthorMentionDB(opts.hostname.value, opts.port.value, opts.dbname.value, opts.collectionName.value, false)

    // The lookup table containing the embeddings. 
    val keystore = InMemoryKeystore.fromCmdOpts(opts)

    // Create the output directory
    new File(opts.outputDir.value).mkdirs()

    // Canopy Functions
    // Convert the strings into canopy functions (mappings of authors to strings) and then to functions from author mentions to strings
    val canopyFunctions = opts.canopies.value.map(Canopies.fromString).map(fn => (authorMention: AuthorMention) => fn(authorMention.self.value))

    // Name processor
    // The name processor to apply to the mentions
    val nameProcessor = NameProcessor.fromString(opts.nameProcessor.value)

    // Initialize the coreference algorithm
    val parCoref = new ParallelHierarchicalCoref(allWork,db,opts,keystore,canopyFunctions,new File(opts.outputDir.value),nameProcessor)

    // Run the algorithm on all the tasks
    parCoref.runInParallel(opts.numThreads.value)

    // Write the timing info
    val timesPW = new PrintWriter(new File(opts.outputDir.value,"timing.txt"))
    timesPW.println(parCoref.times.map(f => f._1 + "\t" + f._2).mkString("\n"))
    timesPW.close()

    // display the timing info
    parCoref.printTimes()
  }
}

Currently, the recommended model parameters for this set up are config/coref/DefaultWeightsWithoutTopicsAndKeywords.config. But please note, this may change in the future.

The output of the coreference algorithm is a file all-results.txt in the output directory. This file is a two-column tab separated file with mention ids in the left column and entity ids in the right column:

A00-1001_LN_Amble_FN_Tore       1
A00-1002_LN_Haji_FN_Jan 2
A00-1002_LN_Hric_FN_Jan 3
A00-1003_LN_Flank_FN_Sharon     4
A00-1004_LN_Chen_FN_Jiang       5
A00-1004_LN_Nie_FN_Jian-Yun     6
A00-1005_LN_Bagga_FN_Amit       7
A00-1005_LN_Bowden_FN_G 8
A00-1005_LN_Strzalkowski_FN_Tomek       9

Given a setting file as described above and the model parameters, you may use the shell script scripts/coref/run_coref.sh to run the algorithm:

bash scripts/coref/run_coref.sh $setting_file $parameter_file

For example:

bash scripts/coref/run_coref.sh config/coref/ParallelCoref.config config/coref/DefaultWeightsWithoutTopicsAndKeywords.config

Distributing Coreference

The simplest (though perhaps too naive) way to distribute the coreference work is to split up the coref task file into several chunks and to run a parallel coreference job with a different task file on each machine. One machine will run the MongoDB instance-- make sure to specify the name of this machine instead of localhost in the config files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LARGE_SCALE_EXPERIMENT_NOTES.md

LARGE_SCALE_EXPERIMENT_NOTES.md

Large Scale Experiment Notes

Recommended Machine Setup

Starting Mongo

Data Format

Loading Data

JSON

Other Formats

Populating the MongoDB

General Method

Using JSON Loader

Generating the Coref Tasks

General Method

Using JSON

Word Embeddings

Preparing Training Data

General Method

Using JSON

Training Embeddings

Running Disambiguation

Distributing Coreference

Files

LARGE_SCALE_EXPERIMENT_NOTES.md

Latest commit

History

LARGE_SCALE_EXPERIMENT_NOTES.md

File metadata and controls

Large Scale Experiment Notes

Recommended Machine Setup

Starting Mongo

Data Format

Loading Data

JSON

Other Formats

Populating the MongoDB

General Method

Using JSON Loader

Generating the Coref Tasks

General Method

Using JSON

Word Embeddings

Preparing Training Data

General Method

Using JSON

Training Embeddings

Running Disambiguation

Distributing Coreference