-
Notifications
You must be signed in to change notification settings - Fork 128
Manual
- Installation
- Manual Installation
- Configuration
- DSL for Textual Entities
- Text Processing
- Annotations
- Computers
- Support for Other Languages
- Dynamically Extending Treat
You can install Treat as a gem:
gem install treat
Or clone the git repo:
git clone [email protected]:louismullie/treat.git
A language package is a set of dependencies that Treat uses for a given language. This can include gems, JAR files and training models.
You can install a package through a Ruby program or IRB session:
require 'treat'
Treat::Core::Installer.install 'english'
Or install it from rake:
rake treat:install[english]
If no language is specified, English will be assumed.
You can run all the core spec tests:
rake treat:spec
Or choose a specific language to spec as follows ("agnostic" is also valid):
rake treat:spec[language]
The Treat core is pure Ruby and is tested on Ruby 1.9.2, 1.9.3 as well as JRuby 1.7.1. Previous versions of JRuby and Rubinius implementations are not tested, and may or may not work.
The workers that use C extensions cannot be run on JRuby - these include ferret
, lda-ruby
, rb-libsvm
, ruby-fann
, linkparser
, tf-idf-similarity
, rbtagger
and ruby-stemmer
.
You may want to install some of the following libraries and make the binary files available in your path. This is only necessary if you plan on using the feature supplied by the relevant binary.
- The Enju Parser for deep parsing of English text.
- The Google Ocropus OCR Engine to read image files.
- The Poppler text utilities to read PDF files.
- The Antiword MS Word Document Reader to read DOC files.
- The Graphviz Graph Visualizer to visualize and export DOT graphs.
It is easiest to use port
or apt-get
to install the last four libraries, e.g.:
port install ocropus poppler antiword graphviz
Alternatively you can use brew
to install all but the ocropus
library, e.g.:
brew install poppler antiword graphviz
Any necessary JARs and model files will be downloaded when installing a language package (see above). The next section explains how to download them manually.
Download Stanford JAR Files and Models
If you want to use any of the Stanford tools, be sure to download either:
- The JAR files with one tagger and one parser model for English (15 MB)
- The JAR files with all models for English (150 MB)
- The JAR files with all models for all languages (300 MB)
The JAR files should be placed under /bin/stanford/
, and the model files under /models/stanford/
(inside the Treat gem folder).
Download Punkt Segmenter Models
You can download trained segmenter models for the Punkt segmenter here. Models are available for Dutch, English, German, French, Greek, Italian, Polish, Portuguese, Russian, Spanish, and Swedish.
The Punkt model files should be placed under /models/punkt/
.
Download Reuters Keyword Model
You can download a general topic classification model trained on a corpus of Reuters news tickers here.
Place all the model files under /models/reuters/
.
Ruby 1.9 does not parse files with non-ASCII characters unless you specify the encoding.
You can do so by adding a Ruby comment at the very top of the file, with the appropriate encoding, e.g.: # encoding: utf-8
Option | Default | Description |
Treat.core.verbosity.silence |
true |
A boolean value indicating whether to silence the output of external libraries (e.g. Stanford tools, Enju, LDA, Ruby-FANN, Schiphol) when they are used. |
Treat.core.verbosity.debug |
false |
A boolean value indicating whether to explain the steps that Treat is performing. |
Option | Default | Description |
Treat.core.language.detect |
false |
A boolean value indicating whether Treat should try to detect the language of newly input text. |
Treat.core.language.default |
'english' |
A string representing the language to default to when detection is off. |
Treat.core.language.detect_at |
:document |
A symbol representing the finest level at which language detection should be performed if language detection is turned on. |
Option | Default | Description |
Treat.paths.tmp |
'/$GEM_FOLDER/tmp/' |
A directory in which to create temporary files. |
Treat.paths.files |
'/$GEM_FOLDER/files/' |
A directory in which to store downloaded files. |
Treat.paths.bin |
'/$GEM_FOLDER/bin/' |
The directory containing executables and JAR files. |
Treat.paths.models |
'/$GEM_FOLDER/models/' |
The directory containing training models. |
Currently, Treat only supports MongoDB, although support for more DB formats are on the way. You can configure MongoDB as such:
Treat.databases.mongo.db = 'your_database'
Treat.databases.mongo.host = 'localhost'
Treat.databases.mongo.port = '27017'
Including the DSL
Most users will choose to use Treat through the DSL it provides. The Treat::Core::DSL
module must be required first:
require 'treat'
include Treat::Core::DSL
The DSL provides for two things: creation of textual entities (entity builders) and machine learning functionalities (problem, question, feature and data set builders).
Creating Individual Textual Entities
In Treat, empty entities are created simply by naming them:
p = paragraph
s = sentence
w = word
# etc.
You can see the tree structure of any entity in the terminal by calling print_tree
on it.
If you add a string after the name, that string becomes the text value of the entity:
p = phrase 'hello world'
e = email '[email protected]'
There are two exceptions. If you are creating a document, the string will be considered to refer to a readable (local or remote) file. The entity's text value becomes the content of that file:
d = document 'local_file.txt'
s = document 'http://www.a.com/z.html'
The value of an entity is available:
- As a string, through the
value
orto_s
methods for nodes that have no children. - As a string, through the
to_s
method for nodes that have children. - As an array, through
to_a
for nodes that have children (callsto_s
on each). - Note that
to_str
andto_ary
are defined as aliases forto_s
andto_a
respectively.
The type of an entity is available through the type
attribute.
Creating Composite Entities
If you add a list of entities after the name, they will be nested under that entity:
p = phrase 'hello', 'world', '!'
You can further add entities by using the append operator:
p << 'test'
You can also mix and match strings and entities:
phra = phrase 'Obama', 'Sarkozy', 'Meeting'
para = paragraph 'Obama and Sarkozy met on January 1st to'
'investigate the possibility of a new rescue plan. Nicolas ' +
'Sarkozy is to meet Merkel next Tuesday in Berlin.'
sect = section title(phra), para
Transparent Casting
The Treat DSL also provides transparent casting of strings and numbers to textual entities using the to_entity
method. This is used in two contexts.
In the context of a builder, any string or number will automatically be converted to its corresponding type. This was used in the last section, when a string is passed to a builder method or is added to an entity using the append operator.
When not in the context of a builder, casting only occurs if a method defined by Treat is called on a string or a number. If a method defined by Treat is called on a Numeric
object, that object will automatically be cast to a Treat::Entities::Number
object.
s = 'inflection'.stem
# is equivalent to
s = 'inflection'.to_entity.stem
# which comes down to
s = word('inflection').stem
Here, since 'inflection'
can be cast to a word, and since stem
is a method defined by Treat on Word
, casting is performed and the method is called on the resulting entity.
Textual entities can be created by using the special "builder" methods available in the global namespace. These methods are aliases for the corresponding Treat::Entities::X.build(...)
methods, so that word('hello')
is equivalent to Treat::Entities::Word.build('hello')
.
Casting Examples
Consider the following examples to further illustrate how casting works.
Operation Performed | Resulting Type |
"A syntactical phrase".to_entity
|
Treat::Entities::Phrase
|
"A super little sentence.".to_entity
|
Treat::Entities::Sentence
|
"Sentence number one. sentence number two.".to_entity
|
Treat::Entities::Paragraph
|
"A title\nA short amount of text.".to_entity`
|
Treat::Entities::Section
|
20.to_entity
|
Treat::Entities::Number
|
Creating Entities from Strings
# Create a word
word = word('run')
# Create a phrase
phrase = phrase('am running')
# Create a sentence
sentence = sentence('Welcome to Treat!')
# Create a section
section = section("A small text\nA factitious paragraph.")
Creating Documents From Files or URLs
# If a filename is supplied, the file format and
# the appropriate parser to use will be determined
# based on the file extension:
d = document('text.extension')
# N.B. Supports .txt, .doc, .htm(l), .abw, .odt;
# .pdf can be parsed with poppler-utils installed
# .doc can be parsed with antiword installed
# .jpg, .gif or .png with ocropus installed
# Tip: `port install ocropus poppler antiword`
# If a URL is specified, the file will be downloaded
# and then parsed as a regular file:
d = document('http://www.example.com/XX/XX')
# N.B. By default, files will be downloaded into the
# '/files/' folder of the gem's directory. This can
# be changed by modifying 'Treat.core.paths.files'
# N.B. Format will assumed to be HTML if the web page
# does not have any file extension. Otherwise,
# will be determined based on the file extension.
# If a hash is provided, that hash will be used as
# a selector to retrieve a document from the DB.
# This can be done based on the ID of the document:
d = document({id: 103757301323})
# Or through any given feature of a certain document:
d = document({'features.file' => 'somefile.txt'})
# N.B. Currently, MongoDB is the only supported DB.
# You need to configure the database adapter before
# using this particular way of retrieving documents.
Creating collections from Folders
A collection is a set of documents that are grouped together.
# If an existing folder is passed to the builder,
# that folder is recursively searched for files
# with any of the supported formats, and these
# files are loaded into the collection object:
c = collection('existing_folder')
# If a non-existing folder is passed to the builder,
# that folder is created and the collection is opened:
c = collection('new_folder')
# If a collection has been created with a folder,
# documents added to the collection are copied
# to the newly created folder:
c = collection('some_folder')
d = document 'http://www.someurl.com'
c << d
# If a hash is passed to the builder, that hash
# will be used as a selector to retrieve documents
# from the DB, and a collection containing these
# documents will be loaded. An empty hash loads
# all documents in the DB:
c = collection({})
# You can also use any arbitrary feature that
# you have defined on the documents to create
# a collection on-the-fly:
c = collection({'features.topic' => 'news'})
# N.B. Currently, MongoDB is the only supported DB.
# You need to configure the database adapter before
# using this particular way of creating collections.
Three useful ways to visualize entities (in addition, of course, to #inspect
) are the tree, graph and standoff (tag-bracketed) formats.
Format | Example | Description |
Tree |
entity.visualize :tree
|
Outputs a tree representation of any kind of entity in a terminal-friendly format. A shorthand for this is `entity.print_tree`. |
Graph |
entity.visualize :dot, file: 'test.dot'
|
Outputs a DOT graph representation of the entity in Graphviz format. |
Standoff |
sentence.visualize :standoff
|
Outputs a tag-bracketed version of a tagged sentence (only works on sentences). |
Treat currently provides mongo, xml and yaml serialization. Deserialization methods are also shown in the examples below.
Format | Serialization Example | Deserialization Example | Description |
MongoDB |
doc.serialize :mongo, db: 'testing'
|
doc = document({id: your_doc_id})
|
Serializes the entity and its whole subtree in a single document, in collection with a name derived from the type of that entity (e.g. "documents"). See the Mongo configuration options for details. |
XML |
doc.serialize :xml, file: 'test.xml'
|
doc = document('test.xml')
|
Serializes the entity to the Treat XML format. |
YAML |
sentence.serialize :yaml, file: 'test.yml'
|
sentence = sentence('test.yml')
|
Serializes the entity to YAML format using Psych. |
About Text Processors
The first step once a textual entity has been created is usually to split it into smaller bits and pieces that are more useful to work with. Treat allows to successively split a text into logical zones, sentences, syntactical phrases, and, finally, tokens (which include words, numbers, punctuation, etc.) All text processors work destructively on the receiving object, returning the modified object. They add the results of their operations in the @children
hash of the receiving object. Note that each of type of processor is only available on specific types of entities, which are bolded in the text below.
Note on Default Workers
When called without any options, a processing task will be done using the default worker. The default worker will be determined based on the format of the supplied file (for chunkers) or the language of the text (for segmenters, tokenizers and parsers). You can specify a non-default worker by passing it as a symbol to the method, e.g. paragraph.segment :punkt
, sentence.parse :enju
, sentence.tokenize :stanford
, etc.
Chunkers split a document or section into its logical subdivisions, which may include zones (titles, paragraphs, lists) and/or other sections (pages, blocks, lists).
d = document('doc.html').chunk
Segmenters split a zone of text (a title or a paragraph) into sentences.
p = paragraph('A walk in the park. A trip on a boat.').segment
Available Processors
Name | Description | Reference |
srx | Sentence segmentation based on a set of predefined rules defined in SRX (Segmentation Rules eXchange) format and developped by Marcin Milkowski. | Marcin Milkowski, Jaroslaw Lipski, 2009. Using SRX standard for sentence segmentation in LanguageTool, in: Human Language Technologies as a Challenge for Computer Science and Linguistics. |
tactful | Sentence segmentation based on a Naive Bayesian statistical model. Trained on Wall Street Journal news combined with the Brown Corpus, which is intended to be widely representative of written English. | Dan Gillick. 2009. sentence Boundary Detection and the Problem with the U.S. University of California, Berkeley. |
punkt | Sentence segmentation based on a set of log- likelihood-based heuristics to infer abbreviations and common sentence starters from a large text corpus. Easily adaptable but requires a large (unlabeled) indomain corpus for assembling statistics. | Kiss, Tibor and Strunk, Jan. 2006. Unsupervised Multilingual sentence Boundary Detection. Computational Linguistics 32:485-525. |
stanford | Detects sentence boundaries by first tokenizing the text and deciding whether periods are sentence ending or used for other purposes (abreviations, etc.). The obtained tokens are then grouped into sentences. | - |
scalpel | Sentence segmentation based on a set of predefined rules that handle a large number of usage cases of sentence enders. The idea is to remove all cases of .!? being used for other purposes than marking a full stop before naively segmenting the text. | - |
Tokenizers split a group of words (a sentence, phrase or fragment) into tokens.
s = sentence('An uninteresting sentence, yes it is.').tokenize
Available Processors
Name | Description | Reference |
ptb | Tokenization based on the tokenizer developped by Robert Macyntyre in 1995 for the Penn Treebank project. This tokenizer follows the conventions used by the Penn Treebank, except that by default it will not change double quotes to directional quotes. | Robert MacIntyre. 1995. Reference implementation for PTB tokenization. University of Pennsylvania. |
stanford | Tokenization provided by Stanford Penn-Treebank style tokenizer. Most punctuation is split from adjoining words, double quotes (") are changed to doubled single forward- and backward- quotes (`` and ''), verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately. | - |
tactful | Tokenization script lifted from the 'tactful- tokenizer' gem. | - |
punkt | Tokenization script from the 'punkt-segmenter' Ruby gem. | - |
Parsers parse a group of words (a sentence, phrase or fragment) into its syntactical tree.
s = sentence('The prospect of an Asian arms race is genuinely frightening.').parse
You can chain any number of processors using do
. This allows rapid splitting of any textual entity down to the desired granularity. The tree of the receiver will be recursively searched for entities to which each of the supplied processors can be applied.
sect = section "A walk in the park\n"+
'Obama and Sarkozy met this friday to investigate ' +
'the possibility of a new rescue plan. The French ' +
'president Sarkozy is to meet Merkel next Tuesday.'
sect.do(:chunk, :segment, :tokenize, :parse)
Annotations are a type of metadata which can be grafted to a textual entity to help in further classification tasks. All annotators work destructively on the receiving object, and store their result in the @features
hash of that object. Each annotator is available only on specific types of entities (it makes no sense to get the synonyms of a sentence, or the topics of a word). This section is split by entity type, and lists the annotations that are available on each particular type of entity.
Note that in many of the following examples, transparent string-to-entity casting is used. Refer to the relevant section above for more information.
You can set your own arbitrary annotations on any entity by using set
, check if an annotation is defined on an entity by using has?
, and retrieve an annotation by using get
:
w = word('hello')
w.set :topic, "conversation"
w.has? :topic # => true
w.get :topic # => "conversation"
Language
The "language" annotation is available on all types of entities. Note that Treat.core.language.detect
must be set to true
for language detection to be performed when the language
method is called. Otherwise, Treat.core.language.default
will be returned regardless of the actual content of the entity.
Treat.core.language.detect = true
a = "I want to know God's thoughts; the rest are details. - Albert Einstein"
b = "El mundo de hoy no tiene sentido, así que ¿por qué debería pintar cuadros que lo tuvieran? - Pablo Picasso"
c = "Un bon Allemand ne peut souffrir les Français, mais il boit volontiers les vins de France. - Goethe"
d = "Wir haben die Kunst, damit wir nicht an der Wahrheit zugrunde gehen. - Friedrich Nietzsche"
puts a.language # => :english
puts b.language # => :spanish
puts c.language # => :french
puts d.language # => :german
Part of Speech Tags
'running'.tag # => "VBG"
'running'.category # => "verb"
'inflection'.tag # => "NN"
'inflection'.category # => "noun"
Available Annotators
Name | Description | Reference |
lingua | POS tagging using part-of-speech statistics from the Penn Treebank to assign POS tags to English text. The tagger applies a bigram (two-word) Hidden Markov Model to guess the appropriate POS tag for a word. | - |
brill | POS tagging using a set of rules developped by Eric Brill. | Eric Brill. 1992. A simple rule-based part of speech tagger. In Proceedings of the third conference on Applied natural language processing. |
stanford | POS tagging using (i) explicit use of both preceding and following tag contexts via a dependency network representation, (ii) broad use of lexical features, including jointly conditioning on multiple consecutive words, (iii) effective use of priors in conditional loglinear models, and (iv) �ne-grained modeling of unknown word features. | Toutanova, Manning, Klein and Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. |
Synonyms, Antonyms, Hypernyms and Hyponyms
'ripe'.synonyms
# => ["mature", "ripe(p)", "good", "right", "advanced"]
'ripe'.antonyms
# => ["green", "unripe", "unripened", "immature"]
'coffee'.hypernyms
# => ["beverage", "drink", [...], "drinkable", "potable"]
'juice'.hyponyms
# => ["lemon_juice", "lime_juice", [...], "digestive_fluid"]
Word Stemming
'running'.stem # => "run"
'inflection'.stem # => "inflect"
Available annotators
Name | Description | Reference |
porter | Stemming using a native Ruby implementation of the Porter stemming algorithm, a rule-based suffix-stripping stemmer which is very widely used and is considered the de-facto standard algorithm used for English stemming. | Porter, 1980. An algorithm for suffix stripping. Program, vol. 14, no. 3, p. 130-137. |
porter_c | Stemming using a wrapper for a C implementation of the Porter stemming algorithm, a rule-based suffix-stripping stemmer which is very widely used and is considered the de-facto standard algorithm used for English stemming. | Porter, 1980. An algorithm for suffix stripping. Program, vol. 14, no. 3, p. 130-137. |
uea | Stemming using the UEA algorithm, a stemmer that operates on a set of rules which are used as steps. There are two groups of rules: the first to clean the tokens, and the second to alter suffixes. | Jenkins, Marie-Claire, Smith, Dan, Conservative stemming for search and indexing, 2005. |
Noun and Adjective Declensions
'inflection'.plural # => "inflections"
'inflections'.singular # => "inflection"
Verb Inflections
'running'.infinitive # => "run"
'run'.present_participle # => "running"
'runs'.plural_verb # => "run"
Ordinals & Cardinals
20.ordinal # => "twentieth"
20.cardinal # => "twenty"
Named Entity Tags
The annotation :name_tag
allows to retrieve person, location and time expressions in texts.
p = paragraph "Obama and Sarkozy met on January 1st to investigate the possibility of a new rescue plan." +
"President Sarkozy is to meet Merkel next Tuesday in Berlin."
p.apply(:chunk, :segment, :tokenize, :name_tag)
Available Annotators
stanford | A Conditional Random Field sequence model, together with well-engineered features for Named Entity Recognition in English and German. | Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky. Stanford's Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task. In Proceedings of the CoNLL-2011 Shared Task, 2011. |
Date and Time
The annotation :time
allows to retrieve natural language expressions describing events in time.
s = section "A bad day for leaders\n2011-12-23 - Obama and Sarkozy announced that they will start meeting every Tuesday."
s.apply(:chunk, :segment, :tokenize, :parse, :time)
Available Annotators
chronic | Time/date extraction using a rule-based, pure Ruby natural language date parser. |
ruby | Date extraction using Ruby's standard library DateTime.parse() method. |
nickel | Time extraction using a pure Ruby natural language time parser. |
TF*IDF
Not considered stable. API may change.
The annotation :tf_idf
allows to get the TF*IDF score of a word within its parent collection.
c = collection('economist')
c.words[0].tf_idf
Keywords
Not considered stable. API may change.
The annotation :keywords
allows to retrieve the keywords of a document, section or zone. Uses a naive TF*IDF approach, i.e. the relevant document/section/zone must tokenized beforehand.
c = collection('economist')
c.apply(:chunk, :segment, :tokenize, :keywords)
General Topic
The annotation :topic
allows to retrieve the general topic of a document, section or zone. Uses a model trained on a large set of Reuters articles.
s = paragraph 'Michigan, Ohio (Reuters) - Unfortunately, the RadioShack is closing.'
s.apply(:segment, :tokenize, :topics)
Topic Words
The annotation :topic_words
allows you to retrieve clusters of topics within documents. Uses Latent Dirichlet Allocation (LDA).
c = collection('economist')
c.apply(:chunk, :segment, :tokenize)
puts c.topic_words(
:lda,
:num_topics => 10,
:words_per_topic => 5,
:iterations => 20
).inspect
About Computers
By contrast with processors and annotators, which modify the receiving object, computers perform operations that leave the receiving entity untouched. [Work in progress.]
Serializers
Serializers allow to persist entities on disk or in a database.
d = document('index.html')
d.serialize :xml, file: 'test.xml'
Currently, XML, YAML and MongoDB serialization is supported.
Indexers
The index
computer builds a searchable index for a collection.
c = collection('folder')
c.index
Searchers
This search
computer allows to retrieve documents inside a collection by searching through a previously built index (see above).
c.search(:q => 'some query').each do |doc|
# Process the document of interest
end
By default, Treat considers any input to be in English. However, by changing the default language or turning language detection on (refer to the Configuration), plugins for other languages can be leveraged as well. Note that Treat's multilingual features are currently relatively limited; help is more than welcome!
Current support for other languages is as follows:
- Parsers: English, German, [French, Arabic, Chinese on the way].
- Segmenters: most models available work for all Latin languages.
- Taggers: English, French, German [Arabic, Chinese on the way].
Note that the Penn Treebank tag set is used for English, the Stuttgart-Tübingen tag set for German, and the Paris7 tag set for French.
Example: German, English and French Parsing
Treat.core.language.detect = true
# Penn Tree Bank tags.
sent = "This is an English sentence, prove it to me!"
sent.apply(:tokenize, :parse).print_tree
# Stutgart-Tubingen tags.
sent = "Wegen ihres Jahrestages bereiten wir unseren " +
"Eltern eine Exkursion nach München vor."
sent.apply(:tokenize, :parse).print_tree
# Paris7 tags.
sent = "Une phrase en Français pour entourlouper les Anglais."
sent.apply(:tokenize, :tag).print_tree
Extending Treat with your own processors, annotators or computers is extremely simple. You can dynamically define your own pluggable workers by adding them in the right group. Once this is done, the algorithm will be available on the right types of entities. Due to transparent string-to-entity casting, it will also be available on the "right type" of string.
Stemmer Example
Here is a dummy stemmer that removes the last letter of a word:
Treat::Workers::Inflectors::Stemmers.add(:dummy) do |word, options={}|
word.to_s[0..-2]
end
'dummy'.stem(:dummy) # => dumm
Tokenizer Example
Here is a tokenizer that naively splits on space characters:
Treat::Workers::Processors::Tokenizers.add(:dummy) do |sentence, options={}|
sentence.to_s.split(' ').each do |token|
sentence << Treat::Entities::Token.from_string(token)
end
end
s = sentence('A sentence to tokenize.')
s.tokenize(:dummy)