Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.
This library provides a fast, streaming, customizable, regular expression based tokenizer using a Deterministic Finite Automaton (DFA).
Create a new instance of Latvian.Tokenization.LatvianTokenizer
and call the Tokenize
method with the text to tokenize as a string parameter.
using System;
using Latvian.Tokenization;
using Latvian.Tokenization.Tokens;
class Program
{
static void Main()
{
string text = "Sveika, pasaule! Man iet labi. Šodienas datums ir 2014-01-01";
LatvianTokenizer tokenizer = new LatvianTokenizer();
foreach (Token token in tokenizer.Tokenize(text))
{
Console.WriteLine("Line {0}: Pos {1}: Type: {2} Token: {3}",
token.Line, token.Position, token.GetType(), token.Text);
if (token is DateToken)
{
DateToken dateToken = token as DateToken;
Console.WriteLine(dateToken.DateTime.ToString("dd/MM/yyyy"));
}
}
}
}
You can also tokenize sentences.
using Sentence = IEnumerable<Token>;
IEnumerable<Sentence> sentences = tokenizer.TokenizeSentences(text);
foreach (Sentence sentence in sentences)
{
foreach (Token token in sentence)
{
Console.WriteLine(token);
}
Console.WriteLine("-- End of sentence --");
}
If you already have tokenized text, you can break it into sentences:
Token[] tokens = tokenizer.Tokenize(text).ToArray();
Sentence[] sentences = tokenizer.BreakSentences(tokens).ToArray();
Token.Text
contains the token value.
Token token = new LatvianTokenizer().Tokenize("vārds").First();
Assert.AreEqual("vārds", token.Text);
Assert.AreEqual("vārds", token.ToString());
If you are only interested in token values:
string[] tokens = new LatvianTokenizer().Tokenize("viens divi").Select(t => t.Text).ToArray();
Assert.AreEqual("viens", tokens[0]);
Assert.AreEqual("divi", tokens[1]);
Two tokens are identical if their values match.
var tokens = new LatvianTokenizer().Tokenize("viens viens").Distinct();
Assert.AreEqual(1, tokens.Count());
Token
has starting and ending positions as well as starting and ending line numbers and the positions on the lines. Positions start at 0.
Token[] tokens = new LatvianTokenizer().Tokenize("Vārds.").ToArray();
Assert.AreEqual(0, tokens[0].Position);
Assert.AreEqual(5, tokens[0].PositionEnd);
Assert.AreEqual(0, tokens[0].Line);
Assert.AreEqual(0, tokens[0].LineEnd);
Assert.AreEqual(0, tokens[0].LinePosition);
Assert.AreEqual(5, tokens[0].LinePositionEnd);
Assert.AreEqual(".", tokens[1].Text);
Assert.AreEqual(5, tokens[1].Position);
Assert.AreEqual(6, tokens[1].PositionEnd);
Assert.AreEqual(0, tokens[1].Line);
Assert.AreEqual(0, tokens[1].LineEnd);
Assert.AreEqual(5, tokens[1].LinePosition);
Assert.AreEqual(6, tokens[1].LinePositionEnd);
Token type can be determined by looking at its class type. Casting to a more specific type will reveal some other useful properties. For example, DateToken
has a DateTime
property which returns the parsed date as System.DateTime
. Token types are in the Latvian.Tokenization.Tokens
namespace.
Token token = new LatvianTokenizer().Tokenize("2014-01-01").First();
Assert.IsTrue(token is DateToken);
Assert.AreEqual(2014, ((DateToken)token).DateTime.Year);
Token token = new LatvianTokenizer().Tokenize("123").First();
Assert.IsTrue(token is OrdinalNumberToken);
Assert.AreEqual(123 * 2, ((OrdinalNumberToken)token).Value * 2);
By default, whitespace and control characters are not returned. You can use the IncludeWhitespace
and IncludeControlChars
boolean properties to change this behavior.
Token[] tokens = new LatvianTokenizer().Tokenize("123 456").ToArray();
Assert.AreEqual("123", tokens[0].Text);
Assert.AreEqual("456", tokens[1].Text);
LatvianTokenizer tokenizer = new LatvianTokenizer() { IncludeWhitespace = true };
Token[] tokens = tokenizer.Tokenize("123 456").ToArray();
Assert.AreEqual("123", tokens[0].Text);
Assert.AreEqual(" ", tokens[1].Text);
Assert.AreEqual("456", tokens[2].Text);
There are extension methods for tokenizing a Stream
and TextReader
so that you don't have load the whole text in memory:
using (FileStream stream = new FileStream("file.txt", FileMode.Open, FileAccess.Read))) {
Token[] tokens = new LatvianTokenizer().Tokenize(stream).ToArray();
}
using (StreamReader reader = new StreamReader("file.txt")) {
Token[] tokens = new LatvianTokenizer().Tokenize(reader).ToArray();
}
LatvianTokenizer
is a collection of regular expression patterns which are represented by token type classes. See the list of default token types below. You can remove the ones you don't like and add new ones.
Token[] tokens = new LatvianTokenizer().Tokenize("A.Bērziņš").ToArray();
Assert.AreEqual("A.", tokens[0].Text);
Assert.AreEqual("Bērziņš", tokens[1].Text);
// vs
LatvianTokenizer tokenizer = new LatvianTokenizer(compile: false);
tokenizer.Remove<InitialsToken>();
tokenizer.Compile(); // optional
Token[] tokens = tokenizer.Tokenize("A.Bērziņš").ToArray();
Assert.AreEqual("A", tokens[0].Text);
Assert.AreEqual(".", tokens[1].Text);
Assert.AreEqual("Bērziņš", tokens[2].Text);
If you use your own configuration often, you can create a custom class.
public class MyTokenizer : LatvianTokenizer
{
public MyTokenizer() : base(compile: false)
{
Remove<InitialsToken>();
IncludeWhitespace = true;
Compile();
}
}
MyTokenizer tokenizer = new MyTokenizer();
Patterns use a custom regular expression language (see further below for a short tutorial). The simplest way is to create a new class. It must extend Token
and implement IHasPattern
. It also must have a parameterless constructor.
public class EmotionToken : Token, IHasPattern
{
public string Pattern
{
get { return "[:].[)(DP]"; } // :) :( :D :P
}
public bool IsHappy
{
get { return Text == ":)" || Text == ":D"; }
}
}
LatvianTokenizer tokenizer = new LatvianTokenizer(compile: false);
tokenizer.Add<EmotionToken>();
EmotionToken[] tokens = tokenizer.Tokenize("Šodien esmu :) bet vakar biju :(").OfType<EmotionToken>().ToArray();
Assert.AreEqual(":)", tokens[0].Text);
Assert.AreEqual(":(", tokens[1].Text);
The tokenizer finds the longest match. If there are two patterns that overlap and match the same string, the longest match is returned.
Token[] tokens = new LatvianTokenizer().Tokenize("2014-01-01 2014-01-01T12:00:00").ToArray();
Assert.AreEqual("2014-01-01", tokens[0].Text);
Assert.AreEqual("2014-01-01T12:00:00", tokens[1].Text);
If there are two patterns that match the same string of equal length, the first token type that was added to the tokenizer is used.
LatvianTokenizer tokenizer = new LatvianTokenizer(compile: false);
tokenizer.Clear(); // removes existing TimeToken that matches 00:00:00
tokenizer.Add<TimeSpanToken>(); // matches 00:00:00
tokenizer.Add<ClockToken>(); // matches 00:00:00
Token token = tokenizer.Tokenize("00:00:00").First();
Assert.IsTrue(token is TimeSpanToken);
LatvianTokenizer
implements IList<Type>
so you can rearrange the order. There are strongly typed versions of IList<Type>
methods available and one convenience method Move<T>()
.
// remove and insert at the first position
tokenizer.Remove<ClockToken>();
tokenizer.Insert<ClockToken>(0);
// it's is the same as using the methods from IList<Type>
tokenizer.Remove(typeof(ClockToken));
tokenizer.Insert(0, typeof(ClockToken));
// here's a convenience method for rearranging the order
tokenizer.Move<ClockToken>(0);
// all of the above result in this
token = tokenizer.Tokenize("00:00:00").First();
Assert.IsTrue(token is ClockToken);
All patterns are parsed, then converted into a non-deterministic finite automaton (NFA) that is then converted to a deterministic finite automaton (DFA). This is referred to as compiling. This takes a while but it may not be significant depending on usage.
By default, when a new instance of LatvianTokenizer
is created, the default token types and patterns are added and compiled. If you plan on modifying token types, you can pass false
in the constructor to delay compiling: new LatvianTokenizer(compile: false)
or new LatvianTokenizer(false)
. Adding new token types or removing existing token types will destroy the compiled automaton, so it will need to be compiled again. You can then compile the tokenizer by calling LatvianTokenizer.Compile()
. If this is not done, it will be called automatically when you try to tokenize text for the first time.
It's also possible to save the current configuration to a file and then load it later.
LatvianTokenizer tokenizer = new LatvianTokenizer(compile: false);
tokenizer.Remove<InitialsToken>();
tokenizer.Save("tokenizer.bin"); // compiles automatically
LatvianTokenizer tokenizer2 = new LatvianTokenizer("tokenizer.bin");
LatvianTokenizer tokenizer3 = new LatvianTokenizer("tokenizer.bin", compile: false);
tokenizer3.Compile();
LatvianTokenizer tokenizer4 = new LatvianTokenizer(compile: false);
tokenizer4.Load("tokenizer.bin");
LatvianTokenizer tokenizer5 = new LatvianTokenizer(compile: false);
tokenizer5.Load("tokenizer.bin", compile: false);
tokenizer5.Compile();
Here is the list of default token types along with the regular expression patterns used.
- WordToken:
[a-zA-ZĀ-ž]+
- WhitespaceToken:
[ \t\r\n]+
- ... todo ...
If this list is out of date, see LatvianTokenizer.cs
[x]
matches a single character:[a]
matches onlya
[xX]
matches one of the characters:[aA]
matches onlya
orA
but not both[x-x]
matches characters in the range (inclusive):[0-9]
matches a single digit*
matches zero or more occurences of the expression:[0-9]*
+
matches one or more occurrences of the expression:[0-9]+
.
concatinates expressions:[0-9].[a-z]
matches a digit followed by a letter|
matches one or the other expression:[0-9]|[a-z]
matches a digit or a letter but not both- Evaulated left to right, equal priority:
[0].[1]|[2]
is[0].([1]|[2])
and matches01
or02
()
changes order:([0].[1])|[2]
matches01
or2
All characters must be in square brackets, there are no implicit ranges. There is no implicit concatenation. There is no repetition apart from *
and +
.
Examples:
- Multi ranges:
[a-z0-9!@#]
- No need to escape the dash:
[-02]
or[02-]
but[0\-2]
(all match0
or-
or2
) - Unicode chars:
[\u0000-\uffff]
matches any character - Two or more:
[a].[a]+
or[a].[a].[a]*
Here are the benchmark results:
Method | Time |
---|---|
Compile | 33 ms |
Load | 37 ms |
Compile x100 | 3300 ms |
Load x100 | 3770 ms |
Release build, NUnit test runner, mobile i7, SSD
At this time it looks like there is no difference between loading from a file and compiling everything every time. Loading is slower than compiling probably because it's neccessary to use reflection to load types from the file.
The tokenizer data file could be compressed but the space savings are insifignant for the default configuration (6 KiB vs 24 KiB) so it's not implemented.
Source | Size | Time | Rate | Speed |
---|---|---|---|---|
string | 10 MiB | 921 ms | 10.9 MiB/s | 1 842 425 tokens/s |
file (StreamReader) | 10 MiB | 1222 ms | 8.2 MiB/s | 1 388 168 tokens/s |
Release build, NUnit test runner, mobile i7, SSD
It's possible to further optimize tokenization by inlining CharReader
and PositionCounter
but that would require sacrificing code readability.