It is easier to write things out than to read them in, since more things can go wrong. The read may fail, the text may not be valid UTF-8, the number may be malformed or simply out of range.
Lexical scanners split a stream of characters into tokens.
Tokens are returned by repeatedly calling the get
method of Scanner
,
(which will return Token::End
if no tokens are left)
or by iterating over the scanner. They represent numbers, characters, identifiers,
or single/double quoted strings. There is also Token::Error
to
indicate a badly formed token.
This lexical scanner makes some assumptions, such as a number may not be directly followed by a letter, etc. No attempt is made in this version to decode C-style escape codes in strings. All whitespace is ignored. It's intended for processing generic structured data, rather than code.
For example, the string "hello 'dolly' * 42" will be broken into four tokens:
- an identifier 'hello'
- a quoted string 'dolly'
- a character '*'
- and a number 42
extern crate scanlex;
use scanlex::{Scanner,Token};
let mut scan = Scanner::new("hello 'dolly' * 42");
assert_eq!(scan.get(),Token::Iden("hello".into()));
assert_eq!(scan.get(),Token::Str("dolly".into()));
assert_eq!(scan.get(),Token::Char('*'));
assert_eq!(scan.get(),Token::Int(10));
assert_eq!(scan.get(),Token::End);
To extract the values, use code like this:
let greeting = scan.get_iden()?;
let person = scan.get_string()?;
let op = scan.get_char()?;
let answer = scan.get_integer(); // i64
Scanner
implements Iterator
. If you just wanted to extract the words from
a string, then filtering with as_iden
will do the trick, since it returns
Option<String>
.
let s = Scanner::new("bonzo 42 dog (cat)");
let v: Vec<_> = s.filter_map(|t| t.as_iden()).collect();
assert_eq!(v,&["bonzo","dog","cat"]);
Using as_number
instead you can use this strategy to extract all the numbers out of a
document, ignoring all other structure. The scan.rs
example shows you the tokens
that would be generated by parsing the given string on the commmand-line.
This iterator only stops at Token::End
- you can handle Token::Error
yourself.
Usually it's important not to ignore structure. Say we have input strings that look like this "(WORD) = NUMBER":
scan.skip_chars("(")?;
let word = scan.get_iden()?;
scan.skip_chars(")=")?;
let num = scan.get_number()?;
Any of these calls may fail!
It is a common pattern to create a scanner for each line of text read from a readable
source. The scanline.rs
example shows how to use ScanLines
to accomplish this.
let f = File::open("scanline.rs").expect("cannot open scanline.rs");
let mut iter = ScanLines::new(&f);
while let Some(s) = iter.next() {
let mut s = s.expect("cannot read line");
// show the first token of each line
println!("{:?}",s.get());
}
A more serious example (taken from the tests) is parsing JSON:
type JsonArray = Vec<Box<Value>>;
type JsonObject = HashMap<String,Box<Value>>;
#[derive(Debug, Clone, PartialEq)]
pub enum Value {
Str(String),
Num(f64),
Bool(bool),
Arr(JsonArray),
Obj(JsonObject),
Null
}
fn scan_json(scan: &mut Scanner) -> Result<Value,ScanError> {
use Value::*;
match scan.get() {
Token::Str(s) => Ok(Str(s)),
Token::Num(x) => Ok(Num(x)),
Token::Int(n) => Ok(Num(n as f64)),
Token::End => Err(scan.scan_error("unexpected end of input",None)),
Token::Error(e) => Err(e),
Token::Iden(s) =>
if s == "null" {Ok(Null)}
else if s == "true" {Ok(Bool(true))}
else if s == "false" {Ok(Bool(false))}
else {Err(scan.scan_error(&format!("unknown identifier '{}'",s),None))},
Token::Char(c) =>
if c == '[' {
let mut ja = Vec::new();
let mut ch = c;
while ch != ']' {
let o = scan_json(scan)?;
ch = scan.get_ch_matching(&[',',']'])?;
ja.push(Box::new(o));
}
Ok(Arr(ja))
} else
if c == '{' {
let mut jo = HashMap::new();
let mut ch = c;
while ch != '}' {
let key = scan.get_string()?;
scan.get_ch_matching(&[':'])?;
let o = scan_json(scan)?;
ch = scan.get_ch_matching(&[',','}'])?;
jo.insert(key,Box::new(o));
}
Ok(Obj(jo))
} else {
Err(scan.scan_error(&format!("bad char '{}'",c),None))
}
}
}
(This is of course an Illustrative Example. JSON is a solved problem.)
With no_float
you get a barebones parser that does not recognize floats,
just integers, strings, chars and identifiers. This is useful if the
existing rules are too strict - e.g "2d" is fine in no_float
mode, but
an error in the default mode. chrono-english
uses this mode to parse date expressions.
With line_comment
you provide a character; after this character, the rest of the current line
will be ignored.