Can't figure out why Lexer doesn't match a defined token and throws error "unexpected character..." #1910

lichrot · 2023-01-27T16:18:34Z

lichrot
Jan 27, 2023

Hi. I'm new to Chevrotain, and I must say it's awesome. However, while trying to build with it, I've encountered a problem that I just can't solve even after trying to track the issue within Lexer itself.

import { Lexer, createToken } from 'chevrotain';

export const Space = createToken({ name: 'Whitespace', pattern: /\s+/, group: Lexer.SKIPPED });

export const IfFunction = createToken({ name: 'If Function', pattern: /IF\(/, label: 'IF' });
export const LParen = createToken({ name: 'Left Parenthesis', pattern: /\(/, label: '(' });
export const RParen = createToken({ name: 'Right Parenthesis', pattern: /\)/, label: ')' });
export const Comma = createToken({ name: 'Comma', pattern: /,/, label: ',' });

export const LocalVar = createToken({ name: 'Local Var', pattern: /\$?[A-Za-z]{1,3}\$?[0-9]{1,3}/ });
export const GlobalVar = createToken({ name: 'Global Var', pattern: /[А-Яа-я_]+/ });

export const Bool = createToken({ name: 'Boolean', pattern: /TRUE|FALSE/, label: '<bool>' });
export const Float = createToken({ name: 'Float', pattern: /[0-9]+\.?[0-9]*/, label: '<f64>' });

export const MulOp = createToken({ name: 'Multiplicative Operator', pattern: Lexer.NA });
export const MulSign = createToken({ name: 'Mul Sign', pattern: /\*/, label: '*', categories: MulOp });
export const DivSign = createToken({ name: 'Div Sign', pattern: /\//, label: '/', categories: MulOp });

export const AddOp = createToken({ name: 'Additive Operator', pattern: Lexer.NA });
export const AddSign = createToken({ name: 'Add Sign', pattern: /\+/, label: '+', categories: AddOp });
export const SubSign = createToken({ name: 'Sub Sign', pattern: /-/, label: '-', categories: AddOp });

export const CompOp = createToken({ name: 'Comparison Operator', pattern: Lexer.NA });
export const EqSign = createToken({ name: 'Eq Sign', pattern: /=/, label: '=', categories: CompOp });
export const GteSign = createToken({ name: 'GTE Sign', pattern: />=/, label: '>=', categories: CompOp });
export const LteSign = createToken({ name: 'LTE Sign', pattern: /<=/, label: '<=', categories: CompOp });
export const GtSign = createToken({ name: 'GT Sign', pattern: />/, label: '>', categories: CompOp });
export const LtSign = createToken({ name: 'LT Sign', pattern: /</, label: '<', categories: CompOp });

export const allTokens = [
  Space,
  IfFunction,
  LParen,
  RParen,
  Comma,
  LocalVar,
  GlobalVar,
  Bool,
  Float,
  MulOp,
  AddOp,
  CompOp,
];

const lexer = new Lexer(allTokens, { positionTracking: 'onlyOffset', ensureOptimizations: true, traceInitPerf: true });

export const lex = (input: string) => lexer.tokenize(input);

The above Lexer fails to match any characters in the MulOp/AddOp/CompOp categories, with the unexpected character... error:

const result = lex('ВАР-IF(ВАР=0,1,0');
// result.errors[0].message === 'unexpected character: ->-<- at offset: 8, skipped 1 characters.'
// result.errors[1].message === 'unexpected character: ->=<- at offset: 20, skipped 1 characters.'

Answered by msujew

Jan 27, 2023

Hey @yohgen,

The lexer errors are expected, since - and = characters cannot be lexed with the given tokens. Even if you assign your token a category, you still need to explicitly add the token type to the lexer. They can later be used in the parser using the correct category (for example: CONSUME(AddOp)), but the lexer doesn't care about categories at all.

View full answer

msujew · 2023-01-27T16:31:22Z

msujew
Jan 27, 2023
Collaborator

Hey @yohgen,

The lexer errors are expected, since - and = characters cannot be lexed with the given tokens. Even if you assign your token a category, you still need to explicitly add the token type to the lexer. They can later be used in the parser using the correct category (for example: CONSUME(AddOp)), but the lexer doesn't care about categories at all.

1 reply

lichrot Jan 27, 2023
Author

Thank you for the swift response!
I must have missed that, because for some reason I thought that examples in the repository that featured token categories only included categories themselves, but not their members.

bd82 · 2023-01-27T17:44:04Z

bd82
Jan 27, 2023
Maintainer

I wonder if we could / should expand the token categories automatically

1 reply

lichrot Jan 27, 2023
Author

I'm by no means an expert on library design, but for it's worth my first instinct, as an API consumer, was that category would be somehow 'lumped together' into a singular declaration. If I used categories in a bigger project than what I work on right now, I'd probably create a wrapper of the following kind:

import { ITokenConfig, TokenType, Lexer, createToken } from 'chevrotain';

// This function adds performance overhead, if performance is critical, I'd advise users against it
const createCatgeory = <
  CategoryKey extends string,
  ItemKeys extends string,
>(
  // Reuse name for result key? API consumer has to set it anyway, so why not double down
  catConfig: Omit<ITokenConfig, 'pattern'> & { name: CategoryKey },
  tokenConfigs: (ITokenConfig & { name: ItemKeys })[],
) => {
  const catToken = createToken(Object.assign(catConfig, { pattern: Lexer.NA }));

  // Besides tokens themselves, also return token array... This maybe a bad idea since order matters
  const tokens: TokenType[] = [catToken];
  const result: Record<string, any> = { [`${catConfig.name}Tokens`]: tokens, [catConfig.name]: catToken };

  for (const tokenConfig of tokenConfigs) {
    const categories = tokenConfig.categories
      ? Array.isArray(tokenConfig.categories)
        ? [...tokenConfig.categories, catConfig]
        : [tokenConfig.categories, catConfig]
      : catConfig;
    
    const token = createToken(Object.assign(tokenConfig, { categories }));
    tokens.push(token);
    result[tokenConfig.name] = token;
  }

  return result as ({ [key in CategoryKey | ItemKeys]: TokenType } & { [key in `${CategoryKey}Tokens`]: TokenType[] });
};

// Can instantly export all tokens, although API consumer would still need to write them down individually
export const { ArithmeticOp, ArithmeticOpTokens, AddOp, SubOp } = createCatgeory({
  name: 'ArithmeticOp',
}, [
  { name: 'AddOp', pattern: /\+/, label: '+' },
  { name: 'SubOp', pattern: /-/, label: '-' },
  { name: 'MulOp', pattern: /\*/, label: '*' },
  { name: 'DivOp', pattern: /\//, label: '/' },
]);

export const AllTokens = [
  // ...Other tokens
  ArithmeticOpTokens,
  // .flat() can be moved to the Lexer constructor itself: it's a very popular pattern among library devs these days
  // But since performance is paramount, it maybe (again) a bad idea
].flat();

export const MyLexer = new Lexer(AllTokens);

It would obviously work differently if it was integrated more tightly into the Lexer API, but this is how I think about it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't figure out why Lexer doesn't match a defined token and throws error "unexpected character..." #1910

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Can't figure out why Lexer doesn't match a defined token and throws error "unexpected character..." #1910

lichrot Jan 27, 2023

Replies: 2 comments · 2 replies

msujew Jan 27, 2023 Collaborator

lichrot Jan 27, 2023 Author

bd82 Jan 27, 2023 Maintainer

lichrot Jan 27, 2023 Author

lichrot
Jan 27, 2023

Replies: 2 comments 2 replies

msujew
Jan 27, 2023
Collaborator

lichrot Jan 27, 2023
Author

bd82
Jan 27, 2023
Maintainer

lichrot Jan 27, 2023
Author