Skip to content

๐Ÿ“ GPT token estimation and context size utilities without a full tokenizer

License

Notifications You must be signed in to change notification settings

johannschopplich/tokenx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

50 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

tokenx

GPT token count and context size utilities when approximations are good enough. For advanced use cases, please use a full tokenizer like gpt-tokenizer. This library is intended to be used for quick estimations and to avoid the overhead of a full tokenizer, e.g. when you want to limit your bundle size.

Benchmarks

The following table shows the accuracy of the token count approximation for different input texts:

Description Actual GPT Token Count Estimated Token Count Token Count Deviation
Short English text 10 11 10.00%
German text with umlauts 56 49 12.50%
Metamorphosis by Franz Kafka (English) 31891 33928 6.39%
Die Verwandlung by Franz Kafka (German) 40620 34908 14.06%
้“ๅพท็ถ“ by Laozi (Chinese) 14386 11919 17.15%
TypeScript ES5 Type Declarations (~ 4000 loc) 47890 50464 5.37%

Features

  • ๐ŸŒ Estimate token count without a full tokenizer
  • ๐Ÿ“ Supports multiple model context sizes
  • ๐Ÿ—ฃ๏ธ Supports accented characters, like German umlauts or French accents
  • ๐Ÿชฝ Zero dependencies

Installation

Run the following command to add tokenx to your project.

# npm
npm install tokenx

# pnpm
pnpm add tokenx

# yarn
yarn add tokenx

Usage

import {
  approximateMaxTokenSize,
  approximateTokenSize,
  isWithinTokenLimit
} from 'tokenx'

const prompt = 'Your prompt goes here.'
const inputText = 'Your text goes here.'

// Estimate the number of tokens in the input text
const estimatedTokens = approximateTokenSize(inputText)
console.log(`Estimated token count: ${estimatedTokens}`)

// Calculate the maximum number of tokens allowed for a given model
const modelName = 'gpt-3.5-turbo'
const maxResponseTokens = 1000
const availableTokens = approximateMaxTokenSize({
  prompt,
  modelName,
  maxTokensInResponse: maxResponseTokens
})
console.log(`Available tokens for model ${modelName}: ${availableTokens}`)

// Check if the input text is within a specific token limit
const tokenLimit = 1024
const withinLimit = isWithinTokenLimit(inputText, tokenLimit)
console.log(`Is within token limit: ${withinLimit}`)

API

approximateTokenSize

Estimates the number of tokens in a given input string based on common English patterns and tokenization heuristics. Work well for other languages too, like German.

Usage:

const estimatedTokens = approximateTokenSize('Hello, world!')

Type Declaration:

function approximateTokenSize(input: string): number

approximateMaxTokenSize

Calculates the maximum number of tokens that can be included in a response given the prompt length and model's maximum context size.

Usage:

const maxTokens = approximateMaxTokenSize({
  prompt: 'Sample prompt',
  modelName: 'text-davinci-003',
  maxTokensInResponse: 500
})

Type Declaration:

function approximateMaxTokenSize({ prompt, modelName, maxTokensInResponse }: {
  prompt: string
  modelName: ModelName
  /** The maximum number of tokens to generate in the reply. 1000 tokens are roughly 750 English words. */
  maxTokensInResponse?: number
}): number

isWithinTokenLimit

Checks if the estimated token count of the input is within a specified token limit.

Usage:

const withinLimit = isWithinTokenLimit('Check this text against a limit', 100)

Type Declaration:

function isWithinTokenLimit(input: string, tokenLimit: number): boolean

License

MIT License ยฉ 2023-PRESENT Johann Schopplich