Skip to main content

Command Palette

Search for a command to run...

Tokenization Made Easy for beginners (Using tiktoken in JavaScript)

Published
3 min read
Tokenization Made Easy for beginners (Using tiktoken in JavaScript)

1. What Is Tokenization?

Tokenization is the process of breaking down text into smaller parts called tokens. These tokens can be words, parts of words, or punctuation, which lets a computer process text meaningfully.

Example:

"Hello, Mom!"
→ ["Hello", ",", "Mom", "!"]

2. Why Do We Tokenize Text?

Computers don't understand raw sentences like humans do. Tokenization helps by:

  • Splitting text into manageable bits

  • Letting AI models process text piece by piece

  • Helping tools like ChatGPT understand and generate responses


3. Introducing tiktoken

Tiktoken is a powerful tool for tokenization in JavaScript.

  • It uses a method called Byte Pair Encoding (BPE) to break text into tokens efficiently GitHubnpm

  • There are two versions:

    • tiktoken: The WASM version with full performance npm.

    • js-tiktoken: A pure JavaScript fallback for lighter environments npm


4. Getting Started with Tiktoken

a) Install the package:

npm install tiktoken

Or, for the pure JS version:

npm install js-tiktoken

b) Basic Example: Encode & Decode a Sentence

import { get_encoding, encoding_for_model } from "tiktoken";

const enc = get_encoding("gpt2");
// Convert text to tokens, then back to text
const tokens = enc.encode("Hello world");
const decoded = new TextDecoder().decode(enc.decode(tokens));
console.log(tokens, decoded); 
// tokens: [ ... ], decoded: "Hello world"
enc.free();

You can also target a specific OpenAI model:

const enc2 = encoding_for_model("text-davinci-003");
enc2.free();

npm


c) Using the Lite (WASM) Version for Lightweight Use:

import { Tiktoken } from "js-tiktoken/lite";
import o200k_base from "js-tiktoken/ranks/o200k_base";

const enc = new Tiktoken(o200k_base);
const text = "Hello world from Mom!";
const tokens = enc.encode(text);
const decoded = enc.decode(tokens);
console.log("Tokens:", tokens);
console.log("Decoded text:", decoded);
enc.free();

npm


5. Encoding and Decoding in Action

Tokenization is more than splitting text it's about turning human-readable text into numbers and back again:

  1. Text → Tokens
    The tool outputs an array of numbers (token IDs).

  2. Tokens → Original Text
    The tool can reverse the process back into readable text.

This reversible, lossless process ensures text remains accurate throughout AI computations GitHub


6. Why This Matters for Freshers

  • Interactive Learning: You see how AI tools like ChatGPT break and rebuild text.

  • Performance Awareness: Knowing token count helps avoid hitting model limits.

  • Costs and Limits: APIs charge per token; managing tokens controls cost.

  • Deep Understanding: Tokenization is the foundation of how models read and write.


Conclusion

Tokenization might sound technical, but with tools like tiktoken, it's simple:

StepWhat Happens
InstallAdd tiktoken to your project
EncodeText → token IDs
DecodeToken IDs → text
ApplyUsed in AI APIs for prediction

Understanding this step gives you a solid foundation for working with AI tools and models. Want a visual flow diagram (e.g., Text → Tokens → Model → Response)? Just say the word I can draw one next!