Subword Splitting
Token problems: 2 of 20
Token problems: the workbook
Subword Splitting
What happens when a word is not in the vocabulary? Tokenizers split it into subword pieces so a fixed vocabulary can cover any text: common words stay whole, while rare or long ones break into familiar fragments. For agents handling technical jargon or code, one long term can cost several tokens, more than a plain-English instruction of the same length.
Problem
A tokenizer splits preprocessing into three pieces: pre, process, and ing. The agent receives the pipeline command run preprocessing now. How many tokens does it count?
Write each token in a box
Count the filled boxes
Practice 1
A tokenizer splits embeddings into embed and dings, and retrieval into retriev and al. The agent receives the query run embeddings retrieval. How many tokens does it count?
Write each token in a box
Count the filled boxes
Practice 2
A tokenizer splits debugging into debug and ging, and codebase into code and base. The agent receives the task start debugging codebase issues. How many tokens does it count?







