Subword Splitting
Token problem problems: 2 of 20
Token problems: the series
Subword Splitting
What happens when a word is not in the vocabulary? Tokenizers split it into subword pieces so a fixed vocabulary can cover any text: common words stay whole, while rare or long ones break into familiar fragments. For agents handling technical jargon or code, one long term can cost several tokens, more than a plain-English instruction of the same length.
Problem
A tokenizer splits `preprocessing` into three pieces: `pre`, `process`, and `ing`. The agent receives the pipeline command `run preprocessing now`. How many tokens does it count?
Step 1: Write each token in a box
Step 2: Count the filled boxes
Practice 1
A tokenizer splits `embeddings` into `embed` and `dings`, and `retrieval` into `retriev` and `al`. The agent receives the query `run embeddings retrieval`. How many tokens does it count?
Step 1: Write each token in a box
Step 2: Count the filled boxes
Practice 2
A tokenizer splits `debugging` into `debug` and `ging`, and `codebase` into `code` and `base`. The agent receives the task `start debugging codebase issues`. How many tokens does it count?
Step 1: Write each token in a box
Step 2: Count the filled boxes
Next:







