Coverage for maze_dataset/tokenization/__init__.py: 100%
5 statements
« prev ^ index » next coverage.py v7.6.12, created at 2025-04-09 12:48 -0600
« prev ^ index » next coverage.py v7.6.12, created at 2025-04-09 12:48 -0600
1"""turning a maze into text
3- `MazeTokenizerModular` is the new recommended way to do this as of 1.0.0
4- legacy `TokenizationMode` enum and `MazeTokenizer` class for supporting existing code
5- a variety of helper classes and functions
7There are many algorithms by which one might tokenize a 2D maze into a 1D format usable by autoregressive text models. Training multiple models on the encodings output from each of these algorithms may produce very different internal representations, learned solution algorithms, and levels of performance. To explore how different maze tokenization algorithms affect these models, the `MazeTokenizerModular` class contains a rich set of options to customize how mazes are stringified. This class contains 19 discrete parameters, resulting in 5.9 million unique tokenizers. But wait, there's more! There are 6 additional parameters available in the library which are untested but further expand the the number of tokenizers by a factor of $44/3$ to 86 million.
9All output sequences consist of four token regions representing different features of the maze. These regions are distinguished by color in Figure below.
11- <span style="background-color:rgb(217,210,233)">Adjacency list</span>: A text representation of the lattice graph
12- <span style="background-color:rgb(217,234,211)">Origin</span>: Starting coordinate
13- <span style="background-color:rgb(234,209,220)">Target</span>: Ending coordinate
14- <span style="background-color:rgb(207,226,243)">Path</span>: Maze solution sequence from the start to the end
16
18Each `MazeTokenizerModular` is constructed from a set of several `_TokenizerElement` objects, each of which specifies how different token regions or other elements of the stringification are produced.
20
22Optional delimiter tokens may be added in many places in the output. Delimiter options are all configured using the parameters named `pre`, `intra`, and `post` in various `_TokenizerElement` classes. Each option controls a unique delimiter token.
23Here we describe each `_TokenizerElement` and the behaviors they support. We also discuss some of the model behaviors and properties that may be investigated using these options.
25### Coordinates
27The `_CoordTokenizer` object controls how coordinates in the lattice are represented in across all token regions. Options include:
29- **Unique tokens**: Each coordinate is represented as a single unique token `"(i,j)"`
30- **Coordinate tuple tokens**: Each coordinate is represented as a sequence of 2 tokens, respectively encoding the row and column positions: `["i", ",", "j"]`
32### Adjacency List
34The `_AdjListTokenizer` object controls this token region. All tokenizations represent the maze connectivity as a sequence of connections or walls between pairs of adjacent coordinates in the lattice.
36- `_EdgeSubset`: Specifies the subset of lattice edges to be tokenized
37 - **All edges**: Every edge in the lattice
38 - **Connections**: Only edges which contain a connection
39 - **Walls**: Only edges which contain a wall
40- `_EdgePermuter`: Specifies how to sequence the two coordinates in each lattice edge
41 - **Random**
42 - **Sorted**: The smaller coordinate always comes first
43 - **Both permutations**: Each edge is represented twice, once with each permutation. This option attempts to represent connections in a more directionally symmetric manner. Including only one permutation of each edge may affect models' internal representations of edges, treating a path traversing the edge differently depending on if the coordinate sequence in the path matches the sequence in the adjacency list.
44- `shuffle_d0`: Whether to shuffle the edges randomly or sort them in the output by their first coordinate
45- `connection_token_ordinal`: Location in the sequence of the token representing whether the edge is a connection or a wall
47### Path
49The `_PathTokenizer` object controls this token region. Paths are all represented as a sequence of steps moving from the start to the end position.
51- `_StepSize`: Specifies the size of each step
52 - **Singles**: Every coordinate traversed between start and end is directly represented
53 - **Forks**: Only coordinates at forking points in the maze are represented. The paths between forking points are implicit. Using this option might train models more directly to represent forking points differently from coordinates where the maze connectivity implies an obvious next step in the path.
54- `_StepTokenizer`: Specifies how an individual step is represented
55 - **Coordinate**: The coordinates of each step are directly tokenized using a `_CoordTokenizer`
56 - **Cardinal direction**: A single token corresponding to the cardinal direction taken at the starting position of that step. E.g., `NORTH`, `SOUTH`. If using a `_StepSize` other than **Singles**, this direction may not correspond to the final direction traveled to arrive at the end position of the step.
57 - **Relative direction**: A single token corresponding to the first-person perspective relative direction taken at the starting position of that step. E.g., `RIGHT`, `LEFT`.
58 - **Distance**: A single token corresponding to the number of coordinate positions traversed in that step. E.g., using a `_StepSize` of **Singles**, the **Distance** token would be the same for each step, corresponding to a distance of 1 coordinate. This option is only of interest in combination with a `_StepSize` other than **Singles**.
60A `_PathTokenizer` contains a sequence of one or more unique `_StepTokenizer` objects. Different step representations may be mixed and permuted, allowing for investigation of model representations of multiple aspects of a maze solution at once.
62## Tokenized Outputs for Training and Evaluation {#token-training}
64During deployment we provide only the prompt up to the `<PATH_START>` token.
66Examples of usage of this dataset to train autoregressive transformers can be found in our `maze-transformer` library [@maze-transformer-github]. Other tokenization and vocabulary schemes are also included, such as representing each coordinate as a pair of $i,j$ index tokens.
68## Extensibility
70The tokenizer architecture is purposefully designed such that adding and testing a wide variety of new tokenization algorithms is fast and minimizes disturbances to functioning code. This is enabled by the modular architecture and the automatic inclusion of any new tokenizers in integration tests. To create a new tokenizer, developers forking the library may simply create their own `_TokenizerElement` subclass and implement the abstract methods. If the behavior change is sufficiently small, simply adding a parameter to an existing `_TokenizerElement` subclass and updating its implementation will suffice. For small additions, simply adding new cases to existing unit tests will suffice.
72The breadth of tokenizers is also easily scaled in the opposite direction. Due to the exponential scaling of parameter combinations, adding a small number of new features can significantly slow certain procedures which rely on constructing all possible tokenizers, such as integration tests. If any existing subclass contains features which aren't needed, a developer tool decorator is provided which can be applied to the unneeded `_TokenizerElement` subclasses to prune those features and compact the available space of tokenizers.
74"""
76from maze_dataset.tokenization.maze_tokenizer_legacy import (
77 MazeTokenizer,
78 TokenizationMode,
79 get_tokens_up_to_path_start,
80)
81from maze_dataset.tokenization.modular.element_base import _TokenizerElement
82from maze_dataset.tokenization.modular.elements import (
83 AdjListTokenizers,
84 CoordTokenizers,
85 EdgeGroupings,
86 EdgePermuters,
87 EdgeSubsets,
88 PathTokenizers,
89 PromptSequencers,
90 StepSizes,
91 StepTokenizers,
92 TargetTokenizers,
93)
94from maze_dataset.tokenization.modular.maze_tokenizer_modular import (
95 MazeTokenizerModular,
96)
98# we don't sort alphabetically on purpose, we sort by the type
99__all__ = [
100 # submodules
101 "modular",
102 "common",
103 "maze_tokenizer_legacy",
104 "maze_tokenizer",
105 # legacy tokenizer
106 "MazeTokenizer",
107 "TokenizationMode",
108 # MMT
109 "MazeTokenizerModular",
110 # element base
111 "_TokenizerElement",
112 # elements
113 "PromptSequencers",
114 "CoordTokenizers",
115 "AdjListTokenizers",
116 "EdgeGroupings",
117 "EdgePermuters",
118 "EdgeSubsets",
119 "TargetTokenizers",
120 "StepSizes",
121 "StepTokenizers",
122 "PathTokenizers",
123 # helpers
124 "get_tokens_up_to_path_start",
125]