maze_dataset.tokenization.modular
implements ModularMazeTokenizer
and related code
the structure of a typical MazeTokenizerModular
is something like this:
+----------------------------------------------------+
| MazeTokenizerModular |
| +-----------------------------------------------+ |
| | _PromptSequencer | |
| | +-----------------------------+ | |
| | | _CoordTokenizer | | |
| | +-----------------------------+ | |
| | +------------------------------------+ | |
| | | _AdjListTokenizer | | |
| | | +-----------+ +-------------+ | | |
| | | |_EdgeSubset| |_EdgeGrouping| | | |
| | | +-----------+ +-------------+ | | |
| | | +-------------+ | | |
| | | |_EdgePermuter| | | |
| | | +-------------+ | | |
| | +------------------------------------+ | |
| | +-----------------------------+ | |
| | | _TargetTokenizer | | |
| | +-----------------------------+ | |
| | +------------------------------------------+ | |
| | | _PathTokenizer | | |
| | | +---------------+ +----------------+ | | |
| | | | _StepSize | | _StepTokenizer | | | |
| | | +---------------+ +----------------+ | | |
| | | | _StepTokenizer | | | |
| | | +----------------+ | | |
| | | : | | |
| | +------------------------------------------+ | |
| +-----------------------------------------------+ |
+----------------------------------------------------+
Optional delimiter tokens may be added in many places in the output. Delimiter options are all configured using the parameters named pre
, intra
, and post
in various _TokenizerElement
classes. Each option controls a unique delimiter token.
Here we describe each _TokenizerElement
and the behaviors they support. We also discuss some of the model behaviors and properties that may be investigated using these options.
Coordinates {#coordtokenizer}
The _CoordTokenizer
object controls how coordinates in the lattice are represented in across all token regions. Options include:
- Unique tokens: Each coordinate is represented as a single unique token
"(i,j)"
- Coordinate tuple tokens: Each coordinate is represented as a sequence of 2 tokens, respectively encoding the row and column positions:
["i", ",", "j"]
Adjacency List {#adjlisttokenizer}
The _AdjListTokenizer
object controls this token region. All tokenizations represent the maze connectivity as a sequence of connections or walls between pairs of adjacent coordinates in the lattice.
_EdgeSubset
: Specifies the subset of lattice edges to be tokenized- All edges: Every edge in the lattice
- Connections: Only edges which contain a connection
- Walls: Only edges which contain a wall
_EdgePermuter
: Specifies how to sequence the two coordinates in each lattice edge- Random
- Sorted: The smaller coordinate always comes first
- Both permutations: Each edge is represented twice, once with each permutation. This option attempts to represent connections in a more directionally symmetric manner. Including only one permutation of each edge may affect models' internal representations of edges, treating a path traversing the edge differently depending on if the coordinate sequence in the path matches the sequence in the adjacency list.
shuffle_d0
: Whether to shuffle the edges randomly or sort them in the output by their first coordinateconnection_token_ordinal
: Location in the sequence of the token representing whether the edge is a connection or a wall
Path {#pathtokenizer}
The _PathTokenizer
object controls this token region. Paths are all represented as a sequence of steps moving from the start to the end position.
_StepSize
: Specifies the size of each step- Singles: Every coordinate traversed between start and end is directly represented
- Forks: Only coordinates at forking points in the maze are represented. The paths between forking points are implicit. Using this option might train models more directly to represent forking points differently from coordinates where the maze connectivity implies an obvious next step in the path.
_StepTokenizer
: Specifies how an individual step is represented- Coordinate: The coordinates of each step are directly tokenized using a
_CoordTokenizer
- Cardinal direction: A single token corresponding to the cardinal direction taken at the starting position of that step. E.g.,
NORTH
,SOUTH
. If using a_StepSize
other than Singles, this direction may not correspond to the final direction traveled to arrive at the end position of the step. - Relative direction: A single token corresponding to the first-person perspective relative direction taken at the starting position of that step. E.g.,
RIGHT
,LEFT
. - Distance: A single token corresponding to the number of coordinate positions traversed in that step. E.g., using a
_StepSize
of Singles, the Distance token would be the same for each step, corresponding to a distance of 1 coordinate. This option is only of interest in combination with a_StepSize
other than Singles.
- Coordinate: The coordinates of each step are directly tokenized using a
A _PathTokenizer
contains a sequence of one or more unique _StepTokenizer
objects. Different step representations may be mixed and permuted, allowing for investigation of model representations of multiple aspects of a maze solution at once.
1"""implements `ModularMazeTokenizer` and related code 2 3the structure of a typical `MazeTokenizerModular` is something like this: 4``` 5+----------------------------------------------------+ 6| MazeTokenizerModular | 7| +-----------------------------------------------+ | 8| | _PromptSequencer | | 9| | +-----------------------------+ | | 10| | | _CoordTokenizer | | | 11| | +-----------------------------+ | | 12| | +------------------------------------+ | | 13| | | _AdjListTokenizer | | | 14| | | +-----------+ +-------------+ | | | 15| | | |_EdgeSubset| |_EdgeGrouping| | | | 16| | | +-----------+ +-------------+ | | | 17| | | +-------------+ | | | 18| | | |_EdgePermuter| | | | 19| | | +-------------+ | | | 20| | +------------------------------------+ | | 21| | +-----------------------------+ | | 22| | | _TargetTokenizer | | | 23| | +-----------------------------+ | | 24| | +------------------------------------------+ | | 25| | | _PathTokenizer | | | 26| | | +---------------+ +----------------+ | | | 27| | | | _StepSize | | _StepTokenizer | | | | 28| | | +---------------+ +----------------+ | | | 29| | | | _StepTokenizer | | | | 30| | | +----------------+ | | | 31| | | : | | | 32| | +------------------------------------------+ | | 33| +-----------------------------------------------+ | 34+----------------------------------------------------+ 35``` 36 37Optional delimiter tokens may be added in many places in the output. Delimiter options are all configured using the parameters named `pre`, `intra`, and `post` in various `_TokenizerElement` classes. Each option controls a unique delimiter token. 38Here we describe each `_TokenizerElement` and the behaviors they support. We also discuss some of the model behaviors and properties that may be investigated using these options. 39 40### Coordinates {#coordtokenizer} 41 42The `_CoordTokenizer` object controls how coordinates in the lattice are represented in across all token regions. Options include: 43 44- **Unique tokens**: Each coordinate is represented as a single unique token `"(i,j)"` 45- **Coordinate tuple tokens**: Each coordinate is represented as a sequence of 2 tokens, respectively encoding the row and column positions: `["i", ",", "j"]` 46 47### Adjacency List {#adjlisttokenizer} 48 49The `_AdjListTokenizer` object controls this token region. All tokenizations represent the maze connectivity as a sequence of connections or walls between pairs of adjacent coordinates in the lattice. 50 51- `_EdgeSubset`: Specifies the subset of lattice edges to be tokenized 52 - **All edges**: Every edge in the lattice 53 - **Connections**: Only edges which contain a connection 54 - **Walls**: Only edges which contain a wall 55- `_EdgePermuter`: Specifies how to sequence the two coordinates in each lattice edge 56 - **Random** 57 - **Sorted**: The smaller coordinate always comes first 58 - **Both permutations**: Each edge is represented twice, once with each permutation. This option attempts to represent connections in a more directionally symmetric manner. Including only one permutation of each edge may affect models' internal representations of edges, treating a path traversing the edge differently depending on if the coordinate sequence in the path matches the sequence in the adjacency list. 59- `shuffle_d0`: Whether to shuffle the edges randomly or sort them in the output by their first coordinate 60- `connection_token_ordinal`: Location in the sequence of the token representing whether the edge is a connection or a wall 61 62### Path {#pathtokenizer} 63 64The `_PathTokenizer` object controls this token region. Paths are all represented as a sequence of steps moving from the start to the end position. 65 66- `_StepSize`: Specifies the size of each step 67 - **Singles**: Every coordinate traversed between start and end is directly represented 68 - **Forks**: Only coordinates at forking points in the maze are represented. The paths between forking points are implicit. Using this option might train models more directly to represent forking points differently from coordinates where the maze connectivity implies an obvious next step in the path. 69- `_StepTokenizer`: Specifies how an individual step is represented 70 - **Coordinate**: The coordinates of each step are directly tokenized using a `_CoordTokenizer` 71 - **Cardinal direction**: A single token corresponding to the cardinal direction taken at the starting position of that step. E.g., `NORTH`, `SOUTH`. If using a `_StepSize` other than **Singles**, this direction may not correspond to the final direction traveled to arrive at the end position of the step. 72 - **Relative direction**: A single token corresponding to the first-person perspective relative direction taken at the starting position of that step. E.g., `RIGHT`, `LEFT`. 73 - **Distance**: A single token corresponding to the number of coordinate positions traversed in that step. E.g., using a `_StepSize` of **Singles**, the **Distance** token would be the same for each step, corresponding to a distance of 1 coordinate. This option is only of interest in combination with a `_StepSize` other than **Singles**. 74 75A `_PathTokenizer` contains a sequence of one or more unique `_StepTokenizer` objects. Different step representations may be mixed and permuted, allowing for investigation of model representations of multiple aspects of a maze solution at once. 76 77""" 78 79__all__ = [ 80 # modules 81 "all_instances", 82 "all_tokenizers", 83 "element_base", 84 "elements", 85 "fst_load", 86 "fst", 87 "hashing", 88 "maze_tokenizer_modular", 89 "save_hashes", 90]