View Source on GitHub

maze_dataset.tokenization.modular

implements ModularMazeTokenizer and related code

the structure of a typical MazeTokenizerModular is something like this:

+----------------------------------------------------+
|                  MazeTokenizerModular              |
|  +-----------------------------------------------+ |
|  |                 _PromptSequencer              | |
|  |         +-----------------------------+       | |
|  |         |       _CoordTokenizer       |       | |
|  |         +-----------------------------+       | |
|  |     +------------------------------------+    | |
|  |     |         _AdjListTokenizer          |    | |
|  |     | +-----------+    +-------------+   |    | |
|  |     | |_EdgeSubset|    |_EdgeGrouping|   |    | |
|  |     | +-----------+    +-------------+   |    | |
|  |     |          +-------------+           |    | |
|  |     |          |_EdgePermuter|           |    | |
|  |     |          +-------------+           |    | |
|  |     +------------------------------------+    | |
|  |         +-----------------------------+       | |
|  |         |      _TargetTokenizer       |       | |
|  |         +-----------------------------+       | |
|  |  +------------------------------------------+ | |
|  |  |              _PathTokenizer              | | |
|  |  |  +---------------+   +----------------+  | | |
|  |  |  |   _StepSize   |   | _StepTokenizer |  | | |
|  |  |  +---------------+   +----------------+  | | |
|  |  |                      | _StepTokenizer |  | | |
|  |  |                      +----------------+  | | |
|  |  |                             :            | | |
|  |  +------------------------------------------+ | |
|  +-----------------------------------------------+ |
+----------------------------------------------------+

Optional delimiter tokens may be added in many places in the output. Delimiter options are all configured using the parameters named pre, intra, and post in various _TokenizerElement classes. Each option controls a unique delimiter token. Here we describe each _TokenizerElement and the behaviors they support. We also discuss some of the model behaviors and properties that may be investigated using these options.

Coordinates {#coordtokenizer}

The _CoordTokenizer object controls how coordinates in the lattice are represented in across all token regions. Options include:

Unique tokens: Each coordinate is represented as a single unique token "(i,j)"
Coordinate tuple tokens: Each coordinate is represented as a sequence of 2 tokens, respectively encoding the row and column positions: ["i", ",", "j"]

Adjacency List {#adjlisttokenizer}

The _AdjListTokenizer object controls this token region. All tokenizations represent the maze connectivity as a sequence of connections or walls between pairs of adjacent coordinates in the lattice.

_EdgeSubset: Specifies the subset of lattice edges to be tokenized
- All edges: Every edge in the lattice
- Connections: Only edges which contain a connection
- Walls: Only edges which contain a wall
_EdgePermuter: Specifies how to sequence the two coordinates in each lattice edge
- Random
- Sorted: The smaller coordinate always comes first
- Both permutations: Each edge is represented twice, once with each permutation. This option attempts to represent connections in a more directionally symmetric manner. Including only one permutation of each edge may affect models' internal representations of edges, treating a path traversing the edge differently depending on if the coordinate sequence in the path matches the sequence in the adjacency list.
shuffle_d0: Whether to shuffle the edges randomly or sort them in the output by their first coordinate
connection_token_ordinal: Location in the sequence of the token representing whether the edge is a connection or a wall

Path {#pathtokenizer}

The _PathTokenizer object controls this token region. Paths are all represented as a sequence of steps moving from the start to the end position.

_StepSize: Specifies the size of each step
- Singles: Every coordinate traversed between start and end is directly represented
- Forks: Only coordinates at forking points in the maze are represented. The paths between forking points are implicit. Using this option might train models more directly to represent forking points differently from coordinates where the maze connectivity implies an obvious next step in the path.
_StepTokenizer: Specifies how an individual step is represented
- Coordinate: The coordinates of each step are directly tokenized using a _CoordTokenizer
- Cardinal direction: A single token corresponding to the cardinal direction taken at the starting position of that step. E.g., NORTH, SOUTH. If using a _StepSize other than Singles, this direction may not correspond to the final direction traveled to arrive at the end position of the step.
- Relative direction: A single token corresponding to the first-person perspective relative direction taken at the starting position of that step. E.g., RIGHT, LEFT.
- Distance: A single token corresponding to the number of coordinate positions traversed in that step. E.g., using a _StepSize of Singles, the Distance token would be the same for each step, corresponding to a distance of 1 coordinate. This option is only of interest in combination with a _StepSize other than Singles.

A _PathTokenizer contains a sequence of one or more unique _StepTokenizer objects. Different step representations may be mixed and permuted, allowing for investigation of model representations of multiple aspects of a maze solution at once.

View Source

View on GitHub

 1"""implements `ModularMazeTokenizer` and related code
 2
 3the structure of a typical `MazeTokenizerModular` is something like this:
 4```
 5+----------------------------------------------------+
 6|                  MazeTokenizerModular              |
 7|  +-----------------------------------------------+ |
 8|  |                 _PromptSequencer              | |
 9|  |         +-----------------------------+       | |
10|  |         |       _CoordTokenizer       |       | |
11|  |         +-----------------------------+       | |
12|  |     +------------------------------------+    | |
13|  |     |         _AdjListTokenizer          |    | |
14|  |     | +-----------+    +-------------+   |    | |
15|  |     | |_EdgeSubset|    |_EdgeGrouping|   |    | |
16|  |     | +-----------+    +-------------+   |    | |
17|  |     |          +-------------+           |    | |
18|  |     |          |_EdgePermuter|           |    | |
19|  |     |          +-------------+           |    | |
20|  |     +------------------------------------+    | |
21|  |         +-----------------------------+       | |
22|  |         |      _TargetTokenizer       |       | |
23|  |         +-----------------------------+       | |
24|  |  +------------------------------------------+ | |
25|  |  |              _PathTokenizer              | | |
26|  |  |  +---------------+   +----------------+  | | |
27|  |  |  |   _StepSize   |   | _StepTokenizer |  | | |
28|  |  |  +---------------+   +----------------+  | | |
29|  |  |                      | _StepTokenizer |  | | |
30|  |  |                      +----------------+  | | |
31|  |  |                             :            | | |
32|  |  +------------------------------------------+ | |
33|  +-----------------------------------------------+ |
34+----------------------------------------------------+
35```
36
37Optional delimiter tokens may be added in many places in the output. Delimiter options are all configured using the parameters named `pre`, `intra`, and `post` in various `_TokenizerElement` classes. Each option controls a unique delimiter token.
38Here we describe each `_TokenizerElement` and the behaviors they support. We also discuss some of the model behaviors and properties that may be investigated using these options.
39
40### Coordinates {#coordtokenizer}
41
42The `_CoordTokenizer` object controls how coordinates in the lattice are represented in across all token regions. Options include:
43
44- **Unique tokens**: Each coordinate is represented as a single unique token `"(i,j)"`
45- **Coordinate tuple tokens**: Each coordinate is represented as a sequence of 2 tokens, respectively encoding the row and column positions: `["i", ",", "j"]`
46
47### Adjacency List {#adjlisttokenizer}
48
49The `_AdjListTokenizer` object controls this token region. All tokenizations represent the maze connectivity as a sequence of connections or walls between pairs of adjacent coordinates in the lattice.
50
51- `_EdgeSubset`: Specifies the subset of lattice edges to be tokenized
52  - **All edges**: Every edge in the lattice
53  - **Connections**: Only edges which contain a connection
54  - **Walls**: Only edges which contain a wall
55- `_EdgePermuter`: Specifies how to sequence the two coordinates in each lattice edge
56  - **Random**
57  - **Sorted**: The smaller coordinate always comes first
58  - **Both permutations**: Each edge is represented twice, once with each permutation. This option attempts to represent connections in a more directionally symmetric manner. Including only one permutation of each edge may affect models' internal representations of edges, treating a path traversing the edge differently depending on if the coordinate sequence in the path matches the sequence in the adjacency list.
59- `shuffle_d0`: Whether to shuffle the edges randomly or sort them in the output by their first coordinate
60- `connection_token_ordinal`: Location in the sequence of the token representing whether the edge is a connection or a wall
61
62### Path {#pathtokenizer}
63
64The `_PathTokenizer` object controls this token region. Paths are all represented as a sequence of steps moving from the start to the end position.
65
66- `_StepSize`: Specifies the size of each step
67  - **Singles**: Every coordinate traversed between start and end is directly represented
68  - **Forks**: Only coordinates at forking points in the maze are represented. The paths between forking points are implicit. Using this option might train models more directly to represent forking points differently from coordinates where the maze connectivity implies an obvious next step in the path.
69- `_StepTokenizer`: Specifies how an individual step is represented
70  - **Coordinate**: The coordinates of each step are directly tokenized using a `_CoordTokenizer`
71  - **Cardinal direction**: A single token corresponding to the cardinal direction taken at the starting position of that step. E.g., `NORTH`, `SOUTH`. If using a `_StepSize` other than **Singles**, this direction may not correspond to the final direction traveled to arrive at the end position of the step.
72  - **Relative direction**: A single token corresponding to the first-person perspective relative direction taken at the starting position of that step. E.g., `RIGHT`, `LEFT`.
73  - **Distance**: A single token corresponding to the number of coordinate positions traversed in that step. E.g., using a `_StepSize` of **Singles**, the **Distance** token would be the same for each step, corresponding to a distance of 1 coordinate. This option is only of interest in combination with a `_StepSize` other than **Singles**.
74
75A `_PathTokenizer` contains a sequence of one or more unique `_StepTokenizer` objects. Different step representations may be mixed and permuted, allowing for investigation of model representations of multiple aspects of a maze solution at once.
76
77"""
78
79__all__ = [
80	# modules
81	"all_instances",
82	"all_tokenizers",
83	"element_base",
84	"elements",
85	"fst_load",
86	"fst",
87	"hashing",
88	"maze_tokenizer_modular",
89	"save_hashes",
90]