Coverage for maze_dataset/tokenization/__init__.py: 100%

5 statements  

« prev     ^ index     » next       coverage.py v7.6.12, created at 2025-04-09 12:48 -0600

1"""turning a maze into text 

2 

3- `MazeTokenizerModular` is the new recommended way to do this as of 1.0.0 

4- legacy `TokenizationMode` enum and `MazeTokenizer` class for supporting existing code 

5- a variety of helper classes and functions 

6 

7There are many algorithms by which one might tokenize a 2D maze into a 1D format usable by autoregressive text models. Training multiple models on the encodings output from each of these algorithms may produce very different internal representations, learned solution algorithms, and levels of performance. To explore how different maze tokenization algorithms affect these models, the `MazeTokenizerModular` class contains a rich set of options to customize how mazes are stringified. This class contains 19 discrete parameters, resulting in 5.9 million unique tokenizers. But wait, there's more! There are 6 additional parameters available in the library which are untested but further expand the the number of tokenizers by a factor of $44/3$ to 86 million. 

8 

9All output sequences consist of four token regions representing different features of the maze. These regions are distinguished by color in Figure below. 

10 

11- <span style="background-color:rgb(217,210,233)">Adjacency list</span>: A text representation of the lattice graph 

12- <span style="background-color:rgb(217,234,211)">Origin</span>: Starting coordinate 

13- <span style="background-color:rgb(234,209,220)">Target</span>: Ending coordinate 

14- <span style="background-color:rgb(207,226,243)">Path</span>: Maze solution sequence from the start to the end 

15 

16![Example text output format with token regions highlighted.](figures/outputs-tokens-colored.tex) 

17 

18Each `MazeTokenizerModular` is constructed from a set of several `_TokenizerElement` objects, each of which specifies how different token regions or other elements of the stringification are produced. 

19 

20![Nested internal structure of `_TokenizerElement` objects inside a typical `MazeTokenizerModular` object.](figures/TokenizerElement_structure.pdf) 

21 

22Optional delimiter tokens may be added in many places in the output. Delimiter options are all configured using the parameters named `pre`, `intra`, and `post` in various `_TokenizerElement` classes. Each option controls a unique delimiter token. 

23Here we describe each `_TokenizerElement` and the behaviors they support. We also discuss some of the model behaviors and properties that may be investigated using these options. 

24 

25### Coordinates 

26 

27The `_CoordTokenizer` object controls how coordinates in the lattice are represented in across all token regions. Options include: 

28 

29- **Unique tokens**: Each coordinate is represented as a single unique token `"(i,j)"` 

30- **Coordinate tuple tokens**: Each coordinate is represented as a sequence of 2 tokens, respectively encoding the row and column positions: `["i", ",", "j"]` 

31 

32### Adjacency List 

33 

34The `_AdjListTokenizer` object controls this token region. All tokenizations represent the maze connectivity as a sequence of connections or walls between pairs of adjacent coordinates in the lattice. 

35 

36- `_EdgeSubset`: Specifies the subset of lattice edges to be tokenized 

37 - **All edges**: Every edge in the lattice 

38 - **Connections**: Only edges which contain a connection 

39 - **Walls**: Only edges which contain a wall 

40- `_EdgePermuter`: Specifies how to sequence the two coordinates in each lattice edge 

41 - **Random** 

42 - **Sorted**: The smaller coordinate always comes first 

43 - **Both permutations**: Each edge is represented twice, once with each permutation. This option attempts to represent connections in a more directionally symmetric manner. Including only one permutation of each edge may affect models' internal representations of edges, treating a path traversing the edge differently depending on if the coordinate sequence in the path matches the sequence in the adjacency list. 

44- `shuffle_d0`: Whether to shuffle the edges randomly or sort them in the output by their first coordinate 

45- `connection_token_ordinal`: Location in the sequence of the token representing whether the edge is a connection or a wall 

46 

47### Path 

48 

49The `_PathTokenizer` object controls this token region. Paths are all represented as a sequence of steps moving from the start to the end position. 

50 

51- `_StepSize`: Specifies the size of each step 

52 - **Singles**: Every coordinate traversed between start and end is directly represented 

53 - **Forks**: Only coordinates at forking points in the maze are represented. The paths between forking points are implicit. Using this option might train models more directly to represent forking points differently from coordinates where the maze connectivity implies an obvious next step in the path. 

54- `_StepTokenizer`: Specifies how an individual step is represented 

55 - **Coordinate**: The coordinates of each step are directly tokenized using a `_CoordTokenizer` 

56 - **Cardinal direction**: A single token corresponding to the cardinal direction taken at the starting position of that step. E.g., `NORTH`, `SOUTH`. If using a `_StepSize` other than **Singles**, this direction may not correspond to the final direction traveled to arrive at the end position of the step. 

57 - **Relative direction**: A single token corresponding to the first-person perspective relative direction taken at the starting position of that step. E.g., `RIGHT`, `LEFT`. 

58 - **Distance**: A single token corresponding to the number of coordinate positions traversed in that step. E.g., using a `_StepSize` of **Singles**, the **Distance** token would be the same for each step, corresponding to a distance of 1 coordinate. This option is only of interest in combination with a `_StepSize` other than **Singles**. 

59 

60A `_PathTokenizer` contains a sequence of one or more unique `_StepTokenizer` objects. Different step representations may be mixed and permuted, allowing for investigation of model representations of multiple aspects of a maze solution at once. 

61 

62## Tokenized Outputs for Training and Evaluation {#token-training} 

63 

64During deployment we provide only the prompt up to the `<PATH_START>` token. 

65 

66Examples of usage of this dataset to train autoregressive transformers can be found in our `maze-transformer` library [@maze-transformer-github]. Other tokenization and vocabulary schemes are also included, such as representing each coordinate as a pair of $i,j$ index tokens. 

67 

68## Extensibility 

69 

70The tokenizer architecture is purposefully designed such that adding and testing a wide variety of new tokenization algorithms is fast and minimizes disturbances to functioning code. This is enabled by the modular architecture and the automatic inclusion of any new tokenizers in integration tests. To create a new tokenizer, developers forking the library may simply create their own `_TokenizerElement` subclass and implement the abstract methods. If the behavior change is sufficiently small, simply adding a parameter to an existing `_TokenizerElement` subclass and updating its implementation will suffice. For small additions, simply adding new cases to existing unit tests will suffice. 

71 

72The breadth of tokenizers is also easily scaled in the opposite direction. Due to the exponential scaling of parameter combinations, adding a small number of new features can significantly slow certain procedures which rely on constructing all possible tokenizers, such as integration tests. If any existing subclass contains features which aren't needed, a developer tool decorator is provided which can be applied to the unneeded `_TokenizerElement` subclasses to prune those features and compact the available space of tokenizers. 

73 

74""" 

75 

76from maze_dataset.tokenization.maze_tokenizer_legacy import ( 

77 MazeTokenizer, 

78 TokenizationMode, 

79 get_tokens_up_to_path_start, 

80) 

81from maze_dataset.tokenization.modular.element_base import _TokenizerElement 

82from maze_dataset.tokenization.modular.elements import ( 

83 AdjListTokenizers, 

84 CoordTokenizers, 

85 EdgeGroupings, 

86 EdgePermuters, 

87 EdgeSubsets, 

88 PathTokenizers, 

89 PromptSequencers, 

90 StepSizes, 

91 StepTokenizers, 

92 TargetTokenizers, 

93) 

94from maze_dataset.tokenization.modular.maze_tokenizer_modular import ( 

95 MazeTokenizerModular, 

96) 

97 

98# we don't sort alphabetically on purpose, we sort by the type 

99__all__ = [ 

100 # submodules 

101 "modular", 

102 "common", 

103 "maze_tokenizer_legacy", 

104 "maze_tokenizer", 

105 # legacy tokenizer 

106 "MazeTokenizer", 

107 "TokenizationMode", 

108 # MMT 

109 "MazeTokenizerModular", 

110 # element base 

111 "_TokenizerElement", 

112 # elements 

113 "PromptSequencers", 

114 "CoordTokenizers", 

115 "AdjListTokenizers", 

116 "EdgeGroupings", 

117 "EdgePermuters", 

118 "EdgeSubsets", 

119 "TargetTokenizers", 

120 "StepSizes", 

121 "StepTokenizers", 

122 "PathTokenizers", 

123 # helpers 

124 "get_tokens_up_to_path_start", 

125]