Coverage for maze_dataset/tokenization/modular/__init__.py: 100%

1 statements  

« prev     ^ index     » next       coverage.py v7.6.12, created at 2025-04-09 12:48 -0600

1"""implements `ModularMazeTokenizer` and related code 

2 

3the structure of a typical `MazeTokenizerModular` is something like this: 

4``` 

5+----------------------------------------------------+ 

6| MazeTokenizerModular | 

7| +-----------------------------------------------+ | 

8| | _PromptSequencer | | 

9| | +-----------------------------+ | | 

10| | | _CoordTokenizer | | | 

11| | +-----------------------------+ | | 

12| | +------------------------------------+ | | 

13| | | _AdjListTokenizer | | | 

14| | | +-----------+ +-------------+ | | | 

15| | | |_EdgeSubset| |_EdgeGrouping| | | | 

16| | | +-----------+ +-------------+ | | | 

17| | | +-------------+ | | | 

18| | | |_EdgePermuter| | | | 

19| | | +-------------+ | | | 

20| | +------------------------------------+ | | 

21| | +-----------------------------+ | | 

22| | | _TargetTokenizer | | | 

23| | +-----------------------------+ | | 

24| | +------------------------------------------+ | | 

25| | | _PathTokenizer | | | 

26| | | +---------------+ +----------------+ | | | 

27| | | | _StepSize | | _StepTokenizer | | | | 

28| | | +---------------+ +----------------+ | | | 

29| | | | _StepTokenizer | | | | 

30| | | +----------------+ | | | 

31| | | : | | | 

32| | +------------------------------------------+ | | 

33| +-----------------------------------------------+ | 

34+----------------------------------------------------+ 

35``` 

36 

37Optional delimiter tokens may be added in many places in the output. Delimiter options are all configured using the parameters named `pre`, `intra`, and `post` in various `_TokenizerElement` classes. Each option controls a unique delimiter token. 

38Here we describe each `_TokenizerElement` and the behaviors they support. We also discuss some of the model behaviors and properties that may be investigated using these options. 

39 

40### Coordinates {#coordtokenizer} 

41 

42The `_CoordTokenizer` object controls how coordinates in the lattice are represented in across all token regions. Options include: 

43 

44- **Unique tokens**: Each coordinate is represented as a single unique token `"(i,j)"` 

45- **Coordinate tuple tokens**: Each coordinate is represented as a sequence of 2 tokens, respectively encoding the row and column positions: `["i", ",", "j"]` 

46 

47### Adjacency List {#adjlisttokenizer} 

48 

49The `_AdjListTokenizer` object controls this token region. All tokenizations represent the maze connectivity as a sequence of connections or walls between pairs of adjacent coordinates in the lattice. 

50 

51- `_EdgeSubset`: Specifies the subset of lattice edges to be tokenized 

52 - **All edges**: Every edge in the lattice 

53 - **Connections**: Only edges which contain a connection 

54 - **Walls**: Only edges which contain a wall 

55- `_EdgePermuter`: Specifies how to sequence the two coordinates in each lattice edge 

56 - **Random** 

57 - **Sorted**: The smaller coordinate always comes first 

58 - **Both permutations**: Each edge is represented twice, once with each permutation. This option attempts to represent connections in a more directionally symmetric manner. Including only one permutation of each edge may affect models' internal representations of edges, treating a path traversing the edge differently depending on if the coordinate sequence in the path matches the sequence in the adjacency list. 

59- `shuffle_d0`: Whether to shuffle the edges randomly or sort them in the output by their first coordinate 

60- `connection_token_ordinal`: Location in the sequence of the token representing whether the edge is a connection or a wall 

61 

62### Path {#pathtokenizer} 

63 

64The `_PathTokenizer` object controls this token region. Paths are all represented as a sequence of steps moving from the start to the end position. 

65 

66- `_StepSize`: Specifies the size of each step 

67 - **Singles**: Every coordinate traversed between start and end is directly represented 

68 - **Forks**: Only coordinates at forking points in the maze are represented. The paths between forking points are implicit. Using this option might train models more directly to represent forking points differently from coordinates where the maze connectivity implies an obvious next step in the path. 

69- `_StepTokenizer`: Specifies how an individual step is represented 

70 - **Coordinate**: The coordinates of each step are directly tokenized using a `_CoordTokenizer` 

71 - **Cardinal direction**: A single token corresponding to the cardinal direction taken at the starting position of that step. E.g., `NORTH`, `SOUTH`. If using a `_StepSize` other than **Singles**, this direction may not correspond to the final direction traveled to arrive at the end position of the step. 

72 - **Relative direction**: A single token corresponding to the first-person perspective relative direction taken at the starting position of that step. E.g., `RIGHT`, `LEFT`. 

73 - **Distance**: A single token corresponding to the number of coordinate positions traversed in that step. E.g., using a `_StepSize` of **Singles**, the **Distance** token would be the same for each step, corresponding to a distance of 1 coordinate. This option is only of interest in combination with a `_StepSize` other than **Singles**. 

74 

75A `_PathTokenizer` contains a sequence of one or more unique `_StepTokenizer` objects. Different step representations may be mixed and permuted, allowing for investigation of model representations of multiple aspects of a maze solution at once. 

76 

77""" 

78 

79__all__ = [ 

80 # modules 

81 "all_instances", 

82 "all_tokenizers", 

83 "element_base", 

84 "elements", 

85 "fst_load", 

86 "fst", 

87 "hashing", 

88 "maze_tokenizer_modular", 

89 "save_hashes", 

90]