Coverage for maze_dataset/tokenization/modular/__init_

1"""implements `ModularMazeTokenizer` and related code

3the structure of a typical `MazeTokenizerModular` is something like this:

4```

5+----------------------------------------------------+

6| MazeTokenizerModular |

7| +-----------------------------------------------+ |

8| | _PromptSequencer | |

9| | +-----------------------------+ | |

10| | | _CoordTokenizer | | |

11| | +-----------------------------+ | |

12| | +------------------------------------+ | |

13| | | _AdjListTokenizer | | |

14| | | +-----------+ +-------------+ | | |

16| | | +-----------+ +-------------+ | | |

17| | | +-------------+ | | |

18| | | |_EdgePermuter| | | |

19| | | +-------------+ | | |

20| | +------------------------------------+ | |

21| | +-----------------------------+ | |

22| | | _TargetTokenizer | | |

23| | +-----------------------------+ | |

24| | +------------------------------------------+ | |

25| | | _PathTokenizer | | |

26| | | +---------------+ +----------------+ | | |

28| | | +---------------+ +----------------+ | | |

29| | | | _StepTokenizer | | | |

30| | | +----------------+ | | |

31| | | : | | |

32| | +------------------------------------------+ | |

33| +-----------------------------------------------+ |

34+----------------------------------------------------+

35```

37Optional delimiter tokens may be added in many places in the output. Delimiter options are all configured using the parameters named `pre`, `intra`, and `post` in various `_TokenizerElement` classes. Each option controls a unique delimiter token.

38Here we describe each `_TokenizerElement` and the behaviors they support. We also discuss some of the model behaviors and properties that may be investigated using these options.

40### Coordinates {#coordtokenizer}

42The `_CoordTokenizer` object controls how coordinates in the lattice are represented in across all token regions. Options include:

44- **Unique tokens**: Each coordinate is represented as a single unique token `"(i,j)"`

45- **Coordinate tuple tokens**: Each coordinate is represented as a sequence of 2 tokens, respectively encoding the row and column positions: `["i", ",", "j"]`

47### Adjacency List {#adjlisttokenizer}

49The `_AdjListTokenizer` object controls this token region. All tokenizations represent the maze connectivity as a sequence of connections or walls between pairs of adjacent coordinates in the lattice.

51- `_EdgeSubset`: Specifies the subset of lattice edges to be tokenized

52 - **All edges**: Every edge in the lattice

53 - **Connections**: Only edges which contain a connection

54 - **Walls**: Only edges which contain a wall

55- `_EdgePermuter`: Specifies how to sequence the two coordinates in each lattice edge

56 - **Random**

57 - **Sorted**: The smaller coordinate always comes first

58 - **Both permutations**: Each edge is represented twice, once with each permutation. This option attempts to represent connections in a more directionally symmetric manner. Including only one permutation of each edge may affect models' internal representations of edges, treating a path traversing the edge differently depending on if the coordinate sequence in the path matches the sequence in the adjacency list.

59- `shuffle_d0`: Whether to shuffle the edges randomly or sort them in the output by their first coordinate

60- `connection_token_ordinal`: Location in the sequence of the token representing whether the edge is a connection or a wall

62### Path {#pathtokenizer}

64The `_PathTokenizer` object controls this token region. Paths are all represented as a sequence of steps moving from the start to the end position.

66- `_StepSize`: Specifies the size of each step

67 - **Singles**: Every coordinate traversed between start and end is directly represented

68 - **Forks**: Only coordinates at forking points in the maze are represented. The paths between forking points are implicit. Using this option might train models more directly to represent forking points differently from coordinates where the maze connectivity implies an obvious next step in the path.

69- `_StepTokenizer`: Specifies how an individual step is represented

70 - **Coordinate**: The coordinates of each step are directly tokenized using a `_CoordTokenizer`

71 - **Cardinal direction**: A single token corresponding to the cardinal direction taken at the starting position of that step. E.g., `NORTH`, `SOUTH`. If using a `_StepSize` other than **Singles**, this direction may not correspond to the final direction traveled to arrive at the end position of the step.

72 - **Relative direction**: A single token corresponding to the first-person perspective relative direction taken at the starting position of that step. E.g., `RIGHT`, `LEFT`.

73 - **Distance**: A single token corresponding to the number of coordinate positions traversed in that step. E.g., using a `_StepSize` of **Singles**, the **Distance** token would be the same for each step, corresponding to a distance of 1 coordinate. This option is only of interest in combination with a `_StepSize` other than **Singles**.

75A `_PathTokenizer` contains a sequence of one or more unique `_StepTokenizer` objects. Different step representations may be mixed and permuted, allowing for investigation of model representations of multiple aspects of a maze solution at once.

77"""

79__all__ = [

80 # modules

81 "all_instances",

82 "all_tokenizers",

83 "element_base",

84 "elements",

85 "fst_load",

86 "fst",

87 "hashing",

88 "maze_tokenizer_modular",

89 "save_hashes",

90]

Coverage for maze_dataset/tokenization/modular/init.py: 100%

1 statements