docs for maze-dataset v1.1.0

Contents

PyPI PyPI - Downloads Checks Coverage code size, bytes GitHub commit activity GitHub closed pull requests

maze-dataset

This package provides utilities for generation, filtering, solving, visualizing, and processing of mazes for training ML systems. Primarily built for the maze-transformer interpretability project. You can find our paper on it here: http://arxiv.org/abs/2309.10498

This package includes a variety of maze generation algorithms, including randomized depth first search, Wilson’s algorithm for uniform spanning trees, and percolation. Datasets can be filtered to select mazes of a certain length or complexity, remove duplicates, and satisfy custom properties. A variety of output formats for visualization and training ML models are provided.

Maze generated via percolation Maze generated via constrained randomized depth first search Maze with random heatmap MazePlot with solution

Installation

This package is available on PyPI, and can be installed via

pip install maze-dataset

Docs

The full hosted documentation is available at https://understanding-search.github.io/maze-dataset/.

Additionally:

Usage

Creating a dataset

To create a MazeDataset, which inherits from torch.utils.data.Dataset, you first create a MazeDatasetConfig:

from maze_dataset import MazeDataset, MazeDatasetConfig
from maze_dataset.generation import LatticeMazeGenerators
cfg: MazeDatasetConfig = MazeDatasetConfig(
    name="test", # name is only for you to keep track of things
    grid_n=5, # number of rows/columns in the lattice
    n_mazes=4, # number of mazes to generate
    maze_ctor=LatticeMazeGenerators.gen_dfs, # algorithm to generate the maze
    maze_ctor_kwargs=dict(do_forks=False), # additional parameters to pass to the maze generation algorithm
)

and then pass this config to the MazeDataset.from_config method:

dataset: MazeDataset = MazeDataset.from_config(cfg)

This method can search for whether a dataset with matching config hash already exists on your filesystem in the expected location, and load it if so. It can also generate a dataset on the fly if needed.

Conversions to useful formats

The elements of the dataset are SolvedMaze objects:

>>> m = dataset[0]
>>> type(m)
maze_dataset.maze.lattice_maze.SolvedMaze

Which can be converted to a variety of formats:

# visual representation as ascii art
m.as_ascii() 
# RGB image, optionally without solution or endpoints, suitable for CNNs
m.as_pixels() 
# text format for autoreregressive transformers
from maze_dataset.tokenization import MazeTokenizerModular, TokenizationMode
m.as_tokens(maze_tokenizer=MazeTokenizerModular(
    tokenization_mode=TokenizationMode.AOTP_UT_rasterized, max_grid_size=100,
))
# advanced visualization with many features
from maze_dataset.plotting import MazePlot
MazePlot(maze).plot()
textual and visual output formats

Development

This project uses Poetry for development. To install with dev requirements, run

poetry install --with dev

A makefile is included to simplify common development tasks:

Citing

If you use this code in your research, please cite our paper:

@misc{maze-dataset,
    title={A Configurable Library for Generating and Manipulating Maze Datasets}, 
    author={Michael Igorevich Ivanitskiy and Rusheb Shah and Alex F. Spies and Tilman Räuker and Dan Valentine and Can Rager and Lucia Quirke and Chris Mathwin and Guillaume Corlouer and Cecilia Diniz Behn and Samy Wu Fung},
    year={2023},
    eprint={2309.10498},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={http://arxiv.org/abs/2309.10498}
}

Submodules

API Documentation

View Source on GitHub

maze_dataset

PyPI PyPI - Downloads Checks Coverage code size, bytes GitHub commit activity GitHub closed pull requests

maze-dataset

This package provides utilities for generation, filtering, solving, visualizing, and processing of mazes for training ML systems. Primarily built for the maze-transformer interpretability project. You can find our paper on it here: http://arxiv.org/abs/2309.10498

This package includes a variety of maze generation algorithms, including randomized depth first search, Wilson’s algorithm for uniform spanning trees, and percolation. Datasets can be filtered to select mazes of a certain length or complexity, remove duplicates, and satisfy custom properties. A variety of output formats for visualization and training ML models are provided.

Maze generated via percolation Maze generated via constrained randomized depth first search Maze with random heatmap MazePlot with solution

Installation

This package is available on PyPI, and can be installed via

pip install maze-dataset

Docs

The full hosted documentation is available at https://understanding-search.github.io/maze-dataset/.

Additionally:

Usage

Creating a dataset

To create a MazeDataset, which inherits from torch.utils.data.Dataset, you first create a MazeDatasetConfig:

from maze_dataset import MazeDataset, MazeDatasetConfig
from <a href="maze_dataset/generation.html">maze_dataset.generation</a> import LatticeMazeGenerators
cfg: MazeDatasetConfig = MazeDatasetConfig(
    name="test", # name is only for you to keep track of things
    grid_n=5, # number of rows/columns in the lattice
    n_mazes=4, # number of mazes to generate
    maze_ctor=LatticeMazeGenerators.gen_dfs, # algorithm to generate the maze
    maze_ctor_kwargs=dict(do_forks=False), # additional parameters to pass to the maze generation algorithm
)

and then pass this config to the <a href="#MazeDataset.from_config">MazeDataset.from_config</a> method:

dataset: MazeDataset = <a href="#MazeDataset.from_config">MazeDataset.from_config</a>(cfg)

This method can search for whether a dataset with matching config hash already exists on your filesystem in the expected location, and load it if so. It can also generate a dataset on the fly if needed.

Conversions to useful formats

The elements of the dataset are SolvedMaze objects:

>>> m = dataset[0]
>>> type(m)
<a href="#SolvedMaze">SolvedMaze</a>

Which can be converted to a variety of formats:

### visual representation as ascii art
m.as_ascii() 
### RGB image, optionally without solution or endpoints, suitable for CNNs
m.as_pixels() 
### text format for autoreregressive transformers
from <a href="maze_dataset/tokenization.html">maze_dataset.tokenization</a> import MazeTokenizerModular, TokenizationMode
m.as_tokens(maze_tokenizer=MazeTokenizerModular(
    tokenization_mode=TokenizationMode.AOTP_UT_rasterized, max_grid_size=100,
))
### advanced visualization with many features
from <a href="maze_dataset/plotting.html">maze_dataset.plotting</a> import MazePlot
MazePlot(maze).plot()
textual and visual output formats

Development

This project uses Poetry for development. To install with dev requirements, run

poetry install --with dev

A makefile is included to simplify common development tasks:

Citing

If you use this code in your research, please cite our paper:

@misc{maze-dataset,
    title={A Configurable Library for Generating and Manipulating Maze Datasets}, 
    author={Michael Igorevich Ivanitskiy and Rusheb Shah and Alex F. Spies and Tilman Räuker and Dan Valentine and Can Rager and Lucia Quirke and Chris Mathwin and Guillaume Corlouer and Cecilia Diniz Behn and Samy Wu Fung},
    year={2023},
    eprint={2309.10498},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={http://arxiv.org/abs/2309.10498}
}

View Source on GitHub

class SolvedMaze(maze_dataset.maze.lattice_maze.TargetedLatticeMaze):

View Source on GitHub

Stores a maze and a solution

SolvedMaze

(
    connection_list: jaxtyping.Bool[ndarray, 'lattice_dim=2 row col'],
    solution: jaxtyping.Int8[ndarray, 'coord row_col'],
    generation_meta: dict | None = None,
    start_pos: jaxtyping.Int8[ndarray, 'row_col'] | None = None,
    end_pos: jaxtyping.Int8[ndarray, 'row_col'] | None = None,
    allow_invalid: bool = False
)

View Source on GitHub

def get_solution_tokens

(self) -> list[str | tuple[int, int]]

View Source on GitHub

View Source on GitHub

def from_lattice_maze

(
    cls,
    lattice_maze: maze_dataset.maze.lattice_maze.LatticeMaze,
    solution: list[tuple[int, int]]
) -> maze_dataset.maze.lattice_maze.SolvedMaze

View Source on GitHub

def from_targeted_lattice_maze

(
    cls,
    targeted_lattice_maze: maze_dataset.maze.lattice_maze.TargetedLatticeMaze,
    solution: list[tuple[int, int]] | None = None
) -> maze_dataset.maze.lattice_maze.SolvedMaze

View Source on GitHub

solves the given targeted lattice maze and returns a SolvedMaze

def get_solution_forking_points

(
    self,
    always_include_endpoints: bool = False
) -> tuple[list[int], jaxtyping.Int8[ndarray, 'coord row_col']]

View Source on GitHub

coordinates and their indicies from the solution where a fork is present

def get_solution_path_following_points

(self) -> tuple[list[int], jaxtyping.Int8[ndarray, 'coord row_col']]

View Source on GitHub

coordinates from the solution where there is only a single (non-backtracking) point to move to

returns the complement of get_solution_forking_points from the path

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class MazeDatasetConfig(maze_dataset.dataset.dataset.GPTDatasetConfig):

View Source on GitHub

config object which is passed to <a href="#MazeDataset.from_config">MazeDataset.from_config</a> to generate or load a dataset

MazeDatasetConfig

(
    *,
    name: str,
    seq_len_min: int = 1,
    seq_len_max: int = 512,
    seed: int | None = 42,
    applied_filters: list[dict[typing.Literal['name', 'args', 'kwargs'], str | list | dict]] = <factory>,
    grid_n: int,
    n_mazes: int,
    maze_ctor: Callable = <function LatticeMazeGenerators.gen_dfs>,
    maze_ctor_kwargs: dict = <factory>,
    endpoint_kwargs: dict[typing.Literal['except_when_invalid', 'allowed_start', 'allowed_end', 'deadend_start', 'deadend_end'], bool | None | list[tuple[int, int]]] = <factory>
)

def maze_ctor

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    lattice_dim: int = 2,
    accessible_cells: int | float | None = None,
    max_tree_depth: int | float | None = None,
    do_forks: bool = True,
    randomized_stack: bool = False,
    start_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

generate a lattice maze using depth first search, iterative

Arguments

algorithm

  1. Choose the initial cell, mark it as visited and push it to the stack
  2. While the stack is not empty 1. Pop a cell from the stack and make it a current cell 2. If the current cell has any neighbours which have not been visited 1. Push the current cell to the stack 2. Choose one of the unvisited neighbours 3. Remove the wall between the current cell and the chosen cell 4. Mark the chosen cell as visited and push it to the stack

View Source on GitHub

View Source on GitHub

View Source on GitHub

def stable_hash_cfg

(self) -> int

View Source on GitHub

def to_fname

(self) -> str

View Source on GitHub

convert config to a filename

def summary

(self) -> dict

View Source on GitHub

return a summary of the config

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class MazeDataset(typing.Generic[+T_co]):

View Source on GitHub

a maze dataset class. This is a collection of solved mazes, and should be initialized via <a href="#MazeDataset.from_config">MazeDataset.from_config</a>

MazeDataset

(
    cfg: maze_dataset.dataset.maze_dataset.MazeDatasetConfig,
    mazes: Sequence[maze_dataset.maze.lattice_maze.SolvedMaze],
    generation_metadata_collected: dict | None = None
)

View Source on GitHub

def data_hash

(self) -> int

View Source on GitHub

def as_tokens

(
    self,
    maze_tokenizer,
    limit: int | None = None,
    join_tokens_individual_maze: bool = False
) -> list[list[str]] | list[str]

View Source on GitHub

return the dataset as tokens according to the passed maze_tokenizer

the maze_tokenizer should be either a MazeTokenizer or a MazeTokenizerModular

if join_tokens_individual_maze is True, then the tokens of each maze are joined with a space, and the result is a list of strings. i.e.:

>>> dataset.as_tokens(join_tokens_individual_maze=False)
[["a", "b", "c"], ["d", "e", "f"]]
>>> dataset.as_tokens(join_tokens_individual_maze=True)
["a b c", "d e f"]

def generate

(
    cls,
    cfg: maze_dataset.dataset.maze_dataset.MazeDatasetConfig,
    gen_parallel: bool = False,
    pool_kwargs: dict | None = None,
    verbose: bool = False
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

generate a maze dataset given a config and some generation parameters

def download

(
    cls,
    cfg: maze_dataset.dataset.maze_dataset.MazeDatasetConfig,
    **kwargs
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

def load

(
    cls,
    data: Union[bool, int, float, str, list, Dict[str, Any], NoneType]
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

load from zanj/json

def serialize

(self) -> Union[bool, int, float, str, list, Dict[str, Any], NoneType]

View Source on GitHub

serialize to zanj/json

def update_self_config

(self)

View Source on GitHub

update the config to match the current state of the dataset (number of mazes, such as after filtering)

def custom_maze_filter

(
    self,
    method: Callable[[maze_dataset.maze.lattice_maze.SolvedMaze], bool],
    **kwargs
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

filter the dataset using a custom method

Inherited Members

class MazeDatasetCollection(typing.Generic[+T_co]):

View Source on GitHub

a collection of maze datasets

MazeDatasetCollection

(
    cfg: maze_dataset.dataset.collected_dataset.MazeDatasetCollectionConfig,
    maze_datasets: list[maze_dataset.dataset.maze_dataset.MazeDataset],
    generation_metadata_collected: dict | None = None
)

View Source on GitHub

View Source on GitHub

View Source on GitHub

View Source on GitHub

def generate

(
    cls,
    cfg: maze_dataset.dataset.collected_dataset.MazeDatasetCollectionConfig,
    **kwargs
) -> maze_dataset.dataset.collected_dataset.MazeDatasetCollection

View Source on GitHub

def download

(
    cls,
    cfg: maze_dataset.dataset.collected_dataset.MazeDatasetCollectionConfig,
    **kwargs
) -> maze_dataset.dataset.collected_dataset.MazeDatasetCollection

View Source on GitHub

def serialize

(self) -> Union[bool, int, float, str, list, Dict[str, Any], NoneType]

View Source on GitHub

def load

(
    cls,
    data: Union[bool, int, float, str, list, Dict[str, Any], NoneType]
) -> maze_dataset.dataset.collected_dataset.MazeDatasetCollection

View Source on GitHub

def as_tokens

(
    self,
    maze_tokenizer,
    limit: int | None = None,
    join_tokens_individual_maze: bool = False
) -> list[list[str]] | list[str]

View Source on GitHub

return the dataset as tokens

if join_tokens_individual_maze is True, then the tokens of each maze are joined with a space, and the result is a list of strings. i.e.: >>> dataset.as_tokens(join_tokens_individual_maze=False) [[“a”, “b”, “c”], [“d”, “e”, “f”]] >>> dataset.as_tokens(join_tokens_individual_maze=True) [“a b c”, “d e f”]

def update_self_config

(self) -> None

View Source on GitHub

update the config of the dataset to match the actual data, if needed

for example, adjust number of mazes after filtering

Inherited Members

class MazeDatasetCollectionConfig(maze_dataset.dataset.dataset.GPTDatasetConfig):

View Source on GitHub

maze dataset collection configuration, including tokenizers and shuffle

MazeDatasetCollectionConfig

(
    *,
    name: str,
    seq_len_min: int = 1,
    seq_len_max: int = 512,
    seed: int | None = 42,
    applied_filters: list[dict[typing.Literal['name', 'args', 'kwargs'], str | list | dict]] = <factory>,
    maze_dataset_configs: list[maze_dataset.dataset.maze_dataset.MazeDatasetConfig]
)

def summary

(self) -> dict

View Source on GitHub

return a summary of the config

View Source on GitHub

View Source on GitHub

View Source on GitHub

View Source on GitHub

def stable_hash_cfg

(self) -> int

View Source on GitHub

def to_fname

(self) -> str

View Source on GitHub

convert config to a filename

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class TargetedLatticeMaze(maze_dataset.maze.lattice_maze.LatticeMaze):

View Source on GitHub

A LatticeMaze with a start and end position

TargetedLatticeMaze

(
    *,
    connection_list: jaxtyping.Bool[ndarray, 'lattice_dim=2 row col'],
    generation_meta: dict | None = None,
    start_pos: jaxtyping.Int8[ndarray, 'row_col'],
    end_pos: jaxtyping.Int8[ndarray, 'row_col']
)

def get_start_pos_tokens

(self) -> list[str | tuple[int, int]]

View Source on GitHub

def get_end_pos_tokens

(self) -> list[str | tuple[int, int]]

View Source on GitHub

def from_lattice_maze

(
    cls,
    lattice_maze: maze_dataset.maze.lattice_maze.LatticeMaze,
    start_pos: jaxtyping.Int8[ndarray, 'row_col'],
    end_pos: jaxtyping.Int8[ndarray, 'row_col']
) -> maze_dataset.maze.lattice_maze.TargetedLatticeMaze

View Source on GitHub

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class LatticeMaze(muutils.json_serialize.serializable_dataclass.SerializableDataclass):

View Source on GitHub

lattice maze (nodes on a lattice, connections only to neighboring nodes)

Connection List represents which nodes (N) are connected in each direction.

First and second elements represent rightward and downward connections, respectively.

Example: Connection list: [ [ # down [F T], [F F] ], [ # right [T F], [T F] ] ]

Nodes with connections N T N F F T N T N F F F

Graph: N - N | N - N

Note: the bottom row connections going down, and the right-hand connections going right, will always be False.

LatticeMaze

(
    *,
    connection_list: jaxtyping.Bool[ndarray, 'lattice_dim=2 row col'],
    generation_meta: dict | None = None
)

View Source on GitHub

View Source on GitHub

View Source on GitHub

View Source on GitHub

def heuristic

(a: tuple[int, int], b: tuple[int, int]) -> float

View Source on GitHub

return manhattan distance between two points

def nodes_connected

(
    self,
    a: jaxtyping.Int8[ndarray, 'row_col'],
    b: jaxtyping.Int8[ndarray, 'row_col'],
    /
) -> bool

View Source on GitHub

returns whether two nodes are connected

def is_valid_path

(
    self,
    path: jaxtyping.Int8[ndarray, 'coord row_col'],
    empty_is_valid: bool = False
) -> bool

View Source on GitHub

check if a path is valid

def coord_degrees

(self) -> jaxtyping.Int8[ndarray, 'row col']

View Source on GitHub

Returns an array with the connectivity degree of each coord. I.e., how many neighbors each coord has.

def get_coord_neighbors

(
    self,
    c: jaxtyping.Int8[ndarray, 'row_col']
) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

Returns an array of the neighboring, connected coords of c.

def gen_connected_component_from

(
    self,
    c: jaxtyping.Int8[ndarray, 'row_col']
) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

return the connected component from a given coordinate

def find_shortest_path

(
    self,
    c_start: tuple[int, int],
    c_end: tuple[int, int]
) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

find the shortest path between two coordinates, using A*

def get_nodes

(self) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

return a list of all nodes in the maze

def get_connected_component

(self) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

get the largest (and assumed only nonsingular) connected component of the maze

TODO: other connected components?

def generate_random_path

(
    self,
    except_when_invalid: bool = True,
    allowed_start: list[tuple[int, int]] | None = None,
    allowed_end: list[tuple[int, int]] | None = None,
    deadend_start: bool = False,
    deadend_end: bool = False,
    endpoints_not_equal: bool = False
) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

return a path between randomly chosen start and end nodes within the connected component

Note that setting special conditions on start and end positions might cause the same position to be selected as both start and end.

Parameters:

Returns:

Raises:

def as_adj_list

(
    self,
    shuffle_d0: bool = True,
    shuffle_d1: bool = True
) -> jaxtyping.Int8[ndarray, 'conn start_end coord']

View Source on GitHub

def from_adj_list

(
    cls,
    adj_list: jaxtyping.Int8[ndarray, 'conn start_end coord']
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

create a LatticeMaze from a list of connections

[!NOTE] This has only been tested for square mazes. Might need to change some things if rectangular mazes are needed.

def as_adj_list_tokens

(self) -> list[str | tuple[int, int]]

View Source on GitHub

def as_tokens

(
    self,
    maze_tokenizer: maze_dataset.tokenization.maze_tokenizer.MazeTokenizer | maze_dataset.tokenization.maze_tokenizer.TokenizationMode | maze_dataset.tokenization.maze_tokenizer.MazeTokenizerModular
) -> list[str]

View Source on GitHub

serialize maze and solution to tokens

def from_tokens

(
    cls,
    tokens: list[str],
    maze_tokenizer: maze_dataset.tokenization.maze_tokenizer.MazeTokenizer | maze_dataset.tokenization.maze_tokenizer.TokenizationMode | maze_dataset.tokenization.maze_tokenizer.MazeTokenizerModular
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

Constructs a maze from a tokenization. Only legacy tokenizers and their MazeTokenizerModular analogs are supported.

def as_pixels

(
    self,
    show_endpoints: bool = True,
    show_solution: bool = True
) -> jaxtyping.Int[ndarray, 'x y rgb']

View Source on GitHub

def from_pixels

(
    cls,
    pixel_grid: jaxtyping.Int[ndarray, 'x y rgb']
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

def as_ascii

(self, show_endpoints: bool = True, show_solution: bool = True) -> str

View Source on GitHub

return an ASCII grid of the maze

def from_ascii

(cls, ascii_str: str) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

def set_serialize_minimal_threshold

(threshold: int | None) -> None

View Source on GitHub

class LatticeMazeGenerators:

View Source on GitHub

namespace for lattice maze generation algorithms

def gen_dfs

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    lattice_dim: int = 2,
    accessible_cells: int | float | None = None,
    max_tree_depth: int | float | None = None,
    do_forks: bool = True,
    randomized_stack: bool = False,
    start_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

generate a lattice maze using depth first search, iterative

Arguments

algorithm

  1. Choose the initial cell, mark it as visited and push it to the stack
  2. While the stack is not empty 1. Pop a cell from the stack and make it a current cell 2. If the current cell has any neighbours which have not been visited 1. Push the current cell to the stack 2. Choose one of the unvisited neighbours 3. Remove the wall between the current cell and the chosen cell 4. Mark the chosen cell as visited and push it to the stack

def gen_prim

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    lattice_dim: int = 2,
    accessible_cells: int | float | None = None,
    max_tree_depth: int | float | None = None,
    do_forks: bool = True,
    start_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

def gen_wilson

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col']
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

Generate a lattice maze using Wilson’s algorithm.

Algorithm

Wilson’s algorithm generates an unbiased (random) maze sampled from the uniform distribution over all mazes, using loop-erased random walks. The generated maze is acyclic and all cells are part of a unique connected space. https://en.wikipedia.org/wiki/Maze_generation_algorithm#Wilson’s_algorithm

def gen_percolation

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    p: float = 0.4,
    lattice_dim: int = 2,
    start_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

generate a lattice maze using simple percolation

note that p in the range (0.4, 0.7) gives the most interesting mazes

Arguments

def gen_dfs_percolation

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    p: float = 0.4,
    lattice_dim: int = 2,
    accessible_cells: int | None = None,
    max_tree_depth: int | None = None,
    start_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

dfs and then percolation (adds cycles)

docs for maze-dataset v1.1.0

Contents

constants and type hints used accross the package

API Documentation

View Source on GitHub

maze_dataset.constants

constants and type hints used accross the package

View Source on GitHub

single coordinate as array

single coordinate as tuple

array of coordinates

list of tuple coordinates

single connection (pair of coords) as array

internal representation used in LatticeMaze

n_edges * 2 * 2 array of connections, like an adjacency list

class SpecialTokensError(builtins.Exception):

View Source on GitHub

Common base class for all non-exit exceptions.

Inherited Members

special tokens

down, up, right, left directions for when inside a ConnectionList

down, up, right, left as vectors

public access to universal vocabulary for MazeTokenizerModular

list of VOCAB tokens, in order

map of VOCAB tokens to their indices

map of cardinal directions to appropriate tokens

docs for maze-dataset v1.1.0

Contents

MazeDatasetConfigs are used to create a MazeDataset via MazeDataset.from_config(cfg)

Submodules

API Documentation

View Source on GitHub

maze_dataset.dataset

MazeDatasetConfigs are used to create a MazeDataset via <a href="#MazeDataset.from_config">MazeDataset.from_config</a>(cfg)

View Source on GitHub

class MazeDataset(typing.Generic[+T_co]):

View Source on GitHub

a maze dataset class. This is a collection of solved mazes, and should be initialized via <a href="#MazeDataset.from_config">MazeDataset.from_config</a>

MazeDataset

(
    cfg: maze_dataset.dataset.maze_dataset.MazeDatasetConfig,
    mazes: Sequence[maze_dataset.maze.lattice_maze.SolvedMaze],
    generation_metadata_collected: dict | None = None
)

View Source on GitHub

def data_hash

(self) -> int

View Source on GitHub

def as_tokens

(
    self,
    maze_tokenizer,
    limit: int | None = None,
    join_tokens_individual_maze: bool = False
) -> list[list[str]] | list[str]

View Source on GitHub

return the dataset as tokens according to the passed maze_tokenizer

the maze_tokenizer should be either a MazeTokenizer or a MazeTokenizerModular

if join_tokens_individual_maze is True, then the tokens of each maze are joined with a space, and the result is a list of strings. i.e.:

>>> dataset.as_tokens(join_tokens_individual_maze=False)
[["a", "b", "c"], ["d", "e", "f"]]
>>> dataset.as_tokens(join_tokens_individual_maze=True)
["a b c", "d e f"]

def generate

(
    cls,
    cfg: maze_dataset.dataset.maze_dataset.MazeDatasetConfig,
    gen_parallel: bool = False,
    pool_kwargs: dict | None = None,
    verbose: bool = False
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

generate a maze dataset given a config and some generation parameters

def download

(
    cls,
    cfg: maze_dataset.dataset.maze_dataset.MazeDatasetConfig,
    **kwargs
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

def load

(
    cls,
    data: Union[bool, int, float, str, list, Dict[str, Any], NoneType]
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

load from zanj/json

def serialize

(self) -> Union[bool, int, float, str, list, Dict[str, Any], NoneType]

View Source on GitHub

serialize to zanj/json

def update_self_config

(self)

View Source on GitHub

update the config to match the current state of the dataset (number of mazes, such as after filtering)

def custom_maze_filter

(
    self,
    method: Callable[[maze_dataset.maze.lattice_maze.SolvedMaze], bool],
    **kwargs
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

filter the dataset using a custom method

Inherited Members

class MazeDatasetConfig(maze_dataset.dataset.dataset.GPTDatasetConfig):

View Source on GitHub

config object which is passed to <a href="#MazeDataset.from_config">MazeDataset.from_config</a> to generate or load a dataset

MazeDatasetConfig

(
    *,
    name: str,
    seq_len_min: int = 1,
    seq_len_max: int = 512,
    seed: int | None = 42,
    applied_filters: list[dict[typing.Literal['name', 'args', 'kwargs'], str | list | dict]] = <factory>,
    grid_n: int,
    n_mazes: int,
    maze_ctor: Callable = <function LatticeMazeGenerators.gen_dfs>,
    maze_ctor_kwargs: dict = <factory>,
    endpoint_kwargs: dict[typing.Literal['except_when_invalid', 'allowed_start', 'allowed_end', 'deadend_start', 'deadend_end'], bool | None | list[tuple[int, int]]] = <factory>
)

def maze_ctor

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    lattice_dim: int = 2,
    accessible_cells: int | float | None = None,
    max_tree_depth: int | float | None = None,
    do_forks: bool = True,
    randomized_stack: bool = False,
    start_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

generate a lattice maze using depth first search, iterative

Arguments

algorithm

  1. Choose the initial cell, mark it as visited and push it to the stack
  2. While the stack is not empty 1. Pop a cell from the stack and make it a current cell 2. If the current cell has any neighbours which have not been visited 1. Push the current cell to the stack 2. Choose one of the unvisited neighbours 3. Remove the wall between the current cell and the chosen cell 4. Mark the chosen cell as visited and push it to the stack

View Source on GitHub

View Source on GitHub

View Source on GitHub

def stable_hash_cfg

(self) -> int

View Source on GitHub

def to_fname

(self) -> str

View Source on GitHub

convert config to a filename

def summary

(self) -> dict

View Source on GitHub

return a summary of the config

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class MazeDatasetCollection(typing.Generic[+T_co]):

View Source on GitHub

a collection of maze datasets

MazeDatasetCollection

(
    cfg: maze_dataset.dataset.collected_dataset.MazeDatasetCollectionConfig,
    maze_datasets: list[maze_dataset.dataset.maze_dataset.MazeDataset],
    generation_metadata_collected: dict | None = None
)

View Source on GitHub

View Source on GitHub

View Source on GitHub

View Source on GitHub

def generate

(
    cls,
    cfg: maze_dataset.dataset.collected_dataset.MazeDatasetCollectionConfig,
    **kwargs
) -> maze_dataset.dataset.collected_dataset.MazeDatasetCollection

View Source on GitHub

def download

(
    cls,
    cfg: maze_dataset.dataset.collected_dataset.MazeDatasetCollectionConfig,
    **kwargs
) -> maze_dataset.dataset.collected_dataset.MazeDatasetCollection

View Source on GitHub

def serialize

(self) -> Union[bool, int, float, str, list, Dict[str, Any], NoneType]

View Source on GitHub

def load

(
    cls,
    data: Union[bool, int, float, str, list, Dict[str, Any], NoneType]
) -> maze_dataset.dataset.collected_dataset.MazeDatasetCollection

View Source on GitHub

def as_tokens

(
    self,
    maze_tokenizer,
    limit: int | None = None,
    join_tokens_individual_maze: bool = False
) -> list[list[str]] | list[str]

View Source on GitHub

return the dataset as tokens

if join_tokens_individual_maze is True, then the tokens of each maze are joined with a space, and the result is a list of strings. i.e.: >>> dataset.as_tokens(join_tokens_individual_maze=False) [[“a”, “b”, “c”], [“d”, “e”, “f”]] >>> dataset.as_tokens(join_tokens_individual_maze=True) [“a b c”, “d e f”]

def update_self_config

(self) -> None

View Source on GitHub

update the config of the dataset to match the actual data, if needed

for example, adjust number of mazes after filtering

Inherited Members

class MazeDatasetCollectionConfig(maze_dataset.dataset.dataset.GPTDatasetConfig):

View Source on GitHub

maze dataset collection configuration, including tokenizers and shuffle

MazeDatasetCollectionConfig

(
    *,
    name: str,
    seq_len_min: int = 1,
    seq_len_max: int = 512,
    seed: int | None = 42,
    applied_filters: list[dict[typing.Literal['name', 'args', 'kwargs'], str | list | dict]] = <factory>,
    maze_dataset_configs: list[maze_dataset.dataset.maze_dataset.MazeDatasetConfig]
)

def summary

(self) -> dict

View Source on GitHub

return a summary of the config

View Source on GitHub

View Source on GitHub

View Source on GitHub

View Source on GitHub

def stable_hash_cfg

(self) -> int

View Source on GitHub

def to_fname

(self) -> str

View Source on GitHub

convert config to a filename

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

docs for maze-dataset v1.1.0

Contents

collecting different maze datasets into a single dataset, for greater variety in a training or validation set

[!CAUTION] MazeDatasetCollection is not thoroughly tested and is not guaranteed to work.

API Documentation

View Source on GitHub

maze_dataset.dataset.collected_dataset

collecting different maze datasets into a single dataset, for greater variety in a training or validation set

[!CAUTION] MazeDatasetCollection is not thoroughly tested and is not guaranteed to work.

View Source on GitHub

class MazeDatasetCollectionConfig(maze_dataset.dataset.dataset.GPTDatasetConfig):

View Source on GitHub

maze dataset collection configuration, including tokenizers and shuffle

MazeDatasetCollectionConfig

(
    *,
    name: str,
    seq_len_min: int = 1,
    seq_len_max: int = 512,
    seed: int | None = 42,
    applied_filters: list[dict[typing.Literal['name', 'args', 'kwargs'], str | list | dict]] = <factory>,
    maze_dataset_configs: list[maze_dataset.dataset.maze_dataset.MazeDatasetConfig]
)

def summary

(self) -> dict

View Source on GitHub

return a summary of the config

View Source on GitHub

View Source on GitHub

View Source on GitHub

View Source on GitHub

def stable_hash_cfg

(self) -> int

View Source on GitHub

def to_fname

(self) -> str

View Source on GitHub

convert config to a filename

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class MazeDatasetCollection(typing.Generic[+T_co]):

View Source on GitHub

a collection of maze datasets

MazeDatasetCollection

(
    cfg: maze_dataset.dataset.collected_dataset.MazeDatasetCollectionConfig,
    maze_datasets: list[maze_dataset.dataset.maze_dataset.MazeDataset],
    generation_metadata_collected: dict | None = None
)

View Source on GitHub

View Source on GitHub

View Source on GitHub

View Source on GitHub

def generate

(
    cls,
    cfg: maze_dataset.dataset.collected_dataset.MazeDatasetCollectionConfig,
    **kwargs
) -> maze_dataset.dataset.collected_dataset.MazeDatasetCollection

View Source on GitHub

def download

(
    cls,
    cfg: maze_dataset.dataset.collected_dataset.MazeDatasetCollectionConfig,
    **kwargs
) -> maze_dataset.dataset.collected_dataset.MazeDatasetCollection

View Source on GitHub

def serialize

(self) -> Union[bool, int, float, str, list, Dict[str, Any], NoneType]

View Source on GitHub

def load

(
    cls,
    data: Union[bool, int, float, str, list, Dict[str, Any], NoneType]
) -> maze_dataset.dataset.collected_dataset.MazeDatasetCollection

View Source on GitHub

def as_tokens

(
    self,
    maze_tokenizer,
    limit: int | None = None,
    join_tokens_individual_maze: bool = False
) -> list[list[str]] | list[str]

View Source on GitHub

return the dataset as tokens

if join_tokens_individual_maze is True, then the tokens of each maze are joined with a space, and the result is a list of strings. i.e.: >>> dataset.as_tokens(join_tokens_individual_maze=False) [[“a”, “b”, “c”], [“d”, “e”, “f”]] >>> dataset.as_tokens(join_tokens_individual_maze=True) [“a b c”, “d e f”]

def update_self_config

(self) -> None

View Source on GitHub

update the config of the dataset to match the actual data, if needed

for example, adjust number of mazes after filtering

Inherited Members

docs for maze-dataset v1.1.0

Contents

MAZE_DATASET_CONFIGS contains some default configs for tests and demos

API Documentation

View Source on GitHub

maze_dataset.dataset.configs

MAZE_DATASET_CONFIGS contains some default configs for tests and demos

View Source on GitHub

docs for maze-dataset v1.1.0

Contents

GPTDatasetConfig and GPTDataset are base classes for datasets they implement some basic functionality, saving/loading, the from_config pipeline, and filtering

[!NOTE] these should probably be moved into a different package, so don’t rely on them being here

API Documentation

View Source on GitHub

maze_dataset.dataset.dataset

GPTDatasetConfig and GPTDataset are base classes for datasets they implement some basic functionality, saving/loading, the from_config pipeline, and filtering

[!NOTE] these should probably be moved into a different package, so don’t rely on them being here

View Source on GitHub

class FilterInfoMismatchError(builtins.ValueError):

View Source on GitHub

raised when the filter info in a dataset config does not match the filter info in the dataset

Inherited Members

class GPTDatasetConfig(muutils.json_serialize.serializable_dataclass.SerializableDataclass):

View Source on GitHub

base GPTDatasetConfig class

GPTDatasetConfig

(
    *,
    name: str,
    seq_len_min: int = 1,
    seq_len_max: int = 512,
    seed: int | None = 42,
    applied_filters: list[dict[typing.Literal['name', 'args', 'kwargs'], str | list | dict]] = <factory>
)

def summary

(self) -> dict

View Source on GitHub

return a summary of the config

def to_fname

(self) -> str

View Source on GitHub

convert config to a filename

def serialize

(
    self,
    *args,
    **kwargs
) -> Union[bool, int, float, str, list, Dict[str, Any], NoneType]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(*args, **kwargs) -> maze_dataset.dataset.dataset.GPTDatasetConfig

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class GPTDataset(typing.Generic[+T_co]):

View Source on GitHub

wrapper for torch dataset with some extra functionality

(meaning the functionality should be inherited in downstream classes)

[!NOTE] GPTDatasetConfig should implement a to_fname method that returns a unique filename for the config

Requires:

the following methods should be implemented in subclasses: - __init__(self, cfg: GPTDatasetConfig, **kwargs) initialize the dataset from a given config. kwargs are not passed through, the kwargs should take the actual generated or loaded data (a list of objects or sequences probably) - generate(cls, cfg: GPTDatasetConfig, **kwargs) -> GPTDataset generate the dataset from a given config. kwargs are passed through from from_config, and should only contain things that dont belong in the config (i.e. how many threads to use for generation) - serialize(self) -> JSONitem serialize the dataset to a ZANJ-serializable object, including: - config - data in formats specified by self.save_formats - load(cls, data: JSONitem) -> GPTDataset load the dataset from a ZANJ-serializable object - download(cls, cfg: GPTDatasetConfig, **kwargs) -> GPTDataset given a config, try to download a dataset from some source. kwargs are passed through from from_config, and should only contain things that dont belong in the config (i.e. some kind of auth token or source url) - __len__(self) -> int return the length of the dataset, required for torch.utils.data.Dataset - __getitem__(self, i: int) -> list[str] return the ith item in the dataset, required for torch.utils.data.Dataset return the ith item in the dataset, required for torch.utils.data.Dataset - update_self_config(self) -> None update the config of the dataset to match the current state of the dataset, used primarily in filtering and validation - decorating the appropriate filter namespace with register_filter_namespace_for_dataset(your_dataset_class) if you want to use filters

Parameters:

- `cfg : GPTDatasetConfig`
config for the dataset, used to generate the dataset
- `do_generate : bool`
whether to generate the dataset if it isn't found
(defaults to `True`)
- `load_local : bool`
whether to try finding the dataset locally
(defaults to `True`)
- `save_local : bool`
whether to save the dataset locally if it is generated or downloaded
(defaults to `True`)
- `do_download : bool`
whether to try downloading the dataset
(defaults to `True`)
- `local_base_path : Path`
where to save the dataset
(defaults to `Path("data/maze_dataset")`)

Returns:

- `GPTDataset`
the dataset, as you wanted it

Implements:

- `save(self, file_path: str) -> None`
save the dataset to a file, using ZANJ
- `read(cls, file_path: str) -> GPTDataset`
read the dataset from a file, using ZANJ
get all items in the dataset, in the specified format
- `filter_by(self)`
returns a namespace class
-  `_filter_namespace(self) -> Class`
returns a namespace class for filtering the dataset, checking that method
- `_apply_filters_from_config(self) -> None`
apply filters to the dataset, as specified in the config. used in `from_config()` but only when generating

def from_config

(
    cls,
    cfg: maze_dataset.dataset.dataset.GPTDatasetConfig,
    do_generate: bool = True,
    load_local: bool = True,
    save_local: bool = True,
    zanj: zanj.zanj.ZANJ | None = None,
    do_download: bool = True,
    local_base_path: pathlib.Path = WindowsPath('data/maze_dataset'),
    except_on_config_mismatch: bool = True,
    allow_generation_metadata_filter_mismatch: bool = True,
    verbose: bool = False,
    **kwargs
) -> maze_dataset.dataset.dataset.GPTDataset

View Source on GitHub

base class for gpt datasets

priority of loading: 1. load from local 2. download 3. generate

def save

(
    self,
    file_path: pathlib.Path | str,
    zanj: zanj.zanj.ZANJ | None = None
)

View Source on GitHub

def read

(
    cls,
    file_path: str,
    zanj: zanj.zanj.ZANJ | None = None
) -> maze_dataset.dataset.dataset.GPTDataset

View Source on GitHub

def serialize

(self) -> Union[bool, int, float, str, list, Dict[str, Any], NoneType]

View Source on GitHub

def data_hash

(self) -> int

View Source on GitHub

def load

(
    cls,
    data: Union[bool, int, float, str, list, Dict[str, Any], NoneType]
) -> maze_dataset.dataset.dataset.GPTDataset

View Source on GitHub

def generate

(
    cls,
    cfg: maze_dataset.dataset.dataset.GPTDatasetConfig,
    **kwargs
) -> maze_dataset.dataset.dataset.GPTDataset

View Source on GitHub

def download

(
    cls,
    cfg: maze_dataset.dataset.dataset.GPTDatasetConfig,
    **kwargs
) -> maze_dataset.dataset.dataset.GPTDataset

View Source on GitHub

def update_self_config

(self)

View Source on GitHub

update the config of the dataset to match the actual data, if needed

for example, adjust number of mazes after filtering

View Source on GitHub

class GPTDataset.FilterBy:

View Source on GitHub

thanks GPT-4

GPTDataset.FilterBy

(dataset: maze_dataset.dataset.dataset.GPTDataset)

View Source on GitHub

def register_filter_namespace_for_dataset

(
    dataset_cls: Type[maze_dataset.dataset.dataset.GPTDataset]
) -> Callable[[Type], Type]

View Source on GitHub

register the namespace class with the given dataset class

class DatasetFilterProtocol(typing.Protocol):

View Source on GitHub

Base class for protocol classes.

Protocol classes are defined as::

class Proto(Protocol):
    def meth(self) -> int:
        ...

Such classes are primarily used with static type checkers that recognize structural subtyping (static duck-typing), for example::

class C:
    def meth(self) -> int:
        return 0

def func(x: Proto) -> int:
    return x.meth()

func(C())  # Passes static type check

See PEP 544 for details. Protocol classes decorated with @typing.runtime_checkable act as simple-minded runtime protocols that check only the presence of given attributes, ignoring their type signatures. Protocol classes can be generic, they are defined as::

class GenProto(Protocol[T]):
    def meth(self) -> T:
        ...

DatasetFilterProtocol

(*args, **kwargs)

View Source on GitHub

def register_dataset_filter

(
    method: maze_dataset.dataset.dataset.DatasetFilterProtocol
) -> maze_dataset.dataset.dataset.DatasetFilterProtocol

View Source on GitHub

register a dataset filter, copying the underlying dataset and updating the config

be sure to return a COPY, not the original?

method should be a staticmethod of a namespace class registered with register_filter_namespace_for_dataset

docs for maze-dataset v1.1.0

Contents

MazeDatasetConfig is where you decide what your dataset should look like, then pass it to MazeDataset.from_config to generate or load the dataset.

see demo_dataset notebook

API Documentation

View Source on GitHub

maze_dataset.dataset.maze_dataset

MazeDatasetConfig is where you decide what your dataset should look like, then pass it to <a href="#MazeDataset.from_config">MazeDataset.from_config</a> to generate or load the dataset.

see demo_dataset notebook

View Source on GitHub

def set_serialize_minimal_threshold

(threshold: int | None) -> None

View Source on GitHub

type hint for <a href="#MazeDatasetConfig.endpoint_kwargs">MazeDatasetConfig.endpoint_kwargs</a>

class MazeDatasetConfig(maze_dataset.dataset.dataset.GPTDatasetConfig):

View Source on GitHub

config object which is passed to <a href="#MazeDataset.from_config">MazeDataset.from_config</a> to generate or load a dataset

MazeDatasetConfig

(
    *,
    name: str,
    seq_len_min: int = 1,
    seq_len_max: int = 512,
    seed: int | None = 42,
    applied_filters: list[dict[typing.Literal['name', 'args', 'kwargs'], str | list | dict]] = <factory>,
    grid_n: int,
    n_mazes: int,
    maze_ctor: Callable = <function LatticeMazeGenerators.gen_dfs>,
    maze_ctor_kwargs: dict = <factory>,
    endpoint_kwargs: dict[typing.Literal['except_when_invalid', 'allowed_start', 'allowed_end', 'deadend_start', 'deadend_end'], bool | None | list[tuple[int, int]]] = <factory>
)

def maze_ctor

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    lattice_dim: int = 2,
    accessible_cells: int | float | None = None,
    max_tree_depth: int | float | None = None,
    do_forks: bool = True,
    randomized_stack: bool = False,
    start_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

generate a lattice maze using depth first search, iterative

Arguments

algorithm

  1. Choose the initial cell, mark it as visited and push it to the stack
  2. While the stack is not empty 1. Pop a cell from the stack and make it a current cell 2. If the current cell has any neighbours which have not been visited 1. Push the current cell to the stack 2. Choose one of the unvisited neighbours 3. Remove the wall between the current cell and the chosen cell 4. Mark the chosen cell as visited and push it to the stack

View Source on GitHub

View Source on GitHub

View Source on GitHub

def stable_hash_cfg

(self) -> int

View Source on GitHub

def to_fname

(self) -> str

View Source on GitHub

convert config to a filename

def summary

(self) -> dict

View Source on GitHub

return a summary of the config

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class MazeDataset(typing.Generic[+T_co]):

View Source on GitHub

a maze dataset class. This is a collection of solved mazes, and should be initialized via <a href="#MazeDataset.from_config">MazeDataset.from_config</a>

MazeDataset

(
    cfg: maze_dataset.dataset.maze_dataset.MazeDatasetConfig,
    mazes: Sequence[maze_dataset.maze.lattice_maze.SolvedMaze],
    generation_metadata_collected: dict | None = None
)

View Source on GitHub

def data_hash

(self) -> int

View Source on GitHub

def as_tokens

(
    self,
    maze_tokenizer,
    limit: int | None = None,
    join_tokens_individual_maze: bool = False
) -> list[list[str]] | list[str]

View Source on GitHub

return the dataset as tokens according to the passed maze_tokenizer

the maze_tokenizer should be either a MazeTokenizer or a MazeTokenizerModular

if join_tokens_individual_maze is True, then the tokens of each maze are joined with a space, and the result is a list of strings. i.e.:

>>> dataset.as_tokens(join_tokens_individual_maze=False)
[["a", "b", "c"], ["d", "e", "f"]]
>>> dataset.as_tokens(join_tokens_individual_maze=True)
["a b c", "d e f"]

def generate

(
    cls,
    cfg: maze_dataset.dataset.maze_dataset.MazeDatasetConfig,
    gen_parallel: bool = False,
    pool_kwargs: dict | None = None,
    verbose: bool = False
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

generate a maze dataset given a config and some generation parameters

def download

(
    cls,
    cfg: maze_dataset.dataset.maze_dataset.MazeDatasetConfig,
    **kwargs
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

def load

(
    cls,
    data: Union[bool, int, float, str, list, Dict[str, Any], NoneType]
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

load from zanj/json

def serialize

(self) -> Union[bool, int, float, str, list, Dict[str, Any], NoneType]

View Source on GitHub

serialize to zanj/json

def update_self_config

(self)

View Source on GitHub

update the config to match the current state of the dataset (number of mazes, such as after filtering)

def custom_maze_filter

(
    self,
    method: Callable[[maze_dataset.maze.lattice_maze.SolvedMaze], bool],
    **kwargs
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

filter the dataset using a custom method

Inherited Members

def register_maze_filter

(
    method: Callable[[maze_dataset.maze.lattice_maze.SolvedMaze, Any], bool]
) -> maze_dataset.dataset.dataset.DatasetFilterProtocol

View Source on GitHub

register a maze filter, casting it to operate over the whole list of mazes

method should be a staticmethod of a namespace class registered with register_filter_namespace_for_dataset

this is a more restricted version of register_dataset_filter that removes the need for boilerplate for operating over the arrays

class MazeDatasetFilters:

View Source on GitHub

namespace for filters for MazeDatasets

def path_length

(maze: maze_dataset.maze.lattice_maze.SolvedMaze, min_length: int) -> bool

View Source on GitHub

filter out mazes with a solution length less than min_length

def start_end_distance

(
    maze: maze_dataset.maze.lattice_maze.SolvedMaze,
    min_distance: int
) -> bool

View Source on GitHub

filter out datasets where the start and end pos are less than min_distance apart on the manhattan distance (ignoring walls)

def cut_percentile_shortest

(
    dataset: maze_dataset.dataset.maze_dataset.MazeDataset,
    percentile: float = 10.0
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

cut the shortest percentile of mazes from the dataset

percentile is 1-100, not 0-1, as this is what np.percentile expects

def truncate_count

(
    dataset: maze_dataset.dataset.maze_dataset.MazeDataset,
    max_count: int
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

truncate the dataset to be at most max_count mazes

def remove_duplicates

(
    dataset: maze_dataset.dataset.maze_dataset.MazeDataset,
    minimum_difference_connection_list: int | None = 1,
    minimum_difference_solution: int | None = 1,
    _max_dataset_len_threshold: int = 1000
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

remove duplicates from a dataset, keeping the LAST unique maze

set minimum either minimum difference to None to disable checking

if you want to avoid mazes which have more overlap, set the minimum difference to be greater

Gotchas: - if two mazes are of different sizes, they will never be considered duplicates - if two solutions are of different lengths, they will never be considered duplicates TODO: check for overlap?

def remove_duplicates_fast

(
    dataset: maze_dataset.dataset.maze_dataset.MazeDataset
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

remove duplicates from a dataset

def strip_generation_meta

(
    dataset: maze_dataset.dataset.maze_dataset.MazeDataset
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

strip the generation meta from the dataset

def collect_generation_meta

(
    dataset: maze_dataset.dataset.maze_dataset.MazeDataset,
    clear_in_mazes: bool = True,
    inplace: bool = True,
    allow_fail: bool = False
) -> maze_dataset.dataset.maze_dataset.MazeDataset

View Source on GitHub

docs for maze-dataset v1.1.0

Contents

a special RasterizedMazeDataset that returns 2 images, one for input and one for target, for each maze

this lets you match the input and target format of the easy_2_hard dataset

see their paper:

@misc{schwarzschild2021learn,
      title={Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks}, 
      author={Avi Schwarzschild and Eitan Borgnia and Arjun Gupta and Furong Huang and Uzi Vishkin and Micah Goldblum and Tom Goldstein},
      year={2021},
      eprint={2106.04537},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

API Documentation

View Source on GitHub

maze_dataset.dataset.rasterized

a special RasterizedMazeDataset that returns 2 images, one for input and one for target, for each maze

this lets you match the input and target format of the easy_2_hard dataset

see their paper:

@misc{schwarzschild2021learn,
      title={Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks}, 
      author={Avi Schwarzschild and Eitan Borgnia and Arjun Gupta and Furong Huang and Uzi Vishkin and Micah Goldblum and Tom Goldstein},
      year={2021},
      eprint={2106.04537},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

View Source on GitHub

def process_maze_rasterized_input_target

(
    maze: maze_dataset.maze.lattice_maze.SolvedMaze,
    remove_isolated_cells: bool = True,
    extend_pixels: bool = True,
    endpoints_as_open: bool = False
) -> jaxtyping.Float[Tensor, 'in/tgt=2 x y rgb=3']

View Source on GitHub

class RasterizedMazeDatasetConfig(maze_dataset.dataset.maze_dataset.MazeDatasetConfig):

View Source on GitHub

RasterizedMazeDatasetConfig

(
    remove_isolated_cells: bool = True,
    extend_pixels: bool = True,
    endpoints_as_open: bool = False,
    *,
    name: str,
    seq_len_min: int = 1,
    seq_len_max: int = 512,
    seed: int | None = 42,
    applied_filters: list[dict[typing.Literal['name', 'args', 'kwargs'], str | list | dict]] = <factory>,
    grid_n: int,
    n_mazes: int,
    maze_ctor: Callable = <function LatticeMazeGenerators.gen_dfs>,
    maze_ctor_kwargs: dict = <factory>,
    endpoint_kwargs: dict[typing.Literal['except_when_invalid', 'allowed_start', 'allowed_end', 'deadend_start', 'deadend_end'], bool | None | list[tuple[int, int]]] = <factory>
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class RasterizedMazeDataset(typing.Generic[+T_co]):

View Source on GitHub

a maze dataset class. This is a collection of solved mazes, and should be initialized via MazeDataset.from_config

def get_batch

(
    self,
    idxs: list[int] | None
) -> jaxtyping.Float[Tensor, 'in/tgt=2 item x y rgb=3']

View Source on GitHub

def from_config_augmented

(
    cls,
    cfg: maze_dataset.dataset.rasterized.RasterizedMazeDatasetConfig,
    **kwargs
) -> torch.utils.data.dataset.Dataset

View Source on GitHub

loads either a maze transformer dataset or an easy_2_hard dataset

def from_base_MazeDataset

(
    cls,
    base_dataset: maze_dataset.dataset.maze_dataset.MazeDataset,
    added_params: dict | None = None
) -> torch.utils.data.dataset.Dataset

View Source on GitHub

loads either a maze transformer dataset or an easy_2_hard dataset

def plot

(self, count: int | None = None, show: bool = True) -> tuple

View Source on GitHub

Inherited Members

def make_numpy_collection

(
    base_cfg: maze_dataset.dataset.rasterized.RasterizedMazeDatasetConfig,
    grid_sizes: list[int],
    from_config_kwargs: dict | None = None,
    verbose: bool = True,
    key_fmt: str = '{size}x{size}'
) -> dict[typing.Literal['configs', 'arrays'], dict[str, maze_dataset.dataset.rasterized.RasterizedMazeDatasetConfig | numpy.ndarray]]

View Source on GitHub

create a collection of configs and arrays for different grid sizes, in plain tensor form

output is of structure:

{
    "configs": {
        "<n>x<n>": RasterizedMazeDatasetConfig,
        ...
    },
    "arrays": {
        "<n>x<n>": np.ndarray,
        ...
    },
}

docs for maze-dataset v1.1.0

Contents

generation functions have signature (grid_shape: Coord, **kwargs) -> LatticeMaze and are methods in LatticeMazeGenerators

DEFAULT_GENERATORS is a list of generator name, generator kwargs pairs used in tests and demos

Submodules

API Documentation

View Source on GitHub

maze_dataset.generation

generation functions have signature (grid_shape: Coord, **kwargs) -> LatticeMaze and are methods in LatticeMazeGenerators

DEFAULT_GENERATORS is a list of generator name, generator kwargs pairs used in tests and demos

View Source on GitHub

class LatticeMazeGenerators:

View Source on GitHub

namespace for lattice maze generation algorithms

def gen_dfs

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    lattice_dim: int = 2,
    accessible_cells: int | float | None = None,
    max_tree_depth: int | float | None = None,
    do_forks: bool = True,
    randomized_stack: bool = False,
    start_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

generate a lattice maze using depth first search, iterative

Arguments

algorithm

  1. Choose the initial cell, mark it as visited and push it to the stack
  2. While the stack is not empty 1. Pop a cell from the stack and make it a current cell 2. If the current cell has any neighbours which have not been visited 1. Push the current cell to the stack 2. Choose one of the unvisited neighbours 3. Remove the wall between the current cell and the chosen cell 4. Mark the chosen cell as visited and push it to the stack

def gen_prim

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    lattice_dim: int = 2,
    accessible_cells: int | float | None = None,
    max_tree_depth: int | float | None = None,
    do_forks: bool = True,
    start_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

def gen_wilson

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col']
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

Generate a lattice maze using Wilson’s algorithm.

Algorithm

Wilson’s algorithm generates an unbiased (random) maze sampled from the uniform distribution over all mazes, using loop-erased random walks. The generated maze is acyclic and all cells are part of a unique connected space. https://en.wikipedia.org/wiki/Maze_generation_algorithm#Wilson’s_algorithm

def gen_percolation

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    p: float = 0.4,
    lattice_dim: int = 2,
    start_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

generate a lattice maze using simple percolation

note that p in the range (0.4, 0.7) gives the most interesting mazes

Arguments

def gen_dfs_percolation

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    p: float = 0.4,
    lattice_dim: int = 2,
    accessible_cells: int | None = None,
    max_tree_depth: int | None = None,
    start_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

dfs and then percolation (adds cycles)

def get_maze_with_solution

(
    gen_name: str,
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    maze_ctor_kwargs: dict | None = None
) -> maze_dataset.maze.lattice_maze.SolvedMaze

View Source on GitHub

helper function to get a maze already with a solution

docs for maze-dataset v1.1.0

Contents

DEFAULT_GENERATORS is a list of generator name, generator kwargs pairs used in tests and demos

API Documentation

View Source on GitHub

maze_dataset.generation.default_generators

DEFAULT_GENERATORS is a list of generator name, generator kwargs pairs used in tests and demos

View Source on GitHub

docs for maze-dataset v1.1.0

Contents

generation functions have signature (grid_shape: Coord, **kwargs) -> LatticeMaze and are methods in LatticeMazeGenerators

API Documentation

View Source on GitHub

maze_dataset.generation.generators

generation functions have signature (grid_shape: Coord, **kwargs) -> LatticeMaze and are methods in LatticeMazeGenerators

View Source on GitHub

def get_neighbors_in_bounds

(
    coord: jaxtyping.Int8[ndarray, 'row_col'],
    grid_shape: jaxtyping.Int8[ndarray, 'row_col']
) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

get all neighbors of a coordinate that are within the bounds of the grid

class LatticeMazeGenerators:

View Source on GitHub

namespace for lattice maze generation algorithms

def gen_dfs

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    lattice_dim: int = 2,
    accessible_cells: int | float | None = None,
    max_tree_depth: int | float | None = None,
    do_forks: bool = True,
    randomized_stack: bool = False,
    start_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

generate a lattice maze using depth first search, iterative

Arguments

algorithm

  1. Choose the initial cell, mark it as visited and push it to the stack
  2. While the stack is not empty 1. Pop a cell from the stack and make it a current cell 2. If the current cell has any neighbours which have not been visited 1. Push the current cell to the stack 2. Choose one of the unvisited neighbours 3. Remove the wall between the current cell and the chosen cell 4. Mark the chosen cell as visited and push it to the stack

def gen_prim

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    lattice_dim: int = 2,
    accessible_cells: int | float | None = None,
    max_tree_depth: int | float | None = None,
    do_forks: bool = True,
    start_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

def gen_wilson

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col']
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

Generate a lattice maze using Wilson’s algorithm.

Algorithm

Wilson’s algorithm generates an unbiased (random) maze sampled from the uniform distribution over all mazes, using loop-erased random walks. The generated maze is acyclic and all cells are part of a unique connected space. https://en.wikipedia.org/wiki/Maze_generation_algorithm#Wilson’s_algorithm

def gen_percolation

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    p: float = 0.4,
    lattice_dim: int = 2,
    start_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

generate a lattice maze using simple percolation

note that p in the range (0.4, 0.7) gives the most interesting mazes

Arguments

def gen_dfs_percolation

(
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    p: float = 0.4,
    lattice_dim: int = 2,
    accessible_cells: int | None = None,
    max_tree_depth: int | None = None,
    start_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

dfs and then percolation (adds cycles)

mapping of generator names to generator functions, useful for loading MazeDatasetConfig

def get_maze_with_solution

(
    gen_name: str,
    grid_shape: jaxtyping.Int8[ndarray, 'row_col'],
    maze_ctor_kwargs: dict | None = None
) -> maze_dataset.maze.lattice_maze.SolvedMaze

View Source on GitHub

helper function to get a maze already with a solution

docs for maze-dataset v1.1.0

Contents

LatticeMaze and the classes like SolvedMaze that inherit from it, along with a ton of helper funcs

Submodules

API Documentation

View Source on GitHub

maze_dataset.maze

LatticeMaze and the classes like SolvedMaze that inherit from it, along with a ton of helper funcs

View Source on GitHub

class SolvedMaze(maze_dataset.maze.lattice_maze.TargetedLatticeMaze):

View Source on GitHub

Stores a maze and a solution

SolvedMaze

(
    connection_list: jaxtyping.Bool[ndarray, 'lattice_dim=2 row col'],
    solution: jaxtyping.Int8[ndarray, 'coord row_col'],
    generation_meta: dict | None = None,
    start_pos: jaxtyping.Int8[ndarray, 'row_col'] | None = None,
    end_pos: jaxtyping.Int8[ndarray, 'row_col'] | None = None,
    allow_invalid: bool = False
)

View Source on GitHub

def get_solution_tokens

(self) -> list[str | tuple[int, int]]

View Source on GitHub

View Source on GitHub

def from_lattice_maze

(
    cls,
    lattice_maze: maze_dataset.maze.lattice_maze.LatticeMaze,
    solution: list[tuple[int, int]]
) -> maze_dataset.maze.lattice_maze.SolvedMaze

View Source on GitHub

def from_targeted_lattice_maze

(
    cls,
    targeted_lattice_maze: maze_dataset.maze.lattice_maze.TargetedLatticeMaze,
    solution: list[tuple[int, int]] | None = None
) -> maze_dataset.maze.lattice_maze.SolvedMaze

View Source on GitHub

solves the given targeted lattice maze and returns a SolvedMaze

def get_solution_forking_points

(
    self,
    always_include_endpoints: bool = False
) -> tuple[list[int], jaxtyping.Int8[ndarray, 'coord row_col']]

View Source on GitHub

coordinates and their indicies from the solution where a fork is present

def get_solution_path_following_points

(self) -> tuple[list[int], jaxtyping.Int8[ndarray, 'coord row_col']]

View Source on GitHub

coordinates from the solution where there is only a single (non-backtracking) point to move to

returns the complement of get_solution_forking_points from the path

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class TargetedLatticeMaze(maze_dataset.maze.lattice_maze.LatticeMaze):

View Source on GitHub

A LatticeMaze with a start and end position

TargetedLatticeMaze

(
    *,
    connection_list: jaxtyping.Bool[ndarray, 'lattice_dim=2 row col'],
    generation_meta: dict | None = None,
    start_pos: jaxtyping.Int8[ndarray, 'row_col'],
    end_pos: jaxtyping.Int8[ndarray, 'row_col']
)

def get_start_pos_tokens

(self) -> list[str | tuple[int, int]]

View Source on GitHub

def get_end_pos_tokens

(self) -> list[str | tuple[int, int]]

View Source on GitHub

def from_lattice_maze

(
    cls,
    lattice_maze: maze_dataset.maze.lattice_maze.LatticeMaze,
    start_pos: jaxtyping.Int8[ndarray, 'row_col'],
    end_pos: jaxtyping.Int8[ndarray, 'row_col']
) -> maze_dataset.maze.lattice_maze.TargetedLatticeMaze

View Source on GitHub

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class LatticeMaze(muutils.json_serialize.serializable_dataclass.SerializableDataclass):

View Source on GitHub

lattice maze (nodes on a lattice, connections only to neighboring nodes)

Connection List represents which nodes (N) are connected in each direction.

First and second elements represent rightward and downward connections, respectively.

Example: Connection list: [ [ # down [F T], [F F] ], [ # right [T F], [T F] ] ]

Nodes with connections N T N F F T N T N F F F

Graph: N - N | N - N

Note: the bottom row connections going down, and the right-hand connections going right, will always be False.

LatticeMaze

(
    *,
    connection_list: jaxtyping.Bool[ndarray, 'lattice_dim=2 row col'],
    generation_meta: dict | None = None
)

View Source on GitHub

View Source on GitHub

View Source on GitHub

View Source on GitHub

def heuristic

(a: tuple[int, int], b: tuple[int, int]) -> float

View Source on GitHub

return manhattan distance between two points

def nodes_connected

(
    self,
    a: jaxtyping.Int8[ndarray, 'row_col'],
    b: jaxtyping.Int8[ndarray, 'row_col'],
    /
) -> bool

View Source on GitHub

returns whether two nodes are connected

def is_valid_path

(
    self,
    path: jaxtyping.Int8[ndarray, 'coord row_col'],
    empty_is_valid: bool = False
) -> bool

View Source on GitHub

check if a path is valid

def coord_degrees

(self) -> jaxtyping.Int8[ndarray, 'row col']

View Source on GitHub

Returns an array with the connectivity degree of each coord. I.e., how many neighbors each coord has.

def get_coord_neighbors

(
    self,
    c: jaxtyping.Int8[ndarray, 'row_col']
) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

Returns an array of the neighboring, connected coords of c.

def gen_connected_component_from

(
    self,
    c: jaxtyping.Int8[ndarray, 'row_col']
) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

return the connected component from a given coordinate

def find_shortest_path

(
    self,
    c_start: tuple[int, int],
    c_end: tuple[int, int]
) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

find the shortest path between two coordinates, using A*

def get_nodes

(self) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

return a list of all nodes in the maze

def get_connected_component

(self) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

get the largest (and assumed only nonsingular) connected component of the maze

TODO: other connected components?

def generate_random_path

(
    self,
    except_when_invalid: bool = True,
    allowed_start: list[tuple[int, int]] | None = None,
    allowed_end: list[tuple[int, int]] | None = None,
    deadend_start: bool = False,
    deadend_end: bool = False,
    endpoints_not_equal: bool = False
) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

return a path between randomly chosen start and end nodes within the connected component

Note that setting special conditions on start and end positions might cause the same position to be selected as both start and end.

Parameters:

Returns:

Raises:

def as_adj_list

(
    self,
    shuffle_d0: bool = True,
    shuffle_d1: bool = True
) -> jaxtyping.Int8[ndarray, 'conn start_end coord']

View Source on GitHub

def from_adj_list

(
    cls,
    adj_list: jaxtyping.Int8[ndarray, 'conn start_end coord']
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

create a LatticeMaze from a list of connections

[!NOTE] This has only been tested for square mazes. Might need to change some things if rectangular mazes are needed.

def as_adj_list_tokens

(self) -> list[str | tuple[int, int]]

View Source on GitHub

def as_tokens

(
    self,
    maze_tokenizer: maze_dataset.tokenization.maze_tokenizer.MazeTokenizer | maze_dataset.tokenization.maze_tokenizer.TokenizationMode | maze_dataset.tokenization.maze_tokenizer.MazeTokenizerModular
) -> list[str]

View Source on GitHub

serialize maze and solution to tokens

def from_tokens

(
    cls,
    tokens: list[str],
    maze_tokenizer: maze_dataset.tokenization.maze_tokenizer.MazeTokenizer | maze_dataset.tokenization.maze_tokenizer.TokenizationMode | maze_dataset.tokenization.maze_tokenizer.MazeTokenizerModular
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

Constructs a maze from a tokenization. Only legacy tokenizers and their MazeTokenizerModular analogs are supported.

def as_pixels

(
    self,
    show_endpoints: bool = True,
    show_solution: bool = True
) -> jaxtyping.Int[ndarray, 'x y rgb']

View Source on GitHub

def from_pixels

(
    cls,
    pixel_grid: jaxtyping.Int[ndarray, 'x y rgb']
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

def as_ascii

(self, show_endpoints: bool = True, show_solution: bool = True) -> str

View Source on GitHub

return an ASCII grid of the maze

def from_ascii

(cls, ascii_str: str) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class AsciiChars:

View Source on GitHub

standard ascii characters for mazes

AsciiChars

(
    WALL: str = '#',
    OPEN: str = ' ',
    START: str = 'S',
    END: str = 'E',
    PATH: str = 'X'
)

class PixelColors:

View Source on GitHub

standard colors for pixel grids

PixelColors

(
    WALL: tuple[int, int, int] = (0, 0, 0),
    OPEN: tuple[int, int, int] = (255, 255, 255),
    START: tuple[int, int, int] = (0, 255, 0),
    END: tuple[int, int, int] = (255, 0, 0),
    PATH: tuple[int, int, int] = (0, 0, 255)
)

docs for maze-dataset v1.1.0

API Documentation

View Source on GitHub

maze_dataset.maze.lattice_maze

View Source on GitHub

rgb tuple of values 0-255

rgb grid of pixels

boolean grid of pixels

def color_in_pixel_grid

(
    pixel_grid: jaxtyping.Int[ndarray, 'x y rgb'],
    color: tuple[int, int, int]
) -> bool

View Source on GitHub

class PixelColors:

View Source on GitHub

standard colors for pixel grids

PixelColors

(
    WALL: tuple[int, int, int] = (0, 0, 0),
    OPEN: tuple[int, int, int] = (255, 255, 255),
    START: tuple[int, int, int] = (0, 255, 0),
    END: tuple[int, int, int] = (255, 0, 0),
    PATH: tuple[int, int, int] = (0, 0, 255)
)

class AsciiChars:

View Source on GitHub

standard ascii characters for mazes

AsciiChars

(
    WALL: str = '#',
    OPEN: str = ' ',
    START: str = 'S',
    END: str = 'E',
    PATH: str = 'X'
)

map ascii characters to pixel colors

class LatticeMaze(muutils.json_serialize.serializable_dataclass.SerializableDataclass):

View Source on GitHub

lattice maze (nodes on a lattice, connections only to neighboring nodes)

Connection List represents which nodes (N) are connected in each direction.

First and second elements represent rightward and downward connections, respectively.

Example: Connection list: [ [ # down [F T], [F F] ], [ # right [T F], [T F] ] ]

Nodes with connections N T N F F T N T N F F F

Graph: N - N | N - N

Note: the bottom row connections going down, and the right-hand connections going right, will always be False.

LatticeMaze

(
    *,
    connection_list: jaxtyping.Bool[ndarray, 'lattice_dim=2 row col'],
    generation_meta: dict | None = None
)

View Source on GitHub

View Source on GitHub

View Source on GitHub

View Source on GitHub

def heuristic

(a: tuple[int, int], b: tuple[int, int]) -> float

View Source on GitHub

return manhattan distance between two points

def nodes_connected

(
    self,
    a: jaxtyping.Int8[ndarray, 'row_col'],
    b: jaxtyping.Int8[ndarray, 'row_col'],
    /
) -> bool

View Source on GitHub

returns whether two nodes are connected

def is_valid_path

(
    self,
    path: jaxtyping.Int8[ndarray, 'coord row_col'],
    empty_is_valid: bool = False
) -> bool

View Source on GitHub

check if a path is valid

def coord_degrees

(self) -> jaxtyping.Int8[ndarray, 'row col']

View Source on GitHub

Returns an array with the connectivity degree of each coord. I.e., how many neighbors each coord has.

def get_coord_neighbors

(
    self,
    c: jaxtyping.Int8[ndarray, 'row_col']
) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

Returns an array of the neighboring, connected coords of c.

def gen_connected_component_from

(
    self,
    c: jaxtyping.Int8[ndarray, 'row_col']
) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

return the connected component from a given coordinate

def find_shortest_path

(
    self,
    c_start: tuple[int, int],
    c_end: tuple[int, int]
) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

find the shortest path between two coordinates, using A*

def get_nodes

(self) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

return a list of all nodes in the maze

def get_connected_component

(self) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

get the largest (and assumed only nonsingular) connected component of the maze

TODO: other connected components?

def generate_random_path

(
    self,
    except_when_invalid: bool = True,
    allowed_start: list[tuple[int, int]] | None = None,
    allowed_end: list[tuple[int, int]] | None = None,
    deadend_start: bool = False,
    deadend_end: bool = False,
    endpoints_not_equal: bool = False
) -> jaxtyping.Int8[ndarray, 'coord row_col']

View Source on GitHub

return a path between randomly chosen start and end nodes within the connected component

Note that setting special conditions on start and end positions might cause the same position to be selected as both start and end.

Parameters:

Returns:

Raises:

def as_adj_list

(
    self,
    shuffle_d0: bool = True,
    shuffle_d1: bool = True
) -> jaxtyping.Int8[ndarray, 'conn start_end coord']

View Source on GitHub

def from_adj_list

(
    cls,
    adj_list: jaxtyping.Int8[ndarray, 'conn start_end coord']
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

create a LatticeMaze from a list of connections

[!NOTE] This has only been tested for square mazes. Might need to change some things if rectangular mazes are needed.

def as_adj_list_tokens

(self) -> list[str | tuple[int, int]]

View Source on GitHub

def as_tokens

(
    self,
    maze_tokenizer: maze_dataset.tokenization.maze_tokenizer.MazeTokenizer | maze_dataset.tokenization.maze_tokenizer.TokenizationMode | maze_dataset.tokenization.maze_tokenizer.MazeTokenizerModular
) -> list[str]

View Source on GitHub

serialize maze and solution to tokens

def from_tokens

(
    cls,
    tokens: list[str],
    maze_tokenizer: maze_dataset.tokenization.maze_tokenizer.MazeTokenizer | maze_dataset.tokenization.maze_tokenizer.TokenizationMode | maze_dataset.tokenization.maze_tokenizer.MazeTokenizerModular
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

Constructs a maze from a tokenization. Only legacy tokenizers and their MazeTokenizerModular analogs are supported.

def as_pixels

(
    self,
    show_endpoints: bool = True,
    show_solution: bool = True
) -> jaxtyping.Int[ndarray, 'x y rgb']

View Source on GitHub

def from_pixels

(
    cls,
    pixel_grid: jaxtyping.Int[ndarray, 'x y rgb']
) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

def as_ascii

(self, show_endpoints: bool = True, show_solution: bool = True) -> str

View Source on GitHub

return an ASCII grid of the maze

def from_ascii

(cls, ascii_str: str) -> maze_dataset.maze.lattice_maze.LatticeMaze

View Source on GitHub

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class TargetedLatticeMaze(LatticeMaze):

View Source on GitHub

A LatticeMaze with a start and end position

TargetedLatticeMaze

(
    *,
    connection_list: jaxtyping.Bool[ndarray, 'lattice_dim=2 row col'],
    generation_meta: dict | None = None,
    start_pos: jaxtyping.Int8[ndarray, 'row_col'],
    end_pos: jaxtyping.Int8[ndarray, 'row_col']
)

def get_start_pos_tokens

(self) -> list[str | tuple[int, int]]

View Source on GitHub

def get_end_pos_tokens

(self) -> list[str | tuple[int, int]]

View Source on GitHub

def from_lattice_maze

(
    cls,
    lattice_maze: maze_dataset.maze.lattice_maze.LatticeMaze,
    start_pos: jaxtyping.Int8[ndarray, 'row_col'],
    end_pos: jaxtyping.Int8[ndarray, 'row_col']
) -> maze_dataset.maze.lattice_maze.TargetedLatticeMaze

View Source on GitHub

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class SolvedMaze(TargetedLatticeMaze):

View Source on GitHub

Stores a maze and a solution

SolvedMaze

(
    connection_list: jaxtyping.Bool[ndarray, 'lattice_dim=2 row col'],
    solution: jaxtyping.Int8[ndarray, 'coord row_col'],
    generation_meta: dict | None = None,
    start_pos: jaxtyping.Int8[ndarray, 'row_col'] | None = None,
    end_pos: jaxtyping.Int8[ndarray, 'row_col'] | None = None,
    allow_invalid: bool = False
)

View Source on GitHub

def get_solution_tokens

(self) -> list[str | tuple[int, int]]

View Source on GitHub

View Source on GitHub

def from_lattice_maze

(
    cls,
    lattice_maze: maze_dataset.maze.lattice_maze.LatticeMaze,
    solution: list[tuple[int, int]]
) -> maze_dataset.maze.lattice_maze.SolvedMaze

View Source on GitHub

def from_targeted_lattice_maze

(
    cls,
    targeted_lattice_maze: maze_dataset.maze.lattice_maze.TargetedLatticeMaze,
    solution: list[tuple[int, int]] | None = None
) -> maze_dataset.maze.lattice_maze.SolvedMaze

View Source on GitHub

solves the given targeted lattice maze and returns a SolvedMaze

def get_solution_forking_points

(
    self,
    always_include_endpoints: bool = False
) -> tuple[list[int], jaxtyping.Int8[ndarray, 'coord row_col']]

View Source on GitHub

coordinates and their indicies from the solution where a fork is present

def get_solution_path_following_points

(self) -> tuple[list[int], jaxtyping.Int8[ndarray, 'coord row_col']]

View Source on GitHub

coordinates from the solution where there is only a single (non-backtracking) point to move to

returns the complement of get_solution_forking_points from the path

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

def detect_pixels_type

(
    data: jaxtyping.Int[ndarray, 'x y rgb']
) -> Type[maze_dataset.maze.lattice_maze.LatticeMaze]

View Source on GitHub

Detects the type of pixels data by checking for the presence of start and end pixels

docs for maze-dataset v1.1.0

Contents

utilities for plotting mazes and printing tokens

Submodules

API Documentation

View Source on GitHub

maze_dataset.plotting

utilities for plotting mazes and printing tokens

View Source on GitHub

def plot_dataset_mazes

(
    ds: maze_dataset.dataset.maze_dataset.MazeDataset,
    count: int | None = None,
    figsize_mult: tuple[float, float] = (1.0, 2.0),
    title: bool | str = True
) -> tuple

View Source on GitHub

(
    ds: maze_dataset.dataset.maze_dataset.MazeDataset,
    count: int | None = None
)

View Source on GitHub

class MazePlot:

View Source on GitHub

Class for displaying mazes and paths

MazePlot

(
    maze: maze_dataset.maze.lattice_maze.LatticeMaze,
    unit_length: int = 14
)

View Source on GitHub

UNIT_LENGTH: Set ratio between node size and wall thickness in image. Wall thickness is fixed to 1px A “unit” consists of a single node and the right and lower connection/wall. Example: ul = 14 yields 13:1 ratio between node size and wall thickness

View Source on GitHub

def add_true_path

(
    self,
    path: list[tuple[int, int]] | jaxtyping.Int8[ndarray, 'coord row_col'] | maze_dataset.plotting.plot_maze.StyledPath,
    path_fmt: maze_dataset.plotting.plot_maze.PathFormat | None = None,
    **kwargs
) -> maze_dataset.plotting.plot_maze.MazePlot

View Source on GitHub

def add_predicted_path

(
    self,
    path: list[tuple[int, int]] | jaxtyping.Int8[ndarray, 'coord row_col'] | maze_dataset.plotting.plot_maze.StyledPath,
    path_fmt: maze_dataset.plotting.plot_maze.PathFormat | None = None,
    **kwargs
) -> maze_dataset.plotting.plot_maze.MazePlot

View Source on GitHub

Recieve predicted path and formatting preferences from input and save in predicted_path list. Default formatting depends on nuber of paths already saved in predicted path list.

def add_multiple_paths

(
    self,
    path_list: list[list[tuple[int, int]] | jaxtyping.Int8[ndarray, 'coord row_col'] | maze_dataset.plotting.plot_maze.StyledPath]
)

View Source on GitHub

Function for adding multiple paths to MazePlot at once. This can be done in two ways: 1. Passing a list of

def add_node_values

(
    self,
    node_values: jaxtyping.Float[ndarray, 'grid_n grid_n'],
    color_map: str = 'Blues',
    target_token_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None,
    preceeding_tokens_coords: jaxtyping.Int8[ndarray, 'coord row_col'] = None,
    colormap_center: float | None = None,
    colormap_max: float | None = None,
    hide_colorbar: bool = False
) -> maze_dataset.plotting.plot_maze.MazePlot

View Source on GitHub

def plot

(
    self,
    dpi: int = 100,
    title: str = '',
    fig_ax: tuple | None = None,
    plain: bool = False
) -> maze_dataset.plotting.plot_maze.MazePlot

View Source on GitHub

Plot the maze and paths.

def mark_coords

(
    self,
    coords: jaxtyping.Int8[ndarray, 'coord row_col'] | list[jaxtyping.Int8[ndarray, 'row_col']],
    **kwargs
) -> maze_dataset.plotting.plot_maze.MazePlot

View Source on GitHub

def to_ascii

(self, show_endpoints: bool = True, show_solution: bool = True) -> str

View Source on GitHub

class PathFormat:

View Source on GitHub

formatting options for path plot

PathFormat

(
    *,
    label: str | None = None,
    fmt: str = 'o',
    color: str | None = None,
    cmap: str | None = None,
    line_width: float | None = None,
    quiver_kwargs: dict | None = None
)

def combine

(
    self,
    other: maze_dataset.plotting.plot_maze.PathFormat
) -> maze_dataset.plotting.plot_maze.PathFormat

View Source on GitHub

combine with other PathFormat object, overwriting attributes with non-None values.

returns a modified copy of self.

def color_tokens_cmap

(
    tokens: list[str],
    weights: Sequence[float],
    cmap: str | matplotlib.colors.Colormap = 'Blues',
    fmt: Literal['html', 'latex', 'terminal', None] = 'html',
    template: str | None = None,
    labels: bool = False
)

View Source on GitHub

color tokens given a list of weights and a colormap

def color_maze_tokens_AOTP

(
    tokens: list[str],
    fmt: Literal['html', 'latex', 'terminal', None] = 'html',
    template: str | None = None,
    **kwargs
) -> str

View Source on GitHub

color tokens assuming AOTP format

i.e: adjaceny list, origin, target, path

def color_tokens_rgb

(
    tokens: list,
    colors: Sequence[Sequence[int]],
    fmt: Literal['html', 'latex', 'terminal', None] = 'html',
    template: str | None = None,
    clr_join: str | None = None,
    max_length: int | None = None
) -> str

View Source on GitHub

color tokens from a list with an RGB color array

tokens will not be escaped if fmt is None

Parameters:

docs for maze-dataset v1.1.0

Contents

plot_dataset_mazes will plot several mazes using as_pixels

print_dataset_mazes will use as_ascii to print several mazes

API Documentation

View Source on GitHub

maze_dataset.plotting.plot_dataset

plot_dataset_mazes will plot several mazes using as_pixels

print_dataset_mazes will use as_ascii to print several mazes

View Source on GitHub

def plot_dataset_mazes

(
    ds: maze_dataset.dataset.maze_dataset.MazeDataset,
    count: int | None = None,
    figsize_mult: tuple[float, float] = (1.0, 2.0),
    title: bool | str = True
) -> tuple

View Source on GitHub

(
    ds: maze_dataset.dataset.maze_dataset.MazeDataset,
    count: int | None = None
)

View Source on GitHub

docs for maze-dataset v1.1.0

Contents

provides MazePlot, which has many tools for plotting mazes with multiple paths, colored nodes, and more

API Documentation

View Source on GitHub

maze_dataset.plotting.plot_maze

provides MazePlot, which has many tools for plotting mazes with multiple paths, colored nodes, and more

View Source on GitHub

class PathFormat:

View Source on GitHub

formatting options for path plot

PathFormat

(
    *,
    label: str | None = None,
    fmt: str = 'o',
    color: str | None = None,
    cmap: str | None = None,
    line_width: float | None = None,
    quiver_kwargs: dict | None = None
)

def combine

(
    self,
    other: maze_dataset.plotting.plot_maze.PathFormat
) -> maze_dataset.plotting.plot_maze.PathFormat

View Source on GitHub

combine with other PathFormat object, overwriting attributes with non-None values.

returns a modified copy of self.

class StyledPath(PathFormat):

View Source on GitHub

StyledPath

(
    path: jaxtyping.Int8[ndarray, 'coord row_col'],
    *,
    label: str | None = None,
    fmt: str = 'o',
    color: str | None = None,
    cmap: str | None = None,
    line_width: float | None = None,
    quiver_kwargs: dict | None = None
)

Inherited Members

def process_path_input

(
    path: list[tuple[int, int]] | jaxtyping.Int8[ndarray, 'coord row_col'] | maze_dataset.plotting.plot_maze.StyledPath,
    _default_key: str,
    path_fmt: maze_dataset.plotting.plot_maze.PathFormat | None = None,
    **kwargs
) -> maze_dataset.plotting.plot_maze.StyledPath

View Source on GitHub

class MazePlot:

View Source on GitHub

Class for displaying mazes and paths

MazePlot

(
    maze: maze_dataset.maze.lattice_maze.LatticeMaze,
    unit_length: int = 14
)

View Source on GitHub

UNIT_LENGTH: Set ratio between node size and wall thickness in image. Wall thickness is fixed to 1px A “unit” consists of a single node and the right and lower connection/wall. Example: ul = 14 yields 13:1 ratio between node size and wall thickness

View Source on GitHub

def add_true_path

(
    self,
    path: list[tuple[int, int]] | jaxtyping.Int8[ndarray, 'coord row_col'] | maze_dataset.plotting.plot_maze.StyledPath,
    path_fmt: maze_dataset.plotting.plot_maze.PathFormat | None = None,
    **kwargs
) -> maze_dataset.plotting.plot_maze.MazePlot

View Source on GitHub

def add_predicted_path

(
    self,
    path: list[tuple[int, int]] | jaxtyping.Int8[ndarray, 'coord row_col'] | maze_dataset.plotting.plot_maze.StyledPath,
    path_fmt: maze_dataset.plotting.plot_maze.PathFormat | None = None,
    **kwargs
) -> maze_dataset.plotting.plot_maze.MazePlot

View Source on GitHub

Recieve predicted path and formatting preferences from input and save in predicted_path list. Default formatting depends on nuber of paths already saved in predicted path list.

def add_multiple_paths

(
    self,
    path_list: list[list[tuple[int, int]] | jaxtyping.Int8[ndarray, 'coord row_col'] | maze_dataset.plotting.plot_maze.StyledPath]
)

View Source on GitHub

Function for adding multiple paths to MazePlot at once. This can be done in two ways: 1. Passing a list of

def add_node_values

(
    self,
    node_values: jaxtyping.Float[ndarray, 'grid_n grid_n'],
    color_map: str = 'Blues',
    target_token_coord: jaxtyping.Int8[ndarray, 'row_col'] | None = None,
    preceeding_tokens_coords: jaxtyping.Int8[ndarray, 'coord row_col'] = None,
    colormap_center: float | None = None,
    colormap_max: float | None = None,
    hide_colorbar: bool = False
) -> maze_dataset.plotting.plot_maze.MazePlot

View Source on GitHub

def plot

(
    self,
    dpi: int = 100,
    title: str = '',
    fig_ax: tuple | None = None,
    plain: bool = False
) -> maze_dataset.plotting.plot_maze.MazePlot

View Source on GitHub

Plot the maze and paths.

def mark_coords

(
    self,
    coords: jaxtyping.Int8[ndarray, 'coord row_col'] | list[jaxtyping.Int8[ndarray, 'row_col']],
    **kwargs
) -> maze_dataset.plotting.plot_maze.MazePlot

View Source on GitHub

def to_ascii

(self, show_endpoints: bool = True, show_solution: bool = True) -> str

View Source on GitHub

docs for maze-dataset v1.1.0

Contents

plot_colored_text function to plot tokens on a matplotlib axis with colored backgrounds

API Documentation

View Source on GitHub

maze_dataset.plotting.plot_tokens

plot_colored_text function to plot tokens on a matplotlib axis with colored backgrounds

View Source on GitHub

def plot_colored_text

(
    tokens: Sequence[str],
    weights: Sequence[float],
    cmap: str | typing.Any,
    ax: matplotlib.axes._axes.Axes = None,
    width_scale: float = 0.023,
    width_offset: float = 0.005,
    height_offset: float = 0.1,
    rect_height: float = 0.7,
    token_height: float = 0.7,
    label_height: float = 0.3,
    word_gap: float = 0.01,
    fontsize: int = 12,
    fig_height: float = 0.7,
    fig_width_scale: float = 0.25,
    char_min: int = 4
)

View Source on GitHub

hacky function to plot tokens on a matplotlib axis with colored backgrounds

docs for maze-dataset v1.1.0

Contents

Functions to print tokens with colors in different formats

you can color the tokens by their:

and the output can be in different formats, specified by FormatType (html, latex, terminal)

API Documentation

View Source on GitHub

maze_dataset.plotting.print_tokens

Functions to print tokens with colors in different formats

you can color the tokens by their:

and the output can be in different formats, specified by FormatType (html, latex, terminal)

View Source on GitHub

1D array of RGB values

output format for the tokens

templates of printing tokens in different formats

def color_tokens_rgb

(
    tokens: list,
    colors: Sequence[Sequence[int]],
    fmt: Literal['html', 'latex', 'terminal', None] = 'html',
    template: str | None = None,
    clr_join: str | None = None,
    max_length: int | None = None
) -> str

View Source on GitHub

color tokens from a list with an RGB color array

tokens will not be escaped if fmt is None

Parameters:

def color_tokens_cmap

(
    tokens: list[str],
    weights: Sequence[float],
    cmap: str | matplotlib.colors.Colormap = 'Blues',
    fmt: Literal['html', 'latex', 'terminal', None] = 'html',
    template: str | None = None,
    labels: bool = False
)

View Source on GitHub

color tokens given a list of weights and a colormap

def color_maze_tokens_AOTP

(
    tokens: list[str],
    fmt: Literal['html', 'latex', 'terminal', None] = 'html',
    template: str | None = None,
    **kwargs
) -> str

View Source on GitHub

color tokens assuming AOTP format

i.e: adjaceny list, origin, target, path

def display_html

(html: str)

View Source on GitHub

def display_color_tokens_rgb

(tokens: list[str], colors: jaxtyping.UInt8[ndarray, 'n 3']) -> None

View Source on GitHub

def display_color_tokens_cmap

(
    tokens: list[str],
    weights: Sequence[float],
    cmap: str | matplotlib.colors.Colormap = 'Blues'
) -> None

View Source on GitHub

def display_color_maze_tokens_AOTP

(tokens: list[str]) -> None

View Source on GitHub

docs for maze-dataset v1.1.0

Contents

Shared utilities for tests only. Do not import into any module outside of the tests directory

API Documentation

View Source on GitHub

maze_dataset.testing_utils

Shared utilities for tests only. Do not import into any module outside of the tests directory

View Source on GitHub

class MANUAL_MAZE(typing.NamedTuple):

View Source on GitHub

MANUAL_MAZE(tokens, ascii, straightaway_footprints)

MANUAL_MAZE

(
    tokens: str,
    ascii: tuple[str],
    straightaway_footprints: jaxtyping.Int8[ndarray, 'coord row_col']
)

Create new instance of MANUAL_MAZE(tokens, ascii, straightaway_footprints)

Alias for field number 0

Alias for field number 1

Alias for field number 2

Inherited Members

docs for maze-dataset v1.1.0

Contents

a whole bunch of utilities for tokenization

API Documentation

View Source on GitHub

maze_dataset.token_utils

a whole bunch of utilities for tokenization

View Source on GitHub

def remove_padding_from_token_str

(token_str: str) -> str

View Source on GitHub

def tokens_between

(
    tokens: list[str],
    start_value: str,
    end_value: str,
    include_start: bool = False,
    include_end: bool = False,
    except_when_tokens_not_unique: bool = False
) -> list[str]

View Source on GitHub

def get_adj_list_tokens

(tokens: list[str]) -> list[str]

View Source on GitHub

def get_path_tokens

(tokens: list[str], trim_end: bool = False) -> list[str]

View Source on GitHub

The path is considered everything from the first path coord to the path_end token, if it exists.

def get_context_tokens

(tokens: list[str]) -> list[str]

View Source on GitHub

def get_origin_tokens

(tokens: list[str]) -> list[str]

View Source on GitHub

def get_target_tokens

(tokens: list[str]) -> list[str]

View Source on GitHub

def get_cardinal_direction

(coords: jaxtyping.Int[ndarray, 'start_end=2 row_col=2']) -> str

View Source on GitHub

Returns the cardinal direction token corresponding to traveling from coords[0] to coords[1].

def get_relative_direction

(coords: jaxtyping.Int[ndarray, 'prev_cur_next=3 row_col=2']) -> str

View Source on GitHub

Returns the relative first-person direction token corresponding to traveling from coords[1] to coords[2]. ### Parameters - coords: Contains 3 Coords, each of which must neighbor the previous Coord. - coords[0]: The previous location, used to determine the current absolute direction that the “agent” is facing. - coords[1]: The current location - coords[2]: The next location. May be equal to the current location.

class TokenizerPendingDeprecationWarning(builtins.PendingDeprecationWarning):

View Source on GitHub

Pending deprecation warnings related to the MazeTokenizerModular upgrade.

Inherited Members

def str_is_coord

(coord_str: str, allow_whitespace: bool = True) -> bool

View Source on GitHub

return True if the string represents a coordinate, False otherwise

class TokenizerDeprecationWarning(builtins.DeprecationWarning):

View Source on GitHub

Deprecation warnings related to the MazeTokenizerModular upgrade.

Inherited Members

def coord_str_to_tuple

(coord_str: str, allow_whitespace: bool = True) -> tuple[int, ...]

View Source on GitHub

convert a coordinate string to a tuple

def coord_str_to_coord_np

(coord_str: str, allow_whitespace: bool = True) -> numpy.ndarray

View Source on GitHub

convert a coordinate string to a numpy array

def coord_str_to_tuple_noneable

(coord_str: str) -> tuple[int, int] | None

View Source on GitHub

convert a coordinate string to a tuple, or None if the string is not a coordinate string

def coords_string_split_UT

(coords: str) -> list[str]

View Source on GitHub

Splits a string of tokens into a list containing the UT tokens for each coordinate.

Not capable of producing indexed tokens (“(”, “1”, “,”, “2”, “)”), only unique tokens (“(1,2)”). Non-whitespace portions of the input string not matched are preserved in the same list: “(1,2) (5,6)” -> [“(1,2)”, “”, “(5,6)”]

def strings_to_coords

(
    text: str | list[str],
    when_noncoord: Literal['except', 'skip', 'include'] = 'skip'
) -> list[str | tuple[int, int]]

View Source on GitHub

converts a list of tokens to a list of coordinates

returns list[CoordTup] if when_noncoord is “skip” or “error” returns list[str | CoordTup] if when_noncoord is “include”

def coords_to_strings

(
    coords: list[str | tuple[int, int]],
    coord_to_strings_func: Callable[[tuple[int, int]], list[str]],
    when_noncoord: Literal['except', 'skip', 'include'] = 'skip'
) -> list[str]

View Source on GitHub

converts a list of coordinates to a list of strings (tokens)

expects list[CoordTup] if when_noncoord is “error” expects list[str | CoordTup] if when_noncoord is “include” or “skip”

def get_token_regions

(toks: list[str]) -> tuple[list[str], list[str]]

View Source on GitHub

def equal_except_adj_list_sequence

(
    rollout1: list[str],
    rollout2: list[str],
    do_except: bool = False,
    when_counter_mismatch: muutils.errormode.ErrorMode = ErrorMode.Except,
    when_len_mismatch: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

Returns if the rollout strings are equal, allowing for differently sequenced adjacency lists. and tokens must be in the rollouts. Intended ONLY for determining if two tokenization schemes are the same for rollouts generated from the same maze. This function should NOT be used to determine if two rollouts encode the same LatticeMaze object.

Warning: CTT False Positives

This function is not robustly correct for some corner cases using CoordTokenizers.CTT. If rollouts are passed for identical tokenizers processing two slightly different mazes, a false positive is possible. More specifically, some cases of zero-sum adding and removing of connections in a maze within square regions along the diagonal will produce a false positive.

def connection_list_to_adj_list

(
    conn_list: jaxtyping.Bool[ndarray, 'lattice_dim=2 row col'],
    shuffle_d0: bool = True,
    shuffle_d1: bool = True
) -> jaxtyping.Int8[ndarray, 'conn start_end=2 coord=2']

View Source on GitHub

converts a ConnectionList (special lattice format) to a shuffled adjacency list

Parameters:

Returns:

def is_connection

(
    edges: jaxtyping.Int8[ndarray, 'edges leading_trailing_coord=2 row_col=2'],
    connection_list: jaxtyping.Bool[ndarray, 'lattice_dim=2 row col']
) -> jaxtyping.Bool[ndarray, 'is_connection=edges']

View Source on GitHub

Returns if each edge in edges is a connection (True) or wall (False) in connection_list.

docs for maze-dataset v1.1.0

Contents

turning a maze into text

Submodules

API Documentation

View Source on GitHub

maze_dataset.tokenization

turning a maze into text

View Source on GitHub

class TokenizationMode(enum.Enum):

View Source on GitHub

legacy tokenization modes

[!CAUTION] Legacy mode of tokenization. will still be around in future releases, but is no longer recommended for use. Use MazeTokenizerModular instead.

Abbreviations:

Modes:

def to_legacy_tokenizer

(self, max_grid_size: int | None = None)

View Source on GitHub

Inherited Members

class _TokenizerElement(muutils.json_serialize.serializable_dataclass.SerializableDataclass, abc.ABC):

View Source on GitHub

Superclass for tokenizer elements. Subclasses contain modular functionality for maze tokenization.

Development

[!TIP] Due to the functionality of get_all_tokenizers(), _TokenizerElement subclasses may only contain fields of type utils.FiniteValued. Implementing a subclass with an int or float-typed field, for example, is not supported. In the event that adding such fields is deemed necessary, get_all_tokenizers() must be updated.

View Source on GitHub

def tokenizer_elements

(
    self,
    deep: bool = True
) -> list[maze_dataset.tokenization.maze_tokenizer._TokenizerElement]

View Source on GitHub

Returns a list of all _TokenizerElement instances contained in the subtree. Currently only detects _TokenizerElement instances which are either direct attributes of another instance or which sit inside a tuple without further nesting.

Parameters

def tokenizer_element_tree

(self, depth: int = 0, abstract: bool = False) -> str

View Source on GitHub

Returns a string representation of the tree of tokenizer elements contained in self.

Parameters

def tokenizer_element_dict

(self) -> dict

View Source on GitHub

Returns a dictionary representation of the tree of tokenizer elements contained in self.

def attribute_key

(cls) -> str

View Source on GitHub

Returns the binding used in MazeTokenizerModular for that type of _TokenizerElement.

def to_tokens

(self, *args, **kwargs) -> list[str]

View Source on GitHub

Converts a maze element into a list of tokens. Not all _TokenizerElement subclasses produce tokens, so this is not an abstract method. Those subclasses which do produce tokens should override this method.

def is_valid

(self) -> bool

View Source on GitHub

Returns if self contains data members capable of producing an overall valid MazeTokenizerModular. Some _TokenizerElement instances may be created which are not useful despite obeying data member type hints. is_valid allows for more precise detection of invalid _TokenizerElements beyond type hinting alone. If type hints are sufficient to constrain the possible instances of some subclass, then this method may simply return True for that subclass.

Types of Invalidity

In nontrivial implementations of this method, each conditional clause should contain a comment classifying the reason for invalidity and one of the types below. Invalidity types, in ascending order of invalidity: - Uninteresting: These tokenizers might be used to train functional models, but the schemes are not interesting to study. E.g., _TokenizerElements which are strictly worse than some alternative. - Duplicate: These tokenizers have identical tokenization behavior as some other valid tokenizers. - Untrainable: Training functional models using these tokenizers would be (nearly) impossible. - Erroneous: These tokenizers might raise exceptions during use.

Development

is_invalid is implemented to always return True in some abstract classes where all currently possible subclass instances are valid. When adding new subclasses or data members, the developer should check if any such blanket statement of validity still holds and update it as neccesary.

Nesting

In general, when implementing this method, there is no need to recursively call is_valid on nested _TokenizerElements contained in the class. In other words, failures of is_valid need not bubble up to the top of the nested _TokenizerElement tree. <a href="#MazeTokenizerModular.is_valid">MazeTokenizerModular.is_valid</a> calls is_valid on each of its _TokenizerElements individually, so failure at any level will be detected.

Types of Invalidity

If it’s judged to be useful, the types of invalidity could be implemented with an Enum or similar rather than only living in comments. This could be used to create more or less stringent filters on the valid _TokenizerElement instances.

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class MazeTokenizerModular(muutils.json_serialize.serializable_dataclass.SerializableDataclass):

View Source on GitHub

Tokenizer for mazes

Parameters

Development

MazeTokenizerModular

(
    *,
    prompt_sequencer: maze_dataset.tokenization.maze_tokenizer.PromptSequencers._PromptSequencer = PromptSequencers.AOTP(coord_tokenizer=CoordTokenizers.UT(), adj_list_tokenizer=AdjListTokenizers.AdjListCoord(pre=False, post=True, shuffle_d0=True, edge_grouping=EdgeGroupings.Ungrouped(connection_token_ordinal=1), edge_subset=EdgeSubsets.ConnectionEdges(walls=False), edge_permuter=EdgePermuters.RandomCoords()), target_tokenizer=TargetTokenizers.Unlabeled(post=False), path_tokenizer=PathTokenizers.StepSequence(step_size=StepSizes.Singles(), step_tokenizers=(StepTokenizers.Coord(),), pre=False, intra=False, post=False))
)

def hash_int

(self) -> int

View Source on GitHub

def hash_b64

(self, n_bytes: int = 8) -> str

View Source on GitHub

filename-safe base64 encoding of the hash

View Source on GitHub

def tokenizer_element_tree

(self, abstract: bool = False) -> str

View Source on GitHub

Returns a string representation of the tree of tokenizer elements contained in self.

Parameters

View Source on GitHub

Property wrapper for tokenizer_element_tree so that it can be used in properties_to_serialize.

def tokenizer_element_dict

(self) -> dict

View Source on GitHub

Nested dictionary of the internal TokenizerElements.

View Source on GitHub

Serializes MazeTokenizer into a key for encoding in zanj

def summary

(self) -> dict[str, str]

View Source on GitHub

Single-level dictionary of the internal TokenizerElements.

def has_element

(
    self,
    *elements: Sequence[type[maze_dataset.tokenization.maze_tokenizer._TokenizerElement] | maze_dataset.tokenization.maze_tokenizer._TokenizerElement]
) -> bool

View Source on GitHub

Returns True if the MazeTokenizerModular instance contains ALL of the items specified in elements.

Querying with a partial subset of _TokenizerElement fields is not currently supported. To do such a query, assemble multiple calls to has_elements.

Parameters

def is_valid

(self)

View Source on GitHub

Returns True if self is a valid tokenizer. Evaluates the validity of all of self.tokenizer_elements according to each one’s method.

def is_legacy_equivalent

(self) -> bool

View Source on GitHub

Returns if self has identical stringification behavior as any legacy MazeTokenizer.

def is_tested_tokenizer

(self, do_assert: bool = False) -> bool

View Source on GitHub

Returns if the tokenizer is returned by all_tokenizers.get_all_tokenizers, the set of tested and reliable tokenizers.

Since evaluating all_tokenizers.get_all_tokenizers is expensive, instead checks for membership of self’s hash in get_all_tokenizer_hashes().

if do_assert is True, raises an AssertionError if the tokenizer is not tested.

def is_AOTP

(self) -> bool

View Source on GitHub

def is_UT

(self) -> bool

View Source on GitHub

def from_legacy

(
    cls,
    legacy_maze_tokenizer: maze_dataset.tokenization.maze_tokenizer.MazeTokenizer | maze_dataset.tokenization.maze_tokenizer.TokenizationMode
) -> maze_dataset.tokenization.maze_tokenizer.MazeTokenizerModular

View Source on GitHub

Maps a legacy MazeTokenizer or TokenizationMode to its equivalent MazeTokenizerModular instance.

def from_tokens

(
    cls,
    tokens: str | list[str]
) -> maze_dataset.tokenization.maze_tokenizer.MazeTokenizerModular

View Source on GitHub

Infers most MazeTokenizerModular parameters from a full sequence of tokens.

View Source on GitHub

map from index to token

View Source on GitHub

map from token to index

View Source on GitHub

Number of tokens in the static vocab

View Source on GitHub

View Source on GitHub

def to_tokens

(self, maze: maze_dataset.maze.lattice_maze.LatticeMaze) -> list[str]

View Source on GitHub

Converts maze into a list of tokens.

def coords_to_strings

(
    self,
    coords: list[tuple[int, int] | jaxtyping.Int8[ndarray, 'row_col']]
) -> list[str]

View Source on GitHub

def strings_to_coords

(
    text: str,
    when_noncoord: Literal['except', 'skip', 'include'] = 'skip'
) -> list[str | tuple[int, int]]

View Source on GitHub

def encode

(text: str | list[str]) -> list[int]

View Source on GitHub

encode a string or list of strings into a list of tokens

def decode

(token_ids: Sequence[int], joined_tokens: bool = False) -> list[str] | str

View Source on GitHub

decode a list of tokens into a string or list of strings

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class PromptSequencers(maze_dataset.tokenization.maze_tokenizer.__TokenizerElementNamespace):

View Source on GitHub

Namespace for _PromptSequencer subclass hierarchy used by MazeTokenizerModular.

class PromptSequencers.AOTP(maze_dataset.tokenization.maze_tokenizer.PromptSequencers._PromptSequencer):

View Source on GitHub

Sequences a prompt as [adjacency list, origin, target, path].

Parameters

PromptSequencers.AOTP

(
    *,
    coord_tokenizer: maze_dataset.tokenization.maze_tokenizer.CoordTokenizers._CoordTokenizer = CoordTokenizers.UT(),
    adj_list_tokenizer: maze_dataset.tokenization.maze_tokenizer.AdjListTokenizers._AdjListTokenizer = AdjListTokenizers.AdjListCoord(pre=False, post=True, shuffle_d0=True, edge_grouping=EdgeGroupings.Ungrouped(connection_token_ordinal=1), edge_subset=EdgeSubsets.ConnectionEdges(walls=False), edge_permuter=EdgePermuters.RandomCoords()),
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.PromptSequencers.AOTP'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.PromptSequencers.AOTP'>",
    target_tokenizer: maze_dataset.tokenization.maze_tokenizer.TargetTokenizers._TargetTokenizer = TargetTokenizers.Unlabeled(post=False),
    path_tokenizer: maze_dataset.tokenization.maze_tokenizer.PathTokenizers._PathTokenizer = PathTokenizers.StepSequence(step_size=StepSizes.Singles(), step_tokenizers=(StepTokenizers.Coord(),), pre=False, intra=False, post=False)
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class PromptSequencers.AOP(maze_dataset.tokenization.maze_tokenizer.PromptSequencers._PromptSequencer):

View Source on GitHub

Sequences a prompt as [adjacency list, origin, path]. Still includes “” and “” tokens, but no representation of the target itself.

Parameters

PromptSequencers.AOP

(
    *,
    coord_tokenizer: maze_dataset.tokenization.maze_tokenizer.CoordTokenizers._CoordTokenizer = CoordTokenizers.UT(),
    adj_list_tokenizer: maze_dataset.tokenization.maze_tokenizer.AdjListTokenizers._AdjListTokenizer = AdjListTokenizers.AdjListCoord(pre=False, post=True, shuffle_d0=True, edge_grouping=EdgeGroupings.Ungrouped(connection_token_ordinal=1), edge_subset=EdgeSubsets.ConnectionEdges(walls=False), edge_permuter=EdgePermuters.RandomCoords()),
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.PromptSequencers.AOP'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.PromptSequencers.AOP'>",
    path_tokenizer: maze_dataset.tokenization.maze_tokenizer.PathTokenizers._PathTokenizer = PathTokenizers.StepSequence(step_size=StepSizes.Singles(), step_tokenizers=(StepTokenizers.Coord(),), pre=False, intra=False, post=False)
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class CoordTokenizers(maze_dataset.tokenization.maze_tokenizer.__TokenizerElementNamespace):

View Source on GitHub

Namespace for _CoordTokenizer subclass hierarchy used by MazeTokenizerModular.

class CoordTokenizers.UT(maze_dataset.tokenization.maze_tokenizer.CoordTokenizers._CoordTokenizer):

View Source on GitHub

Unique token coordinate tokenizer.

CoordTokenizers.UT

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.CoordTokenizers.UT'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.CoordTokenizers.UT'>"
)

def to_tokens

(
    self,
    coord: jaxtyping.Int8[ndarray, 'row_col'] | tuple[int, int]
) -> list[str]

View Source on GitHub

Converts a maze element into a list of tokens. Not all _TokenizerElement subclasses produce tokens, so this is not an abstract method. Those subclasses which do produce tokens should override this method.

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class CoordTokenizers.CTT(maze_dataset.tokenization.maze_tokenizer.CoordTokenizers._CoordTokenizer):

View Source on GitHub

Coordinate tuple tokenizer

Parameters

CoordTokenizers.CTT

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.CoordTokenizers.CTT'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.CoordTokenizers.CTT'>",
    pre: bool = True,
    intra: bool = True,
    post: bool = True
)

def to_tokens

(
    self,
    coord: jaxtyping.Int8[ndarray, 'row_col'] | tuple[int, int]
) -> list[str]

View Source on GitHub

Converts a maze element into a list of tokens. Not all _TokenizerElement subclasses produce tokens, so this is not an abstract method. Those subclasses which do produce tokens should override this method.

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class AdjListTokenizers(maze_dataset.tokenization.maze_tokenizer.__TokenizerElementNamespace):

View Source on GitHub

Namespace for _AdjListTokenizer subclass hierarchy used by MazeTokenizerModular.

class AdjListTokenizers.AdjListCoord(maze_dataset.tokenization.maze_tokenizer.AdjListTokenizers._AdjListTokenizer):

View Source on GitHub

Represents an edge group as tokens for the leading coord followed by coord tokens for the other group members.

AdjListTokenizers.AdjListCoord

(
    *,
    pre: bool = False,
    post: bool = True,
    shuffle_d0: bool = True,
    edge_grouping: maze_dataset.tokenization.maze_tokenizer.EdgeGroupings._EdgeGrouping = EdgeGroupings.Ungrouped(connection_token_ordinal=1),
    edge_subset: maze_dataset.tokenization.maze_tokenizer.EdgeSubsets._EdgeSubset = EdgeSubsets.ConnectionEdges(walls=False),
    edge_permuter: maze_dataset.tokenization.maze_tokenizer.EdgePermuters._EdgePermuter = EdgePermuters.RandomCoords(),
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.AdjListTokenizers.AdjListCoord'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.AdjListTokenizers.AdjListCoord'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class AdjListTokenizers.AdjListCardinal(maze_dataset.tokenization.maze_tokenizer.AdjListTokenizers._AdjListTokenizer):

View Source on GitHub

Represents an edge group as coord tokens for the leading coord and cardinal tokens relative to the leading coord for the other group members.

Parameters

AdjListTokenizers.AdjListCardinal

(
    *,
    pre: bool = False,
    post: bool = True,
    shuffle_d0: bool = True,
    edge_grouping: maze_dataset.tokenization.maze_tokenizer.EdgeGroupings._EdgeGrouping = EdgeGroupings.Ungrouped(connection_token_ordinal=1),
    edge_subset: maze_dataset.tokenization.maze_tokenizer.EdgeSubsets._EdgeSubset = EdgeSubsets.ConnectionEdges(walls=False),
    edge_permuter: maze_dataset.tokenization.maze_tokenizer.EdgePermuters._EdgePermuter = EdgePermuters.BothCoords(),
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.AdjListTokenizers.AdjListCardinal'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.AdjListTokenizers.AdjListCardinal'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class EdgeGroupings(maze_dataset.tokenization.maze_tokenizer.__TokenizerElementNamespace):

View Source on GitHub

Namespace for _EdgeGrouping subclass hierarchy used by _AdjListTokenizer.

class EdgeGroupings.Ungrouped(maze_dataset.tokenization.maze_tokenizer.EdgeGroupings._EdgeGrouping):

View Source on GitHub

No grouping occurs, each edge is tokenized individually.

Parameters

EdgeGroupings.Ungrouped

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.EdgeGroupings.Ungrouped'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.EdgeGroupings.Ungrouped'>",
    connection_token_ordinal: Literal[0, 1, 2] = 1
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class EdgeGroupings.ByLeadingCoord(maze_dataset.tokenization.maze_tokenizer.EdgeGroupings._EdgeGrouping):

View Source on GitHub

All edges with the same leading coord are grouped together.

Parameters

EdgeGroupings.ByLeadingCoord

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.EdgeGroupings.ByLeadingCoord'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.EdgeGroupings.ByLeadingCoord'>",
    intra: bool = True,
    shuffle_group: bool = True,
    connection_token_ordinal: Literal[0, 1] = 0
)

def is_valid

(self_)

View Source on GitHub

Returns if self contains data members capable of producing an overall valid MazeTokenizerModular. Some _TokenizerElement instances may be created which are not useful despite obeying data member type hints. is_valid allows for more precise detection of invalid _TokenizerElements beyond type hinting alone. If type hints are sufficient to constrain the possible instances of some subclass, then this method may simply return True for that subclass.

Types of Invalidity

In nontrivial implementations of this method, each conditional clause should contain a comment classifying the reason for invalidity and one of the types below. Invalidity types, in ascending order of invalidity: - Uninteresting: These tokenizers might be used to train functional models, but the schemes are not interesting to study. E.g., _TokenizerElements which are strictly worse than some alternative. - Duplicate: These tokenizers have identical tokenization behavior as some other valid tokenizers. - Untrainable: Training functional models using these tokenizers would be (nearly) impossible. - Erroneous: These tokenizers might raise exceptions during use.

Development

is_invalid is implemented to always return True in some abstract classes where all currently possible subclass instances are valid. When adding new subclasses or data members, the developer should check if any such blanket statement of validity still holds and update it as neccesary.

Nesting

In general, when implementing this method, there is no need to recursively call is_valid on nested _TokenizerElements contained in the class. In other words, failures of is_valid need not bubble up to the top of the nested _TokenizerElement tree. <a href="#MazeTokenizerModular.is_valid">MazeTokenizerModular.is_valid</a> calls is_valid on each of its _TokenizerElements individually, so failure at any level will be detected.

Types of Invalidity

If it’s judged to be useful, the types of invalidity could be implemented with an Enum or similar rather than only living in comments. This could be used to create more or less stringent filters on the valid _TokenizerElement instances.

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class EdgePermuters(maze_dataset.tokenization.maze_tokenizer.__TokenizerElementNamespace):

View Source on GitHub

Namespace for _EdgePermuter subclass hierarchy used by _AdjListTokenizer.

class EdgePermuters.SortedCoords(maze_dataset.tokenization.maze_tokenizer.EdgePermuters._EdgePermuter):

View Source on GitHub

returns a sorted representation. useful for checking consistency

EdgePermuters.SortedCoords

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.EdgePermuters.SortedCoords'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.EdgePermuters.SortedCoords'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class EdgePermuters.RandomCoords(maze_dataset.tokenization.maze_tokenizer.EdgePermuters._EdgePermuter):

View Source on GitHub

Permutes each edge randomly.

EdgePermuters.RandomCoords

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.EdgePermuters.RandomCoords'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.EdgePermuters.RandomCoords'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class EdgePermuters.BothCoords(maze_dataset.tokenization.maze_tokenizer.EdgePermuters._EdgePermuter):

View Source on GitHub

Includes both possible permutations of every edge in the output. Since input ConnectionList has only 1 instance of each edge, a call to BothCoords._permute will modify lattice_edges in-place, doubling shape[0].

EdgePermuters.BothCoords

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.EdgePermuters.BothCoords'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.EdgePermuters.BothCoords'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class EdgeSubsets(maze_dataset.tokenization.maze_tokenizer.__TokenizerElementNamespace):

View Source on GitHub

Namespace for _EdgeSubset subclass hierarchy used by _AdjListTokenizer.

class EdgeSubsets.AllLatticeEdges(maze_dataset.tokenization.maze_tokenizer.EdgeSubsets._EdgeSubset):

View Source on GitHub

All 2n**2-2n edges of the lattice are tokenized. If a wall exists on that edge, the edge is tokenized in the same manner, using VOCAB.ADJLIST_WALL in place of VOCAB.CONNECTOR.

EdgeSubsets.AllLatticeEdges

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.EdgeSubsets.AllLatticeEdges'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.EdgeSubsets.AllLatticeEdges'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class EdgeSubsets.ConnectionEdges(maze_dataset.tokenization.maze_tokenizer.EdgeSubsets._EdgeSubset):

View Source on GitHub

Only edges which contain a connection are tokenized. Alternatively, only edges which contain a wall are tokenized.

Parameters

EdgeSubsets.ConnectionEdges

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.EdgeSubsets.ConnectionEdges'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.EdgeSubsets.ConnectionEdges'>",
    walls: bool = False
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class TargetTokenizers(maze_dataset.tokenization.maze_tokenizer.__TokenizerElementNamespace):

View Source on GitHub

Namespace for _TargetTokenizer subclass hierarchy used by MazeTokenizerModular.

class TargetTokenizers.Unlabeled(maze_dataset.tokenization.maze_tokenizer.TargetTokenizers._TargetTokenizer):

View Source on GitHub

Targets are simply listed as coord tokens. - post: Whether all coords include an integral following delimiter token

TargetTokenizers.Unlabeled

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.TargetTokenizers.Unlabeled'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.TargetTokenizers.Unlabeled'>",
    post: bool = False
)

def to_tokens

(
    self,
    targets: Sequence[jaxtyping.Int8[ndarray, 'row_col']],
    coord_tokenizer: maze_dataset.tokenization.maze_tokenizer.CoordTokenizers._CoordTokenizer
) -> list[str]

View Source on GitHub

Returns tokens representing the target.

def is_valid

(self) -> bool

View Source on GitHub

Returns if self contains data members capable of producing an overall valid MazeTokenizerModular. Some _TokenizerElement instances may be created which are not useful despite obeying data member type hints. is_valid allows for more precise detection of invalid _TokenizerElements beyond type hinting alone. If type hints are sufficient to constrain the possible instances of some subclass, then this method may simply return True for that subclass.

Types of Invalidity

In nontrivial implementations of this method, each conditional clause should contain a comment classifying the reason for invalidity and one of the types below. Invalidity types, in ascending order of invalidity: - Uninteresting: These tokenizers might be used to train functional models, but the schemes are not interesting to study. E.g., _TokenizerElements which are strictly worse than some alternative. - Duplicate: These tokenizers have identical tokenization behavior as some other valid tokenizers. - Untrainable: Training functional models using these tokenizers would be (nearly) impossible. - Erroneous: These tokenizers might raise exceptions during use.

Development

is_invalid is implemented to always return True in some abstract classes where all currently possible subclass instances are valid. When adding new subclasses or data members, the developer should check if any such blanket statement of validity still holds and update it as neccesary.

Nesting

In general, when implementing this method, there is no need to recursively call is_valid on nested _TokenizerElements contained in the class. In other words, failures of is_valid need not bubble up to the top of the nested _TokenizerElement tree. <a href="#MazeTokenizerModular.is_valid">MazeTokenizerModular.is_valid</a> calls is_valid on each of its _TokenizerElements individually, so failure at any level will be detected.

Types of Invalidity

If it’s judged to be useful, the types of invalidity could be implemented with an Enum or similar rather than only living in comments. This could be used to create more or less stringent filters on the valid _TokenizerElement instances.

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepSizes(maze_dataset.tokenization.maze_tokenizer.__TokenizerElementNamespace):

View Source on GitHub

Namespace for _StepSize subclass hierarchy used by MazeTokenizerModular.

class StepSizes.Singles(maze_dataset.tokenization.maze_tokenizer.StepSizes._StepSize):

View Source on GitHub

Every coord in maze.solution is represented. Legacy tokenizers all use this behavior.

StepSizes.Singles

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.Singles'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.Singles'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepSizes.Straightaways(maze_dataset.tokenization.maze_tokenizer.StepSizes._StepSize):

View Source on GitHub

Only coords where the path turns are represented in the path. I.e., the path is represented as a sequence of straightaways, specified by the coords at the turns.

StepSizes.Straightaways

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.Straightaways'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.Straightaways'>"
)

def is_valid

(self_)

View Source on GitHub

Returns if self contains data members capable of producing an overall valid MazeTokenizerModular. Some _TokenizerElement instances may be created which are not useful despite obeying data member type hints. is_valid allows for more precise detection of invalid _TokenizerElements beyond type hinting alone. If type hints are sufficient to constrain the possible instances of some subclass, then this method may simply return True for that subclass.

Types of Invalidity

In nontrivial implementations of this method, each conditional clause should contain a comment classifying the reason for invalidity and one of the types below. Invalidity types, in ascending order of invalidity: - Uninteresting: These tokenizers might be used to train functional models, but the schemes are not interesting to study. E.g., _TokenizerElements which are strictly worse than some alternative. - Duplicate: These tokenizers have identical tokenization behavior as some other valid tokenizers. - Untrainable: Training functional models using these tokenizers would be (nearly) impossible. - Erroneous: These tokenizers might raise exceptions during use.

Development

is_invalid is implemented to always return True in some abstract classes where all currently possible subclass instances are valid. When adding new subclasses or data members, the developer should check if any such blanket statement of validity still holds and update it as neccesary.

Nesting

In general, when implementing this method, there is no need to recursively call is_valid on nested _TokenizerElements contained in the class. In other words, failures of is_valid need not bubble up to the top of the nested _TokenizerElement tree. <a href="#MazeTokenizerModular.is_valid">MazeTokenizerModular.is_valid</a> calls is_valid on each of its _TokenizerElements individually, so failure at any level will be detected.

Types of Invalidity

If it’s judged to be useful, the types of invalidity could be implemented with an Enum or similar rather than only living in comments. This could be used to create more or less stringent filters on the valid _TokenizerElement instances.

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepSizes.Forks(maze_dataset.tokenization.maze_tokenizer.StepSizes._StepSize):

View Source on GitHub

Only coords at forks, where the path has >=2 options for the next step are included. Excludes the option of backtracking. The starting and ending coords are always included.

StepSizes.Forks

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.Forks'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.Forks'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepSizes.ForksAndStraightaways(maze_dataset.tokenization.maze_tokenizer.StepSizes._StepSize):

View Source on GitHub

Includes the union of the coords included by Forks and Straightaways. See documentation for those classes for details.

StepSizes.ForksAndStraightaways

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.ForksAndStraightaways'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.ForksAndStraightaways'>"
)

def is_valid

(self_)

View Source on GitHub

Returns if self contains data members capable of producing an overall valid MazeTokenizerModular. Some _TokenizerElement instances may be created which are not useful despite obeying data member type hints. is_valid allows for more precise detection of invalid _TokenizerElements beyond type hinting alone. If type hints are sufficient to constrain the possible instances of some subclass, then this method may simply return True for that subclass.

Types of Invalidity

In nontrivial implementations of this method, each conditional clause should contain a comment classifying the reason for invalidity and one of the types below. Invalidity types, in ascending order of invalidity: - Uninteresting: These tokenizers might be used to train functional models, but the schemes are not interesting to study. E.g., _TokenizerElements which are strictly worse than some alternative. - Duplicate: These tokenizers have identical tokenization behavior as some other valid tokenizers. - Untrainable: Training functional models using these tokenizers would be (nearly) impossible. - Erroneous: These tokenizers might raise exceptions during use.

Development

is_invalid is implemented to always return True in some abstract classes where all currently possible subclass instances are valid. When adding new subclasses or data members, the developer should check if any such blanket statement of validity still holds and update it as neccesary.

Nesting

In general, when implementing this method, there is no need to recursively call is_valid on nested _TokenizerElements contained in the class. In other words, failures of is_valid need not bubble up to the top of the nested _TokenizerElement tree. <a href="#MazeTokenizerModular.is_valid">MazeTokenizerModular.is_valid</a> calls is_valid on each of its _TokenizerElements individually, so failure at any level will be detected.

Types of Invalidity

If it’s judged to be useful, the types of invalidity could be implemented with an Enum or similar rather than only living in comments. This could be used to create more or less stringent filters on the valid _TokenizerElement instances.

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepTokenizers(maze_dataset.tokenization.maze_tokenizer.__TokenizerElementNamespace):

View Source on GitHub

Namespace for _StepTokenizer subclass hierarchy used by MazeTokenizerModular.

class StepTokenizers.Coord(maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer):

View Source on GitHub

A direct tokenization of the end position coord represents the step.

StepTokenizers.Coord

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Coord'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Coord'>"
)

def to_tokens

(
    self,
    maze: maze_dataset.maze.lattice_maze.SolvedMaze,
    start_index: int,
    end_index: int,
    coord_tokenizer: maze_dataset.tokenization.maze_tokenizer.CoordTokenizers._CoordTokenizer
) -> list[str]

View Source on GitHub

Tokenizes a single step in the solution.

Parameters

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepTokenizers.Cardinal(maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer):

View Source on GitHub

A step is tokenized with a cardinal direction token. It is the direction of the step from the starting position along the solution.

StepTokenizers.Cardinal

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Cardinal'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Cardinal'>"
)

def to_tokens

(
    self,
    maze: maze_dataset.maze.lattice_maze.SolvedMaze,
    start_index: int,
    end_index: int,
    **kwargs
) -> list[str]

View Source on GitHub

Tokenizes a single step in the solution.

Parameters

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepTokenizers.Relative(maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer):

View Source on GitHub

Tokenizes a solution step using relative first-person directions (right, left, forward, etc.). To simplify the indeterminacy, at the start of a solution the “agent” solving the maze is assumed to be facing NORTH. Similarly to Cardinal, the direction is that of the step from the starting position.

StepTokenizers.Relative

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Relative'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Relative'>"
)

def to_tokens

(
    self,
    maze: maze_dataset.maze.lattice_maze.SolvedMaze,
    start_index: int,
    end_index: int,
    **kwargs
) -> list[str]

View Source on GitHub

Tokenizes a single step in the solution.

Parameters

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepTokenizers.Distance(maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer):

View Source on GitHub

A count of the number of individual steps from the starting point to the end point. Contains no information about directionality, only the distance traveled in the step. Distance must be combined with at least one other _StepTokenizer in a StepTokenizerPermutation. This constraint is enforced in _PathTokenizer.is_valid.

StepTokenizers.Distance

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Distance'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Distance'>"
)

def to_tokens

(
    self,
    maze: maze_dataset.maze.lattice_maze.SolvedMaze,
    start_index: int,
    end_index: int,
    **kwargs
) -> list[str]

View Source on GitHub

Tokenizes a single step in the solution.

Parameters

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class PathTokenizers(maze_dataset.tokenization.maze_tokenizer.__TokenizerElementNamespace):

View Source on GitHub

Namespace for _PathTokenizer subclass hierarchy used by MazeTokenizerModular.

class PathTokenizers.StepSequence(maze_dataset.tokenization.maze_tokenizer.PathTokenizers._PathTokenizer, abc.ABC):

View Source on GitHub

Any PathTokenizer where the tokenization may be assembled from token subsequences, each of which represents a step along the path. Allows for a sequence of leading and trailing tokens which don’t fit the step pattern.

Parameters

PathTokenizers.StepSequence

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.PathTokenizers.StepSequence'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.PathTokenizers.StepSequence'>",
    step_size: maze_dataset.tokenization.maze_tokenizer.StepSizes._StepSize = StepSizes.Singles(),
    step_tokenizers: tuple[maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer] | tuple[maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer, maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer] | tuple[maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer, maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer, maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer] | tuple[maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer, maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer, maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer, maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer] = (StepTokenizers.Coord(),),
    pre: bool = False,
    intra: bool = False,
    post: bool = False
)

def to_tokens

(
    self,
    maze: maze_dataset.maze.lattice_maze.SolvedMaze,
    coord_tokenizer: maze_dataset.tokenization.maze_tokenizer.CoordTokenizers._CoordTokenizer
) -> list[str]

View Source on GitHub

Returns tokens representing the solution path.

def is_valid

(self) -> bool

View Source on GitHub

Returns if self contains data members capable of producing an overall valid MazeTokenizerModular. Some _TokenizerElement instances may be created which are not useful despite obeying data member type hints. is_valid allows for more precise detection of invalid _TokenizerElements beyond type hinting alone. If type hints are sufficient to constrain the possible instances of some subclass, then this method may simply return True for that subclass.

Types of Invalidity

In nontrivial implementations of this method, each conditional clause should contain a comment classifying the reason for invalidity and one of the types below. Invalidity types, in ascending order of invalidity: - Uninteresting: These tokenizers might be used to train functional models, but the schemes are not interesting to study. E.g., _TokenizerElements which are strictly worse than some alternative. - Duplicate: These tokenizers have identical tokenization behavior as some other valid tokenizers. - Untrainable: Training functional models using these tokenizers would be (nearly) impossible. - Erroneous: These tokenizers might raise exceptions during use.

Development

is_invalid is implemented to always return True in some abstract classes where all currently possible subclass instances are valid. When adding new subclasses or data members, the developer should check if any such blanket statement of validity still holds and update it as neccesary.

Nesting

In general, when implementing this method, there is no need to recursively call is_valid on nested _TokenizerElements contained in the class. In other words, failures of is_valid need not bubble up to the top of the nested _TokenizerElement tree. <a href="#MazeTokenizerModular.is_valid">MazeTokenizerModular.is_valid</a> calls is_valid on each of its _TokenizerElements individually, so failure at any level will be detected.

Types of Invalidity

If it’s judged to be useful, the types of invalidity could be implemented with an Enum or similar rather than only living in comments. This could be used to create more or less stringent filters on the valid _TokenizerElement instances.

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

def get_tokens_up_to_path_start

(
    tokens: list[str],
    include_start_coord: bool = True,
    tokenization_mode: maze_dataset.tokenization.maze_tokenizer.TokenizationMode = <TokenizationMode.AOTP_UT_uniform: 'AOTP_UT_uniform'>
) -> list[str]

View Source on GitHub

class MazeTokenizer(muutils.json_serialize.serializable_dataclass.SerializableDataclass):

View Source on GitHub

LEGACY Tokenizer for mazes

[!CAUTION] MazeTokenizerModular is the new standard for tokenization. This class is no longer recommended for use, but will remain for compatibility with existing code.

Parameters:

Properties

Conditional Properties

these all return None if max_grid_size is None. Prepend _ to the name to get a guaranteed type, and cause an exception if max_grid_size is None

Methods

MazeTokenizer

(
    *,
    tokenization_mode: maze_dataset.tokenization.maze_tokenizer.TokenizationMode = <TokenizationMode.AOTP_UT_uniform: 'AOTP_UT_uniform'>,
    max_grid_size: int | None = None
)

View Source on GitHub

View Source on GitHub

map a coordinate to a token

View Source on GitHub

View Source on GitHub

View Source on GitHub

View Source on GitHub

View Source on GitHub

def coords_to_strings

(
    self,
    coords: list[tuple[int, int]],
    when_noncoord: Literal['except', 'skip', 'include'] = 'skip'
) -> list[str]

View Source on GitHub

def strings_to_coords

(
    text: str,
    when_noncoord: Literal['except', 'skip', 'include'] = 'skip'
) -> list[str | tuple[int, int]]

View Source on GitHub

def encode

(self, text: str | list[str]) -> list[int]

View Source on GitHub

encode a string or list of strings into a list of tokens

def decode

(
    self,
    tokens: Sequence[int],
    joined_tokens: bool = False
) -> list[str] | str

View Source on GitHub

decode a list of tokens into a string or list of strings

View Source on GitHub

View Source on GitHub

def summary

(self) -> dict

View Source on GitHub

returns a summary of the tokenization mode

def is_AOTP

(self) -> bool

View Source on GitHub

returns true if a tokenization mode is Adjacency list, Origin, Target, Path

def is_UT

(self) -> bool

View Source on GitHub

def clear_cache

(self)

View Source on GitHub

clears all cached properties

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

docs for maze-dataset v1.1.0

Contents

Contains get_all_tokenizers() and supporting limited-use functions.

get_all_tokenizers()

returns a comprehensive collection of all valid MazeTokenizerModular objects. This is an overwhelming majority subset of the set of all possible MazeTokenizerModular objects. Other tokenizers not contained in get_all_tokenizers() may be possible to construct, but they are untested and not guaranteed to work. This collection is in a separate module since it is expensive to compute and will grow more expensive as features are added to MazeTokenizerModular.

Use Cases

In general, uses for this module are limited to development of the library and specific research studying many tokenization behaviors. - Unit testing: - Tokenizers to use in unit tests are sampled from get_all_tokenizers() - Large-scale tokenizer research: - Specific research training models on many tokenization behaviors can use get_all_tokenizers() as the maximally inclusive collection - get_all_tokenizers() may be subsequently filtered using MazeTokenizerModular.has_element For other uses, it’s likely that the computational expense can be avoided by using - maze_tokenizer.get_all_tokenizer_hashes() for membership checks - utils.all_instances for generating smaller subsets of MazeTokenizerModular or _TokenizerElement objects

EVERY_TEST_TOKENIZERS

A collection of the tokenizers which should always be included in unit tests when test fuzzing is used. This collection should be expanded as specific tokenizers become canonical or popular.

API Documentation

View Source on GitHub

maze_dataset.tokenization.all_tokenizers

Contains get_all_tokenizers() and supporting limited-use functions.

get_all_tokenizers()

returns a comprehensive collection of all valid MazeTokenizerModular objects. This is an overwhelming majority subset of the set of all possible MazeTokenizerModular objects. Other tokenizers not contained in get_all_tokenizers() may be possible to construct, but they are untested and not guaranteed to work. This collection is in a separate module since it is expensive to compute and will grow more expensive as features are added to MazeTokenizerModular.

Use Cases

In general, uses for this module are limited to development of the library and specific research studying many tokenization behaviors. - Unit testing: - Tokenizers to use in unit tests are sampled from get_all_tokenizers() - Large-scale tokenizer research: - Specific research training models on many tokenization behaviors can use get_all_tokenizers() as the maximally inclusive collection - get_all_tokenizers() may be subsequently filtered using MazeTokenizerModular.has_element For other uses, it’s likely that the computational expense can be avoided by using - maze_tokenizer.get_all_tokenizer_hashes() for membership checks - utils.all_instances for generating smaller subsets of MazeTokenizerModular or _TokenizerElement objects

EVERY_TEST_TOKENIZERS

A collection of the tokenizers which should always be included in unit tests when test fuzzing is used. This collection should be expanded as specific tokenizers become canonical or popular.

View Source on GitHub

def get_all_tokenizers

() -> list[maze_dataset.tokenization.maze_tokenizer.MazeTokenizerModular]

View Source on GitHub

Computes a complete list of all valid tokenizers. Warning: This is an expensive function.

def all_tokenizers_set

() -> set[maze_dataset.tokenization.maze_tokenizer.MazeTokenizerModular]

View Source on GitHub

Casts get_all_tokenizers() to a set.

def sample_all_tokenizers

(
    n: int
) -> list[maze_dataset.tokenization.maze_tokenizer.MazeTokenizerModular]

View Source on GitHub

Samples n tokenizers from get_all_tokenizers().

def sample_tokenizers_for_test

(
    n: int | None
) -> list[maze_dataset.tokenization.maze_tokenizer.MazeTokenizerModular]

View Source on GitHub

Returns a sample of size n of unique elements from get_all_tokenizers(), always including every element in EVERY_TEST_TOKENIZERS.

def save_hashes

(
    path: pathlib.Path | None = None,
    verbose: bool = False,
    parallelize: bool | int = False
) -> jaxtyping.Int64[ndarray, 'tokenizers']

View Source on GitHub

Computes, sorts, and saves the hashes of every member of get_all_tokenizers().

docs for maze-dataset v1.1.0

Contents

turning a maze into text: MazeTokenizerModular and the legacy TokenizationMode enum and MazeTokenizer class

API Documentation

View Source on GitHub

maze_dataset.tokenization.maze_tokenizer

turning a maze into text: MazeTokenizerModular and the legacy TokenizationMode enum and MazeTokenizer class

View Source on GitHub

class TokenError(builtins.ValueError):

View Source on GitHub

error for tokenization

Inherited Members

class TokenizationMode(enum.Enum):

View Source on GitHub

legacy tokenization modes

[!CAUTION] Legacy mode of tokenization. will still be around in future releases, but is no longer recommended for use. Use MazeTokenizerModular instead.

Abbreviations:

Modes:

def to_legacy_tokenizer

(self, max_grid_size: int | None = None)

View Source on GitHub

Inherited Members

def is_UT

(
    tokenization_mode: maze_dataset.tokenization.maze_tokenizer.TokenizationMode
) -> bool

View Source on GitHub

def get_tokens_up_to_path_start

(
    tokens: list[str],
    include_start_coord: bool = True,
    tokenization_mode: maze_dataset.tokenization.maze_tokenizer.TokenizationMode = <TokenizationMode.AOTP_UT_uniform: 'AOTP_UT_uniform'>
) -> list[str]

View Source on GitHub

class MazeTokenizer(muutils.json_serialize.serializable_dataclass.SerializableDataclass):

View Source on GitHub

LEGACY Tokenizer for mazes

[!CAUTION] MazeTokenizerModular is the new standard for tokenization. This class is no longer recommended for use, but will remain for compatibility with existing code.

Parameters:

Properties

Conditional Properties

these all return None if max_grid_size is None. Prepend _ to the name to get a guaranteed type, and cause an exception if max_grid_size is None

Methods

MazeTokenizer

(
    *,
    tokenization_mode: maze_dataset.tokenization.maze_tokenizer.TokenizationMode = <TokenizationMode.AOTP_UT_uniform: 'AOTP_UT_uniform'>,
    max_grid_size: int | None = None
)

View Source on GitHub

View Source on GitHub

map a coordinate to a token

View Source on GitHub

View Source on GitHub

View Source on GitHub

View Source on GitHub

View Source on GitHub

def coords_to_strings

(
    self,
    coords: list[tuple[int, int]],
    when_noncoord: Literal['except', 'skip', 'include'] = 'skip'
) -> list[str]

View Source on GitHub

def strings_to_coords

(
    text: str,
    when_noncoord: Literal['except', 'skip', 'include'] = 'skip'
) -> list[str | tuple[int, int]]

View Source on GitHub

def encode

(self, text: str | list[str]) -> list[int]

View Source on GitHub

encode a string or list of strings into a list of tokens

def decode

(
    self,
    tokens: Sequence[int],
    joined_tokens: bool = False
) -> list[str] | str

View Source on GitHub

decode a list of tokens into a string or list of strings

View Source on GitHub

View Source on GitHub

def summary

(self) -> dict

View Source on GitHub

returns a summary of the tokenization mode

def is_AOTP

(self) -> bool

View Source on GitHub

returns true if a tokenization mode is Adjacency list, Origin, Target, Path

def is_UT

(self) -> bool

View Source on GitHub

def clear_cache

(self)

View Source on GitHub

clears all cached properties

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

def mark_as_unsupported

(is_valid: Callable[[~T], bool], *args) -> ~T

View Source on GitHub

mark a _TokenizerElement as unsupported.

Classes marked with this decorator won’t show up in get_all_tokenizers() and thus wont be tested. The classes marked in release 1.0.0 did work reliably before being marked, but they can’t be instantiated since the decorator adds an abstract method. The decorator exists to prune the space of tokenizers returned by all_instances both for testing and usage. Previously, the space was too large, resulting in impractical runtimes. These decorators could be removed in future releases to expand the space of possible tokenizers.

class CoordTokenizers(__TokenizerElementNamespace):

View Source on GitHub

Namespace for _CoordTokenizer subclass hierarchy used by MazeTokenizerModular.

class CoordTokenizers.UT(CoordTokenizers._CoordTokenizer):

View Source on GitHub

Unique token coordinate tokenizer.

CoordTokenizers.UT

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.CoordTokenizers.UT'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.CoordTokenizers.UT'>"
)

def to_tokens

(
    self,
    coord: jaxtyping.Int8[ndarray, 'row_col'] | tuple[int, int]
) -> list[str]

View Source on GitHub

Converts a maze element into a list of tokens. Not all _TokenizerElement subclasses produce tokens, so this is not an abstract method. Those subclasses which do produce tokens should override this method.

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class CoordTokenizers.CTT(CoordTokenizers._CoordTokenizer):

View Source on GitHub

Coordinate tuple tokenizer

Parameters

CoordTokenizers.CTT

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.CoordTokenizers.CTT'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.CoordTokenizers.CTT'>",
    pre: bool = True,
    intra: bool = True,
    post: bool = True
)

def to_tokens

(
    self,
    coord: jaxtyping.Int8[ndarray, 'row_col'] | tuple[int, int]
) -> list[str]

View Source on GitHub

Converts a maze element into a list of tokens. Not all _TokenizerElement subclasses produce tokens, so this is not an abstract method. Those subclasses which do produce tokens should override this method.

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class EdgeGroupings(__TokenizerElementNamespace):

View Source on GitHub

Namespace for _EdgeGrouping subclass hierarchy used by _AdjListTokenizer.

class EdgeGroupings.Ungrouped(EdgeGroupings._EdgeGrouping):

View Source on GitHub

No grouping occurs, each edge is tokenized individually.

Parameters

EdgeGroupings.Ungrouped

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.EdgeGroupings.Ungrouped'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.EdgeGroupings.Ungrouped'>",
    connection_token_ordinal: Literal[0, 1, 2] = 1
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class EdgeGroupings.ByLeadingCoord(EdgeGroupings._EdgeGrouping):

View Source on GitHub

All edges with the same leading coord are grouped together.

Parameters

EdgeGroupings.ByLeadingCoord

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.EdgeGroupings.ByLeadingCoord'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.EdgeGroupings.ByLeadingCoord'>",
    intra: bool = True,
    shuffle_group: bool = True,
    connection_token_ordinal: Literal[0, 1] = 0
)

def is_valid

(self_)

View Source on GitHub

Returns if self contains data members capable of producing an overall valid MazeTokenizerModular. Some _TokenizerElement instances may be created which are not useful despite obeying data member type hints. is_valid allows for more precise detection of invalid _TokenizerElements beyond type hinting alone. If type hints are sufficient to constrain the possible instances of some subclass, then this method may simply return True for that subclass.

Types of Invalidity

In nontrivial implementations of this method, each conditional clause should contain a comment classifying the reason for invalidity and one of the types below. Invalidity types, in ascending order of invalidity: - Uninteresting: These tokenizers might be used to train functional models, but the schemes are not interesting to study. E.g., _TokenizerElements which are strictly worse than some alternative. - Duplicate: These tokenizers have identical tokenization behavior as some other valid tokenizers. - Untrainable: Training functional models using these tokenizers would be (nearly) impossible. - Erroneous: These tokenizers might raise exceptions during use.

Development

is_invalid is implemented to always return True in some abstract classes where all currently possible subclass instances are valid. When adding new subclasses or data members, the developer should check if any such blanket statement of validity still holds and update it as neccesary.

Nesting

In general, when implementing this method, there is no need to recursively call is_valid on nested _TokenizerElements contained in the class. In other words, failures of is_valid need not bubble up to the top of the nested _TokenizerElement tree. <a href="#MazeTokenizerModular.is_valid">MazeTokenizerModular.is_valid</a> calls is_valid on each of its _TokenizerElements individually, so failure at any level will be detected.

Types of Invalidity

If it’s judged to be useful, the types of invalidity could be implemented with an Enum or similar rather than only living in comments. This could be used to create more or less stringent filters on the valid _TokenizerElement instances.

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class EdgePermuters(__TokenizerElementNamespace):

View Source on GitHub

Namespace for _EdgePermuter subclass hierarchy used by _AdjListTokenizer.

class EdgePermuters.SortedCoords(EdgePermuters._EdgePermuter):

View Source on GitHub

returns a sorted representation. useful for checking consistency

EdgePermuters.SortedCoords

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.EdgePermuters.SortedCoords'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.EdgePermuters.SortedCoords'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class EdgePermuters.RandomCoords(EdgePermuters._EdgePermuter):

View Source on GitHub

Permutes each edge randomly.

EdgePermuters.RandomCoords

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.EdgePermuters.RandomCoords'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.EdgePermuters.RandomCoords'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class EdgePermuters.BothCoords(EdgePermuters._EdgePermuter):

View Source on GitHub

Includes both possible permutations of every edge in the output. Since input ConnectionList has only 1 instance of each edge, a call to BothCoords._permute will modify lattice_edges in-place, doubling shape[0].

EdgePermuters.BothCoords

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.EdgePermuters.BothCoords'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.EdgePermuters.BothCoords'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class EdgeSubsets(__TokenizerElementNamespace):

View Source on GitHub

Namespace for _EdgeSubset subclass hierarchy used by _AdjListTokenizer.

class EdgeSubsets.AllLatticeEdges(EdgeSubsets._EdgeSubset):

View Source on GitHub

All 2n**2-2n edges of the lattice are tokenized. If a wall exists on that edge, the edge is tokenized in the same manner, using VOCAB.ADJLIST_WALL in place of VOCAB.CONNECTOR.

EdgeSubsets.AllLatticeEdges

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.EdgeSubsets.AllLatticeEdges'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.EdgeSubsets.AllLatticeEdges'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class EdgeSubsets.ConnectionEdges(EdgeSubsets._EdgeSubset):

View Source on GitHub

Only edges which contain a connection are tokenized. Alternatively, only edges which contain a wall are tokenized.

Parameters

EdgeSubsets.ConnectionEdges

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.EdgeSubsets.ConnectionEdges'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.EdgeSubsets.ConnectionEdges'>",
    walls: bool = False
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class AdjListTokenizers(__TokenizerElementNamespace):

View Source on GitHub

Namespace for _AdjListTokenizer subclass hierarchy used by MazeTokenizerModular.

class AdjListTokenizers.AdjListCoord(AdjListTokenizers._AdjListTokenizer):

View Source on GitHub

Represents an edge group as tokens for the leading coord followed by coord tokens for the other group members.

AdjListTokenizers.AdjListCoord

(
    *,
    pre: bool = False,
    post: bool = True,
    shuffle_d0: bool = True,
    edge_grouping: maze_dataset.tokenization.maze_tokenizer.EdgeGroupings._EdgeGrouping = EdgeGroupings.Ungrouped(connection_token_ordinal=1),
    edge_subset: maze_dataset.tokenization.maze_tokenizer.EdgeSubsets._EdgeSubset = EdgeSubsets.ConnectionEdges(walls=False),
    edge_permuter: maze_dataset.tokenization.maze_tokenizer.EdgePermuters._EdgePermuter = EdgePermuters.RandomCoords(),
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.AdjListTokenizers.AdjListCoord'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.AdjListTokenizers.AdjListCoord'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class AdjListTokenizers.AdjListCardinal(AdjListTokenizers._AdjListTokenizer):

View Source on GitHub

Represents an edge group as coord tokens for the leading coord and cardinal tokens relative to the leading coord for the other group members.

Parameters

AdjListTokenizers.AdjListCardinal

(
    *,
    pre: bool = False,
    post: bool = True,
    shuffle_d0: bool = True,
    edge_grouping: maze_dataset.tokenization.maze_tokenizer.EdgeGroupings._EdgeGrouping = EdgeGroupings.Ungrouped(connection_token_ordinal=1),
    edge_subset: maze_dataset.tokenization.maze_tokenizer.EdgeSubsets._EdgeSubset = EdgeSubsets.ConnectionEdges(walls=False),
    edge_permuter: maze_dataset.tokenization.maze_tokenizer.EdgePermuters._EdgePermuter = EdgePermuters.BothCoords(),
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.AdjListTokenizers.AdjListCardinal'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.AdjListTokenizers.AdjListCardinal'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class TargetTokenizers(__TokenizerElementNamespace):

View Source on GitHub

Namespace for _TargetTokenizer subclass hierarchy used by MazeTokenizerModular.

class TargetTokenizers.Unlabeled(TargetTokenizers._TargetTokenizer):

View Source on GitHub

Targets are simply listed as coord tokens. - post: Whether all coords include an integral following delimiter token

TargetTokenizers.Unlabeled

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.TargetTokenizers.Unlabeled'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.TargetTokenizers.Unlabeled'>",
    post: bool = False
)

def to_tokens

(
    self,
    targets: Sequence[jaxtyping.Int8[ndarray, 'row_col']],
    coord_tokenizer: maze_dataset.tokenization.maze_tokenizer.CoordTokenizers._CoordTokenizer
) -> list[str]

View Source on GitHub

Returns tokens representing the target.

def is_valid

(self) -> bool

View Source on GitHub

Returns if self contains data members capable of producing an overall valid MazeTokenizerModular. Some _TokenizerElement instances may be created which are not useful despite obeying data member type hints. is_valid allows for more precise detection of invalid _TokenizerElements beyond type hinting alone. If type hints are sufficient to constrain the possible instances of some subclass, then this method may simply return True for that subclass.

Types of Invalidity

In nontrivial implementations of this method, each conditional clause should contain a comment classifying the reason for invalidity and one of the types below. Invalidity types, in ascending order of invalidity: - Uninteresting: These tokenizers might be used to train functional models, but the schemes are not interesting to study. E.g., _TokenizerElements which are strictly worse than some alternative. - Duplicate: These tokenizers have identical tokenization behavior as some other valid tokenizers. - Untrainable: Training functional models using these tokenizers would be (nearly) impossible. - Erroneous: These tokenizers might raise exceptions during use.

Development

is_invalid is implemented to always return True in some abstract classes where all currently possible subclass instances are valid. When adding new subclasses or data members, the developer should check if any such blanket statement of validity still holds and update it as neccesary.

Nesting

In general, when implementing this method, there is no need to recursively call is_valid on nested _TokenizerElements contained in the class. In other words, failures of is_valid need not bubble up to the top of the nested _TokenizerElement tree. <a href="#MazeTokenizerModular.is_valid">MazeTokenizerModular.is_valid</a> calls is_valid on each of its _TokenizerElements individually, so failure at any level will be detected.

Types of Invalidity

If it’s judged to be useful, the types of invalidity could be implemented with an Enum or similar rather than only living in comments. This could be used to create more or less stringent filters on the valid _TokenizerElement instances.

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepSizes(__TokenizerElementNamespace):

View Source on GitHub

Namespace for _StepSize subclass hierarchy used by MazeTokenizerModular.

class StepSizes.Singles(StepSizes._StepSize):

View Source on GitHub

Every coord in maze.solution is represented. Legacy tokenizers all use this behavior.

StepSizes.Singles

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.Singles'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.Singles'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepSizes.Straightaways(StepSizes._StepSize):

View Source on GitHub

Only coords where the path turns are represented in the path. I.e., the path is represented as a sequence of straightaways, specified by the coords at the turns.

StepSizes.Straightaways

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.Straightaways'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.Straightaways'>"
)

def is_valid

(self_)

View Source on GitHub

Returns if self contains data members capable of producing an overall valid MazeTokenizerModular. Some _TokenizerElement instances may be created which are not useful despite obeying data member type hints. is_valid allows for more precise detection of invalid _TokenizerElements beyond type hinting alone. If type hints are sufficient to constrain the possible instances of some subclass, then this method may simply return True for that subclass.

Types of Invalidity

In nontrivial implementations of this method, each conditional clause should contain a comment classifying the reason for invalidity and one of the types below. Invalidity types, in ascending order of invalidity: - Uninteresting: These tokenizers might be used to train functional models, but the schemes are not interesting to study. E.g., _TokenizerElements which are strictly worse than some alternative. - Duplicate: These tokenizers have identical tokenization behavior as some other valid tokenizers. - Untrainable: Training functional models using these tokenizers would be (nearly) impossible. - Erroneous: These tokenizers might raise exceptions during use.

Development

is_invalid is implemented to always return True in some abstract classes where all currently possible subclass instances are valid. When adding new subclasses or data members, the developer should check if any such blanket statement of validity still holds and update it as neccesary.

Nesting

In general, when implementing this method, there is no need to recursively call is_valid on nested _TokenizerElements contained in the class. In other words, failures of is_valid need not bubble up to the top of the nested _TokenizerElement tree. <a href="#MazeTokenizerModular.is_valid">MazeTokenizerModular.is_valid</a> calls is_valid on each of its _TokenizerElements individually, so failure at any level will be detected.

Types of Invalidity

If it’s judged to be useful, the types of invalidity could be implemented with an Enum or similar rather than only living in comments. This could be used to create more or less stringent filters on the valid _TokenizerElement instances.

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepSizes.Forks(StepSizes._StepSize):

View Source on GitHub

Only coords at forks, where the path has >=2 options for the next step are included. Excludes the option of backtracking. The starting and ending coords are always included.

StepSizes.Forks

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.Forks'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.Forks'>"
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepSizes.ForksAndStraightaways(StepSizes._StepSize):

View Source on GitHub

Includes the union of the coords included by Forks and Straightaways. See documentation for those classes for details.

StepSizes.ForksAndStraightaways

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.ForksAndStraightaways'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepSizes.ForksAndStraightaways'>"
)

def is_valid

(self_)

View Source on GitHub

Returns if self contains data members capable of producing an overall valid MazeTokenizerModular. Some _TokenizerElement instances may be created which are not useful despite obeying data member type hints. is_valid allows for more precise detection of invalid _TokenizerElements beyond type hinting alone. If type hints are sufficient to constrain the possible instances of some subclass, then this method may simply return True for that subclass.

Types of Invalidity

In nontrivial implementations of this method, each conditional clause should contain a comment classifying the reason for invalidity and one of the types below. Invalidity types, in ascending order of invalidity: - Uninteresting: These tokenizers might be used to train functional models, but the schemes are not interesting to study. E.g., _TokenizerElements which are strictly worse than some alternative. - Duplicate: These tokenizers have identical tokenization behavior as some other valid tokenizers. - Untrainable: Training functional models using these tokenizers would be (nearly) impossible. - Erroneous: These tokenizers might raise exceptions during use.

Development

is_invalid is implemented to always return True in some abstract classes where all currently possible subclass instances are valid. When adding new subclasses or data members, the developer should check if any such blanket statement of validity still holds and update it as neccesary.

Nesting

In general, when implementing this method, there is no need to recursively call is_valid on nested _TokenizerElements contained in the class. In other words, failures of is_valid need not bubble up to the top of the nested _TokenizerElement tree. <a href="#MazeTokenizerModular.is_valid">MazeTokenizerModular.is_valid</a> calls is_valid on each of its _TokenizerElements individually, so failure at any level will be detected.

Types of Invalidity

If it’s judged to be useful, the types of invalidity could be implemented with an Enum or similar rather than only living in comments. This could be used to create more or less stringent filters on the valid _TokenizerElement instances.

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepTokenizers(__TokenizerElementNamespace):

View Source on GitHub

Namespace for _StepTokenizer subclass hierarchy used by MazeTokenizerModular.

class StepTokenizers.Coord(StepTokenizers._StepTokenizer):

View Source on GitHub

A direct tokenization of the end position coord represents the step.

StepTokenizers.Coord

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Coord'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Coord'>"
)

def to_tokens

(
    self,
    maze: maze_dataset.maze.lattice_maze.SolvedMaze,
    start_index: int,
    end_index: int,
    coord_tokenizer: maze_dataset.tokenization.maze_tokenizer.CoordTokenizers._CoordTokenizer
) -> list[str]

View Source on GitHub

Tokenizes a single step in the solution.

Parameters

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepTokenizers.Cardinal(StepTokenizers._StepTokenizer):

View Source on GitHub

A step is tokenized with a cardinal direction token. It is the direction of the step from the starting position along the solution.

StepTokenizers.Cardinal

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Cardinal'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Cardinal'>"
)

def to_tokens

(
    self,
    maze: maze_dataset.maze.lattice_maze.SolvedMaze,
    start_index: int,
    end_index: int,
    **kwargs
) -> list[str]

View Source on GitHub

Tokenizes a single step in the solution.

Parameters

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepTokenizers.Relative(StepTokenizers._StepTokenizer):

View Source on GitHub

Tokenizes a solution step using relative first-person directions (right, left, forward, etc.). To simplify the indeterminacy, at the start of a solution the “agent” solving the maze is assumed to be facing NORTH. Similarly to Cardinal, the direction is that of the step from the starting position.

StepTokenizers.Relative

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Relative'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Relative'>"
)

def to_tokens

(
    self,
    maze: maze_dataset.maze.lattice_maze.SolvedMaze,
    start_index: int,
    end_index: int,
    **kwargs
) -> list[str]

View Source on GitHub

Tokenizes a single step in the solution.

Parameters

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class StepTokenizers.Distance(StepTokenizers._StepTokenizer):

View Source on GitHub

A count of the number of individual steps from the starting point to the end point. Contains no information about directionality, only the distance traveled in the step. Distance must be combined with at least one other _StepTokenizer in a StepTokenizerPermutation. This constraint is enforced in _PathTokenizer.is_valid.

StepTokenizers.Distance

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Distance'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.StepTokenizers.Distance'>"
)

def to_tokens

(
    self,
    maze: maze_dataset.maze.lattice_maze.SolvedMaze,
    start_index: int,
    end_index: int,
    **kwargs
) -> list[str]

View Source on GitHub

Tokenizes a single step in the solution.

Parameters

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class PathTokenizers(__TokenizerElementNamespace):

View Source on GitHub

Namespace for _PathTokenizer subclass hierarchy used by MazeTokenizerModular.

class PathTokenizers.StepSequence(PathTokenizers._PathTokenizer, abc.ABC):

View Source on GitHub

Any PathTokenizer where the tokenization may be assembled from token subsequences, each of which represents a step along the path. Allows for a sequence of leading and trailing tokens which don’t fit the step pattern.

Parameters

PathTokenizers.StepSequence

(
    *,
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.PathTokenizers.StepSequence'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.PathTokenizers.StepSequence'>",
    step_size: maze_dataset.tokenization.maze_tokenizer.StepSizes._StepSize = StepSizes.Singles(),
    step_tokenizers: tuple[maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer] | tuple[maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer, maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer] | tuple[maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer, maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer, maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer] | tuple[maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer, maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer, maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer, maze_dataset.tokenization.maze_tokenizer.StepTokenizers._StepTokenizer] = (StepTokenizers.Coord(),),
    pre: bool = False,
    intra: bool = False,
    post: bool = False
)

def to_tokens

(
    self,
    maze: maze_dataset.maze.lattice_maze.SolvedMaze,
    coord_tokenizer: maze_dataset.tokenization.maze_tokenizer.CoordTokenizers._CoordTokenizer
) -> list[str]

View Source on GitHub

Returns tokens representing the solution path.

def is_valid

(self) -> bool

View Source on GitHub

Returns if self contains data members capable of producing an overall valid MazeTokenizerModular. Some _TokenizerElement instances may be created which are not useful despite obeying data member type hints. is_valid allows for more precise detection of invalid _TokenizerElements beyond type hinting alone. If type hints are sufficient to constrain the possible instances of some subclass, then this method may simply return True for that subclass.

Types of Invalidity

In nontrivial implementations of this method, each conditional clause should contain a comment classifying the reason for invalidity and one of the types below. Invalidity types, in ascending order of invalidity: - Uninteresting: These tokenizers might be used to train functional models, but the schemes are not interesting to study. E.g., _TokenizerElements which are strictly worse than some alternative. - Duplicate: These tokenizers have identical tokenization behavior as some other valid tokenizers. - Untrainable: Training functional models using these tokenizers would be (nearly) impossible. - Erroneous: These tokenizers might raise exceptions during use.

Development

is_invalid is implemented to always return True in some abstract classes where all currently possible subclass instances are valid. When adding new subclasses or data members, the developer should check if any such blanket statement of validity still holds and update it as neccesary.

Nesting

In general, when implementing this method, there is no need to recursively call is_valid on nested _TokenizerElements contained in the class. In other words, failures of is_valid need not bubble up to the top of the nested _TokenizerElement tree. <a href="#MazeTokenizerModular.is_valid">MazeTokenizerModular.is_valid</a> calls is_valid on each of its _TokenizerElements individually, so failure at any level will be detected.

Types of Invalidity

If it’s judged to be useful, the types of invalidity could be implemented with an Enum or similar rather than only living in comments. This could be used to create more or less stringent filters on the valid _TokenizerElement instances.

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class PromptSequencers(__TokenizerElementNamespace):

View Source on GitHub

Namespace for _PromptSequencer subclass hierarchy used by MazeTokenizerModular.

class PromptSequencers.AOTP(PromptSequencers._PromptSequencer):

View Source on GitHub

Sequences a prompt as [adjacency list, origin, target, path].

Parameters

PromptSequencers.AOTP

(
    *,
    coord_tokenizer: maze_dataset.tokenization.maze_tokenizer.CoordTokenizers._CoordTokenizer = CoordTokenizers.UT(),
    adj_list_tokenizer: maze_dataset.tokenization.maze_tokenizer.AdjListTokenizers._AdjListTokenizer = AdjListTokenizers.AdjListCoord(pre=False, post=True, shuffle_d0=True, edge_grouping=EdgeGroupings.Ungrouped(connection_token_ordinal=1), edge_subset=EdgeSubsets.ConnectionEdges(walls=False), edge_permuter=EdgePermuters.RandomCoords()),
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.PromptSequencers.AOTP'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.PromptSequencers.AOTP'>",
    target_tokenizer: maze_dataset.tokenization.maze_tokenizer.TargetTokenizers._TargetTokenizer = TargetTokenizers.Unlabeled(post=False),
    path_tokenizer: maze_dataset.tokenization.maze_tokenizer.PathTokenizers._PathTokenizer = PathTokenizers.StepSequence(step_size=StepSizes.Singles(), step_tokenizers=(StepTokenizers.Coord(),), pre=False, intra=False, post=False)
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class PromptSequencers.AOP(PromptSequencers._PromptSequencer):

View Source on GitHub

Sequences a prompt as [adjacency list, origin, path]. Still includes “” and “” tokens, but no representation of the target itself.

Parameters

PromptSequencers.AOP

(
    *,
    coord_tokenizer: maze_dataset.tokenization.maze_tokenizer.CoordTokenizers._CoordTokenizer = CoordTokenizers.UT(),
    adj_list_tokenizer: maze_dataset.tokenization.maze_tokenizer.AdjListTokenizers._AdjListTokenizer = AdjListTokenizers.AdjListCoord(pre=False, post=True, shuffle_d0=True, edge_grouping=EdgeGroupings.Ungrouped(connection_token_ordinal=1), edge_subset=EdgeSubsets.ConnectionEdges(walls=False), edge_permuter=EdgePermuters.RandomCoords()),
    _type_: Literal["<class 'maze_dataset.tokenization.maze_tokenizer.PromptSequencers.AOP'>"] = "<class 'maze_dataset.tokenization.maze_tokenizer.PromptSequencers.AOP'>",
    path_tokenizer: maze_dataset.tokenization.maze_tokenizer.PathTokenizers._PathTokenizer = PathTokenizers.StepSequence(step_size=StepSizes.Singles(), step_tokenizers=(StepTokenizers.Coord(),), pre=False, intra=False, post=False)
)

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

class MazeTokenizerModular(muutils.json_serialize.serializable_dataclass.SerializableDataclass):

View Source on GitHub

Tokenizer for mazes

Parameters

Development

MazeTokenizerModular

(
    *,
    prompt_sequencer: maze_dataset.tokenization.maze_tokenizer.PromptSequencers._PromptSequencer = PromptSequencers.AOTP(coord_tokenizer=CoordTokenizers.UT(), adj_list_tokenizer=AdjListTokenizers.AdjListCoord(pre=False, post=True, shuffle_d0=True, edge_grouping=EdgeGroupings.Ungrouped(connection_token_ordinal=1), edge_subset=EdgeSubsets.ConnectionEdges(walls=False), edge_permuter=EdgePermuters.RandomCoords()), target_tokenizer=TargetTokenizers.Unlabeled(post=False), path_tokenizer=PathTokenizers.StepSequence(step_size=StepSizes.Singles(), step_tokenizers=(StepTokenizers.Coord(),), pre=False, intra=False, post=False))
)

def hash_int

(self) -> int

View Source on GitHub

def hash_b64

(self, n_bytes: int = 8) -> str

View Source on GitHub

filename-safe base64 encoding of the hash

View Source on GitHub

def tokenizer_element_tree

(self, abstract: bool = False) -> str

View Source on GitHub

Returns a string representation of the tree of tokenizer elements contained in self.

Parameters

View Source on GitHub

Property wrapper for tokenizer_element_tree so that it can be used in properties_to_serialize.

def tokenizer_element_dict

(self) -> dict

View Source on GitHub

Nested dictionary of the internal TokenizerElements.

View Source on GitHub

Serializes MazeTokenizer into a key for encoding in zanj

def summary

(self) -> dict[str, str]

View Source on GitHub

Single-level dictionary of the internal TokenizerElements.

def has_element

(
    self,
    *elements: Sequence[type[maze_dataset.tokenization.maze_tokenizer._TokenizerElement] | maze_dataset.tokenization.maze_tokenizer._TokenizerElement]
) -> bool

View Source on GitHub

Returns True if the MazeTokenizerModular instance contains ALL of the items specified in elements.

Querying with a partial subset of _TokenizerElement fields is not currently supported. To do such a query, assemble multiple calls to has_elements.

Parameters

def is_valid

(self)

View Source on GitHub

Returns True if self is a valid tokenizer. Evaluates the validity of all of self.tokenizer_elements according to each one’s method.

def is_legacy_equivalent

(self) -> bool

View Source on GitHub

Returns if self has identical stringification behavior as any legacy MazeTokenizer.

def is_tested_tokenizer

(self, do_assert: bool = False) -> bool

View Source on GitHub

Returns if the tokenizer is returned by all_tokenizers.get_all_tokenizers, the set of tested and reliable tokenizers.

Since evaluating all_tokenizers.get_all_tokenizers is expensive, instead checks for membership of self’s hash in get_all_tokenizer_hashes().

if do_assert is True, raises an AssertionError if the tokenizer is not tested.

def is_AOTP

(self) -> bool

View Source on GitHub

def is_UT

(self) -> bool

View Source on GitHub

def from_legacy

(
    cls,
    legacy_maze_tokenizer: maze_dataset.tokenization.maze_tokenizer.MazeTokenizer | maze_dataset.tokenization.maze_tokenizer.TokenizationMode
) -> maze_dataset.tokenization.maze_tokenizer.MazeTokenizerModular

View Source on GitHub

Maps a legacy MazeTokenizer or TokenizationMode to its equivalent MazeTokenizerModular instance.

def from_tokens

(
    cls,
    tokens: str | list[str]
) -> maze_dataset.tokenization.maze_tokenizer.MazeTokenizerModular

View Source on GitHub

Infers most MazeTokenizerModular parameters from a full sequence of tokens.

View Source on GitHub

map from index to token

View Source on GitHub

map from token to index

View Source on GitHub

Number of tokens in the static vocab

View Source on GitHub

View Source on GitHub

def to_tokens

(self, maze: maze_dataset.maze.lattice_maze.LatticeMaze) -> list[str]

View Source on GitHub

Converts maze into a list of tokens.

def coords_to_strings

(
    self,
    coords: list[tuple[int, int] | jaxtyping.Int8[ndarray, 'row_col']]
) -> list[str]

View Source on GitHub

def strings_to_coords

(
    text: str,
    when_noncoord: Literal['except', 'skip', 'include'] = 'skip'
) -> list[str | tuple[int, int]]

View Source on GitHub

def encode

(text: str | list[str]) -> list[int]

View Source on GitHub

encode a string or list of strings into a list of tokens

def decode

(token_ids: Sequence[int], joined_tokens: bool = False) -> list[str] | str

View Source on GitHub

decode a list of tokens into a string or list of strings

def serialize

(self) -> dict[str, typing.Any]

View Source on GitHub

returns the class as a dict, implemented by using @serializable_dataclass decorator

def load

(cls, data: Union[dict[str, Any], ~T]) -> Type[~T]

View Source on GitHub

takes in an appropriately structured dict and returns an instance of the class, implemented by using @serializable_dataclass decorator

def validate_fields_types

(
    self: muutils.json_serialize.serializable_dataclass.SerializableDataclass,
    on_typecheck_error: muutils.errormode.ErrorMode = ErrorMode.Except
) -> bool

View Source on GitHub

validate the types of all the fields on a SerializableDataclass. calls SerializableDataclass__validate_field_type for each field

Inherited Members

def set_tokenizer_hashes_path

(path: pathlib.Path)

View Source on GitHub

set path to tokenizer hashes, and reload the hashes if needed

the hashes are expected to be stored in and read from _TOKENIZER_HASHES_PATH, which by default is Path(__file__).parent / "MazeTokenizerModular_hashes.npz" or in this file’s directory.

However, this might not always work, so we provide a way to change this.

def get_all_tokenizer_hashes

() -> jaxtyping.Int64[ndarray, 'n_tokenizers']

View Source on GitHub

docs for maze-dataset v1.1.0

Contents

generate and save the hashes of all supported tokenizers

calls maze_dataset.tokenization.all_tokenizers.save_hashes()

Usage:

To save to the default location (inside package, maze_dataset/tokenization/MazeTokenizerModular_hashes.npy):

python -m maze_dataset.tokenization.save_hashes

to save to a custom location:

python -m maze_dataset.tokenization.save_hashes /path/to/save/to.npy

to check hashes shipped with the package:

python -m maze_dataset.tokenization.save_hashes --check

View Source on GitHub

maze_dataset.tokenization.save_hashes

generate and save the hashes of all supported tokenizers

calls <a href="all_tokenizers.html#save_hashes">maze_dataset.tokenization.all_tokenizers.save_hashes()</a>

Usage:

To save to the default location (inside package, maze_dataset/tokenization/MazeTokenizerModular_hashes.npy):

python -m <a href="">maze_dataset.tokenization.save_hashes</a>

to save to a custom location:

python -m <a href="">maze_dataset.tokenization.save_hashes</a> /path/to/save/to.npy

to check hashes shipped with the package:

python -m <a href="">maze_dataset.tokenization.save_hashes</a> --check

View Source on GitHub

docs for maze-dataset v1.1.0

Contents

misc utilities for the maze_dataset package

API Documentation

View Source on GitHub

maze_dataset.utils

misc utilities for the maze_dataset package

View Source on GitHub

def bool_array_from_string

(
    string: str,
    shape: list[int],
    true_symbol: str = 'T'
) -> jaxtyping.Bool[ndarray, '*shape']

View Source on GitHub

Transform a string into an ndarray of bools.

Parameters

string: str The string representation of the array shape: list[int] The shape of the resulting array true_symbol: The character to parse as True. Whitespace will be removed. All other characters will be parsed as False.

Returns

np.ndarray A ndarray with dtype bool of shape shape

Examples

bool_array_from_string( … “TT TF”, shape=[2,2] … ) array([[ True, True], [ True, False]])

def corner_first_ndindex

(n: int, ndim: int = 2) -> list[tuple]

View Source on GitHub

returns an array of indices, sorted by distance from the corner

this gives the property that np.ndindex((n,n)) is equal to the first n^2 elements of np.ndindex((n+1, n+1))

>>> corner_first_ndindex(1)
[(0, 0)]
>>> corner_first_ndindex(2)
[(0, 0), (0, 1), (1, 0), (1, 1)]
>>> corner_first_ndindex(3)
[(0, 0), (0, 1), (1, 0), (1, 1), (0, 2), (2, 0), (1, 2), (2, 1), (2, 2)]

def manhattan_distance

(
    edges: jaxtyping.Int[ndarray, 'edges coord=2 row_col=2'] | jaxtyping.Int[ndarray, 'coord=2 row_col=2']
) -> jaxtyping.Int[ndarray, 'edges'] | jaxtyping.Int[ndarray, '']

View Source on GitHub

Returns the Manhattan distance between two coords.

def lattice_max_degrees

(n: int) -> jaxtyping.Int8[ndarray, 'row col']

View Source on GitHub

Returns an array with the maximum possible degree for each coord.

def lattice_connection_array

(
    n: int
) -> jaxtyping.Int8[ndarray, 'edges=2*n*(n-1) leading_trailing_coord=2 row_col=2']

View Source on GitHub

Returns a 3D NumPy array containing all the edges in a 2D square lattice of size n x n. Thanks Claude.

Parameters

Returns

np.ndarray: A 3D NumPy array of shape containing the coordinates of the edges in the 2D square lattice. In each pair, the coord with the smaller sum always comes first.

def adj_list_to_nested_set

(adj_list: list) -> set

View Source on GitHub

Used for comparison of adj_lists

Adj_list looks like [[[0, 1], [1, 1]], [[0, 0], [0, 1]], …] We don’t care about order of coordinate pairs within the adj_list or coordinates within each coordinate pair.

FiniteValued

The details of this type are not possible to fully define via the Python 3.10 typing library. This custom generic type is a generic domain of many types which have a finite, discrete, and well-defined range space. FiniteValued defines the domain of supported types for the all_instances function, since that function relies heavily on static typing. These types may be nested in an arbitrarily deep tree via Container Types and Superclass Types (see below). The leaves of the tree must always be Primitive Types.

FiniteValued Subtypes

*: Indicates that this subtype is not yet supported by all_instances

Non-FiniteValued (Unbounded) Types

These are NOT valid subtypes, and are listed for illustrative purposes only. This list is not comprehensive. While the finite and discrete nature of digital computers means that the cardinality of these types is technically finite, they are considered unbounded types in this context. - No Container subtype may contain any of these unbounded subtypes. - int - float - str - list - set: Set types without a FiniteValued argument are unbounded - tuple: Tuple types without a fixed length are unbounded

Primitive Types

Primitive types are non-nested types which resolve directly to a concrete range of values - bool: has 2 possible values - *enum.Enum: The range of a concrete Enum subclass is its set of enum members - typing.Literal: Every type constructed using Literal has a finite set of possible literal values in its definition. This is the preferred way to include limited ranges of non-FiniteValued types such as int or str in a FiniteValued hierarchy.

Container Types

Container types are types which contain zero or more fields of FiniteValued type. The range of a container type is the cartesian product of their field types, except for set[FiniteValued]. - tuple[FiniteValued]: Tuples of fixed length whose elements are each FiniteValued. - IsDataclass: Concrete dataclasses whose fields are FiniteValued. - Standard concrete class: Regular classes could be supported just like dataclasses if all their data members are FiniteValued-typed. - set[FiniteValued]: Sets of fixed length of a FiniteValued type.

Superclass Types

Superclass types don’t directly contain data members like container types. Their range is the union of the ranges of their subtypes. - Abstract dataclasses: Abstract dataclasses whose subclasses are all FiniteValued superclass or container types - IsDataclass: Concrete dataclasses which also have their own subclasses. - Standard abstract classes: Abstract dataclasses whose subclasses are all FiniteValued superclass or container types - UnionType: Any union of FiniteValued types, e.g., bool | Literal[2, 3]

def all_instances

(
    type_: ~FiniteValued,
    validation_funcs: dict[~FiniteValued, typing.Callable[[~FiniteValued], bool]] | None = None
) -> Generator[~FiniteValued, NoneType, NoneType]

View Source on GitHub

Returns all possible values of an instance of type_ if finite instances exist. Uses type hinting to construct the possible values. All nested elements of type_ must themselves be typed. Do not use with types whose members contain circular references. Function is susceptible to infinite recursion if type_ is a dataclass whose member tree includes another instance of type_.

Parameters

Supported type_ Values

See docstring on FiniteValued for full details. type_ may be: - FiniteValued - A finite-valued, fixed-length Generic tuple type. E.g., tuple[bool], tuple[bool, MyEnum] are OK. tuple[bool, ...] is NOT supported, since the length of the tuple is not fixed. - Nested versions of any of the types in this list - A UnionType of any of the types in this list

validation_funcs Details