Migrate our core representation to the typed choice sequence (original) (raw)

This epic-style issue tracks our work on migrating the internals of Hypothesis from a bytestring to the typed choice sequence.

This is a huge change that touches every part of Hypothesis. The short version is that we have encountered several shortcomings of Hypothesis internals over time, which have limited the efficiency of input generation and shrinking, and made alternative backends infeasible. We expect this migration to give moderate efficiency gains and pave a clear path forward to future integration with alternative backends or other powerful features which are currently intractable.

Bytestring

Hypothesis currently works at the level of a bytestring (a sequence of bytes).

But, the bytestring has its limitations.

These limitations extend equally to alternative backends over the bytestring. e.g. for CrossHair (#3086) integration, bytes is simply too low level to get an efficient SMT algorithm out of.

Typed Choice Sequence

Enter the typed choice sequence, which replaces the bytestring. We lift up the representation from bytes to a slightly higher representation at the level of five types: boolean, integer, float, string, and bytes.

class PrimitiveProvider(abc.ABC): @abc.abstractmethod def draw_boolean( self, p: float = 0.5, ) -> bool: ...

@abc.abstractmethod
def draw_integer(
    self,
    min_value: int | None = None,
    max_value: int | None = None,
    *,
    # weights are for choosing an element index from a bounded range
    weights: Sequence[float] | None = None,
    shrink_towards: int = 0,
) -> int:
    ...

@abc.abstractmethod
def draw_float(
    self,
    *,
    min_value: float = -math.inf,
    max_value: float = math.inf,
    allow_nan: bool = True,
    smallest_nonzero_magnitude: float,
) -> float:
    ...

@abc.abstractmethod
def draw_string(
    self,
    intervals: IntervalSet,
    *,
    min_size: int = 0,
    max_size: int | None = None,
) -> str:
    ...

@abc.abstractmethod
def draw_bytes(
    self,
    min_size: int = 0,
    max_size: int | None = None,
) -> bytes:
    ...

This improves redundancy (as DataTree operates at this higher level) and precision (as we retain type and shape information about what was previously spans of the bytestring). The end goal is rebuilding input generation, shrinking, targeted pbt, the explain phase, the database, etc on the typed choice sequence, and we expect to see performance gains in each (except the database).

Roadmap