regex_automata::nfa::thompson

Enum State

Source
pub enum State {
    ByteRange {
        trans: Transition,
    },
    Sparse(SparseTransitions),
    Dense(DenseTransitions),
    Look {
        look: Look,
        next: StateID,
    },
    Union {
        alternates: Box<[StateID]>,
    },
    BinaryUnion {
        alt1: StateID,
        alt2: StateID,
    },
    Capture {
        next: StateID,
        pattern_id: PatternID,
        group_index: SmallIndex,
        slot: SmallIndex,
    },
    Fail,
    Match {
        pattern_id: PatternID,
    },
}
Available on crate feature nfa-thompson only.
Expand description

A state in an NFA.

In theory, it can help to conceptualize an NFA as a graph consisting of States. Each State contains its complete set of outgoing transitions.

In practice, it can help to conceptualize an NFA as a sequence of instructions for a virtual machine. Each State says what to do and where to go next.

Strictly speaking, the practical interpretation is the most correct one, because of the Capture state. Namely, a Capture state always forwards execution to another state unconditionally. Its only purpose is to cause a side effect: the recording of the current input position at a particular location in memory. In this sense, an NFA has more power than a theoretical non-deterministic finite automaton.

For most uses of this crate, it is likely that one may never even need to be aware of this type at all. The main use cases for looking at States directly are if you need to write your own search implementation or if you need to do some kind of analysis on the NFA.

Variants§

§

ByteRange

A state with a single transition that can only be taken if the current input symbol is in a particular range of bytes.

Fields

§trans: Transition

The transition from this state to the next.

§

Sparse(SparseTransitions)

A state with possibly many transitions represented in a sparse fashion. Transitions are non-overlapping and ordered lexicographically by input range.

In practice, this is used for encoding UTF-8 automata. Its presence is primarily an optimization that avoids many additional unconditional epsilon transitions (via Union states), and thus decreases the overhead of traversing the NFA. This can improve both matching time and DFA construction time.

§

Dense(DenseTransitions)

A dense representation of a state with multiple transitions.

§

Look

A conditional epsilon transition satisfied via some sort of look-around. Look-around is limited to anchor and word boundary assertions.

Look-around states are meant to be evaluated while performing epsilon closure (computing the set of states reachable from a particular state via only epsilon transitions). If the current position in the haystack satisfies the look-around assertion, then you’re permitted to follow that epsilon transition.

Fields

§look: Look

The look-around assertion that must be satisfied before moving to next.

§next: StateID

The state to transition to if the look-around assertion is satisfied.

§

Union

An alternation such that there exists an epsilon transition to all states in alternates, where matches found via earlier transitions are preferred over later transitions.

Fields

§alternates: Box<[StateID]>

An ordered sequence of unconditional epsilon transitions to other states. Transitions earlier in the sequence are preferred over transitions later in the sequence.

§

BinaryUnion

An alternation such that there exists precisely two unconditional epsilon transitions, where matches found via alt1 are preferred over matches found via alt2.

This state exists as a common special case of Union where there are only two alternates. In this case, we don’t need any allocations to represent the state. This saves a bit of memory and also saves an additional memory access when traversing the NFA.

Fields

§alt1: StateID

An unconditional epsilon transition to another NFA state. This is preferred over alt2.

§alt2: StateID

An unconditional epsilon transition to another NFA state. Matches reported via this transition should only be reported if no matches were found by following alt1.

§

Capture

An empty state that records a capture location.

From the perspective of finite automata, this is precisely equivalent to an unconditional epsilon transition, but serves the purpose of instructing NFA simulations to record additional state when the finite state machine passes through this epsilon transition.

slot in this context refers to the specific capture group slot offset that is being recorded. Each capturing group has two slots corresponding to the start and end of the matching portion of that group.

The pattern ID and capture group index are also included in this state in case they are useful. But mostly, all you’ll need is next and slot.

Fields

§next: StateID

The state to transition to, unconditionally.

§pattern_id: PatternID

The pattern ID that this capture belongs to.

§group_index: SmallIndex

The capture group index that this capture belongs to. Capture group indices are local to each pattern. For example, when capturing groups are enabled, every pattern has a capture group at index 0.

§slot: SmallIndex

The slot index for this capture. Every capturing group has two slots: one for the start haystack offset and one for the end haystack offset. Unlike capture group indices, slot indices are global across all patterns in this NFA. That is, each slot belongs to a single pattern, but there is only one slot at index i.

§

Fail

A state that cannot be transitioned out of. This is useful for cases where you want to prevent matching from occurring. For example, if your regex parser permits empty character classes, then one could choose a Fail state to represent them. (An empty character class can be thought of as an empty set. Since nothing is in an empty set, they can never match anything.)

§

Match

A match state. There is at least one such occurrence of this state for each regex that can match that is in this NFA.

Fields

§pattern_id: PatternID

The matching pattern ID.

Implementations§

Source§

impl State

Source

pub fn is_epsilon(&self) -> bool

Returns true if and only if this state contains one or more epsilon transitions.

In practice, a state has no outgoing transitions (like Match), has only non-epsilon transitions (like ByteRange) or has only epsilon transitions (like Union).

§Example
use regex_automata::{
    nfa::thompson::{State, Transition},
    util::primitives::{PatternID, StateID, SmallIndex},
};

// Capture states are epsilon transitions.
let state = State::Capture {
    next: StateID::ZERO,
    pattern_id: PatternID::ZERO,
    group_index: SmallIndex::ZERO,
    slot: SmallIndex::ZERO,
};
assert!(state.is_epsilon());

// ByteRange states are not.
let state = State::ByteRange {
    trans: Transition { start: b'a', end: b'z', next: StateID::ZERO },
};
assert!(!state.is_epsilon());

Trait Implementations§

Source§

impl Clone for State

Source§

fn clone(&self) -> State

Returns a copy of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for State

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl PartialEq for State

Source§

fn eq(&self, other: &State) -> bool

Tests for self and other values to be equal, and is used by ==.
1.0.0 · Source§

fn ne(&self, other: &Rhs) -> bool

Tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
Source§

impl Eq for State

Source§

impl StructuralPartialEq for State

Auto Trait Implementations§

§

impl Freeze for State

§

impl RefUnwindSafe for State

§

impl Send for State

§

impl Sync for State

§

impl Unpin for State

§

impl UnwindSafe for State

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dst: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dst. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.

Layout§

Note: Most layout information is completely unstable and may even differ between compilations. The only exception is types with certain repr(...) attributes. Please see the Rust Reference's “Type Layout” chapter for details on type layout guarantees.

Size: 24 bytes

Size for each variant:

  • ByteRange: 8 bytes
  • Sparse: 20 bytes
  • Dense: 20 bytes
  • Look: 8 bytes
  • Union: 20 bytes
  • BinaryUnion: 8 bytes
  • Capture: 16 bytes
  • Fail: 0 bytes
  • Match: 4 bytes