regex_syntax::hir

Enum Class

Source
pub enum Class {
    Unicode(ClassUnicode),
    Bytes(ClassBytes),
}
Expand description

The high-level intermediate representation of a character class.

A character class corresponds to a set of characters. A character is either defined by a Unicode scalar value or a byte.

A character class, regardless of its character type, is represented by a sequence of non-overlapping non-adjacent ranges of characters.

There are no guarantees about which class variant is used. Generally speaking, the Unicode variat is used whenever a class needs to contain non-ASCII Unicode scalar values. But the Unicode variant can be used even when Unicode mode is disabled. For example, at the time of writing, the regex (?-u:a|\xc2\xa0) will compile down to HIR for the Unicode class [a\u00A0] due to optimizations.

Note that Bytes variant may be produced even when it exclusively matches valid UTF-8. This is because a Bytes variant represents an intention by the author of the regular expression to disable Unicode mode, which in turn impacts the semantics of case insensitive matching. For example, (?i)k and (?i-u)k will not match the same set of strings.

Variants§

§

Unicode(ClassUnicode)

A set of characters represented by Unicode scalar values.

§

Bytes(ClassBytes)

A set of characters represented by arbitrary bytes (one byte per character).

Implementations§

Source§

impl Class

Source

pub fn case_fold_simple(&mut self)

Apply Unicode simple case folding to this character class, in place. The character class will be expanded to include all simple case folded character variants.

If this is a byte oriented character class, then this will be limited to the ASCII ranges A-Z and a-z.

§Panics

This routine panics when the case mapping data necessary for this routine to complete is unavailable. This occurs when the unicode-case feature is not enabled and the underlying class is Unicode oriented.

Callers should prefer using try_case_fold_simple instead, which will return an error instead of panicking.

Source

pub fn try_case_fold_simple(&mut self) -> Result<(), CaseFoldError>

Apply Unicode simple case folding to this character class, in place. The character class will be expanded to include all simple case folded character variants.

If this is a byte oriented character class, then this will be limited to the ASCII ranges A-Z and a-z.

§Error

This routine returns an error when the case mapping data necessary for this routine to complete is unavailable. This occurs when the unicode-case feature is not enabled and the underlying class is Unicode oriented.

Source

pub fn negate(&mut self)

Negate this character class in place.

After completion, this character class will contain precisely the characters that weren’t previously in the class.

Source

pub fn is_utf8(&self) -> bool

Returns true if and only if this character class will only ever match valid UTF-8.

A character class can match invalid UTF-8 only when the following conditions are met:

  1. The translator was configured to permit generating an expression that can match invalid UTF-8. (By default, this is disabled.)
  2. Unicode mode (via the u flag) was disabled either in the concrete syntax or in the parser builder. By default, Unicode mode is enabled.
Source

pub fn minimum_len(&self) -> Option<usize>

Returns the length, in bytes, of the smallest string matched by this character class.

For non-empty byte oriented classes, this always returns 1. For non-empty Unicode oriented classes, this can return 1, 2, 3 or 4. For empty classes, None is returned. It is impossible for 0 to be returned.

§Example

This example shows some examples of regexes and their corresponding minimum length, if any.

use regex_syntax::{hir::Properties, parse};

// The empty string has a min length of 0.
let hir = parse(r"")?;
assert_eq!(Some(0), hir.properties().minimum_len());
// As do other types of regexes that only match the empty string.
let hir = parse(r"^$\b\B")?;
assert_eq!(Some(0), hir.properties().minimum_len());
// A regex that can match the empty string but match more is still 0.
let hir = parse(r"a*")?;
assert_eq!(Some(0), hir.properties().minimum_len());
// A regex that matches nothing has no minimum defined.
let hir = parse(r"[a&&b]")?;
assert_eq!(None, hir.properties().minimum_len());
// Character classes usually have a minimum length of 1.
let hir = parse(r"\w")?;
assert_eq!(Some(1), hir.properties().minimum_len());
// But sometimes Unicode classes might be bigger!
let hir = parse(r"\p{Cyrillic}")?;
assert_eq!(Some(2), hir.properties().minimum_len());
Source

pub fn maximum_len(&self) -> Option<usize>

Returns the length, in bytes, of the longest string matched by this character class.

For non-empty byte oriented classes, this always returns 1. For non-empty Unicode oriented classes, this can return 1, 2, 3 or 4. For empty classes, None is returned. It is impossible for 0 to be returned.

§Example

This example shows some examples of regexes and their corresponding maximum length, if any.

use regex_syntax::{hir::Properties, parse};

// The empty string has a max length of 0.
let hir = parse(r"")?;
assert_eq!(Some(0), hir.properties().maximum_len());
// As do other types of regexes that only match the empty string.
let hir = parse(r"^$\b\B")?;
assert_eq!(Some(0), hir.properties().maximum_len());
// A regex that matches nothing has no maximum defined.
let hir = parse(r"[a&&b]")?;
assert_eq!(None, hir.properties().maximum_len());
// Bounded repeats work as you expect.
let hir = parse(r"x{2,10}")?;
assert_eq!(Some(10), hir.properties().maximum_len());
// An unbounded repeat means there is no maximum.
let hir = parse(r"x{2,}")?;
assert_eq!(None, hir.properties().maximum_len());
// With Unicode enabled, \w can match up to 4 bytes!
let hir = parse(r"\w")?;
assert_eq!(Some(4), hir.properties().maximum_len());
// Without Unicode enabled, \w matches at most 1 byte.
let hir = parse(r"(?-u)\w")?;
assert_eq!(Some(1), hir.properties().maximum_len());
Source

pub fn is_empty(&self) -> bool

Returns true if and only if this character class is empty. That is, it has no elements.

An empty character can never match anything, including an empty string.

Source

pub fn literal(&self) -> Option<Vec<u8>>

If this class consists of exactly one element (whether a codepoint or a byte), then return it as a literal byte string.

If this class is empty or contains more than one element, then None is returned.

Trait Implementations§

Source§

impl Clone for Class

Source§

fn clone(&self) -> Class

Returns a copy of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for Class

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl PartialEq for Class

Source§

fn eq(&self, other: &Class) -> bool

Tests for self and other values to be equal, and is used by ==.
1.0.0 · Source§

fn ne(&self, other: &Rhs) -> bool

Tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
Source§

impl Eq for Class

Source§

impl StructuralPartialEq for Class

Auto Trait Implementations§

§

impl Freeze for Class

§

impl RefUnwindSafe for Class

§

impl Send for Class

§

impl Sync for Class

§

impl Unpin for Class

§

impl UnwindSafe for Class

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dst: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dst. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.

Layout§

Note: Most layout information is completely unstable and may even differ between compilations. The only exception is types with certain repr(...) attributes. Please see the Rust Reference's “Type Layout” chapter for details on type layout guarantees.

Size: 40 bytes

Size for each variant:

  • Unicode: 32 bytes
  • Bytes: 32 bytes