The meaning of regular expressions

Next: Translating regular expressions to Up: Regular expressions Previous: What are regular expressions?

The meaning of regular expressions

We now know what regular expressions are but what do they mean?

For this purpose, we shall first define an operation on languages called the Kleene star. Given a language $L\subseteq \Sigma^*$ we define

$\displaystyle L^* = \{ w_0w_1\dots w_{n-1} \mid n\in\mathbb{N}\wedge \forall i<n. w_i \in L \}$

Intuitively,

contains all the words which can be formed by concatenating an arbitrary number of words in

. This includes the empty word since the number may be 0.

As an example consider $L=\{\textrm{a},\textrm{ab}\}\subseteq \{\textrm{a},\textrm{b}\}^*$ :

$\displaystyle L^* = \{\epsilon,\textrm{a},\textrm{ab},\textrm{aab},\textrm{aba},\textrm{aaab},\textrm{aaba},\dots\}$

You should notice that we use the same symbol as in $\Sigma^*$ but there is a subtle difference: $\Sigma$ is a set of symbols but $L\subseteq \Sigma^*$ is a set of words.

Alternatively (and more abstractly) one may describe as the least language (wrt $\subseteq$ ) which contains and the empty word and is closed under concatenation:

$\displaystyle w \in L^* \wedge v\in L^* \implies wv \in L^*$

We now define the semantics of regular expressions: To each regular expression over $\Sigma$ we assign a language $L(E)\subseteq \Sigma^*$ . We do this by induction over the definition of the syntax:

$L(\emptyset) = \emptyset$
$L(\epsilon) = \{\epsilon\}$
$L(\textbf{x}) = \{ x \}$
where $x\in\Sigma$ .
$L(E+F) = L(E)\cup L(F)$
$L(EF) = \{ wv \mid w\in L(E) \wedge v\in L(F) \}$
$L(\textbf{(}E\textbf{)})=L(E)$

Subtle points: in 1. the symbol $\emptyset$ may be used as a regular expression (as in $L(\emptyset)$ ) or the empty set ( $\emptyset=\{\}$ ). Similarily, $\epsilon$ in 2. may be a regular expression or a word, in 6.

may be used to construct regular expressions or it is an operation on languages. Which alternative we mean becomes only clear from the context, there is no generally agreed mathematical notation ¹to make this difference explicit.

Let us now calculate what the examples of regular expressions from the previous section mean, i.e. what are the langauges they define:

$\epsilon$

$\displaystyle L(\epsilon) = \{\epsilon\}$

By 2.

$\textrm{hallo}$

Let's just look at $L(\textrm{ha})$ . We know from 3:

$\displaystyle L(\textrm{h})$	$\displaystyle = \{ \textrm{h}\}$
$\displaystyle L(\textrm{a})$	$\displaystyle = \{ \textrm{a}\}$

Hence by 5:

$\displaystyle L(\textrm{ha})$	$\displaystyle = \{ wv \mid w\in\{\textrm{h}\} \wedge v\in \{\textrm{a}\}\}$
$\displaystyle$	$\displaystyle = \{ \textrm{ha} \}$

Continuing the same reasoning we obtain:

$\displaystyle L(\textrm{hallo}) = \{\textrm{hallo}\}$

$\textrm{hallo}+\textrm{hello}$

From the previous point we know that:

$\displaystyle L(\textrm{hallo})$	$\displaystyle = \{\textrm{hallo}\}$
$\displaystyle L(\textrm{hello})$	$\displaystyle = \{\textrm{hello}\}$
$\displaystyle$

Hence by using 4 we get:

$\displaystyle L(\textrm{hallo}+\textrm{hello})$	$\displaystyle = \{\textrm{hallo}\}\cup \{\textrm{hello}\}\}$
$\displaystyle$	$\displaystyle = \{\textrm{hallo},\textrm{hello}\}$

$\textrm{h}(\textrm{a}+\textrm{e})\textrm{llo}$

Using 3 and 4 we know

$\displaystyle L(\textrm{a}+\textrm{e})$

$\displaystyle = \{ \textrm{a},\textrm{e} \}$

Hence using 5 we obtain:

$\displaystyle L(\textrm{h}(\textrm{a}+\textrm{e})\textrm{llo})$	$\displaystyle = \{uvw \mid u\in L(\textrm{h}) \wedge v\in L(\textrm{a}+\textrm{e}) \wedge w \in L(\textrm{llo}) \}$
	$\displaystyle = \{uvw \mid u\in \{\textrm{h}\} \wedge v\in\{ \textrm{a},\textrm{e} \} \wedge w \in \{\textrm({llo}\} \}$
	$\displaystyle = \{\textrm{hallo},\textrm{hello}\}$

$\textrm{a}^*\textrm{b}^*$

Let us introduce the following notation:

$\displaystyle w^i = \underbrace{ww \dots w}_{i \textrm{ times}}$

Now using 6 we know that

$\displaystyle L(\textrm{a}^*)$	$\displaystyle = \{ w_0w_1\dots w_{n-1} \mid n\in\mathbb{N}\wedge \forall i<n. w_i \in L(\textrm{a}) \}$
	$\displaystyle = \{ w_0w_1\dots w_{n-1} \mid n\in\mathbb{N}\wedge \forall i<n. w_i \in \{\textrm{a}\} \}$
	$\displaystyle = \{ \textrm{a}^n \mid n\in\mathbb{N}\}$

and hence using 5 we conclude

$\displaystyle L(\textrm{a}^\textrm{b}^)$	$\displaystyle = \{ uv \mid u \in L(\textrm{a}^) \wedge v \in L(\textrm{b}^) \}$
$\displaystyle = \{ uv \mid u \in \{\textrm{a}^n \mid n\in\mathbb{N}\} \wedge v \in \{\textrm{a}^m \mid m\in\mathbb{N}\} \}$
$\displaystyle = \{ \textrm{a}^n\textrm{b}^m \mid m,n \in \mathbb{N}\}$

I.e. $L(\textrm{a}^*\textrm{b}^*)$ is the set of all words which start with a (possibly empty) sequence of as followed by a (possibly empty) sequence of bs.

$(\epsilon+\textrm{b})(\textrm{ab})^*(\epsilon+\textrm{a})$

Let's analyze the parts:

$\displaystyle L(\epsilon+\textrm{b})$	$\displaystyle = \{ \epsilon, \textrm{b} \}$
$\displaystyle L((\textrm{ab})^*)$	$\displaystyle = \{ \textrm{ab}^i \mid i \in\mathbb{N}\}$
$\displaystyle L(\epsilon+\textrm{b})$	$\displaystyle = \{ \epsilon, \textrm{b} \}$

Hence, we have

$\displaystyle L((\epsilon+\textrm{b})(\textrm{ab})^*(\epsilon+\textrm{a})) = \{... ...on, \textrm{b} \} \wedge i\in\mathbb{N}\wedge v \in \{ \epsilon, \textrm{b} \}$

In english: $L((\epsilon+\textrm{b})(\textrm{ab})^*(\epsilon+\textrm{a}))$ is the set of (possibly empty) sequences of interchanging as and bs.

Next: Translating regular expressions to Up: Regular expressions Previous: What are regular expressions?

Thorsten Altenkirch 2001-05-08