Regular Languages, Regular Expressions

can be described by . Thus, these form the basis for all languages for specifying the lexical analysis. Regular languages can be recognized by finite automata. These terms will now be introduced. In what follows, we shall always assume an underlying alphabet .

In order to be recognized automatically, words of a symbol class have to be described formally. Now we will try to capture the structure of a symbol class formally. As an example, we will examine the symbol class of . Identifiers are formed by a concatenation of letters and digits, where the first character must be a letter.

How can this structure be represented?

a LETTER, then any number of (a LETTER or a DIGIT)

or shorter:

le, then any number of ( le or di )

This informal notation can be represented formally as follows:

So the expression now looks like this: le ( le | di )*. This is a .

Strictly speaking, le and di are themselves :

So the for identifiers is composed of already existing regular expressions by applying CONCATENATION, UNION, and KLEENE-STAR to these already existing regular expressions.

Therefore, , and the they describe, are defined inductively:

Regular expression Regular language
Basis: a

Ø
describes {a}
{}
Ø
Inductive hypothesis: r1

r2

describes R1

R2

Inductive step: (r1 r2)
(r1 | r2)
(r1)*
describes R1R2
R1R2
R1*
There are no other regular expressions.

Remarks on the notation:

Examples for regular expressions and languages:

Regular expression Regular language described Elements of the regular language
a|b {a, b} a,b
ab*a {a}{b}*{a} aa, aba, abba, Ø
(ab)* {ab}* , ab, abab, Ø
abba {abba} abba

Copyright © (1999 - 2001) University of Saarland, Germany