How to prove that a language is not regular?



We learned about the class of regular languages $\mathrm{REG}$. It is characterised by any one concept among regular expressions, finite automata and left-linear grammars, so it is easy to show that a given language is regular.

How do I show the opposite, though? My TA has been adamant that in order to do so, we would have to show for all regular expressions (or for all finite automata, or for all left-linear grammars) that they can not describe the language at hand. This seems like a big task!

I have read about some pumping lemma but it looks really complicated.

This is intended to be a reference question collecting usual proof methods and application examples. See here for the same question on context-free languages.


Posted 2012-04-04T10:30:32.163

Reputation: 54 413



Proof by contradiction is often used to show that a language is not regular: let $P$ a property true for all regular languages, if your specific language does not verify $P$, then it's not regular. The following properties can be used:

  1. The pumping lemma, as exemplified in Dave's answer;
  2. Closure properties of regular languages (set operations, concatenation, Kleene star, mirror, homomorphisms);
  3. A regular language has a finite number of prefix equivalence class, Myhill–Nerode theorem.

To prove that a language $L$ is not regular using closure properties, the technique is to combine $L$ with regular languages by operations that preserve regularity in order to obtain a language known to be not regular, e.g., the archetypical language $I= \{ a^n b^n | n \in \mathbb{N} \}$. For instance, let $L= \{a^p b^q | p \neq q \}$. Assume $L$ is regular, as regular languages are closed under complementation so is $L$'s complement $L^c$. Now take the intersection of $L^c$ and $a^\star b^\star$ which is regular, we obtain $I$ which is not regular.

The Myhill–Nerode theorem can be used to prove that $I$ is not regular. For $p \geq 0 $, $I/a^p= \{ a^{r}b^rb^p| r \in \mathbb{N} \}=I.\{b^p\}$. All classes are different and there is a countable infinity of such classes. As a regular language must have a finite number of classes $I$ is not regular.


Posted 2012-04-04T10:30:32.163

Reputation: 980

1Didn't know about Myhill-Nerode theorem, cool! – Daniil – 2012-04-04T13:20:46.247

Wikipedia also has a section about the number of words in a regular language: if you can prove your language doesn't match the characterization, then your language is not regular:

– Alex ten Brink – 2012-04-04T14:39:33.380

@Daniil, regular expressions can't count seems to me a popular informal formulation of Myhill-Nerode theorem. – AProgrammer – 2012-04-04T15:28:07.590

@AlextenBrink: That is neat. I guess the constants in the statement are the eigenvalues of the automaton's Laplacian? This would make a nice addition to the answers here. – Louis – 2012-04-04T20:27:18.130

@Louis: actually, we've found no reference for that theorem at all, so if you know more about it... Also see:

– Alex ten Brink – 2012-04-04T20:31:04.090

@AlextenBrink: I had never seen the statement before you pointed to it, but here is how I'd try to prove it. Suppose the start state is $1$ and $2,3,\ldots, k$ are accepting. Then the number of words of length $n$ is given by $\sum{j=2}^k a{1j}$ where $A^n = (a_{ij})$ where $A$ is the adjacency matrix of the automaton's graph. Maybe you can work out a recursion and solve it in that form. (So my other comment is likely not the right thing.) – Louis – 2012-04-04T21:17:56.363

See here for two examples of using closure properties.

– Raphael – 2012-05-28T22:56:03.037

It's worth noting that Myhill-Nerode is a characterisation of REG, that is it always works (in principle). That's different from the Pumping lemma, which only yields a necessary criterion. – Raphael – 2015-10-13T10:13:06.267


Based on Dave's answer, here is a step-by-step "manual" for using the pumping lemma.

Recall the pumping lemma (taken from Dave's answer, taken form Wikipedia):

Let $L$ be a regular language. Then there exists an integer $n\ge 1$ (depending only on $L$) such that every string $w$ in $L$ of length at least $n$ ($p$ is called the "pumping length") can be written as $w = xyz$ (i.e., $w$ can be divided into three substrings), satisfying the following conditions:

  1. $|y| \ge 1$
  2. $|xy| \le n$ and
  3. a "pumped" $w$ is still in $L$: for all $i \ge 0$, $xy^iz \in L$.

Assume that you are given some language $L$ and you want to show that it is not regular via the pumping lemma. The proof looks like this:

  1. Assume that $L$ is regular.
  2. If it is regular, then the pumping lemma says that there exists some number $n$ which is the pumping length.
  3. Pick a specific word $w\in L$ of length larger than $n$. The difficult part is to know which word to take.
  4. Consider ALL the ways to partition $w$ into 3 parts, $w=xyz$, with $|xy|\le n$ and $y$ non empty. For each of these ways, show that it cannot be pumped: there always exists some $i\ge 0$ such that $xy^iz \notin L$.
  5. Conclude: the word $w$ cannot be "pumped" (no matter how we split it to $xyz$) in contradiction to the pumping lemma, i.e., our assumption (step 1) is wrong: $L$ is not regular.

Before we go to an example, let me reiterate Step 3 and Step 4 (this is where most of the people go wrong). In Step 3 you need to pick one specific word in $L$. write it down explicitly, like "00001111" or "$a^nb^n$". Examples for things that are not a specific word: "$w$" or "a word that has 000 as a prefix".

On the other hand, in Step 4 you need to consider more than one case. For instance, if $w=000111$ it is not enough to say $x=00, y=01, z=00$, and then reach a contrudiction. You must also check $x=0, y=0, z=0111$, and $x=\epsilon, y=000, z=111$, and all the other possible options.

Now let's follow the steps and prove that $L= \{ 0^k1^{2k} \mid k>0 \}$ is not regular.

  1. Assume $L$ is regular.
  2. Let $n$ be the pumping length given by the pumping lemma.
  3. Let $w = 0^n 1^{2n}$.
    (sanity check: $|w|\gt n$ as needed. Why this word? other words can work as well.. it takes some experience to come up with the right $w$). Again, note that $w$ is a specific word: $\underbrace{000\ldots0}_{n \text{ times}}\underbrace{111\ldots1}_{2n \text{ times}}$.
  4. Now lets start consider the various cases to split $w$ into $xyz$ with $|xy|\le n$ and $|y|>0$. Since $|xy|<n$ no matter how we split $w$, $x$ will consist of only 0's and so will $y$. Lets assume $|x|=s$ and $|y|=k$. We need to consider ALL the options, that is all the possible $s,k$ such that $s\ge 0, k\ge 1$ and $s+k \le n$. FOR THIS $L$ the proof for all these cases is the same, but in general it might be different.
    take $i=0$ and consider $xy^iz = xz$. this word is NOT in $L$ since it is of the form $0^{n-k}1^{2n}$ (no matter what $s$ and $k$ were), and since $k \ge 1$, this word is not in $L$ and we reach a contradiction.
  5. Thus, our assumption is incorrect, and $L$ is not regular.

A youtube clip that explains how to use the pumping lemma along the same lines can be found here

Ran G.

Posted 2012-04-04T10:30:32.163

Reputation: 15 978

1It's n that is the pumping length in this definition! – saadtaame – 2012-08-24T01:17:28.403


From Wikipedia, the pumping language for regular languages is the following:

Let $L$ be a regular language. Then there exists an integer $p\ge 1$ (depending only on $L$) such that every string $w$ in $L$ of length at least $p$ ($p$ is called the "pumping length") can be written as $w = xyz$ (i.e., $w$ can be divided into three substrings), satisfying the following conditions:

  1. $|y| \ge 1$
  2. $|xy| \le p$ and
  3. for all $i \ge 0$, $xy^iz \in L$.
    $y$ is the substring that can be pumped (removed or repeated any number of times, and the resulting string is always in $L$).

(1) means the loop y to be pumped must be of length at least one; (2) means the loop must occur within the first p characters. There is no restriction on x and z.

In simple words, For any regular language L, any sufficiently long word $w\in L$ can be split into 3 parts. i.e $w = xyz$, such that all the strings $xy^kz$ for $k\ge 0$ are also in $L$.

Now let's consider an example. Let $L=\{(01)^n2^n\mid n\ge0\}$.

To show that this is not regular, you need to consider what all the decompositions $w=xyz$ look like, so what are all the possible things x, y and z can be given that $xyz=(01)^p2^p$ (we choose to look at this particular word, of length $3p$, where $p$ is the pumping length). We need to consider where the $y$ part of the string occurs. It could overlap with the first part, and will thus equal either $(01)^{k+1}$, $(10)^{k+1}$, $1(01)^k$ or $0(10)^k$, for some $k\ge 0$ (don't forget that $|y|\ge 1$). It could overlap with the second part, meaning that $y=2^k$, for some $k>0$. Or it could overlap across the two parts of the word, and will have the form $(01)^{k+1} 2^l$, $(10)^{k+1} 2^l$, $1(01)^k 2^l$ or $0(10)^k 2^l$, for $k\ge0$ and $l\ge1$.

Now pump each one to obtain a contradiction, which will be a word not in your language. For example, if we take $y=0(10)^k2^l$, the pumping lemma says, for instance, that $xy^2z=x0(10)^k2^l0(10)^k2^lz$ must be in the language, for an appropriate choice of $x$ and $z$. But this word cannot be in the language as a $2$ appears before a $1$.

Other cases will result in the number of $(01)$'s being more than the number of $2$'s or vice versa, or will result in words that won't have the structure $(01)^n2^n$ by, for example, having two $0$'s in a row.

Don't forget that $|xy| \le p$. Here, it's useful to shorten the proof: many of the decompositions above are impossible because they would make the $z$ part too long.

Each of the cases above needs to lead to such a contradiction, which would then be a contradiction of the pumping lemma. Voila! The language would not be regular.

Dave Clarke

Posted 2012-04-04T10:30:32.163

Reputation: 16 574

An example where the hypothesis $|xy|\le p$ is needed would be nice. – Gilles – 2012-04-04T15:25:58.330

@Gilles: I'm not even sure what the sentence you added means. – Dave Clarke – 2012-04-04T15:28:31.090

@Gilles: I think that all the decompositions are possible, just that $k$ will be bounded. I'm not sure what it has to do with the length of $z$. – Dave Clarke – 2012-04-04T15:32:19.493

Duh! I see it now. Thanks. It does not, however, rule out any of the forms of decomposition mentioned in the answer; it only limits what values of $k$ and $l$ I can take. – Dave Clarke – 2012-04-04T16:19:56.580

I don't like Wikipedia's formulation; due to redundancy it is very verbose. – Raphael – 2012-04-04T16:26:20.133

@Gilles: But it doesn't rule out any of the forms I enumerated. – Dave Clarke – 2012-04-04T16:30:01.763

@Gilles: I see that you have radically changed my answer, and now see what you mean. – Dave Clarke – 2012-04-04T16:31:16.690

@Dave Feel free to revert if you don't like my edits. I thought this breakdown with less overlap between the cases was both easier to work with and more natural (either $y$ contains a $2$ or it doesn't, or in other words either $|z| \le p$ or $|z| < p$). – Gilles – 2012-04-04T16:38:22.443

@Gilles: It's okay. I simply confused myself by not reading all of the changes you made. – Dave Clarke – 2012-04-04T17:57:24.273

1The amount of editing that has been done to answer such an easy question makes me wonder why everybody teaches the pumping lemma as "the" way to prove non-regularity. Out of curiosity, why not just take your string to be something like $(01)^{2p}2^{2p}$? The pumping lemma tells you that $y$ has no $2$s in it, from which a contradiction is more straightforward. – Louis – 2012-04-04T19:47:47.180

@Louis: I was trying to show a more general strategy with my example. But as you can see, there are many ways. – Dave Clarke – 2012-04-04T19:53:58.343


For a given language $L \subseteq \Sigma^*$, let

$\qquad \displaystyle S_L(z) = \sum\limits_{n \geq 0} |L \cap \Sigma^n|\cdot z^n$

the (ordinary) generating function of $L$, i.e. its sequence of word counts per length.

The following statement holds [FlSe09, p52]:

$\qquad \displaystyle L \in \mathrm{REG} \quad \Longrightarrow \quad S_L \text{ rational}$

That is, $S_L(z) = \frac{P(z)}{Q(z)}$ with $P,Q$ polynomials.

So any language whose generating function is not rational is not regular. Unfortunately, all linear languages also have rational generating functions¹ so this method won't work for the simpler non-regular languages. Another drawback is that obtaining $S_L$ (and showing that it is not rational) can be hard.

Example: Consider the language of correctly nested parentheses words, i.e. the Dyck language. It is generated by the unambiguous grammar

$\qquad \displaystyle S \to [S]S \mid \varepsilon$

which can be translated into the equation

$\qquad \displaystyle S(z) = z^2S^2(z) + 1$

one solution (the one with all positive coefficients) of which is

$\qquad \displaystyle \mathcal{S}(z) = \frac{1 - \sqrt{1 - 4z^2}}{2z^2}$.

As $S_L = \mathcal{S}$ [Kuic70] and $\mathcal{S}$ is not rational, the Dyck language is not regular.

  1. The proof for the statement for regular languages works via grammars and transfers to linear grammars immediately (commutativity of multiplication).

$\ \ $ [FlSe09] Analytic Combinatorics by P. Flajolet and R. Sedgewick (2009)
$\ \ $ [Kuic70] On the Entropy of Context-Free Languages by W. Kuich (1970)


Posted 2012-04-04T10:30:32.163

Reputation: 54 413


This is an expanded version of my answer from here Using Pumping Lemma to prove language $L = \{(01)^m 2^m \mid m \ge0\}$ is not regular since this is supposed to be a reference question.

So, you think the pumping lemma looks complicated? Don't worry. Here's a slightly different take approach, which is hidden in @Romuald's answer as well. (Quiz: where?)

Let's start by remembering that every regular language is accepted by a deterministic finite state automaton (DFA). A DFA is a finite directed graph where every vertex has exactly one out-edge for each letter in the alphabet. Strings give you a walk in the graph based at a vertex labeled "start", and the DFA accepts if this walk ends at a vertex labeled "accept". (The vertices are called "states" because different areas of math like to make up their own terminology for the same thing.)

With this way of thinking it is easy to see that: If strings $a$ and $b$ drive the DFA to the same state, then for any other string $c$, $ac$ and $bc$ drive the DFA to the same state. Why? Because the stating point of a walk and the string defining it determine the end completely.

Put slightly differently: If $L$ is regular and strings $a$ and $b$ drive a recognizing automaton to the same state, then for all strings $c$, either $ac$ and $bc$ are both in $L$ or neither is.

We can use this to show languages aren't regular by imagining it is and then coming up with $a$ and $b$ driving an DFA to the same state, and $c$ so that $ac$ is in the language and $bc$ isn't. Take the example language from @Dave's answer. Imagine it is regular, so it has some recognizing DFA with $m$ states. The Pigeon Hole Principle says that at least two of $\{(01)^i : 0\le i\le m+1\}$ send the DFA to the same state, say $a=(01)^p$ and $b=(01)^q$. Since $p\neq q$, we see that $a2^p$ is in the language and $b2^p$ is not, so this language can't be regular.

The nice thing is that the example is really a template for proving that languages aren't regular:

  • Find a family of strings $\{a_i :i\in\mathbb{N}\}$ with the property that each of them has a "tail" $t_i$ so that $a_it_i$ is in the language and $a_it_j$, for $i\neq j$ is not.
  • Apply the argument above verbatim. (This is allowed, since there are always enough $a_i$ to let you invoke the Pigeon Hole Principle.)

There are other tricks, but this one will work easily on most of your homework problems.

Edit: An earlier version had some discussion of how this idea relates to the Pumping Lemma.


Posted 2012-04-04T10:30:32.163

Reputation: 2 568

I don't think that reproducing the proof of Pumping Lemma is useful in general, but YMMV. Understanding the proof is good in any case; it is immediately connected with a number of closure and other interesting properties of finite automata and regular languages. I strongly disagree with the last sentence, though: automata theory is not boring at all, and it is certainly not the most boring part of theory classes. – Raphael – 2012-04-06T11:27:09.873


Following the answer here, I will describe a method of proving non-regularity based on Kolmogorv complexity.

This approach is discussed in "A New Approach to Formal Language Theory by Kolmogorov Complexity", by Ming Li and Paul M.B. Vitanyi (see section 3.1).

Let $K(x)$ denote the Kolmogorov complexity of a string $x$, i.e. the length of the shortest encoding of a Turing machine $M$, such that $M(\epsilon)=x$ (any of the usual definitions will do). One can then use the following lemma to prove non regularity:

KC-Regularity: Let $L\subseteq \Sigma^*$ be a regular language, then there exists a constant $c$ which depends only on $L$, such that for all $x\in\Sigma^*$, If $y$ is the $n'th$ string (relative to the lexicographic ordering) in $L_x=\left\{y\in \Sigma^*|xy\in L\right\}$, then $K(y)\le O(\log n)+c$.

One can understand (and prove) the above lemma as follows, for any $x\in\Sigma^*$, to describe the $n'th$ string in $L_x$ one needs to specify:

  • The automaton which accepts $L$
  • The state in the automaton after processing the prefix $x$
  • The index $n$

Since we only need to remember the state after processing $x$, and not $x$ itself, we can hide this factor in the constant depending on $L$. The index $n$ requires $\log n$ bits to describe, and we get the above result (for completeness, one needs to add the specific instructions required to generate $y$, but this only adds a constant factor to the final description).

This lemma shows how to bound the Kolmogorov complexity of all strings which are members of $L_x$ for some regular language $L$ and $x\in\Sigma^*$. In order to show non regularity, one can assume $L$ is regular, and prove that the bounds are too restrictive (e.g. bounded Kolmogrov complexity for an infinite set of strings).

The answer linked above contains an example of how to use this lemma to show $L=\left\{1^p | \text{p is prime}\right\}$ is not regular, several more examples are given in the paper. For completeness, we show here how to prove $L=\left\{0^n1^n| n\ge 0\right\}$ is not regular.

Given some $x\in\left\{0,1\right\}^*$, we denote by $y_i^x$ the $i'th$ word in $L_x$. Note that $y_1^{0^i}=1^i$. Using the above lemma, focusing on prefixes $x$ of the form $x=0^i$ and fixing $n=1$, we obtain $\forall i\ge 0 : K(y_1^{0^i})\le c$. Since $y_1^{0^i}=1^i$, this means that we can bound the Kolmogorov complexity of all strings of the form $1^i$ by a constant, which is obviously false. It is worth mentioning that we could have examined a single $x$, e.g. $x=0^n$ for large enough $n$ which satisfies $K(0^n)\ge \log n $ (we start with a high complexity prefix). Since $y_1^x=1^n$, we get $K(1^n)<c$, contradiction (suppose $n>2^c$).


Posted 2012-04-04T10:30:32.163

Reputation: 9 296


In the case of unary languages (languages over an alphabet of size 1), there is a simple criterion. Let us fix an alphabet $\{ \sigma \}$, and for $A \subseteq \mathbb{N}$, define $$ L(A) = \{ \sigma^n : n \in A \}. $$

Theorem. Let $A \subseteq \mathbb{N}$. The following are equivalent:

  1. $L(A)$ is regular.

  2. $L(A)$ is context-free.

  3. There exist $n_0,m \geq 1$ such that for all $n \geq n_0$, it holds that $n \in A$ iff $n+m \in A$. (We say that $A$ is eventually periodic.)

  4. Let $a_i = 1_{i \in A}$. Then $0.a_0a_1a_2\ldots$ is rational.

  5. The generating function $\sum_{i \in A} x^i$ is a rational function.

The theorem can be proved in many ways, for example using the pumping lemma, Myhill–Nerode theory, Parikh's theorem, the structure of DFAs on unary languages (they look like a "$\rho$", as in Pollard's $\rho$ algorithm), and so on. Here is a useful corollary.

Corollary. Let $A \subseteq \mathbb{N}$, and suppose that $L(A)$ is regular.

  1. The limit $\rho = \lim_{n\to\infty} \frac{|A \cap \{1,\ldots,n\}|}{n}$ exists. (This is the asymptotic density of $A$.)

  2. If $\rho = 0$ then $A$ is finite.

  3. If $\rho = 1$ then $A$ is cofinite (that is, $\overline{A}$ is finite).

As an example, the language $L(\{2^n : n \geq 0\})$ isn't regular, since the set has vanishing asymptotic density, yet is infinite.

Yuval Filmus

Posted 2012-04-04T10:30:32.163

Reputation: 167 283