What is a context-free language?

Asked

Viewed 5,048 times

31

In Wikipedia has the following statement:

In formal language theory, a context-free language (LLC) is a language generated by some context-free grammar (GLC). Different context-free grammars can generate the same language context-free, or conversely, a given language free of context can be generated by different context-free grammars.

Oh God!!!

However, I found the explanation a little confusing, jumbled, and so I still ask:

  • What is a context-free language?
  • How is it possible, if it is possible, to define whether a language is context-free or not?
  • Talks about programming languages?

  • 1

    @cat programming languages and compilers.

2 answers

28


There is a hierarchy defined by Chomsky and refined by other researchers regarding the structure of languages denoted by sequences of symbols, words, letters, etc. This structure contains the following levels:

  1. Finite languages.

  2. Regular languages - those of regular expressions.

  3. Context-free languages - I’ll explain below.

  4. Languages sensitive to context.

  5. Recursive languages (also called decidable).

  6. Recursively enumerable languages (also called semi-decidable).

  7. Unrestricted languages - that is, this is the most general category that encompasses any and all languages.

Each level is a superset of the previous level. So:

  • Context-free languages are more powerful and more general than regular languages. That is, with context-free languages you can express structures/languages that cannot be represented by regular expressions.

  • All languages that can be expressed by means of a regular expression can also be expressed by means of a context-free language.

  • Context-free languages are still far from being languages capable of expressing anything. There are some levels above them.

That’s what the hierarchy looks like:

hierarquia

Well, let’s take a look at regular expressions first since it’s more likely you’ve seen them before. For example, a regular expression of the type ab(c*)(d|e)[f](g+) denotes sequences of symbols having the following structure:

  1. Start with ab.

  2. have zero or more cs then.

  3. Have a d or a e next.

  4. Then they may or may not have one f.

  5. End with one or more gs.

Well, we see that regular expressions can contain rules in the following formats:

  • Strings of input string symbols.

  • Repeat subexpressions zero or more.

  • Repetition once or more of subexpressions.

  • Choose between two or more subexpressions.

  • Optional subexpressions (zero or one times appear).

However, a regular expression nay is recursive. Already a grammar context-free is a grammar composed of recursive rules called productions. Each production is identified by a symbol/name called nonterminal. One of the non-terminal symbols is called the initial symbol, which is the starting point in the case of generation and the final objective in the case of acceptance (I explain this better below). In addition to the non-terminal symbols, we have the symbols terminals which correspond to symbols (characters) that are represented directly in the input string.

Context-free grammars can contain all these elements that regular expressions have, but they have one more possibility. They can also:

  • Expand non-terminal symbols between terminals, including non-terminal symbols that recursively expand.

To demonstrate, I will take as an example one of the grammars which is in the wikipedia article:

T x
T y
T z
S S + T
S S - T
S S * T
S S / T
T ( S )
S T

A form equivalent to this grammar would be this:

T x | y | z | ( S )
S S + T | S - T | S * T | S / T | T

In this grammar, the symbols x, y, z, +, -, *, /, ( and ) are the terminals and symbols S and T are the non-terminal. The non-terminal symbol S is the initial symbol. This grammar models arithmetic expressions (although it has no precedence rules).

The arrow is read as "drift" or "produce". So, the rule T x can be read as "T drift x", or "T produces x". Note that in this given example, it is possible that a S derive another S or that a T derive another T directly or indirectly, because productions are applied recursively.

To see an example in practice, note that an expression such as x-(z)+(x*x) may be produced from the initial symbol S thus:

S           // Símbolo inicial
S+T         // Aplicação da regra S -> S + T
S+(S)       // Aplicação da regra T -> ( S )
S+(S*T)     // Aplicação da regra S -> S * T
S+(T*T)     // Aplicação da regra S -> T
S+(T*x)     // Aplicação da regra T -> x
S+(x*x)     // Aplicação da regra T -> x
S-T+(x*x)   // Aplicação da regra S -> S - T
S-(S)+(x*x) // Aplicação da regra T -> ( S )
S-(T)+(x*x) // Aplicação da regra S -> T
S-(z)+(x*x) // Aplicação da regra T -> z
T-(z)+(x*x) // Aplicação da regra S -> T
x-(z)+(x*x) // Aplicação da regra T -> x

The recognition process is the reverse:

x-(z)+(x*x) // Cadeia de entrada
T-(z)+(x*x) // Aplicação da regra T -> x
S-(z)+(x*x) // Aplicação da regra S -> T
S-(T)+(x*x) // Aplicação da regra T -> z
S-(S)+(x*x) // Aplicação da regra S -> T
S-T+(x*x)   // Aplicação da regra T -> (S)
S+(x*x)     // Aplicação da regra S -> S - T
S+(T*x)     // Aplicação da regra T -> x
S+(T*T)     // Aplicação da regra T -> x
S+(S*T)     // Aplicação da regra S -> T
S+(S)       // Aplicação da regra S -> S * T
S+T         // Aplicação da regra T -> (S)
S           // Aplicação da regra S -> S + T

Note that the process of generate or produce a character string is one that, starting from the initial symbol, reaches a string of terminals, while the process of acknowledge or accept a character string is one that, starting from the character string, reaches the initial symbol. These generation and recognition processes also exist for regular expressions, as I can use them both to generate and to recognize strings.

Now note that in the case of the above example, the generated and/or recognized expression will always have properly nested and balanced parentheses, thing that regular expressions are not able to do. This demonstrates that context-free grammars are more powerful/expressive than regular expressions.

As for the difference between languages context-free and grammars context-free is that a grammar is what I posted above, a set of rules. On the other hand, a language is a set of strings of symbols that are accepted or generated, regardless of which rules are used to generate or accept them. A language that can be generated or accepted by a context-free grammar is a context-free language. The same language can be generated or recognized by different grammars, which in this case are equivalent.

Maybe you’re wondering why that name is "context-free". The answer is that the recognition of a subexpression does not depend on the context in which it is inserted. That is, it does not depend on what precedes or succeeds such a subexpression. For example, a (sub)expression x-y may be recognised or generated as a S in the previous grammar regardless of what comes before or after this x-y. In context-sensitive languages (or even more so in the hierarchy) this may not be true.

Finally, to define whether a language is context-free, you must demonstrate that there is some context-free grammar that recognizes it. To prove that it is not context-free, you must prove that there is no context-free grammar that recognizes it.

13

To Reply by @Victor Stafusa is excellent, I come here only to deepen some points.

Parlance

One language is a subset (finite or infinite) words generated by the operator Kleene star on a set of symbols.

Word

A word is the concatenation of 0 or more symbols.

A word with zero symbols is the empty word. It can be represented by Epsilon or lambda, or simply by "".

Lambda: lambda

Epsilon: epsilon

A single symbol is considered a word of size 1. Two concatenated symbols is a word of size two.

It is possible to concatenate any desired word, even the empty word. I will represent the concatenation with the point operator .. Some examples of concatenation:

"a" . "b" -> "ab"
"ban" . "ana" -> "banana"
"" . "" -> ""
"" . "a" -> "a"
"a" . "" -> "a"

Formally, every element formed by the Kleene star on a set of symbols T is a word made up of T.

Kleene star

A Kleene star generates concatenations of words over a previously existing set of words.

But, Jefferson, you spoke just before that a language is a subset about the set generated by the Kleene Star on symbols, why are you saying that the Star can operate on words?

Remember it was not long ago that I wrote about words of size 1? Well, a single symbol is a word of size one, so these definitions don’t contradict.

Kleene’s star operator is about a set V is represented by V*

A word is generated by the star of Kleene about V if:

  1. she is the empty word;
  2. she is a word of V concatenated from a word generated by V*.

Yes, the definition is recursive. More formally, the second step is given by the following mathematical expression:

formação recursiva de palavras sobre V

Language as a whole

Parlance L is a subset of T*, being T a set of symbols. As a set, I can define several ways. For example:

palavra com mesma quantidade de 'a', 'b' e 'c'

This is a word made up of symbols a, b and c, so that all the acome before all the b, which in turn appear before the cs. It also has the same amount of as, bs and cs.

For some cases, the language needs to follow a structure, called syntax. To define a syntax, use a grammar. Unfortunately, the set notation is poor to express the syntactic structure of a language, so we have another notation for this...

Derivative grammar

Derivative grammar, or transformational grammar, was proposed by Chomsky to study in a structured way how phrases form in natural language.

A grammar is defined as a set of symbol transformations. Usually, we have a transformation defined like this:

V -> W

Where V is a lexical form and W another lexical form.

What is a lexical form?

Let’s go back a little to Portuguese, our language. A reduced version of sentence formation would be:

  1. Generally speaking, we have a phrase F
  2. F can be described as SVO, where S is a guy, V is an adverbial phrase and O is an object
  3. S is described by a nominal phrase Sn
  4. as well as O is also described as Sn
  5. A Sn can be directly a name N, but it can also be a nominal phrase combined with an adnominal adjoint AN before or after another Sn
  6. Therefore, a Sn can be represented by N, Sn . An or An . Sn
  7. A An is an adjective Ad which can be combined by an adverb Av in the same way that a nominal phrase is combined with adnominal adjoints
  8. Therefore, An can be represented by Ad, An . Av or Av . An
  9. Similarly, verbal phrases are verbs Ve or combinations of verbal and adverb phrases: V is represented by Ve, V . Av or Av . V
  10. N is the union of morphological classes of nouns Su and pronouns P
  11. Su, P, Ve, Ad and Av are the grammar classes we study in school

I can turn these rules into a grammar as follows:

F -> SVO
S -> Sn
O -> Sn
Sn -> N | An Sn | Sn An
N -> Su | P
An -> Ad | Av An | An Av
V -> Ve | Av V | V Av

A lexical form is any concatenation of symbols, and may contain intermediate and final symbols in any proportions. Examples of a lexical form:

F
"verde"
Sn
"incolores" . "ideias" . "verdes"
Sn . V . Sn
"incolores" . Sn . Ad
"gato" . "amarelas"

A word is a special lexical form that is composed only of final symbols. Therefore, from the above examples, only the following are words:

"verde"
"incolores" . "ideias" . "verdes"
"gato" . "amarelas"

Note that the grammatical structure does not need to make the generated word make sense in the language used, because there is no semantic evaluation, only syntactic. An example of nonsense is having something ("ideas") that has a color ("green") and has no color at all ("colorless"").

The language generated by a grammar is the set of all possible words that it is possible to generate with the grammar starting from an initial symbol; this language is indicated as L(G).

Classification of grammars

As Victor Safusa has already said, there are several types of grammars. They are differentiated in how each production is made.

A context-sensitive grammar has the following form:

A V B -> A W B
A B -> B A
V -> Z

Note that V turned into W given the context A prefix and B suffixed. On the other hand, note that V turn Z has no context involved. Production V -> Z is therefore said to be context-free.

A context-free grammar is a grammar whose productions are all context-free. To be context-free, the left side (producer?) may only have a single non-terminal symbol, with the right side being free (produced?).

As a curiosity, a regular grammar follows the following form, to V and W non-terminal symbols and a any terminal symbol:

V -> a W
V -> a
V -> 

In chapter 7 of my Monograph of Graduation, I used Context-Sensitive Grammars to demonstrate some simulation properties of petri net.

Languages free of context

A language is context-free if it can be described by a context-free grammar. Simple as that, it is not?

Well, then, for it not to be free of context, it is necessary to demonstrate that it is not possible to build language, even with all the infinite possible productions.

For a grammar to be infinite, one thing must be true: a nonterminal symbol must produce itself. Grossly, it would be something like this (always generating the word from the symbol in parentheses):

S -> A(B)C -> AXT(Z)LC -> ... -> AXTPQO S AHFUQLC

Of the symbol S, reached S again, preceded by AXTPQO and succeeded by AHFUQLC. That means I can also get lexical expression AXTPQO AXTPQO S AHFUQLC AHFUQLC, and also the AXTPQO AXTPQO AXTPQO S AHFUQLC AHFUQLC AHFUQLC, and generally (AXTPQO)^n S (AHFUQLC)^n. This operation (exit from Sand get into S again) is called pumping.

Visually, there’s this image of motto of pumping in Wikipedia showing the generalization of the previous paragraph (N pump N):

lema do bombeamento, N bombeando N

Every context-free language follows this pumping motto when infinite.

EDIT An alternative name given to this pumping production is self-nesting; N is nestled within the productions of N

Direct answers

What is a context-free language?

A language generated by a context-free grammar.

How is it possible, if it is possible, to define whether a language is context-free or not?

In general, it is enough to prove by writing a context-free grammar, but this task is not always trivial.

There are cases where it is impossible to know whether a context-sensitive language is context-free. Even proving that a language is not context-free is difficult, because even languages that respect the pumping lemma may not be context-free.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.