Formal Languages and Grammars: Defining String Structures

Explore formal languages and grammars in discrete mathematics. This guide explains how formal grammars define formal languages, focusing on phrase structure grammars and their components (vocabulary, terminals, start symbol, productions).

Formal Languages and Grammars in Discrete Mathematics

Formal Languages

A formal language is a set of strings (sequences of symbols) defined by strict rules. Unlike natural languages (like English), formal languages have precise grammars that dictate what constitutes a valid "sentence" or string in the language.

Formal Grammars

A formal grammar provides a way to describe a formal language. It's a set of rules that define how to construct valid strings. We'll explore a type of grammar called a phrase structure grammar.

Key Definitions in Phrase Structure Grammars

Vocabulary (V): A finite set of symbols used to build strings.
Word (or Sentence): A finite-length string of symbols from the vocabulary.
Empty String (λ): A string with no symbols.
Set of Words (V*): The set of all possible words that can be formed from the vocabulary V.
Language (L): A subset of V* (a collection of specific words).

Phrase Structure Grammars

A phrase structure grammar G is defined as a 4-tuple:

G = (V, T, S, P)

V: The vocabulary (all symbols).
T: The terminal symbols (symbols that appear in the final strings of the language).
S: The start symbol (the symbol we begin with to generate strings).
P: The production rules (rules that show how to replace symbols to create strings).

N = V - T represents the non-terminal symbols (symbols that are replaced during string generation).

Derivability

A string w₁ is directly derivable from a string w₀ (written w₀ ⇒ w₁) if w₀ can be transformed into w₁ by applying one production rule. A string w_n is derivable from w₀ (written w₀ ⇒* w_n) if a sequence of production rules can transform w₀ into w_n. This sequence is called a derivation.

Language Generated by a Grammar

The language generated by a grammar G, denoted L(G), is the set of all terminal strings that can be derived from the start symbol S. Formally:

L(G) = {w ∈ T* | S ⇒* w}

Types of Grammars

Grammars are classified into types based on restrictions on their production rules (w₁ → w₂):

Type 0 (Unrestricted): No restrictions on productions.
Type 1 (Context-Sensitive): w₁ = αAβ and w₂ = αγβ, where A is a non-terminal, α, β, and γ are strings of symbols, and γ ≠ λ (unless w₁ = S and w₂ = λ, and S doesn't appear on the right side of any other production).
Type 2 (Context-Free): w₁ is a single non-terminal symbol (A).
Type 3 (Regular): w₁ is a single non-terminal symbol (A), and w₂ is either a terminal symbol followed by a non-terminal (aB) or a single terminal symbol (a).

Derivation Trees

A derivation tree (or parse tree) is a graphical representation of a derivation in a context-free grammar. The root is the start symbol, internal nodes are non-terminal symbols, leaves are terminal symbols, and the tree shows how the production rules were applied.

Backus-Naur Form (BNF)

BNF is a notation for describing context-free grammars. It's often used to specify the syntax of programming languages. Instead of listing individual productions, BNF groups productions with the same left-hand side non-terminal. It uses "::=" instead of "→" to represent productions and uses "|" to separate alternatives on the right-hand side.

(An illustrative example of BNF notation should be included here.)

Conclusion

Formal languages and grammars are fundamental to computer science, providing a framework for specifying and analyzing programming languages, defining data structures, and building compilers and interpreters.

Follow On

TutorialsArena

Formal Languages and Grammars: Defining String Structures

Formal Languages and Grammars in Discrete Mathematics