Formal Languages and Grammars: Defining String Structures

Explore formal languages and grammars in discrete mathematics. This guide explains how formal grammars define formal languages, focusing on phrase structure grammars and their components (vocabulary, terminals, start symbol, productions).



Formal Languages and Grammars in Discrete Mathematics

Formal Languages

A formal language is a set of strings (sequences of symbols) defined by strict rules. Unlike natural languages (like English), formal languages have precise grammars that dictate what constitutes a valid "sentence" or string in the language.

Formal Grammars

A formal grammar provides a way to describe a formal language. It's a set of rules that define how to construct valid strings. We'll explore a type of grammar called a phrase structure grammar.

Key Definitions in Phrase Structure Grammars

  • Vocabulary (V): A finite set of symbols used to build strings.
  • Word (or Sentence): A finite-length string of symbols from the vocabulary.
  • Empty String (λ): A string with no symbols.
  • Set of Words (V*): The set of all possible words that can be formed from the vocabulary V.
  • Language (L): A subset of V* (a collection of specific words).

Phrase Structure Grammars

A phrase structure grammar G is defined as a 4-tuple:

G = (V, T, S, P)

  • V: The vocabulary (all symbols).
  • T: The terminal symbols (symbols that appear in the final strings of the language).
  • S: The start symbol (the symbol we begin with to generate strings).
  • P: The production rules (rules that show how to replace symbols to create strings).

N = V - T represents the non-terminal symbols (symbols that are replaced during string generation).

Derivability

A string w1 is directly derivable from a string w0 (written w0 ⇒ w1) if w0 can be transformed into w1 by applying one production rule. A string wn is derivable from w0 (written w0 ⇒* wn) if a sequence of production rules can transform w0 into wn. This sequence is called a derivation.

Language Generated by a Grammar

The language generated by a grammar G, denoted L(G), is the set of all terminal strings that can be derived from the start symbol S. Formally:

L(G) = {w ∈ T* | S ⇒* w}

Types of Grammars

Grammars are classified into types based on restrictions on their production rules (w1 → w2):

  • Type 0 (Unrestricted): No restrictions on productions.
  • Type 1 (Context-Sensitive): w1 = αAβ and w2 = αγβ, where A is a non-terminal, α, β, and γ are strings of symbols, and γ ≠ λ (unless w1 = S and w2 = λ, and S doesn't appear on the right side of any other production).
  • Type 2 (Context-Free): w1 is a single non-terminal symbol (A).
  • Type 3 (Regular): w1 is a single non-terminal symbol (A), and w2 is either a terminal symbol followed by a non-terminal (aB) or a single terminal symbol (a).

Derivation Trees

A derivation tree (or parse tree) is a graphical representation of a derivation in a context-free grammar. The root is the start symbol, internal nodes are non-terminal symbols, leaves are terminal symbols, and the tree shows how the production rules were applied.

Backus-Naur Form (BNF)

BNF is a notation for describing context-free grammars. It's often used to specify the syntax of programming languages. Instead of listing individual productions, BNF groups productions with the same left-hand side non-terminal. It uses "::=" instead of "→" to represent productions and uses "|" to separate alternatives on the right-hand side.

(An illustrative example of BNF notation should be included here.)

Conclusion

Formal languages and grammars are fundamental to computer science, providing a framework for specifying and analyzing programming languages, defining data structures, and building compilers and interpreters.