NLP - Word Level Analysis

In this chapter, we will explore word-level analysis in Natural Language Processing.



Regular Expressions

A regular expression (RE) is a pattern that specifies a set of strings. It is used for searching and matching text in various applications, including UNIX and MS Word.

Properties of Regular Expressions

  • Formalized by American Mathematician Stephen Cole Kleene.
  • Defines patterns that characterize sets of strings using specific syntax.
  • Requires a search pattern and a corpus of text.

Mathematically, a Regular Expression can include:

  • ε: Represents an empty string.
  • φ: Represents an empty language.
  • X, Y: Any regular expressions can be combined with operations like:
    • X.Y: Concatenation
    • X+Y: Union
    • X*, Y*: Kleene closure

Examples of Regular Expressions

Regular Expressions Regular Set
(0 + 10*) {0, 1, 10, 100, 1000, …}
(0*10*) {1, 01, 10, 010, 0010, …}
(0 + ε)(1 + ε) {ε, 0, 1, 01}
(a+b)* Set of strings of 'a's and 'b's of any length including null.
(a+b)*abb Set of strings ending with 'abb'.

Regular Sets & Their Properties

Regular sets represent the values of regular expressions and have specific properties:

  • Union of two regular sets is regular.
  • Intersection of two regular sets is regular.
  • Complement of a regular set is regular.
  • Difference between two regular sets is regular.
  • Reversal of a regular set is regular.
  • Closure of a regular set is regular.
  • Concatenation of two regular sets is regular.

Finite State Automata (FSA)

A Finite State Automaton (FSA) is an abstract machine with a finite number of states that processes inputs according to predetermined rules. It is defined by a 5-tuple (Q, Σ, δ, q0, F):

  • Q: Finite set of states.
  • Σ: Alphabet of symbols.
  • δ: Transition function.
  • q0: Initial state (q0 ∈ Q).
  • F: Set of final states (F ⊆ Q).

Relation Between Finite Automata, Regular Grammars, and Regular Expressions

Finite automata, regular grammars, and regular expressions all describe regular languages. A regular expression can be implemented as an FSA, and any FSA can be described using a regular expression.

Types of Finite State Automata

Deterministic Finite Automaton (DFA)

A DFA has a deterministic state for each input symbol. It is defined by a 5-tuple (Q, Σ, δ, q0, F) and represented graphically by state diagrams.

Example of DFA

Given DFA:

  • Q = {a, b, c}
  • Σ = {0, 1}
  • q0 = {a}
  • F = {c}
Current State Next State for Input 0 Next State for Input 1
A A B
B B A
C C C

Non-deterministic Finite Automaton (NDFA)

In NDFA, the next state for an input symbol may be multiple states or none. It is also defined by a 5-tuple (Q, Σ, δ, q0, F) and can be represented by state diagrams similar to DFA.

Example of NDFA

Given NDFA:

  • Q = {a, b, c}
  • Σ = {0, 1}
  • q0 = {a}
  • F = {c}
Current State Next State for Input 0 Next State for Input 1
A A, B B
B C A, C
C B, C C

Morphological Parsing

Morphological parsing involves breaking down words into smaller meaningful units called morphemes. For example, "foxes" can be broken into "fox" and "-es".

Types of Morphemes

  • Stems: Core meaningful unit, e.g., "fox" in "foxes".
  • Affixes: Add meaning and grammatical function, e.g., "-es" in "foxes".

Types of Affixes

  • Prefixes: Before the stem, e.g., "un-" in "unbuckle".
  • Suffixes: After the stem, e.g., "-s" in "cats".
  • Infixes: Inserted inside the stem, e.g., "-s" in "cupsful".
  • Circumfixes: Surround the stem, e.g., "A-ing" in a word.

Requirements for Building a Morphological Parser

  • Lexicon: List of stems and affixes with basic information.
  • Morphotactics: Model explaining morpheme ordering within words.
  • Orthographic Rules: Spelling rules to model changes in word forms, e.g., "city" + "s" = "cities".