NLP - Word Level Analysis

In this chapter, we will explore word-level analysis in Natural Language Processing.

Regular Expressions

A regular expression (RE) is a pattern that specifies a set of strings. It is used for searching and matching text in various applications, including UNIX and MS Word.

Properties of Regular Expressions

Formalized by American Mathematician Stephen Cole Kleene.
Defines patterns that characterize sets of strings using specific syntax.
Requires a search pattern and a corpus of text.

Mathematically, a Regular Expression can include:

ε: Represents an empty string.
φ: Represents an empty language.
X, Y: Any regular expressions can be combined with operations like:
- X.Y: Concatenation
- X+Y: Union
- X*, Y*: Kleene closure

Examples of Regular Expressions

Regular Expressions	Regular Set
(0 + 10*)	{0, 1, 10, 100, 1000, …}
(010)	{1, 01, 10, 010, 0010, …}
(0 + ε)(1 + ε)	{ε, 0, 1, 01}
(a+b)*	Set of strings of 'a's and 'b's of any length including null.
(a+b)*abb	Set of strings ending with 'abb'.

Regular Sets & Their Properties

Regular sets represent the values of regular expressions and have specific properties:

Union of two regular sets is regular.
Intersection of two regular sets is regular.
Complement of a regular set is regular.
Difference between two regular sets is regular.
Reversal of a regular set is regular.
Closure of a regular set is regular.
Concatenation of two regular sets is regular.

Finite State Automata (FSA)

A Finite State Automaton (FSA) is an abstract machine with a finite number of states that processes inputs according to predetermined rules. It is defined by a 5-tuple (Q, Σ, δ, q0, F):

Q: Finite set of states.
Σ: Alphabet of symbols.
δ: Transition function.
q0: Initial state (q0 ∈ Q).
F: Set of final states (F ⊆ Q).

Relation Between Finite Automata, Regular Grammars, and Regular Expressions

Finite automata, regular grammars, and regular expressions all describe regular languages. A regular expression can be implemented as an FSA, and any FSA can be described using a regular expression.

Types of Finite State Automata

Deterministic Finite Automaton (DFA)

A DFA has a deterministic state for each input symbol. It is defined by a 5-tuple (Q, Σ, δ, q0, F) and represented graphically by state diagrams.

Example of DFA

Given DFA:

Q = {a, b, c}
Σ = {0, 1}
q0 = {a}
F = {c}

Current State	Next State for Input 0	Next State for Input 1
A	A	B
B	B	A
C	C	C

Non-deterministic Finite Automaton (NDFA)

In NDFA, the next state for an input symbol may be multiple states or none. It is also defined by a 5-tuple (Q, Σ, δ, q0, F) and can be represented by state diagrams similar to DFA.

Example of NDFA

Given NDFA:

Q = {a, b, c}
Σ = {0, 1}
q0 = {a}
F = {c}

Current State	Next State for Input 0	Next State for Input 1
A	A, B	B
B	C	A, C
C	B, C	C

Morphological Parsing

Morphological parsing involves breaking down words into smaller meaningful units called morphemes. For example, "foxes" can be broken into "fox" and "-es".

Types of Morphemes

Stems: Core meaningful unit, e.g., "fox" in "foxes".
Affixes: Add meaning and grammatical function, e.g., "-es" in "foxes".

Types of Affixes

Prefixes: Before the stem, e.g., "un-" in "unbuckle".
Suffixes: After the stem, e.g., "-s" in "cats".
Infixes: Inserted inside the stem, e.g., "-s" in "cupsful".
Circumfixes: Surround the stem, e.g., "A-ing" in a word.

Requirements for Building a Morphological Parser

Lexicon: List of stems and affixes with basic information.
Morphotactics: Model explaining morpheme ordering within words.
Orthographic Rules: Spelling rules to model changes in word forms, e.g., "city" + "s" = "cities".

Follow On

TutorialsArena