NLP - Word Level Analysis
In this chapter, we will explore word-level analysis in Natural Language Processing.
Regular Expressions
A regular expression (RE) is a pattern that specifies a set of strings. It is used for searching and matching text in various applications, including UNIX and MS Word.
Properties of Regular Expressions
- Formalized by American Mathematician Stephen Cole Kleene.
- Defines patterns that characterize sets of strings using specific syntax.
- Requires a search pattern and a corpus of text.
Mathematically, a Regular Expression can include:
- ε: Represents an empty string.
- φ: Represents an empty language.
- X, Y: Any regular expressions can be combined with operations like:
- X.Y: Concatenation
- X+Y: Union
- X*, Y*: Kleene closure
Examples of Regular Expressions
Regular Expressions | Regular Set |
---|---|
(0 + 10*) | {0, 1, 10, 100, 1000, …} |
(0*10*) | {1, 01, 10, 010, 0010, …} |
(0 + ε)(1 + ε) | {ε, 0, 1, 01} |
(a+b)* | Set of strings of 'a's and 'b's of any length including null. |
(a+b)*abb | Set of strings ending with 'abb'. |
Regular Sets & Their Properties
Regular sets represent the values of regular expressions and have specific properties:
- Union of two regular sets is regular.
- Intersection of two regular sets is regular.
- Complement of a regular set is regular.
- Difference between two regular sets is regular.
- Reversal of a regular set is regular.
- Closure of a regular set is regular.
- Concatenation of two regular sets is regular.
Finite State Automata (FSA)
A Finite State Automaton (FSA) is an abstract machine with a finite number of states that processes inputs according to predetermined rules. It is defined by a 5-tuple (Q, Σ, δ, q0, F):
- Q: Finite set of states.
- Σ: Alphabet of symbols.
- δ: Transition function.
- q0: Initial state (q0 ∈ Q).
- F: Set of final states (F ⊆ Q).
Relation Between Finite Automata, Regular Grammars, and Regular Expressions
Finite automata, regular grammars, and regular expressions all describe regular languages. A regular expression can be implemented as an FSA, and any FSA can be described using a regular expression.
Types of Finite State Automata
Deterministic Finite Automaton (DFA)
A DFA has a deterministic state for each input symbol. It is defined by a 5-tuple (Q, Σ, δ, q0, F) and represented graphically by state diagrams.
Example of DFA
Given DFA:
- Q = {a, b, c}
- Σ = {0, 1}
- q0 = {a}
- F = {c}
Current State | Next State for Input 0 | Next State for Input 1 |
---|---|---|
A | A | B |
B | B | A |
C | C | C |
Non-deterministic Finite Automaton (NDFA)
In NDFA, the next state for an input symbol may be multiple states or none. It is also defined by a 5-tuple (Q, Σ, δ, q0, F) and can be represented by state diagrams similar to DFA.
Example of NDFA
Given NDFA:
- Q = {a, b, c}
- Σ = {0, 1}
- q0 = {a}
- F = {c}
Current State | Next State for Input 0 | Next State for Input 1 |
---|---|---|
A | A, B | B |
B | C | A, C |
C | B, C | C |
Morphological Parsing
Morphological parsing involves breaking down words into smaller meaningful units called morphemes. For example, "foxes" can be broken into "fox" and "-es".
Types of Morphemes
- Stems: Core meaningful unit, e.g., "fox" in "foxes".
- Affixes: Add meaning and grammatical function, e.g., "-es" in "foxes".
Types of Affixes
- Prefixes: Before the stem, e.g., "un-" in "unbuckle".
- Suffixes: After the stem, e.g., "-s" in "cats".
- Infixes: Inserted inside the stem, e.g., "-s" in "cupsful".
- Circumfixes: Surround the stem, e.g., "A-ing" in a word.
Requirements for Building a Morphological Parser
- Lexicon: List of stems and affixes with basic information.
- Morphotactics: Model explaining morpheme ordering within words.
- Orthographic Rules: Spelling rules to model changes in word forms, e.g., "city" + "s" = "cities".