Natural Language Processing (NLP) with Python
Learn about Natural Language Processing (NLP) using Python. Explore Python's features, prerequisites, and key libraries such as NLTK, gensim, and pattern. Understand essential NLP techniques like tokenization, stemming, lemmatization, and chunking.
Python Features for NLP
Python stands out due to several features that make it ideal for NLP:
- Interpreted Language: Python code is executed by the interpreter at runtime without needing to compile.
- Interactive: Python allows direct interaction with the interpreter for writing and testing code.
- Object-Oriented: Python supports object-oriented programming, making it easier to write and manage code.
- Beginner-Friendly: Python is considered a beginner’s language due to its simplicity and readability, supporting a wide range of applications.
Prerequisites
The latest version of Python 3, Python 3.7.1, is available for various operating systems. Here’s how to install Python:
- Windows: Download from Python.org
- Mac OS: Download from Python.org
- Linux: Use the package manager. For Ubuntu, run:
$ sudo apt-get install python3-minimal
Getting Started with NLTK
We will use the NLTK (Natural Language Toolkit) library for text analysis in Python. NLTK is a powerful library designed for working with human language data.
Installing NLTK
Install NLTK using pip:
pip install nltk
Or, if using Anaconda, install with Conda:
conda install -c anaconda nltk
Downloading NLTK Data
After installing NLTK, download its data using:
import nltk
nltk.download()
Other Necessary Packages
For comprehensive NLP, you might need additional packages:
- gensim: A semantic modeling library. Install with:
pip install gensim
- pattern: Supports gensim functionalities. Install with:
pip install pattern
Tokenization
Tokenization is the process of breaking text into smaller units called tokens, such as words or punctuation marks. NLTK provides several tokenization tools:
- Sent Tokenization: Splits text into sentences.
from nltk.tokenize import sent_tokeniz
- Word Tokenization: Splits text into words.
from nltk.tokenize import word_tokenize
- WordPunctTokenizer: Splits text into words and punctuation marks.
from nltk.tokenize import WordPunctTokenizer
Stemming
Stemming is used to reduce words to their base forms. NLTK provides several stemming algorithms:
- PorterStemmer: Uses Porter’s algorithm for stemming.
from nltk.stem.porter import PorterStemmer
- LancasterStemmer: Uses Lancaster’s algorithm.
from nltk.stem.lancaster import LancasterStemmer
- SnowballStemmer: Uses Snowball’s algorithm.
from nltk.stem.snowball import SnowballStemmer
Lemmatization
Lemmatization is a more advanced method for reducing words to their base forms. The WordNetLemmatizer package in NLTK provides this functionality:
from nltk.stem import WordNetLemmatizer
Counting POS Tags – Chunking
Chunking identifies parts of speech (POS) and short phrases. It involves creating chunks based on grammatical rules. Here’s how to perform noun-phrase chunking using NLTK:
Example
Define the sentence and chunking grammar, then parse the sentence:
import nltk
sentence = [("a", "DT"), ("clever", "JJ"), ("fox", "NN"), ("was", "VBP"), ("jumping", "VBP"), ("over", "IN"), ("the", "DT"), ("wall", "NN")]
grammar = "NP:{?*}"
parser_chunking = nltk.RegexpParser(grammar)
output = parser_chunking.parse(sentence)
output.draw()