Natural Language Processing (NLP) with Python

Learn about Natural Language Processing (NLP) using Python. Explore Python's features, prerequisites, and key libraries such as NLTK, gensim, and pattern. Understand essential NLP techniques like tokenization, stemming, lemmatization, and chunking.



Python Features for NLP

Python stands out due to several features that make it ideal for NLP:

  • Interpreted Language: Python code is executed by the interpreter at runtime without needing to compile.
  • Interactive: Python allows direct interaction with the interpreter for writing and testing code.
  • Object-Oriented: Python supports object-oriented programming, making it easier to write and manage code.
  • Beginner-Friendly: Python is considered a beginner’s language due to its simplicity and readability, supporting a wide range of applications.

Prerequisites

The latest version of Python 3, Python 3.7.1, is available for various operating systems. Here’s how to install Python:

  • Windows: Download from Python.org
  • Mac OS: Download from Python.org
  • Linux: Use the package manager. For Ubuntu, run: $ sudo apt-get install python3-minimal

Getting Started with NLTK

We will use the NLTK (Natural Language Toolkit) library for text analysis in Python. NLTK is a powerful library designed for working with human language data.

Installing NLTK

Install NLTK using pip:


pip install nltk

Or, if using Anaconda, install with Conda:


conda install -c anaconda nltk

Downloading NLTK Data

After installing NLTK, download its data using:


import nltk
nltk.download()

Other Necessary Packages

For comprehensive NLP, you might need additional packages:

  • gensim: A semantic modeling library. Install with:
    
    pip install gensim
    
  • pattern: Supports gensim functionalities. Install with:
    
    pip install pattern
    

Tokenization

Tokenization is the process of breaking text into smaller units called tokens, such as words or punctuation marks. NLTK provides several tokenization tools:

  • Sent Tokenization: Splits text into sentences.
    
    from nltk.tokenize import sent_tokeniz
    
  • Word Tokenization: Splits text into words.
    
    from nltk.tokenize import word_tokenize
    
  • WordPunctTokenizer: Splits text into words and punctuation marks.
    
    from nltk.tokenize import WordPunctTokenizer
    

Stemming

Stemming is used to reduce words to their base forms. NLTK provides several stemming algorithms:

  • PorterStemmer: Uses Porter’s algorithm for stemming.
    
    from nltk.stem.porter import PorterStemmer
    
  • LancasterStemmer: Uses Lancaster’s algorithm.
    
    from nltk.stem.lancaster import LancasterStemmer
    
  • SnowballStemmer: Uses Snowball’s algorithm.
    
    from nltk.stem.snowball import SnowballStemmer
    

Lemmatization

Lemmatization is a more advanced method for reducing words to their base forms. The WordNetLemmatizer package in NLTK provides this functionality:


from nltk.stem import WordNetLemmatizer

Counting POS Tags – Chunking

Chunking identifies parts of speech (POS) and short phrases. It involves creating chunks based on grammatical rules. Here’s how to perform noun-phrase chunking using NLTK:

Example

Define the sentence and chunking grammar, then parse the sentence:


import nltk
sentence = [("a", "DT"), ("clever", "JJ"), ("fox", "NN"), ("was", "VBP"), ("jumping", "VBP"), ("over", "IN"), ("the", "DT"), ("wall", "NN")]
grammar = "NP:{
?*}"
parser_chunking = nltk.RegexpParser(grammar)
output = parser_chunking.parse(sentence)
output.draw()