Introduction

Natural language processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and human language. It is used for things like sentiment analysis, product reviews, text summary, text generation, language translation, and countless more. Essentially, if you want to solve an issue that involves natural language, it’s an NLP problem.

The early days of NLP involved a bunch of scientists creating if-else statements to support a large set of rules, so it was very limited on what you can accomplish. Since then, AI has come a long way and we can now use neural networks, like a recurrent neural network (RNN), to work with text data. Luckily for us, frameworks like TensorFlow and PyTorch allow us to easily create these types of models where they can learn from text data and produce whatever we’re looking for. Most people use python for deep learning, so any code mentioned will be python. Some popular libraries for NLP in python are NLTK (Natural Language Tool Kit) which is used for cleaning text data and applying some attributes mentioned later, and TensorFlow for the actual model creation. TensorFlow in this post’s case is used for tokenization, and some other languages get used for smaller parts like spacy, re, spellchecker, and collections.

When you want to create a neural network to run an NLP task, like a Long Short-Term Memory (LSTM) recurrent neural network, you have to clean the text for the computer to be able to find patterns in the first place. When using something like a transformer such as GPT-4 or BERT, most of the text preprocessing is done in the model so it won’t be necessary to clean yourself, but the techniques mentioned bellow assume you want to know how to preprocess text data for non-transformer based models. Text preprocessing usually involves cleaning, tokenizing (breaking sentences into words), and embedding (representing words as vectors where similar words are closer together numerically).

Text Cleaning

Text cleaning is the process of preparing your text before converting words into numerical representations for your model to find patterns. The primary goal is to ensure that different forms of the same word are treated as equal. For example, “long,” “Long,” and “Looong” should all be recognized as “long” by the computer. This section outlines the fundamental steps to achieve a normalized dataset, which you can adjust based on your specific needs.

Lowercasing

Convert all text to lowercase to avoid treating something like “Hot” and “hot” as different words.

Python Example:

def lowercase(text):
  return text.lower()

URLs

Remove URLs to retain only the text.

Python Example:

import re

def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

Emoticons

Remove emoticons and emojis, as they do not convey meaning to the computer in the same way words do

Python Example:

import re

def remove_emoticons(text):
  emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
  return emoji_pattern.sub(r'', text)

Special Characters & Numbers

Remove special characters and numbers, unless they are relevant to your analysis.

Python Example:

import re

def remove_characters(text):
  return re.sub(r'[^A-Za-z ]+', '', text)

Stop Words

Remove stop words, which are common words that do not contribute much to the meaning.

Python Example:

from nltk.corpus import stopwords
nltk.download('stopwords')

def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

Spell Correction

Correct spelling mistakes to ensure consistency in word representation.

Python Example:

from spellchecker import SpellChecker

def correct_spellings(text):
    if not text.strip():
        return text

    spell = SpellChecker()
    corrected_text = []
    misspelled_words = spell.unknown(text.split())

    for word in text.split():
        if word in misspelled_words:
            suggested_correction = spell.correction(word)
            if suggested_correction and "'" not in suggested_correction:
                corrected_text.append(suggested_correction)
            else:
                corrected_text.append(word)
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)

Part-of-Speech

Tag each word with its part of speech to provide context to the model.

Python Example:

import nltk
nltk.download('averaged_perceptron_tagger')

def get_wordnet_pos(text):
    tag = nltk.pos_tag([text])[0][1][0].upper()
    tag_dict = {"J": "a",  # Adjective
                "N": "n",  # Noun
                "V": "v",  # Verb
                "R": "r"}  # Adverb
    return tag_dict.get(tag, "n")

Lemmatizing vs Stemming

Once you’ve made it this far, should you lemmatize or stem your words? Lemmatization is when you cut the suffix off a word and replace it with the normalized form of the word, while stemming refers to cutting the suffix off completely. A good example of this is if you have the word “running”, the stemmed version would be “runn” while the lemmatized version would be “run”. You can see that by lemmatizing your words, you convert it into it’s root word, while stemming could create almost like a second version of the word, which your model might see as a completely separate word. This is why I highly suggest lemmatizing your words rather than stemming them.

Python Example:

from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')

def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in nltk.word_tokenize(text)]
    return " ".join(lemmatized_words)

Other Options

There is more out there that you can do to clean or preprocess your text data. Some examples could be marking negations, handling specific elements like hashtags, converting dates to the same format, converting numbers to words, etc. Some of the cleaning mentioned before you could possibly even avoid, depending on what your end goal is. If you’re planning on using a transformer in the future, pretty much all this text preprocessing is done fully for you through pre-trained models which would cut a lot of time from this stage.

Tokenization

After cleaning, split sentences into words (tokens), create sequences, and pad them to ensure consistent vector lengths.

Code

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Define parameters
MAX_VOCAB_SIZE = max_vocab_size  # Maximum number of words to keep in the vocabulary
MAX_SEQUENCE_LENGTH = max_seq_length  # Maximum length of each sequence

# Assuming you have these variables defined:
# all_texts: list of all text samples
# train_texts, val_texts, test_texts: lists of text samples for each set
# train_labels, val_labels, test_labels: corresponding labels

# Tokenization
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer.fit_on_texts(all_texts)  # Fit tokenizer on all available texts

# Convert texts to sequences
train_sequences = tokenizer.texts_to_sequences(train_texts)
val_sequences = tokenizer.texts_to_sequences(val_texts)
test_sequences = tokenizer.texts_to_sequences(test_texts)

# Pad sequences
train_padded = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
val_padded = pad_sequences(val_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
test_padded = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post')

# Convert labels to numpy arrays
train_labels = np.array(train_labels)
val_labels = np.array(val_labels)
test_labels = np.array(test_labels)

# Now train_padded, val_padded, and test_padded are your vectorized inputs
# train_labels, val_labels, and test_labels are your corresponding labels

Pre-trained Embeddings

The final part of text preprocessing are embeddings. Realistically, unless you have a crazy compute power, you’re not going to be training your own embeddings, so you can use something like GloVe, Word2Vec, or others like these where people have embedded lots of words to find their spatial relationships to each other. Below I have an example of GloVe, and it would create the vector representations of your words you’ve just cleaned, and tokenized, and make it ready to put into your deep learning model.

If in the future you are planning on using a pre-trained model like BERT or GPT, then you would not need to do this step as the model will already do all the work for you.

Code

import numpy as np

def load_glove_embeddings(filepath, word_index, embedding_dim):
    """
    Load GloVe embeddings for the words in the tokenizer's word index.
    
    Args:
    filepath (str): Path to the GloVe embeddings file.
    word_index (dict): Word-to-index mapping from the tokenizer.
    embedding_dim (int): Dimensionality of the GloVe embeddings.
    
    Returns:
    numpy.ndarray: Embedding matrix for the vocabulary.
    """
    vocab_size = len(word_index) + 1  # Adding 1 for padding token
    embedding_matrix = np.zeros((vocab_size, embedding_dim))
    
    with open(filepath, encoding="utf8") as f:
        for line in f:
            values = line.split()
            word = values[0]
            if word in word_index:
                idx = word_index[word]
                embedding_matrix[idx] = np.asarray(values[1:], dtype='float32')
    
    return embedding_matrix

# Example usage:
# Assuming you have already created and fitted a tokenizer
GLOVE_PATH = 'path/to/glove.42B.300d.txt'  # Update this path
EMBEDDING_DIM = 300

# Load GloVe embeddings
embedding_matrix = load_glove_embeddings(GLOVE_PATH, tokenizer.word_index, EMBEDDING_DIM)

# Now you can use this embedding_matrix in your model, for example:
# model.add(Embedding(vocab_size, EMBEDDING_DIM, 
#                     weights=[embedding_matrix], 
#                     input_length=MAX_SEQUENCE_LENGTH, 
#                     trainable=False))

Conclusion

At this point you should have cleaned your text data, tokenized it, created padded sequences for the tokens, and then used pre-trained word embeddings to vectorize and find relationships within the data. Now you can feed it into a neural network! I would suggest using a Bidirectional Long Short-Term Memory RNN to train your model as this seems to be a good, safe option when training a model for NLP yourself.

There’s a lot to be done when trying to process text for NLP, and now you know how to turn natural language into computer language! One that would then have to be turned from numbers into the actual computer’s language… Anyways, go train a custom model! Natural language processing is everywhere in the world now, and has great applications in literally everything. If you have ever made a stock trading bot where you use machine learning to analyze trends within numeric data to create predictions, you can actually use NLP to review articles of companies and create sentiment analysis to determine if the stock will change. Nevertheless, remember to preprocess all text until it’s as bare as it can get, and your model will be happy!

Side note, I believe that because of the creation of transformers, most recurrent neural networks (yes, even the better optimized ones) aren’t nearly as good as transformers when it comes to pattern recognition and memorization of text. The transformer models that are currently out for consumer use like Llama 3, GPT-4o, Claude 3, etc. are all trained with billions of dollars so they work very, VERY well. You know who doesn’t have billions of dollars? Probably you (I’m sorry), so generally using a pre-trained transformer model is the way to go if you want to do any projects involving NLP. You can even fine-tune one of these models so it can perform better at a specific task. I think it’s good to know the different core neural networks before tackling something like a transformer, but transformers are kind of the way to go now. You can thank NVIDIA for that.

Support

You’ve reached the end of my blog post! 🎉🎉🎉
If you learned something and would like to support me, feel free to buy me a coffee :)

Text Preprocessing for NLP

Introduction

Text Cleaning

Lowercasing

URLs

Emoticons

Special Characters & Numbers

Stop Words

Spell Correction

Part-of-Speech

Lemmatizing vs Stemming

Other Options

Tokenization

Pre-trained Embeddings

Conclusion

Support

Comments