def lowercase(text):
return text.lower()
Introduction
Natural language processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and human language. It is used for things like sentiment analysis, product reviews, text summary, text generation, language translation, and countless more. Essentially, if you want to solve an issue that involves natural language, it’s an NLP problem.
The early days of NLP involved a bunch of scientists creating if-else statements to support a large set of rules, so it was very limited on what you can accomplish. Since then, AI has come a long way and we can now use neural networks, like a recurrent neural network (RNN), to work with text data. Luckily for us, frameworks like TensorFlow
and PyTorch
allow us to easily create these types of models where they can learn from text data and produce whatever we’re looking for. Most people use python for deep learning, so any code mentioned will be python. Some popular libraries for NLP in python are NLTK
(Natural Language Tool Kit) which is used for cleaning text data and applying some attributes mentioned later, and TensorFlow
for the actual model creation. TensorFlow
in this post’s case is used for tokenization, and some other languages get used for smaller parts like spacy
, re
, spellchecker
, and collections.
When you want to create a neural network to run an NLP task, like a Long Short-Term Memory (LSTM) recurrent neural network, you have to clean the text for the computer to be able to find patterns in the first place. When using something like a transformer such as GPT-4 or BERT, most of the text preprocessing is done in the model so it won’t be necessary to clean yourself, but the techniques mentioned bellow assume you want to know how to preprocess text data for non-transformer based models. Text preprocessing usually involves cleaning, tokenizing (breaking sentences into words), and embedding (representing words as vectors where similar words are closer together numerically).
Text Cleaning
Text cleaning is the process of preparing your text before converting words into numerical representations for your model to find patterns. The primary goal is to ensure that different forms of the same word are treated as equal. For example, “long,” “Long,” and “Looong” should all be recognized as “long” by the computer. This section outlines the fundamental steps to achieve a normalized dataset, which you can adjust based on your specific needs.
Lowercasing
Convert all text to lowercase to avoid treating something like “Hot” and “hot” as different words.
Python Example:
URLs
Remove URLs to retain only the text.
Python Example:
import re
def remove_urls(text):
= re.compile(r'https?://\S+|www\.\S+')
url_pattern return url_pattern.sub(r'', text)
Emoticons
Remove emoticons and emojis, as they do not convey meaning to the computer in the same way words do
Python Example:
import re
def remove_emoticons(text):
= re.compile("["
emoji_pattern u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text)
Special Characters & Numbers
Remove special characters and numbers, unless they are relevant to your analysis.
Python Example:
import re
def remove_characters(text):
return re.sub(r'[^A-Za-z ]+', '', text)
Stop Words
Remove stop words, which are common words that do not contribute much to the meaning.
Python Example:
from nltk.corpus import stopwords
'stopwords')
nltk.download(
def remove_stopwords(text):
= set(stopwords.words("english"))
stop_words = text.split()
words = [word for word in words if word.lower() not in stop_words]
filtered_words return ' '.join(filtered_words)
Spell Correction
Correct spelling mistakes to ensure consistency in word representation.
Python Example:
from spellchecker import SpellChecker
def correct_spellings(text):
if not text.strip():
return text
= SpellChecker()
spell = []
corrected_text = spell.unknown(text.split())
misspelled_words
for word in text.split():
if word in misspelled_words:
= spell.correction(word)
suggested_correction if suggested_correction and "'" not in suggested_correction:
corrected_text.append(suggested_correction)else:
corrected_text.append(word)else:
corrected_text.append(word)return " ".join(corrected_text)
Part-of-Speech
Tag each word with its part of speech to provide context to the model.
Python Example:
import nltk
'averaged_perceptron_tagger')
nltk.download(
def get_wordnet_pos(text):
= nltk.pos_tag([text])[0][1][0].upper()
tag = {"J": "a", # Adjective
tag_dict "N": "n", # Noun
"V": "v", # Verb
"R": "r"} # Adverb
return tag_dict.get(tag, "n")
Lemmatizing vs Stemming
Once you’ve made it this far, should you lemmatize or stem your words? Lemmatization is when you cut the suffix off a word and replace it with the normalized form of the word, while stemming refers to cutting the suffix off completely. A good example of this is if you have the word “running”, the stemmed version would be “runn” while the lemmatized version would be “run”. You can see that by lemmatizing your words, you convert it into it’s root word, while stemming could create almost like a second version of the word, which your model might see as a completely separate word. This is why I highly suggest lemmatizing your words rather than stemming them.
Python Example:
from nltk.stem import WordNetLemmatizer
'punkt')
nltk.download('wordnet')
nltk.download(
def lemmatize(text):
= WordNetLemmatizer()
lemmatizer = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in nltk.word_tokenize(text)]
lemmatized_words return " ".join(lemmatized_words)
Other Options
There is more out there that you can do to clean or preprocess your text data. Some examples could be marking negations, handling specific elements like hashtags, converting dates to the same format, converting numbers to words, etc. Some of the cleaning mentioned before you could possibly even avoid, depending on what your end goal is. If you’re planning on using a transformer in the future, pretty much all this text preprocessing is done fully for you through pre-trained models which would cut a lot of time from this stage.
Tokenization
After cleaning, split sentences into words (tokens), create sequences, and pad them to ensure consistent vector lengths.
Code
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
# Define parameters
= max_vocab_size # Maximum number of words to keep in the vocabulary
MAX_VOCAB_SIZE = max_seq_length # Maximum length of each sequence
MAX_SEQUENCE_LENGTH
# Assuming you have these variables defined:
# all_texts: list of all text samples
# train_texts, val_texts, test_texts: lists of text samples for each set
# train_labels, val_labels, test_labels: corresponding labels
# Tokenization
= Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer # Fit tokenizer on all available texts
tokenizer.fit_on_texts(all_texts)
# Convert texts to sequences
= tokenizer.texts_to_sequences(train_texts)
train_sequences = tokenizer.texts_to_sequences(val_texts)
val_sequences = tokenizer.texts_to_sequences(test_texts)
test_sequences
# Pad sequences
= pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
train_padded = pad_sequences(val_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
val_padded = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
test_padded
# Convert labels to numpy arrays
= np.array(train_labels)
train_labels = np.array(val_labels)
val_labels = np.array(test_labels)
test_labels
# Now train_padded, val_padded, and test_padded are your vectorized inputs
# train_labels, val_labels, and test_labels are your corresponding labels
Pre-trained Embeddings
The final part of text preprocessing are embeddings. Realistically, unless you have a crazy compute power, you’re not going to be training your own embeddings, so you can use something like GloVe, Word2Vec, or others like these where people have embedded lots of words to find their spatial relationships to each other. Below I have an example of GloVe, and it would create the vector representations of your words you’ve just cleaned, and tokenized, and make it ready to put into your deep learning model.
If in the future you are planning on using a pre-trained model like BERT or GPT, then you would not need to do this step as the model will already do all the work for you.
Code
import numpy as np
def load_glove_embeddings(filepath, word_index, embedding_dim):
"""
Load GloVe embeddings for the words in the tokenizer's word index.
Args:
filepath (str): Path to the GloVe embeddings file.
word_index (dict): Word-to-index mapping from the tokenizer.
embedding_dim (int): Dimensionality of the GloVe embeddings.
Returns:
numpy.ndarray: Embedding matrix for the vocabulary.
"""
= len(word_index) + 1 # Adding 1 for padding token
vocab_size = np.zeros((vocab_size, embedding_dim))
embedding_matrix
with open(filepath, encoding="utf8") as f:
for line in f:
= line.split()
values = values[0]
word if word in word_index:
= word_index[word]
idx = np.asarray(values[1:], dtype='float32')
embedding_matrix[idx]
return embedding_matrix
# Example usage:
# Assuming you have already created and fitted a tokenizer
= 'path/to/glove.42B.300d.txt' # Update this path
GLOVE_PATH = 300
EMBEDDING_DIM
# Load GloVe embeddings
= load_glove_embeddings(GLOVE_PATH, tokenizer.word_index, EMBEDDING_DIM)
embedding_matrix
# Now you can use this embedding_matrix in your model, for example:
# model.add(Embedding(vocab_size, EMBEDDING_DIM,
# weights=[embedding_matrix],
# input_length=MAX_SEQUENCE_LENGTH,
# trainable=False))
Conclusion
At this point you should have cleaned your text data, tokenized it, created padded sequences for the tokens, and then used pre-trained word embeddings to vectorize and find relationships within the data. Now you can feed it into a neural network! I would suggest using a Bidirectional Long Short-Term Memory RNN to train your model as this seems to be a good, safe option when training a model for NLP yourself.
There’s a lot to be done when trying to process text for NLP, and now you know how to turn natural language into computer language! One that would then have to be turned from numbers into the actual computer’s language… Anyways, go train a custom model! Natural language processing is everywhere in the world now, and has great applications in literally everything. If you have ever made a stock trading bot where you use machine learning to analyze trends within numeric data to create predictions, you can actually use NLP to review articles of companies and create sentiment analysis to determine if the stock will change. Nevertheless, remember to preprocess all text until it’s as bare as it can get, and your model will be happy!
Side note, I believe that because of the creation of transformers, most recurrent neural networks (yes, even the better optimized ones) aren’t nearly as good as transformers when it comes to pattern recognition and memorization of text. The transformer models that are currently out for consumer use like Llama 3, GPT-4o, Claude 3, etc. are all trained with billions of dollars so they work very, VERY well. You know who doesn’t have billions of dollars? Probably you (I’m sorry), so generally using a pre-trained transformer model is the way to go if you want to do any projects involving NLP. You can even fine-tune one of these models so it can perform better at a specific task. I think it’s good to know the different core neural networks before tackling something like a transformer, but transformers are kind of the way to go now. You can thank NVIDIA for that.
Support
You’ve reached the end of my blog post! 🎉🎉🎉
If you learned something and would like to support me, feel free to buy me a coffee :)
Comments