Natural Language Processing (NLP) is a sub-field of Artificial Intelligence. It deals with interactions between computers and humans such that computers can process and analyze our language as it is.
Every sort of intelligence represents an ability to navigate well around a particular set of challenges. Eg: mathematical, linguistic, technical, commercial or emotional intelligence.
Linguistic intelligence stands tall among others. For humans as social creatures, communication is a part and parcel of our everyday lives. Our daily activities consist of not only interaction with fellow humans but also with inanimate objects such as virtual assistants on our smartphones.
So what is it that makes it so hard for computers to understand us?
- One of the key challenges is the lack of a well-defined structure in our languages. Let us consider some examples that will present a clearer picture. Eg :
- Consider this SQL command, select name from students. This statement presents no ambiguity and is quite clear in what it wants the computer to do. That is a list of all the names from the student table has to be retrieved.
- Consider the case of mathematics, if we write 10x =2y+5. The relation between x and y is pretty unambiguous. 10 times ‘ x ‘ equals 2 times ‘y’ plus 5.
- A more common example is a programming language that we use to communicate with computers have a set of rules or grammar that make it highly structured. A little defiance of these rules or grammar leads to a syntax error.
- A part of this challenge lies in the complexity and variability of our sentences. Consider for example one of my favorite excerpts from Charles Dickens’ A Tale of Two Cities , “The time will come, the time will not belong in coming, when new ties will be formed about you- ties that will bind you yet more tenderly and strongly to the home you so adorn-the dearest ties that will ever grace and gladden you . O Miss Manette, when the little picture of a happy father’s face looks up in yours when you see your own bright beauty springing up anew at your feet, think now and then that there is a man who would give his life, to keep a life you love beside you!”
How much we love such a sentence, so deep and adorned with figures of speech. But for computers, such sentences are no less than a nightmare.
- Consider these lines:
1. “The sofa did not fit through the door as it was too narrow”
2. “The sofa didn’t for through the door as it was too wide.”
Here, in the first sentence “it” refers to the door while in the second one “it” refers to the sofa. We apply our knowledge of the physical world that wide objects do not fir through narrow objects to figure out what “it” refers to in both the above sentences.
- Another challenge is the use of slangs!
- And of course, my favorite, pun!
Though the languages that we use are well within the bounds laid by grammar, yet human discourse is complex, unstructured and mostly ambiguous. Yet we seem to have no trouble in understanding each other even ambiguity like sarcasm, satire or irony is welcomed to some extent.
So what can computers do understand unstructured texts?
That’s where Natural Language Processing comes in. NLP or Natural Language Processing is concerned with interactions between computers and humans such that computers can analyze and process large amounts of natural language / unstructured language data and be able to learn to understand human language as it is. Here are some preliminary ideas to carry it out:
- Computers can process words and phrases to identify :
- Parts of speech
- Named entities
- keywords etc.
- Using this, they can parse sentences to gather relevant parts of statements, questions or instructions.
- At a higher level, computers can analyze documents to find frequent and rare words, access the overall tone or sentiment being expressed and even cluster similar documents together.
Natural Language Processing, despite so many challenges, is one of the fastest-growing fields in AI. It finds many applications in our daily lives. For instance word suggestion in Gmail or GBoard, sorting reviews etc.
Natural Language Processing Pipeline :
Natural Language Processing Pipeline consists of three stages, viz. :
- Text processing
- Feature extraction
Each stage transforms text in some way and produces a result that is appropriate to be given as input to the next stage. Let’s take a closer look at each of the stages.
Text Processing :
The source of input data could be a website, PDF, Word doc or even speech. We need to extract textual data that is free of any source-specific marker or constructs that aren’t relevant to our task. Furthermore, capitalizations and punctuation marks and common words that provide structure to the sentence, like a, and, an, the, of, the etc should also be removed.
Data pre-processing :
Reading text data :
- Text data present on our local machines can be read using python’s built-in file input mechanism.
# read in a plain text file
With open (“file_name.txt”),”r”) as f :
- For tabular data, pandas can be used to read the file :
Import pandas as pd
- To fetch data from an online source such as an API :
Most APIs return .xml or .json data , so you’ll need to be aware of the structure inorder to pullout the fields you’ll need.
Cleaning the Data
From bs4 import BeautifulSoup
Soup=BeautifulSoup(r.text ,”html5lib ”)
convert each word into the same case, preferably lower :
We can use a regular expression that replaces everything that is not a lowercase or uppercase alphabet and isn’t a number from 0 to 9, by “ “(space).
In the case of Natural language Processing, Tokenisation is simply splitting each sentence into its constituent words.
The simplest way to achieve this is by using the split() function which returns a list of the constituent words. This function, by default, splits the text on the basis of the presence of whitespace.
Another method is to use NLTK or Natural Language Tool Kit for tokenization. As:
From nltk.tokenize import word_tokenize
Two isolate each sentence :
from nltk.tokenize import sent_tokenize
NLTK library for Natural Language Processing also provides regular expression tokenizer. It is used to remove punctuations and tokenize simultaneously. Also, it provides tweet tokenizer that is aware of twitter handles, emoticons, hash-tags, etc.
Stop word Removal
Stopwords are uninformative words that do not add meaning to a sentence but adds to its structure. Eg : a, an, the, of , and ,are etc.
Words =[w for w in words if w not in stopwords.words(“English”)]
A natural language processing library called, NLTK library can also be used for the same as shown in the model below.
Parts of speech
Parts of speech gives the relation between each word in a sentence. It can be achieved using NLTK.
Stemming and lemmatization
Stemming in natural language processing is reducing a word to its root form.
NLTK provides a few different stemmers to choose from like porter stemmer, snowball stemmer, and other language-specific stemmers. Eg :
“branched, branching, branches “ is stemmed to branch.
Lemmatization is another tool for reducing words to their root form. It used a dictionary to map the words to their original forms. Eg :
“Is, was, were, are” is lemmatized to ‘be’.
The default lemmatizer in NLTK uses WordNetLemmetizer to reduce words to their root forms.
2 . Feature Extraction :
So now that we have our text data extracted, so can this data be fed to our Natural Language Processing model?
Not quite so. Why? Let’s see.
Text data is represented in a computer as the binary equivalent of their ASCII or Unicode forms. Words are formed by the combinations of these individual characters and internally they are just sequences of ASCII or Unicode values but they don’t quite capture the meaning or relationship between the words. Hence, we need to have features that can be used for modeling. These features depend upon the kind of model we will be using. For instance:
If the model is going to be graph-based, words may be represented as symbolic nodes with a relationship between them as coordinates. For statistical models, however, the representation has to be numerical.
One hot encoding
Treat each word as a class. Assign it a vector that has one in a single predetermined position and zero everywhere else.
one hot encoding breaks down when we have a very large vocabulary to deal with as the size of word representation also grows with the number of words in the vocabulary. Hence, we need to fix the size of representation , that is, we need to find an embedding for each word n some vector space and we want it to exhibit some desired properties. For instance, two words similar in meaning (happy,bliss ) should be placed together in the vector space than the words that aren’t similar t them(house, car, dog).
Also if two words have similar difference in their meaning, they should be approximately equally separated in the embedded space.
Word embedding serves several purpose such as classifying words, finding synonyms, antonyms, analogies etc.
Word to vector
As the name suggests . it is used for transforming words to their vector farms. The core idea behind word to vec is :
A model that is able to predict a given word given the neighbouring words or vice versa is able to capture the contextual meanings of words. There are two forms of word to vec :
- Continuous bag of words (CBoW) :
Where the word needs to be predicted and the neighboring words are given.
2. Continuous Skip-Gram :
Neighboring words are to be predicted from a given word.
The given word is converted into its one-hot encoded form and fed to a natural language processing model designed to predict a few surrounding words. Using a suitable loss function, optimize the weights or parameters and repeat until it learns to predict context words. Now take an intermediate representation like a hidden layer and the output of that layer for a given word becomes the corresponding word vector.
3. Model for predicting anagrams using Natural Language Processing:
We will be building a word to vec model for prediction of analogies. Implying, if we know that “a” is related to “b” with relation “r” , then our model has to predict the value of “d” which is going to be related to “c” with relation “r”. eg :
Whiteboard : marker : : notebook : pen .
We can represent whiteboard as Vw and marker as Vm . So, their vector difference
Vd = Vm-Vw.
Let’s say notebook’s vector is represented as Vn and that of pen as Vp. So , their vector difference , Vdiff = Vp-Vn.
As we saw in the word embedding section, the difference between the spatial arrangement of the vectors of whiteboard and marker should be equal to that of notebook and pen, i.e the cosine similarity between Vd and Vdiff should be 0. Hence, we can say that :
Hence, it is safe to assume that :
Here, Vp is the vector that stands close to pen.
For an in-depth understanding, visit: Natural Language Processing is Fun!
Although building an intelligent machine is sort of the center of today’s tech-savvy universe, our machines lag behind when it comes to language and communication, that is computers cannot understand us the way a peer human can.
Yet, despite all the challenges, Natural Language Processing is one of the fastest-growing streams and finds utility in a large number of spheres.
Stay tuned at NeuralAI for more Artificial Intelligence and deep learning projects of NLP.