The Basics of NLP

Nicholas Wu
7 min readSep 17, 2022

Natural Language Processing (NLP) is a subset of artificial intelligence which deals with how computers process and understand natural language, or human language. It is used to processes human language in the form of speech or text so that it can be used by the computer and analyzed to recognize context and intent. It is far from perfect, and there is still much to be explored and improved in this field. However, it is being applied in our everyday lives, whether we know it or not. Some examples include:

  1. Virtual personal assistant applications such as Siri, Alexa, and OK Google
  2. Word processors such as Grammarly and Google Docs grammar check
  3. Email filters to separate spam from important emails
  4. Translation programs such as Google Translate
  5. Chatbots for multiple applications, such as automated customer service

How it works

For humans, sentences and paragraphs are easily read, and we can extract meaning from them pretty quickly. However, for computers, the countless rules with human languages are hard to understand, making it extremely difficult to extract information from text or speech.

Thus, computers need to use algorithms to preprocess data to turn it into a structured form which is easily understandable, called a corpus. This includes using techniques such as tokenization, removal of stopwords, and lemmatization.

After doing so, the structure, or syntax of the sentence/text can be analyzed using different forms of parsing. This includes parts of speech (POS) tagging, shallow parsing or chunking, and dependency parsing.

Finally, the meaning of the text can be analyzed using a variety of different methods. However, this article will focus on named-entity recognition.

Thankfully, you do not need to do all of this by scratch. Both NLTK and spaCy are great NLP libraries for python, but I will be using NLTK in this article.

Preprocessing Data

Tokenization

Tokenization is the process in which text is separated into defined units such as sentences or words. It is also possible to throw away characters such as punctuation. This is a useful step to start structuring the data into a useable form. An example of this is:

This outputs:

['natural', 'language', 'processing', 'nlp', 'is', 'a', 'subset', 'of', 'artificial', 'intelligence', 'which', 'deals', 'with', 'how', 'computers', 'process', 'and', 'understand', 'natural', 'language', 'or', 'human', 'language', 'it', 'is', 'used', 'to', 'processes', 'human', 'language', 'in', 'the', 'form', 'of', 'speech', 'or', 'text', 'so', 'that', 'it', 'can', 'be', 'used', 'by', 'the', 'computer', 'and', 'analyzed', 'to', 'recognize', 'context', 'and', 'intent', 'it', 'is', 'far', 'from', 'perfect', 'and', 'there', 'is', 'still', 'much', 'to', 'be', 'explored', 'and', 'improved', 'in', 'this', 'field']

As you can see, I used the first sentence of this article as an example. Each individual word is itemized into a list, all characters are lower-cased, and punctuation is removed.

Removal of Stopwords

Stopword removal is a way to cut the fat out of some text. Stopwords are words which do not add much meaning to a sentence (eg. “the”, “and”, “to”). There is no set list of stopwords to remove, but typically either a predefined list of stopwords or a custom list is used. Here is an example of NLTK’s stopword list:

{‘ourselves’, ‘hers’, ‘between’, ‘yourself’, ‘but’, ‘again’, ‘there’, ‘about’, ‘once’, ‘during’, ‘out’, ‘very’, ‘having’, ‘with’, ‘they’, ‘own’, ‘an’, ‘be’, ‘some’, ‘for’, ‘do’, ‘its’, ‘yours’, ‘such’, ‘into’, ‘of’, ‘most’, ‘itself’, ‘other’, ‘off’, ‘is’, ‘s’, ‘am’, ‘or’, ‘who’, ‘as’, ‘from’, ‘him’, ‘each’, ‘the’, ‘themselves’, ‘until’, ‘below’, ‘are’, ‘we’, ‘these’, ‘your’, ‘his’, ‘through’, ‘don’, ‘nor’, ‘me’, ‘were’, ‘her’, ‘more’, ‘himself’, ‘this’, ‘down’, ‘should’, ‘our’, ‘their’, ‘while’, ‘above’, ‘both’, ‘up’, ‘to’, ‘ours’, ‘had’, ‘she’, ‘all’, ‘no’, ‘when’, ‘at’, ‘any’, ‘before’, ‘them’, ‘same’, ‘and’, ‘been’, ‘have’, ‘in’, ‘will’, ‘on’, ‘does’, ‘yourselves’, ‘then’, ‘that’, ‘because’, ‘what’, ‘over’, ‘why’, ‘so’, ‘can’, ‘did’, ‘not’, ‘now’, ‘under’, ‘he’, ‘you’, ‘herself’, ‘has’, ‘just’, ‘where’, ‘too’, ‘only’, ‘myself’, ‘which’, ‘those’, ‘i’, ‘after’, ‘few’, ‘whom’, ‘t’, ‘being’, ‘if’, ‘theirs’, ‘my’, ‘against’, ‘a’, ‘by’, ‘doing’, ‘it’, ‘how’, ‘further’, ‘was’, ‘here’, ‘than’}

This step would reduce the amount of text which is useless, and we would only focus on the important parts of a sentence. Using the previous tokenized text, an example can be seen here:

This outputs:

['natural', 'language', 'processing', 'nlp', 'subset', 'artificial', 'intelligence', 'deals', 'computers', 'process', 'understand', 'natural', 'language', 'human', 'language', 'used', 'processes', 'human', 'language', 'form', 'speech', 'text', 'used', 'computer', 'analyzed', 'recognize', 'context', 'intent', 'far', 'perfect', 'still', 'much', 'explored', 'improved', 'field']

As you can see, there are far fewer words compared to the initial text, as words such as ‘is’ and ‘and’ are removed from the corpus.

Stopword removal could potentially be problematic, as some meaning or context can be removed by throwing these words out. Thus, it is important to be careful of which words are chosen to be removed.

Lemmatization

Lemmatization is the process of reducing the words of a text into items defined by their lemmas, or their roots. For example, the lemmas of ‘frustrating’ and ‘frustrated’ is ‘frustrate’. This technique helps to organize similar words with similar meaning together, and further simplifying the content of the data.

However, to do this, the part of speech of the word being lemmatized needs to be specified to lemmatize better. This will be talked about later on. For now, this is what lemmatization with POS would look like:

Resulting in:

['natural', 'language', 'processing', 'nlp', 'subset', 'artificial', 'intelligence', 'deal', 'computer', 'process', 'understand', 'natural', 'language', 'human', 'language', 'use', 'process', 'human', 'language', 'form', 'speech', 'text', 'use', 'computer', 'analyze', 'recognize', 'context', 'intent', 'far', 'perfect', 'still', 'much', 'explore', 'improved', 'field']

As you can see, plural words, such as ‘computers’ are turned singular, and verbs in the past tense such as ‘analysed’, ‘explored’, and ‘improved’ are turned into present tense as a result of lemmatization.

Parsing

Parsing is the process of separating a sentence into its parts and understanding their role. There are many different methods to parse sentences, including parts of speech (POS) tagging, shallow parsing or chunking, and dependency parsing. In all of these methods, algorithms are used to identify and separate elements of a text into these individual parts and determine their meaning.

Parts of Speech Tagging

POS tagging organizes each word into their parts of speech. Here is a list of some important POS tags used by NLTK:

  • NN: noun, singular
  • JJ: adjective
  • VB: verb, base form
  • RB: adverb
  • PRP: personal pronoun
  • IN: preposition/subordinating conjunction

These categories can be further divided into subcategories. For example, JJ (adjectives) can be separated into JJR (comparative adjectives) and JJS (superlative adjectives). This step allows us to use many other techniques, such as lemmatization, much more effectively. An example of POS tagging using our previously preprocessed text:

This resulted in:

[('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('nlp', 'JJ'), ('subset', 'VBN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('deal', 'NN'), ('computer', 'NN'), ('process', 'NN'), ('understand', 'JJ'), ('natural', 'JJ'), ('language', 'NN'), ('human', 'JJ'), ('language', 'NN'), ('use', 'NN'), ('process', 'NN'), ('human', 'JJ'), ('language', 'NN'), ('form', 'NN'), ('speech', 'NN'), ('text', 'NN'), ('use', 'NN'), ('computer', 'NN'), ('analyze', 'VBP'), ('recognize', 'VB'), ('context', 'JJ'), ('intent', 'NN'), ('far', 'RB'), ('perfect', 'JJ'), ('still', 'RB'), ('much', 'JJ'), ('explore', 'RBR'), ('improved', 'JJ'), ('field', 'NN')]

Shallow Parsing

Shallow parsing is similar to POS tagging, but sentences are separated into categories of phrases such as noun phrases, instead of parts of speech. This can allow for an even better understanding of the sentence structure. Shallow parsing will result in text being separated into phrases and parts of speech, which can be visualized using tree diagrams.

In this example, I only parsed and categorized the noun phrases using a given pattern of POS tags to recognize them. There are other methods to do this, including using Machine learning to identify different types of phrases.

This was the code used to produce the diagram:

Dependency Parsing

Dependency parsing, as the name suggests, determines the relationships and dependancies between words in a sentence. This can help decode meaning as well as context, and is especially helpful in real world applications since it can decode dependency even if the grammar is incorrect, or word ordering is wrong. Dependency can also be represented by a parse tree:

Named Entity Recognition

Named Entity Recognition identifies the important parts of a sentence after it is preprocessed, for example entities such as people, locations, times, or organizations are identified and classified by algorithms. This is technique is very important to gain a general understanding of what the text is generally about.

This technique can be seen here, where I pulled the first sentence of a recent news article, preprocessed and parsed it, and performed named entity recognition using NLTK.

Original Sentence:

The Public Health Agency of Canada (PHAC) says Brig.-Gen. Krista Brodie will take over as the general in charge of overseeing the delivery and distribution of COVID-19 vaccines across Canada.

Result after Named Entity Recognition:

(SThe/DT(ORGANIZATION Public/NNP Health/NNP Agency/NNP)of/IN(GPE Canada/NNP)(ORGANIZATION PHAC/NNP)says/VBZBrig.-Gen./NNP(PERSON Krista/NNP Brodie/NNP)will/MDtake/VBover/RPas/INthe/DTgeneral/JJin/INcharge/NNof/INoverseeing/VBGthe/DTdelivery/NNand/CCdistribution/NNof/INCOVID-19/NNPvaccines/NNSacross/IN(GPE Canada/NNP))

As shown, the algorithm was able to recognize the Public Health Agency as well as the PHAC as an organization, Krista Brodie as a person, and Canada as a GPE (country, city or state). This provides a lot of information and can greatly help the algorithm understand the main ideas of such text.

The code for this is here:

This may seem like a lot, but this is barely scratching the surface of NLP. The vast depth of the NLP field can’t be covered in one article, but these concepts shown are some of the most important for any NLP program to work. At the end of the day, machine learning algorithms and AI models will need to be applied to this data in order to generate real meaning so that it can be applied in real life products and services.

--

--