NLP Techniques for Text Normalization. Part II

This is the second part of the NLP tutorial referred to techniques for text normalization.

In a previous article we talked about some basic techniques used in an NLP pipeline which are the first steps we need to perform when starting our process based on NLP. Once we have done the tokenization, stemming and lemmatization, we are ready to perform some other advanced techniques such as tagging and classifying the word classes or lexical categories in the parts-of-speech of the text being analyzed. The collection of tags used for a particular task is known as a tagset.

What is part-of-speech tagging?

Parts-of-speech is exactly the process of classifying words contained in some text into different lexical categories. For this particular task is specially designed a POS tagger, which processes a sequence of words, and attaches a part-of-speech tag to each word.

A tagged token is represented using a tuple consisting of the token and the tag, and the part-of-speech tagging tells us how a word is used in a sentence. There are traditionally eight parts of speech in the English language: noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection.

But let’s see it in action in the following chunk of code, here we are importing, the genesis corpora from NLTK, then tokenizing the raw text and from the first 15 tokens we are getting their tags.

Image from Author

The tagged token is represented using a tuple consisting of the token and the tag, in the case of the first token ‘In’ of the text “In the beginning God created the heaven and the earth” corresponds to the tag ‘IN’ which means a preposition.

We can also pass the parameter tagset=’universal’ to nltk.pos_tag in order to get some more descriptive tagsets, so in this case the resulting list would be [(‘In’, ‘ADP’), (‘the’, ‘DET’), (‘beginning’, ‘NOUN’), (‘God’, ‘NOUN’), (‘created’, ‘VERB’), (‘the’, ‘DET’), (‘heaven’, ‘NOUN’), (‘and’, ‘CONJ’), (‘the’, ‘DET’), (‘earth’, ‘NOUN’), (‘.’, ‘.’), (‘And’, ‘CONJ’), (‘the’, ‘DET’), (‘earth’, ‘NOUN’), (‘was’, ‘VERB’)].

In the tuples above we can infer that ‘ADP’ is an adposition or preposition, ‘DET’ corresponds to a determiner, and the remaining tags I believe are self-explanatory (‘NOUN’, ‘VERB’ and ‘CONJ’).

We can find a simplified tagset table with examples in the following table:

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. O’Reilly.

In the Natural Language Processing process, categorizing every word in a corpus in correspondence with a particular part of speech depends on the words and its context.

So in the sentence above if you are familiar with English, you’d instantly infer that there is a higher likelihood that a determiner (‘DET’) is followed by a ‘NOUN’ rather than, a ‘VERB’ for instance. Hence the idea behind this behavior is that the tag that is assigned to the next word is dependent on the tag of the previous word.

Relation between Markov Chains and parts-of-speech?

The concept that is working in POS tagging behind the scenes is what is defined as Markov Models, which according to Wikipedia are “… stochastic models used to model randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it (that is, it assumes the Markov property).” So in order to get the probability of the next event, we would need only the states of the current event.

This idea makes Markov Chains the right one for modeling transitions between POS tags, as the probability of transitioning from one POS to another depends only on the POS immediately preceding it. Andrey Markov proposed these models “…to predict whether an upcoming letter in Pushkin’s Eugene Onegin would be a vowel or a consonant. Markov classified 20,000 letters as V or C and computed the bigram and trigram probability that a given letter would be a vowel given the previous one or two letters.”

We will not discuss in depth the details regarding Markov Chains, but just give a basic idea of how it helps NLP models to work, for more details read this article.

Image from https://brilliant.org

Let’s suppose we have a Markov Chain built on states A and B as depicted in the diagram above, theory explains what will be the probability of 0.3 + 0.7 that a process beginning on A will be on B, the same happens in the case of parts-of-speech, as sentences are represented as a sequence of words (or as a graph like in the picture above), so the line arrows indicate the direction (“directed graph”), and the circles representing the states of the model.

For the example in the same image above, A would become NN (Noun) and B would be VB (the Verb), so the directed lines would be the probability of transitioning from one state to the next.

Tagging methods for NLP pipelines

In this section we will explore and practice different ways to automatically add part-of-speech tags to text. As we saw before related to Markov Chains, the tag of a specific word depends on the word and its context within a sentence, therefore, we will be working with data at the level of tagged sentences, instead of tagged words.

Default Taggers

Default tagging is one of the most basic steps on NLP tagging process. It is performed using the DefaultTagger class. The DefaultTagger class takes just one argument, which is the ‘tag’. Basically, this process assigns the same tag for a singular noun. DefaultTagger is most useful when it gets to work with the most common part-of-speech tag, that’s why a noun tag is recommended.

In order to prove that we are gonna download the brown corpora from NLTK corpus, and from there extract a specific sentence from the category science_fiction that we will make use of the DefaultTagger feature.

Image from Author

Then the resulting tagged_sents and sents for the brown corpora are as follows:

[[('Now', 'RB'), ('that', 'CS'), ('he', 'PPS'), ('knew', 'VBD'), ('himself', 'PPL'), ('to', 'TO'), ('be', 'BE'), ('self', 'NN'), ('he', 'PPS'), ('was', 'BEDZ'), ('free', 'JJ'), ('to', 'TO'), ('grok', 'VB'), ('ever', 'QL'), ('closer', 'RBR'), ('to', 'IN'), ('his', 'PP$'), ('brothers', 'NNS'), (',', ','), ('merge', 'VB'), ('without', 'IN'), ('let', 'NN'), ('.', '.')], [("Self's", 'NN$'), ('integrity', 'NN'), ('was', 'BEDZ'), ('and', 'CC'), ('is', 'BEZ'), ('and', 'CC'), ('ever', 'RB'), ('had', 'HVD'), ('been', 'BEN'), ('.', '.')], ...]

[['Now', 'that', 'he', 'knew', 'himself', 'to', 'be', 'self', 'he', 'was', 'free', 'to', 'grok', 'ever', 'closer', 'to', 'his', 'brothers', ',', 'merge', 'without', 'let', '.'], ["Self's", 'integrity', 'was', 'and', 'is', 'and', 'ever', 'had', 'been', '.'], ...]

Being the tagged_sents the corresponding tokens of the text with the corresponding tags making a tuple of (‘token’,’tag’).

What we will analyze next is the frequency distribution of the tags extracted from that corpora, by using the FreqDist() class available in the NLTK package, that together with the most_common() method will tell us nothing else than the count of each of the tags in the corpora.

Image from Author

Now we are ready to create a tagger that will tag everything as NN (noun), I’ve chosen a sentence from Brave New World for this exercise:

"""Turned, the babies at once fell silent, then began to crawl towards those
clusters of sleek colours, those shapes so gay and brilliant on the white
pages."""

The result of using the DefaultTagger() class applied to our sample sentence will be as follows:

Image from Author

A very useful method in NLTK is accuracy(), with this you can be able to evaluate the accuracy of the default tagger against the gold standard text which is often used to evaluate algorithms for tasks such as part-of-speech tagging. Resulting that the precision of our dafault_tagger is more than 10 percent.

Image from Author

Regular Expression Tagger

The regular expression tagger assigns tags to tokens based on some regex matching patterns. For instance, as we know the English language, for sure any word ending in ed is likely to be the past participle of a verb, as well as, any word ending with ’s is a possessive noun.

A good starting point is to define some regex patterns, defined as a list of tuples of (‘pattern’,’tag’) and then assign those patterns to the NLTK RegexpTagger() class.

Image from Author

Then we will pass the RegexpTagger to one of the texts we analysed before:

Image from Author

We can again evaluate the tagger accuracy, as we may notice is more accurate than the default tagger, but it can be better if we add a more complete set of patterns.

Image from Author

The Lookup Tagger

The main purpose of this tagger is to find the most frequent words used in a text, and then store their most likely tags. After that, we will use that information as the model for the NLTK built-in class UnigramTagger(), which basically finds the most likely tag for each word in the training corpus, and then uses that information to assign new tags.

So now let’s put it into action. In the code we have extracted the most frequent words present in brown corpus in the category of science_fiction, then we have computed the most likely tags that correspond to the extracted tokens.

Image from Author

The final step would be to test the accuracy of our lookup tagger, and it seems to be a lot more accurate that the default and regex taggers.

Image from Author

Final words

Part-of-speech (POS) tagging is a fundamental technique used in natural language processing (NLP) to analyze and understand the structure and meaning of sentences. It involves identifying and labeling the parts of speech of words in a sentence, such as nouns, verbs, adjectives, adverbs, and so on.

In this tutorial, we have learned how to use these techniques to use pos tagging in our NLP pipelines effectively.

Overall, POS tagging is an important technique in NLP that helps in understanding the structure and meaning of sentences, which is crucial for many NLP applications.