Information extraction algorithms using Named Entity Recognition

This is the third part of the NLP tutorial that includes Tokenization, Lemmatization, Stemming, Part-of-Speach (POS), Named Entity Recognition (NER) and Sentiment Analysis.

Photo by ilgmyzi on Unsplash

In previous articles we learned how to work with the first steps of an NLP pipeline such as Tokenization, Lemmatization, Stemming and Part-of-Speach (POS), and this time we will continue knowing the basics of every information extraction system, which in NLP terms is known as Named Entity Recognition.

Figure 1: Information extraction system architecture (Image by Author)

Figure 1 shows the basic architecture of a simple information extraction system. It begins by processing raw text using several of the procedures discussed in one of the previous articles, by splitting it into sentences using a sentence segmenter, then each sentence is further subdivided into words using a tokenizer, this would be the first phase in the process. The second phase starts coming into action the part-of-speech tagging, which will be a fundamental step for what comes next, named entity recognition.

What is Named Entity Recognition?

The term Named Entity was first proposed at the Message Understanding Conference (MUC-6) with the purpose of locating and classifying named entities in text into predefined categories, such as people, organizations, locations, dates, and so on.

Since then, has grown a huge interest in NER and Information Extraction (IE) techniques on text-based data, mainly to be used across various fields and sectors to automate the information extraction process from unstructured text data by extracting essential information into more editable and structured data formats.

Common uses of Named Entity Recognition are Chatbots, Healthcare, HR Resume Summarization, Search Engine Algorithms, and Recommender Systems, among others.

Let’s see an example of how a NER algorithm will work in the above sample text, NER will detect some predefined entities in order to identify mentions of rigid designators from text belonging to particular semantic types such as person, location, organization, ect.

Figure 2: Screenshot of how a NER algorithm highlights and extracts particular entities (Image by Author)

NORP is one of several named entity categories commonly used in NER algorithms, along with categories such as PERSON, LOCATION, ORG, and DATE. These categories help to extract structured information from unstructured text data, enabling automated analysis and processing of large volumes of text.

Unlike the previous articles where we used NLTK package, this time we will be using SpaCy in order to work with our NLP pipeline.

What is spaCy?

spaCy is a Python-based open-source NLP library that was designed fast, efficient, scalable and simple to use. It provides multiple NLP features, such as text classification, named entity recognition, part-of-speech tagging, and dependency parsing.

spaCy’s effectiveness and quickness are two of its main features. It uses streamlined algorithms and data structures in order to process vast amounts of text really, really fast! This makes it a quite good option for real-time applications and/or analyzing huge datasets when performance is a must.

spaCy’s main advantages are:

High Performance, as it is optimized for performance, and its processing speed is faster than many other NLP libraries. It is designed to be scalable and can handle large volumes of text data efficiently.
Its pre-trained models are optimized for Named Entity Recognition (NER), Part-of-Speech (POS) tagging, Dependency Parsing, and other NLP tasks. These models are trained on large corpora of text and can be fine-tuned on specific domains.
Its easy-to-use API, gives a user-friendly, easy-to-use and intuitive API. It follows a simple and consistent interface, making it easy to learn and use for both beginners and experienced users.
It allows the customization of its models, which makes it possible to train models for specific domains or tasks. Users can also add their custom components to the pipeline for more specific processing.

If you want to know more about spaCy and its architecture, is strongly recommended to visit its web page:

[Library Architecture · spaCy API Documentation
The central data structures in spaCy are the class, the and the object. The Language class is used to process a text…spacy.io](https://spacy.io/api "spacy.io/api")

If you followed along with the previous articles on NLP, you have now created a conda environment where NLTK should be installed and this time we will install spaCy library by executing.

pip install spacy

You can always clone the code in the GitHub repository I have prepared for this tutorial.

Methods of NER

Any NER model is a two-step process that consists in the following steps:

Figure 3: NER Model phases (Image by Author)

1. Dictionary-based

This is the simplest NER method for information extraction, it involves using pre-built dictionaries or lexicons of known entities, such as Locations, Companies, Persons, Products, Events, etc, in order to identify named entities in the text.

Some common uses of a Dictionary-based NER method are when dealing with specific domains or topics, like healthcare, football, and sports in general, so named entities are very specific to those fields, then entities can be easily extracted using pre-built dictionaries.

Let’s try it by doing some code where we will be using the pre-trained model for English language processing en_core_web_sm provided by the spaCy library.

spaCy has three main English pre-trained models, that are optimized for NER tasks, they are en_core_web_sm, en_core_web_md andn_core_web_lg listed in ascending order according to their size as they stand for small, medium, and large models, respectively.

Image from Author

When this code is executed, spaCy will automatically download the en_core_web_sm model from the spaCy website and store it in the nlp local variable, then defining a custom dictionary of known entities and iterating over the model we will assign the entities of the dictionary to the document string.

This is considered the simplest NER method, and is not commonly employed because the dictionaries employed used need to be updated and maintained frequently.

2. Rule-based

This method involves defining a set of rules or patterns to identify named entities in text based on their linguistic properties, such as the part of speech of surrounding words or the presence of specific keywords.

Image from Author

In the code above we have defined some rules consisting in Products (pattern variable). The function identify_products() that takes a paratemer which is a document, identifies location entities based on patterns in the text. It then adds a custom entity ruler to the NER pipeline and defines patterns that match location entities based on that function. Finally, it processes a sample document and extracts the named entities tagged as “GPE” (geopolitical entities) in the document.

3. Machine learning-based & Deep learning-based approaches

These methods solve a lot of the limitations of the two above-described methods, as they are statistical-based models that try to make a feature-based representation of the observed data. Entity names can be recognized even with small spelling variations.

Figure 4: Machine Learning-based method phases (Image by Author)

As described in the illustration above, two phases are involved. The first trains the ML model on annotated documents, like for example theen_core_web_sm model, next the trained model is used to annotate the raw documents. The process is similar to a normal ML model pipeline.

Let’s go with an example in the code, where we will be training the text using theen_core_web_sm model from spaCy library, and then pass that pre-trained model to a sample text:

"Musk attended Waterkloof House Preparatory School, Bryanston High School,
and Pretoria Boys High School, from which he graduated. Musk applied for a
Canadian passport through his Canadian-born mother, knowing that it would
be easier to immigrate to the United States this way. While waiting for his
application to be processed, he attended the University of Pretoria for five
months."

Image from Author

When passing the trained model by using the function spacy_ner() the response will be the extracted named entities from the input text we previously passed as a parameter to the function.

Some of the entities are self-explanatory, except for “GPE” which stands for geo-political entities such as city, state/province, and country, and “FACILITY” which are domains of architecture and civil engineering.

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. O’Reilly.

Conclusion

Named Entity Recognition (NER), as stated before is particularly important in applications such as search engines, social media monitoring and chatbots, as it helps to identify and extract specific pieces of information by recognizing and classifying entities such as people, organizations, locations, dates, and other types of named entities in unstructured text.

For example, in search engines, NER can help to identify and extract relevant information from web pages, such as the names of people, organizations, or locations mentioned in an article. In social media monitoring, helps to identify and track mentions of specific brands, products, or people, and in chatbots, NER helps to understand the intent of the user’s message and provide a more personalized response.

References

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. O’reilly.

Level Up Coding

Thanks for being a part of our community! Before you go:

👏 Clap for the story and follow the author 👉
📰 View more content in the Level Up Coding publication
💰 Free coding interview course ⇒ View Course
🔔 Follow us: Twitter | LinkedIn | Newsletter

🚀👉 Join the Level Up talent collective and find an amazing job