How to fine-tune spaCy for NLP use cases

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. It is published under the MIT license.

spaCy excels at large-scale information extraction tasks. It’s written from the ground up in carefully memory-managed Cython.

spaCy is designed to help us build real products, or gather real insights. It’s built with 73+ languages, and supports custom models built with Pytorch and Tensorflow. It’s robust and has rigorously evaluated accuracy.

I hope most people would not have heard about Cython. Let’s have a quick look at it.

Cython

Cython is a Python compiler that makes writing C extensions for Python as easy as Python itself. Cython is based on Pyrex, but supports more cutting edge functionality and optimizations.

To put in simple words, it’s a Python to C compiler.

Quoting from Wikipedia,

Cython is a programming language, a superset of the Python programming language, designed to give C-like performance with code that is written mostly in Python with optional additional C-inspired syntax.
Wikipedia

“Should we learn Cython to finetune spaCy?”, I hope this question would have stuck your mind by now.

Don’t worry. You don’t need to learn Cython to finetune spaCy. Since it’s a new term and I believe most people would not be aware of it. So, thought of describing it in a simple way.

Prerequisite

Basic knowledge of spaCy
Gather data (most relevant & good)

Basic knowledge of spaCy

The official documentation site of spaCy provides a lot of information about spaCy. Alternatively, you can read my another blog which gives some basic information about spaCy.

Gather data

Not just spaCy, to fine-tune any model, you need to have the data ready. Especially, it should be good. Through out this blog, let’s assume we built a event management software. We want to add voice assistance to our software. We built a module that converts the voice input into text. Our next step is to process this text and extract data from the given sentence using spaCy.

We have to gather some basic sentences that we hear from people trying to schedule a event. Here are a few,

Schedule event for visit to Trivandrum on July 18
Create event happening tomorrow on AI
Schedule Pongal celebration event in Oaks HOA at June 20, 2023

Similarly, we have to collect prompts related to event scheduling. The more data you collect and input, more accurate our model will be.

I created 7 sentences, which is too small for a event management software company to train it’s model. But from a demo standpoint, I feel it is enough.

Pre-process the data

Collecting data covers just one part of the equation. We need to pre-process the data and transform it in a way that it can be easily understood by spaCy. We should also define what kind of data (tags) that should be identified from the given sentence.

Let’s take the following sentence as an example,

“Schedule event for visit to Trivandrum on July 18”.

Let’s try to split tags from above sentence,

Schedule – It belongs to “action” tag
event – It belongs to “domain” tag
visit to Trivandrum – It belongs to “name” tag
July 18 – It belongs to “date” tag

Every tag defined above may contain alternatives in other sentences. For an example, we may input the following sentences,

Cancel client meeting scheduled tomorrow
Change time of mall visit to 6 PM

From the above sentences, the action tags are “Cancel” and “Change” (Edit). Similarly data for each tag may vary for each sentence.

Our next step is to teach spaCy about the words for each tags. We need to prepare a JSON file that contains examples with the tags and their index.

For example, in the above sentence (“Schedule event for visit to Trivandrum on July 18”), the index for “action” tag starts from 0 (index always starts with 0) and ends at 7.

Similarly for all the 7 sentences that I’ve choosen, I’ve prepared the index for each tags and created the JSON file.

Fine-tune spaCy

Let’s try to fine-tune spaCy with the data that we have.

Create a folder and download the above JSON file and place it into the folder. Create a new file named “custom_model.ipynb”.

All the following sections below needs a code block to be created. Create a code block wherever you see a heading.

Import spaCy

import spacy

Load the pre-trained model

nlp = spacy.load("en_core_web_lg")

Import the JSON file

Import the above downloaded JSON file.

import json

with open('./event_schedule_data.json', 'r') as f:
    data = json.load(f)

Convert the data

Convert the data read from JSON file into tuple of dictionaries containing original text and entities.

training_data = []
for example in data['examples']:
    temp_dict = {}
    temp_dict['text'] = example['content']
    temp_dict['entities'] = []
    for annotation in example['annotations']:
        start = annotation['start']
        end = annotation['end'] + 1
        label = annotation['tag_name'].upper()
        temp_dict['entities'].append((start, end, label))
    training_data.append(temp_dict)
print(training_data[0])

The above code will convert the data to the required format and print the first dictionary in the tuple, which will look something like below.

{'text': 'Schedule a calendar event in Teak oaks HOA about competitions happening tomorrow', 'entities': [(0, 8, 'ACTION'), (11, 25, 'DOMAIN'), (29, 42, 'HOA'), (49, 71, 'EVENT'), (72, 80, 'DATE')]}

Import training libraries

from spacy.tokens import DocBin
from tqdm import tqdm
from spacy.util import filter_spans

nlp = spacy.blank('en')

Train the model

The below code will create a custom model with the data that we give. Finally, it will generate a binary file named train.spacy.

doc_bin = DocBin()
for training_example in tqdm(training_data):
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in labels: 
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents
    doc_bin.add(doc)

doc_bin.to_disk("train.spacy")

Create a config files

Create a new file named base_config.cfg and copy the below code into it.

Create another file named config.cfg and copy the below code into it.

Don’t worry. These are default configurations that I’ve taken from their official documentation and I’ve not made any changes to it.

Initialize spaCy with the config files

Run the following command in the notebook code block to initialize spaCy with the config file. This config file will be used to train the spaCy model with our generated custom model.

!python -m spacy init fill-config base_config.cfg config.cfg

Train spaCy model

Run the following command to train the spaCy model

!python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./train.spacy

This may take some time depending on your system configuration. Ideally not too long (around 5 to 10 minutes). At the end, it’ll generate 2 folders named model-best and model-last.

Load the best model

nlp_ner = spacy.load("model-best")

Test our model

Let’s test our model with the following input.

“Could you please reserve a team brainstorming session on coming Wednesday at 11 AM?”

doc = nlp_ner("Could you please reserve a team brainstorming session on coming Wednesday at 11 AM?")

spacy.displacy.render(doc, style="ent")

You should be surprised to see the output.

That’s great right?

Eventually, you would raise a question.

As a programmer, How can I get this data in my backend code?

Well. That’s something everyone ask.

spaCy has an answer for it. You can expose the above data as JSON.

Convert extracted data to JSON

json_obj = doc.to_json()
json_obj

This will show similar output like the one shown below.

Write a REST API and expose this data as JSON. That’s it. But remember spaCy will give you only the indices, you have parse your sentence and extract words in between those indices.

Conclusion

In this article, we learnt about how to customize and fine-tune the pre-trained spaCy model with the data that corresponds to our domain knowledge. Similarly you can also train with your domain specific data. The model that you fine-tune will be private to you unless you expose it to the public. Hence, it’s best suited for training with the domain data that is not publicly available.

Hope you enjoyed reading the article. If you wish to learn more about NLP / spaCy, subscribe to my article by entering your email address in the below box.

Have a look at my site which has a consolidated list of all my blogs.