Automatic labelling of text data based on predefined entities



I'm new to NLP. I have a folder containing .txt files which are legal and specific documents. I want to label all these files based on four predefined labels. How can I do that automatically?


Posted 2019-03-25T13:41:46.047

Reputation: 11

What kinds of labels? You want to put the entire text into categories? Do you want to find where in the text a category can be found? Please give some details on what you are trying to achieve. – Simon Larsson – 2019-03-25T13:49:11.763

Labels are: 1) Money, 2) judge, 3) tribunal, 4) state of the sentence (is a binary rejected or accepted). I want to label every text file with the following labels, so ideally I want to put the files and their corresponding labels in the same folder. – GiuliaC. – 2019-03-25T14:07:42.747



The task you have is called named-entity recognition. From wiki:

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Since this is a common NLP task there are libraries that are made to do NER out of the box. One such library is spaCy which can do NER as well as many other NLP tasks using Python.

You will not be able to perform NER without first training a model on your custom labels/entities. You need to have some labelled data to train on, maybe you already have this or you can label it manually. SpaCy wants yo have the data labelled with location of each entity on the format:

[("legal text here", {"entities": [(Start index, End index, "Money"), 
                                   (Start index, End index, "Judge"), 
                                   (Start index, End index, "Tribunal"), 
                                   (Start index, End index, "State")]}),
("legal text here", {"entities": [(Start index, End index, "Money"), 
                                  (Start index, End index, "Judge"), 
                                  (Start index, End index, "Tribunal"), 
                                  (Start index, End index, "State")]})

Example on how to training a spaCy model for NER (taken from docs):

from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding  

# training data
TRAIN_DATA =   Insert you labelled training data here  

    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
def main(model=None, output_dir=None, n_iter=100):
    """Load the model, set up the pipeline and train the entity recognizer."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")

    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
        ner = nlp.get_pipe("ner")

    # add labels
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        # reset and initialize the weights randomly – but only if we're
        # training a new model
        if model is None:
        for itn in range(n_iter):
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
            print("Losses", losses)

    # test the trained model
    for text, _ in TRAIN_DATA:
        doc = nlp(text)
        print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
        print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        for text, _ in TRAIN_DATA:
            doc = nlp2(text)
            print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
            print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

if __name__ == "__main__":

Then when you have a trained model you can use it to get your entities:

doc = nlp('put legal text to test your model here')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Simon Larsson

Posted 2019-03-25T13:41:46.047

Reputation: 3 498

1I think OP is talking about document classification rather than NER. – Esmailian – 2019-03-25T17:00:50.503

1For me the labels only made sense with NER. Money, judge, tribunal and state of the sentence does not really seem like categories to sort entire documents into. But the question is quite vague so my assumption might be off. :) – Simon Larsson – 2019-03-25T17:07:47.057