## How to annotate text documents with meta-data?

21

4

Having a lot of text documents (in natural language, unstructured), what are the possible ways of annotating them with some semantic meta-data? For example, consider a short document:

I saw the company's manager last day.


To be able to extract information from it, it must be annotated with additional data to be less ambiguous. The process of finding such meta-data is not in question, so assume it is done manually. The question is how to store these data in a way that further analysis on it can be done more conveniently/efficiently?

A possible approach is to use XML tags (see below), but it seems too verbose, and maybe there are better approaches/guidelines for storing such meta-data on text documents.

<Person name="John">I</Person> saw the <Organization name="ACME">company</Organization>'s
manager <Time value="2014-5-29">last day</Time>.


2

I have used http://brat.nlplab.org/ in the past. There is a nice interface for many different types of annotations. The annotations are stored in a separate .annot file which is a list of the words that are annotoated and their position in the document.

– user1893354 – 2014-06-09T13:09:03.373

@user1893354 Very helpful! Specially the "brat standoff format" used by it seems very suitable to my needs. I suggest posting an answer if you like.

– Amir Ali Akbari – 2014-06-14T08:09:09.150

One of SGML's major purposes (the same holds for its offspring, XML) was to provide the means for tagging text documents (POS and semantic tags). – Deer Hunter – 2014-05-30T20:47:32.157

Could be more specific/restrictive about what kind of metadata you want to add? With your two examples, I doubt that there is a less verbose way that has the same generic expressiveness as XML tags. – ojdo – 2014-06-01T22:28:33.520

@ojdo The most of the meta-data is either for disambiguation (like the relative times), or for specifying special entities (i.e. FKs). – Amir Ali Akbari – 2014-06-02T16:59:48.337

16

Personally I would advocate using something that is both not-specific to the NLP field, and something that is sufficiently general that it can still be used as a tool even when you've started moving beyond this level of metadata. I would especially pick a format that can be used regardless of development environment and one that can keep some basic structure if that becomes relevant (like tokenization)

It might seem strange, but I would honestly suggest JSON. It's extremely well supported, supports a lot of structure, and is flexible enough that you shouldn't have to move from it for not being powerful enough. For your example, something like this:

{'text': 'I saw the company's manager last day.", {'Person': [{'name': 'John'}, {'indices': [0:1]}, etc...]}


The one big advantage you've got over any NLP-specific formats here is that JSON can be parsed in any environment, and since you'll probably have to edit your format anyway, JSON lends itself to very simple edits that give you a short distance to other formats.

You can also implicitly store tokenization information if you want:

{"text": ["I", "saw", "the", "company's", "manager", "last", "day."]}


EDIT: To clarify the mapping of metadata is pretty open, but here's an example:

{'body': '<some_text>',
{'<entity>':
{'<attribute>': '<value>',
'location': [<start_index>, <end_index>]
}
}
}


Hope that helps, let me know if you've got any more questions.

Being a web developer, JSON seems completely reasonable to me, but, can you elaborate on the exact format of mapping words to entities? – Amir Ali Akbari – 2014-06-16T15:49:04.183

@AmirAliAkbari Updated answer to include more details. – indico – 2014-06-16T17:35:52.767

7

In general, you don't want to use XML tags to tag documents in this way because tags may overlap.

UIMA, GATE and similar NLP frameworks denote the tags separate from the text. Each tag, such as Person, ACME, John etc. is stored as the position that the tag begins and the position that it ends. So, for the tag ACME, it would be stored as starting a position 11 and ending at position 17.

7

The brat annotation tool might be useful for you as per my comment. I have tried many of them and this is the best I have found. It has a nice user interface and can support a number of different types of annotations. The annotations are stored in a separate .annot file which contain each annotation as well as its location within the original document. A word of warning though, if you ultimately want to feed the annotations into a classifier like the Stanford NER tool then you will have to do some manipulation to get the data into a format that it will accept.

1

To describe all existed data it is so difficult task, but we can use a data model: http://schema.org/, where are structural types of the information. The prior execution was targeted to implement MarkUp technology, so, it seems can be useful for your task.

0

Try to use Label Studio. It supports Simple Text & HTML NER tagging and much more.

Input to Label Studio for task on the screenshot (HTML code packed to JSON):

{
"text": "<div style=\"max-width: 750px\"><div style=\"clear: both\"><div style=\"float: right; display: inline-block; border: 1px solid #F2F3F4; background-color: #F8F9F9; border-radius: 5px; padding: 7px; margin: 10px 0;\"><p><b>Jules</b>: No no, Mr. Wolfe, it's not like that. Your help is definitely appreciated.</p></div></div><div style=\"clear: both\"><div style=\"float: right; display: inline-block; border: 1px solid #F2F3F4; background-color: #F8F9F9; border-radius: 5px; padding: 7px; margin: 10px 0;\"><p><b>Vincent</b>: Look, Mr. Wolfe, I respect you. I just don't like people barking orders at me, that's all.</p></div></div><div style=\"clear: both\"><div style=\"display: inline-block; border: 1px solid #D5F5E3; background-color: #EAFAF1; border-radius: 5px; padding: 7px; margin: 10px 0;\"><p><b>The Wolf</b>: If I'm curt with you, it's because time is a factor. I think fast, I talk fast, and I need you two guys to act fast if you want to get out of this. So pretty please, with sugar on top, clean the car.</p></div></div></div>"
}


Output:

[
{
"from_name": "ner",
"to_name": "text",
"source": "$text", "type": "hypertextlabels", "value": { "start": "/div[1]/div[1]/div[1]/p[1]/b[1]/text()[1]", "end": "/div[1]/div[1]/div[1]/p[1]/b[1]/text()[1]", "text": "Jules", "startOffset": 0, "endOffset": 5, "htmllabels": [ "Person" ] } }, { "id": "YMeGv8ndLx", "from_name": "ner", "to_name": "text", "source": "$text",
"type": "hypertextlabels",
"value": {
"start": "/div[1]/div[1]/div[1]/p[1]/text()[1]",
"end": "/div[1]/div[1]/div[1]/p[1]/text()[1]",
"text": "Wolfe",
"startOffset": 13,
"endOffset": 18,
"htmllabels": [
"Organization"
]
}
},
{
"id": "vgGGhXRFcr",
"from_name": "ner",
"to_name": "text",
"source": "$text", "type": "hypertextlabels", "value": { "start": "/div[1]/div[2]/div[1]/p[1]/text()[1]", "end": "/div[1]/div[2]/div[1]/p[1]/text()[1]", "text": " Look, Mr. Wo", "startOffset": 1, "endOffset": 14, "htmllabels": [ "Person" ] } }, { "id": "oJxIH-ztQv", "from_name": "ner", "to_name": "text", "source": "$text",
"type": "hypertextlabels",
"value": {
"start": "/div[1]/div[2]/div[1]/p[1]/text()[2]",
"end": "/div[1]/div[2]/div[1]/p[1]/text()[2]",
"text": "people bar",
"startOffset": 38,
"endOffset": 48,
"htmllabels": [
"Organization"
]
}
}
]