Extracting referenced documents


I'm looking to write an AI that will be able to extract in text references from standards documents to assist human research.

My use case is extracting the identifying numbers, for example, "AR 25-2", along with the title of the document "Information Assurance" so that a human can gather all the related research on a contract at once, instead of having to keep track of references while they're reading through the document.

I have a pretty good idea of where to gather the names of these documents for training, I'm planning on 'scraping' a few repositories for different categories of these documents.

What kind of model should I use to get the best results?


Posted 2018-07-13T18:17:09.933

Reputation: 31

Welcome to AI.SE, and nice question! I've slightly edited it to remove the request for software recommendations, which are usually off-topic on Stack Exchange since they tend to go out of date quickly. If you specifically need existing software, you might be able to get help at [softwarerecs.se]. Otherwise, best of luck with your question here! – Ben N – 2018-07-14T00:17:49.097

1Do the documents have a variable layout or do they share a common layout? Also, do the documents have a common format? This sounds like something that could be done with regular expressions and some parsing of document elements. – Greenstick – 2018-07-14T05:29:19.390

Unfortunately not. The documents have variable format title tags, and there is no common format for the titles, they're from all different branches of the government, and some of the title-tags are mostly plain english. I'm also wary of regular expressions for this task because there may be standards formats that I'm not aware of, and I don't want my tools to miss a single document: I'm aiming to have a high false positive rate to avoid this problem. Also, a regular expression wouldn't get the english titles of the documents. – comp.sci.intern – 2018-07-16T13:13:21.297

No answers