Increasing SpaCy max NLP limit

8

2

I'm getting this error:

[E088] Text of length 1029371 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

The weird thing is that if I reduce the amount of documents being lemmatized, it still says the length exceeds 1 million. Is there a way of increasing the limit past 1 million? The error seems to suggest there is but I'm unable to do so.

D500

Posted 2018-09-24T23:33:27.883

Reputation: 181

1What code exactly are you running when you get that error? Please include a sapmle in your post. – n1k31t4 – 2018-09-25T00:30:09.257

Facing the same issue. Would be nice if spacy leaves it to the user how many words his/her infrastructure can process. – padmalcom – 2019-01-02T13:01:21.400

See my answer. I spent many hours trying to troubleshoot this and figured that it was just easier to split the document into smaller pieces. Initially, I thought it had to do with the amount of RAM I was running.. But I think its a character limit on the library – D500 – 2019-01-03T14:03:27.097

Answers

7

Try to raise the nlp.max_length parameter (as your error message suggests):

nlp = spacy.load('en_core_web_sm') 
nlp.max_length = 1500000 #or any large value, as long as you don't run out of RAM

also, when calling your spaCy pipeline, you can disable RAM-hungry intensive parts of the pipeline that are not needed for lemmatization:

doc = nlp("The sentences we'd like to do lemmatization on", disable = ['ner', 'parser'])

Finally you should get the results you expect with:

print([x.lemma_ for x in doc])

Davide Fiocco

Posted 2018-09-24T23:33:27.883

Reputation: 425

0

I wasn't able to figure out how to increase the maximum limit of characters but I did however just split my document in half. The problem is that SpaCy cannot process more than 1 million characters. Because I ran into this problem during the lemmatization, it doesn't matter if the document is one whole or a few parts.

D500

Posted 2018-09-24T23:33:27.883

Reputation: 181