Are there any good out-of-the-box language models for python?

14

1

I'm prototyping an application and I need a language model to compute perplexity on some generated sentences.

Is there any trained language model in python I can readily use? Something simple like

model = LanguageModel('en')
p1 = model.perplexity('This is a well constructed sentence')
p2 = model.perplexity('Bunny lamp robert junior pancake')
assert p1 < p2

I've looked at some frameworks but couldn't find what I want. I know I can use something like:

from nltk.model.ngram import NgramModel
lm = NgramModel(3, brown.words(categories='news'))

This uses a good turing probability distribution on Brown Corpus, but I was looking for some well-crafted model on some big dataset, like the 1b words dataset. Something that I can actually trust the results for a general domain (not only news)

Fred

Posted 2018-09-20T13:34:22.520

Reputation: 323

2Tensorflow 1b words LM – user12075 – 2018-09-20T15:46:23.007

Well this is not at all readily usable but it's something. Thanks :) – Fred – 2018-09-20T17:57:51.493

That's a pre-trained model that you can simply download and run, and you think that is "not at all readily usable" ... – user12075 – 2018-09-20T18:06:20.133

1I think you and me have very different definitions of what "readily usable" means... I would need to figure out how to get the tensorflow ops I want (input and output) and how they behave, figure out if there's any preprocessing to this and then wrap everything in some perplexity function. I'm not saying I can't do it, I'm just saying it is not at all the "readily usable" function I showed. But again, thanks for the pointer – Fred – 2018-09-20T18:17:36.463

Have you tried google? I hear they get a fair amount of data :) Not sure if they have the exact metrics you're after. https://cloud.google.com/natural-language/docs/

– flyingmeatball – 2018-09-24T19:04:07.253

Yeah I actually use google nlp quite extensively... But no they don't have language modeling – Fred – 2018-09-24T23:56:25.440

Answers

9

I also think that the first answer is incorrect for the reasons that @noob333 explained.

But also Bert cannot be used out of the box as a language model. Bert gives you the p(word|context(both left and right) ) and what you want is to compute p(word|previous tokens(only left contex)). The author explains here why you cannot use it as a lm.

However you can adapt Bert and use it as a language model, as explained here.

But you can use the open ai gpt or gpt-2 pre-tained models from the same repo

Here is how you can compute the perplexity using the gpt model.

import math
from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel, OpenAIGPTLMHeadModel
# Load pre-trained model (weights)
model = OpenAIGPTLMHeadModel.from_pretrained('openai-gpt')
model.eval()
# Load pre-trained model tokenizer (vocabulary)
tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')

def score(sentence):
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    loss=model(tensor_input, lm_labels=tensor_input)
    return math.exp(loss)


a=['there is a book on the desk',
                'there is a plane on the desk',
                        'there is a book in the desk']
print([score(i) for i in a])
21.31652459381952, 61.45907380241148, 26.24923942649312

lads

Posted 2018-09-20T13:34:22.520

Reputation: 358

its interesting how the score dropped for "in" and "on" by 5 Thinking emo – Syenix – 2020-08-06T21:07:26.207

7

I think the accepted answer is incorrect.

token.prob is the log-prob of the token being a particular type . I am guessing 'type' refers to something like POS-tag or type of named entity (it's not clear from spacy's documentation) and the score is a confidence measure over space of all types.

This is not the same as the probabilities assigned by a language model. A language model gives you the probability distribution over all possible tokens (not the type) saying which of them is most likely to occur next.

This repo has pretty nice documentation on using BERT (a state-of-the art model) with pre-trained weights for the neural network,

I think the API's don't give you perplexity directly but you should be able to get probability scores for each token quite easily.

noob333

Posted 2018-09-20T13:34:22.520

Reputation: 71

5

The spaCy package has many language models, including ones trained on Common Crawl.

Language model has a specific meaning in Natural Language Processing (NlP). A language model is a probability distribution over sequences of tokens. Given a specific sequence of tokens, the model can assign a probability of that sequence appearing. SpaCy's language models include more than just a probability distribution.

The spaCy package needs to be installed and the language models need to be download:

$ pip install spacy 
$ python -m spacy download en

Then the language models can used with a couple lines of Python:

>>> import spacy
>>> nlp = spacy.load('en')

For a given model and token, there is a smoothed log probability estimate of a token's word type can be found with: token.prob attribute.

Brian Spiering

Posted 2018-09-20T13:34:22.520

Reputation: 10 864

Deleted my previous comments... Apparently spacy does include a proper language model (using the token.prob attribute), but it's only built in the large model version. If you edit your answer to include that info I can give you the bounty. Funny enough, I've been using spacy for months now and nowhere I saw that it had this feature – Fred – 2018-09-28T16:27:54.777

Glad you found something that works for you. – Brian Spiering – 2018-09-29T23:04:58.267

Again.. This only works if you download the large English model – Fred – 2018-09-30T20:19:14.107

1

You can use the lm_scorer package to calculate the language model probabilities using GPT-2 models.

First install the package as:

pip install lm-scorer

Then, you can create a scorer by specifying the model size.

from lm_scorer.models.auto import AutoLMScorer
scorer = AutoLMScorer.from_pretrained("gpt2-large")

def score(sentence):
    return scorer.sentence_score(sentence)

Apply it to your text and you get back the probabilities.

>>> score('good luck')
8.658163769270644e-11

You can also refer to a blogpost I had written a while back if you're looking for more details.

Amit Chaudhary

Posted 2018-09-20T13:34:22.520

Reputation: 111