What is the bleu score of professional human translators?



Machine translation models are usually evaluated using bleu score. I want to get some intuition for this score. What is the bleu score of professional human translator?

I know it depends on the languages, the translator ect. I just want to get the scale.

edit: I want to make it clear - I talk about the expected bleu. It's not a theoretical question, it is an experimental one.

Amit Keinan

Posted 2020-02-23T17:08:50.137

Reputation: 460



The original paper "BLEU: a Method for Automatic Evaluation of Machine Translation" contains a couple of numbers on this:

The BLEU metric ranges from 0 to 1. Few translations will attain a score of 1 unless they are identical to a reference translation. For this reason, even a human translator will not necessarily score 1. It is important to note that the more reference translations per sentence there are, the higher the score is. Thus, one must be cautious making even “rough” comparisons on evaluations with different numbers of reference translations: on a test corpus of about 500 sentences (40 general news stories), a human translator scored 0.3468 against four references and scored 0.2571 against two references.

But as their table 1 (providing the numbers compared to two references, H2 is the one mentioned in the text above) shows there is variance among human BLEU scores:

enter image description here

Unfortunately, the paper does not qualify the skill level of the translators.


Posted 2020-02-23T17:08:50.137

Reputation: 3 740

9Thanks. Notice that the table might be confusing without saying that h1, h2 are human translations and s1, s2, s3 are machine translations. For the other readers I will also mention it is chinese-english translation task. – Amit Keinan – 2020-02-24T06:31:48.877


BLEU scores are based on comparing the translation to evaluate against a gold-standard translation. In general the gold-standard translation is the same source sentence translated by a professional translator, so in theory a professional human translation should always receive the maximum score of 1 (BLEU scores are normalized between 0 and 1).

However it's important to keep in mind that:

  • Even professional translators don't always agree on what is the "correct" translation, so there's no perfect evaluation method.
  • There can be multiple valid translations for the same sentences. This can be taken into account in the BLEU score, but most of the time BLEU scores are calculated using a single translation. As a consequence it's possible that a perfectly good translation gets a low score.
  • BLEU scores are based on counting the number of n-grams in common between the predicted translation and the gold-standard. It's a quite good proxy for translation quality, but it's also very basic. In particular it cannot take the meaning into account.


Posted 2020-02-23T17:08:50.137

Reputation: 12 600

1Thanks. I already knew what you said. I want to know what is the expected bleu of a translator - if I take two different translators and measure the translation of one of them (choose his randomly) against the "gold translation" of the other what is the average bleu. It won't be 1 because the translations may not be exactly the same. – Amit Keinan – 2020-02-23T19:40:26.770

@AmitKeinan Yes, that's why I mentioned that professional translators don't always agree. That means that there can be no general answer to your question: since the score depends on which translation is chosen as reference, it could be anywhere between 0 and 1. But there's no reason to choose one professional translation rather than the other, so in theory the logic of BLEU score is that a professional translation has perfect score. – Erwan – 2020-02-23T20:03:04.900

1well, maybe you don't understand my question. I know it can be anywhere between 0 and 1 - it is a probabalistic question. My question is about expectation. If I will take many translators, split them to couples and measure the bleu between the translations, what will be the avergae score? I want an answer with real experimental references. – Amit Keinan – 2020-02-23T20:12:39.913

1oh ok, I didn't understand that indeed. Still, any experimental value depends on the data and the translators, there's no universal answer. – Erwan – 2020-02-23T21:16:22.403

@Erwan it's also worth noting that BLEU as such does not depend on which translation is chosen as reference, it's designed to take into account multiple equally valid reference translations - it's just that in most practical applications we don't have more than one reference translation available. – Peteris – 2020-02-24T20:19:04.197