What is the actual quality of machine translations?



Till today I - as an AI layman - am confused by the promised and achieved improvements of automated translation.

My impression is: there is still a very, very far way to go. Or are there other explanations why the automated translations (offered and provided e.g. by Google) of quite simple Wikipedia articles still read and sound mainly silly, are hardly readable, and only very partially helpful and useful?

It may depend on personal preferences (concerning readability, helpfulness and usefulness), but my personal expectations are disappointed sorely.

The other way around: Are Google's translations nevertheless readable, helpful and useful for a majority of users?

Or does Google have reasons to retain its achievements (and not to show to the users the best they can show)?

Preliminary result: We are still far away from being able to talk with artificial intelligences on an equal footing and understanding - only on the level of strings. So why should we be afraid? Because they know more than we know - but we don't know?

Hans-Peter Stricker

Posted 2019-06-03T14:34:41.017

Reputation: 391

Comments are not for extended discussion; this conversation has been moved to chat.

– nbro – 2020-03-14T20:32:02.410



Who claimed that machine translation is as good as a human translator? For me, as a professional translator who makes his living on translation for 35 years now, MT means that my daily production of human quality translation has grown by factor 3 to 5, depending on complexity of the source text.

I cannot agree that the quality of MT goes down with the length of the foreign language input. That used to be true for the old systems with semantic and grammatical analyses. I don't think that I know all of the old systems (I know Systran, a trashy tool from Siemens that was sold from one company to the next like a Danaer's gift, XL8, Personal Translator and Translate), but even a professional system in which I invested 28.000 DM (!!!!) failed miserably.

For example, the sentence:

On this hot summer day I had to work and it was a pain in the ass.

can be translated using several MT tools to German.

Personal Translator 20:

Auf diesem heißen Sommertag musste ich arbeiten, und es war ein Schmerz im Esel.


An diesem heißen Sommertag musste ich arbeiten, und es war ein Schmerz im Esel.


An diesem heißen Sommertag musste ich arbeiten und es war eine Qual.


An diesem heißen Sommertag musste ich arbeiten und es war ein Schmerz im Arsch.

Today, Google usually presents me with readable, nearly correct translations and DeepL is even better. Just this morning I translated 3500 words in 3 hours and the result is flawless although the source text was full of mistakes (written by Chinese).


Posted 2019-06-03T14:34:41.017

Reputation: 221

Comments are not for extended discussion; this conversation has been moved to chat.

– nbro – 2020-03-14T20:31:51.040


Google's translations can be useful, especially if you know that the translations are not perfect and if you just want to have an initial idea of the meaning of the text (whose Google's translations can sometimes be quite misleading or incorrect). I wouldn't recommend Google's translate (or any other non-human translator) to perform a serious translation, unless it's possibly a common sentence or word, it does not involve very long texts and informal language (or slang), the translations involve the English language or you do not have access to a human translator.

Google Translate currently uses a neural machine translation system. To evaluate this model (and similar models), the BLEU metric (a scale from $0$ to $100$, where $100$ corresponds to the human gold-standard translation) and side-by-side evaluations (a human rates the translations) have been used. If you use only the BLEU metric, the machine traslations are quite poor (but the BLEU metric is also not a perfect evaluation metric, because there's often more than one translation of a given sentence). However, GNMT reduces the translation errors compared to phrase-based machine translation (PBMT).

In the paper Making AI Meaningful Again, the authors also discuss the difficulty of the task of translation (which is believed to be an AI-complete problem). They also mention the transformer (another state-of-the-art machine translation model), which achieves quite poor results (evaluated using the BLEU metric).

To conclude, machine translation is a hard problem and current machine translation systems definitely do not perform as well as a professional human translator.


Posted 2019-06-03T14:34:41.017

Reputation: 19 783

100 BLEU score doesn't mean human gold-standard translation, it means it matches the reference translation exactly. As there are usually multiple ways to translate a sentence, even human translation usually does not have 100 BLEU, but more like 50-60. – justhalf – 2019-06-06T21:29:16.790

@justhalf Read my answer again. – nbro – 2019-06-06T21:30:13.340

1Thanks for the reply, and sorry if my previous comment appeared rude. My point in my previous comment was that it is inaccurate to give the impression that human translation will get 100 BLEU points, which your current answer seems to do. – justhalf – 2019-06-06T21:48:30.990

@justhalf I just said that $100$ corresponds to a human "gold-standard" translation. However, I also state that the BLUE metric is not perfect, because often there is more than one translation of a given text. – nbro – 2019-06-06T22:14:42.490


You have asked quite a lot of questions, some of which cannot be answered definitively . To give an insight of the quality (and its history) of machine translations I like to refer to Christopher Manning his 'one sentence benchmark' as presented in his lecture. It contains one Chinese to English example which is compared with Google Translate output. The correct translation for the example would be:

In 1519, six hundred Spaniards landed in Mexico to conquer the Aztec Empire with a population of a few million. They lost two thirds of their soldiers in the first clash.

Google Translate returned the following translations.

2009 1519 600 Spaniards landed in Mexico, millions of people to conquer the Aztec empire, the first two-thirds of soldiers against their loss.

2011 1519 600 Spaniards landed in Mexico, millions of people to conquer the Aztec empire, the initial loss of soldiers, two thirds of their encounters.

2013 1519 600 Spaniards landed in Mexico to conquer the Aztec empire, hundreds of millions of people, the initial confrontation loss of soldiers two-thirds.

2015 1519 600 Spaniards landed in Mexico, millions of people to conquer the Aztec empire, the first two-thirds of the loss of soldiers they clash.

2017 In 1519, 600 Spaniards landed in Mexico, to conquer the millions of people of the Aztec empire, the first confrontation they killed two-thirds.

Whether Google retains or 'hides' their best results: I doubt it. There are many excellent researchers working in the field of natural language processing (NLP). If Google would have a 'greatest achievement' for translation, the researchers would figure it out sooner or later. (Why would Google hide their 'greatest achievement' anyway? They seem to see the benefit of open source, see the Transformer[1] or BERT[2])

NB. For an updated list of state-of-the-art algorithms in NLP, see the SQuAD2.0 leaderboard.

[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.

[2] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).


Posted 2019-06-03T14:34:41.017

Reputation: 161

Thanks a lot for the link to "well compensated researchers". Having compensations in mind always helps to understand things better (even though I don't know what you had in mind when setting this link). – Hans-Peter Stricker – 2019-06-04T17:39:23.233

Argument was not very solid either. Have removed the link and tried to improve the argument. I have been reading a lot of NLP papers and am quite confident in my findings, but its difficult to find support for the argument. – RikH – 2019-06-04T18:12:16.137

Please let me know about your findings (if you don't mind). My mail address can be found on my profile page. – Hans-Peter Stricker – 2019-06-04T18:18:58.730

32019: In 1519, 600 Spaniards landed in Mexico to conquer the Aztec empire of millions of people, and they first met two-thirds of their soldiers. – Dan M. – 2019-06-05T11:08:57.847


It really depends on the language pair and the topic of the content. Translating to/from English to any other language usually is the best supported. Translating to and from popular languages works better, for example, translating from English to Romanian is a poorer translation than English to Russian. But translating from English to Russian or Romanian is better than translating Russian to Romanian. And translating Romanian to English is better than translating English to Romanian.

But if you are used to working with translators and you have a passing familiarity with the languages, translation mistakes and the topic, it's easy to understand what was supposed to be there. And, at that point, sometimes its easier to read something translated into your native language for quick scanning than it is to read it in a second language.

Less popular languages (for translation not necessarily in number of speakers) are much much closer to literal translations only slightly better than what you personally would do using a dictionary for two languages you do not know.

Aaron Harun

Posted 2019-06-03T14:34:41.017

Reputation: 141


Am I wrong and Google's translations are nevertheless readable, helpful and useful for a majority of users?

Yes, they are somewhat helpful and allow you to translate faster.

Or does Google have reasons to retain its greatest achievements (and not to show to the users the best they can show)?

Maybe, I don't know. If you search for info, Google does really do a lot of horrible stupid stuff, like learning from what users say on the internet, taking unsuitable data as trusted input data sets.


Posted 2019-06-03T14:34:41.017

Reputation: 354


Apologies for not writing in English. Please find the adapted translation here:

To give interested people an idea of the quality of MT (DeepL) please see this example from a text I was working on this morning (6,300 words, started at 9 am, delivery today around 1 pm and still find time for this post). I was working on this sentence (201 words) when I posted my comment.

"You further represent, warrant and undertake to ABC that you shall not: (a) Conduct any fraudulent, abusive, or otherwise illegal activity which may be grounds for termination of your right to access or use this Website and/or the Services; or (b) Post or transmit, or cause to be posted or transmitted, any communication or solicitation designed or intended to obtain password, account, or private information from any other user of this Website; or (c) Violate the security of any computer network, crack passwords or security encryption codes, transfer or store illegal material (including material that may be considered threatening or obscene), or engage in any kind of illegal activity that is expressly prohibited; or (d) Run maillist, listserv, or any other form of auto-responder, or "spam" on this Website, or any processes that run or are activated while you are not logged on to this Website, or that otherwise interfere with the proper working of or place an unreasonable load on this Website’s infrastructure; or (e) Use manual or automated software, devices, or other processes to "crawl," "scrape," or "spider" any page of this Website; or (f) Decompile, reverse engineer, or otherwise attempt to obtain the source code of this Website."

DeepL returns this:

Sie versichern, garantieren und verpflichten sich gegenüber ABC, dass Sie dies nicht tun werden: (a) betrügerische, missbräuchliche oder anderweitig illegale Aktivitäten durchzuführen, die Anlass für die Beendigung Ihres Rechts auf Zugang oder Nutzung dieser Website und/oder der Dienste sein können; oder (b) Mitteilungen oder Aufforderungen, die dazu bestimmt sind oder sind, Passwörter, Konten oder private Informationen von anderen Nutzern dieser Website zu erhalten, zu posten oder zu übertragen oder posten oder zu posten oder zu übertragen; oder (c) die Sicherheit eines Computernetzwerks verletzen, Passwörter oder Sicherheitsverschlüsselungscodes knacken, illegales Material (einschließlich Material, das als bedrohlich oder obszön angesehen werden kann) übertragen oder speichern oder sich an illegalen Aktivitäten beteiligen, die ausdrücklich verboten sind; oder (d) Mailliste, Listenserver oder jede andere Form von Auto-Responder oder "Spam" auf dieser Website oder Prozesse, die ausgeführt werden oder aktiviert werden, während Sie nicht auf dieser Website angemeldet sind, oder die anderweitig das ordnungsgemäße Funktionieren oder eine unangemessene Belastung der Infrastruktur dieser Website stören; oder (e) manuelle oder automatisierte Software, Geräte oder andere Prozesse verwenden, um eine Seite dieser Website zu "crawlen", zu kratzen, zu spinnen oder zu spinnen; oder (f) dekompilieren, zurückzuentwickeln oder anderweitig zu versuchen, den Quellcode dieser Website zu erhalten.

It took me about 5 to 10 minutes to adjust this paragraph.

As a translator, I know that I cannot rely on the machine translation, but I learnt the specifics and capabilities of the different systems over time and I know what to pay attention for.

MT helps me a lot in my work.


Posted 2019-06-03T14:34:41.017

Reputation: 21

2Notice that legal texts yield better automatic translations, since there's a bucketload of multilingual texts in this area. – Quora Feans – 2019-06-04T17:29:25.263


This will be not so much an answer as a commentary.

The quality depends on several things, including (as Aaron said above) 1) the language pair and 2) the topic, but also 3) the genera and 4) the style of the original, and 5) the amount of parallel text you have to train the MT system.

To set the stage, virtually all MT these days is based off of parallel texts, that is a text in two different languages, with one presumably being a translation of the other (or both being a translation of some third language); and potentially using dictionaries (perhaps assisted by morphological processes) as backoff when the parallel texts don't contain particular words.

Moreover, as others have said, an MT system in no way understands the texts it's translating; it just sees strings of characters, and sequences of words made up of characters, and it looks for similar strings and sequences in texts it's translated before. (Ok, it's slightly more complicated than that, and there have been attempts to get at semantics in computational systems, but for now it's mostly strings.)

1) Languages vary. Some languages have lots of morphology, which means they do things with a single word that other languages do with several words. A simple example would be Spanish 'cantaremos' = English "we will sing". And one language may do things that the other language doesn't even bother with, like the informal/formal (tu/ usted) distinction in Spanish, which English doesn't have an equivalent to. Or one language may do things with morphology that another language does with word order. Or the script that the language uses may not even mark word boundaries (Chinese, and a few others). The more different the two languages, the harder it will be for the MT system to translate between them. The first experiments in statistical MT were done between French and English, which are (believe it or not) very similar languages, particularly in their syntax.

2) Topic: If you have parallel texts in the Bible (which is true for nearly any pair of written languages), and you train your MT system off of those, don't expect it to do well on engineering texts. (Well, the Bible is a relatively small amount of text by the standards of training MT systems anyway, but pretend :-).) The vocabulary of the Bible is very different from that of engineering texts, and so is the frequency of various grammatical constructions. (The grammar is essentially the same, but in English, for example, you get lots more passive voice and more compound nouns in scientific and engineering texts.)

3) Genera: If your parallel text is all declarative (like tractor manuals, say), trying to use the resulting MT system on dialog won't get you good results.

4) Style: Think Hilary vs. Donald; erudite vs. popular. Training on one won't get good results on the other. Likewise training the MT system on adult-level novels and using it on children's books.

5) Language pair: English has lots of texts, and the chances of finding texts in some other language which are parallel to a given English text are much higher than the chances of finding parallel texts in, say, Russian and Igbo. (That said, there may be exceptions, like languages of India.) As a gross generalization, the more such parallel texts you have to train the MT system, the better results.

In sum, language is complicated (which is why I love it--I'm a linguist). So it's no surprise that MT systems don't always work well.

BTW, human translators don't always do so well, either. A decade or two ago, I was getting translations of documents from human translators into English, to be used as training materials for MT systems. Some of the translations were hard to understand, and in some cases where we got translations from two (or more) human translators, it was hard to believe the translators had been reading the same documents.

And finally, there's (almost) never just one correct translation; there are multiple ways of translating a passage, which may be more or less good, depending on what features (grammatical correctness, style, consistency of usage,...) you want. There's no easy measure of "accuracy".

Mike Maxwell

Posted 2019-06-03T14:34:41.017

Reputation: 111


Surprisingly all the other answers are very vague and try to approach this from the human translator POV. Let's switch over to ML engineer.

When creating a translation tool, one of the first questions that we should consider is "How do we measure that our tool works?".

Which is essentially what the OP is asking.

Now this is not an easy task (some other answers explain why). There is a Wikipedia Article that mentions different ways to evaluate machine translation results - both human and automatic scores exist (such as BLEU, NIST, LEPOR).

With rise of neural network techniques, those scores improved significantly.

Translation is a complex problem. There are many things that can go right(or wrong), and computer translation system often ignores some of the subtleties, which stands out for a human speaker.

I think if we are to think about the future, there are few things that we can rely on:

  • Our techniques are getting better, wider known and tested. This is going to improve the accuracy in the long run.
  • We are developing new techniques which can take into account variables previously ignored or just do a better job.
  • Many of currently existing translation models are often "reused" to translate other languages (for example, try translating "JEDEN" from Polish to Chinese(traditional) using Google Translator - you will end up with "ONE", which is an evidence pointing out the fact that Google translates Polish to English, and then English to Chinese). This is obviously not a good approach - you are going to lose some information in the process - but it's a one that will still work, so companies like Google use it for languages where they don't have enough workpower or data. With time, more specialized models will appear, which will improve the situation.
  • Also, as previous point stated, more and more data will only help improving the machine translation.

To summarize, this complex problem, although not solved, is certainly on a good way and allows for some impressive results for well-researched language pairs.


Posted 2019-06-03T14:34:41.017

Reputation: 111

"Surprisingly all the other answers...", not all other answers. I would say "Some other answers" or "Most other answers". – nbro – 2019-06-06T22:53:20.407


"Or does Google have reasons to retain its achievements (and not to show to the users the best they can show)"

If they were, then what they're holding back would be amazing. Google publishes a lot of strong papers in Natural Language Processing, including ones that get state of the art results or make significant conceptual breakthroughs. They have also released very useful datasets and tools. Google is one of the few companies out there that is not only using the cutting edge of current research, but is actively contributing to the literature.

Machine translation is just a hard problem. A good human translator needs to be fluent in both languages to do the job well. Each language will have its own idioms and non-literal or context-dependent meanings. Just working from a dual-language dictionary would yield terrible results (for a human or computer), so we need to train our models on existing corpora that exist in multiple languages in order to learn how words are actually used (n.b. hand-compiled phrase translation tables can be used as features; they just can't be the whole story). For some language pairs, parallel corpora are plentiful (e.g. for EU languages, we have the complete proceedings of the European Parliament). For other pairs, training data is much sparser. And even if we have training data, there will exist lesser used words and phrases that don't appear often enough to be learned.

This used to be an even bigger problem, since synonyms were hard to account for. If our training data had sentences for "The dog caught the ball", but not "The puppy caught the ball", we would end up with a low probability for the second sentence. Indeed, significant smoothing would be needed to prevent the probability from being zero in many such cases.

The emergence of neural language models in the last 15 years or so has massively helped with this problem, by allowing words to be mapped to a real-valued semantic space before learning the connections between words. This allows models to be learned in which words that are close together in meaning are also close together in the semantic space, and thus switching a word for its synonym will not greatly affect the probability of the containing sentence. word2vec is a model that illustrated this very well; it showed that you could, e.g., take the semantic vector for "king", subtract the vector for "man", add the vector for "woman", and find that the nearest word to the resulting vector was "queen". Once the research in neural language models began in earnest, we started seeing immediate and massive drops in perplexity (i.e. how confused the models were by natural text) and we're seeing corresponding increases in BLEU score (i.e. quality of translation) now that those language models are being integrated into machine translation systems.

Machine translations are still not as good as quality human translations, and quite possibly won't be that good until we crack fully sapient AI. But good human translators are expensive, while everyone with Internet access has machine translators available. The question isn't whether the human translation is better, but rather how close the machine gets to that level of quality. That gap has been shrinking and is continuing to shrink.


Posted 2019-06-03T14:34:41.017

Reputation: 229

I don't like this approach - but that's a matter of taste and opinion. Doing without a "learned/savant/understanding" translation just because "human translators are expensive" makes me feel sad. What then is translation all about? – Hans-Peter Stricker – 2019-06-05T16:55:36.690

@Hans-PeterStricker Translation is about being able to communicate with people with whom you do not share a common language. Machine translation is currently good enough to allow us to do that somewhat well, although the resulting translations are often ungrammatical or sound like a non-native speaker. (continued...) – Ray – 2019-06-05T23:27:32.070

Depending on what you mean by "learned/savant/understanding", we may already be doing that. That's what the mapping to a semantic vector is; the words are embedded in a vector space that represents their underlying meaning. The Sutskever paper I linked (as "conceptual") actually does translation by mapping the entire sentence onto a semantic vector and then converting that vector into a sentence in the target language. So "understanding" of a sort is definitely happening there. (continued...) – Ray – 2019-06-05T23:28:41.063

There also exist models that learn the underlying syntax (i.e. sentence structure) , and there has been work on integrating that into neural models, although at the moment, models that learn what parts of the sentence they should pay attention to at any given moment seem to be more effective at handling that sort of thing than the explicit syntactic models. (continued...)

– Ray – 2019-06-05T23:29:13.713

If you don't think that any of this sort of "understanding" counts as True Understanding, then what would count other than an AI that passes the Turing Test, i.e. a fully sapient one? Do note that I never said we can't make a fully sapient AI (I couldn't say how long it'll take; that's not my part of the field. But I have little doubt we'll get there eventually). But the models I'm describing here are what we're using now, and they work fairly well at allowing people to communicate. AI research is all about getting successively better versions of "good enough" – Ray – 2019-06-05T23:29:59.697