PHASE #1: Machine Translation

1.1 Parallel Corpora

The purpose of this entry is to summarize any information regarding the datasets from Europarl v10 and News Commentary v15 that will be used that may be of interest for the users.

Background

The parallel corpora we are going to be focusing on is part of the training data of the “Fifth Conference on Machine Translation WMT20” which aims at, among many other goals, investigate the applicability of current MT techniques when translating into languages other than English, create publicly available corpora for MT applications and evaluation and compare unsupervised MT in a controlled environment.

Furthermore, it focuses on news translation tasks, so one should expect the corpora to have short paragraphs, as online news texts usually do.

As it is obvious, when making use of any of these datasets for research purpose, one must cite the WMT20 shared task overview paper as well as respect any additional citation requirements that may be found on each specific dataset.

Dataset Structure

With respect to the released datasets’s structure, all data is in Unicode (UTF-8) format and it is important to take into account that it is not tokenized and that sentences of any length are present, including empty sentences. One may found many tools to perform processing on the training data in the Moses Git Repository if needed.

As mentioned earlier, we are going to be focusing on the French-English (FR-EN) parallel corpus from Europarl v10 and News Commentary v15.

For the news on the proceedings of the European Parliament the .tsv file contains the following columns: source, target, file_id, chapter_id, which indicates the document referred, speaker_id, speaker name, language, and last, affiliation. There are some mandatory preprocessing tasks specific for this data such as tokenizing the text, strip empty lines and their correspondences and remove lines with XML-tags, which start with “<“. It is not required but highly recommended to lowercase the text as well. There are also some special HTML entities and noisy characters that have not been removed, which could be a tedious bug while trying to accomplish any TM task. To sum up on this one, some stats; this dataset is compound of a little over 2M of sentences containing over 50M of English words.

On the other side, we have a parallel corpus that consists on political and economic news commentaries originated from the web site Project Syndicate. The file is available in different formats and accompanied by different kinds of annotations. In addition, one may also find some other useful files such as pre-compiled word alignments, phrase tables, bilingual dictionaries or frequency counts through the OPUS website. This corpus is formed by +209k sentences and over 10M words.

1.2 Neural Network for Machine Translation

A few years ago we started using Recurrent Neural Network(RNN) and Phrase-Based Machine translation. The first one, was used to directly learn the mapping between an input sequence to an output sequence and, the other one, was used to translate sentences, breaking the input sequence into words and treating them independently.

When Neural machine translation (NMT) first came out, NMT showed equivalent accuracy with existing Phrase-Based translation systems and moreover, as opposed to Phrase-Based machine translation, NMT consider the entire input phrase as a single unit, and this, requires fewer engineering design choices. Over time, this Neural Machine Translation model was improved with many techniques like external alignment model, using attention, breaking words into smaller units… Despite these improvements, it wasn’t until the Google Neural Machine Translation (GNMT) that we came up with a model that could be considered as a production system.

GNMT implementation is based on an input that the encoder represents as a list of vectors and a decoder that generates the output of the words, one by one, using a weighted distribution over the encoded input vector of the most relevant words at each step (Attention).

Today, using human-rated side-by-side comparison as a metric, we can consider GNMT to produce much better translations than previous Phrase-Based translation machine. Although it may make significant mistakes that a human translator would never have committed. That is why we can’t say that the machine translation problem has been solved yet. There is still a lot of work we can do to serve our users better.

1.3 Machine Translation via Transformers

“Attention is All you Need” [1], without a doubt, is one of the most impactful and interesting paper in 2017. It made possible to do seq2seq modelling without recurrent network units. The proposed “transformer” model is entirely built on the self-attention mechanisms without using sequence-aligned recurrent architecture. The secret recipe is carried in its model architecture. Before digging into it … Why?

RNNs presented several challenges …

  • Tough work with long range dependencies though it was solved using attention mechanisms.
  • Gradient vanishing and explosion.
  • Large number of training steps (unrolled RNNs become very deep models, since we have to unroll so many times as steps). RNNs are always more costly to train than convolutional networks and other types of simpler networks.
  • Recurrence prevents parallel computation! Because weights are tied, and optimization is difficult.

Transformer networks facilitate long range dependencies; secondly, there is no gradient vanishing nor explosion; it has fewer training steps and most important: it has no recurrence! A fact which facilitates parallel computation. The Multi-Head Attention block is the important block where all the good stuff happens … Multi-Head Attention takes every word, combines it with some of the other words through the Attention Mechanisms to produce a better embedding that merges together information from pairs of words. If we stack those blocks then we will be looking at pairs of pairs (and pairs of pairs of pairs, and we can continue so forth), essentially we will combining not just pairs of words, but groups of words, that get larger and larger. Finally, we will obtain completely new representations of words which capture contextual meaning.

Another impressive feature of the Transformers, apart from computational performance and higher accuracy, is that we can visualize the “attention”, so that we can gain insight of how information travels through the network while processing or translating a given word. Let’s see that in an example:

  • The animal didn’t cross the street because it was too tired.
  • The animal didn’t cross the street because it was too wide.

What does the pronoun “it” refer to, in both phrases? This problem is called “coreference resolution”. Visualizing the “attention” in a well-trained Transformer network, we would see that in the first sentence the attention value for “animal” is high when translation “it”, whereas in the second phrase the word “street” would have a higher value. This information is of great value when doing Machine Translation.

1.4 Gender Specific Translations

Most modern translators use end-to-end neural network-based systems to perform translations, which greatly improved the quality of them when they were introduced. Google Translate, probably the most infamous one, also transitioned into this type of system. In this entry, we will focus on how Google is working in order to reduce gender bias in their translation service.[2]

Languages differ a lot in how they represent gender. These differences can cause trouble for translation systems and, when ambiguity arises (i.e. going from a gender-neutral language to a language where gender is explicitly encoded), systems tend to pick gender choices which reflect some of socities’ assymetries. This problem lies more in the training data than in the models themselves, as biases are implicitly encoded in it.

About two years ago, Google Translate started tackling this problem by returning all the possible gender possibilities given a possible ambiguous query for certain languages. As an example, the Turkish sentence o bir hemşire (he/she is a nurse) was historically translated to English as he is a nurse by Google Translate. Their current system, instead, returns two possible translations: he is a nurse and she is a nurse.

To support gender-specific translations of single words, an enrichment of the available data in order to add gender attributes was enough. However, for longer queries it was obviously more challenging and it required major changes to the whole architecture. To overcome so, Google’s team focused on Turkish to English translation and came up with a three-step plan.

The first step was detecting Gender-Neutral queries. Despite many Turkish sentences being gender-neutral, not all sentences are. Therefore, the detection of those eligible to gender-specific translations was crucial. Moreover, sometimes this neutralness is explicitly encoded, but sometimes it is done implicitly, so a simple mapper of pronouns was not enough. Furthermore, if a machine learning system was to be used, it had to be simple enough not to introduce a high latency and complex enough to have an accurate performance. A convolutional neural network was finally built to solve this task.

The next logical step was to generate the gender-specific translations. They did so by enhancing their NMT system to produce both feminine and masculine translations when needed. They could achieve so by dividing their parallel training data into feminine, masculine and genderless words, then adding a special input token for the desired translation (i.e. <2MALE>, <2FEMALE>) and finally training the NMT system with the 3 different datasets.

Finally, they performed an assessment of the quality of the translations. To do so, given a gender-neutral query, the NMT is requested thrice: one for a masculine translation, one for a femenine translation and one for a default translation (no gender token is added to the input). Finally, these outputs are compared. In the case of the gender-specific translations changing more than the necessary between them, the default one is given to the user. As an example, if the model returned Did he really say those words? and Did she actually say those words? as the gender-specific translations, they would not be shown as the word really is swapped by actually, and the default translation would be returned instead.

This is only the first step in fighting gender bias in translation systems, and Google’s plan is to extend these gender-specific translations to several languages and also include non-binary gender in translations.

1.5 Gender in, Gender out

Natural Language models, such as Machine Translation (MT), are trained on large text corpora which contain biases and stereotypes. These models are mostly trained from millions of translated text examples, which implies that they learn from translations that us humans have made of all kinds of documents. The direct consequence of this is that the model inherits the bias present on the underlying corpora. Thus, neural models might learn and amplify these unfairnesses.

We notice an analogy to the well known garbage in, garbage out (GIGO) in Natural Language Processing (NLP) tasks which is gender in, gender out. An example of this gender bias is that models are biased towards corpus, hence, they incorporate assumptions such as man is to doctor as woman is to nurse.

We can undoubtedly assume that texts in the web will have, among others, a strong gender bias, which is carried onto the web from its pre-existing state in society. Consequently, if a specific method to correct this inherent bias is not applied, this will be present in the translations made by the model.

It has been shown [3] that even the state-of-the-art models for NLP tasks like the Transformer models, which use word embeddings for language treatment, outperform humans and achieve great results but they’re incapable of solving these mentioned gender issues. This can be seen as they assign and relate certain occupations to a specific gender, even if the subject of the sentence was correctly identified as the opposite gender.

Neutral language processing must be implicit in any NLP applications. It is widely known that word embeddings are used in most of the neural language models. Thus, one key step is to equalize these word embeddings. Several approaches have been already proposed:

Although there are some applied debiasing techniques in NLP applications, studies of debiasing MT are quite limited. One recent paper [3] suggests that fair word embeddings could be used in MT to build more neutral models that focus on solving these issues and still achieve the same great results for non-biased cases.

The used technique to identify this unfairness works by testing the model preferences to match an occupation with a specific gender. This testing showed that pre-trained embeddings with debiased techniques achieve best results. In the following, we present an example taken from the paper, which shows that with the correct implementation and debiased pre-trained embeddings, both the subject and the occupation can be correctly translated:

Embedding   Predict
GloVe (Enc+Dec) La conozco desde hace mucho tiempo,… mi amiga trabaja como un operator de coches.
GN-GloVe (Enc+Dec) La conozco desde hace mucho tiempo,… mi amiga trabaja como operadora de transporte de minas.

1.6 Reducing Gener Bias on Google Translate

Tech companies and Research centers are making a big effort to deal with this issue. Two years ago (2018), Google Translate included gender-specific translations for some languages (such as Turkish), and now returns the two possibilities, one corresponding to each binary gender. For instance, if you type “o bir doktor” in Turkish, you’ll now get “she is a doctor” and “he is a doctor” as the gender-specific translations.

With the recent appearance of non-binary genders, it is clear that deep learning based translators will have to adapt their models if they want to adapt to them - and neglect the clear gender bias in most of today’s content, where the existance of non-binary genders is, overwhelmingly, not even considered.

References

[1] Attention is All you Need (Vasvani, et al., 2017)

[2] Providing Gender-Specific Translations in Google Translate

[3] Equalizing Gender Bias in Neural Machine Translation with Word Embeddings Techniques

HOME

Written on April 17, 2020