PHASE #2: Characterization and documentation of translation datasets

In this task we are expected to design the generalized structure of a Datasheet that describes a corpus that has NOT been created by us (Phase 1) and that is Machine Translation related (Phase 2). In further tasks we will be creating a template for this and later on, filling it out with the information on the corpora we will be using.

2.1 Re-designing Datasheet for Datasets for when no corpus has been created

First, we will define those questions found in Datasheets for Datasets that could be answered even when one is not the corpus creator. Next, we will add some more questions we have come up with that could be interesting to list. The latter will be marked with italic style.

We are including questions that may require to have acces to some sort of verified information that explains it. If this was not available, one may not include the question or state that there is not available information on it.

Motivation

First of all we need to go for the basics. The aim of this section is to explain why does the dataset exist.

  • Who created the dataset(e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

  • Did they fund it themselves? If there is an associated grant, please provide the name of the grantorand the grant name and number

  • For what purpose was the data set created? Was there a specific task in mind? If so, please specify the result type ( e.g. unit ) to be expected.

  • Could any of these uses, or their results, interfere with human will or communicate a false reality?

  • What is the antiquity of the file? Provide, please, the current date.

  • Has there been any monetary profit from the creation of this dataset?

  • Any other comments?

Composition

Now that we know where is this data coming from, let’s see what does it tell us about. One must first check if there already exists some sort of ReadMe file that explains the content of the dataset. If it doesn’t exist, one can get almost all of this information from having a ( deep ) look at the dataset. For those questions that are more of an ethic topic, one could also use their own sense ( e.g. if the dataset contains people’s name and ID’s information, it is obvious that it is going to be more likely confidential information and that identifies individuals ). However, one must always clearly state that it is not-verified information but rather their own conclusions.

  • Are there multiple types of instances or is there just one type? Please specify the type(s) ( e.g. Raw data, preprocessed, continuous, discrete )

  • What do the instances ( of each type, if appropriate ) that comprise the data set represent (e.g., documents, photos, people, countries)?

  • How many instances (of each type, if appropriate) are there in total?

  • Does the dataset contain all possible instances or is it just a sample of a larger set?

  • Is there a label or a target associated with each of the instances? If so, please provide a description.

  • What is the format of the data ( e.g. .json, .xml, .csv )?

  • Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.

  • Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description. Do not include missing information here.

  • Is there any verification that guarantees there is not institutionalization of unfair biases? Both regarding the dataset itself and the potential algorithms that could use it.*

  • Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationa lebehind them.

  • Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b)are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a future user? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.

  • Does the dataset contain data that might be considered confidential (e.g.,data that is protected by legal privilegior by doctor patient confidentiality, data that includes the content of individuals non-public communications)? If so, please provide a description.

  • Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.

  • Does the dataset relate to people? If so, please specify a) Whether the dataset identifies subpopulations or not. b) Whether the dataset identifies indivual people or not. c) Whether it contains information that could vulnerate any individuals or their rights. c) Any other verified information you can provide.

  • Any other comments?

Collection Process

  • If the dataset is a sample from a larger set, what was the sampling strategy? ( e.g. deterministic, probabilistic with specific sampling probabilities )

  • Are there any guarantees that prove that the acquisition of the data did not vulnerate any law or anyone’s rights?

  • Are there any guarantees that prove that the data is reliable?

  • Please, if the dataset relates to people, specify any information regarding these people. ( e.g. Was the data collected from these people directly? Did all the involved parts give their explicit consent? Is there any mechanism available to revoke this consent in the future, if desired? )

  • Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis)been con- ducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.

  • Were any ethical review processes conducted?

    • Any other comments? ( e.g. Who collected the data, whether they were compensated, what mechanisms were used ) Please specify any other information regarding the collection progress that you may know (e.g. the person who created the dataset has explained ) or be able to find ( e.g. there exists and informational site about the dataset ). Please, only include if verified.

Preprocessing/Cleaning/Labeling

Please specify any information regarding the preprocessing that you may know (e.g. the person who created the dataset has explained ) or be able to find ( e.g. there exists and informational site about the dataset ). Please, only include if verified.

Uses

  • Has the dataset been used already? If so, please provide a description.

  • Is there repository that links to any or all papers or Systems that use data set? If so, please provide a link or other access point.

  • What ( other ) tasks could the data set be used for? Please include your own intentions, if any.

  • Are there tasks for which the data set should not be used? If so, please give a description.

  • Any other comments? ( e.g. Do the collection or preprocessing processes impact future uses? )

Distribution

  • Please specify the source where you got the dataset from.

  • When was the dataset first released?

  • What is the quality of the source?

  • Is the dataset distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? Please cite a verified source.

  • Any other comments? ( e.g. How has the data been distributed? Who has access to the dataset? When was the dataset first distributed? Are there any other regulations on the dataset? )

Maintenance

  • Is there any verified manner of contacting the creator of the dataset?

  • Has there any erratum been notified?

  • Is there any verified information on whether the dataset will be updatated in any form in the future?

  • Could changes to current legislation end the right-of-use of the dataset?

  • **Any other comments? ( e.g. Will the dataset be updated in the future? Is there someone supporting/hosting/maintaining the dataset? If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances? )

2.2 Modifying Datasheet for Datasets for data used in Machine Translation

In the following, we list and explain several questions that could be added to the Datasheet for Datasets [1] when those are oriented to solve Machine Translation (MT) problems.

Composition

It is known that MT Datasets don’t include rare or uncommon languages as many times as worldwide spoken languages, such as English. Not only are these datasets unbalanced in terms of languages, but they are also biased in terms of words. Hence, we should ask these question every time before creating a dataset to avoid more discriminance:

  • Does the dataset cover included languages equally?
  • Is there any evidence that the data may be somehow biased? ( e.g. towards gender, ethics )

It would also be good to consider the type of text contained in the data. This can also indirectly bias MT algorithms, for example in terms of representation of some words (such as contractions) or the presence of idioms.

  • Is the data made up of formal text, informal text or both equitably? Does it also contain slang terms?

Furthermore, maybe the data is not completely made up of fully correct language and there’s the possibility of it containing, for example, extremely informal language to the point of being just not correct (hi, text messages!). This can be desirable for some applications, but non-desirable for others. In the case of MT, it could be useful to try and make the user avoid having to write perfectly in order to get quality translations in an environment where the user is not used to doing so.

  • Does the data contain incorrect language expressions on purpose? If that’s the case, please provide which instances of the data correspond to these texts.

Collection Process

The majority of data sources that are used to address Natural Language Processing (NLP) tasks, and more specifically MT tasks, consist of large corpora of plain text. We find it relevant to carefully think if any variant on source would give the same dataset as a result. Logically, as one gets more and more samples, the variances of the corpora extracted from different sources should tend to be smaller, but as an example, would one get the same descriptive adjectives for politicians from different newspapers (sources) ? We all know that it is completely false. Thus, we encourage dataset creators to carefully think about these biased sources.

  • If the same content was to be extracted from a different source, would it be similar? Moreover, and linked to the previous statement, it is possible that a corpus is made of data coming from different sources. In case the answer to the previous question is negative, then it is logical to ask the following question.
  • Does the data come from a single source or is it the result of a combination of data coming from different sources? In case it is a compilation of different sources, please provide a link to those.

Preprocessing/Cleaning/Labeling

If any kind of bias or discrimination was detected in data, or the dataset creator is concerned about a potential detection of it, it is strongly recommended to mention any measures that have been applied to treat it.

  • Was there any mechanism applied to obtain a neutral language?

It is possible that not all instances go through the same preprocessing pipeline, especially if the dataset is (partly) made of a compilation of several sources.

  • Were all instances preprocessed the same way?

Distribution

Users of MT applications can only take advantage if their language is considered by the application. Hence, we should care about covering as much as possible all regions with the language spoken.

  • Was the dataset released publicly? Are there any regions where this dataset is not available?

Maintenance

It is well known that all languages evolve over time. Recent sources of plain text may use words following different distributions (trending topics, new vocabulary, and many other variants…). It’s important to think about the possibility of the implementation of an automatic process that takes care of these modifications.

  • Is vocabulary enrichment automatically developed?

References

[1] Datasheets for Datasets

HOME

Written on April 20, 2020