PHASE #4: The End

In this final phase, based on all the research and the activities we have carried out, we have decided to include some new Qs to our MT DataSheet Template. At the end of this post, we will also farewell this Project.

NEW Qs INCLUDED in the DataSheet

Hereunder one can find the different sections to which we have added some new Qs or modified a Q that was already included in the DataSheet, as well as the Q at issue and its justification.

DataSheet Composition

We have modified the Q Does the dataset contain all possible instances or is it just a sample of a larger set? to Does the dataset contain all possible instances or is it just a sample of a larger set? i.e. Is the dataset different than an original one due to the preprocessing process? ) In case this dataset is a subset of another one, is the original dataset available? The preprocessing process will probably be the reason why a dataset is a subset of another dataset, so it can be helpful to ask it explicitly so that the writer thinks about it. ( i.e. it is different if I have a whole dataset and take only some instances to create another dataset ( subset ) than if I have a dataset and during preprocessing some data gets collapsed/deleted/transformed creating a dataset which is not the original (it is actually also a subset) Of course, we believe it is important to refer to the original dataset if it exists.
We have included the Q Is there syntetic data in the dataset? If so, in what percentage? We believe it is important to know wether there has been any data augmentation process performed at all.

Collection Process

We have included the Q: Where was the data collected at? ( Please include as much detail; i.e. Country, City, Community, etc. ) in order to include the where aspect. We believe it is important to differentiate between the source of the data and from where it was collected ( i.e. if the data was extracted by the company Microsoft and it is data about the computer brand preferences it is important to know whether the people surveyed are residents in China or in the States or in Cameroon )

Dataset Maintenance

We have included the Q Is there any log about the changes performed in the dataset available? We believe it is very important to be able to access information on previous versions and therefore it may be very handy to have it referenced if it exists.
We have included the Q Specify any specific limitations to contributing to the dataset? ( i.e. Can anyone contribute to it? Can someone do it at all? ) MT and NLP in general are environments where researchers contribute to different projects all the time. And therefore it would not be strange if the dataset was open for anyone to contribute ( with some verification process, of course ). Then it is very important to know whether the dataset is open for contributions and whether these may be done by anyone or only by a set of people.
We have extended the Q Is there any verified information on whether the dataset will be updatated in any form in the future? to Is there any verified information on whether the dataset will be updatated in any form in the future? Is someone in charge of checking if any of the data has become irrelevant throughout the time? If so, will it be removed or labeled somehow? Because we believe that, previously, future updates were only being considered as augmentations or error correction of the dataset. Now, one will also (explicitly) consider an update as the removal of data when this has become irrelevant for the dataset usage.

THE END

The time has come for us to conclude this blog, as we have encountered the end of this project. Throughout the whole process we have learnt and discovered many new things, like what BIAS actually means, where it comes from and how to deal with it. Furthermore, we hope we have showed and taught about this all, to many more, through this blog.

On top of all the research, the creation of the augmented MT DataSheet and the activities carried out with our university class, we have also been part of a bigger project. Which has allowed us to not only work on an actual research project but also to get to know and even attend conferences given by expertise on the field. We would like to specially thank our professor Marta R. Costa-jussà for approaching the real-world situation, not only on Algorithmic Bias but on Computer Science in general, to us students.

With these words, we say farewell to all our readers. But remember you can always contact the team at mtdatasheets@cs.upc.edu and we invite you to collaborate in our Project Website.

See you soon!

HOME

Written on May 18, 2020

Algorithmic Bias Research

We are a group of Data Science students at UPC in Barcelona. We will use this blog to track our study process on the bias found in algorithms, focusing on those used in Machine Translation.