Calmcode - bad labels: dataset

Searching for Label Errors in GoEmotions

1 2 3 4 5 6 7 8

We found a dataset called GoEmotions. It's a dataset collected by Google and it features snippets of text from reddit with an associated emotion label attached. The dataset comes with a paper that explains how the dataset got created. The goal of the dataset is to predict the emotion of the text. In this series of videos we will focus on the "excitement" label.

Download

If you want to download the files yourself, you can run the following from the terminal;

wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_1.csv
wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_2.csv
wget -P data/full_dataset/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_3.csv

You can also visit the Github page for more information.

Warning

The official paper mention that an effort was made to remove profanity in the data. In our experience it's proven very hard to randomly sample 10 examples from the dataset that doesn't have any profanity in it. In the video's we've done our best to remove any slurs from the recording but if you're exploring the dataset on your own; be aware.