Dataset
We want to improve the ad system in Twitch. As an example, we worked on this dataset from February 2015.
Final project for the Cloud & Big Data course 2018-2019 at Universidad Complutense de Madrid. Using the Hyperspace template under Creative Commons.
We want to improve the ad system in Twitch. As an example, we worked on this dataset from February 2015.
We have deployed a Hadoop cluster with Spark on Amazon Web Services to help us process all this data.
You can check our results in graphics by day, or download the raw data obtained during the proccess.
As a part of the Cloud & Big Data course, we've been learning the basics of Spark programming. This is our main work pipeline during the project.
First, we develop a general script in Python3 that extracts the data we need from every file of the dataset.
We need the power of the cloud, so we run our virtual machine instances on Amazon Web Services' EC2.
We expect peek performance, so we need to tweak a little bit the configuration of our cluster.
After the dataset is processed, we retrieve the resulting data and organize it in our laptops.
We obtain A LOT of data, so we need a second Python script to make the results more accesible.
Finally, we can share the results in this very web site, using daily charts and enabling the resulting raw data download.
We are a student group from de Game Development Degree (third course) from Universidad Complutense de Madrid. You can contact us filling this form, or you can check out our GitHub repository!