Tools

To develope this project we used a bunch of different tools and programs and worked on two different Operative Systems, Linux and Windows. In that last case we had to use two programs to make the workflow easier, PuTTy, in order to create a ssh conection between our Windows computer and WinSCP to create a sftp conection to transfer the required files.

First at all we though that it was gonna be enough to create the code to obtain the top 10 broadcasted videogames per hour of each day, but due the huge amount of results we wanted to offer the results in a more generic and visual way. So once we had the 24 top ten results per hour we programmed a second script to work with the data results of the first script and that way get the top 10 broadcasted games of the day.

ids = inputRDD.map(lambda p: Row(Game = p[3].lower(), CurrentViewer = p[1], 
Followers = float(p[7]), Partner = p[8], Language = p[9]))

df = ids.toDF()
df = df.filter(df['Language'] == "es").filter(df['Partner'] != "-1")
df = df.filter(df['Game'] != "-1").filter(df['Followers'] >= 1000) 	

df.groupBy(col("Game")).agg({"CurrentViewer":"sum"}).orderBy("sum(CurrentViewer)",ascending=False).
limit(10).coalesce(1).write.format("com.databricks.spark.csv").save('out7')

The code is prepared to work on a Spark cluster with Hadoop, and to read files we first converted them into dataframes in order to make easier operate with the data and apply it some filters. Finally we mix together all the results ditributed on the different cores of the virtual machine. And of course we used different text editors like SublimeText3 and Visual Code to work on the code.