“Meta Kaggle” Exploration by Adrian Liaw

Background

Kaggle is a platform for data prediction competitions, they call Kaggle “The home of data science”. There are hundreds of competitions out there, many of them are hosted by companies and have money as prizes, like Home Depot, Yelp and Airbnb. They also have a jobs board, where companies can offer data science jobs, and allow members in the Kaggle community to apply for them. If you’d like to learn more, please visit the About page.

Recently, Kaggle introduced the Kaggle Datasets, where you can find some high quality public datasets, like SF Salaries, Reddit Comments, US Baby Names. Kaggle also released their Meta Kaggle dataset, “The dataset on Kaggle, on Kaggle”. This dataset contains data about competitions, submissions, users, etc. on the Kaggle platform, and that’s what I’m going to explore.
Note: This dataset is not a complete dump, they’re just a small subset where some rows, columns, tables have been filtered by Kaggle.

Although the dataset is not a complete dump, it’s still pretty large. There’re 10 tables, but I’m only using 6 of them: Competitions, EvaluationAlgorithms, Submissions, TeamMemberships, Teams and Users, each of them has 239, 29, 934345, 68500, 59231, 365878 observations. My analysis will focus on Competitions, Submissions, Teams and Users.

Univariate Plots Section

Let’s start with making a histogram of total points made by each user.

That doesn’t look pretty good, let’s apply log scale to the y axis.

It’s much better, it looks like a normal distrubution with a log scale.
Here are some descriptive statistics of points:

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      1.7    143.6    295.9   1193.0    719.7 223600.0

Quantity of the competition reward (in USD):

Since there are only 239 competitions recorded in this dataset (and there’re actually more than 750 on Kaggle), we can’t really observe something really special only from this table.

Duration of the competitions (Deadline - DateEnabled):

There isn’t this Duration variable in the original dataset, I done it with R. The following plot is also about a variable that I’ve created, N.Submissions. N.Submissions is a variable of teams, it’s the number of times each team submitted.
Note that every team is bound to a competition, there’s a CompetitionId column in the Teams table. Also, a single user is a team, every time you submit as a single user, Kaggle records it as a team with 1 member.

That’s a highly skewed distribution.

The following plot is about the number of new users joined over time:

There’s a strange peak at about March 2015, since I wasn’t familiar with Kaggle then, I can’t really explain this peak.

This plot’s x-axis is the z-score of submissions’ PublicScore by each competition.

The following one is basically the same as the previous one, but submissions’ PrivateScore instead. The private score is what determines the final ranking, but won’t be shown before the competition ends. To learn more about public score and private score, please visit this page

Here are some descriptive statistics of PublicScore.Z:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -7.2930 -0.2151  0.3730  0.0000  0.5804  4.4920

We can see that the median value is above 0, difference between 1st quartile is about 0.59, 0.21 for 3rd quartile, which is a slightly skewed distribution.

Submissions by days before competitions’ deadlines:

This is quite interesting that, a lot of submissions seem to be happened in the last day of competitions.

The following plot shows the distribution of difference between current ranking and highest ranking of each user:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    3807    9817   11030   17280   32310  314219

Here are three bar charts showing the amount of users within each tier. Kaggle users are separated into 3 different tiers: Novice, Kaggler and Master. Learn more

In the first bar chart, we can see that novice users is an extremely large proportion of total Kaggle users. But in the third plot, I’ve removed all the users that doesn’t have ranking (doesn’t have any activity), and you can see, the bar of Kaggler is the highest one, and Masters is only a tiny portion.

Univariate Analysis

What is the structure of your dataset?

The dataset was designed with relational model, there’re 6 tables I’m using, please visit the description for the list of columns.

The Competitions table has 239 rows, Submissions has 934345, Teams has 59231, TeamMemberships has 68500, Users has 365878.

Variables like Points, PublicScore, PrivateScore, RewardQuantity, Ranking etc. are numeric, and also continuous. Variables like Deadline, DateEnabled, DateSubmitted, RegisterDate etc. are dates, which is also continuous. There’s one categorical variable, TierType in Users (Novice -> Kaggler -> Master).

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is the Ranking (or Score, they’re basically the same). I’ll try to explore what other variables might have relationship with the ranking.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Variables like Deadline, DateEnabled, DaysBeforeDeadline etc., those time variables will probably going to be helpful, because they can show the density of submissions by teams or users. When someone submitted very frequently, that might because he was getting higher and higher accuracy, thus having higher rank. Also, RewardQuantity is definitely another feature that is helpful.

Did you create any new variables from existing variables in the dataset?

Yes, I created DaysBeforeDeadline, PublicScore.Z, PrivateScore.Z in Submissions table, N.Submissions in Teams table.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

It’s pretty hard to say there’s anything unusual. However, I think the distribution of PublicScore.Z is a bit interesting, most of them are located above 0, which is above average in the corresponding competition, but there are also some strange peaks between -0.5 and -4, I think they’re probably some submissions which the content is “all zeros”.

I did some adjustment on Competitions, Submissions, Users and Teams, I changed the data type of some variables, mainly datetime variables. I also did some joins, which makes it no need to use join everytime. Finally, the TierType in Users, I converted the integer field Tier to a categorical variable TierType so it clearly separates 3 tiers: Novice, Kaggler and Master.

Bivariate Plots Section

Let’s start with a scatterplot, this is the scatterplot of PublicScore.Z and PrivateScore.Z, as you might expect, they have a linear relationship.

## 
##  Pearson's product-moment correlation
## 
## data:  PublicScore.Z and PrivateScore.Z
## t = 2614.4, df = 930190, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9379517 0.9384386
## sample estimates:
##       cor 
## 0.9381956

There’re also some outliers, some are getting high public score but low private score, and they might be those ungeneralised solutions.

In the following scatterplot, the x-axis represents how many days before the deadline the user submitted, and the y-axis is z-scores of the submissions’ public score:

I was expecting to have some relationship where PublicScore.Z is getting higher when DaysBeforeDeadline is getting smaller, but there isn’t in this plot. Notice there’re some points looked close to each other and making a curve-like path, they might be submitted by the same user/team, and was trying to get a higher score on the same competition, so submitted many times within a few days, hence we can see those paths.

In this following histogram, I filled those bars with different colours for different tiers:

We can see there’s no “Novice” users appearing in the plot, that’s because novice users are users those didn’t earn any ranking points, like me.

Another way to make this plot is to draw each tier’s bar separately:

The frequency polygon plot below is basically the same as the previous ones, but instead plots the density:

As you can see, users with significantly large number of points are typically masters.

Here are some descriptive statistics of points by each tier:

## TierType: Novice
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0 
## -------------------------------------------------------- 
## TierType: Kaggler
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.7   141.5   287.0   798.0   675.4 64520.0 
## -------------------------------------------------------- 
## TierType: Master
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    298.4   5255.0  12400.0  21020.0  25200.0 223600.0

N.Submissions vs Ranking of teams:

## 
##  Pearson's product-moment correlation
## 
## data:  N.Submissions and Ranking
## t = -46.469, df = 59229, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1953108 -0.1797707
## sample estimates:
##        cor 
## -0.1875525

I can’t say there’s an obvious relationship with each other, but I could say “it’s less likely to get lower rank if a user submitted more times”.

The following plots are pretty interesting though, this time I’m also comparing N.Submissions and Ranking, but on users:

They seem to have a non-linear relationship between each other, we can draw a conditional means line to see more clearly:

Apply a hypothesis testing to see the correlation of them:

## 
##  Pearson's product-moment correlation
## 
## data:  log10(N.Submissions) and cube_root(Ranking)
## t = -117.07, df = 19802, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6477023 -0.6312390
## sample estimates:
##       cor 
## -0.639544

This is the largest correlation I’ve found that isn’t what I expected, and it’s pretty interesting. However, we can’t say submit more causes higher ranking, because correlation does not imply causation.

N.Submissions vs RewardQuantity:

## 
##  Pearson's product-moment correlation
## 
## data:  RewardQuantity and N.Submissions
## t = 6.2367, df = 52052, p-value = 4.505e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.01873946 0.03590781
## sample estimates:
##        cor 
## 0.02732565

In this plot, I was expecting there are more submissions if the reward quantity is large (USD), but it’s not really happening.

The following plot is about RegisterDate and Ranking, and obviously, there isn’t any relationship between each other, but something is pretty interesting:

I think that wierd line definitely caught your eye. These are users who didn’t earn any ranking points, so the ranking is always the last one, and since there are more and more users register on the platform as the time goes by, the rankings of them are getting larger and larger.

Here are three boxplots showing the distribution of Ranking, Points and N.Submissions for each tier of users:

## TierType: Novice
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   32620   37810   42750   42690   47630   52460 
## -------------------------------------------------------- 
## TierType: Kaggler
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      44    8630   16620   16610   24620   32610 
## -------------------------------------------------------- 
## TierType: Master
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1     180     479    1077    1270   16200

## TierType: Novice
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0       0       0       0 
## -------------------------------------------------------- 
## TierType: Kaggler
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.7   141.5   287.0   798.0   675.4 64520.0 
## -------------------------------------------------------- 
## TierType: Master
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    298.4   5255.0  12400.0  21020.0  25200.0 223600.0

## TierType: Novice
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0     3.0     6.0    10.6    13.0   296.0   15187 
## -------------------------------------------------------- 
## TierType: Kaggler
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    3.00    8.00   24.71   22.00 1616.00    4220 
## -------------------------------------------------------- 
## TierType: Master
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     1.0    70.5   160.0   319.0   388.5  3917.0      26

By these plots and statistics, you can see how user’s tier related to ranking, points, number of submissions.

Notice the second plot doesn’t have the box for “Novice”, that’s because novice users are all having Points of 0, and log(0) is -Inf, but I applied a log scale to the x-axis.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Ranking of users tend to correlate with number of submissions a user makes. It looks like it doesn’t make sense, but I think the reason this happened is that users who makes a lot of submissions were probably trying to increase their score, even if it only increased 0.01, but that’s a huge increment if you’re in the top-three of the leaderboard. That’s why they update their solution so often.

I also observed some difference between different users with different tiers, they have a significant difference on ranking, points and number of submissions.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There’s nothing really crazy I’ve found in my exploration.

What was the strongest relationship you found?

The ranking and number of submissions by a user. They have a correlation coefficient of 0.63, it’s not really significantly high, but it’s the strongest I’ve found.

Multivariate Plots Section

Let’s start this section with a relationship we just observed, N.Submissions to Ranking. In this plot, I added a third dimention: Points, and represent them using colours:

That’s exactly what we expected to see.

The following plot is having the same x and y axis as the previous one, but label the points using TierType.

## TierType: Novice
## 
##  Pearson's product-moment correlation
## 
## data:  df$N.Submissions and df$Ranking
## t = 3.5005, df = 3873, p-value = 0.0004697
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02471504 0.08749143
## sample estimates:
##        cor 
## 0.05615874 
## 
## -------------------------------------------------------- 
## TierType: Kaggler
## 
##  Pearson's product-moment correlation
## 
## data:  df$N.Submissions and df$Ranking
## t = -70.498, df = 27738, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3997408 -0.3797808
## sample estimates:
##        cor 
## -0.3898066 
## 
## -------------------------------------------------------- 
## TierType: Master
## 
##  Pearson's product-moment correlation
## 
## data:  df$N.Submissions and df$Ranking
## t = -8.3278, df = 609, p-value = 5.453e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3891920 -0.2466789
## sample estimates:
##        cor 
## -0.3197427

Nothing really stands out in these relationships.

Let’s look at this plot below, HighestRanking vs Ranking and N.Submissions. My thought here is that if a user is less active, he should have more difference between his highest ranking and current ranking:

The second plot I just drawn is a zoom-in version of the first plot. Here I didn’t see any interesting relationship.

Now, let’s switch the variable that represented with colour to DaysBeforeSep2015. I created this variable, it tells how long a user have joined Kaggle.

Not surprisingly, these three variables are closely related to each other, the longer a user joined Kaggle, the difference between highest ranking and current ranking is larger.

The following plot changes the colour variable to DaysAfterLastSubmit, which shows how long it has been after the last submission a user made:

And yes! They closely related to each other.

Nothing strange in this plot. We can see there are some blue points (RewardType == "Knowledge) away from the main cluster of data points, that’s because many competitions with reward type of knowledge (well, it just means no prize) are having long durations, e.g. Titanic: Machine Learning from Disaster.

Another way of looking at this relationship is to cut DaysBeforeDeadline into several buckets and use facet plots:

Now we can clearly see how the spreading of scores varies with DaysBeforeDeadline. As the time approaching deadline, the scores seem to be more centralised, because participants tend to submit more, and we’re taking z-scores. And if you look at the plot in the facet of (300,600], The quartiles are wider, it means the distribution of PublicScore.Z is more spreaded out.

In this following plot, I’m going to compare Duration and N.Submissions, with RewardType as colour label:

## teams$RewardType: USD
## 
##  Pearson's product-moment correlation
## 
## data:  df$Duration and df$N.Submissions
## t = 5.377, df = 52052, p-value = 7.607e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.01497372 0.03214536
## sample estimates:
##        cor 
## 0.02356128 
## 
## -------------------------------------------------------- 
## teams$RewardType: Others
## 
##  Pearson's product-moment correlation
## 
## data:  df$Duration and df$N.Submissions
## t = -1.0686, df = 1313, p-value = 0.2854
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08340305  0.02461869
## sample estimates:
##         cor 
## -0.02947825 
## 
## -------------------------------------------------------- 
## teams$RewardType: Knowledge
## 
##  Pearson's product-moment correlation
## 
## data:  df$Duration and df$N.Submissions
## t = -0.94561, df = 5860, p-value = 0.3444
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03793988  0.01325248
## sample estimates:
##         cor 
## -0.01235179

We can’t really see any interesting relationship here, but we can see competitions with reward type of USD typically having more wide range of number of submissions than the knowledge ones.

Date registered vs highest ranking and tier:

And again, nothing fancy here.

Here, let’s explore with facets. In the following plots, I’m going to facet the plot by type of reward, they could be USD, Knowledge or Others (Kudos, Jobs, Swag). Each scatterplot has x-axis of DaysBeforeDeadline and colour label of Tier. Here I draw two plots, first one has y-axis of PublicScore.Z in each scatterplot, the second one has PrivateScore.Z.

Notice that in the scatterplot of TierType == "Knowledge", submissions by different tier are evenly spreaded, but in USD, there’s only Kagglers and Masters.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I’ve found that the time of last submission is having a strong relationship with the difference between highest ranking and current ranking. And this relationship actually strengthened each other, we can use these two features to determine how a user’s ranking should be like.

Were there any interesting or surprising interactions between features?

Reward type and score, the distribution in the scatterplot of USD (in the plot with facets) seemed to be wider than other two. And that definitely makes me more understand the data.

Final Plots and Summary

Plot One

Description One

The distribution of z-scores of submissions’ private scores by each competition is slightly skewed, having median of 0.37, 1st quartile of -0.21 and 3rd quartile of 0.58. By this, we know that there is a large portion of submissions having private scores above the average. There are some submissions having the same and pretty low scores, which causes some specific z-scores having extremely more matching data points than they should. Perhaps those are submissions where users submitted “all-zeros” solution, and many users done that. This plot made me more understand about how typical distribution of submissions’ scores look like.

Plot Two

Description Two

Number of submissions by user and user’s current ranking are having an approximately linear relationship with each other when we apply log10 to submissions number, cube root to ranking. The larger number of submissions a user submitted, the ranking is more likely to be higher. These two variables have a correlation coefficient of 0.63, which is pretty high for a 19804 sample (p-value < 0.000001 in a hypothesis test where alternative hypothesis is the correlation is not 0). Also, the 95% CI for correlation is (0.631, 0.648). The relationship may because users with larger number of submissions are typically more hard-working on these competitions, and more likely to get higher score and ranking. This relationship can help us to infer how a user’s rank should be like based on how many submissions submitted by him. However, there could be some traps where someone submitted so many times, but not getting a high score.

Plot Three

Description Three

The longer time a user doesn’t have any submission activity, the more his ranking drops from his highest ranking, you can see that through the colour gradient in the plot. If we cut “days after last submission” (the colour axis in the plot) into (0,250], (250,500], (500,1000], (1000,2000], then the correlation of highest ranking and current ranking for each group will be 0.91, 0.93, 0.85 and 0.48. We can also use this to infer a user’s current ranking, when holding highest ranking constant, users with recent activities not far ago seem to have higher ranking currently but not exceeding highest ranking, and that’s why I chose this plot, because of this interesting and strong relationship.

Reflection

As a non-active Kaggle user, getting familiar with this dataset is quite a hard work, this project actually took me more than a month. This dataset just released months ago, there isn’t too much information about this dataset, it’s hard to find examples, and there even doesn’t have a documentation about the dataset, Kaggle only listed all the variables inside, but not describing what those variables are, I spent so much time to try and explore something from the dataset.
A lot of times during the exploration, I felt really difficult because couldn’t find anything that interested me and maybe other audiences. I tried to tell myself it’s normal, just keep exploring. Eventually, I found some patterns in the data.
Throughout the analysis, lots of things are just in line with my expectations, a few of them aren’t, e.g. scores, submissions by “days before deadline”, number of submissions. I think one that most surprised me is the plot about number of submissions and ranking of users, it makes sense but also doesn’t make sense.
I’m not quite sure how this dataset can be used, but I think the next step will be building a solid model that predicts who is the winner. Maybe Kaggle will open another competition?