Kaggle is a platform for data prediction competitions, they call Kaggle “The home of data science”. There are hundreds of competitions out there, many of them are hosted by companies and have money as prizes, like Home Depot, Yelp and Airbnb. They also have a jobs board, where companies can offer data science jobs, and allow members in the Kaggle community to apply for them. If you’d like to learn more, please visit the About page.
Recently, Kaggle introduced the Kaggle Datasets, where you can find some high quality public datasets, like SF Salaries, Reddit Comments, US Baby Names. Kaggle also released their Meta Kaggle dataset, “The dataset on Kaggle, on Kaggle”. This dataset contains data about competitions, submissions, users, etc. on the Kaggle platform, and that’s what I’m going to explore.
Note: This dataset is not a complete dump, they’re just a small subset where some rows, columns, tables have been filtered by Kaggle.
Although the dataset is not a complete dump, it’s still pretty large. There’re 10 tables, but I’m only using 6 of them: Competitions
, EvaluationAlgorithms
, Submissions
, TeamMemberships
, Teams
and Users
, each of them has 239, 29, 934345, 68500, 59231, 365878 observations. My analysis will focus on Competitions
, Submissions
, Teams
and Users
.
Let’s start with making a histogram of total points made by each user.
That doesn’t look pretty good, let’s apply log scale to the y axis.
It’s much better, it looks like a normal distrubution with a log scale.
Here are some descriptive statistics of points:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.7 143.6 295.9 1193.0 719.7 223600.0
Quantity of the competition reward (in USD):
Since there are only 239 competitions recorded in this dataset (and there’re actually more than 750 on Kaggle), we can’t really observe something really special only from this table.
Duration of the competitions (Deadline - DateEnabled
):
There isn’t this Duration
variable in the original dataset, I done it with R. The following plot is also about a variable that I’ve created, N.Submissions
. N.Submissions
is a variable of teams
, it’s the number of times each team submitted.
Note that every team is bound to a competition, there’s a CompetitionId
column in the Teams
table. Also, a single user is a team, every time you submit as a single user, Kaggle records it as a team with 1 member.
That’s a highly skewed distribution.
The following plot is about the number of new users joined over time:
There’s a strange peak at about March 2015, since I wasn’t familiar with Kaggle then, I can’t really explain this peak.
This plot’s x-axis is the z-score of submissions’ PublicScore
by each competition.
The following one is basically the same as the previous one, but submissions’ PrivateScore
instead. The private score is what determines the final ranking, but won’t be shown before the competition ends. To learn more about public score and private score, please visit this page
Here are some descriptive statistics of PublicScore.Z
:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7.2930 -0.2151 0.3730 0.0000 0.5804 4.4920
We can see that the median value is above 0, difference between 1st quartile is about 0.59, 0.21 for 3rd quartile, which is a slightly skewed distribution.
Submissions by days before competitions’ deadlines:
This is quite interesting that, a lot of submissions seem to be happened in the last day of competitions.
The following plot shows the distribution of difference between current ranking and highest ranking of each user:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3807 9817 11030 17280 32310 314219
Here are three bar charts showing the amount of users within each tier. Kaggle users are separated into 3 different tiers: Novice, Kaggler and Master. Learn more
In the first bar chart, we can see that novice users is an extremely large proportion of total Kaggle users. But in the third plot, I’ve removed all the users that doesn’t have ranking (doesn’t have any activity), and you can see, the bar of Kaggler is the highest one, and Masters is only a tiny portion.
The dataset was designed with relational model, there’re 6 tables I’m using, please visit the description for the list of columns.
The Competitions
table has 239 rows, Submissions
has 934345, Teams
has 59231, TeamMemberships
has 68500, Users
has 365878.
Variables like Points
, PublicScore
, PrivateScore
, RewardQuantity
, Ranking
etc. are numeric, and also continuous. Variables like Deadline
, DateEnabled
, DateSubmitted
, RegisterDate
etc. are dates, which is also continuous. There’s one categorical variable, TierType
in Users
(Novice -> Kaggler -> Master).
The main feature of interest is the Ranking
(or Score
, they’re basically the same). I’ll try to explore what other variables might have relationship with the ranking.
Variables like Deadline
, DateEnabled
, DaysBeforeDeadline
etc., those time variables will probably going to be helpful, because they can show the density of submissions by teams or users. When someone submitted very frequently, that might because he was getting higher and higher accuracy, thus having higher rank. Also, RewardQuantity
is definitely another feature that is helpful.
Yes, I created DaysBeforeDeadline
, PublicScore.Z
, PrivateScore.Z
in Submissions
table, N.Submissions
in Teams
table.
It’s pretty hard to say there’s anything unusual. However, I think the distribution of PublicScore.Z
is a bit interesting, most of them are located above 0, which is above average in the corresponding competition, but there are also some strange peaks between -0.5 and -4, I think they’re probably some submissions which the content is “all zeros”.
I did some adjustment on Competitions
, Submissions
, Users
and Teams
, I changed the data type of some variables, mainly datetime variables. I also did some joins, which makes it no need to use join everytime. Finally, the TierType
in Users
, I converted the integer field Tier
to a categorical variable TierType
so it clearly separates 3 tiers: Novice, Kaggler and Master.
Let’s start with a scatterplot, this is the scatterplot of PublicScore.Z
and PrivateScore.Z
, as you might expect, they have a linear relationship.
##
## Pearson's product-moment correlation
##
## data: PublicScore.Z and PrivateScore.Z
## t = 2614.4, df = 930190, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9379517 0.9384386
## sample estimates:
## cor
## 0.9381956
There’re also some outliers, some are getting high public score but low private score, and they might be those ungeneralised solutions.
In the following scatterplot, the x-axis represents how many days before the deadline the user submitted, and the y-axis is z-scores of the submissions’ public score:
I was expecting to have some relationship where PublicScore.Z
is getting higher when DaysBeforeDeadline
is getting smaller, but there isn’t in this plot. Notice there’re some points looked close to each other and making a curve-like path, they might be submitted by the same user/team, and was trying to get a higher score on the same competition, so submitted many times within a few days, hence we can see those paths.
In this following histogram, I filled those bars with different colours for different tiers:
We can see there’s no “Novice” users appearing in the plot, that’s because novice users are users those didn’t earn any ranking points, like me.
Another way to make this plot is to draw each tier’s bar separately:
The frequency polygon plot below is basically the same as the previous ones, but instead plots the density:
As you can see, users with significantly large number of points are typically masters.
Here are some descriptive statistics of points by each tier:
## TierType: Novice
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
## --------------------------------------------------------
## TierType: Kaggler
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.7 141.5 287.0 798.0 675.4 64520.0
## --------------------------------------------------------
## TierType: Master
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 298.4 5255.0 12400.0 21020.0 25200.0 223600.0
N.Submissions
vs Ranking
of teams:
##
## Pearson's product-moment correlation
##
## data: N.Submissions and Ranking
## t = -46.469, df = 59229, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1953108 -0.1797707
## sample estimates:
## cor
## -0.1875525
I can’t say there’s an obvious relationship with each other, but I could say “it’s less likely to get lower rank if a user submitted more times”.
The following plots are pretty interesting though, this time I’m also comparing N.Submissions
and Ranking
, but on users:
They seem to have a non-linear relationship between each other, we can draw a conditional means line to see more clearly:
Apply a hypothesis testing to see the correlation of them:
##
## Pearson's product-moment correlation
##
## data: log10(N.Submissions) and cube_root(Ranking)
## t = -117.07, df = 19802, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6477023 -0.6312390
## sample estimates:
## cor
## -0.639544
This is the largest correlation I’ve found that isn’t what I expected, and it’s pretty interesting. However, we can’t say submit more causes higher ranking, because correlation does not imply causation.
N.Submissions
vs RewardQuantity
:
##
## Pearson's product-moment correlation
##
## data: RewardQuantity and N.Submissions
## t = 6.2367, df = 52052, p-value = 4.505e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.01873946 0.03590781
## sample estimates:
## cor
## 0.02732565
In this plot, I was expecting there are more submissions if the reward quantity is large (USD), but it’s not really happening.
The following plot is about RegisterDate
and Ranking
, and obviously, there isn’t any relationship between each other, but something is pretty interesting:
I think that wierd line definitely caught your eye. These are users who didn’t earn any ranking points, so the ranking is always the last one, and since there are more and more users register on the platform as the time goes by, the rankings of them are getting larger and larger.
Here are three boxplots showing the distribution of Ranking
, Points
and N.Submissions
for each tier of users:
## TierType: Novice
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 32620 37810 42750 42690 47630 52460
## --------------------------------------------------------
## TierType: Kaggler
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 44 8630 16620 16610 24620 32610
## --------------------------------------------------------
## TierType: Master
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 180 479 1077 1270 16200
## TierType: Novice
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 0 0 0
## --------------------------------------------------------
## TierType: Kaggler
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.7 141.5 287.0 798.0 675.4 64520.0
## --------------------------------------------------------
## TierType: Master
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 298.4 5255.0 12400.0 21020.0 25200.0 223600.0
## TierType: Novice
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 3.0 6.0 10.6 13.0 296.0 15187
## --------------------------------------------------------
## TierType: Kaggler
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 3.00 8.00 24.71 22.00 1616.00 4220
## --------------------------------------------------------
## TierType: Master
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.0 70.5 160.0 319.0 388.5 3917.0 26
By these plots and statistics, you can see how user’s tier related to ranking, points, number of submissions.
Notice the second plot doesn’t have the box for “Novice”, that’s because novice users are all having Points
of 0, and log(0) is -Inf
, but I applied a log scale to the x-axis.
Ranking of users tend to correlate with number of submissions a user makes. It looks like it doesn’t make sense, but I think the reason this happened is that users who makes a lot of submissions were probably trying to increase their score, even if it only increased 0.01, but that’s a huge increment if you’re in the top-three of the leaderboard. That’s why they update their solution so often.
I also observed some difference between different users with different tiers, they have a significant difference on ranking, points and number of submissions.
There’s nothing really crazy I’ve found in my exploration.
The ranking and number of submissions by a user. They have a correlation coefficient of 0.63, it’s not really significantly high, but it’s the strongest I’ve found.
Let’s start this section with a relationship we just observed, N.Submissions
to Ranking
. In this plot, I added a third dimention: Points
, and represent them using colours:
That’s exactly what we expected to see.
The following plot is having the same x and y axis as the previous one, but label the points using TierType
.
## TierType: Novice
##
## Pearson's product-moment correlation
##
## data: df$N.Submissions and df$Ranking
## t = 3.5005, df = 3873, p-value = 0.0004697
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02471504 0.08749143
## sample estimates:
## cor
## 0.05615874
##
## --------------------------------------------------------
## TierType: Kaggler
##
## Pearson's product-moment correlation
##
## data: df$N.Submissions and df$Ranking
## t = -70.498, df = 27738, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3997408 -0.3797808
## sample estimates:
## cor
## -0.3898066
##
## --------------------------------------------------------
## TierType: Master
##
## Pearson's product-moment correlation
##
## data: df$N.Submissions and df$Ranking
## t = -8.3278, df = 609, p-value = 5.453e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3891920 -0.2466789
## sample estimates:
## cor
## -0.3197427
Nothing really stands out in these relationships.
Let’s look at this plot below, HighestRanking
vs Ranking
and N.Submissions
. My thought here is that if a user is less active, he should have more difference between his highest ranking and current ranking:
The second plot I just drawn is a zoom-in version of the first plot. Here I didn’t see any interesting relationship.
Now, let’s switch the variable that represented with colour to DaysBeforeSep2015
. I created this variable, it tells how long a user have joined Kaggle.
Not surprisingly, these three variables are closely related to each other, the longer a user joined Kaggle, the difference between highest ranking and current ranking is larger.
The following plot changes the colour variable to DaysAfterLastSubmit
, which shows how long it has been after the last submission a user made:
And yes! They closely related to each other.
Nothing strange in this plot. We can see there are some blue points (RewardType == "Knowledge
) away from the main cluster of data points, that’s because many competitions with reward type of knowledge (well, it just means no prize) are having long durations, e.g. Titanic: Machine Learning from Disaster.
Another way of looking at this relationship is to cut DaysBeforeDeadline
into several buckets and use facet plots:
Now we can clearly see how the spreading of scores varies with DaysBeforeDeadline
. As the time approaching deadline, the scores seem to be more centralised, because participants tend to submit more, and we’re taking z-scores. And if you look at the plot in the facet of (300,600], The quartiles are wider, it means the distribution of PublicScore.Z
is more spreaded out.
In this following plot, I’m going to compare Duration
and N.Submissions
, with RewardType
as colour label:
## teams$RewardType: USD
##
## Pearson's product-moment correlation
##
## data: df$Duration and df$N.Submissions
## t = 5.377, df = 52052, p-value = 7.607e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.01497372 0.03214536
## sample estimates:
## cor
## 0.02356128
##
## --------------------------------------------------------
## teams$RewardType: Others
##
## Pearson's product-moment correlation
##
## data: df$Duration and df$N.Submissions
## t = -1.0686, df = 1313, p-value = 0.2854
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08340305 0.02461869
## sample estimates:
## cor
## -0.02947825
##
## --------------------------------------------------------
## teams$RewardType: Knowledge
##
## Pearson's product-moment correlation
##
## data: df$Duration and df$N.Submissions
## t = -0.94561, df = 5860, p-value = 0.3444
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03793988 0.01325248
## sample estimates:
## cor
## -0.01235179
We can’t really see any interesting relationship here, but we can see competitions with reward type of USD typically having more wide range of number of submissions than the knowledge ones.
Date registered vs highest ranking and tier:
And again, nothing fancy here.
Here, let’s explore with facets. In the following plots, I’m going to facet the plot by type of reward, they could be USD, Knowledge or Others (Kudos, Jobs, Swag). Each scatterplot has x-axis of DaysBeforeDeadline
and colour label of Tier
. Here I draw two plots, first one has y-axis of PublicScore.Z
in each scatterplot, the second one has PrivateScore.Z
.
Notice that in the scatterplot of TierType == "Knowledge"
, submissions by different tier are evenly spreaded, but in USD, there’s only Kagglers and Masters.
I’ve found that the time of last submission is having a strong relationship with the difference between highest ranking and current ranking. And this relationship actually strengthened each other, we can use these two features to determine how a user’s ranking should be like.
Reward type and score, the distribution in the scatterplot of USD (in the plot with facets) seemed to be wider than other two. And that definitely makes me more understand the data.
The distribution of z-scores of submissions’ private scores by each competition is slightly skewed, having median of 0.37, 1st quartile of -0.21 and 3rd quartile of 0.58. By this, we know that there is a large portion of submissions having private scores above the average. There are some submissions having the same and pretty low scores, which causes some specific z-scores having extremely more matching data points than they should. Perhaps those are submissions where users submitted “all-zeros” solution, and many users done that. This plot made me more understand about how typical distribution of submissions’ scores look like.
Number of submissions by user and user’s current ranking are having an approximately linear relationship with each other when we apply log10 to submissions number, cube root to ranking. The larger number of submissions a user submitted, the ranking is more likely to be higher. These two variables have a correlation coefficient of 0.63, which is pretty high for a 19804 sample (p-value < 0.000001 in a hypothesis test where alternative hypothesis is the correlation is not 0). Also, the 95% CI for correlation is (0.631, 0.648). The relationship may because users with larger number of submissions are typically more hard-working on these competitions, and more likely to get higher score and ranking. This relationship can help us to infer how a user’s rank should be like based on how many submissions submitted by him. However, there could be some traps where someone submitted so many times, but not getting a high score.
The longer time a user doesn’t have any submission activity, the more his ranking drops from his highest ranking, you can see that through the colour gradient in the plot. If we cut “days after last submission” (the colour axis in the plot) into (0,250], (250,500], (500,1000], (1000,2000], then the correlation of highest ranking and current ranking for each group will be 0.91, 0.93, 0.85 and 0.48. We can also use this to infer a user’s current ranking, when holding highest ranking constant, users with recent activities not far ago seem to have higher ranking currently but not exceeding highest ranking, and that’s why I chose this plot, because of this interesting and strong relationship.
As a non-active Kaggle user, getting familiar with this dataset is quite a hard work, this project actually took me more than a month. This dataset just released months ago, there isn’t too much information about this dataset, it’s hard to find examples, and there even doesn’t have a documentation about the dataset, Kaggle only listed all the variables inside, but not describing what those variables are, I spent so much time to try and explore something from the dataset.
A lot of times during the exploration, I felt really difficult because couldn’t find anything that interested me and maybe other audiences. I tried to tell myself it’s normal, just keep exploring. Eventually, I found some patterns in the data.
Throughout the analysis, lots of things are just in line with my expectations, a few of them aren’t, e.g. scores, submissions by “days before deadline”, number of submissions. I think one that most surprised me is the plot about number of submissions and ranking of users, it makes sense but also doesn’t make sense.
I’m not quite sure how this dataset can be used, but I think the next step will be building a solid model that predicts who is the winner. Maybe Kaggle will open another competition?