In a baseball game, the players control the game, the game stats are the actual performance from the teams and the players; whereas audiences control the atmosphere, the emotions of audiences may indirectly effect the team's performance. The crowd could raise a team's morale by cheering on the team, and vice versa, they could lower a team's morale by booing at them. Now, does the amount of audiences related to a team's performance?
In the following analysis, I'll be using the Lahman's Baseball Database, which can be downloaded here
# These are the libraries we're going to use
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
pd.set_option("display.max_columns", 500) # The data tables have a lot of columns,
# some of them will be replaced by ellipsis in the output display,
# by setting this option, we can avoid that (unless more than 500 columns).
%matplotlib inline
We need the attendance and performance statistics, which are available via Teams.csv
# All the data we need are in Teams.csv
teams = pd.read_csv("Teams.csv")
Let's quickly reveal some samples in the Teams table.
You can see there's an attendance
field on the far right.
teams.sample(5)
teams["attendance"].describe()
teams["attendance"].hist()
plt.show()
As you can see in the plot, the distribution is highly skewed, like stairs, it seems unnatural. But take a look at this following plot:
teams.plot("yearID", "attendance", "scatter")
plt.plot(teams.groupby("yearID")["attendance"].mean(), color="r", linewidth=4)
plt.xlim(xmax=2020)
plt.ylim(ymin=0)
plt.show()
It's obvious that attendance
varies over time, it's getting larger and larger. This may caused by many factors, stadium capacity, urbanisation, transportation, etc. We shouldn't say that "having total attendance of 900000 is small" based on just the value itself, if it happened in 2014, it would be very small; but if it happened in 1900, it would be very large.
So it's a good idea to standardize the data based on the year.
As described above, some variables may vary over time, influenced by factors those consistently making changes.
"Standardize by year" is to turn a value into the z-score of itself in the sample that contains all the observations in the corresponding year. e.g. The standardized value for attendance
of an observation with attendance
of 3000000 and yearID
of 2014 is 0.915
def standardize_by_year(df, field):
# The lambda function returns a Series that contains all the standardized values,
# and the apply method of SeriesGroupBy object concatenates them into a new Series.
return df.groupby("yearID")[field].apply(lambda g: (g - g.mean()) / g.std(ddof=0))
# Now apply it to our dataset
teams["standard_attendance"] = standardize_by_year(teams, "attendance")
teams["standard_attendance"].hist()
plt.show()
It turns out to be a little bit skewed distribution, it's much more reasonable than the original one.
All the observations with yearID < 1890
are all missing attendance
field, let's just drop them.
For the rest, just fill 0
into standard_attendance
since that's the mean value anyway.
teams = teams[teams["yearID"] >= 1890]
teams["standard_attendance"] = teams["standard_attendance"].fillna(0)
In this analysis, we're exploring the relationship between attendance and performance. Performance can be measured in a variety of ways, runs, hits, homeruns, wins, etc. Let's make some plots and see if there's anything interesting.
Runs is the total run scores in the year by a team, and it's available via R
field. We're expecting there's a positive relationship, since we want to see if teams with more audiences in the game tended to score more.
sns.jointplot("standard_attendance", "R", teams)
plt.show()
Here we're actually getting what we're expecting, but it's not so clear that they have a significant positive relationship.
Errors is the total errors made by a team in the year, and it's available via E
field. Now we're expecting to get a negative relationship, since less errors is better.
sns.jointplot("standard_attendance", "E", teams)
plt.show()
And it seems no relationship according to our plot.
Homeruns is the total homeruns made by a team in a year, and it's available via HR
field. We expect getting a positive relationship.
sns.jointplot("standard_attendance", "HR", teams)
plt.show()
But there's clearly no relationship.
Winning percentage is a more general way to measure the team's performance, we're missing this field in our dataset, but the calculation is pretty straight forward.
teams["winning_percentage"] = teams["W"] / teams["G"]
sns.jointplot("standard_attendance", "winning_percentage", teams)
plt.show()
It turns out that this relationship is pretty strong. Just by looking at the plot, we can see a positive linear relationship, we're getting Pearson's r of 0.59, which is pretty high for a 2538 samples data.
Seaborn also shown us the p-value, which is the two-sided p-value for a hypothesis test whose null hypothesis is $\rho = 0$ (where $\rho$ is the true correlation for population). The p-value shown is $3.8 \times 10^{-233}$, which is approximately $0$, and given an alpha level of $0.01$ (actually, it doesn't matter since the p-value is so small), the relationship is considered to be extremely significant.
We can even perform a linear regression and plot the regression line so that we can see how they're actually related:
sns.regplot("standard_attendance", "winning_percentage", teams,
line_kws={"color": "orange"}, scatter_kws={"alpha": 0.5})
plt.ylim(ymax=1)
plt.show()
According to the above visualizations and statistical summaries, the relationship between audience attendance and team's winning percentage is significant.
And now we can answer the question in the beginning:
Q: How does the audiences' attendance of a team related to the team's performance?
A: The team with more audiences attending the ballgame tend to have a higher winning percentage.
Resources referred/used: