The aim of this project is to plot interactive scores of NBA games over the course of the match:
Part of a project on github.
Data Collection
The above score data is collected from:
https://www.basketball-reference.com/boxscores/pbp/202001080CHO.html
The data is shown in the Play-By-Play table.
We use the requests
library to grab the page contents. Then we use pandas
to extract and parse the table using beautiful-soup
.
Scrapes match information from basketball-reference.com. Extracts the scores, pre-processes the data and visualises against time:
def score_table_from_url(url):
# Fetch URL contents
response = requests.get(url)
return response.content
def dataframe_from_table_html(html_str):
score_table = pd.read_html(io=html_str, attrs={"id": "pbp"}, flavor="bs4",)[0]
return score_table["1st Q"]
Data Cleaning
The output from the table parsing has quite a few issues which we need to clean. Below is a typical example of the initial pandas dataframe:
Time | Toronto | Unnamed: 2_level_1 | Score | Unnamed: 4_level_1 | Charlotte | |
---|---|---|---|---|---|---|
0 | 12:00.0 | Jump ball: S. Ibaka vs. B. Biyombo (P. McCaw gains possession) | Jump ball: S. Ibaka vs. B. Biyombo (P. McCaw gains possession) | Jump ball: S. Ibaka vs. B. Biyombo (P. McCaw gains possession) | Jump ball: S. Ibaka vs. B. Biyombo (P. McCaw gains possession) | Jump ball: S. Ibaka vs. B. Biyombo (P. McCaw gains possession) |
1 | 11:40.0 | O. Anunoby misses 2-pt layup from 2 ft | nan | 0-0 | nan | nan |
2 | 11:35.0 | nan | nan | 0-0 | nan | Defensive rebound by D. Graham |
The various steps we do to clean the data is outlined as follows:
def clean_table(score_df) -> pd.DataFrame:
score_df = remove_unnamed_columns(score_df)
score_df = add_quarter_column(score_df)
score_df = remove_nonscore_rows(score_df)
score_df = scores_to_separate_columns(score_df)
score_df, team_names = add_team_label(score_df)
score_df = add_action_label(score_df, team_names)
score_df = normalise_time_remaining(score_df)
# Make Time index
score_df.set_index(keys=["TimeElapsed"], inplace=True)
return score_df
The full source can be found here: https://github.com/stanton119/nba-scores/blob/master/src/main.py
After cleaning, the resulting dataframe is much easier to process for our plotting needs:
TimeElapsed | Time | Quarter | HomeScore | AwayScore | TeamLabel | Label |
---|---|---|---|---|---|---|
1900-01-01 00:00:20 | 11:40.0 | 1 | 0 | 0 | Toronto | O. Anunoby misses 2-pt layup from 2 ft |
1900-01-01 00:00:25 | 11:35.0 | 1 | 0 | 0 | Charlotte | Defensive rebound by D. Graham |
1900-01-01 00:00:38 | 11:22.0 | 1 | 0 | 2 | Charlotte | M. Bridges makes 2-pt layup from 2 ft (assist by P. Washington) |
1900-01-01 00:00:49 | 11:11.0 | 1 | 0 | 2 | Toronto | K. Lowry misses 3-pt jump shot from 30 ft |
1900-01-01 00:00:52 | 11:08.0 | 1 | 0 | 2 | Charlotte | Defensive rebound by M. Bridges |
Plot Generation
The plots are generated using the bokeh
library through hvplot
. hvplot
extends the standard pandas plotting API to use different backends.
This allows us to create an interactive plot.
We can zoom and hover over data points.
The hover tools are setup to list all columns of our score dataframe.
The code requires for this becomes rather simple:
def create_plot(score_df):
score_plot = score_df.hvplot(
y=["HomeScore", "AwayScore"], hover_cols=list(score_df.columns)
)
hover = HoverTool(
tooltips=[(col, "@" + col) for col in ["HomeScore", "AwayScore"]]
)
score_plot = score_plot.opts(tools=[hover], show_grid=True)
return score_plot
Create a webservice
This code can be packaged up into a small webservice.
This allows you to call the program via a normal webaddress:
127.0.0.1:5000/nba_score_plot?game_id=202001080CHO
This is run in flask to setup a REST API as follows:
app = Flask(__name__)
@app.route("/nba_score_plot", methods=["GET"])
def process_request():
game_id = request.args.get("game_id")
score_plot = generate_plot(game_id)
score_plot_html = convert_plot_to_html(score_plot)
return score_plot_html, 200
if __name__ == "__main__":
app.run(port=5000)
Conclusions
An example of the interactive plots generated can be seen at the top of the page.
The full source can be found here: https://github.com/stanton119/nba-scores/blob/master/src/main.py