Linear Regression & the NBA

Predicting NBA Player Salary using Linear Regression and Web Scraping

For my second project as part of the Metis Data Science Bootcamp, I web-scraped NBA player statistics from Basketball-Reference to build a linear regression model to predict their salaries.

Introduction

Because of my passion for basketball, I decided to focus my linear regression project on the NBA. Every free agency, we see bloated contracts handed out to underperforming players and value contracts handed out to budding stars that no one saw coming. These disparities between contract value and output can be the difference between championships and disappointment.

The goal of this project is to predict NBA players’ yearly salaries using linear regression. With a data-driven approach to valuing a player, front offices can make more calculated decisions. They are better equipped to negotiate new contracts, trade for undervalued players, or trade away players they deem underperforming.

Design

To predict NBA salaries, the relationships between various box score statistics should be explored. The correlation between the stats and player salary, as well as each other for multicollinearity, should be considered. In addition, feature engineering will be implemented to account for interaction variables and/or dummy indicators that aren’t immediately available.

After data cleaning and manipulation, I split the data into training and test sets, and tested three different models (OLS, Ridge Regression, and Lasso Regression). I modeled based on the training set and judged the efficacy of the models on the test data. I also judged the models on r squared, Mean Absolue Error, and intuitive fit.

Web-Scraped Data

I will be web scraping via BeautifulSoup from the following sources:

Luckily, Basketball-Reference has a coder-friendly interface that helped simplify the web scraping process. Below is a snippet of the web scraping algorithm I used to convert player statistics on the website into a readable csv file. It iterates through each season and extracts relevant player statistics for all players who played in a given season.

Screen Shot web-scrape

Algorithm

Data Cleaning

Data Manipulation

After the initial cleaning, I worked iteratively to determine the best-fitting model. VIF analysis was done to test multicollinearity and the r-squared was calculated as existing variables get dropped and feature-engineered variables get added.

Feagure Engineering

Variables Added:

Variable Selection

The VIF table for modeled variables is below. This has been optimized to reduced multicollinearity; however, there are a few variables with a high VIF that I felt should stay in due to strong correlation with salary and importance in the game (such as points).

vif

From here, three models were experimented with: Ordinary Least Squares, Lasso, and Ridge Regression. The latter two were included to help account for potential overfitting concerns and the penalty term alpha was optimized for intuitive fit and performance via cross validation.

Results/Analysis

Cross validation with the training data was conducted and the r squared, mean absolute error, and intuitive fit were assessed. After comparing the three models, below are the model performance results.

performance

The MAE squared value is meant to express the MAE in real dollar terms. Because the square root of salary was taken in our model, squaring it back in can help explain the performance of this model in a more understandable way. Given the fact that NBA salaries are in the tens of millions of dollars per year, a margin of error less than $1 million is pretty robust!

As we can see here, all three models have very similar performance metrics, with a slight advantage for OLS in terms of MAE and a slight advantage for Lasso in terms of test r squared. Because the performances are not distinguishable enough to pick one over the other, let us take a look at the coefficients to see if these models pass the sanity test.

coef

Although the OLS had slightly better results, the ridge regression was the best fit. OLS gave a negative coefficient for 3 pointers made, which is unrealistic given the benefits of this stat, and overestimated the fouls variable. On the other hand, Lasso removed the 3 pointers made, presumably due to its close link to both 3 and D and points whereas Ridge tapered its impact. Because I believe 3 pointers are significant enough to keep and the difference in r^2 and MAE is very marginal, I went with Ridge. In addition, the difference between test and train r^2 suggests that there isn’t significant overfitting and the MAE is relatively small, providing confidence in this model.

Below is a chart that outlines the weights given to each variable. Although age has the strongest link to salary, it is partially offset by the past peak indicator. Points is the second most valuable feature, but is not the only scoring feature that has a positive impact. Also, the high 3andD score seems to suggest that NBA teams value versatility highly.

coefficients

As part of model validation, I visualized some diagnostics for the ridge regression model, as shown below:

Diagnostics

The residual plot seems to have a mean of 0, but there is a degree of heteroskedasticity even after the target variable was transformed. Also, the q-q plot resembling a straight line suggests a Gaussian distribution of data with minimal skew.

Regression Model in Action

As an example of this model performing accurately in practice, I reviewed the actual vs predicted for several players. Here, we see that Spencer Dinwiddie, according to our model, is valued quite similarly to what his contract would suggest.

Dinwiddie

However, the real fun begins when we start to analyze anomalies. I would argue that this model is counterintuitive in that the real practical value comes from detecting players with large disparities. The reason for this is that once we spot anomalies, a team using this model can gauge which players are overrated and which ones are “diamonds in the rough”. Below, I have three such examples in action:

Anomalies

Pascal Siakam is a budding star for the Toronto Raptors. He won Most Improved Player in 2019 and played a key role in their championship run that year. While his actual salary is $2.4m in 2020, his gameplay suggests a salary that is 7 times that in my model!

On the other hand, Andrew Wiggins is widely considered a mediocre player. Despite this, the Minnesota Timberwolves rewarded him with a 5-year, $148m contract in 2017. As we can see here, there is a wide disparity between his annual salary and player output, something I’m sure hindered the Timberwolves’ ability to add more talent and compete.

This model does have its shortcomings, however, as this only considers box score statistics and not intangibles. For example, Draymond Green is someone who does not fill up the stat sheet but influences the game in other ways. His leadership, basketball IQ, and defense were very valuable during the Golden State Warriors championship runs, but those outcomes are hard to quantify. Despite making the All NBA Defensive Team in 2019 and playing a key part in their finals run that year, my model views Draymond Green as an overvalued player.

Conclusion

Using box-score statistics, it is feasible to build an effective linear regression model to predict NBA player salaries. This model is counterintuitive in that deviations from expected salary provides the real value, since this is where general managers can find “diamonds in the rough” to give their team the edge.

To enhance this project in the future, I would devise a way to include advanced stats that are not included in the standard box score. Metrics such as team performance on/off the court and opposing field goal percentage, (among many others) can go a long way in recognizing the effects that intangibles, leadership, and defense bring that aren’t properly captured in a model with standard box score statistics.

To see my project in further detail, please visit my GitHub Repo.

Tools