Classification & Billboard 100

Predicting Hit Songs using Machine Learning Classification

In my supervised machine learning project as part of the Metis Data Science Bootcamp I built a classification model to predict whether a song will be a hit based on its audio features.

Introduction

A hit song can make an artist’s career, and make millions for record labels. While hit songs have certain tropes, such as upbeat and catchy, not all songs that fit this bill go viral.

The goal of this project is to use machine learning classification to predict whether a song will be a hit depending on its audio characteristics. With a data-driven approach to evaluating a song, record labels and artists will be better equipped in their music production moving forward.

Design

To predict whether a song will be a hit, numerous audio features will be explored. The correlation between the features and whether a song is a hit, as well as the correlation between each other for potential feature engineering, should be considered.

After data cleaning, feature engineering, and data manipulation, I split the data into training and test sets, and tested five different models: logistic regression, decision trees, random forests, ADA Boost, and XG Boost. I modeled based on the training set and judged the efficacy of the models on the test data. I also tuned the hyperparameters of the models via GridSearchCV and tuned the probability threshold via predict_proba in Sklearn.

The main metrics I used to judge the models are F1 score and ROC-AUC. Given the costs associated with producing and marketing music, it is very important that there are few false positives. That said, it is also important to not miss out on a golden opportunity, making precision important as well. For this reason, I will use F1, as this incorporates both precision and recall for a more holistic view of model performance. In addition, ROC-AUC does a great job of measuring how effective a model is in finding hits while simultaneously avoiding false positives.

Data

This Kaggle Dataset contains data on over 34,000 songs extracted from Spotify, dating back to 1985. Luckily, the dataset used was already curated and cleaned. Out of the 34,000+ data points, half were non-hits, and the other half were hit songs. (A song qualifies as a hit if it ever appeared on the Billboard 100 or Spotify 100 list.)

Further information on the fields can be found on the Spotify for Developers website.

Algorithm

corr plot

pairplot

Based on the above correlation matrix and pairplot, I was able to discern patterns and trends. Using these, I simplified the model by taking in the ones most relevant to its hit status as well as feature-engineering new variables that account for trends observed. Right off the bat, I was very surprised to see that the “danceability” variable had minimal effect on whether or not a song is a hit.

Variable Selection

Feature Engineering

Variables Added:

Results/Analysis

GridSearchCV was used on each model to tune hyper parameters and threshold tuning was used to optimize F1. After this was done, the train and test F1 scores were compared for quality and fit, and an ROC-AUC comparison was used to get a holistic sense of model performance. Results:

Model Train F1 Test F1
Logistic 0.792 0.788
Decision Tree 0.868 0.777
Random Forest 0.919 0.826
ADA Boost 0.805 0.803
XG Boost 0.845 0.824

ROC_Curve

The two best performing models are the Random Forest and XGBoost. Ultimately, I went with XGBoost due to mitigated overfitting concerns. At a probability threshold of 0.381, the F1 score is 82.4% and the ROC-AUC is 0.891. The F1 score increases as the probability threshold increases but beyond the .381 threshold it tapers slightly. Below is a chart outlining the performance of XGBoost at all thresholds, up to the default threshold of .5:

Prob_threshold_XG

Please see below for a confusion matrix for further analysis:

Confusion Matrix_XGBoost

Model in Action

As an example of this model performing accurately (and not accurately), I sampled a few songs and compared it against what my algorithm predicted.

Examples

This model did a great job in correctly predicting hits and non-hits, as well as avoiding false negatives. For example, the song Goosebumps by Travis Scott incorporates all the traditional audio components of a hit and is correctly identified as so. One limitation of this model is that it does not contain artist social media popularity or reputation data. For example, several songs from Drake, one of the most popular artists of this generation, have low energy and valence but are considered hits in large part due to his reputation.

Conclusion

The XGBoost model did a great job in correctly predicting hits and non-hits, as well as avoiding false negatives. However, there were quite a few false negatives determined from this model. One limitation of this model is that it does not contain social media popularity. Incorporating social media following into the model would be an excellent addition as this variable has a high correlation with success.

To see my project in further detail, please visit my GitHub Repo.

Tools