INTRODUCTION

The world of music has evolved dramatically in the last 20 years. With the rise of technology, the need for physical copies of music has plummeted. Companies like Spotify were built on the idea of accessing any song from your cell phone anywhere and at any time. Since the creation of Spotify in 2006, it has risen to be the biggest music streaming service in the world in terms of number of subscribers. According to Spotify, 406 million people use Spotify, and those numbers are expected to keep growing.

As the largest streaming service, Spotify must make sure they are classifying their songs correctly. In order to ensure genres are being classified most effectively, we have asked the question: Which variables are the best indicators of the soundtrack genre? Using our model Spotify and other streaming services could easily determine if a track falls in the category of soundtrack or not. In addition, they could use the foundation of our model to classify other genres. Our model would increase the accuracy and efficiency of Spotify’s classification systems which would ultimately make them a more desirable platform.

In addition to making Spotify more efficient, we also wanted to look into the production side of the music industry. The second question we chose was: Within each genre, which variables lead to the highest popularity of songs? For music production, the goal is to produce a song that will become the next big hit. Using our model, production studios will know that songs with high levels of a certain variable or low levels of another variable are more likely to be popular. Our model will reduce the guessing game behind producing the next hit song by basing it instead on data.

DATA

Our analysis uses the database called “Spotify Tracks DB” from Kaggle. This database was created by Zaheen Hamidani who imported 176,774 unique tracks from Spotify into a data frame. He chose about 10,000 tracks for each of 18 genres. When Hamidani imported tracks into Spotify, he did not take into account the proportions of each genre relative to the others so it is not a simple random sample. Since he pulled 10,000 tracks from each genre, some less popular genres like a capella are represented equally to more popular genres like pop. The most important aspect for cleaning our data was removing duplicate tracks from the data and turning the genre variable into many binary indicator variables in the process. As the tables below show, the original data had multiple songs listed where each entry had a different genre and the cleaned data compressed duplicate songs into one entry. Additionally, we combined the tracks from the genres soundtrack and movie, due to their similarities and renamed the new genre “soundtrack”. After we cleaned the data, we chose to analyze the entire data set so our models take into account most of the 176,774 tracks.

Original Data
Genre Artist Name Track Name Track ID Popularity Acousticness Danceability Duration Energy Instrumentalness Key Liveness Loudness Mode Speechiness Tempo Time Signature Valence
Dance Ariana Grande 7 rings 14msK75pk3pA33pzPVNtBF 100 0.578 0.725 178640 0.321 0.0e+00 C# 0.0884 -10.744 Minor 0.3230 70.142 4/4 0.319
Pop Ariana Grande 7 rings 14msK75pk3pA33pzPVNtBF 100 0.578 0.725 178640 0.321 0.0e+00 C# 0.0884 -10.744 Minor 0.3230 70.142 4/4 0.319
Hip-Hop Daddy Yankee Con Calma 5w9c2J52mkdntKOmRLeM2m 98 0.110 0.737 193227 0.860 1.9e-06 G# 0.0574 -2.652 Minor 0.0593 93.989 4/4 0.656
Pop Daddy Yankee Con Calma 5w9c2J52mkdntKOmRLeM2m 98 0.110 0.737 193227 0.860 1.9e-06 G# 0.0574 -2.652 Minor 0.0593 93.989 4/4 0.656
Reggaeton Daddy Yankee Con Calma 5w9c2J52mkdntKOmRLeM2m 98 0.110 0.737 193227 0.860 1.9e-06 G# 0.0574 -2.652 Minor 0.0593 93.989 4/4 0.656
Cleaned Data
Dance Hip-Hop Pop Reggaeton Artist Name Track Name Track ID Popularity Acousticness Danceability Duration Energy Instrumentalness Key Liveness Loudness Mode Speechiness Tempo Time Signature Valence
1 0 1 0 Ariana Grande 7 rings 14msK75pk3pA33pzPVNtBF 100 0.578 0.725 178640 0.321 0.0e+00 C# 0.0884 -10.744 Minor 0.3230 70.142 4/4 0.319
0 1 1 1 Daddy Yankee Con Calma 5w9c2J52mkdntKOmRLeM2m 98 0.110 0.737 193227 0.860 1.9e-06 G# 0.0574 -2.652 Minor 0.0593 93.989 4/4 0.656

\(~\)

Within the data, a specific song is measured using numerical variables, such as popularity, acousticness, danceability, duration, energy, instrumentalness, liveness, loudness, speechiness, tempo, and valence. Using their own algorithm, Spotify measures each of these variables and assigns one value to each track. The popularity variable measures how popular the song is on Spotify. The acousticness variable measures the quantity of instruments as opposed to electronics being used in the track. Danceability measures how danceable the track is by combining other variables like tempo. Duration counts the amount of time each note is played. Energy measures the intensity of the track and how active voice, instruments, and electronics are. The instrumentalness variable accounts for the amount of vocals being used on the track. For example, a track with lyrics would be counted as high instrumentalness. Liveness represents the level of audience in the background of the recording. Loudness is measured in decibels and measures the strength of audio in the track. Speechiness describes the amount of spoken word on the track. For example, audio books would have a spechiness of 1 since it is entirely spoken word. The variable tempo is measured in beats per minute and counts the pace of the track. Finally, valence measures the level of positivity in the music where tracks with low valence are more sad and depressing. In addition, the data includes the categorical variables mode, key and time signature. The variable mode determines if the track is in minor or major key, key represents which musical key each track uses, and time signature indicates the number of beats in each measure. The boxplots below display the distributions of important numeric variables that have been gathered and standardized in order to compare them to each other. We added a vertical line at 0 to allow us to better visualize the skewness of the distribution of each variable.

RESULTS

Question 1: Classification Model

To answer our first question, we used K-nearest neighbors classification (KNN) and logistic models and focused specifically on predicting songs in the soundtrack genre. Before we created our model, we first had to finish cleaning our data to provide the most accurate results. As a first step, we removed songs that were only under the soundtrack genre because this indicated they were more obscure songs that were likely specifically created for a movie soundtrack, and thus, would not need to be run through our model. Then, we removed select genres like anime that had no crossover with the new soundtrack genre category. It is likely that these songs are rarely used in movie soundtracks, so they would likely also not need to be run through our model.

When constructing our classification models, we used a KNN model built on our logistic regression models. We split the data into 70% training set and 30% testing set to build and test a total of 4 KNN models and 3 logistic models. Through our process of model construction, we used sensitivity rates, the percentage of true positives, to evaluate the accuracy of our models which are more relevant in our case due to the low percentage of soundtrack songs in the data than specificity rates, the percentage of true negatives. Within our KNN models, we used cross-validation to determine the best k value, which ended up being 15; however, our models differed in the variables used as predictors. Initially, we included all predictors of our data, both the numeric and categorical predictors, within our models. However, we found that the majority of categorical predictors were not significant in the logistic model and the KNN model was not very accurate. Furthermore, our exploratory data analysis showed that categorical variables within the soundtrack genre had very similar distributions compared to the entire sample, so we excluded categorical variables from our future models.

Next, we created both a logistic and KNN model with all the numeric predictors. Given the results, we learned that our logistic model slightly worsened predictions whereas our KNN model significantly improved predictions compared to our full models. Additionally, after constructing the full numeric logistic model, we found that liveness was not significant at the 5% level. Therefore, we constructed an additional KNN model using all numeric predictors except for liveness. This model was our most accurate, with a sensitivity rate of about 0.62. Finally, we also wanted to consider interaction terms. We constructed a logistic model with all numeric predictors and all two-way interactions between those predictors. Many of the two-way interaction predictors were not significant, so we constructed our final KNN model to include all numeric predictors and every significant two-way interaction. The scatter plot below demonstrates the distribution of each model’s predictions: true positive predictions, true negative predictions, false positive predictions, and false negative predictions.

Our results showed that the best variables to predict whether a song would be considered a soundtrack song or not were popularity, acousticness, danceability, duration, energy, instrumentalness, loudness, speechiness, tempo, and valence. Specifically, when these variables are used in a KNN classification model, the results are the most accurate in terms of correctly predicting soundtrack songs. The table below shows the sensitivity, specificity, false positive rates (FPR), and false negative rates (FNR) for every model constructed, sorted by highest sensitivity rate.

Classification Model Results
Model Sensitivity Specificity FPR FNR
KNN Numeric Model without Liveness 0.6188679 0.9855974 0.0144026 0.3811321
KNN Numeric Model 0.6113208 0.9855526 0.0144474 0.3886792
KNN Model with 2-Way Interaction 0.5947170 0.9870287 0.0129713 0.4052830
Logistic Model with 2-Way Interaction 0.5762264 0.9839871 0.0160129 0.4237736
Full Logistic Model 0.5207547 0.9784408 0.0215592 0.4792453
Logistic Numeric Model 0.5147170 0.9782618 0.0217382 0.4852830
Full KNN Model 0.1245283 0.9902491 0.0097509 0.8754717

\(~\)

Question 2: Lasso Regression

In order to address the second question, we used Lasso regression to create a model predicting popularity for each genre. Lasso regression is a type of linear regression where data values are shrunk to a central point which in our case is zero. We decided to use lasso regression because our data had many different genres and it creates models fairly quickly while also using cross-validation. This allows us to avoid overfitting the model to the data and select the most relevant predictors.

Before we began the lasso regression, we removed the less relevant genres: ska, world, and anime. We then used popularity as the response variable and all other categorical and numerical variables as the predictors. Of note were the additions of track name and artist name as predictors, as we felt that when considering popularity, a consumer may take this into account. To determine what value to use for lambda, a tuning parameter in lasso regression, we performed 10-fold cross-validation. This process found which lambda value produced the lowest out of sample mean squared error, and we then used this lambda value to create the best model. We repeated this process for each of the 21 genres and created 21 different models. Finally, we used this final lasso regression model to make predictions on new observations and calculated the root mean squared error (RMSE) of the genre-respective models. All of the specific coefficients and RMSE for each model can be seen in the table below.

Lasso Regression Coefficients
Variable A Cappella Alternative Blues Classical Children’s Music Comedy Country Dance Electronic Folk Hip-Hop Indie Jazz Soundtrack Opera Pop R&B Rock Reggae Reggaeton Soul
Artist Name -0.2074 0.0002 0.0002 -0.0045 -0.0002 0.0010 -0.0003 0.0002 -0.0008 0.0007 -0.0011 -0.0004 0.0005 0.0057 -0.0033 -0.0003 -0.0005 0.0000 0.0010 -0.0040 0.0000
Track Name 0.0094 -0.0001 -0.0001 0.0000 -0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 -0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0000
Acousticness 0.0000 0.4754 -0.7549 -8.8657 -14.6194 0.0000 0.0342 3.0127 0.6815 0.8558 0.8576 1.5694 0.0036 -2.4965 -6.1732 1.0792 0.0000 1.9142 1.1703 1.1372 0.1623
Danceability 0.0000 2.1549 -5.0290 0.0000 -11.2149 5.5151 0.0000 4.1264 3.6757 1.0633 5.1330 4.3254 6.3585 -7.6019 -6.4886 6.9886 3.1083 2.0588 6.3380 14.4078 3.2745
Duration 0.0000 0.0000 0.0000 0.0000 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Energy -15.6823 -2.0653 1.6179 10.0388 9.0500 2.0176 2.3045 -4.4259 -1.6677 2.3305 -7.4693 -3.6158 -1.7405 5.0390 -3.0943 -3.2991 -1.7639 0.0000 -2.1039 -22.2096 0.0000
Instrumentalness -3.6667 -1.3372 -3.6663 6.1452 -5.1488 -11.2745 0.1500 -5.4592 -2.8128 -1.9594 -6.2774 -2.0028 -3.6329 12.3867 0.4659 -3.9741 -3.7457 -3.5457 -3.0664 -6.1302 0.0000
Key 0.1659 0.0000 0.0000 0.0641 0.0264 0.0441 -0.0582 0.0566 0.0283 0.0033 0.0051 0.0056 -0.0070 -0.0128 -0.0029 0.0165 0.0424 0.0000 0.0000 -0.0241 0.0459
Liveness -1.0938 -1.8495 -4.7373 -17.5022 -2.2588 0.0000 -4.8279 -3.4149 -1.5925 -3.3121 -1.6972 -1.6111 -2.5461 -5.6793 -1.8306 -0.7564 -2.3013 -0.8803 0.0000 -3.7747 -3.7309
Loudness 0.5453 0.1821 -0.0897 -0.0595 1.3966 0.3047 0.0318 0.4416 0.0386 0.0000 0.5870 0.1844 0.0000 -0.1426 0.2398 0.1887 0.4861 -0.0170 0.2771 1.5444 0.1374
Mode 0.0000 0.5583 0.0670 0.6269 5.1070 -0.1907 0.7710 0.2595 0.2582 -0.0125 0.1366 0.0000 0.0829 1.2435 0.0178 0.1830 0.0554 0.0000 0.7285 0.1296 0.1812
Speechiness 0.0000 -2.2027 -8.3965 -29.6800 -1.7966 -3.3023 -6.3106 0.5561 -10.0898 -3.6357 -3.9845 1.2500 -6.5687 -6.5692 -7.9102 -0.3994 -2.3776 0.0000 -1.2300 -7.5302 -0.1712
Tempo -0.0351 -0.0066 -0.0039 -0.0149 -0.0479 0.0037 -0.0089 0.0000 0.0000 -0.0009 0.0091 -0.0047 0.0029 -0.0025 -0.0049 0.0047 0.0000 0.0000 0.0032 0.0222 -0.0010
Time Signature -0.3947 0.2950 1.2832 0.0000 1.8677 0.0412 0.9362 -0.0833 0.5988 0.4433 0.2107 0.1036 0.3647 -0.0649 0.1037 0.3106 0.0000 0.0000 0.0000 3.6056 1.1337
Valence 4.3737 2.3216 -0.0448 -4.4647 -19.6664 -5.9226 1.7677 0.3811 0.2462 0.0000 0.7289 -0.3264 -4.0722 -11.7220 0.8099 0.1279 0.7463 1.4001 -0.5279 -2.1281 -1.7912
RMSE 6.8545 7.6041 9.7980 13.0767 16.0840 8.0972 9.6511 11.1051 9.5562 8.1531 8.3126 7.4543 9.3489 12.1787 8.3099 7.9098 8.9284 7.9949 10.6584 13.0360 8.9997

\(~\)

In our analysis of the lasso regression models investigated two factors, the RMSE values of each of our models and analyzed the differences in coefficients between each of the optimized lasso regression models. Looking at the RMSE values, we noticed that the values are objectively greater than the standard ideal range for RMSE which is 0.2 to 0.5. In the lasso regression models we developed, we had a minimum RMSE value of 6.8545 and a max value of 16.0840. We suspect that some of the high RMSE values can be attributed to having a large number of predictors, which will inevitably introduce some increased unexplained variability in the model. The tile plot below displays the range of coefficient values for each genre’s model.

Through all of the models, we found that track name, tempo, duration, and artist name were shrunk to zero, or very close to zero, across almost all models which indicate they are not important in predicting popularity. We saw that acousticness, danceability, energy, instrumentalness, liveness, speechiness, and valence had consistent presences in being predictive variables across genres, so they are some of the key variables in predicting popularity. While the magnitudes of the coefficients cannot be compared across predictors due to their different units, the general trends (negative vs. positive) can still be compared. In our results, we see that speechiness and liveness trend towards negative coefficient values and danceability trends towards positive coefficient values. A negative coefficient indicates that a higher value of a certain variable makes a song less popular, while positive coefficients show the opposite. Additionally, the ability for certain variables, namely valence and energy, to regularly shift between positive and negative coefficient values across genres could indicate that they are much more sensitive to genre. Thus demonstrating that these variables have different effects on popularity within each genre.

Another step we took to further evaluate this model was obtaining additional Spotify songs from the Spotify API. The Spotify API had values for the same measurements and categories as the original data which allowed us to further evaluate. The original dataset included songs up to 2019, so we chose to obtain about 3,000 additional songs from 2020 through 2022 to see if our model worked well for newer songs. We then chose to focus on the pop genre, due to the high number of pop songs in the new sample we pulled. Using our pop model, we calculated the RMSE once the model was tested with the new pop songs, and, unfortunately, it did not give us ideal results as we got an RMSE of 33.19. This RMSE value indicates that either pop songs significantly changed in their variables after 2019 or that our original model was overfitted. We can speculate that the COVID-19 pandemic changed trends in the music industry due to changes in society or that different methods of recording songs were used, such as at-home recording, that could have caused overall changes in the measures we used. However, it is also possible that our model was overfitted, despite our efforts to use cross-validation, testing and training data.

CONCLUSION

Using a KNN model, we answered the question of how to classify a song as a soundtrack or not as a soundtrack song. After running 7 different models, we found that the KNN numeric model without the variable liveness was the best model for predicting soundtrack. However, the sensitivity analysis only reported a value of 0.62. Statistically speaking, a sensitivity analysis of 0.62 means that this model is not the best indicator of soundtrack songs. We were not surprised that sensitivity rates of all the classification models were relatively low due to the low percentage of soundtrack songs within the data. Therefore, in order to increase the accuracy of this model, we would need to increase the sample size of the soundtrack songs. With a larger sample, the model would be more accurate and be able to predict a soundtrack song even better. Additionally, in future work, we could try to classify more genres besides soundtrack.

By implementing the lasso regression methodology, we answered the question of how we can predict the popularity of a song across genres. We implemented individual cross-validated lasso regression models for each genre and identified the most optimal coefficients of each characteristic for each model. In analyzing our results, we found that track name, tempo, key, duration, and artist name were consistently shrunk out of the model. We also found that a higher value of danceability is expected to result a higher value of popularity across most genres, while speechiness and liveness is expected to result in the opposite. We concluded that valence and energy were the most defining predictors across genres as they featured coefficient values that were positive and negative across the models. In addition, we analyzed the RMSE values for each of our models and found that the RMSE exceeded the typical ideal range. Future work can focus on finding methods to reduce the RMSE, potentially by using alternative model structures including ridge regression or elastic net regression.

With a larger sample size, our models could become useful for music platforms to categorize genres. If we were to continue this research and ultimately work out a deal with Spotify, we would need to gather a larger sample size to improve our accuracy. This is particularly important for Spotify as competitors have started to flood into the music streaming service such as Apple Music, YouTube Music, TIDAL, and others. With this influx of companies who set out to rise to the top of the music streaming industry, any way to separate Spotify from the others will be beneficial. In this case, it is a more accurate genre classification system and popularity predictor. Better genre classification would help increase platform ease-of-use and improve customer feedback. Furthermore, Spotify would be able to use our model to improve its music recommendation system similar to Pandora Radio. Additionally, as Spotify continues to develop their “Recorded at Spotify Studio” series, it would be beneficial for them to be able to predict the popularity of the song based on variables within each genre. Our lasso regression model with further development could help them produce more popular songs thus making them a more profitable company. We hope the results of this investigation can help in increasing revenue for Spotify, other streaming platforms, and production studios.