The R Heels: Julia Barrow, John Kenan Bauer, Bradley Buchner, Andy Makhanov, Sarah Wooster
It is ok to use an anonymized version of this report as an example of a great project for future classes
In this project, we applied machine learning techniques intending to predict how a batter will hit a baseball. We classified batted balls into three categories: ground ball, line drive, and fly ball. This paper uses logistic regression, neural network, and extreme gradient boosting algorithms to create prediction models. We compared the predictive performance and determined which algorithm performed the best. We found that extreme gradient boosting has the highest predictive accuracy of 61%. We also found that logistic regression and neural networks perform worse with an accuracy of 50% and 52%, respectively. This question informs batters and pitchers on statistically significant strategies and factors that result in more desirable hit types.
Baseball, the birthplace of sports analytics, is one of the most data-driven sports today. Using baseball data, one can use statistical techniques in order to analyze and improve team and individual player performance. For our project, we utilized data gathered by Major League Baseball (MLB) to classify over 14,000 hits at bat by baseball players in MLB games. We classified the hits into three categories: fly ball, line drive, and ground ball. These hit types are important because their difficulty to field varies greatly. A fly ball is a hit where the ball results in a trajectory straight up into the air, where it becomes easy to catch. A line drive is a ball batted with a flat trajectory, generally hit harder than other balls, making it difficult to field. A ground ball is a ball hit downwards into the ground, where it bounces and loses momentum, making it easy to field. For our predictors, we explored pre-hit measurements describing the pitch, such as locations of the release, speed, and pitch type. We modeled our classifications using logistic regression, extreme gradient boosting, and neural networks.
Our project has a variety of practical applications within the sport. For a pitcher, it is ideal to safely contain hits, so they are easier for their team to field and prevent runs from being scored against them. A pitcher may use our models to adjust their pitching strategy, so if they fail to throw a strike, hits will more often result in unproductive hit types (ground and fly balls). There are also applications of our models for batters. For a batter, a goal is to hit more line drives because the ball travels further and is harder to field than ground and fly balls. Hitting more line drives will increase an individual’s batting average, a commonly used statistical measure of a player’s hitting performance. A batter would discover that they are unlikely to hit a line drive, for example, upon hitting pitches of certain types and locations. They may use this newfound knowledge to avoid swinging at certain pitches during games. In addition, facing certain hitters, a manager may use our algorithms to learn if they should substitute pitchers in order to pitch more effectively. Left-handed versus right-handed pitcher/hitter matchups tend to drive different types of tactical substitutions. These are potential examples of ways our project can be meaningful for teams and players in baseball.
Analytics has been a commonly used tool in professional baseball for many years. In 2002, the Oakland Athletics and their manager, Billy Beane, caused a revolution in sports management by utilizing data to enhance team performance. This breakthrough inspired Moneyball, a movie that showcases the impact of this innovative approach. Using analytics has proven to be an accessible way for teams to become more competitive without increasing the budget.
Machine learning algorithms in sports analytics are popular for predicting pitch type, game outcomes, hit outcomes, and player performance. A study by Sidle (2017) at N.C. In order to classify pitch types, we utilized a combination of LDA, SVM, and a bagged ensemble of random forests consisting of classification trees. These state-of-the-art techniques enabled us to accurately categorize various types of pitches with a high degree of precision. Another study by Barnes and Bjarnadottir (2016) used regression models to identify undervalued and overvalued free agents within the sport and predict their future performance. A third study, by Das and Das (1994), used neural networks to evaluate the catchability of batted balls. These are a few examples of popular machine learning techniques and their ability to analyze the sport. However, no studies could be found that have been designed to predict specifically the hit type which is the objective of our project.
The data used to train our models contained a set of 14,143 observations, each representing a pitch thrown into an MLB game that was hit into play. The data was acquired through Baseball Savant, an online source for baseball data. All 30 MLB stadiums are equipped with ball-tracking technology that records hundreds of measurements for every pitch thrown. These measurements consist of coordinates, velocities, angles, and more. A condensed version of this data is made available to the public via Baseball Savant, a website operated by MLB. To accurately predict an unbiased response, all post-hit variables were excluded, except one—hit type—which served as the response. Identification variables, such as the player and team identifying numbers, were removed as they provided minimal relevance to the research question. All variables, post-cleaning, were scaled for their use in various models to ensure all predictors had equal weight. To achieve accurate results and avoid overfitting, thirty percent of the data were randomly selected for testing and the remaining seventy percent for training.
Other steps were taken to prepare the data; for example, popup and fly ball categories were combined in the response variable. Using popups as a subset of the data was a smart choice because they consistently yielded high launch angles. This strategy helped to create a more balanced dataset across all classes. One of the Graph 1 displays the values of the release position x, a numerical variable that denotes the ball’s location along the x-axis upon release by the pitcher. These values were situated on either side of zero, depending on the pitcher’s handedness. Therefore, the variable indicated both the position of release and the handedness of the pitcher. Pitcher handedness was already accounted for in a categorical variable as a predictor in our models. To account for this variable being multi-modal, values of release position x for left-handed pitchers were multiplied by negative one.
In total, there were thirteen variables kept, nine of which were
numerical and whose descriptive statistics can be seen in Table 1.
Release_speed is a measurement of the pitch velocity at
release from the pitcher’s hand. Release_spin_rate is the
revolutions per minute (RPM) of the ball during flight and
release_extension is how far the ball is from the batter
when released by the pitcher, usually increasing as the pitcher’s wing
span increases.
Coordinates are an integral part of sports analytics. To contextualize the coordinates, an interactive graph is provided in Graph 2. The cross-section created by the x and z coordinates is the same perspective of the umpire facing the pitcher. Combining the information from Graph 1 and Graph 2, the ball leaves the pitcher’s hand from the side of their dominant hand but does not necessarily end up on one side of the plate.
Release_pos_x and release_pos_z document
the x and z coordinates of the ball once it was released by the
pitchers, respectively. Pfx_z represents the distance in
feet that the ball moves in the z—and pfx_x in the
x—direction during its flight due to its spin rate and spin direction.
Additional coordinates that were used as predictors were the x and z
coordinates of the ball when it was hit by the batter at home plate.
These variables are named plate_x and plate_z.
In baseball, the plate_x value indicates the horizontal
distance from the center of the home plate, while the
plate_z value reflects the vertical height above ground
level and the distance along the z-axis.
The data also contained four categorical variables.
Pitch_type refers to the type of pitch that was thrown,
such as curveball (CU), four-seam fastball (FA), or slider (SL). The
pitch type is directly related to and is classified by the variations in
the pfx variables, as each pitch type causes a different path in the air
to the plate as seen in Graph 3. In baseball, pitchers will often throw
different types of pitches depending on their desired outcome. Some
pitch types are easier for the pitcher to throw to certain locations and
some result in more negative outcomes by the batter. To throw different
pitch types, the pitcher will adjust the orientation of the ball and
their grip. Pitch types vary in speed, spin rate, movement direction,
and more.
Stand is a categorical variable with two levels: Left
(L) or right (R). Stand documents the side of the plate on
which the batter is standing, indicating the dominant hand of the
batter. Similarly, p_throws documents the dominant hand of
the pitcher same levels of stand. Finally, the response
variable, bb_type—an abbreviation of batted ball type— has
three possible values to be taken: ground ball, fly ball, and line
drive. Ground balls are hit with a lower launch angle, fly balls have a
higher angle, and line drives are in between and more parallel to the
field. More specifically, a hit is classified as a ground ball if it has
a launch angle below 5, a line drive if between 5 and 25, and a fly ball
if above 25 degrees.
Multinomial Logistic Regression was chosen because of its simplicity in performing multiclass classification. The data has been appropriately scaled for analysis. However, no other data transformation was necessary for this method.
Three models were fitted using the multinomial method with 5-fold cross-validation. All features, as well as additional interaction terms, were incorporated into the models. We incorporated interaction terms based on the relationships discovered during exploratory data analysis or from established associations in baseball. Each interaction term tested was kept only if it made an improvement to the Brier Skill Score of the model.
The first model that showed an improved Brier Skill Score tested two
relationships including the p_throws variable. The first
was the interaction between release_pos_x and
p_throws. Release_pos_x measures how far out
to the side the pitcher releases the ball, and p_throws is
a categorical variable for pitcher handedness, so the interaction term
represents how far out and to which side the pitcher releases the ball.
The second was the interaction between pfx_x and
p_throws. Pfx_x is a numerical variable that
measures the spin-induced movement of the ball in the x direction during
its flight and its sign is partially dependent on the pitcher’s
handedness, indicating the need for an interaction with
p_throws.
The second model includes similar interactions as the first one but with a twist of transforming them into 3-way interactions. This effectively takes into consideration the interplay between the batter’s handedness and pitcher’s handedness, which is a crucial factor that significantly influences baseball tactics. It is commonly understood that batters perform better when they are of the same handedness as the pitcher.
The third model included the same interactions as the second but
added interaction terms between plate_z,
plate_x, and stand, between
plate_z, plate_x, stand, and
pitch_type, and between p_throws,
pfx_x, pfx_z, and pitch_type. The
first interaction takes the ball’s location when it arrives at the
batter (plate_z and plate_x), and relates it
with the side that the batter stands on. Therefore, ball location
relative to where the batter is standing is used as a predictor rather
than absolute ball location. The second interaction takes the ball’s
location when it arrives at the batter (plate_z and
plate_x), and relates it with the side that the batter
stands on and the type of pitch that is thrown. Thus,
pitch_type is discussed in relation to its location
relative to the batter and is used as a predictor rather than just
pitch_type. The final interaction uses the relationship
between a pitch’s type, its spin-induced movement, and the pitcher’s
handedness. This allows the model to use pitch type while considering
how the pitch type moves.
Hypertuning the interaction terms improved the model’s test accuracy from 42% to 50% for the final model containing all 13 predictors and interactions described above. The most glaring issue with this model was that its sensitivity for predicting line drives was extremely low. It struggled to correctly identify line drives, but excelled in identifying ground balls and fly balls. To provide a comprehensive summary, we assessed the effectiveness of our final model by utilizing the Brier Skill Score, which revealed that our model outperformed the naive model by approximately 5%. This signifies the superior accuracy and reliability of our model. While only a marginal improvement, given that the goal of our models was proof of pattern recognition and not high accuracy, this was a success.
We chose extreme gradient boosting as it offers the unique advantage of utilizing both categorical and numerical predictors effectively. Moreover, it reduces the complexity of the parameters while keeping the number of observations in mind. This approach proved to be the most suitable and efficient for us. Additionally, XGBoost allows us to analyze feature importance which was of interest to this question. XGBoost was chosen over other decision tree techniques given the concern of an unbalanced data set in terms of classes.
The primary step needed to further prepare the data for an XGBoost model was utilizing one-hot encoding. Creating a new column for each value of a categorical predictor is a crucial step to ensure that all possible values are given equal weight. This is especially important when none of the categorical predictors have an inherent order.
Initially, the XGBoost model was fitted using default parameters with
all 12 predictors, which performed better than random but not excellent.
Next, before tuning any parameters, the interaction between
release_pos_x and p_throws was added for the
same reasons as the logistic regression. This model performed marginally
better than the first, and given our intuition that this interaction is
important in order to account for right and left-handed pitchers, it was
kept for all subsequent models built. In addition, we constructed a
model based on the top predictors identified by cover analysis.
Surprisingly, this model turned out to be less effective than expected.
Therefore, we decided to include all the predictors in the subsequent
models. The next step was to test using a different booster: XGBoost
offers two types of boosters for classification, gbtree
is the default with dart as another option. The dart
booster produced worse results than gbtree; therefore,
gbtree was used in all subsequent models.
The first step taken to tune parameters was choosing nrounds, the number of trees being built, using cross-validation. Test error in this process showed that 200 was the optimal number of nrounds. Then, hypertuning was used to find the optimal depth of trees, minimum child node weight, subsample size for each tree, and number of features for each tree. Tuning gamma, which affects overfitting, was attempted; however, given training and testing errors were consistently similar, overfitting was not a major concern. The summary of these parameter choices are shown in Appendix 1.
The results of all XGBoost models are shown in Appendix 2. Hypertuning, while it did improve the model in some respects, including improving some specificities considerably, led to less consistency across classes. Therefore, this led to the conclusion that the non-hypertuned model with 200 nrounds and the interaction variable is the best. With an accuracy of approximately 60%, this model proves to be acceptable, especially when considering the randomness threshold of 33% that often accompanies multiclass classification. Additionally, it demonstrated consistent sensitivities and specificities across all classes, further bolstering its reliability.
The feature importance plot below showcases three distinct levels of
predictor importance. Interestingly, the type of pitch does not have a
large importance in this prediction, which is due to many of the
predictors being characteristics that define pitch type.
However, of the pitch types, sinkers are relatively the most important,
which is logical because sinkers have a unique movement where the ball
ends up lower than the batter expects it to. Furthermore, the importance
of both pfx_x and pfx_z, which represent the
movement of the ball, compared to the lesser importance of any
pitch_type may be an indication that pfx is
more helpful in explaining similar things as pitch_type.
Interestingly, due to this observation in an initial feature importance
plot, a model was fitted without pitch_type, but this
worsened the model significantly. Finally, the most important feature
for predicting bb_type was the movement of the ball.
We opted to utilize a sequential neural network on our data due to the algorithm’s remarkable capacity to effectively model intricate non-linear connections between the input and output datasets. Such non-linear complexities were thought to exist because we can partially model the problem as a physics problem where the position is the first derivative of velocity and the second derivative of force. Therefore, we theorized there may be second or third-degree polynomial relationships that explain the data well.
Similar to the XGBoost model, pitch_type was transformed
into a one-hot encoded binary variable.
The networks were trained for 1000 epochs using ADAM optimization. We found that the training set and the validation had an accuracy of close to 68%. It seemed that within several epochs, the algorithm performed well.
Graph 5: Neural Network Tree
For our network, we chose three hidden layers to increase the complexity of our model as seen in Graph 6. Of course, by increasing the complexity of the model, we could easily overfit the data. Thus, for each hidden layer, we added a dropout layer. Resulting from this was a nine-layer neural network with three hidden layers. We then hypertuned the number of nodes at each hidden layer and the dropout rate. Next, a factorial test was used with 1,728 possible models and randomly sampled 10% (roughly 173 models). We ran each model for 1,000 epochs with a batch size of 2,048. We found that the best model had 128 nodes in each hidden layer with a 30% dropout and an accuracy of 52%.
Graph 6: Neural Network Graph
The most successful model was the XGBoost model using one interaction term, as it had the highest accuracy and balanced results across classes. This model had an accuracy of 0.61, which was higher than the best logistic regression and neural network models, both with accuracies below 0.5. Although there’s room for further improvement, the progress achieved is already remarkable as it’s better than a random model. In fact, with three different classes, a random prediction accuracy rate is around 0.33, which is significantly inferior to our current results. Additionally, it is important to keep in mind that due to the nature of what is being predicted, there is a lot of randomness involved, and it is highly unlikely that an extremely high accuracy would be able to be achieved.
Apart from accuracy, a key factor that demonstrates the XGBoost model was superior is the results on sensitivity. Graph 7 shows how while all three best models performed similarly on fly balls and ground balls, with sensitivities around .6 for fly balls and around .7 for ground balls, XGBoost had a much better sensitivity for line drives compared to the neural network and logistic regression: XGBoost gave a sensitivity of 0.59 for line drives compared to less than .1 for both the neural network and logistic regression.
Another metric used to compare the models was the Brier skill score. The Brier skill score compares the model to a naive model and measures the improvement from the naive model. XGBoost outperformed the other models in this metric as well, with an improvement from the naive model of 19% compared to a 5% improvement for the logistic regression and - 29% for the neural network. It was not completely surprising that XGBoost performed best once the specifications of this data are considered. Logistic regression was used as a baseline, but given its simplicity, was anticipated to fall short of more complex models. The neural network was instrumental in modeling any non-linear relationships between the dependent and independent variables. We were surprised to see that the neural network outperformed our logistic regression model suggesting that the best model might be non-linear. Nevertheless, the impressive performance of the XGBoost model corroborated specific fundamental assumptions made about the data, reinforcing their validity and reliability.
One application of predictions from the XGBoost model is a visual showing how ground ball probability, for example, changes as the ball’s location relative to the batter changes. This visual is shown below:
From this visual, we learn the relationship between pitch location and ground ball likelihood. If a pitcher wants the batter to hit a ground ball, then they should try to locate the ball in the red area relative to the batter.
Another application of our model is for comparing players. Consider a scenario where a coach wants to know which of two pitchers has had the higher average ground ball probability in their last 30 batted balls. The table below shows this:
We learn that Player A had the higher average ground ball probability, indicating that their strategy and skills are more geared towards inducing ground balls than Player B.
One concern with our approach was the number of predictors removed from the original data. While this was done to focus on non-identification and pre hit variables, some identification variables have potential to be useful in predicting batted ball type. Specifically, identification variables could be useful in addressing another issue with our approach: generalizing characteristics of individuals when individuals often have different patterns. Finally, we focused primarily on balls that were hit into play, but this overlooks the fact that the goal of the pitcher is always to have the batter strikeout, and thus, never hit into play. Future development of our model should account for this and possibly add a fourth category to predict: strike.
From this project, we have learned that the hit type cannot be completely explained by our current set of features. A source of this variation could potentially be the individual characteristics of a player as well as general randomness that is not possible to explain given the nature of what is being predicted. Additional data could be added to get more detailed player characteristics to help explain more of this variation.
If we had infinite time and resources, we would add more predictors, specifically researching additional predictors that could have an impact on how a ball is hit. Additionally, it would be interesting to break down the data to individual players and see how the results change based on a player’s individual batting strategy. This could be useful to both batters and pitchers as each would have a preferred type of hit in various situations.
Generally, the XGBoost algorithm is the optimal model to forecast hit type, boasting a remarkable accuracy rate of 61% and an excellent ability to predict line drives, an underrepresented class, very well. In the context of baseball data, having such high accuracy is enough to influence the team’s strategy for pitching and hitting. The patterns uncovered by this model are useful in understanding the choices a batter and pitcher should make, and we hope further development of this model could lead to even more useful analysis for players.
Sidle, G. D. 2017. “Using Multi-Class Machine Learning Methods to Predict Major League Baseball Pitches”. Ph.D. thesis, North Carolina State University.
Barnes, S. L., and M. V. Bjarnadóttir. 2016. Great expectations: An analysis of major league baseball free agent performance. Statistical Analysis and Data Mining: the ASA Data Science Journal
Das, R., and S. Das. 1994. Catching a baseball: A reinforcement learning perspective using a neural network.
Appendix 3: Neural Network Tree
Summary of Assignments:
Kenan: Wrote the literature review and introduction.
Sarah: Wrote the methodology and data section, created EDA visuals and tables in these sections.
Bradley: Built the logistic regression model and wrote the logistic regression and practical applications section
Andy: Built the neural network and wrote the neural network section.
Julia: Built the XGBoost model and wrote the XGBoost model, model comparison, and limitations sections.