Stats Glossary - dggraphs.com

This is a glossary of statistics that are used on this site. Terms are ordered in alphabetical order. It is incomplete, but will be updated regularly.

Aging Curves – One of the age old questions in sports analytics is whether the influence of aging on player performance is uniform or differs based on player skill set or style. Another way to phrase this question is whether player performance is better predicted by a generic (uniform) aging curve or one based on player type (skill set).

I explored this question for touring MPO player ratings (PDGA player ratings). Long story short, I found that aging curves based on three player types are more predictive of player ratings than the generic aging curve.

To determine the appropriate player types I grouped players based on their stats (C1X, C2P, Fairway hits, C1R, C2R, Scramble, OB) using a variety of clustering methods ( k-means, Gaussian mixture model (GMM), hierarchical, and spectral) and number of clusters or player types (ranging from 2-6).

Then, I modeled aging curves for the groups generated using generalized additive models (GAM). Finally, I determined which method and number of clusters produced the most predictive aging curves using leave-one-out cross-validation. I found that 3 player types generated using spectral clustering was the most predictive.

One factor that can significantly influence the quantification of aging in this context is survivor bias. In short, only players that are playing well will continue to have player ratings as they age. Others stop playing. This influence should not be as profound in professional disc golf as in other professional sports. However, it is still important to examine.

To quantify and control for survivor bias I used the inverse probability weighting method. To do this, I used a logistic regression to quantified the probability that a player survives (continues to have player rating) based on their age and number of rated rounds. I inverted the probabilities, normalized them, and inputted them as weights in the GAM models. These inverse probabilities in the GAM weight different data points more than others. Those data points for players who are very young, very old, and played fewer rounds are weighted more because they are less likely to continue playing.

The survivor bias controlled curves should be more representative of the TRUE influence of age on player rating, and indeed, they do fit the data better (have a greater % deviance explained). However, they are also less predictive according to cross validation. So, I provide you with both prediction plots and survivor bias controlled plots.

Go to the aging curve page to see the results!

Effective Putting Percentage (ePP) – The name comes from the basketball stat named effective field goal percentage, which is a modified field goal percentage (percent of shot attempts made). It is modified so that three pointers are given proportionally more (1.5x more) value. This additional value means that you can be less efficient at making three pointers but still provide more value to your team. I applied this concept to C1X and C2P in disc golf. A made C2 putt is worth more in terms of strokes gained on the field than a made C1X putt. To estimate the relative value of these putts I regressed C1X putts and C2 putts made versus there respective strokes gained stats. Like linear weights used in many other sports analytics fields this assumes a straight line relationship between putts made and strokes gained over the field. Then, I weighted the relative value of the made putts and divided the sum of those values by total putting attempts to calculate a raw effective putting percentage. Finally, to help with interpretation, I scaled the statistic so that its mean equals the mean and standard deviation of C1X putting (on a yearly basis).

Expected Strokes Gained (xSG) – This stat uses a statistical model (linear regression) to estimate how a player would be expected to score (using strokes gained) based on their underlying statistics (FH, Parked, C1R, C2R, C1X, C2P, Scrambling). What is leftover, the unexplained variation in strokes gained, is Luck Factor (LF). Here is an article I wrote exploring this concept further.

Luck Factor (LF) – Luck factor has an admittedly provocative name. It is the portion of of strokes gained that is NOT explained by a player’s underlying stats. Is there skill that is not reflected in stats currently collected? Of course! However, data suggests that a vast majority of those leftover strokes gained are due to randomness, or luck. For example, a player’s luck factor in 2023 only explained 2% of the variance in their luck factor for 2024.

The above explanations for xSG and LF left out a few details for reasons of accessibility. If you want, join me in the weeds. xSG is calculated as the predicted tSG value based on the statistical model (multiple linear regression) built using the statistics mentioned above. The difference between the predicted tSG and observed tSG values are the LFs. In technical terms, the LF values are the residuals from the statistical model. A vast majority of the variation in tSG can be explained by the underlying statistics, however, there is some amount of variation in tSG that is not explained by the statistics. And the interesting question here is how much of that variance is do to randomness (luck as labeled in luck factor) or some other causal attribute (or attributes) of golfers (distance, etc.) that is not captured by the statistics. At this point, the causal variable(s), if there is one, is unknowable, but when I run a regression between LF and tSG I found that LF does increase at a statistically significant, though tiny rate with tSG. For every unit increase in tSG the LF increases by 0.076. This indicates that there is an underlying causal link between LF and tSG that is not captured in the original regression. To deal with this, I took the predicted LF values from the LF-tSG regression and did two things: 1) subtracted them from the original LF values AND 2) added them to the xSGs. When I regress these updated LF values versus tSG there is NO statistical relationship. This indicates that the additional causal variable(s) are captured and added to the causal component of the model – xSG.

This process of running a second regression using the residuals of the first is called “two-stage residual inclusion”. Its interesting results here are an impetus for future exploration of what is underlying the relationship between LF and tSG. In other words, what player attributes are not captured in the currently available statistics?

Player comps – This stat represents the top-5 comps (comparable players) for an individual player in a given year. These comps are built on “+” stats. Using these stats a Euclidean distance matrix is generated and the 5 players with the lowest distance values are selected are the comps for an individual player of interest.

Plus(+) stats – These are yearly, standardized values with player wide means set to 100 and a unit change equal to a 1% change in the statistic (baseball statheads will be familiar with this concept). For example, a player with a C1R+ of 120 has a C1R value 20% above the average player that year. Broadly speaking, these types of stats are considered to be detrended and allow for meaningful comparisons of players across time. Only players with 4+ events in season are used to calculate these stats. When a season has <4 events yet completed, only players who have competed in all the events are included. These stats are also used to generate player comps.

POY Awards – These are season long, statistically based player of the year (POY) awards. The McBeth Award is given to the statistically best MPO player and the PiERCE Award is given to the statistically best FPO player. The award is based on season-long accumulated expected true strokes gained (xtSG) and event average xtSG. The logic here is that this average balances how often someone plays (accumulated xtSG) and quality of their play (per event xtSG). The season long and per event values are on quite different scales, so I rescaled both metrics where the highest value is 10. So, if a player has the highest total accumulated xtSG and per event xtSG values they will have a POY value of 10.

The names of the awards are backronyms. They may be a little confusing, but they are fun!

True Strokes Gained (tSG) – True strokes gained is an extension of the concept of strokes gained, which was a statistic developed in ball golf. At its core, strokes gained can directly quantify how a player’s tee-to-green skills and putting skills influence their score because, at its simplest, it is a measure of how many throws it takes for a player to get the disc in the basket from any location on a hole. True strokes gained controls for strength of field. Typically, strokes gained stats compare players to the average of the field within an event. Of course, the average player in events varies quite a bit. So, true strokes gained sets a benchmark of comparison to the 1000 rated player for MPO and 930 rated player for FPO. We have this stat broken down for by tee-to-green (tTG_SG) and putting (tP_SG).

Weighted Putting Percentage (wPP) – This is simply a weighted mean of C1X and C2 putting percentages.

Win Probability Model – The model has been significantly updated from last year. It still used a random forest model to generate win probabilities, but the statistics have changed significantly. The model is based on the following statistics from 2016-present: strokes gained tee-to-green, strokes gained putting, LF, player rating, Total Wins (total number of wins for player in dataset), decayed wins (total number of wins but with a decaying value, decay rate = 0.99), performance at the location (event) in previous years, year, day in year, location, and temperature forecast. I also explored including precipitation in the model, but its inclusion significantly reduced the predictive power of the model. The statistics are weighted so that more recent events stats are weighted more heavily.

The random forest model was generated using the randomForest package in the R statistical program, with the number of trees set to 15000 and the mtry=3. The rest of the hyperparameters were set as follows: minimum node size=1, sample fraction=0.6, and the cases were weighted to reduce the asymmetry between data on winning versus non-wining players. I have spent A LOT of time attempting to improve this model. I experimented using boosted trees, but those models were not more predictive than the random forest models. Some day I will experiment with neural networks.