The distance traveled by the pass from the line of scrimmage to the intended receiver, whether or not the pass was complete.
The average air yards traveled on targeted passes for quarterbacks and targets for receivers.
A type of statistical distribution with a binary (0/1-type) response. Examples include wins/losses for a game or sacks/no-sacks from a play.
The discrete categories of numbers used to summarize data in a histogram.
Short for sportsbook in football gambling. This is the person, group, casino, or other similar enterprise that takes wagers on sporting (and other) events.
Describes a number whose value is constrained by other values. For example, a percentage is bounded by 0% and 100%.
A data visualization that uses a “box” to show the middle 50% of data and stems for the upper and lower quartiles. Commonly, outliers are plotted as dots. The plots are also known as box-and-whisker plots.
Fans of the Green Bay Packers. For example, Richard is a Cheesehead because he likes the Packers. Eric is not a fan of the Packers and is therefore not a Cheesehead.
The final price offered by a sportsbook before a game starts. In theory, this contains all the opinions, expressed through wagers, of all bettors who have enough influence to move the line into place.
A type of statistical method for dividing data points into similar groups (clusters) based on a set of features.
The predictor estimates from a regression-type analysis. Slopes and intercepts are special cases of coefficients.
The rate at which a quarterback has successful (completed) passes, compared to what would be predicted (expected) given a situation based on an expected completion percentage model.
A measure of uncertainty around a point estimate such as a mean or regression coefficient. For example, a 95% CI around a mean will contain the true mean 95% of the time if you repeat the observation process many, many, many times. But, you will never know which 5% of times you are wrong.
What is going on around a situation; the factors involved in a play, such as the down, yards to go, and field position.
Including one or more extra variables in a regression or regression-like model. For example, pass completion might be controlled for yards to go. See also corrected for and normalized.
A synonym for controlled for.
Data about data. Also a synonym for metadata.
The flow of data from one location to another, with the data undergoing changes such as formatting along the way. See also pipe.
The process of getting data into the format you need to solve your problems. Synonyms include data cleaning, data formatting, data tidying, data transformation, data manipulation, data munging, and data mutating.
The “extra” number of data points left over from fitting a model.
A statistical approach for reducing the number of features by creating new, independent features. Principal component analysis (PCA) is an example of one type of dimensionality reduction.
The number of variables needed to describe data. Graphically, this is the number of axes needed to describe the data. Tabularly, this is the number of columns needed to describe the data. Algebraically, this is the number of independent variables needed to describe the data.
The number of yards remaining to either obtain a new first down or score a touchdown.
A finite number of plays to advance the football a certain distance (measured in yards) and either score or obtain a new set of plays before a team loses possession of the ball.
The approximate value generated by a player picked for his drafting team. This is a metric developed by Pro Football Reference.
The resources a team uses during the NFL Draft, including the number of picks, pick rounds, and pick numbers.
An advantage over the betting markets for predicting outcomes, usually expressed as a percentage.
The estimated, or expected, value for the number of points one would expect a team to score given the current game situation on that drive.
The difference between a team’s expected points from one play to the next, measuring the success of the play.
A subset of statistical analysis that analyzes data by describing or summarizing its main characteristics. Usually, this involves both graphical summaries such as plots and numerical summaries such as means and standard deviations.
A predictor variable in a model. This term is used more commonly by data scientists whereas statisticians tend to use predictor or dependent variable.
for
loopA computer programming tool that repeats (or loops) over a function for a predefined number of iterations.
An extension of linear models (such as simple linear regression and multiple regression) to include a link function and non-normal response variable such as logistic regression with binary data or Poisson regression with count data.
A synonym for American football.
A concept from SQL-type languages that describes taking data and creating sub-groups (or grouping) based on (or by) a variable. For example, you might take Aaron Rodger’s passing yards and group by season to calculate his average passing yards per season.
The total amount of money placed by bettors across all markets.
Important plays that determine the outcome of games. For example, converting the ball on third down, or fourth and goal. These plays, while of great importance, are generally not predictive game to game or season to season.
A type of plot that summarizes counts of data into discrete bins.
Two uses in this book. A football colliding with another is a hit. Additionally, a computer script trying to download from a web page hits the page when trying to download.
In a regression-type model, sometimes two predictors or features change together. A relation (or interaction) between these two terms allows this to be included within the model.
The point where a simple linear regression crosses through 0. Also, sometimes used to refer to multiple regression coefficients with a discrete predictor variable.
For sports bettors, the price they would offer the game if they were a sportsbook. The discrepancy between this value and the actual price of the game determines the edge.
The difference between the first and third quartile. See also quartile.
A delay or offset. For example, comparing the number of passes per quarterback per game in one season (such as 2022) to the previous season (such as 2021) would have a lag of 1 season.
The function that maps between the observed scale and model scale in a generalized linear model. For example, a logit function can link between the observed probability of an outcome occurring on the bounded 0–1 probability scale to the log-odds scale that ranges from negative to positive infinity.
Odds on the log scale.
A pass typically longer than 20 yards, although the actual threshold may vary.
The data describing the data. For example, metadata might indicate whether a time column displays minutes, seconds, or decimal minutes. This can often be thought of as a synonym for data dictionary.
A model with both fixed effects and random effects. See also random-effect model. Synonyms include hierarchical model, multilevel model, and repeated-measure or repeated-observation model.
In American football, a bet on a team winning straight up.
A type of regression with more than one predictor variable. Simple linear regression is a special type of multiple regression.
This term has multiple definitions. In the book, we use it to refer to accounting for other variables in regression analysis. Also see correct for or control for. Normalization may also be used to define a transformation of data. Specifically, data are transformed to be on a normal distribution scale (or normalized) to have a mean of 0 and standard deviation of 1, thereby following a normal distribution.
A synonym for American football.
In betting and logistic regression, the number of times an event occurs in relation to the number of time the event does not occur. For example, if Kansas City has 4-to-1 odds of winning this week, we would expect them to win one game for every four games they lose under a similar situation. Odds can either emerge empirically through events occurring and models estimating the odds or through betting as odds emerge through the “wisdom” of the crowds.
Odds in ratio format. For example, 3-to-2 odds can be written as 3:2 or 1.5 odds-ratios.
Describes software where code must freely accessible (anybody can look at the code) and freely available (not cost money).
The process by which an oddsmaker or a bettor sets the line.
Data points that are far away from another data point.
Describes a model for which too many parameters have been estimated compared to the amount of data, or a model that fits one situation too well and does not apply to other situations.
The probability of obtaining the observed test statistic, assuming the null hypothesis of no difference is true. These values are increasingly falling out of favor with professional statisticians because of their common misuse.
A value from –1 to 1. A value of 1 means two groups are perfectly positively correlated, and as one increases, the other increases. A value of –1 means two groups are perfectly negatively correlated, and as one increases, the other decreases. A value of 0 means no correlation, and the values for one group do not have any relation to the values from another group.
To pass the outputs from one function directly to another function. See also data pipeline.
The recorded results for each play of a football game. Often this data is “row poor,” in that there are far more features (columns) than plays (rows).
A statistical tool for creating fewer independent features from a set of features.
A number between 0 and 1 that describes the chance of an event occurring. Multiple definitions of probability exist, including the frequentist definition, which is the long-term average under similar conditions (such as flipping a coin), and Bayesian, which is the belief in an event occurring (such as betting markets).
A mathematical function that assigns a value (specifically, a probability) between 0 and 1 to an event occurring.
A type of bet focusing on a specific outcome occurring, such as who will score the first touchdown. Also called prop for short.
A game for which the outcome lands on the spread and the better is refunded their money.
People who use the Python language.
A quarter of the data based on numerical ranking. By definition, data can be divided into four quartiles.
A model with coefficients that are assumed to come from a shared distribution.
A type of statistical model that describes the relationship between one response variable (a simple linear regression) and one or more predictor variables (a multiple regression). Also a special type of linear model.
With regression, observations are expected to regress to the mean (average) value through time. For example, a player who has a good year this year would be reasonably expected to have a year closer to average next year, especially if the source of their good year is a relatively unstable, or noisy, statistic.
A method for understanding the results from a Poisson regression, similar to the outputs from a logistic regression with odds-ratios.
The difference between a model’s predicted value for an observation and the actual value for an observation.
The number of running yards a player obtains compared to the value expected (or average) from a model given the play’s situation.
Quantitative analysis of baseball, named after the Society for American Baseball Research (SABR).
A type of plot that plots points on both axes.
To use computer programs to download data from websites (as in web scraping).
The process of the oddsmaker(s) creating the odds.
A statistical phenomena whereby relationships between variables change based on different groupings using other variables.
The change in a trend through time and often used to describe regression coefficients with continuous predictor variables.
A betting market in American football that is the most popular and easy to understand. The spread is the point value meant to split outcomes in half over a large sample of games. This doesn’t necessarily mean the sportsbook wants half of the bets on either side of the spread though.
Within the context of this book, stability of an evaluation metric is the metric’s ability to predict itself over a predetermined time frame. Also, see stability analysis and sticky stats.
The measurement of how well a metric or model output holds up through time. For example, with football, we would care about the stability of making predictions across seasons.
A measure of the spread, or dispersion, in a distribution.
A measure of the uncertainty around a distribution, given the uncertainty and sample size.
A term commonly used in fantasy football for numbers that are stable through time.
A pass typically less than 20 yards, although the actual threshold may vary (e.g., the first-down marker).
A running back who tends to play when only a few (or “short”) number of yards are required to obtain a first down or a touchdown.
A type of statistical and machine learning algorithm where people know the groups ahead of time and the algorithm may be trained on data.
Baseball’s first, and arguably most important, outcomes that can be modeled across area walks, strikeouts, and home runs. These outcomes also do not depend on the defense, other than rare exceptions.
A simple bet on whether the sum of the two teams’ points goes over or under a specified amount.
The number of points expected by the betting market for a game.
A type of statistical and machine learning algorithm where people do not know the groups ahead of time.
People who use the R language.
Depending on context, two definitions are used in the book. First, observations can be variable. For example, pass yards might be highly variable for a quarterback, meaning the quarterback lacks consistency. Second, a model can be variable. For example, air yards might be a predictor variable for the response variable completion in a regression model.
See vigorish.
The house (casino, bookie, or other similar institution that takes bets) advantage that ensures the house almost always makes money over the long-term.
A model to predict the probability that a team wins the game at a given point during the game.
A framework for estimating the number of wins a player is worth during the course of a season, set of seasons, or a career. First created in baseball.
Also known as yards per passing attempt, YPA is the average number of yards a quarterback throws during a defined time period, such as game or season.
The average number of yards a player runs the ball during a defined time period, such as game or season.
The number of yards necessary to either obtain a first down or score during a play.