Understanding Linear Weights
It’s so nice to have a night where I actually have some free time to write. I’ll be able to start writing with regularity in about a week and a half or so, right around the time I make my yearly trip to Arizona for a few days of good ol’ Spring Training. I don’t really have anything in particular to write about, and I’m not really in the mood to do player analysis, so I thought I’d write a little bit about a very important sabermetric principle that’s found its way into essentially every aspect of sabermetrics- linear weights. And the more I think about it, if you understand linear weights (hereafter referred to as “LW” or “LWTS”), you’ll understand a lot about sabermetrics.
A brief history on LW, and how they are applied
The history of LW begins well before a night custodian by the name of Bill James, when a man named Ferdinand Cole Lane built a weighted system for measuring the impact of hitting events. This was later picked up by George Lindsey, who recorded detailed play-by-play data on over 1,000 Major League games and produced what is referred to as a run expectancy matrix, which tells us the probability of scoring from a particular base-out state. In 2010, for example, a team was expected to score .49 runs from a man on first and no outs until the end of the inning, and 2.4 runs with the bases loaded and no outs. This merely quantifies what we know- teams are more likely to score more runs in situations like 123_0 (bases loaded, no out) than they are in situations like 001_2 (man on first, two outs). Lindsey then took the average increase in run expectancy from each event to find the average value of each event. And really, that’s all there is to it. Linear weights are merely the empirical average impact an event has towards the run-scoring process. Pete Palmer expanded upon Lindsey’s work in the 1984 classic Hidden Game of Baseball, which introduced the Linear Weights System. What separated Palmer from the rest of the pack is that he included negative events into the equation, so that players were held accountable for the outs he made while at the plate; not just the positive outcomes. Palmer’s original equation (sans outs on bases):
LWTS = .46*1B + .80*2B + 1.02*3B + 1.40*HR + .33*(BB + HBP) + .30*SB – .60*CS – .25*(AB – H)
Singles are worth about .46 runs, doubles about .8, a home run adds about 1.40 runs on average each time, and walks and hit batsmen create about .33 runs each time; slightly less than that of a single. Later analysis revealed that Palmer’s original SB/CS values were too high; if I remember right, he increased the figures arbitrarily in an attempt to account for basestealing in high leverage (or pressure) situations. The reason why linear weights works, compared to the traditional statistics, is explained beautifully by Palmer:
“What Linear Weights does is to take very offensive event and treat it in terms of its impact upon the team- an average team, so that a man does not benefit in his individual record for having the good fortune to bat cleanup with the Brewers or suffer for batting cleanup with the Mets. The relationship of individual performance to team play is stated poorly or not at all in conventional baseball statistics. In Linear Weights it is crystal clear: the linear progression, the sum, of the various offensive events, when weighted by their accurately predicted run values, will total the runs contributed by that batter or that team beyond the league average.” (67)
Players have absolutely no control over where they hit in the lineup. Think of it this way- Bengie Molina was the cleanup hitter for the Giants for a number of years. Had he been on another team, would he have hit in the same spot in the order? And would he have collected as many RBI? Remember, RBI opportunities are highly dependent on one’s slot in the lineup. The same goes for runs scored- yes, good baserunners will score more runs than bad ones. But the player’s teammates are the ones that have to put the ball in play first. So it is foolish to rate players based on team-dependent numbers. Batting average is useless towards player value as well- yes, it tells us the rate of hits by the player, but what about the impact of the hits and the walks? OPS sure is nice, but it doesn’t tell us the amount of runs the player helps generate. Linear weights provides us with a player’s runs above or below the league average based on the ratio of his positive run output to his outs created. If a player is +0 LWTS, this simply means he hit at exactly the league average rate. If a player is -10 LWTS, he provided 10 less runs than a league average player in the same amount of opportunities, and if he’s +10, he’s provided 10 more runs than a league average hitter. That’s all there is to it.
How I generate LW
This is where things get technical, so you may just want to skip ahead. There are various ways to generate LW values- there is the empirical method, as outlined above (and described in more detail here). This is the most “correct” method. But since not everyone is a programming genius (guilty), there are other methods. One is to use a Markov model to simulate the impact of each event. This takes a heck of a lot of calculations, so it might not be for you- but there is one very basic Markov calculator on the internet that will spit out marginal values for you. The more simple method, and the one that I use, is the “plus-one” method outlined by Brandon Heipp, which squeezes out the marginal events from a dynamic run estimator- in this case, Base Runs (BsR). Why BsR?
Because it’s a very simple run estimator, is extremely flexible for the run environment, and works with a true model of run scoring. The original dynamic run estimator, Bill James’ Runs Created, works as follows:
Runs = (A*B)/C
Where “A” are the times on base, “B” is the advancement factor, and “C” are the opportunities; plate appearances. The problem with RC is pretty simple- it doesn’t treat home runs correctly. The simple equation shown above works, yes, but it doesn’t model baseball as well as it could. RC seems to forget that a home run creates a run every single time- excluding it is taking out a major aspect of the game. And this is one of the reasons why BsR works so well. It is constructed as:
Runs = A * (B/(B+C)) + D
Where “A” and “B” are the same as RC, “C” are outs made, and “D” are home runs. In short, it is essentially:
Runs = Times on Base * Score Rate + Home Runs
And it just so happens that it spits out marginal run values that match up perfectly with the empirical run values. Anyways, let’s say we use the simplest BsR formula out there to extract run values for the 2010 MLB season:
A = H – HR + BB + HBP
B = .88*1B + 2.42*2B + 3.96*3B + 2.2*HR + .11*(BB + HBP) + .99*SB – .99*CS
C = AB – H + CS
D = HR
First, we can reconcile the coefficients in the “B” term so that it matches actual league runs scored. To find our required “B,” we simply use (R – D)*C/(A – R + D) to solve for it. Divide this by the estimated “B” and we get 0.88, which we multiply all of our coefficients by. This is for accuracy purposes only. Once we have our new coefficients, we extract each run value through this (pretty darn intense) formula:
LW = ((B+C)*(A*b + B*a) – (A*B)*(b+c))/((B+C)^2) + d
Where the capitalized terms are the sum of the factor (“B,” for example, would be the frequency of the event times the coefficient in the B term), and the lower case terms are the coefficient of the factor (i.e. .88 for singles, 2.42 for doubles, etc.). Doing so yields us the following equation:
LWRC = .47*1B + .75*2B + 1.04*3B + 1.40*HR + .33*(BB + HBP) + .18*SB – .28*CS – .09* (AB – H)
You’ll notice that the title and the out terms look a bit different. The title stands for “Linear Weights Runs Created,” and the out term is -.09 because it is expressed in absolute terms. In order to make it relative to average, we find the overall runs per out- or runs scored divided by C- and add this figure to the events in C. For 2010, runs per out (excluding pitcher hitting) is .178. That gives us this:
LW = .47*1B + .75*2B + 1.04*3B + 1.40*HR + .33*(BB + HBP) + .18*SB – .45*CS – .27* (AB – H)
And that’s all there is to it. I know it may seem like a lot, but it really isn’t- especially if you have a spreadsheet set up for it. Heipp has one in the aforementioned link, and the wOBA calculator that I published a while back does all of this for you. More terms can be added to spice things up a little bit- for example, the LW formula from Tango’s coefficients give us this slightly more complicated equation:
Tango LW = .48*1B + .77*2B + 1.06*3B + 1.41*HR + .49*ROE + .31*NIBB + .34*HBP – .28*(AB – H – ROE – K + SF) – .29*K
And another equation developed from Retrosheet data that spans from 1911 until 2009 gives us the following formula:
Retro LW = .47*1B + .77*2B + 1.05*3B + 1.40*HR + .50*ROE + .31*NIBB + .34*HBP – .27*(AB – H – ROE – K + SF) – .29*K
All slightly different coefficients that give us slightly different results. It’s not a big deal, but I wanted to show how different datasets and BsR formulae can influence the run values provided. When all is said and done, though, you’re not going to see a big difference between them.
Applications beyond hitting
LW values have expanded beyond the realm of just offense- it is applied to defense and to pitching metrics. With defense, the run value of a play made above average is the difference between a batted ball and an out, or about .75 runs. For the outfield, it’s about .85 (more doubles and triples, obviously). With pitching, FIP takes the basic run values, places them above the value of a ball in play and multiplies by 9 to attain its coefficients. Uber-stat tERA takes the linear weight value of each batted ball to estimate the pitcher’s defense neutral runs allowed. So LW doesn’t apply just to offense- it has spread to other aspects of the game, as well. The same applies to baserunning runs as well.
All in all, LW are the best way to measure a player’s offense due to its simplicity and theoretical practicality. The process to get them is a bit complicated, sure, but it will always provide you with an outstanding overall view of a player’s value provided with the bat. And it’s a construct that allows you to look at all other aspects of the game, as well.