Groundhog Day has come and passed, and that furry little prognosticator has said there will be six more weeks of winter. Since we are stuck inside, we are biding our time by being statistics nerds.  That led us to take a look at Punxsutawney Phil’s analysis skills…and how the tradition hampers any real learning from his data set.

It is rather ridiculous to believe in a meteorologically gifted rodent, but it’s not too surprising since we are rather ridiculous prognosticators ourselves.  The author Daniel Kahnneman points out this fact in the first 14 chapters of Thinking Fast and Slow.

What’s Phil’s winning record?

Punxsutawney Phil, whom we like to think is the REAL groundhog, has a track record that is somewhat disputed. What’s our data set? Over the 139 years there in northern PA, he’s predicted six more weeks 99 times, and early spring 15 times. (Some records have been lost…those must be the ones that Bill Murray kept reliving.)

If we look at the calendar, then the little rodent was right 99 times, since winter doesn’t officially end until March 20, a good 6 weeks after his peek from the burrow. But data wonks have calculated that all of the Phils (sorry to burst your bubble, folks – it hasn’t been the same woodchuck) have a cumulative record of 39% correct. Is that bad? I don’t know. If I flipped a coin and got 40 heads out of 114 tosses, you would probably consider that pretty normal. (Pun intended.)

Why is Punxsutawney Phil a bad data analyst?

There is one fatal flaw in Phil’s data set that makes data analysis hard, or worse yet…meaningless. The data analysis great Dr. Deming long stressed that to learn from data, we needed to avoid using in-out buckets and start looking at deviation from optimum. This is one reason that Deming changed his famous PDCA to PDSA, because PD”CHECK”A caused people to think good-bad, red-green, in-out…all binary type decisions. (You Six Sigma geeks will recognize this as doing more measuring and doing less counting.)

But how often do we groundhog ourselves? Sadly, too often. Because, as Kahnemann points out, our brains are lazy, so our brains like to 1) look for binary indicators (red/green; shadow/no shadow) and then 2) give binary answers (bad/good; 6 weeks/not 6). This is bad statistics and poor learning.

Do you want a team that learns from data? What’s a team leader to do?

Measure or Count What’s Meaningful

Groundhog shadows and winter have no correlation. We all know that. Measuring what’s meaningful is HARD! But here are some clues. Just because you can measure is no reason to measure. For example, there’s a tachometer in my car. It’s an automatic trans. I have no control over engine RPM. Why is that on my dashboard? My gas gauge tells me distance to empty, a funky measure, but it’s meaningful and decision driving. (yikes, another pun!)  For example, I rent a lot of cars and when they don’t have that, I realize how little I really know when it says 1/4 full, since I have no idea what the tank size is in a rental car.  When your measures count something meaningful in a way that drives decisions, you can actually LEARN from what you measure!

Operational definitions

Another reason we get lazy and look for binary (on/off) indicators? Measuring deviation from optimum requires us to know a bit more about our system and our goals and to have what’s called operational definitions. In other words, how do know what we are measuring and ensuring we do it the same every time?

So, poor Phil doesn’t really have operational definitions.  For example:

First, the measure itself: What’s his shadow? In 2017, Phil saw his shadow even though it was a snowy day! Is a barely seen shadow the same as a distinct one?

Then the ‘scared’ part. Is Phil always equally scared? If he’s just slightly annoyed by his shadow, then would there be only 4 more weeks of winter?

Second, the prediction: What’s winter? It’s funny that we really don’t know what “six more weeks of winter” really means.  The winter season won’t end for six weeks, so is that it? Is it a mean temperature or amount of snowfall?

These conundrums are no different than what each team experiences going down this road. Developing an operational definition is hard, but it’s time well spent. Otherwise, as different people collect the data and different people interpret, you will have very little way to compare over time.

Deviation from optimum

In order for a team to learn about the system they are working on, they have to measure more and count less. Here’s a simple way to see the difference. Inspecting machined parts: I could judge them as in spec or out of spec and then count the out of spec ones each production period.  That’s bad, but why? Preferably, I’d measure each one and plot the variation from optimum. That’ll promote understanding of the system that can drive decision making.  Like if all the parts are slightly larger than spec you’d know where to look for improvement opportunities, but if they vary wildly smaller and larger than optimum, you’ll have to look to a different process solution.

Another example: I once worked for a firm that had high security measures, so they always counted security infractions per month. That was poor statistics and led to poor analyses. They started measuring the distance (in days) between security infractions. This slight difference in measurement enabled them to learn more about what was really happening.

Make a Picture!

I’m not sure what would be best for our furry Phil, but I’d think that with a process control chart, we’d at least see the correlation (or lack thereof) between his shadow phobia and winter.  Then we’d know whether to base our pea planting on the woodchuck’s prediction or find Phil a new job…maybe something in government burrow-cracy. (Ok I’ve gone too far…I really need to get outside even if it is frigid out there!)

Here’s more tips on making things visual.

The Makings of  Statistics That Help Us Learn

To sum up our thoughts, in case we lost you in our attempt to be entertaining.

  • In order to be a team that learns from what is happening, measuring what matters helps.
  • In order for those measures to be helpful, you need to establish clear operational definitions.
  • You will learn more and make better statistical decisions when you measure, not count.
  • The picture of those measures are most helpful when the picture shows the data over time, with clear indications of the mean and the band of common cause noise (known as the process control limits), and an indication where the desired (customer voice) benchmark is.

If this has you thinking about revising your measurements so that you can learn and make decisions based on what your data says, let us know. We have helped many teams build indicators of team success, including helping them have better operational definitions.  Many clients find we have a way to making that nerdy stuff easier to understand and therefore you have an easier path towards useful use of the tools.  Contact us at

(Visited 137 times, 1 visits today)