So I’m one month into my year-long sabbatical and I’m still cranking away at writing a book proposal for my guide to working with US Census data. I’m positioning the census as one of the original open datasets given its history as a transparent, public domain dataset that plays many key roles in our society, and I’m considering how it fits within the larger data universe.
I recently stumbled across this story, about Facebook telling advertisers and investors that Facebook could reach 41 million adults between the ages of 18 and 24 in the US this year with ads. They also claimed they could reach 60 million people between the ages of 25 and 34. Sounds impressive, right?
Well, an analyst at an equity research firm did some homework, and discovered a problem: according to the Census Bureau, there are currently (as of 2016) only 31 million adults between the ages of 18 and 24 in the United States. Where are Facebook’s extra 10 million people coming from? It gets worse: the Census says there are only 45 million people aged 25 to 34 in the US, 15 million fewer than Facebook’s “reach”. This isn’t just a local phenomena; Facebook says it can reach more people than actually exist in the UK, Australia, Ireland, and France.
When faced with these large discrepancies, Facebook responded with a canned statement that their tools for producing these estimates “…are designed to estimate how many people in a given area are eligible to see an ad a business might run. They are not designed to match population or census estimates. We are always working to improve our estimates.”
Ok, but that’s not how the information was originally presented to the advertising and investment firms, who were nonplussed to say the least. If you say you can reach 41 million adults between ages 18 and 24 in the US, the natural assumption is we’re talking about residents. You would also assume that this estimate, at best, could only represent a certain percentage of that population. You wouldn’t expect the ads could reach all of them, and certainly not more than what actually exists.
Here are a few take-aways from this story. First, while census data isn’t as timely or seemingly “hip” as the new social media data, it is produced by professional statisticians and geographers dedicated to high quality, transparent, demographic data. The census serves as a baseline that other population data can be measured against, and as a foundation on which other datasets can be built. If you are generating population estimates for an area, comparing your estimates to the census should be a matter of common sense. It was to the investment researcher, but apparently not to Facebook.
Second, all data sets have shortcomings and suffer from some degree of bias or error; the decennial census has historically had problems with under-counting, and the margins of error for American Community Survey data for small geographies and populations can be unacceptably large. But at least the process is transparent, so we can understand the limitations and account for them. Furthermore, the census officials know the shortcomings and they develop methodologies to amend and correct data (to the extent they can) before it’s published.
In contrast, the data produced by Facebook, Google, and the like is anything but transparent. Not only can’t we see how it’s collected, most of us don’t have access to the data itself unless the companies choose to feed us some crumbs or allow us to jump through hoops (ironic as their data is really our data in aggregate). Naturally it will suffer from limitations – people on Facebook can mis-represent their age, ad targeting is based on location data that includes residents and visitors, there can be (gasp) fake accounts! But there’s no way for end users to see, correct, or account for these shortcomings. There is little or no documentation. Users just have to hope that the company has studied and scrubbed their data, or that the company understands the data’s limitations and presents it to others appropriately.
Here are the latest population estimates for the US broken down by age and gender. There are also some neat population clocks, and a pyramid – click the pic to visit.