Hidden biases in both the collection and analysis stages present considerable risks and are as important to the big-data equation as the numbers themselves.
Biases and blind spots exist in big data as much as they do in individual perceptions and experiences. Yet there is a problematic belief that bigger data is always better data and that correlation is as good as causation.
Cell-site data - like mailing addresses, phone numbers, and IP addresses - are information that facilitate personal communications rather than part of the content of those communications themselves. The government's collection of business records containing these data, therefore, is not a search.
People in both fields operate with beliefs and biases. To the extent you can eliminate both and replace them with data, you gain a clear advantage.
There's been the emergence of a philosophy that big data is all you need. We would suggest that, actually, numbers don't speak for themselves.
Vivametrica isn't the only company vying for control of the fitness data space. There is considerable power in becoming the default standard-setter for health metrics. Any company that becomes the go-to data analysis group for brands like Fitbit and Jawbone stands to make a lot of money.
I kept a notebook, a surreptitious journal in which I jotted down phrases, technical data, miscellaneous information, names, dates, places, telephone numbers, thoughts, and a collection of other data I thought was necessary or might prove helpful.
Big data is mostly about taking numbers and using those numbers to make predictions about the future. The bigger the data set you have, the more accurate the predictions about the future will be.
Big data has been used by human beings for a long time - just in bricks-and-mortar applications. Insurance and standardized tests are both examples of big data from before the Internet.
Most people, it seems like they've only got one part of the equation down. Caring for themselves, or caring for someone else. And I'v learned how important it is to have both.
I don't think bulk data collection was an enormous factor here, because generally, that deals with overseas calls to the United States. But what bulk data collection did was make the process more efficient. So there were no silver bullets there.
People think 'big data' avoids the problem of discrimination because you are dealing with big data sets, but, in fact, big data is being used for more and more precise forms of discrimination - a form of data redlining.
We should always be suspicious when machine-learning systems are described as free from bias if it's been trained on human-generated data. Our biases are built into that training data.
The biggest challenge in big data today is asking the right questions of data. There are so many questions to ask that you don't have the time to ask them all, so it doesn't even make sense to think about where to start your analysis.
It's hard to look at anything with an objective eye. I think people bring themselves into the equation when they watch a movie. They bring their own prejudices, their own biases, their own feelings toward the subject matter, the characters.
If done correctly, dynamic scoring will provide a more complete picture of Congress's actions. This is exactly the type of modeling the private sector uses, and advances in data collection and analysis create an opportunity for it to be employed accurately.
Hard numbers tell an important story; user stats and sales numbers will always be key metrics. But every day, your users are sharing a huge amount of qualitative data, too - and a lot of companies either don't know how or forget to act on it.