"Big data" is a fashionable term at present, and it is an all-round method used by the technical community to solve the most intractable problems in the world. This term is generally used to describe the skills and science of analyzing massive information to discover laws, collect valuable insights and predict answers to complex questions. This may sound a bit boring, but from stopping terrorists, to eliminating poverty, to saving the planet, nothing can't be solved by advocates of big data.
Victor? Meyer Schoenberg and Kenneth? In the book Big Data: A Revolution that will Change the Way We Live, Work and Think, Churchill cheered: "The benefits to society will be endless, because big data will solve pressing global problems to some extent, such as tackling climate change, eradicating diseases and promoting good governance and economic development."
As long as there is enough data to process-whether it's data on your iPhone, shopping data, personal profiles of online dating sites or anonymous health records of the whole country, people can gain countless valuable insights by using the computing power to decode these raw data. Even the Obama administration has caught up with this trend and released a lot of "data that was difficult to obtain or manage before" to entrepreneurs, researchers and the public on May 9.
However, is big data really what people brag about? Can we believe that the numerous 1 and 0 will reveal the secret world of human behavior? The following is the author's thinking about the so-called big data theory.
1. "With enough data, numbers can tell everything."
No Advocates of big data want us to believe that behind the code lines and huge databases, there are objective and universal valuable insights about human behavior patterns, whether it is consumer spending patterns, crimes or terrorist acts, health habits or employee productivity. But many advocates of big data are unwilling to face up to its shortcomings.
Numbers can't speak for themselves, and data sets-regardless of size-are still the product of human design. Tools of big data, such as ApacheHadoop software framework, can't get rid of misinterpretation, estrangement and wrong stereotypes.
These factors become particularly important when big data tries to reflect the social world in which we live, but we often foolishly think that these results are always more objective than artificial opinions. Prejudice and blind spots exist in big data, just as they exist in personal feelings and experiences. However, there is a doubtful belief that the bigger the data, the better, and correlation is equivalent to causality.
For example, social media is a common source of information for big data analysis, and there is no doubt that there is a lot of information to be mined there. We were told that Twitter data shows that people who are farther away from home are happier, and they are the most depressed on Thursday nights. But there are many reasons to question the significance of these data. First of all, we learned from the Pew Research Center that only 65,438+06% adults in the United States use Twitter, so they are definitely not a representative sample-there are more young people and urbanites among them than the whole population.
In addition, we know that many Twitter accounts are automatic programs, called "robot" programs, fake accounts or "semi-robot" systems (that is, manually controlled accounts assisted by robot programs). Recent estimates show that there may be as many as 20 million fake accounts. So even if we want to step into the methodological minefield of how to evaluate Twitter users' emotions, we still have to ask whether these emotions come from real people or automated algorithm systems.
2. "Big data will make our city smarter and more efficient"
In a way, yes. Big data can provide valuable insights to help improve our city, but it can only help us. Because data are not all equal in the process of generation or collection, there is a "signal problem" in large data sets-that is, some people and communities are ignored or not fully represented, which is the so-called data dark area or shadow area. Therefore, the application of big data in urban planning depends largely on the understanding of data and its limitations by municipal officials.
For example, the StreetBump application in Boston is a smart way to collect information at low cost. The program collects data from the smartphones of drivers who drive through potholes. More similar applications are emerging. However, if cities start to rely only on the information of smartphone users, then these citizens are just a self-selected sample-this will inevitably lead to a lack of data in communities with fewer smartphone users, which usually include the elderly and less affluent citizens.
Although the New City Machinery Office in Boston has made many efforts to make up for these potential data defects, irresponsible public officials may miss these remedies and eventually get unbalanced data, thus further aggravating the existing social injustice. As long as people look back at the "Google Flu Trend" that overestimates the annual influenza incidence in 20 12, they can realize the possible impact of relying on flawed big data on public services and public policies.
The same is true of "open government" projects that publish government data online, such as the Data.gov website and the White House Open Government Project. More data may not improve any functions of the government, including transparency and accountability, unless there is a mechanism to keep the public in touch with public institutions, let alone promote the government's ability to interpret data and respond with sufficient resources. None of this is easy. In fact, there are not many highly skilled data scientists around us. Universities are now scrambling to define this field and develop courses to meet market demand.
3. "For different social groups, big data will not favor one over the other" is almost not the case. Another expectation for the objectivity of big data is that the discrimination against ethnic minorities will be reduced, because the original data always does not contain social prejudice, which enables the analysis to be carried out at the overall level, thus avoiding discrimination based on groups. However, because big data can judge different behaviors of groups, they are usually used only for one purpose-that is, to classify different individuals into different groups. For example, a recent paper pointed out that scientists allowed their racial prejudice to affect the big data research of the genome.
Big data may be used for price discrimination, thus causing serious civil rights problems. This practice was once called "drawing a red line" in history. Recently, Cambridge University conducted a big data study on 58,000 "like" tags of Facebook to predict users' extremely sensitive personal information, such as sexual orientation, race, religious and political views, personality characteristics, intelligence level, happiness or not, addiction drug use, parents' marital status, age and gender.
Reporter Tom? Form commented on this research: "This easily accessible and highly sensitive information may be used by employers, landlords, government departments, educational institutions and private organizations to discriminate and punish individuals. And people have no means to fight. "
Finally, consider the impact on law enforcement. From Washington to Newcastle County, Delaware, the police are turning to the "predictive policing" model of big data, hoping to provide clues for the detection of cold cases and even help prevent future crimes. However, letting the police focus on the specific "hot spots" discovered by big data will strengthen the police's suspicion of social groups with bad reputations and make differentiated law enforcement a system.
As a police chief pointed out in an article, although the predictive police registration system does not take into account factors such as race and gender, the actual result of using this system may "lead to the deterioration of the relationship between the police and the community, make the public feel the lack of judicial procedures, lead to allegations of racial discrimination, and threaten the legitimacy of the police."
4. "Big data is anonymous, so it won't invade our privacy."
All wet. Although many providers of big data are trying their best to eliminate individual identities in human-oriented data sets, the risk of identity reconfirmation is still great. Mobile phone data may seem quite anonymous, but a recent study on the data set of 6.5438+0.5 million mobile phone users in Europe shows that only four reference factors are needed to confirm the identity of 95% of them one by one.
Researchers point out that people take a unique road in the city, and personal privacy has become an "increasingly serious problem" in view of the large amount of information that can be inferred from a large number of public data sets.
However, the privacy problem of big data goes far beyond the scope of conventional identity confirmation risks. Medical data currently sold to analysis companies may be used to track your identity. There is a lot of talk about personalized medicine now, and people hope to develop drugs and other therapies for individuals in the future, just as if these drugs and therapies were made of patients' own DNA.
As far as improving the efficacy of drugs is concerned, this is a bright prospect, but it essentially depends on personal recognition at the molecular and genetic levels. Once this information is improperly used or leaked, it will bring great risks. Although personal health data collection applications such as RunKeeper and Nike+ have developed rapidly, improving medical services with big data in practice is still a wish, not a reality.
Highly personalized large data sets will become the main target of hackers or leakers. Wikileaks has been at the center of several of the most serious big data leaks in recent years. From the large-scale data leakage incident in the offshore financial industry in the UK, we can see that, like everyone else, the personal information of the richest 1% population in the world is easily leaked.
5. "Big data is the future of science"
Part of it is true, but it needs some growth. Big data provides a new way for science. We only need to look at the discovery of the Higgs boson, which is the product of the largest grid computing project in history. In this project, CERN uses Hadoop distributed file system to manage all data. However, unless we recognize and begin to solve some inherent shortcomings of big data in reflecting human life, we may make major public policies and business decisions based on wrong prejudices.
In order to solve this problem, data scientists began to cooperate with social scientists. Over time, this will mean finding a new way to combine big data strategy with small data research. This will go far beyond the practices adopted by the advertising industry or marketing industry, such as central group or A/B testing (that is, showing users the design or results of two versions to determine which version is better).
To be exact, the new hybrid method will ask people why they do something, instead of just counting the frequency of something. This means that in addition to information retrieval and machine learning, we will also use sociological analysis and in-depth understanding of ethnology.
Technology companies have long realized that social scientists can help them better understand how and why people have relationships with their products. For example, Xerox's research center hired pioneer anthropologist Lucy? Suckman. The next stage will further enrich the cooperation between computer scientists, statisticians and various social scientists-not only to test their own research results, but also to ask completely different kinds of questions with a stricter attitude.
Considering that a lot of information about us is collected every day, including Facebook hits, global positioning system (GPS) data, medical prescriptions and Netflix reservation lists, we must decide who to entrust this information to and for what purpose.
We can't avoid the fact that data is by no means neutral and it is difficult to remain anonymous. However, we can make use of expertise that spans different fields in order to better identify prejudices, defects and prejudices.
The above is what Bian Xiao shared for you about the revolutionary prospect of deep analysis of big data. For more information, you can pay attention to the global ivy and share more dry goods.