Four Theses on Big Data
April 9th, 2014 Rafael AlvaradoThe following is a fragment from an internal document I am drafting on why humanists–including but not only digital humanists–should be closely involved with UVa’s new Data Science Institute. Much of this will be old hat to many digital humanists who have already embraced the rise of big data as an opportunity to be seized and a movement to be examined. However, I also think the term Big Data is often dismissed as a mere marketing term without any substance behind it. This I disagree with, and the following is meant as a plea for taking the idea seriously.
Big Data is a social fact. The expression “Big Data,” although now a marketing term with its origin in the hard sciences, indexes a genuine historical transformation in the social organization of knowledge. This transformation is the most recent episode in a decades-long development of a concrete, global, and pervasive network of electronic data producing and consuming devices, embedded in society’s major sectors, including government, medicine, finance, education, and business. This network is not an abstraction; it is not virtual. It is a material development of the human biosphere with a geography and concreteness comparable to that of the free market as described in Polanyi’s The Great Transformation. We might call it the “datasphere”—a sphere of exchange in which the production, distribution, and consumption of digital data sets has developed in relation to other spheres of exchange that constitute what Castells has called the networked society. The datasphere emerges from the combination of separate trends with long histories, such as the development of computational thinking, the rise statistical methods and world hypotheses within the sciences and society, the use of records, both paper and electronic, by organizations to “represent” and manage populations, the construction of a network of computational devices for the sensing, storing, and analysis of data, etc. Many symptoms of this historical development are not new—the anxiety of information overload, the millenarian belief in the transformative effects of abundant data, and so forth. But the historical moment is unique and genuine. We inhabit a situation that requires new perspectives and approaches to understand it.
When we speak of Big Data, we refer to a system of representing the world. This system inheres in an assemblage of technologies and practices that compose the datasphere, including databases, data models, best practices, user interfaces, character sets, query languages, sensors, network protocols, algorithms, software stacks, modes of office and lab work, and so forth. These elements—considered as an ecology with information flows, energy (and therefore economic) requirements, and selective pressures on behavior—are producing a series of effects in the areas of cognition and epistemology. Observers of knowledge-work in the datasphere have claimed that Big Data and associated analytical methods have made obsolete such apparently established notions as causal models (Chris Anderson), categories and systems of classification (Clay Shirky), reading (Franco Moretti), and even meaning itself (Claude Shannon). In place of these long established ideas and practices, the emerging field of Data Science suggests a novum organum, a scienza nuova, in which the study of culture and society becomes a branch of statistical physics. When epistemologies change, especially to this degree, so do ontologies, ethical perspectives, and aesthetic sensibilities. Such transformations beg for the participation of humanists, to join the conversation about their effects on how we think and to pursue new forms of humanistic research.
Big Data is often big social data. One of the most compelling aspects of the current moment is its social dimension. The concept of Big Data has migrated from the more esoteric hard sciences, where the phrase was coined to designate massive and logistically problematic data sets generated by new sensing technologies, to the wider worlds of policy and marketing, where it now stands for a threat and opportunity to various social constituencies, precisely because Big Data increasingly refers to big social and cultural data. We now generate real-time data about human behavior—through digitized libraries, institutional records, transactional data (e.g. credit card use or Google searching), and social media—in quantities vastly exceeding that of data available through traditional methods, such as surveys, participant observation, and archival records. These data are not only massive in scale but rich in scope, including precise and exhaustive behavioral traces—think of consumer data tracked by scannable cards or discourse data generated by social media—that could not otherwise be captured without the existence of the technical apparatus described above. It is this change in both the quantity and quality of social data, that presents enormous challenges and opportunities, at the technical and cultural levels, that defines Big Data as an area of concern and interest to the humanist and social scientist.
Humanists should both use and critically study Big Data. Beyond the legal and ethical issues raised by the use of Big Data, there are significant value opportunities opened up by this historical moment that are of interest to a broad range of the humanities. We may define at least three general areas in which humanists may participate alongside the emerging field of Data Science: (1) what might be called, broadly, the philosophy of data, which would focus on the epistemological and methodological issues raised by both existence and use of these new sets of data and their accompanying methods; (2) the history and sociology of data, which would focus on how cultural life and social organization are affected and transformed by the “data products” being developed by institutions such as governments, businesses, hospitals, etc., and (3) digital humanities research into data sets of cultural materials, represented by such schools of thought as distant reading, cultural analytics, macroanalysis, and culturomics.
Under the category of the philosophy of data are a series of rich interpretive questions such as: What is the relationship between data as processable records in a software environment and the cultural and social realities to which they purportedly refer? Are many aspects of these vast data sets beyond the reach of human understanding? How can we understand such dimensions as truth, accuracy, and bias in large data sets? Is a physics of culture now possible with the rise of Big Data? In what ways do the scale and variety of Big Data change the kinds of questions we can ask of a specific domain? Are models and traditional statistical methods made obsolete by the algorithmic methods developed to make sense of Big Data? How might a genuine data criticism be developed, one that would incorporate interpretive methods and concerns with quantitative techniques?
Under the category of the history and sociology of Big Data are questions such as: What are the genealogies and contours of the institutional frameworks within which the datasphere has developed? How are fundamental human associations being transformed, if at all, by the new data products and data-based platforms of participation that characterize the datasphere? How is the traditional public sphere being affected by the datasphere? What are the cultural and identity effects of Big Data, viewed as a means of representing populations to institutions that make decisions that influence the lives of people? How does Big Data affect the balance of power between governments, corporations, and publics? What happens to the res publica when the scope of res increasingly includes data? How is the social contract changed when Big Data mediates key social relationships, such as that between citizen and representative or doctor and patient?
The field of digital humanities comprises a number of developments in the humanities, with a concentration in literary studies and history. Through the vector of available computational methods and large digitized collections of primary sources, digital humanists have discovered and adapted quantitative approaches to the study of culture and society previously employed by social scientists, such as archaeologists in the study of material culture. They have applied these resources to areas normally considered unreachable by quantitative methods, such as the study of voice, genre, narrative, influence, and symbolism in literature and works of art. In addition, digital humanists have rediscovered human geography through the use of GIS and related technologies, contributing to what has been called the “spatial turn” in the humanities. Both approaches have been coupled with high performance computing and have encountered many of the same logistical difficulties that natural scientists have encountered in their domains.
* * *