Herrenhausen Conference: "Big Data in a Transdisciplinary Perspective"

The conference was held in Herrenhausen Palace, Hanover, on March 25 to 27. Organizers: Volkswagen Foundation in collaboration with Dietmar Harhoff (Munich), Thomas Lippert (Jülich), Volker Markl (Berlin), Arnold Picot (Munich), Ralph Schroeder (Oxford), and Amir Zeldes (Georgetown University).

Summary Report

Public opinion in general is Janus-faced when it comes to Big Data: While some people talk euphorically of the economic opportunity of a fourth industrial revolution (Industry 4.0), others fear something like Big Brother, a monitoring of the individual which will ultimately lead to the demise of democracy as we know it. In the sphere of science, too, Big Data has become a buzzword that is capable of mobilizing large numbers of academics and millions of research funding – yet it is striking that in the numerous conferences dedicated to the topic, the disciplines more or less keep themselves to themselves. The Volkswagen Foundation took steps to rectify this situation by inviting scholars to Herrenhausen Palace to participate in the event "Big Data in a Transdisciplinary Perspective". As Wilhelm Krull (Hanover), the Foundation’s Secretary General, pointed out in his welcoming address, Herrenhausen is also where the great universal scholar Gottfried Wilhelm Leibniz invented the first calculating machine and developed the binary system comprising 0s and 1s. "Calculemus!", "Let us calculate!", he is said to have proclaimed.

In her opening talk "Data, Scholarship and Disciplinary Practice" (link to audiofile) " the American sociologist and communications engineer Christine Borgmann (UCLA)[1] encompassed a wide spectrum of issues, ranging from  the need to clarify the very term "Big Data", through the use of data in modern-day research, and up to the manifold problems confronting users of data in research practice. Borgman’s point of reference – as for so many other conference speakers – was Douglas Laney’s famous definition of the characteristics of Big Data as the "Three Vs": volume, velocity, and variety[2]. Borgman defined research data as "representations of observations, of objects, or other entities used as evidence of phenomena in research". Referring to Tony Hey’s publication "The Fourth Paradigm”[3], Borgman invoked the promise of a new Golden Age of academic research as a consequence of Big Data. At the same time, she pointed out that also in astronomy – a canonical example of "big science" – the enormous volume of data often has to be broken down into small packages before it can be analyzed: As a matter of fact, there is no fundamental difference between "big science" with large instruments, high costs, large numbers of coworkers and division of labor on the one hand, and "little science" with less equipment, low costs, small teams and local work on the other. Borgmann went on to list the problems encountered in structuring data and data utility in research: She spoke of the lack of incentives to pass on data, and of the difficulty of using and building on existing data ("data is noise for another discipline"), including the missing clarification of associated legal aspects. However, the biggest problem is unquestionably the lack of a proper infrastructural planning: The repositories are frequently insufficiently institutionally anchored, and therefore unsustainable. As Borgman succinctly summed up: Big Data is often under threat of becoming "no data".

Clifford A. Lynch from the Coalition for Networked Information CNI (Washington DC) focused on "The Challenges of Data Reuse: The Short and the Long Term". In his view, there can be no question that previously generated data must be archived – in the short term so that investigations can be reproduced, and in the long term, for instance via re-combination and re-annotation, so that new questions can be addressed. He actually sees a central aspect of Big Data in its reuse. However, due to the present speed of data generation and the parallel ongoing rapid development of technology, it is simply not possible to migrate all data to the next generation of computers. On top of this, so few repositories are institutionally anchored. One must therefore first determine which data will be needed in the future, and then work out how these data can be stored by using metadata that is independent of time and culture. Lynch closed his talk with a call for a multidisciplinary discussion of criteria for the generation of data, its processing, and storage – and ultimately for the development of a new archive science in the digital age.

In his talk "Data Science: Practices and Ambitions", Peter Wittenburg from the European-American-Australian network Research Data Alliance (Nijmegen) also called for joint disciplinary efforts to overcome the manifold difficulties confronting cross-disciplinary data work. For – and here Wittenburg cited the famous metaphor coined by the Sheffield Mathematician Clive Humby in 2006 – "data is the new oil", and science is no exception. Notwithstanding, the current way of dealing with data is far too cost-intensive and inefficient. Work with data is frequently carried out by hand rather than automatically, with all the increasingly high costs. A lot of data is still generated with no persistent identifier – and correspondingly de facto already old data at the moment it is generated. As an alternative, Wittenburg outlined the model of a "data factory" in which all aspects of the corpus arrangement are coordinated with one another. He had to admit, though: Nobody knows what the situation will be ten years from now.

Andrew Prescott (Glasgow)[4] illustrated to the conference participants the situation of "Big Data in the Arts and the Humanities" (link to audiofile). Prescott is a medievalist and a digital scientist of the first days. Following a Tour d’Horizon of the large digital humanities projects that have been funded in Great Britain, he put the question as to whether Big Data simply constitutes more data or rather a substantial change to the structure of research. Here – in his opinion – we can draw on historical experience: Similar to the way of keeping cultural records in the 11th century Doomsday Book – England’s very first land register – led to a re-organization of structuring knowledge, Big Data will have a correspondingly substantial impact on our present-day reality. In contrast to a science based on causalities, it is correlations which constitute the base of Big Data and which, beyond this, enable predictions of what will happen in the future. By virtue of Big Data the humanities will certainly become more visual, haptic, and explorative.

Prescott perceives the great scientific-theoretical challenge in the development of a theoretical framework he described as "critical data studies": "Big Data needs Big Theory!" The goal must be to bring about the "humanization of Big Data". This is because data is not reality; rather, it is drawn from observation. Prescott cited the Glaswegian archaeologist Jeremy Huggett: "Data is theory-laden, and relationships are constantly changing, depending on context”[5], and then went on to list the seven-point catalogue for Critical Data Studies[6] developed by Craig Dalton and Jim Thatcher, including: data must be located in time and space; they must be grasped as inherently political and serving vested interests; they could never speak for themselves and, with this in mind, there can be no such thing as "raw data".

In his talk about the Trumpf Werkzeugmaschinenbau GmbH which bore the title "Data-Value Services as a Differentiator for Machine Tools", Stephan Fischer (Ditzingen) delivered some surprisingly deep insights into industry. A former Head of Department at SAP, in 2014 he became the corporation’s director responsible for IT at Trumpf. The company is specialized in laser technology, and his task is to lead it into the new networked digital age. Whereas in prior initial stages the focus was on linking the physical with the virtual world, and via sensors to check the quality of the laser needle ("smart data"), the task today aims at optimizing the entire production system on the basis of mass produced data and machine learning techniques ("smart factory") – in future, however, the focus will be moved to developing the Internet of Services into a business model. In the ongoing process of digitalization, some essential questions still have to be resolved, e.g. how to transform analogue data into a digital form, how data is to be administered, and how data can be safely transferred from the customer to Trumpf – or from Trumpf to research institutions. In the Smart Data Innovation Lab, Trumpf and other private-sector partners are working together with researchers to find out when maintenance falls due, for instance. According to Fischer, industry hopes to benefit from the strategic advantages generated by exchanging data with researchers.

In making data available to researchers, the leader of the Institute for Employment Research Stefan Bender (Nuremberg) also sees advantages for the planning of future political policies, and thus indirectly for Germany’s branding. In his talk "Researcher Access, Economic Value and the Public Good" he called for the development of documentation standards, a definition of data reproducibility, and above all for a fitting way to deal with errors that may occur when using Big Data. In addition to this, Bender addressed the difference between "made data"/"designed data" and "found data"/"organic data". These forms do not, however, compete with each other but rather can be brought together in a complementary manner: For although Big Data may be cheaper to generate, the opposite is the case when having to remedy errors. Bender interpreted the oil metaphor anew: Data is also capable of causing catastrophic damage in the same way as an oil spill.

According to Dirk Helbing (Zürich), a physicist and holder of a sociology chair, there is currently an imbalance between the knowledge we have gained about nature and what we know about society. He posed the question: "How can we build a smart resilient digital society?[7]. Big Data may be able to help us redress this imbalance. In order to do so, Helbing imagines a world in which a number of widespread and self-organized systems are subjected to a decentralized control or intelligence in order to reach decisions on the basis of data. In his view, such a "Planetary Nervous System" together with a "Living Earth Simulator" capable of simulating the different changes and influences on the world, might unveil fundamental insights into our society. At the same time, Helbing pointed out that data also has a "best before" date as certain datasets lose their value after only a short while. This would most certainly apply to some of the Twitter messages posted during the conference under the hash tag  #HKBigData.

A technical aspect of Big Data was elucidated by Shivakumar Vaithyanathan from IBM Big Data Analytics (San José), who began with naming three different Big Data problem issues: 1) Questions arising from the sheer volume of data; 2) Questions answered by a large number of models covering different aspects, and; 3) Questions on which only small amounts of data exist but which via simulations give rise to vast amounts. These challenges are currently being addressed by data scientists who extract insights from large amounts of data. In order to do so, data scientists must have cognizance of both worlds (the still "normal" IT world and that of Big Data) and be capable of translating and mediating between the two worlds. The big goal of Big Data Analytics therefore is to carry out this translation automatically and thus transform the idea of the data scientist automatically to the world of software environments like Hadoop and Co.

Several time windows of the Herrenhausen Conference were set aside for 29 junior researchers from16 different countries for whom the Foundation provided travel grants. They were given the opportunity to present their research projects from different disciplines in three-minute lightning talks. At the end of the Herrenhausen Conference their talks and poster presentations were awarded prizes based on the votes of the conference participants. Historian Ian Milligan (University of Waterloo) received the prize for the best presentation for his description of the project "Finding Community in the Ruins of GeoCities"; the best poster prize went to social scientist Josh Cowls (Oxford) for "Using Big Data for Valid Research: Three Challenges".

The section of the conference dedicated to legal issues turned out to be a most lively one. Big Data very often comprises data for which researchers have not received (and might not receive at all) any informed consent of those who are providing it. This is the point economist Julia Lane (Strasbourg/Melbourne) [8] picked up in her talk "Big Data, Science Policy, and Privacy". One has to be aware that analysis of Big Data might lead to completely wrong results – a thesis that Julia Lane illustrated with the events surrounding the Boston bombing, whereby an innocent man committed suicide after being wrongly accused of the attack as result of Big Data analysis. This gives rise to a legal problem: "What is the legal framework for data on human beings?" The principle of informed consent in the USA laid down in the so-called Common Rule for the protection of human research subjects is today merely a fiction – as in times of Big Data anonymizing data is no longer an option. More often than not, individual persons have absolutely no idea of the data stored about them, and that as a result of this data they can be identified at any time. How, then, will it be possible to carry out any social scientific research in future? Julia Lane called for a round table discussion at which representatives of research institutions, funding organizations, and public authorities agree on a roadmap for tackling this problem.

In his talk bearing the title "From Alibaba to Abida: Legal Issues concerning Big Data", the legal scholar Thomas Hoeren (Münster) shared the view that in today’s world it is no longer possible to obtain people’s "informed consent". In times of Big Data there is hardly any data that is not personalized. He described the German legal ruling on the Schufa [credit investigation agency] as the only existing example of correct legislation on Big Data. This is, first, because it prescribes scientific standards when dealing with data and, second, because it provides for transparency: Citizens have the right at all times to request information about the Schufa data stored on their person. Hoeren also addressed a number of other issues: Who bears the liability for incorrect data? Does data give rise to property rights – and if so, to whom do they belong? What about personality rights? Which role is played by the two great legal traditions of Anglo-Saxon Common Law and Roman Law? Big Data, according to Hoeren, will impact on the whole legal framework of our society. Hoeren is participating in a project funded by the German Ministry for Education and Research called "Assessing Big Data" (ABIDA) – hence the title of his talk. The aim of this project is to observe and monitor the multifaceted developments associated with applied Big Data.

Hoeren’s skepticism with regard to the current state of affairs in times of Big Data was also shared by his colleague Nikolaus Forgó (Hanover). His talk bore the provocative title:  "Ignore the Facts, Forget the Rights: European Principles in an Era of Big Data" (link to audio-file). Forgo referred to the so-called "Volkszählungsurteil" [census ruling] of Dezember 15, 1983. This landmark ruling of the German Constitutional Court established the basic right of informational self-determination which follows from the general right of personality and human dignity. The judgment is generally regarded as a milestone in the domain of data protection and found its way into the Charta of Fundamental Rights of the European Union in Article 7 and especially Article 8(2): Personal data "must be processed fairly for specified purposes and on the basis of the consent of the person concerned or some other legitimate basis laid down by law. Everyone has the right of access to data which has been collected concerning him or her, and the right to have it rectified." But what is today’s reality? It is characterized by individuals’ loss of control over "their" data, and thus by a loss of self: "If the product is for free, you are the product". According to Forgo, three problem areas must be addressed at the same time and clarified on an international level: Issues pertaining to property rights, respect of privacy, and copyright.

In the last session of the conference the focus moved away from legal issues, returning to technical aspects and the formidable challenges facing research in general. At the opening of his talk on "Big Data and Challenges for Research and Research Funding" (link to audiofile) , computer scientist Volker Markl (Berlin) said he perceived the central aspect of Big Data in the coming together of two worlds: namely the world of data management and that of data analysis. He went on to shed light on two further characteristics of data. On the one hand, data can suffer a loss of value – here Markl placed a slightly different emphasis than Bender and Wittenburg – when it is shared. On the other hand, it will become increasingly difficult to shift such enormous amounts of data from one server to another as data is after all "as elastic as a brick of stone". From this he added a further aspect to the frequently referred to oil metaphor: namely, the battles fought for this resource.

Markl followed this with a comprehensive description of the various challenges facing research and research funding, which led in to the concluding panel discussion in which he was joined by DAVID CARR from the Wellcome Trust (London), the Artificial Intelligence expert Oscar Corcho (Madrid), Joshua M. Greenberg from the Alfred P. Sloan Foundation (New York), and Stefan Winkler-Nees

Résumé

In summary, the Herrenhausen Conference brought together a number of outstanding international representatives across different disciplines, providing a transdisciplinary discussion of the highest intellectual capacity. Herein lay the event’s specific added value: Not only because it was an opportunity to meet each other and exchange views, but also because it succeeded in identifying a whole range of issues and challenges that the disciplines will only be able to resolve in mutual cooperation with one another – despite the fact that Big Data constitutes a "container term" with blurred boundaries. On the level of scientific theory and sociology, the call for "Critical Data Studies" with a requisite historical-critical embeddedness of data would appear to be most important. On the technical level, it transpired that the issue of data processing, storage, and reproducibility is of central importance. On the statistical-methodological level, the issue of how to deal with errors in Big Data analyses is likely to dominate the discussion in future. In the legal dimension, it is abundantly clear that the law as it stands today is untenable, and that law has to be adapted to the reality of the new digital age. On the societal level, the call for strengthening the skill-sets associated with data, i.e. data literacy, is overdue on the agenda. Finally, on an overriding level the question arises as to what claim society can make for treating data as "common goods" which researchers are free to use – rather than to cede the opportunities to the internet economy.

Vera Szöllösi-Brenig und Christoph Kolodziejski, VolkswagenStiftung
bigdata@volkswagenstiftung.de

Notes

[1] Christine L. Borgman: "Big Data, Little Data, No data. Scholarship in the Networked World", MIT Press 2015; Christine L. Borgman and Marianne Krasny: "Scholarship in the Digital Age. Information, Infrastructure, and the Internet", MIT Press 2007
[2] Laney, Douglas: "3D Data Management: Controlling Data Volume, Velocity and Variety" (PDF). Gartner. Retrieved 6 February 2001.
[3] "The Fourth Paradigm: Data-Intensive Scientific Discovery”. edited by Tony Hey, Stewart Tansley & Kristin Tolle, Microsoft 2009
[4] http://de.slideshare.net/burgess1822/prescottherrenhausen [7.5.2015]
[5] Jeremy Huggett: "Promise and Paradox: Accessing Open Data in Archaeology”, Proceedings of the Digital Humanities Congress 2012
[6] Craig Dalton and Jim Thatcher: "What does a critical data studies look like, and why do we care? Seven points for a critical approach to ‘big data’", Society and Space 2014 http://societyandspace.com/material/commentaries/craig-dalton-and-jim-thatcher-what-does-a-critical-data-studies-look-like-and-why-do-we-care-seven-points-for-a-critical-approach-to-big-data/ [7.5.2015]
[7] https://www.youtube.com/watch?v=mO-3yVKuDXs (Helbings Vortrag auf Youtube)
[8] Julia Lane, Victoria Stodden, Stefan Bender and Helen Nissenbaum (Hg.): "Privacy, Big Data, and the Public Good: Frameworks for Engagement”, Cambridge University Press 2014