Introduction

Much has been written on the impact that data infrastructures such as digital databases are having on scientific research, and particularly the biological and biomedical sciences. Most of this scholarship has emphasised the importance of quantities (of data, researchers, investments and research sites, among other factors) as a key motivation underlying the widespread adoption of these infrastructures for scientific communication.Footnote 1 Massive research efforts and resources are being devoted to the dissemination of data online by public and private funders across the globe.Footnote 2 Many scientists and science funders view databases as crucial tools to handle the vast amount of molecular data produced by technologies such as automated sequencing and microarray experiments (often referred to as ‘big data’), and getting them to travel across the world quickly and easily. It is hoped that free and widespread access to large data sets will enhance the use of such data as evidence for new claims, thus generating new paths towards discovery (for example, Hey et al, 2009; Royal Society, 2012). The insistence that online databases constitute a solution to the problems posed by large quantities, however, clashes with the quality considerations attached to actually using (evaluating and interpreting) data to produce new knowledge. Biology and medicine are notoriously fragmented into countless epistemic cultures, each characterised by different interests, values, forms of reasoning, methods, material objects and standards for what counts as evidence.Footnote 3 Further, epistemic cultures are not stable objects: the combinations of individuals, expertises, interests and methods that characterise them are subject to constant change to match the ever-shifting nature of biological knowledge and of living systems themselves.Footnote 4 This extensive and dynamic pluralism makes it hard to develop databases that can bridge such diverse expertises, and thus fulfil the specific needs of each community.Footnote 5 In confronting both the high quantities and the diverse qualities of research, databases are subject to two seemingly opposite requirements: fostering the global circulation of data and facilitating their local adoption.

What is ‘big’ about online databases and the sciences that they are meant to support? This article considers how two groups of database curators have attempted to overcome this challenge, and uses this empirical material to reflect on what scale involves in contemporary biological data infrastructures. I here articulate two complementary answers to this question. First, I take a critical stance against the very idea that online databases are catering for ‘big biology’. I show that online databases are primarily responsible for providing useful support to the needs and questions emerging from the unique combinations of expertise and interests brought together within any one biological project. In other words, online databases need to cater primarily for ‘little science’, and if they fail to fulfil this requirement they eventually cease to be used and funded. As I discuss below, I do not think that de Solla Price’s (1963) seminal characterisation of ‘little science’ fits the type of projects that I am discussing here, which target one scientific question through a specific research approach for a limited period of time (in this sense they are ‘little’), but can involve a large number of scientists across different institutions and nations. I thus agree with Niki Vermeulen’s argument that biology, in its currently globalised and networked incarnation, embodies a new way of doing science (Vermeulen, 2009; see also Leonelli, Forthcoming); and yet, I disagree with her that the label ‘big science’ can be usefully applied to most research projects carried out in contemporary biology. Second, I show that in order to foster both the global dissemination and the local adoption of data, database curators strive to develop platforms through which methodological and epistemic diversity can be expressed and explored, rather than hidden and/or ignored. I conclude that one way to understand what counts as the scale of data infrastructures is to consider the range and scope of biological questions that data stored therein can be used to address; and that if scale is understood in this way, well-functioning databases are necessarily bigger than the science that they serve, which tends to remain fragmented into a wide variety of specialised projects.

A Tale of Two Databases

My arguments are grounded on a historical study of two efforts made to build ‘all-encompassing’ databases in the biological and biomedical domains over the last decade. Each of these infrastructures was intended to serve a large and cutting-edge research area; and whereas one is directed to a quintessentially biological field (plant science), the other supports what is arguably the most active area of biomedicine (cancer research). The first case is The Arabidopsis Information Resource (TAIR), which was started in 1999 with funding from the National Science Foundation (NSF) with the objective to gather data relevant to understanding model plant Arabidopsis thaliana. In parallel to the success of Arabidopsis as a key model organism for plant science as a whole, TAIR became a prominent resource within this area in the late 2000s. Since 2010, its structure and content have been undergoing significant reworking to guarantee its usefulness to users and long-term sustainability. My analysis is based on the consultation of publications and archives released by TAIR itself and available on its Website; and publications and reports on plant bioinformatics issued by the NSF and the Arabidopsis plant community over the past 10 years.Footnote 6 The second case study is the Cancer Biomedical Informatics Grid (caBIG), created in 2003 to function as a portal linking together data sets gathered by the research institutions and patient care centres under the purview of National Cancer Institute (NCI). The initial goals of caBIG curators included assembling and integrating all existing data sets on all types of cancer research across a wide range of institutions; this ambitious aim was recently scaled down due to widespread critiques of its limited achievements. My analysis is based on a close study of the caBIG Website; of archival sources, including publications and minutes of meetings of caBIG curators available online; and on publications and reports on cancer bioinformatics and caBIG’s role in it issued by the NCI and several cancer researchers over the past 5 years. Since their inception, both databases have aimed to serve as wide a research community as possible, while at the same time helping researchers to seamlessly fit data retrieved online into their existing projects. The achievement of these aims was complicated by the huge variety of data to be disseminated; the diversity of loci of data production (and thus the format and methods of data generation); and the ongoing tension between biological standards and protocols, and computational methods used to format, annotate and visualise data so that they are machine-readable. These cases confirm existing findings that high levels of standardisation, accessibility and visibility are key requirements for databases aimed at data re-use on a large scale; and yet, whether these databases are ultimately successful depends on how well they encompass – rather than exclude or deny – the epistemic pluralism that characterises research in the life sciences.

TAIR

TAIR was created in 1999 as a replacement for AtDB (the Arabidopsis thaliana Database), a small database containing selected genetic data on the model plant. The NSF funded TAIR to ensure that data coming out of the international sequencing project devoted to Arabidopsis would be adequately stored and made freely accessible to the plant science community. The Carnegie Institution for Science’s Plant Biology Department, home to prominent plant scientists including Arabidopsis veterans Chris and Shauna Somerville, won a national bid to create and host the database; and a former student of Chris Somerville, Seung Yon Rhee, was given the task of directing TAIR, which she undertook with a strong vision for what the database should become in the future. As an experienced experimenter, Rhee thought that TAIR should not be just a repository for sequence data; rather, it should become a repository for all data extracted from Arabidopsis research, a platform facilitating communication and exchange among researchers working on different aspects of plant biology. Further, the database should contain a set of tools for data retrieval and data analysis, which would facilitate the integration of all those data. Finally, the database should enable users to integrate Arabidopsis data with data extracted from other plant species, thus paving the way for developments in plant science as a whole (Rhee et al, 2006, p. 352)

Thanks also to the success of Arabidopsis as a model organism (Leonelli, 2007; Koorneef and Meinke, 2010), TAIR indeed became a reference point for plant researchers across the globe, assembling an impressive array of data sets concerning disparate aspects of Arabidopsis biology, ranging from morphology to metabolic pathways. The database includes various search and visualisation tools elaborated by the TAIR team to help plant scientists in retrieving and interpreting Arabidopsis data. Examples are MapViewer, which allows access to various types of mappings of Arabidopsis chromosomes; and AraCyc, which visualises data about biochemical pathways characterising Arabidopsis cellular processes. TAIR provides abundant information about how these tools have been constructed, how they should be used and which types of data are included; and allows users to order Arabidopsis seeds directly from Arabidopsis Stock Centres.Footnote 7

The importance of TAIR to plant science became a hot topic for scientific debate in 2008, when the NSF decided to cut funding to the resource. The motivations for this decision were several, and included financial constraints as well as the wish for TAIR to accommodate the changing needs of the plant science community, and particularly the increasing importance of tools for the analysis and comparison of data across different plant species. At the same time, the decision to cut TAIR funds, thus effectively making it impossible for its curators to extensively review and revise its contents, was controversial within the plant science community. Scientists’ protests poured in from all corners of the globe, Nature published an editorial on the key role of TAIR in plant science and several working groups were set up to find ways to support TAIR in the long term, thanks also to the effort of the Multinational Arabidopsis Steering Committee and the Genomics Arabidopsis Research Network, or GARNet (Bastow et al, 2010; Ledford, 2010). These events illustrate how TAIR, in a similar way to other model organism databases such as WormBase and FlyBase, has played an exemplary role in demonstrating the value of data curation to experimental biology (Leonelli and Ankeny, 2012). This involves substantial curatorial labour, for reasons that I shall now describe.

First, consider the variety of data sets that TAIR attempts to host under the same digital umbrella. Quantity is not the main problem here. There are of course worries about securing the hardware and memory necessary to store the masses of data collected by TAIR on a weekly basis, but the most urgent problems confronted by curators concern the diversity in quality and provenance of those data. Here we can see one way in which the extensive fragmentation of biological research into different epistemic communities affects the set-up of databases: each groups tends to use different methods, instruments, formats and protocols to produce data, which makes it very hard to integrate those results with data produced by other groups, even when they document the same biological aspect. For instance, the most straightforward type of data curated by TAIR is sequence data documenting Arabidopsis genome structure, and yet even in that case complications abound. Data obtained through the first sequencing project are being constantly updated and checked, in order to correct mistakes or inaccuracies arising from the attempt to merge data sets produced by different groups in different locations. These curatorial efforts include adding novel genes, updating exon/intron structures of existing genes, deleting mispredicted genes, merging and splitting genes, changing gene types, and adding splice variants (Swarbreck et al, 2008). These efforts only increase in the case of functional, metabolic and morphological data about plant mutants, whose production is even more dependent on the preferences and local conditions of data producers.

Another crucial difficulty is posed by the unpredictability of the uses to which data hosted by TAIR could be put in the future. TAIR curators devoted years of efforts to making TAIR as accessible and interesting as possible not only to Arabidopsis specialists but also to other plant scientists and even biologists working on other kingdoms. This is an extremely ambitious goal, which curators have pursued from the early days of TAIR development and which involved frantic consultation of literature dealing with information management, in the hope of finding suggestions about how to integrate and visualise the most diverse information in the simplest possible way, without losing sight of the diversity of cultures through which data are produced and re-used. One early strategy adopted by curators was to create several different search engines within TAIR, each of which would provide a different perspective on Arabidopsis biology. They devised a search engine visualising the location of genes on Arabidopsis chromosomes; another displaying data about gene expression; another focused on data about biochemical pathways and so on. The possibility to gather data about the same phenomena from different perspectives, they reasoned, would maximise the information available to users while minimising losses in the accuracy or the richness of data. Most importantly, users would be allowed to formulate their queries in different ways, reflecting their own epistemic commitments: they would be able to choose among different parameters and ways to display the results of their searches (for instance, when searching a specific gene locus on a chromosome, TAIR users can view their results in the form of a genetic, physical or sequence map).

Not all of these tools have been found to be equally valuable and accessible by plant researchers, and TAIR curators have reduced their ambitions over time, focusing increasingly on updating sequence and functional data on Arabidopsis rather than including new data types and tools for comparison across plant species (which might be viewed as one reason for their loss of funding). Still, what I want to stress here is that the construction of TAIR involved not only collecting diverse types of Arabidopsis data from multiple sources, but also elaborating strategies through which data could be organised, retrieved and visualised by users from various parts of plant science. In Rhee’s words:

Ultimately, our goal is to provide the common vocabulary, visualisation tools, and information retrieval mechanisms that permit integration of all knowledge about Arabidopsis into a seamless whole that can be queried from any perspective. Of equal importance for plant biologists, the ideal TAIR will permit a user to use information about one organism to develop hypotheses about less well-studied organisms.

(Rhee website, accessed January 2005)

During its 12-year-long existence, TAIR could be said to have become a virtual laboratory, experimenting, with mixed results, with different ways of visualising both data and biological phenomena. The main challenge has been to imagine what users want, as TAIR aims to reach several types of scientific audiences, ranging from developmental to molecular and theoretical biologists (not to mention the differences in epistemic cultures to be found within and across those categories). To confront this problem, TAIR curators have drawn insight from their own experience as bench researchers specialised in different areas of Arabidopsis biology. For instance, developmental biologist Eva Huala, who became TAIR director in 2005, proposed that the user should be able to “fly into” the chromosome, that is, to view and explore a three-dimensional representation of Arabidopsis chromosomes that is produced and constantly modified on the basis of incoming experimental data. This meant that TAIR should provide complex, three-dimensional visualisation tools that would allow users to click on the image of a specific chromosome and see a representation of the inside of the chromosome. On the one hand, this representation would have to be realistic enough as to convey ideas about the actual structure and physiology of chromosomes; on the other, it would have to contain specific references to the data from which the model was generated, so as to allow users to trace the sources and original context of the data. Further, from its inception, TAIR has collaborated with other model organism databases, with the ultimate goal to provide an integrated platform with repositories of data on other organisms. In this sense, TAIR curators have indeed invested effort in facilitating comparative research. External collaborations included consultations with experts in each relevant field; participation in the development of the Gene Ontology as a controlled vocabulary to disseminate data about gene products across species; and meetings with curators working on other model organisms to compare strategies and develop a joint resource (the Generic Model Organism Database).

Initially, curators put a lot of emphasis and hope on the idea that users would recognise the benefits of a resource such as TAIR and would, as a consequence, support it both by donating data to it and by helping curators to get it right, for instance, by sending feedback and engaging with the technicalities of how data were collected, stored, visualised by the database. The idea that users would enter a strongly collaborative relationship with database curators, however, proved to be misguided. Providing input is difficult and time-consuming, as it requires familiarity with the software and classification systems used by TAIR. Users have no real incentive to do this, especially as no formal credit system is yet in place for data donation within biology. This results in users expecting curators to take full responsibility for how data are presented, so that users can access the data they need and get on with their research. Rather than asking biologists for direct feedback on the vision of the database, curators thus started asking users for queries that would likely be submitted to a database such as TAIR. The crucial issue for the TAIR team became: Can we answer this query with the current tools? By asking this question, TAIR curators effectively brought together their concerns about information management and user-friendliness, thus elaborating designs for easily accessible, and yet rich databases.

Curators also made assumptions about how the plant community should organise itself so that a database like TAIR reaches its full potential. They strongly relied on the existence of a collaborative, open access ethos within the community, which Rhee (2004) herself aptly characterised through the motto “share and survive”. This constitutes a surprising exception in the competitive context of biological research, and of molecular biology in particular. Since the early 1980s, when Arabidopsis was rediscovered as a model organism and research efforts on it acquired momentum, prominent Arabidopsis researchers agreed that the data acquired through research on this model organism should be kept freely available (Leonelli, 2007; Koorneef and Meinke, 2010). Many biologists and research institutions involved in Arabidopsis research continue to foster this ethos. This situation has had a strong impact on how TAIR was developed and on the expectations that TAIR curators placed on their user community.

The caBIG

caBIG is an ambitious bioinformatics initiative within an area that is vastly better funded and more visible than plant science. The curatorial vision behind caBIG parallels both the commitments to open access and data re-use fostered by TAIR curators and their ongoing struggle with serving widely diverse user communities, although caBIG curators have arguably been much less successful than TAIR curators in implementing effective feedback mechanisms for their curatorial choices and thus gaining the support of prospective users. As I shall argue, this is partly due to the difficulties encountered by caBIG curators with understanding and capturing the pluralism characterising cancer research, which has meant that this resource, though widely and easily accessible online, has not yet been widely adopted as a tool to foster medical research.

caBIG was started in 2003 as a corollary of the ‘big science’ programme of the NCI in the United States. Its initial functions were to facilitate data sharing among stakeholders in NCI research, including clinicians, research scientists, pharmaceutical companies, primary healthcare providers and patients; and showcase the results obtained by NCI in its fight against cancer. Given the scope of the enterprise, as well as the diversity of ethos, training and results characterising these groups, the caBIG remit was immediately seen by many bioinformaticians and cancer researchers as hugely ambitious and possibly unrealistic (Ochs et al, 2010). Despite this early scepticism and a rather critical review of the caBIG pilot phase (NCI, 2008), its reach and ambitions expanded towards becoming a portal to all cancer data, no matter where those data were first obtained and which national agencies see themselves as responsible for their dissemination.Footnote 8 Andrew von Eschenbach, who was directing the NCI at the time when caBIG was started, and Kenneth Buetow, the Associate Director for Bioinformatics and Information Technology at the NCI and the founder of caBIG, summarised their vision as follows:

Medical research teams have operated, in effect, as cottage industries, each collecting and interpreting data using a unique language of their own making and in virtual isolation from other teams. Biomedical informatics has the potential to be the powerful critical means to achieve the necessary degree of integration as it provides the mechanisms and tools to support standardized sharing, management and analysis of diverse data across the bench-to-bedside continuum and back.

(Eschenbach and Buetow, 2006, p. 22)

caBIG was funded to spearhead efforts towards data re-use across diverse research and clinical contexts. In a similar way to TAIR, caBIG embodies the promise of data infrastructure to facilitate data management on a very large scale. The challenge faced by caBIG curators is however even more substantial. caBIG operates in a densely populated research environment, where hundreds of projects around the world are already utilising digital databases to store and disseminate their results. It therefore made little sense to re-invent the wheel and construct a central resource from scratch, as TAIR curators were tasked to do.Footnote 9 Rather, caBIG started with bringing together existing repositories of cancer data, by providing a superstructure through which users can access the data they are interested in, no matter where it is actually located and who is administering the relevant database. caBIG thus proposed to facilitate data re-use by providing a common point of access to data, while at the same time exploiting existing databases produced and maintained by other organisations. Beck (2005), a participant in the caBIG Strategic Planning Workspace, presents the following user scenario to illustrate the usefulness of such an endeavour:

A researcher involved in a phase II clinical trial of a new molecularly targeted therapeutic for brain tumors observes that cancers derived from one specific tissue progenitor appear to be strongly affected. The trial has been generating proteomic and microarray data. The researcher would [now] like to identify potential biochemical and signaling pathways that might be different between this cell type and other potential progenitors in cancer, deduce whether anything similar has been observed in other clinical trials involving agents known to affect these specific pathways, and identify any studies in model organisms involving tissues with similar pathway activity. [..] With caBIG compliant components now under development, the researcher would be able to perform the analysis routinely, with data flowing through systems and analysis being automatic. This analysis will yield biomarkers and potential drug targets gathered from multiple workspaces and make it possible to develop treatment modalities faster, less expensively, and more effective for patients (p. 10).

Realising this vision of seamless data dissemination and retrieval involves building standards aimed at bridging the gaps between existing databases (what Beck calls “caBIG compliant components”), thus facilitating the retrieval and visualisation of very different types of data, ranging from genomic sequences to symptoms of individual patients, coming from hundreds of different sources. The way in which caBIG has attempted to achieve this is by pushing the databases collected under its purview to adopt common formats and follow basic structural rules enabling basic interoperability across different databases. The notion of interoperability constitutes a specific way to envisage data dissemination, integration and re-use. As in the case of TAIR, caBIG curators wish to leave as much room for selecting and interpreting data as possible to their users. Interoperability, defined as “the ability of two or more systems or components to exchange information and to use the information that has been exchanged” (Covitz, 2004, p. 8), seems well suited to making data travel across diverse environments without necessarily affecting the local norms of the contexts in which data are produced. Interoperability requires only a minimal amount of consistency and coherence among existing databases: just enough to enable users to move from a data domain, type and source to another, while leaving them free to read and use the data that they find as they see fit, to answer any scientific query they may have. To make this possible, the structure of caBIG is highly modular. Since its inception, caBIG has been organised into separate pilot groups working on different areas of data management, ranging from curating data acquired through medical imaging technologies to managing tissue biobanks and data coming from clinical trials. These pilot groups were recently formalised into ‘domain workspaces’, each of which oversees the storage and management of different data types, and aims to create tools to enable user communities to overcome physical, legal and scientific barriers to data sharing (Table 1).

Table 1 Synoptic depiction of the organisational structure of caBIG, and the number of participating organisations for each of the domain workspaces in 2004 and in 2009. Information extracted from the caBIG website in October 2010

Table 1 constitutes only a glimpse of the structural and organisational complexity and scope of caBIG. What becomes clear from even such a limited perspective is that this resource is not supposed to operate on a shared, unified understanding of what it could be used for. The idea underlying caBIG is that its use will have distinct characteristics depending on specific biological and biomedical queries and on the contexts from which the queries come. By capitalising on interoperable modules, caBIG is trying to walk the line between a centralising, top-down structure that produces universal standards and provides a focal point of reference for all users; and a decentralised, bottom-up resource informed by curators and researchers around the world. This attention to diversity and decentralisation is not only tied to the curators’ awareness of the variety of epistemic cultures and research traditions involved in cancer research broadly construed; it is also tied to their awareness of the regimes of competition and intellectual property involved in high-profile biomedicine. The challenge of constructing a data-sharing community around cancer research was highlighted by John Niederhuber, the director of the National Institute of Cancer, as the most significant contribution that caBIG can hope to make:

caBIG® is an example of a new approach to organizing medical research in the future that is really both an experiment and yet a transformation at the same time. No single individual or organization can manage the amount of data that we deal with now in biomedical research. Ideally, this information must be available online, in real time, so doctors, patients, and organizations like ours can use it quickly. This is really creating a new community of research. That’s what caBIG® is. It’s not just a technology; it’s a cultural change.

(Leaflet caBIG, January 2011)

Remarkably, the cultural change is viewed as emerging from the inclusion of as many stakeholders in cancer research as possible, including patients and national funding agencies across the globe:

By having this infrastructure in place, we have the capacity to both support next generation clinical research and also, more importantly, inform next generation practice outcomes and bring molecular medicine into the clinical practice environment. However, if we’re going to do this on a broader scale, we need to bring more participants to the table. […] We need more players. These include both our clinical communities, as we currently have, but also a much more meaningful engagement of care providers and consumers. We need to have a full partnership of funders – not just government as has been the case to date with caBIG – by bringing in other players to help underpin this infrastructure.

(Buetow, 2008, p. 5)

It is remarkable that despite such strong public acknowledgements of the importance of inclusion and plurality in caBIG, curators do not seem to have been particularly effective, or indeed interested, in capturing the interest of potential users. The resource is widely seen as too complex for researchers to understand and use; the modules developed to enable interoperability also tend to be viewed as too restrictive in their choice of data format and standards, and thus unfit to encompass the very diversity of research practices that they were devised to capture. A recent review of caBIG achievements by the NCI (2011) concluded that “the level of impact for most of the tools has not been commensurate with the level of investment”. This stark critique has led to a decrease in funding; and it is particularly striking when contrasted to the NSF argument for cutting TAIR funding, which is not tied to a stark critique of TAIR achievements (which are recognised to have been high, though of course its functionality needs to improve and evolve as research moves forward), but rather stressed that the plant community should take more responsibility for securing its long-term sustainability (Bastow and Leonelli, 2010).

A detailed assessment of caBIG’s failure to establish itself as the most useful digital platform in cancer research lies beyond the scope of this article. It might be argued that this is only a question of time, caBIG curators having taken long to think through the structure and computational requirements of their resource, which delayed vital consultations with stakeholders and the management of data themselves. Indeed, the NCI (2011) review remarks that “perhaps the greatest impact of the caBIG® program on cancer research has been to gather several communities around a virtual table to help create and manage community-driven standards for data exchange and application interoperability”. ‘Community-driven’ is a key term here: identifying which communities are involved in cancer data analysis, and which methods, instruments and terminologies they use is a difficult and yet foundational task for caBIG curators, and one that they arguably will take even more seriously in the future. At the same time, caBIG needs to put that knowledge to work, by implementing data searches that highlight the diverse knowledge and provenance of available data, and yet manage to address the specific scientific queries of prospective users.

The approach to data curation taken by caBIG can be viewed as resembling that of TAIR insofar as it emphasises (i) open access to data as to advancements in biomedicine; (ii) the belief that centralised data management is not only compatible with the existing diversity in biomedical research traditions but can actually foster its development; and (iii) the belief that effective data re-use can be achieved through the successful management of data access. The similarities in the visions proposed by TAIR and caBIG might be at least partly derived from their common cultural and political embedding (these are both tools funded by US agencies and aiming to serve scientists around the world). Yet these two resources have independent histories, employ different standards and support vastly different communities, which makes the communalities in their vision ever more striking.

Scale and the Scientific Usefulness of Data Infrastructures

The conclusion I wish to draw from my brief analysis of caBIG and TAIR is that digital infrastructures, rather than science itself, are what needs a big scale. I have shown how these databases are responsible for supporting the needs and specialised questions posed by research projects across a variety of scientific areas. Echoing the findings of STS scholarship on standards (Timmermans and Berg, 1997; Bowker and Star, 2000; Timmermans and Epstein, 2010), both resources view standards adopted for data dissemination as obsolete if they are not immediately useful to users in their current practices. If biologists with different expertises cannot use databases to collect and re-use data to address their own specific queries, the databases have failed to fulfil their role, no matter how good the rationale underlying their construction and the amount of resources invested in their development. Since the start of their work in the early 2000s, TAIR and caBIG curators have been acutely aware of this overarching goal and of the difficulties of reconciling it with the requirement of making data travel far and wide across multiple epistemic communities. Their efforts to resolve this tension resulted in the development of databases where methodological and epistemic diversity is explicitly expressed and explored, though with differing degrees of effectiveness. TAIR and caBIG thus do not aim to fully standardise and/or unify the knowledge and data that they contain, nor to homogenise the instruments, methods and terminologies used by their user communities. Rather, they attempt to make such diversity visible, so that prospective users can take that into account when interpreting data retrieved through those resources for their own investigative purposes. This attempt has not yet been entirely successful, as demonstrated by the fact that both TAIR and caBIG are undergoing extensive revisions and caBIG in particular has been widely critiqued. Yet, these very shifts and critiques signal the extent to which responsiveness to users’ research interests, and particularly the capacity to encompass and serve as many prospective queries as possible, is the most important factor in determining the long-term usefulness and sustainability of data infrastructures.

Incorporating a large variety of possible viewpoints and prospective queries has been, and continues to be, the most complex and labour-intensive task involved in the development of TAIR and caBIG. Both databases responded to this challenge by diversifying search tools and developing sophisticated visualisations, archives of data provenance (the methods and instruments originally used to generate data) and links to biological materials (most obviously in the case of TAIR). Setting up and updating these resources occupies much of curators’ time and creative efforts. caBIG and TAIR illustrate how existing diversity in data types, disciplines, resources, instruments, methods and goals is valued as breeding ground for innovation; and the ‘data friction’ generated by the interaction of different communities (Edwards, 2010; Edwards et al, 2011) is viewed as a fruitful terrain for scientific advance. For instance, ensuring the comparability of Arabidopsis data with data extracted from other species has been crucial to the planning and development of TAIR, although its full potential is yet to be realised; and the overly complex structure of caBIG results from the ambitious goal to include data coming from clinical and biological contexts, as well as directly from cancer patients.

So what does the notion of scale mean for data infrastructures? My observations lead towards an interpretation of scale that has less to do with the quantities of resources, data and personnel involved in the development of data infrastructures, and more to do with their scientific role as platforms towards new discoveries. In other words, the scale of data infrastructures can be measured through the range and scope of biological questions that data stored therein can be used to address – where range indicates the number of research areas and specific queries potentially served by the database and scope indicates the types of organisms whose study can thus be fostered. Under this interpretation, the scale of TAIR is not captured by the number of curators working for the resource, the number of users consulting it or the quantity of data stored therein. Rather, it relates to the range and scope of queries that TAIR can be used to address: the variety of biological questions captured by TAIR search tools (ranging from the genes involved in developmental processes to the morphologies of different ecotypes or the history of specific loci) and the number of plant species and strains that TAIR can help to study.

This definition of scale is only one among many possible interpretations, and indeed we discuss the multi-faceted nature of this term in the introduction to this special section on ‘Bigger, Faster, Better? Rhetorics and Practices of Large-Scale Research in Contemporary Bioscience’ (Davies et al, 2013). I focus on this interpretation here because it has considerable implications for the conceptualisation of databases as forms of ‘big science’, particularly when compared with the idea of ‘big science, little science’ articulated in de Solla Price’s classic 1963 account and Niki Vermeulen’s recent work. Vermeulen endorses de Solla Price’s view that science is becoming increasingly more collaborative, international and expensive. She extends that argument to biology and argues that the quantities involved in research projects (including the number of researchers involved and their varied locations across the globe) are affecting the quality of the research under way (Vermeulen, 2009). I broadly agree with this view, and yet the focus on scale interpreted as numbers (of locations and resources) may lead to overlooking one key source of continuity between twentieth century- and twenty-first-century research: the tendency of biologists to focus on very specific questions and to structure networks, expertises and collaboration around those (a tendency reinforced by current funding regimes through the focus on short-term projects). The long-term aims of contemporary biology and biomedicine involve the integrated understanding of organisms and environment as complex, interrelated systems; but this goal is pursued through division of labour, with myriads of research groups looking at different aspects of biological systems. The curators of large data infrastructure such as TAIR and caBIG are mindful of the fact that their resources need to serve such multiple questions, so as to be of use to as wide-ranging a group of scientists as possible.

Biological research relies on a Web of expertises finely tuned to specific research interests, which means that the science that databases attempt to foster is unavoidably a localised and situated affair not in terms of the quantity and geographical/disciplinary location of researchers involved, but rather in terms of its focus on very specific questions and outputs, which can vary greatly across projects and over time. By contrast, caBIG and TAIR strive to provide overarching infrastructures that serve as many specialised uses of data as possible. This implies that their scale is necessarily bigger than the scale of the science that they support, which tends to remain fragmented into a wide variety of specialised projects. Large-scale infrastructure, as embodied by these two databases, provides a platform where biological queries can be expressed and addressed on a case-by-case basis. This view of databases as platforms for dialogue and collaboration is exemplified by the popularity of the notion of database interoperability, as discussed in the case of caBIG. Whether interoperability works in practice remains an open question: what matters to my analysis is the very existence of this discourse and the ongoing attempts to enforce it.

In closing, it is important to note that emphasising the role of databases in managing and fostering epistemic diversity calls into question the rhetoric of data re-use employed by funders and database curators alike. I illustrated how facilitating data re-use is extremely complex, requiring multiple displacements. Research moves from existing research projects to databases to new projects, with high potential for misunderstandings across different research loci. Acknowledging this means recognising that making data accessible is a different challenge from making them re-usable. This apparently simple point has often been overlooked by key science funders, such as the NSF, where until recently expectations about data re-use were not backed up by empirical studies of the needs and expectations of actual users.Footnote 10 Many curators strive to identify and serve those wishes as well as possible, and these efforts are crucial to the success of large-scale databases; yet, relatively little systematic and empirically grounded research grounds these intuitions, which is striking, given the level of investment in these resources. Database curators have to reconcile two different sets of expectations by both users and funders: on the one hand, databases are supposed to circulate data as widely as possible, thus making it possible to conceive of biological data as global commodities, which should be exploited and re-used by researchers across the world; on the other, most biologists continue to view data as highly contingent products of a specific research group interested in addressing specific questions, and struggle with the idea of re-using such data within a different research context, to address new questions. This tension between the current urge to globalise data and the importance of locating data into specific research contexts is crucial to social scientific attempts to understand what is ‘big’ in contemporary life sciences, and what it means to expand the scale of biological research, whether in a geographical, scientific or social sense.