DataSHIELD Detailed Overview
1. What is DataSHIELD, and why use it?
DataSHIELD is open-source software for the remote/federated non-disclosive analysis of biomedical, healthcare and social-science data. Crucially, the data remain secure behind the firewalls on the system where they usually reside, and under the complete control of their primary custodian(s). Analytical commands are then sent to the data. Analysts are prevented from seeing or copying the individual-level data (microdata) that underpin the analyses required, and yet those same analyses are typically fully efficient from a statistical perspective. Embedded privacy-protection traps guard actively against - and facilitate detection of – attempts based on analytic results to identify individual data subjects or to infer the value of particular variables in given subjects. DataSHIELD therefore facilitates the analysis of microdata (individual-level data) when their sharing might otherwise be restricted or prohibited for reasons of ethico-legal or personal sensitivity, and/or commercial or intellectual property value. It can also be useful when the underlying data objects are too large to physically share. DataSHIELD is used in both research and healthcare settings and can be applied to a wide range of data types. Its analytic capability has recently been extended to encompass high-volume ‘omics data and other large datasets. Most extant use-cases involve federated multi-study co-analysis across international boundaries (www.datashield.ac.uk/about/whousesdatashield) but DataSHIELD also supports the privacy protected analysis of data from a single source (e.g. Oluwagbemigun, K. et al., 2019 [DOI:10.1093/jn/nxz194]).
Figure 1: DataSHIELD Deployment Architecture
DataSHIELD is based on a RESTful client-server architecture (Figure 1) with its functionality encoded in R and Java A DataSHIELD client node connects to at least one DataSHIELD server node (one at each data source) through DataSHIELD-compliant middleware embedded in Opal (OBiBa's core open-source tooled data warehouse https://cran.datashield.org/web/) or in another other equivalent system. Where analyses are to be pooled across multiple sources, DataSHIELD offers two complementary approaches: (1) a full-likelihood-based individual person data (IPD) methodology which generates the same results as if the data from all sources were physically transferred to a central warehouse and analysed jointly. This may be called “virtual IPD” because the data are effectively analysed on an individual person basis, but without physically moving them (even transiently) from their usual trusted repository. (2) Centrally commanded study level meta-analysis (SLMA), sometimes called federated meta-analysis. This is equivalent to undertaking the required analysis in each study separately and then combining the resultant estimates and standard errors using conventional study level meta-analysis methods: based either on fixed or random effects. The capacity to work with either IPD or SLMA optimizes analytic flexibility and because the analysis is directed entirely by an analyst working on the DataSHIELD client, there is no need to wait for individual studies to respond in person to analytic questions and/or requests. Because the need for back-and-forth human interaction is negligible, the DataSHIELD approach can be far less time-consuming than a standard contemporary consortium-based SLMA. To illustrate, consider the commonly encountered need to extend the modelling underpinning a genome-wide association analysis (GWA) to include interactions between significant genomic factors identified in the first-stage of the GWA and a set of key demographic or environmental determinants: unless the particular set of interactions required had been correctly guessed beforehand, the centre coordinating the analysis will have to request every study to undertake additional analyses for every interaction required.
The OBiBa middleware that comprises the middle layer – most commonly Opal at the present time - is fundamental to DataSHIELD. It is responsible for: (1) defining the DataSHIELD configuration - e.g. the particular set of server side (‘assignment’ and ‘aggregation’) - R functions, and R options that the data custodian has agreed may be applied to their data); (2) authenticating DataSHIELD users identity; (3) authorizing access to each data source and to R services; (4) authorizing the execution of individual serverside R functions (all function calls are evaluated by an updateable parser ensuring only allowed functions with valid arguments are enacted); (5) managing DataSHIELD R server sessions; (6) assigning data to a DataSHIELD R server session; (7) returning aggregated results computed in the DataSHIELD R server session; (8) handling runtime errors.
The connection between the DataSHIELD R client and the DataSHIELD middleware is done through the DataSHIELD Interface (DSI), an R API that provides an abstract interface which can in principle be used to connect to a wide variety of types of data repositories. Building on this capability, OBiBa has recently introduced another radical extension to the middle layer, that is the concept of "reference to a resource". In this context, a resource is any DataSHIELD-compliant data repository to which Opal can connect. This may be a dataset (e.g. a file stored in the local file system or in a file store server, a database table, etc.) or a server with some computation capabilities. This functionality allows flexible delegation of the data connection to each separate R server and enables any given set of data to be stored or processed with the most appropriate technology. In consequence, DataSHIELD can now be applied to data in a wide range of formats that may be held locally on the server itself, or on remote resources referred to by a URL. Perhaps most crucially this means that DataSHIELD can now be applied to high volume ‘omics data held in standard formats such as vcf.
2. Funding to date, and the development of functionality
Major grants that have funded DataSHIELD development to date are listed on the DataSHIELD website (www.datashield.ac.uk/research/grants/). Initial development focused on proof-of-principle but once demonstrated (www.datashield.ac.uk/research/publications/) emphasis shifted to developing reliable functionality meeting the needs of our growing user base. Both DataSHIELD and Opal now present an extensive range of functions which are continuously being updated and extended. At present DataSHIELD provides >100 client side functions and >85 server side functions (see cran.datashield.org/web/ ). Leading up to the version 5.0 release (September, 2019), the project focused on ensuring all functions had been subject to rigorous quality assurance and were comprehensively documented. A continuous integration testing infrastructure was also implemented. This has proven crucial as the rate at which modifications to pre-existing functions and the creation of new functions has greatly increased.
Until recently DataSHIELD development primarily targeted functionality for managing, transforming, visualising and securely analysing individual-level data from large-scale epidemiological cohort studies that could not otherwise be shared. The diversity of application both geographically and in terms of health topic is emphasised by the growing list of publications referring to the application of DataSHIELD to real world problems (see www.datashield.ac.uk/research/publications/#application). We have been particularly pleased by the recent publication of papers which include no members of the primary development team based at Newcastle University - objectively demonstrating a reduced need for help and support. Another key development over the last two years has been the exploration and application of infrastructure and methods to enable DataSHIELD to be applied to Health Service Data. This is an important arena for future application and a number of additional governance considerations apply. This work has been undertaken by collaborators in Germany (e.g. Zoeller et al., 2018 [arXiv:1803.00422]; Gruendner et al., 2019 [DOI:10.1371/journal.pone. 0223010 and Gruender et al., 2019 [DOI:10.3233/978-1-61499-959-1-115]) and in the UK via Connected Health Cities North East and North Cumbria. With the new ‘reference to a resource’ capability facilitating analysis of high volume ‘omics datasets, an active new group of European developers has joined the DataSHIELD community with a focus on various classes of ‘omics research. With the increasing complexity of the project we have had to ensure that we maintain a focus on the milestones and deliverables of the principal grants that are actually funding DataSHIELD development and we have an updatable road map to track our funded extant plans and progress (http://bit.ly/DS-roadmap).
3. Data preparation and data governance: non-DataSHIELD pre-requisites for an analysis under DataSHIELD
Although DataSHIELD currently provides a unique approach to the analysis of sensitive data, it is necessarily subject to the same set of pre-requisites that are faced by any valid approach to the analysis or joint co-analysis of research data – particularly, sensitive health or social research data. Thus, any data to be processed via any method – including DataSHIELD - must satisfy two fundamental pre-requisites: (1) the data must be appropriately prepared from a scientific perspective (e.g. quality assured, cleaned and, if necessary, harmonized to ensure that corresponding data from different sources are inferentially equivalent); (2) all uses of the data under the analysis proposed must meet the relevant criteria specified under whatever data governance jurisdiction applies. Although these issues are in principle independent of the decision to use DataSHIELD, they are so important that we have included this section to address them. As touched on below, we anticipate that in the future functionality to address key aspects of both issues will be built into DataSHIELD, but this is currently a work in progress.
The extent to which data must be prepared from a scientific perspective before commencing any analysis is entirely context specific. Nonetheless sound scientific practice clearly dictates that, regardless what setting may apply, one should never embark on analysis or co-analysis via any method – including DataSHIELD – if the preparation is inadequate. Even a perfect analysis of imperfect data will lead to at best unreliable, and at worst misleading, inferences and conclusions. The preparation of a project and the data it will use from the perspective of data governance is no less important. Almost all contemporary uses of personal data – including analyses based on DataSHIELD - are subject to well thought through and rigorously applied rules, laws and evaluation mechanisms relating to data/information governance. The particular frameworks that apply in any given setting vary internationally but the legislation structures and other rules that apply in most well developed jurisdictions address a similar set of key principles. In Europe many of these are encapsulated in the GDPR (General Data Protection Regulation). For these reasons, any analysis of health or social data – including analysis using DataSHIELD – must be able to demonstrate that it has the required legal, ethical and other data access permissions in place before the analysis starts. These may include: ethical approval if required; permission from a study or consortium-based Data Access Committee; confirmation of the legal basis for data processing from the relevant governing authority which may include the academic/research institution under which the data have been collected. Moving forwards, one of the planned extensions of DataSHIELD is to encode key data governance rights and obligations into data sharing agreements between the parties involved. The OBiBa-middleware and DataSHIELD platform will be extended to permit their dynamic reconfiguration. This will permit, when data sharing agreements change, OBiBa-middleware and DataSHIELD to be automatically reconfiguration so enforcing the new governance structure. This infrastructure can then help streamline data access where formal data sharing agreements already exist. We also plan to work towards a system of research passporting, wherein following initial review of bona fides, specific individuals and teams may be awarded permission-in-principle to work with data from a particular source, for a defined range of purposes, via an agreed mechanism (e.g. via DataSHIELD with specified disclosure thresholds), provided the particular use-case being considered meets a range of agreed criteria that have been formally defined. As well as streamlining legitimate analysis of data, such an infrastructure would also provide a powerful sanctioning should any individuals, groups or institutions violate the terms of their data governance agreements. Development work with these aims in mind has already been started under the SME Arjuna – one of the commercial collaborative partners of the DataSHIELD project. In addition, once a viable and sustainable funding mechanism for DataSHIELD has been put in place, we anticipate that support and advice relating to data governance will be one of the services that can form part of an agreed service contract should that be requested by the user.
4. Project governance and sustainability
As the DataSHIELD project has progressed, the underlying concept and software it offers have proven to be increasingly attractive to a wide range of current and prospective users. Over the last three years in particular, interest has grown very rapidly, forcing DataSHIELD to transition from a small self-governing research software project with limited ambitions to a much larger enterprise with an active world-wide community of adopters, users, contributors and committers (as defined by our current governance policy http://bit.ly/DS-governance-policy). Although delighted by this success, it has introduced growing pains requiring urgent therapy. This reflects fundamental challenges relating to governance and sustainability that can only be addressed by restructuring theproject as a whole. Presently the Principal Investigator sits as a ‘benevolent dictator’ as per the current governance model (Figure 2).
Figure 2: DataSHIELD Project Governance Model
For the future scientific and technological wellbeing of the project, it is now crucial that project governance transitions from the 'benevolent dictator' model to a community-driven meritocratic model overseen by some form of consortium Steering Committee. This is change is essential if we are to continue to achieve scalable and sustainable engagement and strategic input across the increasingly global DataSHIELD community. At the same time, we need to ensure sustainability of resource from the perspective of on-going funding. To date we have necessarily relied on funding via a series of traditional research grants. But as the project has matured, it is becoming increasingly difficult to persuade traditional research funders that DataSHIELD development is a “research” activity and we have for some time been targeting infrastructural development funding and working with the broader DataSHIELD and OBiBa communities we continue with this quest. In parallel, however, we have also recognised that we need to think commercially.
In the longer term, if DataSHIELD is to remain a viable and innovative product, its development costs will have to be covered by fees (in some sense) raised from those who wish to use it. Because DataSHIELD, Opal, R and Java are all open-source products, we see a primary route to user-derived resourcing to be based on provision of training, consultancy and support for implementation, use and data governance, coupled with a capacity to provide targeted extension of functionality to projects with particularly urgent needs. This could all be wrapped up in a package of different service contracts with pricing determined by the level of the service provided, and the nature of the user (e.g. bona fide academic users vs health service users vs fully commercial pharmaceutical or biotechnology users). At the same time, it may be possible to develop some specializing add-ons to DataSHIELD as fully-commercial products under licences that are more commercially permissive than our current licence (GNU GPLv3) . For example, it has been proposed that this may be one route forward to resource the development of an easy to use Graphical User Interface.
In the shorter term, we now believe that we should consider a move towards a more commercial philosophy relatively quickly. In particular light of the infrastructural flexibility that has just been introduced through the “resources” capability, we believe that now is a perfect time to start exploring potential commercial interest both in developing and supporting a free-ware based “Community Edition” and a comprehensively supported “Professional Edition” allowing us to properly invest in the development of new large-scale applications for high throughput ‘omics and in health and social care including the pharmaceutical industry.