A tremendous amount of effort has gone into Open Source web content management systems (Web CMS) in the last decade. Blogging software like WordPress is ubiquitous while full blown Web CMS stacks can be found under bizarre names like Drupal, Joomla and Plone. Not that long ago expensive, proprietary systems dominated the Web CMS market but the Open Source model has proved to be very successful at creating full featured, secure CMS tools with wildly enthusiastic developer communities.
What is the possibility that Open Source CMS tools could be used to manage not just documentation, but scientific data as well?
In general, CMS tools are targeted at managing documentation created within an organization for either internal consumption or distribution to the external world. Two important concepts at the heart of these tools reflect 1) the hierarchy of an organization — role based authorization (who is allowed to do what); and 2) the flow of information within and organization — workflow (who signs off on what).
Both of these concepts also apply to the collection and management of scientific data. In many if not most projects, data collection and initial cleanup is the responsibility of someone who got funding specifically for that purpose. There is often a very strong sense of ownership of the data by those who collected it. In academic and agency science there is also a general fear that data may be misinterpreted by those who don’t understand the intricacies of sample design, accuracy, outlier detection, etc. (For the most part these fears are quite well founded. Misinterpretation of data in the wider public discourse is rampant.)
Those individuals responsible for initial data collection are usually willing to publish their data to other scientists and data managers in their field, but not to the general public. Data may then go through additional stages of cleanup and harmonization to make it consistent with data collected by others. Large collections of vetted data are then made available to individuals higher up the organizational chart with less field-specific and more managerial knowledge. Eventually, perhaps, the data may even be made available to the general public.
There is often an informal process for managing the flow of scientific data through an organization. Even where a formal process exists it typically involves phone calls and emails and a certain amount of trust that data will remain confidential until there is consensus to make it public.
It sounds like scientific data management could benefit from all the thinking about authorization and workflow that has gone into some of the Open Source Web CMS tools.
Of these tools, Plone appears the most appropriate for the task for the following reasons:
- It is written in the python scripting language which has seen widespread adoption in various scientific fields. (See scipy.org for examples.)
- It has very strong support for authorization and workflow.
- It has a very enthusiastic and knowledgeable developer community, many of whom come from academic backgrounds.
So the goal would be to harness the infrastructure that Plone provides for documentation and use it to work with data so that issues related to authorized access, workflow, data cleanup and harmonization can be captured in a code base that is independent of the specific people working on the project. This kind of institutionalization is necessary for any scientific project that needs to outlast the individuals with whom the project began.
One of the truisms we have come to accept is that we have never had an original idea. On rare occasions we have an idea that no one has yet acted upon but this not the case with respect to Plone and scientific data management.
We are pleased to report that a small Seattle company — Sound Data Management — has already made progress in this area. Their first full project — Hydra — involves data from Hydrophone arrays in Puget Sound. From the Hydra page it is clear why such a system is needed:
Around the Pacific Northwest, researchers from a variety of federal and local agencies, universities, and tribes in aggregate are using several hundred hydrophones to conduct research studies on movement patterns of aquatic animals. Each program is characterized by numerous tagged animals that move and a relatively limited number of acoustic receivers that are located to address a significant question for individual programs. Importantly, these tagged animals move over larger domains than individual receiver arrays. These researchers have recognized the value of coordinating placement of hydrophones to improve their collective listening capability and ability to address emergent, larger-scale management questions. Researchers needed the ability to efficiently share detections of each others tag codes to enable the larger research collaboration. Hydra was developed to facilitate data sharing and research coordination for these researchers.
Mor importantly, it is clear that the folks at Sound Data Management understand the scientific data collection and management process. Right on the Hydra front page they state their data principles:
- Hydra ensures data integrity as it parses data directly from hydrophone-generated files to a relational database via an automated protocol.
- Hydra securely archives each hydrophone file and makes the archive available to the owner of the file.
- Tag detections are immediately available for view or download the moment the receiver file is uploaded. Tag codes are by default private. Unless a tag owner chooses to share their data, they are solely able to access detections of their tag codes.
- Hydra respects the diverse data sharing needs of individual researchers. Hydra is designed to facilitate data sharing in that tag code detections can be shared between research partners, agencies, or made public with the click of a button. Yet each research partner controls if and when they share data; Hydra has no data sharing or publishing requirement.
Data management, data ownership, data sharing. All these are only part of the diverse needs of scientists with respect to data management. Far too often in the past we have seen science projects get funded because they support the needs of the software engineering community.
We congratulate Sound Data Management on creating a software system that is in service to the needs of science rather than the other way around.