Session 3 – Technical and infrastructure issues

The notes from this session are available.

Welcome to the web forum for this session within the JISC Innovation Forum’s ‘Data’ strand. The (invitation-only) face-to-face component of this forum will be on 16th July 1030-1200. For that to be a success, we need your comments and feedback.

This is the ‘technical’ session, whose objective is to identify the key infrastructure challenges, to share experiences of addressing them, and to prioritise future work in this area. We’ll split the face-to-face session into a brainstorm and then spend some time prioritising the challenges identified.

Before the face-to-face session, and to help it go ahead without long explanations of everyone’s current work in the area, please could you share a 100 word project summary outlining how your work or project contributes to an effective technical infrastructure for research data management.

We’ve also prepared some questions to help provoke the brainstorm session, and we’d welcome your reactions to these. They are:
– What are the main infrastructure challenges in your area?
– Who is addressing them? (subject-oriented bodies, research funders, universities, academic departments, JISC, etc?)
– Why are these bodies addressing the main challenges? Might others do better?
– What should be prioritised over the next five years?

Please use the comments facility below to add your information by Thursday 10th July.

Relevant background reports include:
Developing the UK’s e-infrastructure for science and innovation
Publication and Quality Assurance of Research Data Outputs
Dealing with data

Many thanks
David Shotton, Matthew Dovey and Neil Jacobs

12 thoughts on “Session 3 – Technical and infrastructure issues

  1. Neil Geddes

    Storage and management of data from a large number of scientific projects and faciltities, including:
    LHC
    Diamond Light Source
    ISIS neutron Soure
    Space missions
    Biological Sciences
    Also the development and support of tools for access and exploitation of the stored data.

    – What are the main infrastructure challenges in your area?
    – connecting lightweight end user tools, often community developed, with professionally run storage systems
    – sustaining investment in the infrastructure
    – how to develop affordable systems to cope with data volumes in 5-10 years.
    – Who is addressing them? (subject-oriented bodies, research funders, universities, academic departments, JISC, etc?)
    – no -one
    – Why are these bodies addressing the main challenges? Might others do better?
    – they are not
    – What should be prioritised over the next five years?

  2. Sam Pepler

    I am the curation manager for the British Atmospheric Data Centre and the NERC Earth Observation Data Centre. Both these data centres curate data from NERC funded projects and from other sources.

    – What are the main infrastructure challenges in your area?
    Volume; I am expecting Petabytes of data, and hundreds of millions of files.
    Ingestion methods.
    View services for data.
    Low level data description methods.
    Useable methods for controling vocab.

    – Who is addressing them? (subject-oriented bodies, research funders, universities, academic departments, JISC, etc?)
    NERC is funding the NERC data grid looking at view services and description methods.
    The open geospatial consortium (OGC) is looking at services for geospatial data.
    BODC is hosting a vocab service.

    – Why are these bodies addressing the main challenges? Might others do better?
    These are the bodies that know the issues and have the drivers to adress them. There are other bodies which may help, but we can’t deal with all the people all the time.

    – What should be prioritised over the next five years?
    High volume systems.
    Useable tools for Ontologies.

  3. Norman Gray

    Several Virtual Observatory projects

    Challenges: what Neil and Sam said, plus I think a general perception that data interoperability is important, but secondary to the immediate problems of managing an incoming data flood. Community standardisation bodies (OGC, IVOA and others?) help share expertise and avoid reinventing wheels, but are still rather scattered.

    Ontologies and vocabularies are important, too, and will surely become more so.

  4. Mark Hedges

    Until recently I was responsible for technical/infrastructure matters at the AHDS (Arts & Humanities Data Service). This is no longer the case because of the withdrawal of the AHDS’ funding, but I mention it because my thoughts on these things were developed in this environment. For those of you who aren’t acquainted with us, the AHDS managed outputs from arts and humanities research projects in the UK. Now I am at King’s College London, with a similar sort of remit but an institutional focus rather than a subject-based one.

    – What are the main infrastructure challenges in your area? A few:
    Embedding research data management (systems) into researchers’ day-to-day processes, rather than treating them as a separate library/archive-type service.
    Dealing with complexity of research outputs, not just size (AHDS resources were not so big, but could contain a lot of individual inter-related files).
    Increased automation to cope with complexity/size – curators should be involved only when necessary (otherwise they won’t be able to keep up).
    Sustainability – particularly apposite given the AHDS’ demise. The official rationale for this said that IRs would now be able to handle all the outputs, but I am not so sure.
    Ontologies for description of data

    – Who is addressing them? (subject-oriented bodies, research funders, universities, academic departments, JISC, etc?)
    To some extent: Research funders, JISC, universities, …

    – Why are these bodies addressing the main challenges? Might others do better?
    To the extent that the issues are being addressed, these are the bodies who have a vested interested in making things work. Not sure who else would do it.

    – What should be prioritised over the next five years?
    See above.

    BTW, the link “Publication and Quality Assurance of Research Data Outputs” is broken.

  5. Mark Hedges

    cont.

    – access management for research data, cross-institutional access, cross-sector/federation access (e.g. between NHS and medical research depts in universities)
    – this is a procedural issue as well as a technical one

  6. Neil Jefferies

    Large data is technically rather easy as it tends to be uniform, widely reused and therefore standard ontologies, formats and protocols tends to emerge and the issues are merely those of scale. The much harder problem is the arguably larger volume of data that share few or none of those characteristics. Small, highly specific datasets in proprietary or custom formats.

  7. Robin Rice

    My institution – particularly the Information Services – is trying to address both storage and curation issue for research data, just now, though there is clearly a renewed interest from national bodies as well.

    Edinburgh DataShare, http://datashare.edina.ac.uk/dspace/
    is trying to address data sharing/publication of ‘small science’ datasets by UoE researchers through a DSpace repository specifically for data – and our DISC-UK partners, http://www.disc-uk.org/team.html are enhancing their existing IRs to handle datasets.

    Locally this is a Data Library and Library collaboration, but we are also piloting the Data Audit Framework at UoE, where we have formed a steering committee representing additional institutional stakeholders – research ‘champions’ in different colleges, research office, records manager, Computing Services, and national services (DCC & EDINA).

    Information Services strategy is increasingly trying to gear towards support for research data management and curation, though it is early days. A Research Computing Survey completed before the data audit work indicated a need for more storage and backup services across the board. The recently purchased SAN is already fully committed and as I understand it IT services are looking at purchasing more space. I believe that temporary file storage space is also needed for iterative ‘work’ files for data in the process of being analysed.

    Our Data Sharing Continuum diagram, http://www.disc-uk.org/docs/data_sharing_continuum.pdf
    shows how we think that the low-level storage problems need to be addressed before higher level curation and sharing is likely to take place.

  8. Rob Sanderson

    (foresite/ore hat on) The foresite project deals with a large amount of data — 2 million full text articles with images and metadata from JSTOR.org. We use ORE to provide machine readable descriptions of the journals/issues/articles as aggregations of web content.

    ORE as infrastructure hits several key points: the ATOM serialization is in line with current methods of exposing collections and results (eg gdata, opensearch, etc) and at the same time the use of RDF and a graph model underneath provides an easy entree into the Semantic Web world.
    It is easy to parse and merge different descriptions of objects using the graph model, which is essential in fully analyzing scholarly communication (cf MESUR project).

    The use of competing ontologies is always problematic, especially for bibliographic material. The exchange of information between repository systems and the very use of repository systems at all are areas of concern.

  9. Chris Rusbridge

    Comments in another stream have alluded to the controversy over whether institutional repositories or subject (or national-scale) repositories should curate science data. While subject repositories may be the best answer (although the fate of AHDS suggests we should think carefully about this), there are clearly not enough SRs to go round all the available subjects. Therefore, I think, we have to find ways of extending institutional repositories to support data, especially the kinds of small science data (eg supplementary files for journal articles), that NJ mentions, while at the same time filling the curation gap though finding partnerships with domain scientists. I put this under infrastructure rather than skills, because expanding infrastructure through opening the IRs to data is the prize.

  10. Julia Chruszcz

    I am currently a member of the project team commission to undertake the UK Research Data Service (UKRDS) feasibility study. UKRDS is a joint project between RLUK and RUGIT. It is funded by HEFCE under its Shared Services programme, with support from JISC. The objective of the study is to assess the feasibility of providing a framework of standards and procedures to encourage researchers to submit their valuable data, and the costs of developing and maintaining a national shared digital research data service for UK Higher Education sector. Such a research data service is seen by the project sponsors as forming a crucial component of the UK’s e-infrastructure for research and innovation, and one which will add significantly to the UK’s global competitiveness.
    The feasibility phase of the project is due to report in August and it will full recognise existing services, infrastructure and expertise.
    Differences across disciplines are marked and I would like to explore the issues and opportunities that arise from this more fully.

  11. David Giaretta

    I’m in the DCC (JISC) and project director of CASPAR and PARSE.Insight (EU projects), all dealing with issues of digital preservation. I also lead the working group which aims to produce the ISO standard for Audit and Certification of digital repositories.

    – What are the main infrastructure challenges in your area?
    The need for infrastructure to help sharing the effort and costs of ensuring digitally encoded information is available and usable for future generations, and which deals also with DRM, Authenticity etc, and which is itself auditable and persistable in some sense.
    It should be noted that preservation in OAIS terms requires understandability and usability and this can be tied in with the interoperability needed right now. My work in DCC, CASPAR and PARSE.Insight address various aspects of these needs.

    – Who is addressing them? (subject-oriented bodies, research funders, universities, academic departments, JISC, etc?)
    Piecemeal efforts in many areas but overall there seems to be (1) a lack of focus and (2) a tendency to ignore data and the necessary semantics associated with it.
    In addition to UK efforts there are of course EU efforts which should be integrated, in particular the e-Infrastructure work of which PARSE.Insight is part. The Alliance for Permanent Access (http://www.alliancepermanentaccess.eu) is another contributor to all this by bringing together major players across Europe.

    – Why are these bodies addressing the main challenges? Might others do better?
    This is an area in which clearly many need to contribute but there needs to be a Roadmap which can show how all these efforts can be accomodated. I guess that’s what I like to think I’ve been working on.

    – What should be prioritised over the next five years?
    The Warwick workshop (http://www.dcc.ac.uk/events/warwick_2005/Warwick_Workshop_report.pdf) had a list of priorities which are pretty good and which were picked up by the OSI Preservation and Curation WG (http://www.nesc.ac.uk/documents/OSI/preservation.pdf)

  12. David Shotton

    I sincerely regret that I am unable to attend this JISC meeting as planned. The primary idea I wish to share is the following:

    ** Data Conditioning **

    At present, the ‘activation energy’ required to bridge between the data environment of the average bench researcher and the metadata requirements of the average institutional repository is so great that submission of research datasets by researchers to institutional repositories will never happen to any significant extent.

    To overcome this barrier, I advocate JISC-funded research to investigate ‘data conditioning’, which I envision as having three stages:

    (a) An expert in knowledge management with research domain knowledge establishes a secure local (e.g. Departmental) repository to which a researcher can save datasets in the formats she currently uses (e.g. raw Excel spreadsheets with minimal metadata). The version of Fedora packaged as a virtual machine by Ben o’Steen of the Oxford Research Archive is a suitable platform for such a local repository.

    (b) The researcher is then encouraged to submit data regularly to this local repository, over which she feels a sense of ownership and control. This achieves the goal of ‘sheer curation’ (http://en.wikipedia.org/wiki/Sheer_curation), ensuring routine capture of research data without imposing a large cognitive overhead on the researcher. This is done in ways that enable easy and as far as possible automated enrichment of the data by essential metadata in conformity with agreed vocabularies, using processes previously worked out in collaboration with the researcher. For example, personal details could be automatically extracted from institutional personnel records and/or the Research Council’s Je-S database of academics (the data within which was provided by institutions in the first place).

    (c) When the researcher is ready, the conditioned datasets, in formats and with metadata suitable for the institutional repository, can then be automatically uploaded without further effort to the institutional repository for long-term preservation and (optionally, at the researcher’s say so) for publication. Such ingest could be done in bulk, and could use OAI-ORE packaging.

    A diagram showing the increasing ease of repository submission in response to increasing amounts of data pre-conditioning is available at http://tinyurl.com/6o5f9n

    Have a good meeting!

    David

Comments are closed.