Session 3: Technical and infrastructure issues

Audio from the session
[audio:http://www.jisc.ac.uk/media/avfiles/events/2008/07/session3b.mp3]
To downlod the MP3 click here

This session took the form of a brainstorm, with some of the main discussions tackling issues arising from the sheer variety ‘data’ (from raw to processed/workflows/protocols and between disciplines, methodologies, techniques). We asked why are we keeping it? Do we share now for reuse, preserve for the long term, or make available to validate results?

Some of the main themes included:

  • within a research environment – can we facilitate data curation using the carrot of sharing systems? That is, can we build IT systems that both support lab (or fieldwork, etc) data curation and controlled sharing where appropriate?
  • additional context beyond the metadata: data provenance is essential, and perhaps wider than previously considered. It could include, for example: a ‘make’ file containing the data and its transformations from ‘raw’ to ‘processed’; and an account of relevant methodologies and hypotheses, perhaps via the original proposal and related documents. This implies join-up between research information and management systems and virtual research environments/researchers’ desktops.
  • how do we help institutions understand their infrastructural needs: what are the requirements for institutions (either HEIs or Data Centres, or even hybrid approaches) taking responsibility for curating research data?
  • what has to happen with the various archive systems (fedora etc) to help them cope with research data curation, while retaining a link with the library and institutional systems? It might be useful to see whether some common requirements can be documented for IT systems for data curation.

Matthew Dovey (JISC) has welcomed everyone to the session, which is taking the form of a brainstorm, to identify the key infrastructure challenges, share experiences and prioritise those challenges.

The questions we’re looking at are:

  • what are the main infrastructure challenges in your area?
  • who is addressing them?
  • why are these bodies involved? might others do better?
  • what should be prioritised over the next 5 years?

We went round the room and everyone identified who they were, where they were from and their interest in the session.

Matthew Dovey said that the first few comments on the blog were about large-scale science, all about the science – the data explosion. A few people said at the other end of the spectrum, it’s not about large-scale data, it’s about the variety of different types of data. In fact the large-scale science community possibly has it easier! (their set is very homogeneous). (much laughter..)

It was pointed out by Sam Pepler (British Atmospheric Data Centre) that there’s two separate problems. He has lots of big data, and lots of inhomogeneous data.

Matthew said the other question he had was an economic question (and he would try and insult some large-scale scientists!). The large-scale scientist has had more money in the past so they haven’t had to worry about the curation process – they can store everything arts and humanities have to be much more careful about what they can store.

The amount of data has grown exponentially, even the large-scale projects can’t store it, and they’re panicking.

Chris Rusbridge (DCC) made the point that they’ve always had more large-scale data than they can store. The arts and humanities is different. In small science, the data in chemical laboratories was neglected (it was private data), and management of data was left to postgrads or postdocs.. He said there are different sorts of problems in different areas.

The small versus large discussion continued, with the mention of Theo who runs the Edinburgh repository, and has a successful thesis deposit mandate. This is a problem – a problem that libraries have had for years (thesis in paper form, with lots of media in a back pocket) – he’s getting a very rich variety of data as supplementary data file. It’s going to happen more and more with articles or theses going to a repository. No-ones worked out what a library should do. Close your eyes and cross your fingers and put it in a zip file, or should it be more subtle? What should a well-meaning library do?

There was discussion of whether this is a technical problem. Seems unreasonable that methodological practices should remain unchanged – we maybe have to look at the methodological practices we have.

Who is responsible for doing data curation within research project? Is it researcher, farmed out to postdoc or library? Robin Rice from the University of Edinburgh gave the example of the DataShare project, which is looking at the gap analysis and trying to come up with generic solutions for the smaller data. She said they are going to make some arbitrary decisions – what formats, what size, how to migrate in the future.

It was pointed out that there’s lots of data where you can’t do that – proprietary instruments, where migration will be difficult.

Research is a broad spectrum. In some specialisms (meteology, geospatial) there are data standards. But in new areas of leading-edge research there aren’t – standards may become de facto or a long time afterwards. You need the subject-specific expertise embedded in the process.

Do you do the minimum amount now, and put the cost onto the user to make the data usable (that’s what’s happened in history, archaeology etc) – and the user is the right person to extract that, as they are the people who care. You can spend a lot of money now and in the future it might not be useful.

Sam Pepler pointed out that they are spending money on deciding what data to keep, and on adding metadata to make it useful A lot of money is spent on training the user…

The point was made that we don’t know in advance what parts of thee data are going to be useful – we have to make arbitrary decisions about what we can do. If we have more money for storing the stuff, rather than in making it user-friendly, as long as it still exists, the user can curate it.

But will it be horrendously expensive to use by that point? And what is ‘just the right amount of effort’? What are your selection criteria – what value this data might have in the future (who owns it, who’s going to pay for it), how much effort and money would you have to regenerate this data (eg do you have the equipment and skills to replicate it?)

There’s a view of some organisations (not those in the room!) that it’s not worth keeping the stuff because nobody is going to understand it apart from us.

Moving away from economics and onto the issue of methodology – the discussion looked at how to push methodology to researchers:

How does JISC draw this together? Matthew said the problem we face is that data is quite vague – we’ve talked about datasets, some references to theses, some to e-journals and published work. There’s also been reference to process, and workflow (eg how you did the experiment).

There’s also the question of why we’re doing this? There’s been lots of discussion about how to store it (particularly around how people in 50 years time can do this). There’s not been a lot about how your peers use it, immediate reuse and sharing.

The point was made that not all disciplines are the same – the example was given that sharing protocols in medical research was more important than sharing data. However, it’s essential for the scientific record, to have original data. And a lot of publishers want scientists to provide research data that they used as part of evidence for their papers. Having data available can avoid publication errors in subsequent articles eg Chang. But you need the process and methodology to replicate it as well. The point was also made that you don’t necessary throw the mechanisms out just because the data is dodgy.

The discussion moved onto the fact that currently research data and methodology is kept, but background information on the researchers (which could affect their interpretation of the data) would all be useful.

Matthew summed up so far, by saying that this began with a discussion about the additional metadata needed to reinterpret the data. At that point there were some concerns about the cost of creating that and how much you do up front. In addition to capturing the data, and how you interpret that data, you’re also saying that the context is useful (who, why, historic context, motivation – particularly useful for pharma experiments).

Would this be very expensive? The suggestion was made that you could store the proposal from the person doing the research – and this is done by the funders, as they have to have electronic records management. How you get hold of them is another story….and at some point these records may get deleted.

JISC has done some work in research management and links with research councils which may prove useful here.

The point was made that ownership is a hugely complex landscape, and affects storage and how you make the data available. A whole community can be party to a hadron collider experiment, for example.

Infrastructure was touched on – the social aspect and technical aspect, and the point was made that there’s lots on the social side that we can’t lay down the law about. Infrastructure makes it easier and cheaper to encourage that social structure. People will only do this (data management) if it doesn’t take up a huge amount of time, and if they are motivated to do it.

What are the difficulties and how can JISC help?

Peter Morgan, Cambridge University Library said it was totally unrealistic to expect a library to be able to manage such complex data – it’s looking like it’s something that’s done on the front line – department, research group, with the library acting as a backup.

There was a lot of discussion around the purpose of retaining data – was it reuse or long-term storage? Should a nearline/offline storage model be used? Infrastrucutre for reuse may be different from that for long-term storage? There was some agreement that the capture and dissemination process were front-line tasks, and the hope that it had to be reuse first – preservation would then happen on the back of reuse, sharing and integration.

Is it an awareness issue? The assumption is of long-term curation – which doesn’t attract excitement. Would sharing be a better prospect?

With respect to sharing, why are we reusing data? If it’s researchers sharing it with selected colleagues – should this still be an institutional process? Should we be supporting publication of open notebook science? (and publishing of failed experiments).

What about reuse/sharing if there’s commercial gains? (There’s discipline-specific and cultural differences…)

Data centres that look after subject-specific data can overcome these discipline-specific issues. What’s the path going forward – should it be institutional based/discipline-specific?

Matthew said JISC’s repositories programme is trying to provide that layer in between. An institution is a good future prospect of surviving. Subjects may be useful at the current point, but may not be studied in the future. But from the perspective of the researcher, they’re interested in astronomy data, not University of Cambridge data. He’s hoping for a mixed economy, with federation.

Research councils have a responsibility to guide the subjects in a positive way. Many have data policies, requirements , and guidance for researchers.

Discussion then centred around the data audit project: institutions have a significant problem with understanding what their infrastructure needs are – so much of the research data is distributed around departments or faculties. Could the data audit project have an insight into how institutions get a handle on the size of their problem?

Chris Rusbridge made the point that there are well–established bits of software for managing repositories of papers (filestores etc). What should the library do to repository systems in terms of the technical infrastructure that could be installed and used (file registries etc), so that that part of the job is easier? Is there some development work on fedora, greenstone, e-prints etc that would make it easier?

Finally, Matthew summed up the brainstorm by stating the four main priority areas for JISC that had come out of the discussion:

  1. within a research environment – can we facilitiate the data curation using the carrot of sharing systems? (IT systems in the lab)
  2. additional context beyond the metadata
  3. how do we help institutions understand their infrastructural needs
  4. what has to happen with the various dataset systems (fedora etc) to help them link with the library and institutional systems

2 thoughts on “Session 3: Technical and infrastructure issues

  1. Robin Rice

    We were discussing (perhaps tangentially to this strand) some differences in norms about data sharing within different disciplines.

    I gave the extreme example of medical researchers in universities fronting research papers for pharmaceuticals, and how sometimes even the named authors are not allowed to access the data pertaining to the article. See Aubrey Blumsohn’s “Ghost-statistics, raw data and the meaning of authorship. Are we learning any lessons from scandals in pharmaceutical research?” He lost his job as a medical research at the University of Sheffield because he refused to put his name as lead author on a Proctor and Gamble study of Actonel until he saw the raw data. He later found that the article was indeed mis-representing the results shown by the data. He now monitors other similar situations in his very frank Scientific Misconduct blog.

Comments are closed.