ECURE 2004 Keynote Address Transcript


[ News \| Overview \| Speakers \| Schedule \| Location \| Sponsors \| Archives \| Links \| Search ]

ECURE 2004 Keynote Address

Clifford Lynch Keynote Presentation

ECURE 2004

Arizona State University

Tempe, Arizona

March 2, 2004

Thanks Rob. It really is a pleasure to be back here. I've always been a big fan of ECURE. I feel like it's taken on something that is hard and yet important and that nobody really has been eager to deal with, and we desperately need things like that. Particularly in this area, which to me sort of calls for a coming together of perspectives and interests that is really fairly unprecedented, and I will say more about that in a few minutes.

I was really pleased though when I talked with Rob about this meeting, and he told me that the focus was on research and the management of research results because this is something that really seems to be extremely timely. Just to give you a data point on this, CNI, my organization the Coalition for Networked Information, has started an effort called the Executive Roundtable, which we do in front of our member meetings, and there we bring together typically CIO's and university librarians from around ten institutions, and the institutions vary from session to session. The idea is to get a small number of leaders together in an environment where they can really have a conversation, and we've themed that around a different topic each time. The theme, last December when we convened this, was Institutional Repositories, and it really was striking to me listening to the CIO's and the chief librarians there, how much they are starting to grapple now with this issue of the management of research results in digital form. This is something that if you talked to them two years ago, particularly the CIO's, you wouldn't have probably gotten much interest, but now its become a very real issue.

I think that what we're seeing here in some sense is a convergence of sort of traditional records concerns, the movement of a lot of the teaching and learning processes to digital form, the real transformation of how we're doing scholarship in a lot of areas to become increasingly reliant on information technology and digital content. It getting real hard to tell what's a record, what's research, what's teaching and learning. It's become almost a throw away remark among faculty now to sort of make the observation that they're getting very confused about what's research and what's teaching. The boundaries between those given the sorts of things that many faculty are doing in their classrooms, particularly when you get past the first couple of years of introductory courses, suggests that this barrier between teaching and learning and research is getting awfully porous.

There's one other wild card in the mix here too, which is that it used to be we talked about scholarly publishing, this is kind of this other separate orthogonal well-organized activity, that didn't really have too much to do with records, had a very well-defined kind of hand off to research, and now that whole area has gone crazy too. There are questions that are very much on the table about institutional roles in the dissemination of scholarly communication that had traditionally been handed off to scholarly and professional societies or commercial scholarly publishers. There are issues about what the nature of scholarly communications is, and how it relates to things that we've more traditionally called research or teaching and learning.

I want to in our time this morning, and I really am going to try to sort of rigorously keep this about forty minutes because I want time for discussion, I'm going to give you a bunch of data points which I hope will expose some of what's going on here. I know some of these things will be familiar to some of you, hopefully not all of them will be familiar to all of you. But I think that they may help you to understand the landscape better and some of what I hope to give you is just some sense of how much the world is changing and where some of the pressure points are.

Let me start with a kind of an interesting one from MIT. MIT as many of you know has embarked on a number of interesting initiatives that I think we can all learn from. The D-Space effort for institutional repositories, and I will have a little more to say later about institutional repositories, but more to the point for my immediate purposes is the Open Courseware Initiative. This is a commitment that MIT made to put up the teaching materials for the vast majority of their classes for public access worldwide. And they are sort of chunking ahead on this now, I think that they have five or six hundred courses up and another big batch ready to go. If you go and look at the courses and what's actually up - it quite a varied assortment of things. In a few cases you've got nothing more than a syllabus and a reading list and maybe a few notes about the exams and problem sets. In other cases you've got extensive lecture notes, they have been cautious about commitments to video now simply because of the storage and bandwidth requirements involved in supporting and delivering that on a large scale and the costs of doing that. But one of the things that to me is so interesting about this effort is that it sort of sows us how records and scholarly communication are becoming confused. Lets look at it from kind of a life cycle basis. Somebody there is teaching a course. They have a course management system or a learning management system, and the faculty member might put the material up on there. They may be doing some other things as part of the conduct of the course that actually involve interaction, for example some threaded discussion lists involving student participation, or perhaps students are preparing papers and they all go up on the course site and the participants in the course are invited to comment on each other's papers and discuss them.

OK, so this is all sort of regular course management stuff that we're familiar with. At some point the decision is made to push the course out in OCW. Now what happens there is it goes through this sort of sanitizing process, where you strip out most of the local student stuff, unless you've cleared permission for it explicitly and its desirable to include that, you strip out the third party materials that the faculty member might have included without formal copyright permission to disseminate to the world. You know they may have just gotten specific permission to use it for in classroom use, they may be operating under a fair use for teaching exemption, or some license agreement, some sort of standard license agreement from a scholarly society that permits classroom and teaching use of the material. So they've got to strip out that material if its not eligible for distribution worldwide. This produces this thing that goes up on OCW. It's actually a lot like a lecture note. I remember a million years ago when I was in the university and I was studying mathematics, and there's a tradition in mathematics of lecture notes that are taken of somebody's lectures, and then back in the old days they were Xeroxed, actually they weren't even Xeroxed, I don't know what you'd call it, that nasty purple ink, yes mimeograph, and passed round, and they'd have a lifespan of fifteen or twenty years before somebody got around to writing this stuff up as sort of a digestable kind of textbook or research monograph. This kind of thing that I'm describing shows how classroom material can turn into a form of public scholarly communication, which I think is going to have some real significance particularly as we see this done more and more with advanced level courses. I mean this is not going to make a major difference for Calculus 1, it will for a lot of the research seminars.

We're not done with this adventure in life cycle yet. This material has been if you will been published through OCW. Now the intent as I understand it is to sort of go through the courses every three or four years in general and they will adjust this depending on the nature of the course and the volatility of the material, and basically do a refresh on OCW every three or four years, or in some cases if the material is becoming dated but they're not offering the course again, they'll simply pull it from OCW. What happens then, or what's going to happen then, is it goes in the institutional repository. It goes in the archives. So the intent is also that they're going to keep sort of an institutional record of what was taught across long periods of time. It is unclear to me and I'm not sure it's a hundred percent clear to them yet because they are trailblazing here, how much of the archival stuff is going to derive out of OCW and how much of it might derive out of the base course management system because there's more material in the course management system than they can put up publicly in OCW and a lot of that has to do with rights issues and privacy issues for students and things like that. But you can see that the end of this life cycle is that you build up a record of what we were teaching, what our intellectual focus was in the department and in this course over time. That's an example of how complicated the life cycle of material is getting in this new world.

Now I don't even want to delve too far into these sort of secondary products here, but those are real too. For example as you conduct a course you gather all kinds of logging and usage patterns data which is typically very touchy. Nobody is quite sure who should be able to look at this and when. Should the course designers see it? Should the faculty member in charge be able to see it? Should the professor in charge be able to see it every week so that he or she can berate students that are not visiting the website or otherwise punish them? Or should you only see that after the grades are done and the course is over as a way of maybe helping you design a better course next year. Or do we just park this material for use only in grade disputes and things of that nature. Or do we not keep it at all. Does this go in the archives? All kinds of good questions here that nobody's too eager to explore but are gonna be real.

Let me give you another data point that shows you how interesting the world is getting. The National Institutes of Health, which pass out a great deal of grant money to support research in the health and life sciences, changed their regulations in, I believe it was November of 2003 this went final. Basically their new regulations for dealing with research grants say roughly speaking this: If you've got a grant for over half a million dollars, and you're producing data from that grant, which most of them are, then basically you've either got to have a plan for preserving and disseminating that data, or you've got to explain why you're not doing that, and there are some potential reasons why you might not want to do that, or there might be limits on it if you're doing some kinds of clinical trials or there are confidentiality issues involved in it. But basically it sort of throws down the gauntlet and says the default is that you have a responsibility to curate and disseminate data that is produced by federally funded research. And that is a mandate that's laid on the PI's and through the mechanism of these regulations, also on the institution hosting the grant at a certain level. And this policy change has very definitely gotten the attention of a number of vice presidents or vice provosts for research. It has certainly gotten the attention of a number of CIO's. The CIO's and the vice presidents of research are looking at this and saying "Well now what's the institutional responsibility here and what's the PI responsibility here". These kinds of institutional versus PI roles and responsibilities have over the years gotten sorted out in a number of other areas of grant administration, but this creates a whole new area, and it starts raising questions about things like "Do we need to invest institutional infrastructure to manage, curate and disseminate this data over time?" Starting to sound a lot like part of the underpinning case for institutional repositories.

I wish it was that simple of course. There is a lot more to it than I just suggested. If you look at the practice of science and scholarship broadly, there's no question that this is becoming more and more reliant on digital data: on simulations, on computation, on information technology. If you have not read it I urge you to go and have a look at the report on cyberinfrastructure. This is a report that was prepared for the National Science Foundation by a committee chaired by Dan Atkins of the University of Michigan, and the final version of the report issued last year after probably close to a year of circulation and draft for comment. One of the things that this report does is really lays out very eloquently and clearly the case for how the practice of science has changed and the need to invest in a number of information technology infrastructure components: computational capacity, networking capabilities, storage, data management and curation, in order to support these changes. You may also be interested to know if your familiar with that report that a parallel effort has now been launched in the humanities. And the American Council of Learned Societies has asked John Unsworth from the University of Illinois at Champagne/Urbana to chair a commission to take a similar look at changing practice in humanities and to examine cyberinfrastructure requirements for the humanities in an increasingly digital world. So we have these kinds of recognitions of changing practice.

While we've got things like NEH data retention requirements taking if you will an institutional cut at things, the same kind of cut at things that you see with the commitment to institutional repositories at many institutions now. You've also got disciplinary changes: you're starting to see increasingly expectations, norms within scholarly communities that say "When you publish certain kinds of work," for example work involving genomic sequences, work involving the crystalline structure of proteins, "you are expected to deposit the data that supports your work in specific disciplinary databanks as part of the conditions of publication" so that other people can examine that, replicate your results, judge them independently.

Some of this goes back a long time, but we're seeing more and more emphasis on it. It's fascinating to look for example at what's happening in astronomy now. Used to be that observations in astronomy we're fairly closely held, and the life of many astronomers was very interesting. You applied for grants and if you got your grant you got to go off to mountaintops and exotic places, and hope that the weather held, and hope that you spot something interesting. That was basically sort of the world of astronomy at some characature level. Now what's happening is that people are building sky surveys. You've probably seen the various, some of the work on the digital sky surveys that Sloan has been underwriting, and that people like Jim Grey and Alec Slazi (?) have been involved in. Basically the idea here is that when you're doing observational work the kind of norm now is that you get a couple of years to mine the observations. If you took 'em, you get sort of first crack at mining them.

It's somewhat similar to what happens if you're a principal investigator at say a NASA planetary probe. You get some time of exclusive use of data but then you're expected to place it in these repositories so that other people can use it. And in fact the folks at NSF told me they're now getting new kinds of grant proposals in for astronomy. Instead of a grant proposal that says pay my salary and buy me a plane ticket to some mountaintop someplace, it says I've got an algorythm that I think can identify distant quasars or brown dwarfs or something like that, and what I want is some technical support and some computer time so that I can go and mine this sky survey, mine other people's observations that are now part of the patrimony of the scientific community and see if I can spot these things. That's a very different way of thinking about science and thinking about the databases that support science.

Now what is that database? Is that a research record? Is it a scholarly publication? Is it an infrastructure component? Probably yes to all of them. Who owns it? Well that's pretty murky. Its been contributed to widely from all kinds of international sources. In the case of astronomy maybe asking who owns it isn't that important because we know that sky surveys are of fairly limited commercial value. We could talk about the genome and we might get different answers. Perhaps at least to my mind more significant than who owns it is "Who should take responsibility for it?" Right now it's in a disciplinary archive. That means its being funded primarily by centralized science funders. If you look at the cyberinfrastructure report it calls for a lot more funding of this kind of disciplinary archive in various areas. One of the things that I worry a lot about though is that government funding is a very dangerous thing to count on the stability of across long periods of time. What happens when they decide that funding astronomy isn't a government priority, and they're not gonna pay for this archive anymore. Now presumably if it's important, and it probably is in this case, faculty at universities who need it are going to beat up on their universities and say, "well the government is wimping out on it's responsibilities, you need to step up and take care of this because I can't do my research without it". How are we going to bring disciplinary content back into universities when we have funding failures? How are going to portion out responsibility among the many interested universities, all of whom I can assure you are gonna be lining up with checks just eager to say "Oh don't worry I'll pay for all of it, the rest of you guys can just freeload off me". I somehow don't think it's gonna be that simple, but these are some of the issues I think we're looking at as data becomes more and more essential to the practice of research. I would not be surprised to see similar data dissemination mandates in, from other scientific funding agencies both in the US and abroad over the next few years. I think that NIH has taken some significant leadership here. I know that this is an issue that's very much on the table in science policy circles. There was a very nice National Research Council report that came out oh maybe a eighteen months ago now, looking at this whole issue of responsibilities for data deposit and data dissemination in the life sciences. Certainly this is a topic that is getting discussed in the hallways and things in the National Science Foundation. NASA, as I indicated, certainly has its' own kind of policies in this area and has had for a while. This is something that at least sits around the edges of some of the congressional adventures into accessibility of science information, things like the Sabo bill, which would have really called, which was really at least as I interpreted it, primarily focused on publication and research reporting as opposed to raw data. But raw data is certainly implicated around the edges in these. I don't think this issue is going to go away.

I'd urge you to think very carefully about where the institutional role is, as opposed to the PI role, how you deal with the records management side of it, to the extent that you interpret regulation as a records management mandate, and in particular, some of these complexities about not necessarily holding things locally.

In the sort of happy, simple, halcyon days, yeah sure, of records management when we didn't worry too much about this sort of thing, one thing that I think was true was that most issues were fairly local. You didn't have to worry too much about mandates that crossed institutions or put you in a situation of relying in an integral way on a disciplinary repository that was funded from someplace else and that you had absolutely no control over. Issues where you have faculty from four institutions collaborating and one of them is housing the data but there are no real memos of understanding about responsibility for it. Yet when you come down to the question of have you met the regulations, it's "yeah, you know our colleagues down the freeway a ways are taking care of that", we've met the regulations. That's the kind of thing that sometimes makes general counsels really squirrelly.

So I've talked a little bit about the changing framework around data. I don't want to spend time here on the various discussions about open archives, publication of scholarly journals. I think probably you've all heard pieces of that. That to me is something that is mostly a debate focused around scholarly communication and scholarly publication and how we really want to do that in the future. It does have a potential crossover into this sort of broader area we're focused on here about stewardship of research results and records management issues there, to the extent that we may see some kind of regulatory or legal mandate that research results go out through open archive channels, through open access channels. But I think the connections there are relatively clear and I don't want to spend a lot of time speculating about that.

I've talked a little bit about the learning management side of this and how material moves through learning management into records and into institutional repositories. I simply want to again remind you there that there are a lot of issues about when is this research, when is it teaching and learning material, and there are a nasty set of issues about when is it records, which are sort of independent about when is it teaching and learning and when is it research. As far as I can tell institutions are all over the map with many of them in denial, hiding under blankets or anything else they can find about this question of how what's in these learning management systems that they're mandating the deployment of, relates to the sort of traditional view of the institutional records of the organization. I think those are very real issues.

Now the last set of things that I just want to touch on a little bit, going back to this management of research, are two-fold: one is things that aren't in the sciences, and the other is things that don't come from faculty. As far as things that don't come from faculty, I want to just sort of put out the proposition that we tend to think of work, of most of our research coming from faculty. In most fields now a great deal, perhaps even the majority of the research that comes out of our universities comes out of collaborations between faculty and students. It's a collaborative effort. And in fact as we talk about managing research, and making it available, it's not adequate at all to deal with a policy framework that addresses faculty, and rights and responsibilities of faculty with regard to data, to intellectual property, to copyright, to patents, to things of that nature. Simply dealing in that framework, which we're pretty familiar with and have had a long ongoing sort of negotiation between the university administration and faculty members about, particularly in light of this sort of encroaching digitization of everything. We have to have this conversation with students too. About everything from the theses and dissertations that we're increasingly demanding they produce in digital form and put up on the net. All the way though the data that they capture as part of faculty research efforts, faculty-led research efforts, the contributions they make not just to faculty publications now, but to faculty data sets that we may need to curate and disseminate on an ongoing basis. This whole question of student rights, how we formulate them, how we make sure that students understand what they're getting into and agree to it, is going to be a very central issue and it's one that there are relatively few universities moving on.

I'd just put one further footnote on that. We know that faculty collaborate across institutions, from the point of view of people trying to deal with intellectual property rights, either to exploit them or simply to ensure that they have the permissions to curate and steward various kinds of content resources that are being produced by faculty collaborations. The problems of faculty sort of casually collaborating again without agreements, informally across institutions, is a big headache. Let's keep in mind the students. What we've got now is not faculty collaboration but groups of faculty and students across institutions doing these collaborations, and we'd better think about how we're going to handle those.

Now let me just move on to my last set of issues here. We've talked a lot about the curation of scientific data, and there's something very comfortable here about, "well we have these kind of big funders, they have well-organized policies and regulations, they pump money through universities to principle investigators, principle investigators do science, there are disciplinary norms and community norms for parking the data, there are regulations now coming into play." There are, I would almost characterize as, sort of affirmative expectations about data stewardship. The notion that when you do certain kinds of experiments, when you take certain kinds of observations, one of your obligations as a scientist is to make sure that this ends up in the institutional repositories or the community databases so that other scientists can use it going forward.

Let's contrast that for a minute to what's going on in the humanities and some of the social sciences. Here you have a culture that in some cases collects data over decades with very patchwork funding: private foundations, individual bequests, sometimes funded literally out of the pocket of the principle investigators or the faculty members, they're not principle investigators because that's what you get to be when you get a big grant. They're just faculty members that somehow scrape together the money to keep collecting the data they're interested in.

So you have things like thirty-five years of tape recordings of people talking with different regional accents as somebody tries to understand how spoken language changes in some area of the world, or bird recordings, or ethnographic materials that maybe were captured without necessarily the kind of informed consent that would make people really happy today. Somebody just went out into this rainforest or something with a camera thirty years ago and a tape recorder. Much of this material held by individual faculty members or by departments. It doesn't have a lot of standing, the rights to it are murky, the faculty members are retiring, dying off, moving on, institutions are struggling with where they park this material, who should take responsibility for it. We don't necessarily have the disciplinary norm. The toughest, some of the toughest cases I think we're going to face here are areas where we've got usage restrictions that don't fall against, along traditional intellectual property lines.

Let me give you an example, and it's a little bit different but I think it beautifully illustrates the point of some of the challenges that we're going to face in curating and disseminating some of this material. There's a wonderful project which Stephen Spielberg has been funding for the past few years called Survivors of the Showa (?), some of you may be familiar with it. Basically what they did is they went out and found thousands of Holocaust survivors, and they did lengthy interviews with them. They videotaped them, they coded these interviews in various ways, worked up biographies, and other kinds of structural finding aids. So they've got this corpus with thousands of hours now of interviews in a number of different languages. These are very emotional interviews in many cases, many of the people involved are quite elderly and are reliving extraordinarily traumatic experiences in these interviews. The whole idea of this project is to in some sense to document what happened, and they want to make it available. But they want to make it available in such a way that the materials will be used in context. They will be used respectfully. They will be used to further understanding of what occurred, and research into it, and education about it. Those are the kinds of constraints that they would like to place on the dissemination of this material. I would suggest that the gap between that and the kinds of things we talk about in a rights management kind of context, that it can be used for educational purposes, or you can make two copies but you can't charge for it, or you're allowed to make as many copies as you need to back it up. There's an enormous gap there, and I think that gap is going to very characteristic of a lot of the material that documents human cultural practices, human history, human experience of various kinds. Dealing with this kind of research data as we move it from this sort of very private and if you will almost secret preserve of individual faculty into an institutional and a disciplinary setting as part of our management of research is going to be tremendously challenging, and is going to place some demands on us that we're really very unfamiliar with when we've looked at things historically mostly from a legal point of view.

If there's one notion I'd like to leave you with, it's that this whole question about stewardship of research data, and in particular its relationship to records as we've traditionally understood them, is very much coming to the fore. And to deal with these questions, at least as I view them, is really calling for collaboration that's going to bring in people from the research support world, your vice presidents of research, it's going to bring in your records management people, your general counsel, it's going to bring in your library and archives folks, it's going to bring in your CIO. It may bring in some new players, ethicists for example, it's going to bring in faculty from many disciplines. But there's a huge demand, I think, a huge requirement emerging for a serious conversation about this topic that brings in perspectives that we have not really heard from as we've talked about the management of research data up until now. Thanks.