ECURE 2002 Keynote Address Transcript

[an error occurred while processing this directive]

ECURE 2002 Keynote Address

By Clifford Lynch

It’s really a pleasure to be here. I’m so glad to see this ECURE series of conferences going forward because, as Sherrie indicated, the issues that are on the table here are really of enormous importance, And they’re issues that I feel like we’re getting a slightly better handle on year after year, as we move further into a world that is just extensively dependent on digital information and network-based information systems. The answers here aren’t simple ones, the issues here aren’t simple ones, but they are important ones — and important ones not for just the short term but the long term — and I really want to compliment Sherrie and Rob and their colleagues on their focus and persistence on moving these issues forward year after year.

I wanted to talk about connections this morning mostly, about ways in which our understanding of some of the issues around preservation and stewardship of electronic records are changing, are getting deeper — places where we are starting, I think, to engage other spheres of effort that have historically been relatively disparate and are no longer able to be disparate from the considerations around managing electronic records. And I’ve got three or four examples of this of which sort of flow from one to the other. I think the place I want to start, though, particularly picking up on some of the other themes that I noticed in the conference program, is with the connection between records and information security and preservation.

You know, it’s funny when we talk about preservation of digital materials. I’ve been engaged in these discussions probably for more than a decade now. And I’ve been engaged in them well beyond the records kind of context, into the broader questions about how, for example, we preserve an intellectual record, cultural materials, things like that, that are increasingly being born in a digital form. And you get into an interesting trap when you have these discussions. People want to pose the sort of difficult intellectual problems which postulate: Here we are in the twenty-third century staring at a collection of gifts that have been bequeathed to us from two hundred years in the past, and how are we to interpret and find meaning in these? This is your sort of classic digital preservation dilemma, and it’s a very difficult one. But I think if we focus exclusively on that scenario it’s very easy to forget something else which, I think, has been underscored strongly by the discussions about cyber-security and critical infrastructure and related things in the wake of September 11, 2001. And that’s that bits get to the future a day at a time — one day at a time. And if you have got your bits that you want to bequeath to the future on insecure systems, on systems that are not well managed, that are not appropriately replicated and backed-up and redundant, and that don’t incorporate the right kinds of data integrity and consistency checks for example, it really doesn’t make much difference if we talk about that impossible dilemma in the twenty-third century, because the bits may not get there at all. Or if they do get there, they may not be the bits that you wanted to send there. They may be the bits that amuse some random hacker to replace whatever it was that you were trying to keep alive.

This is a — this is a really serious problem. The Internet, for those of you who aren’t sort of closely connected with it day by day, has become to some of us, I would almost say, an astonishingly hostile place in the last five years. If you — for those of you who are at university campuses — talk to your network managers, they will tell you I believe, certainly the ones I talk to tell me they’re spending an appalling amount of time firefighting daily incursions. The systems at our university campuses are being continually probed for weaknesses. They are continually under attack. And it’s really a catch-up exercise at this point trying to plug those holes; it’s very hard to get out in front of this whole data security situation that we’re facing today.

I’d say two other things about this that are probably worth underscoring. The first is that if you look at our sort of — how to characterize them? — professionally managed institutional systems, mostly these are run fairly carefully and — yes, once in a while, things do hit the press when somebody fails to protect one of those and it’s broken into — but those are, I would say, relatively speaking, uncommon. We have a tremendous amount of digital information, some of it records, some of it just important material for scholarship, however, that is scattered around systems that are not professionally managed.

In fact, one of the bizarre things that has happened if you think about it in the last ten years, and it’s happened sort of gradually again a day at a time — you know, in a sufficiently incremental way — that I am not sure we fully recognize its import. We’ve turned our faculty members and our graduate students who seek to use the Internet as a way of disseminating materials that they create, into systems administrators in their copious spare time. So much of the material they produce is running on desktop machines, under the desk machines, in the closet machines, sort of haphazard personal machines that are not being run to a sort of professional standard of paranoia and defensiveness. Your average faculty member who is entranced and engaged by the ability that the Internet provides him or her to communicate with colleagues around the world is not generally sort of getting up every morning and rushing over to the latest list of mandatory security patches for whatever their operating systems is. Some of them in fact, have sort of begged, borrowed, or stolen hardware which is just sort of sitting there and they’re hoping that it doesn’t crash because they’ve got no place to get money to buy new hardware. That’s how precarious it is and I think that we need to be very aware of that. At the close of my talk, I going to spend a little time on some developments which I think, are going forward largely for other reasons but will have, I think, the very beneficial side effect of starting to redress that balance.

The other thing that I want to say in the area of security and how that connects into preservation and information management, and why this frames a requirement for a lot more interaction between the communities concerned with information management and preservation and those with security, is that we tend, I think, to take a very narrow view of security. If you look at a lot of the stuff that’s come out in the last year as a result of some of the conversations that have moved forward from the federal focus on cyber security, you’ll see that there is a obsession with denial of service, and breakins, and loss of control of systems. What they’re worried about, typically is about hackers, or aggressors, or intruders, or whatever you want to call them, breaking into systems, getting control of them, using them as launching points for attacks on other systems, using them to disrupt physical things that are controlled by computers, things like oil pipelines, or gas pipelines, or power grids, or telephone grids, or things like that. So it’s very much sort of an attack–defense scenario, where often when you fail it’s very visible.

They don’t talk very much — they don’t talk enough, I don’t think — about more subtle attacks that involve information corruption. [What] if some of our crucial databases — particularly those that underpin some of our scientific, and medical, and technical enterprises — become corrupted in various ways? Not necessarily corrupt in the sense that somebody goes and zeros out an entire database, but rather someone suddenly introduces some misinformation in there. That’s, unfortunately, quite likely to go undetected for long periods of time and to be carried forward into the future. And that’s really dangerous. We see anecdotal evidence that errors in scholarship, for example, are picked up and propagated. People in the library world, particularly those who do reference and interlibrary loan will tell you wonderful apocryphal stories of someone who messed up a reference in a paper, which was cheerfully propagated to dozens of other papers over time, and about the innocent researchers who wander in periodically trying to get a copy of this cited thing that is a wrong citation or never existed in the first place. Those kinds of things can really propagate out in very troublesome ways. I think that we really need to spend time thinking about this broad area of information integrity, how to audit information integrity, how to detect corruption in databases, in collections of structured and unstructured information.

This has come up frequently in the last couple of years in discussions, for example, about the establishment of long-term archives to do things like preserve scientific and scholarly journals as they move to a digital form. The arguments that are made there are that use, large scale public use, is one of the best — one of the best measures for detecting corruption. And when you look at the scenarios under discussion that say, basically, “We’re going to build a dark archive, and other than checking for maintenance purposes, this material may not be visible for a hundred years to the public broadly, until copyright expires.” People start raising the question “Well, how do we know the data will be good for a hundred years? How can we check it a day at a time in an environment of intensive information security threats?” And the tool kit we have for that right now is pretty barren. We need a lot more thinking in terms of audit and intrusion detection, not for the classic take-over-a-system problem, but for the broader integrity of data problem.

And I just want to put a side note on that, that corruption of data is not always the result of hostile actions; it’s often the results of mistakes, hardware behaving strangely, program revisions that weren’t quite fully tested. Human error plays an enormous role in here too, and it aggregates over time. If you get people, mostly off-the-record, who have had stewardship over large databases of anything — images, sound recordings, text — and have been trying to manage these over time, they will often admit reluctantly, that “Yea, we’ve lost half a percent or so of what’s in there. You know, things just didn’t work out or failsafe backup systems weren’t quite failsafe… Every now and again, we messed up a little bit in doing database migration…” And they really will admit that even — even in environments that are not under hostile attack, the perversity of hardware and software and human error really does take a toll over time. We need to come up with ways to reduce that toll as we look at the management of information. So I think there’s a very important conversation that needs to happen there.

Let me move to my second conflux of issues, and I think the way I’d frame — that is to say, that we focused a lot on digital preservation, and we very sloppily used the term digital archiving now as a synonym for digital preservation. I think it’s worth remembering that archiving and preservation really aren’t the same thing. Preservation is basically about keeping artifacts alive and comprehensible into the future. Archiving is about a lot more than that. It’s about maintaining some kind of context and provenance around those objects. It’s around — it’s about making choices about what objects you bring forward and what objects you don’t bring forward. And it’s very easy, in a world where we now speak of digital archiving and digital archives as synonymous with systems for digital preservation, to forget those things — and to think that we’re doing all the archival things, just because we’re engaged in the issues around digital preservation. I think that this is a fallacy, and that the sooner we recognize that this is a fallacy, the better-off we are, and that we do need to kind of touch base with a lot of those fundamental archival objectives again, and ask ourselves, “Are we really doing those in the digital world, or are we indeed narrowly focused on the sort of, you know, foundational problem of preservation which we also must address to the exclusion of those other issues?”

When we look at these broader issues — of context, of meaning, of selection — I think that it also helps us to see that we need to think pretty carefully about how much we need to bring forward into the digital world in order to understand artifacts that we may preserve. I’ll give you two examples here that just kind of drive the point home. One is trivial and one is anything but.

We archive e-mail; we’ve been doing that a lot. In fact, archive e-mail is very much in the news, it seems to be always in the news, ever since Ollie North. Now, if you look at what’s going on in Wall Street, and some investigations of the big trading firms, they’re certainly getting a superb object lesson in archiving e-mail and some of its ramifications. It’s hard to really understand a massive organizational archive of e-mail, particularly as it grows more distant in time from its point of capture, without having quite a bit of ancillary information. For example, directories: some organizations assign e-mail addresses that are very directly correlated to names. Others don’t, and if your looking at mail from ak47 to sg92, at some organization, twenty years ago, you may actually have considerable difficulty — unless people are using signature files, unless you’ve got other context — figuring out what this is all about without a directory. You need to save directories, I think, as part of that archival context for e-mail. Yet, directories are these sort of dynamic databases. They’re often maintained as kind of organizational infrastructure by a very different group than the group that worries about records and compliance, and they have a very different characteristic than just arranging for copies of e-mail to fall into an archive because of their dynamic update nature. I would suspect, although I’ve certainly not done any kind of organized survey, that these organizations that are busy archiving e-mail because of various regulatory mandates or legal requirements to do so, aren’t necessarily maintaining a snapshot of directories, a trail of directory updates and images to allow us to interpret that e-mail downstream. That’s a simple example.

Let me give you a complicated example. There is a lot of work going on, both in the corporate world, in the government world, and also in the higher education world, in a set of technologies that are loosely called public key infrastructure. Basically what this is about is ways to manage identity and manage the signing of documents, in such a way that it is non-repudiatable and verifiable who produced or who had possession of and sent a document, using various systems of digital signatures. These are going to become, I suspect, quite deeply ingrained in the business processes of organizations, both internally and interorganizationally over the next decade. This is a enormous complicated context that has to be maintained if we’re going to fully assess what the records of those organizations mean and when we can trust them and when we can’t, ten or twenty or thirty or fifty years downstream. I was absolutely delighted to learn a few months ago that Charles Dollar is actually doing a careful study of some of the issues around records management implications and PKI. This is the first real work I’m aware of, in this area, and I know he’s giving a talk later today which I’d urge you to check out because I think that this begins to shed light on one of these contextual areas that is going to be quite typical of the digital records world in the coming decade, and to give you a sense of its richness, its complexity, and its importance. And I’m really glad to see someone with that kind of intellectual horsepower taking on an area like that, that has, I think, been overlooked for too long.

Let me move to my third area, and that’s rethinking what we mean by records, particularly in the university environment. We’ve had, you know, some sort of classic definitions and understandings of what constitute records. And we’ve built up a certain — how to describe it? — theoretical underpinning that says why we consider these things to be records and why we don’t — why we consider other things not to be records. I can’t help but worry when I see some of that, about whether we haven’t selected some things as records and said some other things aren’t records because we were capable of capturing some and we weren’t capable of capturing others. You always worry about how much of this is justification for what we can do rather than what we should do. I think it’s very important to recognize that when you look at what’s happening at universities right now, there are some big systemic things that are really changing the nature of what we can capture and what we can preserve. Note: I didn’t say changing the nature of records, but I think they invite us to think about whether we want to treat these new things as institutional records or how we want to consider them.

I would say the two most prominent trends there, are first the proliferation of audio and video capture of all kinds of events, meetings, classes, lectures, all of this kind of material. This is becoming relatively commonplace. It’s slowly creeping into some government settings as well, an obvious example being those wonderful city council meetings you get on public cable television — you know, the reruns and things like that, periodically are the town zoning board or other esoterica. But we are capturing a lot more of these kinds of events, and we can ask the question about whether these should be viewed as formal records of various sorts.

Perhaps, going even more deeply, though, into the core of what we do at universities, is the emergence of these things that are variously called learning management systems or course management systems in the US, and in the UK they seem to like to call them virtual learning environments for reasons that I not real clear on, but I got a big dose of that earlier this week. These are systems that moved part of what has historically gone on face to face in the classroom into the digital medium, and like anything else that happens in the digital medium, that means that automatically, as a byproduct of doing the activity, we produce a record, a recording perhaps — let me be more precise, since record has a very special meaning in the context we’re talking here. So we get recordings of this, of things that were historically classroom discussions, quizzes, tests, term papers — all of these things that we used to not be able to capture, or at best they resided somewhere in the files of the occasional professor organized enough to file them and save them. We could treat these as institutional records now. We could preserve them. We can struggle with the various questions and constraints about how we should make them available.

I could go on for a long time about the various issues involved there ranging from faculty and student intellectual property rights, to questions about privacy, to issues of how this affects spontaneity, how it could change the tenure and promotion process, at least in those institutions that are serious about considering classroom teaching and classroom conduct as part of that process. But I think, you know, having mentioned these, you can kind of think through some of the ramifications. The point here is that, there are a lot more possibilities for what we can capture, what we can preserve, what we can make available. And I think that these probably, if we’re honest, should lead us to some reconsideration of the scope of institutional records as we have historically defined them within our institutions of higher education.

Let me move to the last piece of this puzzle, the last set of connections, and talk a little bit about an idea that seems to be gaining a lot of traction recently, which is that of institutional repositories. What’s happened is that a number of trends, a number of threads of discussion, seem to be coming together, and I’ll just give you a few of them.

There’s been an ongoing set of concerns about the system of scholarly publishing and its costs and its constraints. There has been a closely related set of discussions about the emergence of a broader system of scholarly communication, which looks beyond the very paper-rooted things that we have historically considered the center of that system, that recognizes that faculty [and] students are going to be producing what are in essence digital multimedia works, which have some rather different characteristics than your traditional journal article or term paper. And these may very well be disseminated in different ways. Another piece of this has been the move to “e-prints” and preprints, to faculty wanting to more rapidly and more broadly disseminate their work — outside of the system of scholarly publishing as it has been established — through, you know, personal Web pages listing their papers, things of that nature. Yet another piece of it has been this discussion about learning materials, about on-line courses and things of that nature, and about the desire of some universities, MIT with its open courseware initiative being perhaps the, you know, the flagship institution, that’s made a commitment on this. They want to basically make most of their courses available for the public, worldwide, for free. And they need places to put these.

All of these things have combined to lead a number of institutions down the path of starting to think about the construction of systems and organizations to achieve what they’re calling institutional repositories. The idea here is that these are computer systems, plus organizations around them — and I really want to stress there’s more to this than just building a computer system. It’s the same distinction that I’d make between somebody pointing to a database system and saying, “Oh that’s a digital library,” and talking about the functions that a library’s organization does. And the key thing here is to recognize that institutional repositories have a dimension as organization, as well as simply as storage system. These are places where the intellectual assets of institutions can be placed, stored, made available, managed, preserved, on behalf of the community that makes up the institution, that produces those assets and for the good not just of the institution but of the entire world, where appropriate. The things that go in here can be papers, they can be data sets, they can be courses and courseware, they can be institutional records. Institutional records, particularly the public ones, are a very logical and very natural part of the collection of materials that could go into these kinds of repositories.

And as I indicate, there is quite a bit of energy going on in thinking through the construction of such systems at some institutions now, and my sense is that this is sort of a burgeoning movement. You know, we started with a couple of dozen of early adopters; now, as those roll out their work, more and more institutions are starting to think about this. It’s useful to recognize that such a system obviously requires the collaboration of information technology people and people from the library organization. It clearly, in my mind, should also require the collaboration of the records management organizations within these institutions, although it is not clear to me that they are always a presence at the table in the design of these, particularly at places where the primary motivation comes out of a tradition of advancing and altering the patterns of scholarly communication, rather than thinking more broadly about stewardship and access to the full panoply of the institution’s intellectual assets.

There are technical issues, certainly, in building these things, but there is also, I think, a very large and complex policy dimension and that’s one of the reasons why I think it’s important for the records management voice — as well as other perspectives, including those on campus concerned with intellectual property policy, with privacy, and other matters — to all be present in the debate and the construction of policies, to guide what can go into these repositories, how long it should be kept, who gets to see it, what uses can to be made of it, whether deposits, whether deposits are mandatory and if so, what kinds of deposits are mandatory, as opposed to what are left to the option of the faculty or other units on campus.

These institutional repositories, assuming they succeed — and I believe they will — are going to do a lot of things for us. Some of them are squarely on the agenda of the people advancing them. For example, I think that they will have a very — a very valuable effect in legitimizing works of authorship in the digital medium, as, you know, first rate works of legitimate scholarship. Right now there is this complex set of risks that faculty face when they choose to explore the authorship possibilities and the communications possibilities of the digital medium, rather than producing traditional print monographs or journal articles. They run the risk that that effort, when time comes for tenure and promotion kinds of things, will be judged in some way questionable or of lesser importance than the more traditional publishing output. And one of the reasons for that is because there is a sense that this material is ephemeral. When we set up the system where it can be placed permanently in institutionally managed repositories, I believe that is going to provide one of the important underpinnings for addressing some of the legitimacy concerns about that kind of scholarly authorship.

Another thing that these institutional repositories are going to do for us is, I think, they will get a lot of the faculty out of the “playing system administrator in their copious spare time” bind. Instead of having to house these on personal desktop — house these materials on personal desktop machines, they can move them into the institutional repository where they’ll be managed professionally and protected in a more professional way than a faculty member can typically afford to do in his or her spare time. I think that will be an important and valuable byproduct that will help to protect our intellectual assets.

But the other byproduct we may get here is that we may see institution-wide systems for digital preservation. That may really help with the records management dilemma, in the sense that if we can get some institutional infrastructure to support preservation activities in a systematic way, perhaps we can return to some of these more archival and records-management oriented questions in trying to figure out what to place in the repositories, how long to keep it, how to establish context and provenance around it, who gets to see it, and those kinds of questions. So I think that institutional repositories are an extremely promising development for those of us who are concerned with preservation and access to records in higher education.

I hope in these comments that I’ve at least highlighted a few areas where the world is changing the whole context around electronic records and their preservation is changing, and I hope if nothing else, I’ve highlighted some activities that may be happening in your own institutions, where you need to get involved, where you need to be part of the conversation, because this opens opportunities on the one hand, and fundamentally affects the framework within which you’re trying to achieve your own objectives on the other. With that, I’m going to stop, and I would welcome a couple of quick questions or comments. Questions?

Yes.

[Question unintelligible]

The question was, do I see the OAIS standard as the sort of de facto model for institutional repositories?

Let me say a couple things about that, cause I think that’s actually a wonderful question. The OAIS model has been a very useful thing. If nothing else, it has given us a set of common terminologies to have conversations about a lot of the processes involved in archiving and preserving digital information and managing that information. I think that it’s introduced a couple of very useful ideas, the ideas — for example, submission packages and delivery packages, which introduced some helpful discipline into thinking about the problem. Having said that, every document I look at now starts with “We are following the OAIS model,” and then proceeds to do whatever it wants. The OAIS model is a very abstract beast. It’s mostly a vocabulary. It’s not a road map to how to build things. It’s really not even a system architecture. And while it’s a useful common vocabulary, I don’t think it excuses us from the need to really think through system architectures, and to really think through implementation strategies; it’s really not a blueprint to those things. The other thing that I would say is that I get nervous whenever anything like that, this early in a field that’s this complex, gets this kind universal sort of adoption. I think it would be very valuable to see some critical work on it, looking at “What are the limits of the OAIS model is a model?” “What kind of information, if any, doesn’t it apply to well?” “Where’s it a bad fit, where’s it a good fit?” I don’t see enough of that kind of conversation happening, so I guess, maybe in summary, I’d say it’s a very valuable tool for facilitating communication among people thinking here. It’s not a full blueprint for solution, and it’s something that, I think, probably would merit from some further critical study.

Yes.

[Question unintelligible]

That’s an interesting problem. You know, we talked — I think I talked a little earlier about — we keep things sometimes because we can, and don’t worry other things because we can’t keep them, like what happens in the classroom. There are some very scary downsides of being able to capture more and more of this, and I think we do need to be careful about that. We need to be worried, I think, about mandates to keep some of this forever. I mean, one of the things that I sort of can’t get out of my head when I think about this is, imagine that all the term papers that you ever wrote, from like age six on, were preserved somewhere, and imagine that was true for everybody. Now of course there would be privacy things around these, but we all know that that breaks down occasionally, and one can just imagine, you know, in a era where no part of the private life, or life history of political candidates, for example, seems to be off limits, where corporations seem to be going in for ever more elaborate and paranoid, you know, life history background checks. Having these sorts of things there is chilling, frankly. The notion of, you know, letting people dissect term papers that represent part of, you know, your attempts at intellectual evolution, at age fifteen, is not, doesn’t really leave you feeling good. So I think there are some issues there we want to think real hard about, we certainly don’t want to lose the atmosphere of spontaneity and of free intellectual engagement and exploration that I think our universities strive for. And becoming too effective at record keeping there, particularly when we’re not doing it with a great emphasis on informed consent and only doing it in appropriate places, could have a very chilling impact.

I think I’m getting the signal that we’re about out of time. Thanks for letting me be with you this morning.