面向数字图书馆的海量信息管理体系结构研究

搜索高级搜索

原创毕业论文

面向数字图书馆的海量信息管理体系结构研究

作者: 浏览:9次

免费专业论文
政治工作论文
计算机论文
营销专业论文
工程管理论文范文
医药医学论文范文
法律论文范文
生物专业论文
物理教学论文范文
人力资源论文范文
化学教学论文范文
电子专业论文范文
历史专业论文
电气工程论文
社会学专业论文
英语专业论文
行政管理论文范文
语文专业论文
电子商务论文范文
焊工钳工技师论文
社科文学论文
教育论文范文
数学论文范文
物流论文范文
建筑专业论文
食品专业论文
财务管理论文范文
工商管理论文范文
会计专业论文范文
专业论文格式
化工材料专业论文
英语教学专业论文
电子通信论文范文
旅游管理论文范文
环境科学专业论文
经济论文
人力资源论文范文
营销专业论文范文
财务管理论文范文
物流论文范文
财务会计论文范文
数学教育论文范文
数学与应用数学论文
电子商务论文范文
法律专业论文范文
工商管理论文范文
汉语言文学论文
计算机专业论文
教育管理论文范文
现代教育技术论文
小学教育论文范文
机械模具专业论文
报告,总结,申请书
心理学论文范文
学前教育论文范文

收费计算机专业论文范文

收费计算机专业论文
Delphi
ASP
VB
JSP
ASP.NET
VB.NET
java
VC
pb
VS
dreamweaver
c#.net
vf
VC++
计算机论文

毕业论文范文题目：面向数字图书馆的海量信息管理体系结构研究，论文范文关键词：面向数字图书馆的海量信息管理体系结构研究
面向数字图书馆的海量信息管理体系结构研究毕业论文范文介绍开始：

面向数字图书馆的海量信息管理体系结构研究.

邢春晓+, 曾春, 李超, 周立柱 (清华大学计算机科学与技术系,北京 100084)

摘要

分析了数据密集型应用的特点,讨论了管理海量数字资源面临的技术挑战和关键问题,并综述了支持高性能数据密集型应用的相关工作,包括标准、技术和应用系统.在分析和比较相关工作的基础上,设计了一个新型的面向海量信息管理的数字图书馆体系结构,并描述了其中的关键功能组件和核心服务模块.最后,给出了一个遵循该体系结构设计和实现的应用实例..清华大学建筑数字图书馆.

关键词: 数字图书馆;体系结构;海量信息管理;互操作;元数据

1 绪论

在人类有记录的历史中，印刷品在保存和人类数据和知识的扩散中扮演一个重要的角色。但是，随着计算机，电信，多媒体和储藏的技术的迅速发展，这一个角色对保存新时代数字资源的作用正在减弱。信息以数字形式传播的快速增长不仅对传统的文件和他们的数据提供者是一个巨大的冲击, 同时也对政府,商业的和非营利的部门组织产生了影响。根据Lyman 和 Varian的最新报告指出, 每年世界出产的报刊、电影、光碟、磁带的总数量大约有15亿个元组，也就是说相当于地球上每个年有250万元组的存储，其中印刷品只有总数的0.003%。随着每年硬盘能力的加速增长，磁性储藏显然就成为了储存数据的最大的媒体以及成长最快速的区段。数字资源的类型非常多，他们包括数字本档，文件，科学的数据，图像，动画，声音等等。而且数字资源的申请相当宽广,包括 DL(数字图书馆) ，电影中心，其他的公众媒体 (电视，广播，报纸, 等等。) ，博物馆和国家的或合作的数据中心。同时，以英特网为代表的数据公路也已经是数字资源扩散的一个重要工具了。因为全世界政府，公司，团体，研究学会，非政府组织 , 教育学都会在网络发布他们海量的资讯。

技术挑战和主要议题

这些海量的数字资源在数据管理技术区域中出现了许多挑战性的议题。下列是一些例子。

（1）数据模型。

传统数据模型的理论只唯一适用于结构化的数据, 而对于各种不同类型的海量的数字资源来说，他们基本都是半-结构化或根本就没有结构的，因此这些理论就完全不适合了。

因此，新的数据模型被要求。

（2）系统的体系结构。

传统的数据库管理制度是为当前的，简单的，以及不断更新商务数据而设计的。因此对交易管理以及对即时性的控制始终是系统体系结构的中心。然而体系结构对数字资源的管理却是不适当的，正如古典的交易观念在数字资源的管理中已经变成不那么重要了。现在我们需要追求的是高效的、世界性的架构来对海量的数字资源进行管理。

（3）巨大的数据储藏。

数字数据资源的大小是用兆位元组或亿位元组计算的。在线的移民和定居产生了海量的数字资源，而使用传统储藏装置SCS已经不能对这些资源实现高效的储藏了。因此需要储藏制度进行多层次的研究,例如对SAM(储藏区域网络) 和对其他的技术的研究是不可避免的。

（4）查询处理。

在传统的数据库系统, 查询是以查询语句来表达的（例如 SQL语言）。但是在对这些海量数字资源的搜寻过程中，许多新的方法应该被采用，比如关键字搜寻，全文搜寻，模糊查询和以内容为基础的多媒体搜索。该如何有效地整合这些查询方法 (包括 SQL ， OQL 和不同的可扩展标示的查询语言, 举例来说, XQL, XML-QL, XML-GL) 建立一个有效率、有柔性的查询处理方法，至今仍然没有一个满意的解决方法。

对研究员来说，解决上面的问题将会是往后的几年内的一个主要的目标。为了实现这一个目标，我们将会在这篇论文中提出一种新的体系结构，用来对海量的数字资源进行有效的管理。这些体系结构符合管理的，以有分配的、动态的、海量的、变异的为特点的数字资源的需求。

2 相关工作的概述

IEEE STD 610.12[2] 用组件的结构，它们的关系以及随着时间演变的原则和指导方针来定义体系结构。很多的准备工作已经为数字图书馆提出了它体系结构的发展方向。现在我们来介绍一些相关标准以及促成技术：

数字图书馆正式的模型——5S 模型

数字图书馆有一个复杂的数据制度因此也要求有正式的基础，以免发生分歧，或者互通性遭受损害。下面将提到5S模型(水流 , 结构 , 空间 , 情节, 和社会) 的基本抽象化说明,这有助于严密地而且有效地定义数字图书馆。水流是过去一直描述静电和电动内容的抽象项目的序列。结构能够依照分类了的组织指示曲线图而被定义。空间是服从特定的规则的那些组上是抽象项目和操作的组。情节是有事件或者计算修正状态为了要完成一个功能的需求的行动的序列。社会能了解实体和他们之间的关系。所有这些抽象化以及它们有关联统一观念,都必需正式而且明确的说明数字图书馆中的数字资源，元组，收集和服务。它要能清楚地而且正式地定义一间小型的数字图书馆的正式的样板表演。但是正式的模型仍然需要在数字图书馆发展中被改良而且校订。

3 数字图书馆海量数据管理的体系结构

直到现在，大规模的数字图书馆工程的体系结构发展和使用都还没有通常的方式。各种的图书馆，研究机构和大学都传统地使用适合他们的独特需要和目的技术，字汇和发表方案的数字图书馆体系结构。在这个前提下，我们以分析、研究相关的作品为基础为数字图书馆计提出海量数据管理的新体系结构。我们遵从相关的标准而且使用促成技术来设计体系结构，而动机是提供一个通用的结构和软件平台来构造大规模的数字图书馆。

3.1 设计原则

设计体系结构的原则对指导的设计是十分重要的。

标准化。我们将会在体系结构的所有层中遵从相关的国际和国家的标准。

控件化因为它广泛地被认为是一个好的软件工程练习,为大多数的设计者提供了一些现成的成份模型。

可用性。这样的技术结构能够提供功能的规格，物体定向规划的比例和作为一种能实现控件有效性的软件来使用。

可度量性。

这里提出的体系结构的可度量性是指，由一个前进的和多层结构达成, 而且能达到以服务为基础的WEB服务提供机制的要求和资源有效性的要求。

互通性。

在各种的系统和组合之中，互通性是一个中心主题。不同的系统和组合有各式各样的数据类型，元组标准，记录，证明方案和商务模型。互通性的目标将根据整合技术上让那些不同的而且有着不同组织处理的成份的使用者建立互相密合着的服务。这需要三个方面的合作: 技术上的，内容上的和组织的水平上的。

(6) 体系结构在整个数字图书馆中，应该是相关的、可比较的、可发展的。这一项原则通常需要像元组，数字资源，容器，收集这样的定义使用。这一项原则也需要对体系结构描述能被当作基础使用。

(7) 体系结构应该是能适应的，可靠的，可维持的、可测试的。

这一项原则的终极目标将会是减少全部的软件费用, 改善产品而且维修质量, 帮助组织的成功, 甚至能让我们在现在或者将来的困境中得以生还。

3.2 体系结构的设计

我们设计的多层体系结构在以遵从上面一组设计原则为基础上，能为数字图书馆提供服务和管理。体系结构是一个和OSI一样的参考模型，但是它把重心集中在为海量的数据服务、管理、存储上。

3.3 主要功能的成份

(1) 数据摄取。

这一部分提供服务和作用，对受理资源数字化、标记、标注、编目从生产商和内容为数据管理和知识管理做准备。数据摄取功能也在元组和数字资源上的运行提供质量保证, 这遵从相关的数据格式和文件标准。同时数据的摄取，聚集和编目新元组和数字式对象，从遥远的数字图书馆和万维网使用共用协议中。

数据管理。

这一部分提供储存的服务和功能, 维持, 并且存取来自数据元组容器的收集摄取。数据管理功能包括管理元组数据库，多媒体数据库和文件系统。它将会维持轮廓, 视野定义和指示的正确, 而且能运行数据库更新 (载入新的描述数据或管理的数据)。元组的中央储藏是由数据提供元组的摄取，并且从其它的数字图书馆被互通的记录聚集。通过在层之间使用输出接口，它将会在数据服务维修层执行搜寻或者浏览的服务。同时, 为数据管理的资源伺候器（正如电视的伺候器一样) 提供各种的多媒体服务 , 像是随机选择视讯服务一样。数据的存储就像是文件系统的数字资源，又像是在物体中表示关系的或者是物体-定向数据库。这些元组在遵从元组标准的前提下被储存, 像是Dublin核心，又像是USMARC ，CNMARC的元组数据库中的属性。元组的属性使用的是可扩展标示语言 (可展开的语言)、分类数据内容和文件格式定义 (文件类型定义) 来建立数据的半构成表现。

(3) 知识管理。

知识被描述成像往常一样在推理机制内使用的被表示的概念关系。这一个部分提供机制管理关系的所有类型。每种类型处理都包括变数，数据组，收集，申请和领域知识之间的关系的多观念空间。这些观念空间提供以存在论为基础知识的一个水平的阶级组织。存在论是描述数据交换的语意学的主要技术,这被定义为一个特别领域的被分享的概念化的规格。他们提供一个能分享和能沟通的申请系统。这些被分享，而且重复使用的知识的领域的通常十分容易理解。如果知识规则和他们的关系在知识容器被储存，那么这些将会给服务提供特定申请的处理。

(4) 数据档案。

这一个部分提供服务和功能给储藏, 用来维护和提供元组、数传物体和知识规则以保证长期的保存。功能包括要从文件中摄取数据和以及数据管理, 而且把他们加入长备的储藏中，管理储藏阶级组织，例如哪些文件应该被储存、例行的演练和特别的错误检查, 而且在数据移民，回答，后援和灾祸恢复能力上更新媒体。系统依照不同的系统刻度和申请需求将会采用不同的网络储藏方法 (像是 NAS ， FC-SAM和 IP-SAM) 。数据档案的OAIS 参考模型被用于体系结构，它的数字资源为理解和增加档案提供标准。同时，我们扩充模型为使数字图书馆的档案需求的能够被满足。

4 个案研究: THADL

清华大学体系结构数字图书馆 (THADL) 被发展如一个原型系统,这正式地自从 2000 年三月以后开始了。THADL维持技术集中焦点在以技术为基础的研究和以内容为基础的研究之间的寻求平衡。设计队带来来自包括计算机科学，体系结构科学和图书馆数据管理等方面的知识,因此表现在计算机研究员，图书馆员和主题专家之中的的合作的受过不同的训练的三个研究团体。在 THADL大量的多媒体材料中包括文件，日记，相片，手稿，图画/蓝图，动画和声音。我们已经完成THADL 的目标。通过设计发展 THADL 原型，我们分析并且概述早先的体系结构。同时建筑学也指导系统功能和服务的延长。现在我们来描述THADL 的细节。主要的研究 THADL 的内容是依下列各项:

探究有效率的方法和技术。根据分析, 设计, 并且评估 THADL 设计原型系统，能为构造一间的大规模的数字图书馆铺路;

(2) 构建以英特网为基础的，为非静态的数字资源提供智能的，交谈式，和协同合作的环境的一间中国体系结构数字图书馆;

(3) 提出有效率方法使资源数字化，并且保护最多种材料给中国体系结构研究；

(4) 为中国体系结构科学的建立元组规格和标准以完成他们的相关议题;

(5)为在英特网上包括学生，学者，图书馆员和平常的使用者的提供友好了，便利的，个人化的服务。然而， THADL 是一种包罗万象的原型系统,在这个基础上我们应该有一个更好的研究,尝试, 而且分析我们提议的体系结构是否符合大规模的数字图书馆需求, 比如中国数字图书馆。

5 结论和未来工作

在这一篇论文中，我们讨论了挑战性的议题，并且对相关的信息也做出了相应的概述和使该技术能支持高性能数据密集的应用。我们为数字图书馆设计的海量数据管理的新体系结构, 并且描述了主要成份和核心服务。最后, 个案THADL(清华大学建筑学数字图书馆) 为体系结构的设计提供了一个成功事例。

在将来工作中, 我们将学习和开发软件媒件为海量的仓库管理、XML 基于的搜索引擎, 和多语种全文搜索。我们将设计和将实施一个轻量级相互可操作的协议根据，被指定Z39.50 协议使大规模分布的变异数字式资源综合化。

References:

[1] Baldonado M, Chang CK, Gravano L, Paepcke A. The Stanford digital library metadata architecture. International Journal on Digital Libraries, 1997,1(2):108~121.

[2] Lagoze C, Hoehn W, Millman D. Core services in the architecture of the national digital library for science education (NSDL). In: Proc. of the 2nd ACM/IEEE-CS Joint Conf. on Digital Library. Portland: ACM Press, 2002. 58~65.

[3] Wactlar H. Multi-Document summarization and visualization in the informedia digital video library. In: Proc. of the 12th New Information Technology Conf. Beijing: Tsinghua University Press, 2001. 323~332.

[4] Gon?alves MA, Fox EA, Watson LT, Kipp NA. Streams, structures, spaces, scenarios, societies (5S): A formal model for digital libraries. Technical Report, TR-01-12, Virginia Tech, 2001.

[5] Thornburgh RH, Schoenborn BJ. Storage Area Networks: Designing and Implementing a Mass Storage System. Prentice Hall, 2000.

[6] Clark T. IP SANS: An Introduction to iSCSI, iFCP, and FCIP Protocols for Storage Area Networks. Addison-Wesley, 2001.

[7] Staab S, Studer R, Schnurr HP, Sure Y. Knowledge processes and ontologies. IEEE Intelligent Systems, 2001,16(1):26~34.

[8] Xing CX, Wu KH, Luo DY, Zhou LZ, Liu GL, Qin YG. THADL: A digital library for Chinese ancient architecture study. In: Proc. of the 12th Int'l. Conf. on New Information Technology. Beijing: Tsinghua University Press, 2001. 373~382.

A Study on Architecture of Massive Information Management for Digital Library

(Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China)

XING Chun-Xiao+, ZENG Chun, LI Chao, ZHOU Li-Zhu

Abstract

This paper investigates the challenging issues and technologies in managing very large digital contents and collections, and gives an overview of the works and enabling technologies in the related areas. Based on the analysis and comparison of the related work, a novel architecture of massive information management for digital library is designed. The key components and core services are described in detail. Finally, a case study THADL (Tsinghua University architecture digital library) that complies with the architectural framework is presented.

Key words: digital library; architecture; massive information management; interoperability; metadata

1 Introducn

In the recorded hitiostory of human being, the printed materials used to play a dominant role in the preservation and pervasion of human information and knowledge. However, with the rapid development of technologies in computer, communication, multimedia and storage, this role is giving away to the digital resources in the new era. The explosive growth of information in digital forms has posed challenges not only to traditional archives and their information providers, but also to organizations in the government, commercial and non-profit sectors. According to the latest report by Lyman and Varian, the world’s total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage which is roughly 250 megabytes for every person on the earth. Printed documents of all kinds comprise only 0.003% of the total. Magnetic storage is by far the largest medium for storing information and is the most rapidly growing section, with a shipped hard drive capacity doubling every year. The types of digital resources are diverse. They include digital texts, documents, scientific data, images, animation, video, audio etc. The applications of the digital resources are quite broad, including DL (digital library), movie/video center, other public media (television, broadcast, newspaper, etc.), museum, and national or cooperative information center. At the same time the information highway, which is represented by Internet, has been an important tool of the pervasion of digital resources. The governments, companies, groups, research institutes, non-government organizations, education institutes all over the world put massive information on the Web.

Technology challenges and key issues

These massive digital resources present many challenging issues in data management technology area. The following are some examples.

Data model.

Traditional data model theories are only applicable to structured data, but not for the massive digital resources of various types and they are mostly semi-structured or unstructured. Thus, new data models are demanded.

System architecture.

Traditional database management systems are designed for business data processing featured by concurrent, short, and update transactions. Therefore transaction management and concurrent control remains as the center of system architecture. The architecture is not suitable for the management of digital resources as classical transaction concept is becoming less important in these resources. We need to pursue novel and universal frameworks for massive digital resources management.

Massive information storage.

The volume of digital data resources is counted by terabytes or petabytes. Traditional storage devices using SCSI cannot work for efficient storage, online migration and persistent archive of such massive digital resources. So the research of multi-level storage systems, SAN (Storage Area Networks) and other technology are inevitable.

Query processing.

In traditional database systems, queries are expressed in query language such as SQL, but in the query and search of massive digital resources, many new mechanisms should be used, such as keyword search, full-text search, similarity query, and content-based multimedia retrieva l. How to integrate the query methods (including SQL, OQL, and different XML query languages, e.g., XQL, XML-QL, XML-GL) efficiently to build an efficient and flexible query processing method has not been satisfactorily solved yet.

To solve the problems mentioned above will remain as a major goal to researchers in the next few years. To fulfill this end, we present a novel architecture for massive information management of digital resources in this paper. This architecture is intended to meet the requirements of managing digital resources characterized by distributed, dynamic, massive and heterogeneous properties.

2 Overview of the Related Work

The IEEE STD 610.12[2] defines architecture as the structure of components, their relationships, and the principles and guidelines governing their design and evolution over time. A wealth of previous work has addressed the research and development of architecture for digital library. Now we will give an overview of the related standards and enabling technologies as follows.

Digital library formal model..5S model

Digital libraries are complex information systems and therefore demand formal foundations lest development efforts diverge and interoperability suffers. Reference [12] proposed the fundamental abstractions of 5S (streams,structures, spaces, scenarios, and societies), which contribute to define digital libraries rigorously and usefully. Streams are sequences of abstract items used to describe static and dynamic content. Structures can be defined as labeled directed graphs, which impose organization. Spaces are sets of abstract items and operations on those sets that obey certain rules. Scenarios consist of sequences of events or actions that modify states of a computation in order to accomplish a functional requirement. Societies comprehend entities and the relationships between and among them. Together these abstractions relate and unify concepts, among others, of digital objects, metadata, collections, and services required to formalize and elucidate digital libraries. The formal model shows it can clearly and formally define a minimal digital library. But the formal model still needs to be improved and revised in digital library development.

3 Architecture of Massive Information Management for Digital Library

Until recently, there has been no common approach for architecture development and use in large-scale digital libraries construction. All kinds of libraries, research institutions and universities traditionally developed their DL architectures using techniques, vocabularies, and presentation schemes that suit their unique needs and purposes. In this section, we propose a new architecture of massive information management for digital library based on analyzing and researching the related works. We design the architecture by complying with related standards and making use of enabling technologies. The motivation is providing a common framework and software platform for constructing the large-scale digital library.

3.1 Design principles

The following set of principles for building architectures are critical to the objectives of the guidance. (1) Standardization. We will comply with related international and national standards in all layers of the architecture. (2) Componentization. Because it is widely accepted as a good software engineering practice, most modern programming environments adopt some form of component models. (3) Reusability. The technical infrastructure that provides functional specifications, paradigms of object-oriented programming, and component model has been particularly effective strategies for producing reusable and powerful software. (4) Scalability. The scalability of the proposed architecture is achieved by adopting a progressive and multi-layer framework, and providing the mechanisms for scaling services appropriately based on the service-demand and resource-availability. (5) Interoperability. Interoperability among heterogeneous systems and collections is a central theme. The different systems and collections have a wide variety of data types, metadata standards, protocols, authentication schemes,and business models. The goal of interoperability is to build coherent services for users by integrating components that are technically different and managed by different organizations. This requires agreements to cooperate at three levels: technical, content and organizational levels. (6) Architectures should be relatable, comparable, and scalable across DLs. This principle requires the use of common terms and definitions, such as metadata, digital object, repository, collection, and so on. This principle also requires that a common set of architectural building blocks is used as the basis for architecture descriptions. (7) Architectures should be adaptable, reliable, maintainable and testable. The ultimate achievement of this principle will reduce overall software costs, improve product and service quality, help organizations to thrive, or even survive in our current and future turbulent times.

3.2 Architecture design

We design a multi-layer architecture to support the service and management in DL (digital library) based on complying with a set of design principles above. The architecture is a more OSI-like reference model, but it focus on the massive information service, management, and archival.

3.3 Key functional components

Data ingest.

This component provides the services and functions to accept resource digitalizing, marking, indexing, and cataloging from producers, and prepares the contents for data management and knowledge management. Data ingest function also performs quality assurance on metadata and digital objects, which complies with the related data formatting and documentation standards. Meanwhile Data Ingest also gathers and catalogs new metadata and digital objects from remote DLs and WWW by using interoperability protocols.

Data management.

The component provides the services and functions for storing, maintaining, and accessing both metadata repositories and collections from data ingest. Data Management functions include administering the metadata database, multimedia databases and file systems. It will maintain schema, view definitions and referential integrity, and perform database updates (loading new descriptive information or administrative data). The central storage of metadata is provided by data ingest and metadata is gathered from other DLs by interoperability protocols. By using output interfaces between layers, it will provide data services to service layer, such as search and browse services. Meanwhile, object server (such as video server) that is managed by data management provides all kinds of multimedia services such as VOD. Data is typically stored as digital objects in file systems or as blobs in object-relational or object-oriented databases. Descriptive metadata is typically stored as attributes in metadata databases that complies with metadata standards, such as Dublin Core, USMARC, CNMARC. The metadata attributes use the XML (eXtensible markup language) to label the information content and a DTD (document type definition) to build a semi-structured representation of the information.

Knowledge management.

Knowledge is represented as sets of relationships of domain concepts that are expressed as rules used within inference engines. This component provides mechanisms for managing all of these types of relationships. Every discipline deals with multiple concept spaces that describe relationships between physical variables, data sets, collections, applications, and domain knowledge. These concept spaces provide a hierarchy of levels of the implied knowledge based on ontology. Ontology[15] is the key technology used to describe the semantics of information exchange, which is defined as specifications of a shared conceptualization of a particular domain. They provide a shared and common understanding of the domain that can be communicated across people and application systems, and thus facilitate knowledge sharing and reuse. The concepts and their relationships as knowledge rules are stored in knowledge repository. The components will provide services for the manipulation of specific applications.

Data archival.

This component provides the services and functions for the storage, maintenance and retrieva l of metadata, digital objects, and knowledge rule for the long-term preservation. The functions include receiving the data need to be archived from data ingest and data management components, and adding them to the permanent storage, managing the storage hierarchy, refreshing the media on which archive holdings are stored, performing the routine and special error checking, and providing the data migration, replication, backup, and disaster recovery capabilities. The system will adopt different networked storage methods (such as NAS, FC-SAN and IP-SAN) according to different system scale and application requirements. The data archival standard OAIS reference model will be used in the architecture for the understanding and increased awareness of archival concepts needed for the long term digital information preservation and access. Meanwhile, we extend the model and make it suitable for archival requirements of digital library.

3.4 Core services components

(1) Discovery and search: The components are to provide fundamental capabilities for locating and finding resources and collections among the distributed digital libraries. The challenge is to encourage end-user resource discovery and information use in a variety of formats, from a number of local and remote sources, and in a seamlessly integrated way. The architecture uses metadata for resource discovery, and uses a keyword indexing search-engine for resource discovery as a complementary method. The standards and specifications for resource discovery include Z39.50, OAI, and SDARTS. The key techniques include multi-agent and middleware technologies that are based on metadata including USMARC, CNMARC, and Dublin Core Element Set 1.0.

(2) Content-Based retrieva l: The components are to provide fundamental capabilities for query multimedia resources and collections, including text, image, video and music. The content-based image retrieva l allows for image queries based on image examples, feature specifications, and primitive text-based search. Content-based video analysis, automatic video index, summarization, and relevant feedback are used for the video retrieva l.

(3) Personalized service and notification: To quickly and easily gather useful information and knowledge to alleviate information overload problem, it has therefore become necessary to provide users with active and adaptive service mechanisms that automatically extract only relevant incoming documents. The component is able to provide the users with a personalized filtering and notification service based on user modeling and profile learning.

(4) Right management and payment: The first-generation of DRM (digital rights management) focuses on security and encryption as a means of solving the issue of unauthorized copying. That is, lock the content and limit its distribution to only those who pay. The second-generation of DRM covers the description, identification, trading, protection, monitoring and tracking of all forms of right usages over both tangible and intangible assets, including management of right holder relationships.

4 Case Study: THADL

Tsinghua University Architecture Digital Library (THADL) is developed as a prototype system, which started formally since March 2000. THADL maintains a balance between technology-focused research and content-based research. The project team brings together three research groups from different disciplines including computer science, architecture science, and library information management, and represents a substantial cooperation among computer researchers, librarians, and subject specialists. The large amount of multimedia materials in THADL repositories include papers, journals, photographs, manuscripts, drawings/blueprints, animation, video, and audio on Chinese ancient architecture. We have finished following the goals of THADL. By designing and developing THADL prototype, we analyze and summarize the previous architecture. Meanwhile the architecture also guides the improvement and extension of system functions and services. The detail of THADL has already been described. The main research contents of THADL are as follows:

(1) Exploring the efficient methods and technologies by analyzing, designing, and eva luating the THADL prototype system to pave the way for constructing a future large-scale digital library;

(2) Building a Chinese architecture digital library that provides an intelligent, interactive, and collaborative learning environment on the Internet rather than the static digital resource repositories;

(3) Presenting an efficient method to digitalize, index, and preserve most kinds of materials for Chinese architecture study;

(4) Establishing metadata specifications and standards and their related issues for Chinese architecture science;

(5) Supporting friendly, active, and personalized services for different users including students, scholars, librarians, and ordinary users on the Internet. However, THADL is a medium but comprehensive prototype system, and we have a long way to go for researching, testing, and analyzing whether our proposed architecture is suitable for the large scale digital library application such as the China Digital Library Project.

5 Conclusion and Future Work

In this paper we discuss the challenging issues and technologies in managing very large digital contents and collections, and give an overview of the related works and enabling technologies for supporting high performance data-intensive applications. We design a novel architecture of massive information management for digital library, and describe the key components and core services. Finally, the case study..THADL (Tsinghua University Architecture Digital Library) is given according to the architectural framework.

In the future work, we will study and develop software middleware for massive storage management, XML based search engine, and multilingual full-text search. We will design and implement a lightweight interoperable protocol based on the tailed Z39.50 protocol for supporting the large-scale distributed heterogeneous digital resources integration.