簡易檢索 / 詳目顯示

研究生: 吳宜鴻
Yi-Hung Wu
論文名稱: 全球資訊網資料之分析、索引與擷取
The Analysis, Indexing, and Retrieval of Web Data
指導教授: 陳良弼
Arbee L. P. Chen
口試委員:
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2001
畢業學年度: 89
語文別: 英文
論文頁數: 121
中文關鍵詞: 全球資訊網資料探勘資訊過濾搜尋引擎網頁分類查詢精練相關回饋網頁預取
外文關鍵詞: World Wide Web, Data Mining, Information Filtering, Search Engine, Web Page Classification, Query Refinement, Relevance Feedback, Web Page Prefetching
相關次數: 點閱:4下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 全球資訊網上充斥著大量資料,伴隨著使用者數目的持續增加,多元化與高品質
    的資訊服務、以及高效能的資訊系統,變得相當迫切需要。本研究從資料庫管理

    的觀點出發,將整個全球資訊網視為一個超大型的資料庫;除了應用傳統資料庫

    管理的技術之外,更結合了資料探勘的技巧,以探討與全球資訊網相關的資料管

    理及應用之課題 。

    在本論文中,我們將全球資訊網上的資料分成兩大類:一類是網頁的內容與

    相關資訊,另一類則是使用者的瀏覽行為。本研究的主要目標,即在於藉由資料

    探勘的方法,分別對這兩類資料進行分析,以擷取出其隱含的有用資訊;再利用

    資料庫管理的索引技術,將分析結果應用在網頁搜尋、過濾、及預取等方面 。

    有鑒於搜尋引擎在處理大量資料時的困境,我們設計了一套網頁過濾的機

    制,將使用者主動查詢網頁的傳統作法,換成使用者被動式接收最新的網頁資

    訊。在此機制下,使用者事先給定其感興趣網頁的描述檔,而藉由與網頁內容或

    相關資訊的比對,即可決定哪些使用者適合接收新進的網頁。在此研究中,我們

    提出幾種描述檔專用的索引方法,其中一類是針對網頁內容的關鍵字,而另一類

    則是針對網頁的網址所設計 。

    全球資訊網的蓬勃發展,帶來更多資訊分享的機會;隨著上網人數的成長,

    對使用者興趣及行為進行分析,將提供另一個改進服務品質的途徑。在此研究

    中,我們利用資料探勘技術推導使用者的描述檔,並將描述檔分成六類特定模

    式,再透過分群將其結果應用於個人化推薦系統。在這個系統中,網頁內容並不

    是主導是否推薦資訊的依據;取而代之的,則是使用者間的瀏覽興趣與行為是否

    足夠相似。

    在網頁內容與相關資訊的分析方面,我們應用概略集合的理論,提出一套根

    據超鏈結文字分類網頁的作法,並配合物件導向的階層觀念,可將網頁資料依照

    不同應用需求儲存成一個分類階層。藉著一個網頁資料庫雛形的建立,我們將這

    個分類階層應用於網頁查詢的服務上;並輔以網頁的相關資訊,進一步提供使用

    者與超鏈結相關的複雜查詢 。

    全球資訊網上常見的關鍵字查詢系統,使用者通常很難在單一查詢中清楚地

    描述其需求;事實上,一般的查詢可能相當複雜,系統不應該也很難要求使用者

    提供精確無誤的查詢。在此研究中,我們提出一套查詢精練的方法,允許使用者

    一開始先輸入簡單的查詢,藉由過去的歷史資訊(查詢字眼與回饋之間的關聯),

    系統會不斷地修改查詢內容,最終趨近於使用者的真正需求。

    在針對使用者瀏覽行為的分析方面,我們以日誌資料為基礎,運用資料探勘

    的觀念,以期發現隱藏在瀏覽行為背後的資訊。在此研究中,我們著重於熱門的

    瀏覽軌跡,並針對代理伺服器的存取日誌進行探勘;所得到的結果,則分別應用

    在網頁流的查詢與瀏覽、以及網頁存取的預測與預取。由於網頁預取的機制強調

    其系統效能,我們也設計了一套網頁預取法則的索引結構;可保證在每次進行預

    取時,不需要太多資料讀取的動作 。


    Due to the great popularity of the WWW, a huge amount of web pages has spread
    over various web sites and the population of web users grows rapidly. The

    proliferation of web users leads to the urgent requirement of the information services

    with high quality and great performance. In this study, we consider the WWW as a

    very large database and devote to the explorations of the research topics for

    data-intensive applications on the WWW. We classify the data on the WWW into two

    categories: the contents and the metadata of web pages, and the browsing behavior of

    web users. Our goal is to analyze these data for knowledge discovery and apply the

    results to the information services such as searching, filtering, and prefetching web

    pages.

    Owing to the scale up problem on searching web pages, we consider the filtering

    approach that can perform as usual no matter how the number of web pages grows. In

    this approach, users first give descriptions about what they need in the form of user

    profiles. By the comparisons between web pages and user profiles, the users who are

    interested in a web page can be identified and notified. Two types of profiles are

    considered in our study. One only contains a set of keywords that can specify the

    contents of web pages. The other considers the URL’s of web pages. To tackle the

    performance issues, we devise several indexing methods for both types of profiles,

    respectively.

    The dramatic growth of the Web has brought about the increasing possibility of

    information sharing. As the population on the Web grows, the analysis of user

    interests and behaviors will provide hints on how to improve the quality of service. In

    this study, we propose a method for deriving the user profiles by data mining

    techniques. Moreover, we define six types of user profiles and a distance measure to

    classify users into clusters. Finally, several kinds of recommendation services using

    the clustered results are realized.

    To analyze the web pages, we propose a classification scheme that takes the

    hyperlink structure and the associated text into consideration. Moreover, we design a

    rough-set based method for the discovery of classification rules. Based on the

    object-oriented concept, the metadata of web pages are organized into a class

    hierarchy, which can be utilized for specifying user queries. In addition, a user

    interface is also built to support database-like queries. Both the page contents and the

    hyperlink structure can be specified in our query language.

    Considering the keyword search on the Web, it is often difficult for the users to

    specify queries that precisely describe what they need. In fact, such kind of queries

    can be very complex. It is therefore unrealistic for the search engines on the Web to

    demand precise queries directly from the users. In this study, we propose a new

    method for query refinement, which allows users to specify simple queries and then

    repeatedly refines the queries. Our method takes advantage of the historical

    information (user feedbacks and query term associations) to refine queries.

    As for the analysis of the user behaviors, we apply the data mining techniques to

    the log data in order to find the popular sequences of user requests. On the other hand,

    we propose a framework that applies the mining results to predicting user requests. To

    tackle the performance issues, we devise an index structure to provide a fast

    prediction process.

    [1] S. Abiteboul, D. Quass, and J. McHugh, J. Widom, and J. Wiener, “The Lore
    Query Language for Semistructured Data,” International Journal on Digital
    Libraries, 1(1): 68-88, April 1997.
    [2] S. Acharya, H. F. Korth, and V. Poosala, “Systematic Multi-resolution and its
    Application to the World Wide Web,” Proceedings of IEEE Conference on Data
    Engineering, pp. 40-49, 1999.
    [3] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,”
    Proceedings of VLDB Conference, pp.487-499, 1994.
    [4] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proceedings of IEEE
    Conference on Data Engineering, pp. 3-14, 1995.
    [5] V. Almeida, A. Bestavros, M. Crovella, and A. Oliveira, “Characterizing
    Reference Locality in the WWW,” Proceedings of IEEE Conference on Parallel
    and Distributed Information Systems, pp. 92-103, 1996.
    [6] G. Arocena and A. Mendelzon, “WebOQL: Restructuring Documents, Databases,
    and Webs,” Proceedings of IEEE Conference on Data Engineering, pp. 24-33,
    February 1998.
    [7] G. Arocena, A. Mendelzon, and G. Mihaila, “Applications of a Web Query
    Language,” Proceedings of International WWW Conference, April 1997.
    [8] P. Atzeni, G. Mecca, and P. Merialdo, “To Weave the Web,” Proceedings of
    VLDB Conference, pp. 206-215, 1997.
    [9] M. Baentsch, L. Baum, G. Molter, S. Rothkugel, et al., “Enhancing the Web's
    Infrastructure: from Caching to Replication,” IEEE Internet Computing, 1(2):
    18-27, March/April 1997.
    [10] M. Balabanovic, “An Adaptive Web Page Recommendation Service,”
    Proceedings of International Conference on Autonomous Agents, February
    1997.
    [11] M. Balabanovic and Y. Shoham, “Fab: Content-based Collaborative Filtering
    Recommendation,” Communications of the ACM, 40(3): 66-72, March 1997.
    [12] R. J. Bayardo Jr., W. Bohrer, R. Brice, et al., “The InfoSleuth Project,”
    Proceedings of ACM SIGMOD Conference, pp. 543-545, 1997.
    [13] N. J. Belkin and W. B. Croft, “Information Filtering and Information Retrieval:
    Two Sides of the Same Coin?” Communications of the ACM, 35(12): 29-38,
    1992.
    [14] T. A. H. Bell and A. Moffat, “The Design of a High Performance Information
    Filtering System,” Proceedings of ACM SIGIR Conference, pp.12-20, 1996.
    [15] T. Berners-Lee, R. Cailliau, A. Luotonen, H. F. Nielsen, and A. Secret, “The
    World-Wide Web,” Communications of the ACM, 37(8), pp. 76-82, 1994.
    [16] A. Bestavros, “Using Speculation to Reduce Server Load and Service Time on
    the WWW,” Proceedings of ACM Conference on Information and Knowledge
    Management, November 1995.
    [17] A. Bestavros, “Speculative Data Dissemination and Service to Reduce Server
    Load, Network Traffic and Service Time for Distributed Information Systems,”
    Proceedings of IEEE Conference on Data Engineering, pp.180-187, 1996.
    [18] A. Bestavros, R. L. Carter, and M. E. Crovella, “Application-Level Document
    Caching in the Internet,” Proceedings of Workshop on Services in Distributed
    and Networked Environments, June 1995.
    [19] S. Brin, “Extracting Patterns and Relations from the World Wide Web,”
    Proceedings of WebDB at International Conference on Extending Database
    Technology, 1998.
    [20] S. Brin and L. Page, “The Anatomy of a Large-scale Hypertextual Web Search
    Engine,” Proceedings of International WWW Conference, 1998.
    [21] A. Buchner and M. D. Mulvenna, “Discovering Internet Marketing Intelligence
    through Online Analytical Web Usage Mining,” ACM SIGMOD Record, 27(4):
    54-61, December 1998.
    [22] C. Buckley, G. Salton, and J. Allan, “The Effect of Adding Relevance
    Information in a Relevance Feedback Environment,” Proceedings of ACM
    SIGIR Conference, pp. 292-298, 1994.
    [23] S. Chakrabarti, B. Dom, and P. Indyk, “Enhanced Hypertext Categorization
    Using Hyperlinks,” Proceedings of ACM SIGMOD Conference, pp. 307-318,
    1998.
    [24] C. S. Chang and A. L. P. Chen, “Supporting Conceptual and Neighborhood
    Queries on WWW,” IEEE Transactions on Systems, Man, and Cybernetics,
    28(2): 300-308, 1996.
    [25] C. H. Chang and C. C. Hsu, “A Multi-Engine Search Tool with Clustering,”
    Proceedings of International WWW Conference, April 1997.
    [26] C. H. Chang and C. C. Hsu, “Information Searching Agents on the Internet:
    Constructing Personalized Information Agents,” Proceedings of Conference on
    Knowledge Discovery and Data Mining, pp. 15-17, 1998.
    [27] C. H. Chang and C. C. Hsu, “Integrating Query Expansion and Conceptual
    Relevance Feedback for Personalized Web Information Retrieval,” Proceedings
    of International WWW Conference, pp. 14-18, 1998.
    [28] C. H. Chang and C. C. Hsu, “Enabling Concept-based Relevance Feedback on
    World Wide Web,” IEEE Transactions on Knowledge and Data Engineering
    (Special Issue on Web Technologies), 11(4), July/August 1999.
    [29] M. S. Chen, J. W. Han, and P. S. Yu, “Data Mining: An Overview from Database
    Perspective,” IEEE Transactions on Knowledge and Data Engineering, 5:
    926-938, 1996.
    [30] M. S. Chen, J. S. Park, and P. S. Yu, “Data Mining for Path Traversal Patterns in
    a Web Environment,” Proceedings of IEEE Conference on Distributed
    Computing Systems, pp. 385-392, 1996.
    [31] M. S. Chen, J. S. Park, and P. S. Yu, “Efficient Data Mining for Path Traversal
    Patterns,” IEEE Transactions on Knowledge and Data Engineering, 10(2):
    209-220, March/April 1998.
    [32] A. L. P. Chen and Y. H. Wu, “Data Analysis,” Encyclopedia of Electrical and
    Electronics Engineering, John Wiley & Sons Inc. Publisher, 1998.
    [33] W. H. Chen, Y. H. Wu, and A. L. P. Chen, “Web-Flow Mining Techniques,
    Applications and System Implementations,” Proceedings of National Computer
    Symposium, A: 26-32, 1999.
    [34] J. Cho, H. Garcia-Molina, and L. Page, “Efficient Crawling through URL
    Ordering,” Proceedings of International WWW Conference, April 1998.
    [35] M. Craven, D. DiPasquo, D. Freitag, et al., “Learning to Extract Symbolic
    Knowledge from the World Wide Web,” Proceedings of National Conference on
    Artificial Intelligence, January 1998.
    [36] M. Craven, S. Slattery, and K. Nigam, “First-order Learning for Web Mining,”
    Proceedings of European Conference on Machine Learning, April 1998.
    [37] M. Crovella and P. Barford, “The Network Effects of Prefetching,” Proceedings
    of IEEE INFOCOM Conference, 1998.
    [38] M. Crovella and A. Bestavros, “Self-similarity in World Wide Web Traffic:
    Evidence and Possible Causes,” Proceedings of ACM SIGMETRICS Conference,
    May 1996.
    [39] C. R. Cunha, and C. F. B. Jaccoud, “Determining WWW User’s Next Access
    and its Applications to Prefetching,” Proceedings of International Symposium
    on Computers and Communication, July 1997.
    [40] D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. Tukey, “Scatter/Gather: A
    Cluster-based Approach to Browsing Large Document Collections,”
    Proceedings of ACM SIGIR Conference, pp. 318-329, 1992.
    [41] A. Deutsch, N. Fernandez, and D. Suciu, “Storing Semistructured Data with
    STORED,” Proceedings of ACM SIGMOD Conference, pp. 431-442, 1999.
    [42] D. Dreilinger and A. E. Howe, “Experiences with Selecting Search Engines
    Using Metasearch,” ACM Transactions on Information Systems, 15(3): 195-222,
    July1997.
    [43] D. W. Embley, Y. Jiang, and Y. -K. Ng, “Ontology-based Extraction and
    Structuring of Information from Data-rich Unstructured Documents”
    Proceedings of ACM Conference on Information and Knowledge Management,
    pp. 52-59, 1998.
    [44] D. W. Embley, Y. Jiang, and Y. -K. Ng, “Record-Boundary Discovery in Web
    Documents,” Proceedings of ACM SIGMOD Conference, pp. 467-478, 1999.
    [45] M. Ester, H. Kriegel, J. Sander, M. Wimmer, and X. Xu, “Incremental
    Clustering for Mining in a Data Warehousing Environment,” Proceedings of
    VLDB Conference, 2000.
    [46] O. Etzioni, “The World Wide Web: Quagmire or Gold Mine?” Communications
    of the ACM, 39(11): 65-68, November 1996.
    [47] M. Fernandez, D. Florescu, A. Levy, and D. Suciu, “A Query Language for a
    Web-site Management System,” ACM SIGMOD Record, 26(3): 4-11, September
    1997.
    [48] R. T. Fielding, “Maintaining Distributed Hypertext Infostructures: Welcome to
    MOMspider's Web,” Computer Networks and ISDN Systems, 27(2): 1-10,
    November 1994.
    [49] D. Florescu, A. Levy, and A. Mendelzon, “Database Techniques for the World
    Wide Web: A Survey,” ACM SIGMOD Record, 27(3): 59-74, September 1998.
    [50] P. W. Foltz and S. T. Dumais, “Personalized Information Discovery: An
    Analysis of Information Filtering Methods,” Communications of the ACM,
    35(12): 51-60, December 1992.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE