簡易檢索 / 詳目顯示

研究生: 呂芳誠
Fang Cheng Lu
論文名稱: 外部排序與外部搜尋問題之研究
On the study of external sorting and selection problems
指導教授: 唐傳義教授
Professor Chuan Yi Tang
口試委員:
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2001
畢業學年度: 89
語文別: 中文
論文頁數: 92
中文關鍵詞: 外部排序外部搜尋演算法網際網路
外文關鍵詞: external sorting, external selection, algorithm, internet
相關次數: 點閱:2下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 過去演算法研究大多假設主記憶體容量過夠大,可以容納所有的資料量,但是在現實的環境中,演算法所處理的資料量往往大於主記憶體的容量,此類問題稱為外部記憶體問題,而此類演算法稱為外部演算法,其衡量的標準是磁碟讀寫的次數,而非中央處理器的工作量,因為磁碟讀寫的速度演遠慢於中央處理器的速度。
    在論文中針對外部排序與外部搜尋問題提出最佳化的演算法,同時將外部演算法應用到網際網路上的問題,以解決網際網路上同類的問題。

    提出一個最佳化的演算法,利用佇列儲存取樣的資料,用以判斷並減少磁碟讀寫的次數,我們的外部排序演算法,只需將每個資料區塊讀寫兩次即可完成外部排序,同時也比較傳統外部排序演算法與我們的演算法不同之處。

    提出外部搜尋演算法,利用外部排序演算法的觀念加以修改,使外部搜尋所需的磁碟讀寫次數能減少,同時將此外部搜尋演算法應用到網際網路上,解決網際網路上的搜尋問題。 在此提出一個新的觀念,如果要解決網際網路上的問題可以到外部演算法找解決問題的方法,同時將此外部搜尋演算法做小部份修改,成功解決網際網路上的搜尋問題。

    針對外部搜尋演算法,分析磁碟讀取次數的平均值,假設k = N/2,分析在不同的資料分佈下,平均每一個資料區塊需要被讀取次數的平均值。

    本論文對外部問題做廣泛的探討,對外部排序及外部搜尋問題,提出最佳化演算法,同時針對外部搜尋演算法的最佳情況、最差情況和平均情況,加以分析。

    所提出的外部排序及外部搜尋演算法,雖然有記憶體容量的限制,但以目前的電腦硬體環境,此一限制是可以被接受的。


    The problem of how to sort and select data efficiently has been widely discussed. Nowadays, to sort extremely large data is becoming more and more important for large corporations, banks, and government institutions, which rely on computers more and more deeply in all aspects. Most of the time, sorting and selection are accomplished by external sorting and selection algorithm, in which the data file is too large to fit into main memory and must be resided in the secondary memory.
    We here present an optimal external sorting algorithm for two-level memory model. Our method is different from the traditional external merge sort and it uses the sampling information to reduce the disk I/Os in the external phase. The algorithm is elegant, simple and it makes a good use of memory available in the recent computer environment. Under the certain memory constraint, this algorithm runs with optimal number of disk I/Os and each record is exactly read twice and written twice.

    This dissertation also presents an optimal sampling external selection algorithm to select k-th smallest item in large data sets for the two-level memory model. The sampling external selection algorithm is also applied to solve the worldwide selection problem in the Internet environment. The sampling information scheme is used to form an elegant and simple algorithm to reduce the number of disk I/Os. The best case and the worst case of our algorithm are discussed and our algorithm is also efficient for the multiple selections. Finally, we analyze the average case of our algorithm according to equal probability assumption that the probability of one block overlapped or not overlapped with the other blocks is equal.

    Reference:
    [AAEFV98] P. K. Agarwal, L. Arge, J. Erickson, P. G. Franciosa, and J. S. Vitter. Efficient searching with linear constraints. In Proceedings 17th ACM Symposium on Principles of Database Systems, 169-178, 1998.
    [AAMVV98] P. K. Agarwal, L. Arge, T. M. Murali, K. Varadarajan, and J. S. Vitter. I/O-efficient algorithms for contour line extra action and planar graph blocking. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 117-126,1998.
    [ABW98] J. Abello, A. Buchsbaum, and J. Westbrook. A functional approach to external memory graph algorithms In Proceedings of the 6th Annual European Symposium on Algorithms, volume 1461 of Lecture Notes in Computer Science, 332-343, Venice, Italy, August 1998. Springer-Verlag.
    [AFK84] M. Ajtai, M. Fredman, and J. Komlos. Hash functions for priority queues. Information and Control, 63(3), 217-225,1984.
    [Arg95a] L. Arge. The butter tree: A new technique for optimal I/O-algorithms. In Proceedings of the workshop on Algorithms and Data Structure, volume 955 of Lecture Notes in Computer Science, 334-345. Springer-Verlag, 1995. A complete version appears as BRICS technical report RS-96-28, University of Aarhus.
    [Arg95b] L. Arge. The I/O-complexity of ordered binary-decision diagram manipulation. In Proceedings of the Computer Science, 82-91. Springer-Verlag, 1995.
    [AFGV97] L. Arge, P. Ferragina, R. Grossi, and J. Vitter. On sorting strings in external memory. In Proceedings of the ACM Symposium on Theory of Computation, 540-548,1997.
    [APRSV98] L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, and J. S. Vitter. Theory and practice if I/O-efficient algorithms for multidimensional batched searching problems. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 685-694,1998.
    [ASV99] L. Arge, V. Samoladas, and J. S. Vitter. Two-dimensional index ability and optimal range search indexing. In Proceedings of the ACM Symposium Principles of Database Systems, Philadelphia, PA, May-June 1999.
    [AVV95] L. Arge, D. E. Vengroff, and J. S. Vitter. External-memory algorithms for processing line segments in geographic information systems. Algorithmica, to appear. Special issue on cartography and geographic information systems. An earlier version appeared in Proceedings of the Third European Symposium on Algorithms, volume 979 of Lecture Notes in Computer Science, 295-310, Springer-Verlag, September 1995.
    [AV88] Aggarwal and J. S. Vitter. The Input/Output Complexity of Sorting and Related Problems. Communication on ACM, vol. 31 NO. 9 September 1988, pp. 1116 - 1126.
    [AV96] L. Arge and J. S. Vitter. Optimal dynamic interval management in external memory. In Proceedings of the IEEE Symposium on Foundations of Computer Science, 560-569, Burlington, VT, October 1996.
    [BGV96] Rakesh D. Barve, Edward F. Grove, and Jeffery S. Vitter. Simple randomized mergesort on parallel disks. In Proceedings of the Eighth Symposium on Parallel Algorithms and Architectures, pages 109-118, Padua, Italy, June 1996 ACM Press.
    [CGR95] P. Callahan, M. T. Goodrich, and K. Ramaiyer. Topology B-trees and their applications. In Proceedings of the Workshop on Algorithms and Data Structures, volume 955 of Lecture Notes in Computer Science, 381-392. Springer-Verlag, 1995.
    [Cha86] B. Chazelle. Filtering search: a new approach to query-answering. SIAM Journal on Computing, 15, 703-724, 1986.
    [Cha90] B. Chazelle. Lower bounds for orthogonal range searching: I. the reporting case. Journal of the ACM, 37(2), 200-212, April 1990.
    [CE87] B. Chazelle and H. Edelsbrunner. Linear space data structures for two types of range search. Discrete Computational Geometry, 2, 113-126, 1987.
    [CGGTVV95] Y. –J. Chiang, M. T. Goodrich, E. F. Grove, R. Tamassia, D. E. Vengroff, and J. S. Vitter. External-memory graph algorithms. In Proceedings of the ACM-SIAM symposium on Discrete Algorithms, 139- 149, January 1995.
    [CS] Y.-J. Chiang and C. T. Silva. External memory techniques for isosurface extraction in scientific visualization. In J. Abello and J. S. Vitter, editors, External Memory Algorithms and Visualization, Providence, RI, this volume. American Mathematical Society Press.
    .
    [CM96] D. R. Clark and J. I. Munro. Efficient suffix trees on secondary storage. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithm, 383-391, Atlanta, GA, June 1996.
    [CS89] K. L. Clarkson and P. W. Shor. Application of random sampling in computational geometry, II. Discrete and Computational Geometry, 4, 387-421, 1989.
    [CFMMR98] A. Crauser, P. Ferragina, K. Mehlhorn, U. Meyer, and E. Ramos. Randomized external memory algorithm for geometric problems. In Proceedings of the 14th ACM Symposium on Computational Geometry, June 1998.
    [CFMMR] A. Crauser, P. Ferragina, K. Mehlhorn, U. Meyer, and E. Ramos. I/O-optimal computation of segment intersections. In J. Abello and J. S. Vitter, editors, External Memory Algorithms and Visualization. American Mathematical Society Press, Providence, RI, this volume.
    [DHM99] F. Dehne, D. Hutchinson, and Maherhwari. Reducing I/O complexity by simulating coarse grained parallel algorithms. In Proceedings of the International Parallel Processing Symmposium, April 1999.
    [DL92] W. R. Dufrene and F. C. Lin. An Efficiency Sort Algorithm with no Addition Space. In The Computer Journal, vol. 35, NO. 3, 1992.
    [Dob78] W. Dobosiewicz. Sorting by Distributive Partitioning. Information Processing Letters 7 (1), 1-6 (1978).
    [DSST89] J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. E. Tarjan. Making data structures persistent. Journal of Computer and System Science, 38, 86-124,1989.
    [FFM98] M. Frarach, P. Ferragina, and S. Muthukrishnan. Overcoming the memory bottleneck in suffix tree construction. In Proceedings of the IEEE Symposium on Foundations of Computer Science, Palo Alto, Ca, November 1998.
    [FG96] P. Ferragina and R. Grossi. Fast string searching in secondary storage: Theoretical developments and experimental results. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 373-382, Atlanta, June 1996.
    [FG95] P. Ferragina and R. Grossi. The String B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM, to appear. An earlier version appeared in Proceedings of the 27th Annual ACM Symposium on Theory of Computing, 693-702, Las Vegas, NV, May 1995.
    [FST92] T. A. Funkhouser, C. H. Sequin, and S. J. Teller. Management of large amounts of data in interactive building walkthroughs. In Proceedings of the 1992 ACM SIGGRAPH Symposium on Interactive 3D Graphs, 11-20, Boston, March 1992.
    [Gon84] G.H. Gonnet, Handbook of Algorithms and Data Structures (Addison-Welsey, Reading, MA, 1984) 160-162.
    [GTVV93] M. T. Goodrich, J. -J. Tasy, D. E. Vengroff, and J. S. Vitter. External-memory computational geometry. In IEEE Foundations Computer Science, 714-723, Palo Alto, CA, November 1993.
    [GI97] R. Grossi and G. F. Italiano. Efficient splitting and merging algorithms for order decomposable problems. Information and Computation, in press. An earlier version appears in Proceedings of the 24th International Colloquium on Automata, Languages and Programming, volume 1256 of Lecture Notes in Computer Science, Springer Verlag, 605-615, 1997.
    [GI] R. Grossi and G. F. Italiano. Efficient cross-trees for external memory. In J. Abello and J. S. Vitter, editors, External Memory Algorithms and Visualization. American Mathematical Society Press, Providence, RI, this volume.
    [GLR95] S. K. S. Gupta, Z. Li, and J. H. Reif. Generating efficient programs for two-level memories from tensor-products. In Proceedings of the Seventh IASTED/ISMM International Conference on Parallel and Distributed Computing and Systems, 510-513, Washington, D.C., October 1995.
    [HKP97] J. M. Hellerstein, E. Koutsoupias, and C. H. Papadimitriou. On the analysis of indexing schemes. In Proceedings of the 16th ACM Symposium on Principles of Database Systems, 249-256, Tucson, AZ, May 1997.
    [HK81] J. W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In Proceedings of the 13th Annual ACM Symposium on Theory of Computation, 326-333, May 1981.
    [KB85] S. C. Kwan and J. Baer. The I/O performance of Multiway Mergesort and Tag Sort. In IEEE Transaction on Computing, vol. c-34, NO. 4, April 1985, pp. 383 –387.
    [KR99] M. V. Kameshwar and A. Ranade. I/O-complexity of graph algorithms. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, Baltimore, MD, January 1999.
    [KKR90] P. C. Kanellakis, G. M. Kuper, and P. Z. Revesez. Constraint query languages. In Proceedings of the 9th ACM Conference on Principles of Database Systems, 299-313,1990.
    [KNRE97] M. van Kreveld, J. Nievergelt, T. Roos, and P. W. (Eds.). Algorithmic Foundations of GIS, volume 1340 of Lecture Notes in Computer Science. Springer-Verlag, 1997.
    [Knu73] D. E. Knuth, The Art of Computer Programming, VOL. 3: Sorting and Searching Reading, MA: Addison-Wesley, 1973.
    [KRVV96] P. C. Kanellakis, S. Ramaswamy, D. E. Vengroff, and J. S. Vitter. Indexing for data models with constraints and classes. Journal of Computer and System Science, 52(3), 589-612,1996.
    [KS99] K. V. R. Kanth and A. K. Singh. Optimal dynamic range searching in non-replicating index structures. In Proceedings of the 7th International Conference on Database Theory, Jerusalem, January 1999.
    [KS86] D. G. Kirpatrick and R. Seidel. The ultimate planer convex hull algorithm SIAM Journal on Computing,15, 287-299, 1986.
    [Knu98] D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, Reading MA, second edition, 1998.
    [KT98] E. Koutsoupias and D. S. Taylor. Tight bounds for 2-dimensional indexing schemes. In Proceedings of the 17th ACM Symposium on Principle of Database Systems, Seattle, WA, June 1998.
    [KS96] V. Kumar and E. Schwabe. Improved algorithms and data structures for solving graph problems in external memory. In Proceeding of the 8th IEEE Symposium on Parallel and Distributed Processing, 169-176, October 1996.
    [L85] T. Leighton Tight bounds on the complexity of parallel sorting, IEEE Transaction Computing C-34 4(April), 1985, 344-354.
    [LRT93] C. E. Leiserson, S. Rao, and S. Toledo. Efficient out-of-core algorithms for linear relaxation using blocking covers. In Proceedings of the IEEE Symposium on Foundations of Computer Science, 704-713,1993.
    [LT92] R. Laurini and D. Thompson. Fundamentals of Spatial Information Systems. Academic Press, 1992.
    [LTT98] F. C. Leu, Yin-Te Tasi and Chuan Yi Tang. An Efficient External Sorting Algorithm. In International Computer Symposium, December 1998, pp. 139 – 144.
    [LV85] E. E. Lindstorm, and J. S. Vitter. The design and analysis of BucketSort for bubble memory secondary storage. IEEE Transaction Computing C - 34, 3 (Mar. 1985), 218 – 233
    [MC82] D. Motzkin and C.Hansen. An efficient external sorting with minimal space requirement. Internat Journal of Computing & Information Science 11 (6) (1982) 391-392.
    [Mon80] M. C. Monard, Projecto e Analise de Algorithm de Classificacao Externa Baseados na Estrategia di Quicksort, Ph.D. Thesis, Pontificia Univ. Catolica, Rio de Janeiro, Brazil, 1980.
    [Mor68] D. R.Morrision. Patricia: Practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM, 15,514-534,1968.
    [NGV96] H. M. Nodine, M. T. Goodrich, and J. S. Vitter. Blocking for external graph searching. Algorithmica, 16(2), 181-214, August 1996.
    [MH82] D. Motzkin and C. Hansen. An efficient external sorting with minimal space requirement. Internat Journal of Computing and Information Science 11 (6) (1982) 391-392.
    [NLV91] M. H. Nodine, D. P. Lopresti, and J. S. Vitter. I/O overhead and parallel VLSI architectures for lattice computations. IEEE Transactions on Computers, 40(7), 843-852, July 1991.
    [NV95] M. H. Nodine and J. S. Vitter. Greed Sort: An optimal sorting algorithm for multiple disks. Journal of the ACM, 42(4), 919-933, July 1995.
    [RS94] S. Ramaswamy and S. Subramanian. Path caching: a technique for optimal external searching. In Proceedings of the 13th ACM Conference on Principles of Database systems, 1994.
    [Sam89a] H. Samet. Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison-Wesley, 1989.
    [Sam89b] H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley,1989.
    [SM98] V. Samoladas and D. Miranker. A lower bound theorem for indexing schemes and its application to multidimensional range queries. In Proceeding 17th ACM Conf. On Princ. Of Database Systems, Seattle, WA, June 1998.
    [SN85] Singh and T. L. Naps. Introduction to Data Structure. West Publishing Co., St. Paul, MN (1985).
    [Sib97] J. F. Sibeyn. From parallel list ranking. Technical Report MPI-I-97-1-021, Max-Planck-Institute, September 1997.
    [Sib99] Jop F. Sibeyn. External Selection. In Proceedings of the 16th symposium on Theoretical Aspects of Computer Science, Lecture Notes in Computer Science 1563, pp. 291 – 301 Spring-Verlag 1999.
    [SK97] J. F. Sibeyn and M. Kaufmann. BSP-like external-memory computation. In Proceedings of the 3rd Italian Conference on Algorithms and Complexity, 229-240,1997.
    [SR95] S. Subramanian and S. Ramaswamy. The P-range tree: a new data structure for range searching in secondary memory. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 1995.
    [TV96] R. Tamassia and J. S. Vitter. Optimal cooperative search in fractional cascaded data structures. Algorithmica, 15(2), 154-171, February 1996.
    [UY97] J. D. Ullman and M. Yannakakis. The input/output complexity of transitive closure. Annals of Mathematics and Artificial Intellegence, 3, 331-360, 1997.
    [UY91] J. D. Ullman and M. Yannakakis. The input/output complexity of transitive closure. Annals of Mathematics and Aritificial Intellegence, 3, 331-360, 1991.
    [Ver88] I. Verkamo. External Quicksort. Performance Evaluation 8, 271 - 288 (1988).
    [Ver89] I. Verkamo. Performance Comparison of Distributive and Mergesort as External Sorting Algorithms. The Journal of Systems and Software 10, 187 – 200 (1989).
    [VS94] J. S. Vitter and E. A. M. Shriver. Algorithm for parallel memory I: Two-level memories. Algorithmica, 12(2-3), 110-147,1994.
    [VV96] D. E. Vengroff and J. S. Vitter. Efficient 3-d range searching in external memory. In Proceedings of the ACM Symposium on Theory of Computation, 192-201, Philadephia, PA, May 1996.
    [VW99] J. S. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, June 1999.
    [VWI98] J. S. Vitter, M. Wang, and B. Iyer. Data cube approximation and histograms via wavelets. In Proceedings of the Seventh International Conference on Information and Knowledge Management, 96-104, Washington, November 1998.
    [Wei78] B.W Weide. Statistical Methods in Algorithm Design and Analysis. Carnegie-Mellon University Technical Report CMU-CS-78-142, 1978, pp. 3-30-3-39.
    [WVI98] M. Wang, J. S. Vitter, and B. R. Iyer. Scalable mining for classification rules in relational databases. In Proceedings of the International Database Engineering & Application Symposium, 58-67, Cardiff, Wales, July 1998.
    [ZM90] S. B. Zdonik and D. Maier, editors. Readings in Object-Oriented Database Systems. Morgan Kauffman, 1990.
    [Zhu94] B. Zhu. Further computational geometry in secondary memory. In Proceedings of the International Symposium on Algorithms and Computation, 1994.

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE