三階張量在GPU上的原地轉置｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	涂智傑 Tu, Chih-Chieh
論文名稱：	三階張量在GPU上的原地轉置 IT3: In-place Transposition of Third-Order Tensor on Graphics Processing Units
指導教授：	李哲榮 Lee, Che-Rung
口試委員:	韓永楷 Hon, Wing-Kai 高英哲 Kao, Ying-Jer
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2021
畢業學年度：	109
語文別：	英文
論文頁數：	42
中文關鍵詞：	三階張量、張量、原地轉置、演算法、最佳化
外文關鍵詞：	Third-Order Tensor
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

張量轉置是一個被廣泛使用於高效張量計算的基本操作，例如張量降秩。然而，現有的張量轉置方法多為非原地的（out-of-place），意即這些方法地需使用輸入張量所佔空間之雙倍以完成轉置。此低效率之空間使用對大多加速器而言是一個嚴重的問題，例如圖形處理器（GPU）上的記憶體大小即是相對有限的。針對記憶體使用效率的改善，本論文提出三階張量的原地轉置演算法IT3。為有效利用已經過最佳化的的矩陣轉置函式核心（kernels），IT3首先將三階張量拆解為較小的單位來處理，例如矩陣、列或行，依轉置目標而異。接著，IT3結合這些不同的矩陣轉置結果以達成最後目標。透過理論分析可知，相較於非原地轉置的方法，IT3能夠節省至少百分之五十的額外記憶體使用量。本論文亦提出IT3在GPU上的高效率實作。所有三階張量轉置的可能排列都經過了實驗測試。IT3的GPU實作被與當前最先進的GPU張量轉置函式庫CUDA Tensor Transpose（cuTT）作比較，結果顯示IT3不僅在任何情況下的記憶體使用表現上皆優於cuTT，且執行速度方面亦在大多數情況下獲得領先。

Tensor transposition is a basic operation used for many high performance tensor calculations, such as tensor contractions.
However, most existing methods for tensor transposition are out-of-place, which means they require additional space of the same size of the input tensor to accomplish the transposition. The low space utilization is a serious problem for most accelerators, such as Graphics Processing Units (GPUs), because their memory size is relatively limited. To improve the memory usage, we propose in-place transposition algorithms for third-order tensors, called IT3. To utilize existing optimized in-place matrix transposition kernels, IT3 first decomposes third-order tensors into smaller units, such as matrices, rows, or columns, based on different transposition configurations. After that, IT3 combines different matrix transposition methods to achieve the final result. The theoretical analysis shows IT3 uses at most 50\% of the memory of the input size. We also presented efficient GPU implementations of IT3 that optimize the performance for modern GPUs. Experiments that exhaust all possible permutations of third-order tensor transposition are conducted. The GPU implementations of IT3 were compared with the state-of-the-art tensor transposition library for GPUs, CUDA Tensor Transpose (cuTT). The results show IT3 outperformed cuTT not only in terms of memory usage for all benchmarks, but also in terms of speed for most cases.

Chinese Abstract i
Abstract ii
List of Figures v
List of Tables vi
 Introduction 1
 Background and Related Works 4
1    Tensor Transposition 4
2    In-place Matrix Transposition 5
3    Catanzaro’s Algorithm 6
4    Scatter and Gather Function 9
 Algorithm of IT3 10
1    Algorithms 10
1.1 231 Transpose 12
1.2 312 Transpose 12
1.3 213 Transpose 13
1.4 132 Transpose 14
1.5 321 Transpose 14
2    Implementations of Catanzaro’s Algorithm 15
2.1 Data Layout 16
2.2 Column Operation 16
2.3 Row Operation 19
3    GPU Implementations of IT3 21
3.1 231 Transpose and 312 Transpose 21
3.2 213 Transpose 21
3.3 132 Transpose 22
3.4 321 Transpose 23
 Performance Evaluation 24
1    231 and 312 Transpose 25
2    213 Transpose 28
3    132 Transpose 33
4    321 Transpose 36
 Conclusion and Future Works 38
                                

[1] Chetan Nayak, Steven H. Simon, Ady Stern, Michael Freedman, and SankarDas Sarma. Non-abelian anyons and topological quantum computation.Re-views of Modern Physics, 80(3):1083–1159, Sep 2008.
[2] Hans-Joachim Werner, Peter J. Knowles, Gerald Knizia, Frederick R. Manby,and Martin Sch ̈utz. Molpro: a general-purpose quantum chemistry programpackage.WIREs Computational Molecular Science, 2(2):242–253, 2012.
[3] M. Alex O. Vasilescu and Demetri Terzopoulos. Multilinear analysis of imageensembles: TensorFaces. In Anders Heyden, Gunnar Sparr, Mads Nielsen,and Peter Johansen, editors,Computer Vision — ECCV 2002, pages 447–460,Berlin, Heidelberg, 2002. Springer Berlin Heidelberg.
[4] Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov.Tensorizing neural networks, 2015.
[5] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications.SIAM Review, 51(3):455–500, 2009.
[6] So Hirata. Tensor contraction engine: Abstraction and automated parallelimplementation of configuration-interaction, coupled-cluster, and many-bodyperturbation theories.The Journal of Physical Chemistry A, 107:9887–9897,11 2003.
[7] A. Abdelfattah, M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou,A. Haidar, I. Karlin, Tz. Kolev, I. Masliah, and S. Tomov. High-performancetensor contractions for GPUs.Procedia Computer Science, 80:108 – 118, 2016.International Conference on Computational Science 2016, ICCS 2016, 6-8 June2016, San Diego, California, USA.
[8] Edgar Solomonik, Devin Matthews, Jeff R. Hammond, John F. Stanton,and James Demmel. A massively parallel tensor contraction framework forcoupled-cluster computations.Journal of Parallel and Distributed Computing,74(12):3176 – 3190, 2014. Domain-Specific Languages and High-Level Frame-works for High-Performance Computing.
[9] Yang Shi, U. N. Niranjan, Animashree Anandkumar, and Cris Cecka. Tensorcontractions with extended BLAS kernels on CPU and GPU.2016 IEEE 23rdInternational Conference on High Performance Computing (HiPC), Dec 2016.
[10] Paul Springer and Paolo Bientinesi. Design of a high-performance gemm-liketensor–tensor multiplication.ACM Trans. Math. Softw., 44(3), January 2018.
[11] Devin A. Matthews. High-performance tensor contraction without transposi-tion.SIAM Journal on Scientific Computing, 40(1):C1–C24, 2018.
[12] Dmitry I. Lyakh. An efficient tensor transpose algorithm for multicore CPU,Intel Xeon Phi, and NVidia Tesla GPU.Computer Physics Communications,189, 1 2015.
[13] Antti-Pekka Hynninen and Dmitry I. Lyakh. cuTT: A high-performance tensortranspose library for CUDA compatible GPUs, 2017.
[14] Paul Springer, Tong Su, and Paolo Bientinesi. Hptt: A high-performance tensortransposition c++ library. ARRAY 2017, page 56–62, New York, NY, USA,2017. Association for Computing Machinery.
[15] J. Vedurada, A. Suresh, A. S. Rajam, J. Kim, C. Hong, A. Panyala, S. Krish-namoorthy, V. K. Nandivada, R. K. Srivastava, and P. Sadayappan. TTLG -an efficient tensor transposition library for GPUs. In2018 IEEE InternationalParallel and Distributed Processing Symposium (IPDPS), pages 578–588, 2018.
[16] Guifr ́e Vidal. Efficient classical simulation of slightly entangled quantum com-putations.Physical Review Letters, 91(14), Oct 2003.
[17] G. Vidal. Classical simulation of infinite-size quantum lattice systems in onespatial dimension.Physical Review Letters, 98(7), Feb 2007.
[18] G. Evenbly and G. Vidal. Algorithms for entanglement renormalization.Phys-ical Review B, 79(14), Apr 2009.
[19] Fred Gustavson, Lars Karlsson, and Bo K ̊agstr ̈om. Parallel and cache-efficientin-place matrix storage format conversion. 38(3), April 2012.
[20] Fred G. Gustavson and David W. Walker.Algorithms for in-place ma-trix transposition.Concurrency and Computation: Practice and Experience,31(13):e5071, 2019. e5071 cpe.5071.
[21] I-Jui Sung, Juan G ́omez-Luna, Jos ́e Mar ́ıa Gonz ́alez-Linares, Nicol ́as Guil, andWen-Mei W. Hwu. In-place transposition of rectangular matrices on accelera-tors. 49(8):207–218, February 2014.
[22] J. G ́omez-Luna, I. Sung, L. Chang, J. M. Gonz ́alez-Linares, N. Guil, and W. W.Hwu. In-place matrix transposition on GPUs.IEEE Transactions on Paralleland Distributed Systems, 27(3):776–788, 2016.
[23] Bryan Catanzaro, Alexander Keller, and Michael Garland. A decomposition forin-place matrix transposition.SIGPLAN Not., 49(8):193–206, February 2014.
[24] Paul Springer, Aravind Sankaran, and Paolo Bientinesi. TTC: a tensor trans-position compiler for multiple architectures.Proceedings of the 3rd ACM SIG-PLAN International Workshop on Libraries, Languages, and Compilers for Ar-ray Programming - ARRAY 2016, 2016.
[25] Jose L. Jodra, Ibai Gurrutxaga, and Javier Muguerza. Efficient 3d transpo-sitions in graphics processing units.Int. J. Parallel Program., 43(5):876–891,October 2015.
[26] N. Brenner. Algorithm 467: Matrix transposition in place.Commun. ACM,16(11):692–694, November 1973.
[27] P. F. Windley. Transposing Matrices in a Digital Computer.The ComputerJournal, 2(1):47–48, 01 1959.
[28] A.A. Tretyakov and E.E. Tyrtyshnikov. Optimal in-place transposition of rect-angular matrices.Journal of Complexity, 25(4):377 – 384, 2009.
[29] Sunpyo Hong and Hyesoon Kim. An analytical model for a GPU architecturewith memory-level and thread-level parallelism awareness.SIGARCH Comput.Archit. News, 37(3):152–163, June 2009.
[30] Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. Aperformance analysis framework for identifying potential benefits in GPGPUapplications.SIGPLAN Not., 47(8):11–22, February 2012.

簡易檢索 / 詳目顯示

相關論文