N階張量在GPU上的原地轉置｜國立清華大學博碩士論文庫

簡易檢索 / 詳目顯示

回結果列表

研究生：	吳竣宇 Wu, Chun-Yu
論文名稱：	N階張量在GPU上的原地轉置 ITNT: In-place Transposition of N-order Tensor on Graphics Processing Units
指導教授：	李哲榮 Lee, Che-Rung 韓永楷 Hon, Wing-Kai
口試委員:	王弘倫 Wang, Hung-Lung 蔡孟宗 Tsai, Meng-Tsung
學位類別：	碩士 Master
系所名稱：	電機資訊學院 - 資訊工程學系 Computer Science
論文出版年：	2023
畢業學年度：	111
語文別：	英文
論文頁數：	79
中文關鍵詞：	張量、高階張量、原地轉置、演算法、轉置
外文關鍵詞：	N-Order Tensor, Graphics Processing Units, In-place, Transposition
相關次數：	點閱：2 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

張量轉置是一個被廣泛使用於高效能張量計算的基本操作，例如張量縮約。然而，現有的張量轉置方法多為非原地的（Out-Of-Place），意即這些方法必須使用原本張量的兩倍記憶體空間來完成張量轉置。對於多數加速器而言低效率的記憶體空間使用是一個嚴重的問題，例如圖形記憶體（GPU）上的記憶體空間即為相對有限的。本論文提出高階張量的原地轉置演算法ITNT，為有效利用已經最佳化的矩陣轉置函式核心（kernels），ITNT首先將高階張量拆解為較小的子張量。接著，把高階張量轉置降秩成更小的單位，例如矩陣轉置、三階張量轉置、四階張量轉置，依轉置目標而異。ITNT結合這些不同的張量轉置結果以達成最後目標。透過理論分析可得知，相較於非原地轉置的方法，對於夠大的張量，ITNT可以節省至少百分之九十五以上的額外記憶體使用量。本論文亦提出ITNT在GPU上的高效率實作，以二階到六階的張量轉置作為實驗測試的對象。ITNT的GPU實作備與當前最先進的GPU張量轉置函式庫CUDA Tensor Transpose（cuTT）作比較，結果顯示ITNT在記憶體使用及執行速度的表現的綜合考量上皆優於cuTT。

Tensor transposition is a fundamental operation in tensor calculation, which has been widely used in various applications. Naive implementation that relocates each element in the source tensor to the transposed position in the target tensor requires double space, which is not suitable for large-scaled tensors on memory limited accelerators, such as Graphic Processing Units (GPUs). In this thesis, we present an algorithm and efficient implementation, called ITNT, for In-place Transposition of N-order Tensor on GPUs, which requires at most 5\% addition memory for large tensors. First, ITNT utilizes a newly proposed method, called the permutation decomposition, to factorize a transposition of a high order tensor into a sequence of low order tensor transpositions. Next, based on the estimation of required extra memory, ITNT divides a large tensor into smaller tensors, and transposes each smaller tensor separately. Last, the transposed sub tensors are assembled into the desired result. The GPU implementation optimizes the performance of memory access using cooperative groups programming model.
Experiments show that ITNT achieves competitive performance comparing to the state-of-the-art out-of-place GPU implementations, but can be scaled up to nearly double sized tensors for various transpositions of N-order tensors.

中文摘要 1
List of Figures 4
List of Tables 6
Introduction 7
Background and Related Works 10
1 Tensor Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 In-place and Out-of-place Transposition . . . . . . . . . . . . . . . . . . 10
3 Catanzaro’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 IT3: In-place Transposition of Order-3 Tensor . . . . . . . . . . . . . . 14
Algorithm 16
1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1 Column Linearization . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Row Linearization . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Low-order Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Dimension Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Dimension Pre-Padding . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Dimension Post-Padding . . . . . . . . . . . . . . . . . . . . . . 28
5 Tensor Partition and Join . . . . . . . . . . . . . . . . . . . . . . . . . 29
1
5.1 Tensor Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Tensor Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6 Transpose Permutation Decomposition . . . . . . . . . . . . . . . . . . 34
6.1 Pre-order Index Reordering Method . . . . . . . . . . . . . . . . 34
6.2 Post-order Index Reordering Method . . . . . . . . . . . . . . . 36
6.3 Index Reordering Method Selection . . . . . . . . . . . . . . . . 38
6.4 Optimization: Decreasing Consecutive Integers Pattern Aware . 38
7 Memory Usage Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 39
GPU Implementation 48
1 Implementations of Catanzaro’s algorithm . . . . . . . . . . . . . . . . 48
1.1 Shared Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.2 Global Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2 Implementations of Low-order Transpose . . . . . . . . . . . . . . . . . 51
2.1 Linearization methods . . . . . . . . . . . . . . . . . . . . . . . 51
2.2 Extensions of Catanzaro’s algorithm . . . . . . . . . . . . . . . 53
2.3 Combination methods . . . . . . . . . . . . . . . . . . . . . . . 53
3 Implementations of memory-reducing algorithms . . . . . . . . . . . . . 54
3.1 Dimension Padding . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Partition and Join . . . . . . . . . . . . . . . . . . . . . . . . . 55
Performance Evaluation 56
1 Low-order Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2 N-order Transposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.1 Compare with NDPA . . . . . . . . . . . . . . . . . . . . . . . . 63
2.2 Index Reordering Method Selection . . . . . . . . . . . . . . . . 64
2.3 Unevenly Distributed Dimensions . . . . . . . . . . . . . . . . . 66
2.4 Evenly distributed dimensions . . . . . . . . . . . . . . . . . . . 67
2.5 Benchmark Set of TTC . . . . . . . . . . . . . . . . . . . . . . . 69
3 Upper bound of input size of tensor transposition . . . . . . . . . . . . 71
2
Conclusion and Future Work 73
                                

[1] Chetan Nayak et al. “Non-Abelian anyons and topological quantum computation”. In: Reviews of Modern Physics 80.3 (2008), 1083–1159. issn: 1539-0756.
doi: 10.1103/revmodphys.80.1083. url: http://dx.doi.org/10.1103/
RevModPhys.80.1083.
[2] Hans-Joachim Werner et al. “Molpro: a general-purpose quantum chemistry
program package”. In: WIREs Computational Molecular Science 2.2 (2012),
pp. 242–253. doi: https : / / doi . org / 10 . 1002 / wcms . 82. eprint: https :
/ / onlinelibrary . wiley . com / doi / pdf / 10 . 1002 / wcms . 82. url: https :
//onlinelibrary.wiley.com/doi/abs/10.1002/wcms.82.
[3] M. Alex O. Vasilescu and Demetri Terzopoulos. “Multilinear Analysis of Image
Ensembles: TensorFaces”. In: Computer Vision — ECCV 2002. Ed. by Anders
Heyden et al. Berlin, Heidelberg: Springer Berlin Heidelberg, 2002, pp. 447–460.
isbn: 978-3-540-47969-7.
[4] Alexander Novikov et al. Tensorizing Neural Networks. 2015. arXiv: 1509.06569
[cs.LG].
[5] Tamara G. Kolda and Brett W. Bader. “Tensor Decompositions and Applications”. In: SIAM Review 51.3 (2009), pp. 455–500. doi: 10.1137/07070111X.
eprint: https://doi.org/10.1137/07070111X. url: https://doi.org/10.
1137/07070111X.
[6] So Hirata. “Tensor Contraction Engine: Abstraction and Automated Parallel
Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body
75
Perturbation Theories”. In: The Journal of Physical Chemistry A 107 (Nov.
2003), pp. 9887–9897. doi: 10.1021/jp034596z.
[7] A. Abdelfattah et al. “High-performance Tensor Contractions for GPUs”. In:
Procedia Computer Science 80 (2016). International Conference on Computational Science 2016, ICCS 2016, 6-8 June 2016, San Diego, California, USA,
pp. 108 –118. issn: 1877-0509. doi: https://doi.org/10.1016/j.procs.
2016.05.302. url: http://www.sciencedirect.com/science/article/pii/
S1877050916306536.
[8] Edgar Solomonik et al. “A massively parallel tensor contraction framework for
coupled-cluster computations”. In: Journal of Parallel and Distributed Computing 74.12 (2014). Domain-Specific Languages and High-Level Frameworks for
High-Performance Computing, pp. 3176 –3190. issn: 0743-7315. doi: https://
doi.org/10.1016/j.jpdc.2014.06.002. url: http://www.sciencedirect.
com/science/article/pii/S074373151400104X.
[9] Yang Shi et al. “Tensor Contractions with Extended BLAS Kernels on CPU
and GPU”. In: 2016 IEEE 23rd International Conference on High Performance
Computing (HiPC) (2016). doi: 10.1109/hipc.2016.031. url: http://dx.
doi.org/10.1109/HiPC.2016.031.
[10] Paul Springer and Paolo Bientinesi. “Design of a High-Performance GEMM-like
Tensor–Tensor Multiplication”. In: ACM Trans. Math. Softw. 44.3 (Jan. 2018).
issn: 0098-3500. doi: 10.1145/3157733. url: https://doi.org/10.1145/
3157733.
[11] Devin A. Matthews. “High-Performance Tensor Contraction without Transposition”. In: SIAM Journal on Scientific Computing 40.1 (2018), pp. C1–C24. doi:
10.1137/16M108968X. eprint: https://doi.org/10.1137/16M108968X. url:
https://doi.org/10.1137/16M108968X.
76
[12] Dmitry I. Lyakh. “An efficient tensor transpose algorithm for multicore CPU,
Intel Xeon Phi, and NVidia Tesla GPU”. In: Computer Physics Communications
189 (Jan. 2015). doi: 10.1016/j.cpc.2014.12.013.
[13] Antti-Pekka Hynninen and Dmitry I. Lyakh. cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs. 2017. arXiv: 1705.01598
[cs.MS].
[14] Paul Springer, Tong Su, and Paolo Bientinesi. “HPTT: A High-Performance
Tensor Transposition C++ Library”. In: ARRAY 2017. Barcelona, Spain: Association for Computing Machinery, 2017, 56–62. isbn: 9781450350693. doi: 10.
1145/3091966.3091968. url: https://doi.org/10.1145/3091966.3091968.
[15] J. Vedurada et al. “TTLG - An Efficient Tensor Transposition Library for
GPUs”. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2018, pp. 578–588. doi: 10.1109/IPDPS.2018.00067.
[16] NVIDIA Multi-Instance GPU User Guide. https://docs.nvidia.com/datacenter/tesla/miguser-guide/index.html. 2022.
[17] Fred Gustavson, Lars Karlsson, and Bo Kågström. “Parallel and Cache-Efficient
In-Place Matrix Storage Format Conversion”. In: 38.3 (Apr. 2012). issn: 0098-
3500. doi: 10.1145/2168773.2168775. url: https://doi.org/10.1145/
2168773.2168775.
[18] Fred G. Gustavson and David W. Walker. “Algorithms for in-place matrix transposition”. In: Concurrency and Computation: Practice and Experience 31.13
(2019). e5071 cpe.5071, e5071. doi: 10 . 1002 / cpe . 5071. eprint: https : / /
onlinelibrary . wiley . com / doi / pdf / 10 . 1002 / cpe . 5071. url: https :
//onlinelibrary.wiley.com/doi/abs/10.1002/cpe.5071.
[19] I-Jui Sung et al. “In-Place Transposition of Rectangular Matrices on Accelerators”. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming. PPoPP ’14. Orlando, Florida, USA: Associ77
ation for Computing Machinery, 2014, 207–218. isbn: 9781450326568. doi: 10.
1145/2555243.2555266. url: https://doi.org/10.1145/2555243.2555266.
[20] J. Gómez-Luna et al. “In-Place Matrix Transposition on GPUs”. In: IEEE Transactions on Parallel and Distributed Systems 27.3 (2016), pp. 776–788. doi:
10.1109/TPDS.2015.2412549.
[21] Bryan Catanzaro, Alexander Keller, and Michael Garland. “A Decomposition for
In-Place Matrix Transposition”. In: SIGPLAN Not. 49.8 (Feb. 2014), 193–206.
issn: 0362-1340. doi: 10.1145/2692916.2555253. url: https://doi.org/
10.1145/2692916.2555253.
[22] A.A. Tretyakov and E.E. Tyrtyshnikov. “Optimal in-place transposition of rectangular matrices”. In: Journal of Complexity 25.4 (2009), pp. 377 –384. issn:
0885-064X. doi: https://doi.org/10.1016/j.jco.2009.02.008. url: http:
//www.sciencedirect.com/science/article/pii/S0885064X09000120.
[23] Fred Gehrung Gustavson and John A Gunnels. “Method and structure for cache
aware transposition via rectangular subsections”. In: (Feb. 2014).
[24] Jose L. Jodra, Ibai Gurrutxaga, and Javier Muguerza. “Efficient 3D Transpositions in Graphics Processing Units”. In: Int. J. Parallel Program. 43.5 (Oct.
2015), 876–891. issn: 0885-7458. doi: 10 . 1007 / s10766 - 015 - 0366 - 5. url:
https://doi.org/10.1007/s10766-015-0366-5.
[25] Muhammad Elsayed, Saleh El-shehaby, and Mohamed Abougabal. “NDPA: A
generalized efficient parallel in-place N-Dimensional Permutation Algorithm”.
In: Alexandria Engineering Journal 32 (Apr. 2015). doi: 10 . 1016 / j . aej .
2015.03.024.
[26] Chih-Chieh Tu. “IT3: In-place Transposition of Third-Order Tensor on Graphics
Processing Units”. 2021.
[27] Paul Springer, Aravind Sankaran, and Paolo Bientinesi. “TTC: a tensor transposition compiler for multiple architectures”. In: Proceedings of the 3rd ACM
SIGPLAN International Workshop on Libraries, Languages, and Compilers for
78
Array Programming - ARRAY 2016 (2016). doi: 10.1145/2935323.2935328.
url: http://dx.doi.org/10.1145/2935323.2935328.

簡易檢索 / 詳目顯示

相關論文