簡易檢索 / 詳目顯示

研究生: 蔡文嚴
Tsai, Wenyen
論文名稱: 於具備多佇列網路卡的多核心平台上對高效能封包處理之研究
High Performance Packet Processing on Multi-queue and Multi-core Platforms
指導教授: 黃能富
Huang, Nen Fu
口試委員: 李維聰
石維寬
林華君
陳俊良
學位類別: 博士
Doctor
系所名稱: 電機資訊學院 - 通訊工程研究所
Communications Engineering
論文出版年: 2015
畢業學年度: 103
語文別: 英文
論文頁數: 104
中文關鍵詞: 網路封包處理多核心系統同步化技術連線追蹤中斷綁定
外文關鍵詞: packet processing, multi-core, synchronization, connection tracking, interrupt affinitization
相關次數: 點閱:1下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著半導體技術的進步,多核心處理器被廣為使用於現代人的生活 - 從輕巧的手持裝置如行動電話到大型主機都看得到其蹤影。另一方面,為了克服多個處理器同時競爭網路卡上單一收發佇列所造成的效能瓶頸,支援多佇列的網路卡因應而生。在傳統硬體中斷驅動的封包處理模型之下,當網路卡透過直接記憶體存取 (DMA) 將一封包從網卡佇列送到系統記憶體之後,即會透過中斷通知處理器進行後續的處理。為了最大化利用多核心平台及多佇列網路卡的運算能力,新的中斷處理架構如 PCI-MSI(x) 被提出;其大幅改善中斷通知的效率並使得每個網卡佇列可以有各自獨立的中斷向量,透過不同中斷向量對個別處理器的綁定,可達到最高的系統利用率及整機效能。
    雖然多核心處理器提供軟體設計師更強大的運算能力,然而在多核心平台上設計有效率的封包處理程式卻存在許多單核心系統未見的挑戰。其中首要就是如何同步被多個處理器同時存取的資料及其衍生的許多問題如效能的下降及因錯誤的同步方式引起的系統死鎖(deadlock)。本論文首先介紹多核心系統並特別著重於非均勻訪存架構(NUMA) 的特性;接著說明軟體同步技術從經典的鎖(lock)、信號標(semaphore) 及無鎖(lockless) 操作到利用處理器硬體的同步機制如 transactional memory 及鎖省略(lock elision) 等,期待為讀者建立背景知識及相關術語。
    連線追蹤為本論文第一個研究主題,其目的為將個別封包關連到其所屬的連線以進行需要連線資訊的應用如跨封包內容檢測 (cross packet deep inspection) 及位址轉換等。此技術的難度在於高速的查找連線追蹤表以更新既存的連線或建立新的連線紀錄。本論文改善傳統使用單一共享追蹤表的做法,將單一表分割為較小的表以減少原來較多處理器同使存取單一表引起的上鎖/解鎖操作負擔。實際效能量測的結果也符合我們的預期: 當愈少處理器競爭相同的表(鎖),整體效能愈高。另一方面,本研究也提出一個動態資源分配的演算法以避免因負載不均造成連線追蹤能力下降的情形。
    中斷綁定在多核心平台上扮演著影響封包處理效能關鍵的腳色。在非最佳化的綁定之下,系統會因處理器的中斷處理負載分配不均而引起效能的大幅下降。然而設計一個全體適用最佳的中斷綁定器已被證明為 NP-hard 的問題,因此可能的研究方向乃是有效率及系統化的找出一個接近最佳綁定的方法。本論文首先提出一個綜合系統軟硬體及網卡功能配置資訊的系統化綁定演算法,試驗結果顯示此方法的效能在不同網路應用下均接近最佳的綁定法。為了更進一步將此演算法推廣到多佇列網路平台及考慮其提供中斷綁定建議的新介面,我們提出 qcAffin 作為多佇列 (queue) 到多處理器核心 (core) 的綁定器 (affinitizer)。qcAffin 因其對多佇列網卡的最佳化處理,在使用 1G 及 10G 多佇列網卡系統上的效能大幅領先 Linux系統核心內建的中斷綁定方式且可根據系統負載實現動態中斷綁定。


    Advances in semiconductor technology are making way for multi-core and many-core processors that incorporate tens to hundreds cores in a single package. Meanwhile, network interface cards (NICs) featuring multiple hardware reception (Rx) and transmission (Tx) queues are the responses from the networking community to the prevalence of multi-core computing. Multi-queue networking circumvents the performance degradation due to contention of multi-core on a single Rx/Tx queue by distributing the packets across multiple queues. In the meantime, benefiting from evolving interrupt handling techniques, a NIC can now be allocated enough interrupt resource for each of its queues to associate to a dedicated core.
    Although the number of cores in CPUs continues to climb, many difficulties remain in building systems that are capable of keeping up with the packet volume in a modern middle to large scale network deployment. This is due to several factors, including the ever-increasing rate of network traffic, e.g., the now prevalent 10Gbps, the cutting-edge 40Gbps, and the upcoming 100Gbps NICs, and some fundamental limitations in both software and hardware architectures. Software imposed synchronization overheads for multi-core programming such as atomic operations and locking play a critical role affecting the packet processing performance. On the other spectrum, hardware architectural complication like cache coherency and NUMA effects brings new challenges that demand developers to equip with new skill set to unleash the real computing power.
    Correspondingly, researches attack these challenges by a hardware and software co-design approach that starts from investigating the underlying hardware, which collects necessary knowledge to facilitate software development and allow optimization. In this dissertation, we focus on two problems: 1) reducing the lock contentions when performing session tracking and 2) affinitizing interrupts from multi-queue NICs to CPU cores with the objective of maximizing packet processing performance.
    For the first problem, we propose a simple partitioning scheme aiming at striking a balance between excessive locking and lockless manipulations. Meanwhile, a resource balancing mechanism is also given to prevent the problem of underutilization of session tracking resources under circumstances of unbalanced traffic loads. The effectiveness is justified by improved performance as the number of cores that contend for a single lock decreases. On the other end of the spectrum, to address the problem of interrupt affinitization, an algorithmic approach based on numerical cost model is proposed to find the best affinitization. Comprehensive experiences covering 1G and 10G NICs with four networking applications ranging from L2 to L7 are conducted to justify the effectiveness.

    ABSTRACT ........................................................................................................ iii 1. INTRODUCTION ............................................................................................ 1 2. BACKGROUND ............................................................................................. 4 2.1 Components of a modern multi-core system ................................................. 4 2.2 Challenges of packet processing on multi-core platforms ............................. 7 2.2.1 Laws and metrics of parallel programming ........................................... 7 2.2.2 Cache misses on multi-core systems .................................................... 12 2.2.3 Cache-coherency on multi-core systems ............................................. 14 2.2.4 Remote memory and device accesses on NUMA systems .................. 17 2.3 Software synchronization techniques .......................................................... 23 2.3.0 Classical lock primitives ...................................................................... 23 2.3.1 Lockless manipulation ......................................................................... 26 2.3.2 Lock-free data structures...................................................................... 32 2.3.3 Transactional memory and lock elision ............................................... 34 2.4 Literature reviews ........................................................................................ 38 2.4.1 Session tracking on multi-core platforms ............................................ 38 2.4.2 Interrupt affinitization on multi-core platforms ................................... 39 3. LOCK-CONTROLLED SESSION TRACKING ..................................................... 42 3.1 Session table partitioning and dynamic resource balancing ........................ 43 3.1.1 Session table partitioning ..................................................................... 43 3.1.2 Dynamic resource balancing ................................................................ 47 3.2 Experimental results..................................................................................... 49 4. INTERRUPT AFFINITIZATION ON MULTI-QUEUE AND MULTI-CORE PLATFORMS .. 54 4.1 Port-configuration assisted IRQ affinitization ............................................. 56 4.2 Experimental measurements ........................................................................ 61 4.2.1 Performance of static affinitization ...................................................... 62 4.2.2 Performance of dynamic affinitization with interrupt balancing ......... 64 4.3 qcAffin - A hardware topology aware interrupt affinitizing and balancing scheme .................................................................................................................... 66 4.3.1 Affinitization based on static hardware topology ................................ 68 4.3.2 Affinitization cost ................................................................................ 70 4.3.3 Affinitization based on dynamic system load ...................................... 72 4.3.4 Complexity analysis ............................................................................. 75 4.4 Experimental results of qcAffin .................................................................... 77 4.4.1 Performance without interrupt balancing ............................................. 77 4.4.2 Performance with interrupt balancing .................................................. 85 4.4.3 Run-time overhead ............................................................................... 88 CONCLUSION AND FUTURE WORKS ....................................................................... 90 References ....................................................................................................... 92

    [1] Uncore [Online]. Available: http://en.wikipedia.org/wiki/Uncore
    [2] Amdahl, Gene M., "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities," AFIPS Conference Proceedings, 1967, pp. 483-485.
    [3] Hill, M.D.; Marty, M.R., "Amdahl's Law in the Multicore Era," Computer, vol.41, no.7, pp.33-38, July 2008.
    [4] John L. Gustafson., “Reevaluating Amdahl's law,” Commun. ACM, vol. 31, no. 5, pp. 532-533, May 1988.
    [5] X.H. Sun, L.M. Ni, “Scalable Problems and Memory-Bounded Speedup,” Journal of Parallel and Distributed Computing, vol. 19, no. 1, pp. 27-37, Sep. 1993.
    [6] Alan H. Karp and Horace P. Flatt., “Measuring parallel processor performance,” Commun. ACM , vol. 33, no. 5, 539-543, May 1990.
    [7] Marc Prieur. Intel Core-i7 3960X, X79 Express and LGA 2011 [Online]. Available: http://www.behardware.com/articles/846-5/intel-core-i7-3960x-x79-express-and-lga-2011.html
    [8] Cache coherence [Online]. Available: http://en.wikipedia.org/wiki/Cache_coherence
    [9] MESI protocol [Online]. Available: http://en.wikipedia.org/wiki/MESI_protocol
    [10] Ram Huggahalli, Ravi Iyer, and Scott Tetrick, “Direct Cache Access for High Bandwidth Network I/O,” SIGARCH Comput. Archit. News, May 2005, pp. 50-59.
    [11] Intel Data Direct I/O Technology [Online]. Available: http://www.intel.com/content/www/us/en/io/data-direct-i-o-technology.html
    [12] numactl and libnuma [Online]. Available: http://oss.sgi.com/projects/libnuma/
    [13] Robert Love, Linux Kernel Development 3rd Edition. Addison-Wesley Professional, 2010.
    [14] atomic operation cost [Online]. Available: http://stackoverflow.com/questions/2538070/atomic-operation-cost
    [15] Thread-Local Storage [Online]. Available:
    http://gcc.gnu.org/onlinedocs/gcc-3.3/gcc/Thread-Local.html
    [16] Ulrich Drepper. (August 22, 2013). ELF Handling for Thread-Local Storage [Online]. Available: http://www.akkadia.org/drepper/tls.pdf
    [17] Paul E. McKenney, “Structured Deferral: Synchronization via Procrastination,” Queue vol. 11, no. 5, 20 pages, May 2013.
    [18] Martin Sústrik. ZeroMQ [Online]. Available: http://aosabook.org/en/zeromq.html
    [19] DPDK Programmer's guide [Online]. Available: http://dpdk.org/doc
    [20] A lockless ring-buffer [Online]. Available: http://lwn.net/Articles/340400/
    [21] Scalable Networking: Eliminating the Receive Processing Bottleneck—Introducing RSS" [Online]. Available:http://download.microsoft.com/download/5/D/6/5D6EAF2B-7DDF-476B-93DC-7CF0072878E6/NDIS_RSS.doc
    [22] Windows Scalable Networking Initiative [Online]. Available:
    http://www.microsoft.com/whdc/device/network/scale.mspx
    [23] Scaling in the Linux Networking Stack [Online]. Available: https://www.kernel.org/doc/Documentation/networking/scaling.txt
    [24] Intel 82599 10 GbE Controller Datasheet [Online]. Available:
    http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/82599-10-gbe-controller-datasheet.pdf
    [25] Design considerations for efficient network applications with Intel multi-core processor-based systems on Linux [Online]. Available: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/multi-core-processor-based-linux-paper.pdf
    [26] (Jan. 2009) Reducing Interrupt Latency Through the Use of Message Signaled Interrupts [Online]. Available: http://www.intel.com.tw/content/dam/www/public/us/en/documents/white-papers/msg-signaled-interrupts-paper.pdf
    [27] Message Signaled Interrupts [Online]. Available: http://en.wikipedia.org/wiki/Message_Signaled_Interrupts
    [28] Maurice Herlihy and J. Eliot B. Moss, “Transactional memory: architectural support for lock-free data structures,” Proc. 20th Annu. Int. Symp. Cmputer Achitecture (ISCA '93), 1993, pp. 289-300.
    [29] Ulrich Drepper, “Parallel Programming with Transactional Memory,” Queue vol. 6, no. 5, pp. 38-45, Sep. 2008.
    [30] Calin Cascaval, Colin Blundell, Maged Michael, Harold W. Cain, Peng Wu, Stefanie Chiras, and Siddhartha Chatterjee, “Software Transactional Memory: Why Is It Only a Research Toy?” Queue vol. 6, no. 5, 13 pages, Sep. 2008.
    [31] Andi Kleen, “Scaling Existing Lock-based Applications with Lock Elision,” Queue vol. 12, no. 1, 8 pages, Jan. 2014.
    [32] Rajwar, R.; Goodman, J.R., "Speculative lock elision: enabling highly concurrent multithreaded execution," Proc. 34th ACM/IEEE Int. Symp. Microarchitecture, Dec. 2001, pp.294-305.
    [33] Yoo, R.M.; Hughes, C.J.; Lai, K.; Rajwar, R., "Performance evaluation of Intel® Transactional Synchronization Extensions for high-performance computing," Int. Conf. High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2013, pp.1-11.
    [34] Dice, D., Herlihy, M., Lea, D., Lev, Y., Luchangco, V., Mesard, W. & Nussbaum, D. , “Applications of the adaptive transactional memory test platform,” 3rd ACM SIGPLAN Workshop Transactional Computing, Feb. 2008, pp. 1-10.
    [35] Bokhari, S.H., "On the Mapping Problem," IEEE Trans. Computers, vol.C-30, no.3, pp. 207-214, March 1981.
    [36] The netfilter.org project [Online]. Available: http://www.netfilter.org/
    [37] FreeBSD Man Pages – netgraph [Online]. Available: http://www.freebsd.org/cgi/man.cgi?query=netgraph&sektion=4
    [38] Network Emulation in FreeBSD [Online]. Available:
    http://linear.engmath.dal.ca/TinyOS/NetworkEmulation/
    [39] Joanna Rutkowska. “Linux Kernel Backdoors And Their Detection”, ITUnderground Conf., Oct. 2004.
    [40] Jamal Hadi Salim, Robert Olsson, Alexey Kuznetsov, "Beyond softnet", 5th Annu. Linux Showcase & Conf. (ALS '01), 2001, pp. 165–172.
    [41] Hyogon Kim, Jin-Ho Kim, Inhye Kang and Saewoong Bahk, “Preventing session table explosion in packet inspection computers”, IEEE Trans. Computers, vol. 54, no. 2, pp. 238-240, Feb. 2005.
    [42] Xin Li, Zheng-Zhou Ji and Ming-Zeng Hu, ”Stateful Inspection firewall session table processing”, Int. Conf. Information Technology: Coding and Computing, Apr. 2005, pp.615-620.
    [43] Ke Zhang, Juan Wang and Dasen Ren, “A matching algorithm of Netfilter connection tracking based on IP flow”, Int. Conf. Anti-counterfeiting, Security and Identification, Aug. 2008, pp.199-203.
    [44] PF: The OpenBSD Packet Filter [Online]. Available: http://www.openbsd.org/faq/pf
    [45] FreeBSD SMPng Project [Online]. Available: http://www.freebsd.org/smp/
    [46] Fulp, E.W. and Farley, R.J, “A Function-Parallel Architecture for High-Speed Firewalls,” IEEE Int. Conf. Communications (ICC), June 2006, pp.2213-2218.
    [47] John M. Mellor-Crummey and Michael L. Scott, “Algorithms for scalable synchronization on shared-memory multiprocessors,” ACM Trans. Computer Systems, vol. 9, no. 1, pp. 21-65, Feb. 1991.
    [48] Maged M. Michael, “High performance dynamic lock-free hash tables and list-based sets,” ACM Symp. Parallel Algorithms and Architectures (SPAA '02), 2002, pp. 73-82.
    [49] P. E. McKenney and J. D. Slingwine, "Read-Copy Update: Using Execution History to Solve Concurrency Problems," Proc. Int. Conf. Parallel and Distributed Computing and Systems, Oct. 1998, pp. 509-518.
    [50] Irqbalance [Online]. Available: https://github.com/Irqbalance/irqbalance
    [51] J. D. Salehi, J. F. Kurose, and D. Towsley, "The Effectiveness of Affinity-Based Scheduling in Multiprocessor Network Protocol Processing," IEEE/ACM Trans. Networking, vol. 4, no. 4, pp. 516-530, Aug. 1996.
    [52] A. Foong, J. Fung, and D. Newell, "An In-Depth Analysis of the Impact of Processor Affinity on Network Performance," Proc. ICON 2004, Nov. 2004, pp. 244-250.
    [53] A. Foong, J. Fung, and D. Newell, A. Lopez-Estrada, S. Abraham, and P. Irelan, "Architectural characterization of processor affinity in network processing," IEEE Int. Symp. Performance Analysis of Systems and Software, March 2005, pp.207-218.
    [54] Hye-Churn Jang and Hyun-Wook Jin, “MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces,” 17th IEEE Symp. High Performance Interconnects, Aug. 2009, pp.73-82.
    [55] Wu, W., Demar, P., and Crawford, M., "A Transport-Friendly NIC for Multicore/Multiprocessor Systems," IEEE Trans. Parallel and Distributed Systems, vol. 23, no. 4, pp. 607-615, 2012.
    [56] L. Kencl and J. Boudec, “Adaptive load sharing for network processors,” Proc. INFOCOM, 2002, pp. 545-554.
    [57] W. Shi, M. MacGregor, and P. Gburzynski. “Load balancing for parallel forwarding,” IEEE/ACM Trans. Networking, vol. 13, no. 4, pp. 790-801, Aug. 2005.
    [58] Intel Ethernet Converged Network Adapters X710 10 GbE [Online]. Available: http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/ethernet-x710-brief.pdf
    [59] Ferreira, Manuela K., Vicente S. Cruz, and Philippe OA Navaux, "Static Process Mapping Heuristics Evaluation for MPI Processes in Homogeneous Multi-core Clusters," Latin American Conf. High Performance Computing, 2011, 8 pages.
    [60] Abdelhafid Mazouz, Sid-Ahmed-Ali Touati and Denis Barthou, "Performance Evaluation and Analysis of Thread Pinning Strategies on Multi-Core Platforms: Case Study of SPEC OMP Applications on Intel Architectures," IEEE Int. Conf. High Performance Computing Simulation (HPCS), July 2011, pp.273-279.
    [61] Chi Zhang, Xin Yuan, Ashok Srinivasan, "Processor affinity and MPI performance on SMP-CMP clusters," IEEE Int. Symp. Parallel and Distributed Processing, Workshops and Phd Forum, Apr. 2010, pp.1-8.
    [62] Constantine N. K. Osiakwan, Selim G. Akl, “The maximum weight perfect matching problem for complete weighted graphs is in PC,” Proc. IEEE Symp. Parallel and Distributed Processing, Dec. 1990, pp.880-887.
    [63] Pellegrini, F., “Static mapping by dual recursive bipartitioning of process architecture graphs,” Proc. Scalable High-Performance Computing Conf., May 1994, pp.486-493.
    [64] C.D. Sudheer, T. Nagaraju, P.K. Baruah, Ashok Srinivasan, "Optimizing assignment of threads to SPEs on the cell BE processor," IEEE Int. Symp. Parallel and Distributed Processing, May 2009, pp.1-8.
    [65] Fang Zheng, Chitra Venkatramani, Rohit Wagle, Karsten Schwan, "Cache Topology Aware Mapping of Stream Processing Applications onto CMPs," IEEE 33rd Int. Conf. Distributed Computing Systems (ICDCS), July 2013, pp.52-61.
    [66] Chun-Yi Su, Dong Li, Dimitrios S. Nikolopoulos, Kirk W. Cameron, Bronis R. de Supinski, Edgar A. Leon, "Model-based, memory-centric performance and power optimization on NUMA multiprocessors," IEEE Int. Symp. Workload Characterization (IISWC), Nov. 2012, pp.164-173.
    [67] Fengguang Song, Shirley Moore, and Jack Dongarra, "Analytical Modeling and Optimization for Affinity Based Thread Scheduling on Multicore Systems", IEEE Int. Conf. Cluster Computing and Workshops, Sep. 2009, pp.1-10.
    [68] J. D. Salehi, J. F. Kurose, and D. Towsley, "The Effectiveness of Affinity-Based Scheduling in Multiprocessor Network Protocol Processing," IEEE/ACM Trans. Networking, vol. 4, no. 4, pp. 516-530, Aug. 1996.
    [69] HyperTransport Technology Consortium [Online]. Available: http://www.hypertransport.org/
    [70] Wen-Yen Tsai, Nen-Fu Huang, and Hsien-Wei Hung,” A Lock-Controlled Session Table Partitioning Scheme with Dynamic Resource Balancing for Multi-Core Architecture”, IEEE ICC2011, May 2011, pp 1-5.
    [71] Wen-Yen Tsai, NF Huang, and HW Hung, “A port-configuration assisted NIC IRQ affinitization scheme for multi-core packet forwarding applications”, IEEE GLOBECOM2012, Dec. 2012, pp 2547-2552.
    [72] Nen-Fu Huang and Wen-Yen Tsai, “qcAffin: A hardware topology aware interrupt affinitizing and balancing scheme for multi-core and multi-queue packet processing systems,” to appear on IEEE Trans. on Parallel and Distributed Systems.
    [73] Scaling in the Linux Networking Stack [Online]. Available: https://www.kernel.org/doc/Documentation/networking/scaling.txt

    無法下載圖示 全文公開日期 本全文未授權公開 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)

    QR CODE