簡易檢索 / 詳目顯示

研究生: 劉承勲
Liu, Cheng-Hsun
論文名稱: 雲原生容器遷移的實現及其在狀態任務動態負載均衡中的應用
Implementations of Cloud-Native Container Migration and Its Applications of Dynamic Load Balance for Stateful Tasks
指導教授: 李哲榮
Lee, Che-Rung
口試委員: 周志遠
Chou, Chi-Yuan
鍾武君
Zhong, Wu-Jun
學位類別: 碩士
Master
系所名稱: 電機資訊學院 - 資訊工程學系
Computer Science
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 65
中文關鍵詞: 容器熱遷移雲原生動態負載可靠性狀態保存雲端
外文關鍵詞: migration, container, Kubernetes, Docker, checkpoint&restore, reliability
相關次數: 點閱:63下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 容器技術與編排框架的整合已成為雲原生環境的基石,使應用程式能夠無縫地
    利用雲平台提供的可擴展性、彈性、高可用性和靈活性。然而,對於有狀態的應用
    程式而言,根據執行時效能自動調整資源配置的動態負載平衡仍是一項挑戰,因為
    容器無法連同其狀態(如記憶體頁面和程式計數器)一起遷移到其他節點。為了解
    決這一問題,本研究提出了一種基於 Kubernetes Operator 的新方案,利用「檢查
    點/還原」機制來實現有狀態應用程式的遷移能力。
    為了實現負載平衡,我們使用 Operator 框架構建了一個自我監控控制器,用來
    追蹤 Kubernetes 叢集中每個節點的資源使用情況。當某個節點出現資源短缺時,
    控制器會自動觸發 pod 的遷移,將高資源消耗的 pod 從超載節點移動到負載較低
    的節點。
    在 pod 遷移的實作部分,我們同樣使用 Operator 框架來對應用程式的狀態進行
    快照,並在原始節點上保存容器的記憶體狀態。接下來,我們使用 Buildah 在節點
    之間推送和拉取這些檢查點鏡像。Strimzi 則部署了一個 Kafka 消息佇列,用來捕
    捉已準備好在目標節點拉取檢查點影像的事件,從而實現新 pod 的還原並刪除舊的
    pod。此外,我們還實作了一種替代方案,該方案利用 NFS persistent volume來保
    存檢查點檔案,並由 Kubernetes 控制器在推送和拉取過程中管理檢查點和還原操
    作,藉此加速遷移效能。
    我們針對 pod 重建和遷移過程進行了實驗,並將我們的實作與 Podman 進行了
    多項指標的比較,例如遷移時間、檢查點 pod 中記憶體頁面的數量以及記憶體轉存
    所需的時間。實驗結果顯示,我們的實作在檢查點和還原時間方面具有更短的表
    現,但由於 Kafka 消息的網路延遲,鏡像傳輸的時間相對較長。


    Container technology, coupled with orchestration frameworks, has become a founda-
    tional component of cloud-native environments, allowing applications to harness the
    scalability, elasticity, resilience, and flexibility of cloud platforms. However, container
    migration, a process critical for dynamic load balancing and fault tolerance, remains
    challenging for stateful applications due to their persistent data requirements.
    In this work, we introduce a novel Kubernetes operator that leverages the Check-
    point/Restore mechanism to enable efficient migration of stateful applications. Our
    solution uses the Kubernetes operator framework to create application snapshots and
    preserve container memory on the source node. Next, we use Buildah to facilitate the
    transfer of these checkpointed images between nodes, and employ Strimzi to deploy a
    Kafka message queue to capture events, signaling readiness for the destination node to
    retrieve and restore the checkpointed image, instantiate the new pod, and delete the
    old one. To further enhance the performance, we leverage an NFS persistent volume
    to store checkpoint files, with a Kubernetes controller managing the checkpoint and
    restore processes during transfer events.
    To validate the effectiveness of our migration mechanism, we applied it to enable
    dynamic load balancing. The operator framework supports a self-monitoring con-
    troller that continually tracks resource usage across nodes in the Kubernetes cluster.
    When a node faces resource constraints, the controller initiates an automatic migra-
    tion, redistributing resource-intensive pods from the overloaded node to a less utilized
    one.
    Our experiments focused on pod recreation and the migration process, compar-
    ing our approach with Podman, using metrics such as migration time, the volume of
    memory pages dumped in the checkpointed pod, and memory dumping time. Results
    demonstrate that our implementation achieves shorter checkpointing and restoration
    times, but a longer image transfer time, due to Kafka-induced network latency. How-
    ever, the NFS-based optimization significantly reduces image transfer time, under-
    scoring its potential to improve migration efficiency in distributed environments.

    中文摘要 1 Abstract 2 List of Figures 6 List of Tables 7 1 Introduction 8 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Thesis Organizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Literature Review 11 2.1 Checkpoint and Restore . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Container Orchestration . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Reflections on Checkpoint/Restore in Container Orchestration . . . . . 12 2.3.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.3 Relative work and Comparisons . . . . . . . . . . . . . . . . . . 15 2.3.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.5 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.6 Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.7 Kubernetes Node Architecture . . . . . . . . . . . . . . . . . . . 17 2.3.8 Checkpoint/Restore . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.9 CRIU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.10 How it Checkpoints the Process State . . . . . . . . . . . . . . . 18 2.3.11 Integration with container runtimes . . . . . . . . . . . . . . . . 20 2.3.12 OCI Image Format . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.13 Key Points of the Proposal of the image format for checkpointed pod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.14 Container Runtime . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.15 cgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 Checkpoint/Restore Operator . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 MigrationController . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 DaemonsetController . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 NodeController . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.6 ValidatingWebhook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.7 Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.8 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.9 Interactions between each component . . . . . . . . . . . . . . . . . . . 32 3.9.1 The workflow of our first design . . . . . . . . . . . . . . . . . . 33 3.9.2 The workflow of the second implementation . . . . . . . . . . . 35 3.10 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.11 Obstacles in our design . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4 Evaluation 37 4.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.1 Metrics Evaluation for Redis and Stress Test on Checkpoint Stage 40 4.4 Comparison of different checkpoint metrics with different experiments . 41 4.4.1 metrics evaluation for redis and stress test on checkpointing stage 41 4.4.2 metrics evaluation for redis and stress test on transferring stage 42 4.4.3 metrics evaluation for redis and stress test on restore stage . . . 43 4.4.4 Expected Time for Each Stage Using Kubernetes . . . . . . . . 45 4.4.5 Expected Time for Each Stage Using AWS EKS(Amazon Elastic Kubernetes Service) . . . . . . . . . . . . . . . . . . . . . . . 50 4.4.6 Expected Time for each stage using second implementation . . . 51 4.4.7 Expected Time for Each Stage Using Podman . . . . . . . . . . 54 4.4.8 Recover high CPU rate or memory usage rate . . . . . . . . . . 56 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5.1 checkpoint Time . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.5.2 Image Transferring Time . . . . . . . . . . . . . . . . . . . . . . 58 4.5.3 Restoration Time . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.5.4 Failure Handling and Expected Time Calculations . . . . . . . . 60 4.5.5 Recover the node from anomaly state . . . . . . . . . . . . . . . 61 4.5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5 Future Work 62 References 64

    [1] DMTCP Team. DMTCP: Distributed MultiThreaded CheckPointing. https :
    //github.com/dmtcp/dmtcp. Accessed: 2024-09-23. 2024.
    [2] CRIU Team. CRIU: Checkpoint/Restore In Userspace. https://github.com/
    checkpoint-restore/criu. Accessed: 2024-09-23. 2024.
    [3] Paulo Souza Junior, Daniele Miorandi, and Guillaume Pierre. “Good Shepherds
    Care For Their Cattle: Seamless Pod Migration in Geo-Distributed Kubernetes”.
    ICFEC 2022 - 6th IEEE International Conference on Fog and Edge Computing.
    ffhal-0358. Taormina, Italy: IEEE, 2022, pp. 1–9.
    [4] Author(s) Name. “UMS: Live Migration of Containerized Services across Autonomous Computing Systems”. Proceedings of [Conference Name]. Conference
    Location: Publisher, 2022, pp. xx–xx.
    [5] CRIU authors. CRIU main page. "https://criu.org/Main_Page". 2016.
    [6] Adrian Reber. Linux Kernel Patches for Checkpoint/Restore. RedHat. 2020.
    [7] Cgroups Freezer Documentation. https : / / www . kernel . org / doc /
    Documentation/cgroup-v1/freezer-subsystem.txt. Linux, 2010.
    [8] Rodrigo H Müller, Cristina Meinhardt, and Odorico M Mendizabal. “An architecture proposal for checkpoint/restore on stateful containers”. Proceedings of
    the 37th ACM/SIGAPP Symposium on Applied Computing. 2022, pp. 267–270.
    [9] Yahya Al-Dhuraibi et al. “Elasticity in cloud computing: state of the art and
    research challenges”. IEEE Transactions on services computing 11.2 (2017),
    pp. 430–447.
    [10] Gursharan Singh et al. “A secure and lightweight container migration technique
    in cloud computing”. Journal of King Saud University-Computer and Information Sciences 36.1 (2024), p. 101887.
    [11] Kubernetes Operator. https : / / kubernetes . io / docs / concepts / extend -
    kubernetes/operator/.
    [12] Martin Kollingbaum Radostin Stoyanov. Efficient Live Migration of Linux Containers. https://www.researchgate.net/publication/328214412. 2018.
    [13] Open Containers Initiative. Checkpoint Images in OCI Specification. https :
    //github.com/opencontainers/image-spec/issues/962. 2023.
    [14] Kubernetes Documentation. Kubernetes Architecture - Controller. Accessed:
    2024-09-24. 2024.
    [15] Kubernetes Documentation. Kube-controller-manager: Kubernetes Commandline Tools Reference. Accessed: 2024-09-24. 2024.
    [16] Adrian Reber (Red Hat). Forensic container checkpointing in Kubernetes.
    "https : / / kubernetes . io / blog / 2022 / 12 / 05 / forensic - container -
    checkpointing-alpha/". 2022.
    [17] “Minimal checkpointing support”. 2020.
    [18] CRIU maintainer. CRIU: Checkpoint/Restore In Userspace. https://github.
    com/checkpoint-restore/criu/issues/2366. 2024.
    [19] docker private registry support oci-format image. https : / / github . com /
    distribution/distribution/issues/4366. 2024.
    [20] Crio source codes. Missing RootFSImageRef. Accessed: 2024-09-24. 2024.
    [21] Author. kafka DNS problem. https://github.com/strimzi/strimzi-kafkaoperator/issues/10040. 2024.
    [22] Walter Chen-Hua Lu Chien-Hung Chen Che-Rung Lee. “Smart In-Car Camera
    System Using Mobile Cloud Computing Framework for Deep Learning”. ScienceDirect (2017).

    QR CODE