[TIL] VPP 相關 Troubleshooting 紀錄

VPP 目前只能使用 2M 的 Hugepage

依據目前以下的兩份文件,是建議使用 2M 的 Hugepage,自己測試時,即便已經宣告 1G 的 Hugepage 是沒辦法執行的

安裝 VPP 後,DOWN 的 Network Interface 使用 ifconfig (or ip a ) 會看不到。

在這之前,不僅安裝了 VPP ,還更新 i40e 的 driver
在眾多的程序後,發現 ifconfig 看不到那些有支援 SR-IOV VF 的 interface,以為是 driver 被我更新更新到壞掉了。
一直重新 make install driver 但還是沒效。
仔細一看,發現 igb 的另外一個 DOWN 的 Network Interface 也不見了。 不過硬體資訊有看到

$ sudo lshw -class network -businfo
[sudo] password for winlab:
Bus info          Device           Class      Description
=========================================================
pci@0000:01:00.0                   network    Ethernet Controller X710 for 10GbE SFP+
pci@0000:01:00.1                   network    Ethernet Controller X710 for 10GbE SFP+
pci@0000:01:00.2                   network    Ethernet Controller X710 for 10GbE SFP+
pci@0000:01:00.3                   network    Ethernet Controller X710 for 10GbE SFP+
pci@0000:08:00.0  enp8s0           network    I210 Gigabit Network Connection
pci@0000:09:00.0                   network    I210 Gigabit Network Connection

這樣就開始研究,為何 DOWN 的 Interface 會自己不見。
使用 dmesg 也看不出所以然

$ dmesg | grep igb
[    1.569802] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[    1.569803] igb: Copyright (c) 2007-2014 Intel Corporation.
[    5.522996] igb 0000:08:00.0: added PHC on eth0
[    5.522997] igb 0000:08:00.0: Intel(R) Gigabit Ethernet Network Connection
[    5.522999] igb 0000:08:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 8c:ea:1b:30:da:09
[    5.523043] igb 0000:08:00.0: eth0: PBA No: 200500-000
[    5.523044] igb 0000:08:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
[    5.757962] igb 0000:09:00.0: added PHC on eth1
[    5.757963] igb 0000:09:00.0: Intel(R) Gigabit Ethernet Network Connection
[    5.757965] igb 0000:09:00.0: eth1: (PCIe:2.5Gb/s:Width x1) 8c:ea:1b:30:da:0a
[    5.758010] igb 0000:09:00.0: eth1: PBA No: 200500-000
[    5.758012] igb 0000:09:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
[    5.759074] igb 0000:08:00.0 enp8s0: renamed from eth0
[    5.780354] igb 0000:09:00.0 enp9s0: renamed from eth1
[   11.768618] igb 0000:08:00.0 enp8s0: igb: enp8s0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[   17.134762] igb 0000:09:00.0: removed PHC on enp9s0

使用 journalctl 看的時候,發現一開機啟動 OS 確實有出現 Interface ,但當啟用了 VPP 後,DOWN 的 interface 因為 kernel 的被呼叫,導致 removed PHC,而就此也看不到 Network Interface 了!

Mar 07 17:21:30 kubecord-a vpp[2189]: /usr/bin/vpp[2189]: vlib_pci_bind_to_uio: Skipping PCI device 0000:08:00.0 as host interface
Mar 07 17:21:30 kubecord-a /usr/bin/vpp[2189]: vlib_pci_bind_to_uio: Skipping PCI device 0000:08:00.0 as host interface enp8s0 is
Mar 07 17:21:30 kubecord-a kernel: igb 0000:09:00.0: removed PHC on enp9s0

目前因為還沒仔細研究 VPP 也沒有特別要使用到,因此先移除。移除 VPP 並重開機後,DOWN 的 Interface 就可以看到了。

/etc/default/grubGRUB_CMDLINE_LINUX 同時宣告 Hugepages 1G 和 2M ,導致 kubernetes get no 不到

使用 journalctl -f -u kubelet 看到

Mar 11 13:17:56 kubecord-a kubelet[8532]: E0311 13:17:56.621316    8532 kubelet_node_status.go:92] Unable to register node "kubecord-a" with API server: Node "kubecord-a" is invalid: [status.capacity.hugepages-1Gi: Invalid value: resource.Quantity{i:resource.int64Amount{value:8589934592, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"BinarySI"}: may not have pre-allocated hugepages for multiple page sizes, status.allocatable.hugepages-2Mi: Invalid value: resource.Quantity{i:resource.int64Amount{value:2147483648, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"2Gi", Format:"BinarySI"}: may not have pre-allocated hugepages for multiple page sizes, status.allocatable.memory: Invalid value: resource.Quantity{i:resource.int64Amount{value:56150003712, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"54833988Ki", Format:"BinarySI"}: may not have pre-allocated hugepages for multiple page sizes, status.allocatable.pods: Invalid value: resource.Quantity{i:resource.int64Amount{value:110, scal# If you change this file, run 'update-grub' afterwards to update
e:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"110", Format:"DecimalSI"}: may not have pre-allocated hugepages for multiple page sizes, status.allocatable.cpu: Invalid value: resource.Quantity{i:resource.int64Amount{value:31800, scale:-3}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"31800m", Format:"DecimalSI"}: may not have pre-allocated hugepages for multiple page sizes, status.allocatable.ephemeral-storage: Invalid value: resource.Quantity{i:resource.int64Amount{value:226240619760, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"226240619760", Format:"DecimalSI"}: may not have pre-allocated hugepages for multiple page sizes]
Mar 11 13:17:56 kubecord-a kubelet[8532]: E0311 13:17:56.693956    8532 kubelet.go:2236] node "kubecord-a" not found

原本會宣告 1G 的 Hugepages 是因為 DPDK,2M 的 Hugepages 是因為 VPP。
而因為 /etc/default/grubGRUB_CMDLINE_LINUX 沒把 2M 的移除,導致 k8s 起不來。
而把 VPP 移除,2M 的設定也不需要,所以移除 2M 的設定,update-grub 後重開機,Kubernetes 就回來了!

comments powered by Disqus