解决PVE下 eno1 Detected Hardware Unit Hang 的问题

发表于 2020-06-17 更新于 2020-06-18 阅读次数：本文字数： 2.7k 阅读时长 ≈ 2 分钟

不知道从什么时候开始，我的PVE就时不时出现失去连接的情况，因为我是在PVE上安装的LEDE拨号的，导致我也上不了网，每次只能重启。这周找了个时间给机器连上了显示器，等了两天终于等到失去连接的情况，终于排查出了原因。今天就来教你怎么解决。

Jun 17 13:09:43 pve kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
  TDH                  <bd>
  TDT                  <eb>
  next_to_use          <eb>
  next_to_clean        <bd>
buffer_info[next_to_clean]:
  time_stamp           <102c22293>
  next_to_watch        <be>
  jiffies              <102c22440>
  next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <3800>
PHY Extended Status    <3000>
PCI Status             <10>
Jun 17 13:09:45 pve kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
  TDH                  <bd>
  TDT                  <eb>
  next_to_use          <eb>
  next_to_clean        <bd>
buffer_info[next_to_clean]:
  time_stamp           <102c22293>
  next_to_watch        <be>
  jiffies              <102c22638>
  next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <3800>
PHY Extended Status    <3000>
PCI Status             <10>
Jun 17 13:09:47 pve kernel: e1000e 0000:00:19.0 eno1: Detected Hardware Unit Hang:
  TDH                  <bd>
  TDT                  <eb>
  next_to_use          <eb>
  next_to_clean        <bd>
buffer_info[next_to_clean]:
  time_stamp           <102c22293>
  next_to_watch        <be>
  jiffies              <102c22828>
  next_to_watch.status <0>
MAC Status             <80083>
PHY Status             <796d>
PHY 1000BASE-T Status  <3800>
PHY Extended Status    <3000>
PCI Status             <10>

偶尔这个网卡能够自动重启成功，但是大多数时候都会启动失败，直接导致网卡失去响应。

这个问题是Intel网卡特有的问题，e1000e的driver在网络流量过大时就会出现这个问题。从内核3.10就有这种情况，一直到目前我的内核（5.4.41）仍然没有解决。

临时解决办法

输入

1	ethtool -K eno1 tso off gso off

如果没有安装ethtool请安装

1	apt install ethtool

将eno1换成出问题的那个网卡名称。

每次PVE重启后都需要重新执行这个命令。

查看是否生效

执行：

1	ethtool -k eno1 \| grep offload

root@pve:~# ethtool -k eno1 | grep offload
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]

可以看到tcp-segmentation-offload和udp-fragmentation-offload都是off，说明生效了。

临时永久解决办法

打开 /etc/network/interfaces 文件

在出问题的网卡那一行下面添加一句：

1	post-up /usr/sbin/ethtool -K eno1 tso off gso off

/usr/sbin/ethtool 在命令行下执行 which ethtool 即可找到。

1 2	root@pve:~# which ethtool /usr/sbin/ethtool

这样每次启动PVE后都会自动执行这个命令。

永久解决办法

等官方发patch吧。

其他心理安慰

如果你不放心，可以在桥接的网口（vmbr0）下也加上这句话。

如果还是不放心，那么将那句话改为：

1	post-up /usr/sbin/ethtool -K eno1 gso off gro off tso off tx off rx off

终极版：

1	post-up /usr/sbin/ethtool -K eno1 gso off gro off tso off tx off rx off rxvlan off txvlan off sg off

没有问题不建议执行上面的命令。

参考

https://forum.proxmox.com/threads/e1000-driver-hang.58284/page-5