Linux 下的 Sync 卡死问题分析报告

Luo Xiaogang 创作于 2020/05/14

By rxd of TinyLab.org May 14, 2020

背景

最近陆续有 Linux Lab 用户报告偶发 firefox 跑不起来，偶发执行 sync 命令后不返回的情况，本文是这个 bug 的分析报告。

有部分 Linux Lab 开发者参与了现象收集和初步分析工作，最后由我汇总如下文。在此感谢他们的付出。

现象收集与汇总

我们陆续收到的现象汇总如下：

异常环境信息

Product: Gigabyte、ASUSTeK 、Hasee
ARCH: x86_64
CPU: i5-6500、i5-4210U
RAM: 7915 MiB
System: Deepin 20 Beta、 Ubuntu 18.04 LTS、Ubuntu 20.04 LTS
Kernel: 5.3、5.4
Docker: 19.03
Shell: 5.0

正常环境信息

Product: VMware、Lenovo Y50-70、X555LI、VBox
ARCH: x86_64
CPU: i7-8550U、i5-4210H、i7-6700K、i7-8550U
RAM: 3934 MiB、 2906 MiB
System: Ubuntu 16.04、Ubuntu 18.04、CentOS Linux release 7.7.1908、Deepin 15.11
Kernel: 4.4、4.15、4.19、5.0、5.2、5.3
Docker: 18.09、19.03
Shell: 4.2、4.4、5.0

问题初步分析

寻找必现路径

启动 lab
登录 lab
stop lab
systemctl stop docker
sync 卡死

调用栈初步分析

用下面的命令分别抓取了正常情况和异常情况下 sync 的调用信息

sudo perf  ftrace -t function_graph -G ksys_sync sync > 1.txt

通过分析，发现异常情况下，最后停在了这个位置：

 0)   0.364 us    |      _raw_spin_lock();
 0)               |      down_read() {
 0)               |        _cond_resched() {
 0)   0.239 us    |          rcu_all_qs();
 0)   0.887 us    |        }
 0)               |        rwsem_down_read_slowpath() {
 0)   0.186 us    |          _raw_spin_lock_irq();
 0)   0.373 us    |          wake_up_q();
 0)               |          schedule() {
 0)               |            rcu_note_context_switch() {
 0)   0.391 us    |              rcu_qs();
 0)   0.889 us    |            }
 0)   0.197 us    |            _raw_spin_lock();
 0)   0.202 us    |            update_rq_clock();

问题深入分析

获取第一现场

为了深入分析，我在笔记本上全新安装了 ubuntu 20.04 和 Linux Lab。
等到我的 ubuntu 20.04 装好，Linux Lab 装好，迫不及待的尝试了一下。非常神奇的现象出现了：sync 卡死了，有了第一现场的信息！

分析第一现场

根据多年的嵌入式研发经验，简单的命令卡死，第一反应当然是看 cpu、mem、io、/proc/<pidof sync>/status、/proc/<pidof sync>/stack、gdb attach 上去看看。发现 cpu、mem、IO 消耗都很低，sync 卡死在 ksys_sync 里面的 iterate_supers 中，状态是 disk sleep，gdb 也卡死了。

void ksys_sync(void)
{
 int nowait = 0, wait = 1;

 wakeup_flusher_threads(WB_REASON_SYNC);
 iterate_supers(sync_inodes_one_sb, NULL);
 iterate_supers(sync_fs_one_sb, &nowait);
 iterate_supers(sync_fs_one_sb, &wait);
 iterate_bdevs(fdatawrite_one_bdev, NULL);
 iterate_bdevs(fdatawait_one_bdev, NULL);
 if (unlikely(laptop_mode))
  laptop_sync_completion();
}

此时感觉和 Linux 内核还是有一定关系的（毕竟卡死在内核了）。

排除干扰，寻找元凶

我们逐个排查了 cloud-lab 下 configs/linux-lab/tools/container-run 中加载的模块：loop minix nfsd 9pnet_virtio kvm $kvm_dri。发现加载 nfsd 的时候有异常。
kernel image 是通过 apt upgrade 方式升级的，nfsd 模块是 upgrade 的时候自带的，不是 Linux Lab 中的。因此，可以先确认一下在 host 里面开启 nfs 服务并且挂载 nfs 文件系统，看有没有问题。如果没有问题，再看 Linux Lab 中的配置和 host 有什么区别。
经过验证，在 host 中加载 nfsd 没有问题。但是 Linux Lab 中 /etc/exports 和 host 中的 exports 有差异。统一配置后，在 Linux Lab 中挂载 nfsd 还是有问题。只能在 Linux Lab 中一步步的执行、排查。
实验如下( vbox ubuntu 18.04 kernel v5.3 )：

luo@luo:~/linux/cloud-lab$ lsmod | grep nsfd
luo@luo:~/linux/cloud-lab$ ./tools/docker/run linux-lab
LOG: Current Lab is linux-lab
LOG: Open the running linux-lab
LOG: Current Lab is linux-lab
LOG: Available login methods:

   bash ssh vnc webssh webvnc

LOG: Switch to another method:

   .//home/luo/linux/cloud-lab/tools/docker/login LOGIN_METHOD

LOG: Running 'tools/docker/bash'

LOG: Current Lab is linux-lab
LOG: Running '  docker exec -it -w //labs/linux-lab -u ubuntu 'linux-lab-10410' //bin/bash'
ubuntu@linux-lab:/labs/linux-lab$ sudo modprobe -d /m nfsd
ubuntu@linux-lab:/labs/linux-lab$ echo $?
137
ubuntu@linux-lab:/labs/linux-lab$

执行 modprobe，返回值137，说明执行失败。查看 dmesg，发现如下异常信息：

[  304.086144] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[  304.086228] BUG: kernel NULL pointer dereference, address: 0000000000000058
[  304.086231] #PF: supervisor write access in kernel mode
[  304.086233] #PF: error_code(0x0002) - not-present page
[  304.086234] PGD 0 P4D 0
[  304.086238] Oops: 0002 [#1] SMP PTI
[  304.086241] CPU: 3 PID: 9151 Comm: modprobe Tainted: G           O      5.3.0-51-generic #44~18.04.2-Ubuntu
[  304.086242] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  304.086257] RIP: 0010:nfsd_fill_super+0x71/0x90 [nfsd]
[  304.086260] Code: 85 c0 89 c3 74 09 89 d8 5b 41 5c 41 5d 5d c3 49 8b 7c 24 68 31 f6 48 c7 c2 70 84 a0 c0 e8 97 fe ff ff 48 3d 00 f0 ff ff 77 0d <49> 89 45 58 89 d8 5b 41 5c 41 5d 5d c3 89 c3 eb cb 0f 1f 40 00 66
[  304.086261] RSP: 0018:ffffba6945edfaa8 EFLAGS: 00010287
[  304.086263] RAX: ffff9256d671e600 RBX: 0000000000000000 RCX: 0000000000000002
[  304.086265] RDX: 0000000000000000 RSI: 0000000000000100 RDI: ffff9256cf8309e0
[  304.086267] RBP: ffffba6945edfac0 R08: ffff9256d671e620 R09: 0000000000000000
[  304.086268] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff925796085000
[  304.086269] R13: 0000000000000000 R14: ffffffffc09cf4d0 R15: ffff92579b344680
[  304.086272] FS:  00007fb55e4b9740(0000) GS:ffff92579bb80000(0000) knlGS:0000000000000000
[  304.086273] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  304.086275] CR2: 0000000000000058 CR3: 0000000056ab4000 CR4: 00000000000406e0
[  304.086281] Call Trace:
[  304.086288]  vfs_get_super+0x5b/0xe0
[  304.086292]  ? vfs_parse_fs_param+0xdc/0x1c0
[  304.086302]  nfsd_fs_get_tree+0x2c/0x30 [nfsd]
[  304.086305]  vfs_get_tree+0x2a/0x100
[  304.086308]  fc_mount+0x12/0x40
[  304.086310]  vfs_kern_mount.part.31+0x76/0x90
[  304.086312]  vfs_kern_mount+0x13/0x20
[  304.086321]  nfsd_init_net+0x101/0x140 [nfsd]
[  304.086325]  ops_init+0x44/0x120
[  304.086327]  register_pernet_operations+0xed/0x200
[  304.086339]  ? trace_event_define_fields_nfsd_stateid_class+0xb3/0xb3 [nfsd]
[  304.086341]  register_pernet_subsys+0x28/0x40
[  304.086352]  init_nfsd+0x22/0xcbc [nfsd]
[  304.086355]  do_one_initcall+0x4a/0x1fa
[  304.086357]  ? _cond_resched+0x19/0x40
[  304.086361]  ? kmem_cache_alloc_trace+0x15c/0x210
[  304.086365]  do_init_module+0x5f/0x227
[  304.086368]  load_module+0x1aa4/0x2140
[  304.086372]  __do_sys_finit_module+0xfc/0x120
[  304.086375]  ? __do_sys_finit_module+0xfc/0x120
[  304.086379]  __x64_sys_finit_module+0x1a/0x20
[  304.086381]  do_syscall_64+0x5a/0x130
[  304.086383]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  304.086386] RIP: 0033:0x7fb55dfd93c9
[  304.086388] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8f 9a 2c 00 f7 d8 64 89 01 48
[  304.086390] RSP: 002b:00007ffd8171ba08 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  304.086392] RAX: ffffffffffffffda RBX: 000056544ce0eee9 RCX: 00007fb55dfd93c9
[  304.086393] RDX: 0000000000000000 RSI: 000056544ce0eee9 RDI: 0000000000000003
[  304.086395] RBP: 0000000000000000 R08: 0000000000000000 R09: 000056544d238150
[  304.086396] R10: 0000000000000003 R11: 0000000000000246 R12: 000056544d230250
[  304.086398] R13: 000056544d233b20 R14: 000056544d2382f0 R15: 0000000000040000
[  304.086400] Modules linked in: nfsd(+) auth_rpcgss nfs_acl lockd grace kvm irqbypass 9pnet_virtio 9pnet minix xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bpfilter br_netfilter bridge stp llc nls_utf8 isofs vboxsf(O) aufs overlay intel_rapl_msr snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi vboxvideo snd_seq snd_seq_device snd_timer drm_vram_helper snd ttm drm_kms_helper drm fb_sys_fops soundcore syscopyarea joydev sysfillrect intel_rapl_common sysimgblt crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 crypto_simd cryptd glue_helper vboxguest(O) input_leds intel_rapl_perf serio_raw mac_hid binfmt_misc sch_fq_codel cuse parport_pc ppdev sunrpc lp parport ip_tables x_tables autofs4 hid_generic usbhid hid ahci i2c_piix4 psmouse libahci video e1000 pata_acpi
[  304.086436] CR2: 0000000000000058
[  304.086439] ---[ end trace 0deb363709117b2c ]---
[  304.086449] RIP: 0010:nfsd_fill_super+0x71/0x90 [nfsd]
[  304.086451] Code: 85 c0 89 c3 74 09 89 d8 5b 41 5c 41 5d 5d c3 49 8b 7c 24 68 31 f6 48 c7 c2 70 84 a0 c0 e8 97 fe ff ff 48 3d 00 f0 ff ff 77 0d <49> 89 45 58 89 d8 5b 41 5c 41 5d 5d c3 89 c3 eb cb 0f 1f 40 00 66
[  304.086453] RSP: 0018:ffffba6945edfaa8 EFLAGS: 00010287
[  304.086454] RAX: ffff9256d671e600 RBX: 0000000000000000 RCX: 0000000000000002
[  304.086456] RDX: 0000000000000000 RSI: 0000000000000100 RDI: ffff9256cf8309e0
[  304.086457] RBP: ffffba6945edfac0 R08: ffff9256d671e620 R09: 0000000000000000
[  304.086459] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff925796085000
[  304.086460] R13: 0000000000000000 R14: ffffffffc09cf4d0 R15: ffff92579b344680
[  304.086462] FS:  00007fb55e4b9740(0000) GS:ffff92579bb80000(0000) knlGS:0000000000000000
[  304.086463] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  304.086465] CR2: 0000000000000058 CR3: 0000000056ab4000 CR4: 00000000000406e0

刚开始加载模块的时候，kernel NULL pointer dereference
supervisor write access in kernel mode，error_code(0x0002) - not-present page，PGD 0 P4D 0
Oops：backtrace 中有 init_nfsd，初始化 nfsd 失败。
sync 卡死可能和在 docker 容器中加载 nfsd 模块失败有关系。

寻找第一现场和第二现场的直接联系

sync 卡死时请求获取的锁被谁持有？

用 perf ftrace 跟踪到 sync 卡死的时候，运行在 ksys_sync 下面的第一个 iterate_supers 中。相关代码如下：

void ksys_sync(void)
{
 int nowait = 0, wait = 1;

 wakeup_flusher_threads(WB_REASON_SYNC);
 iterate_supers(sync_inodes_one_sb, NULL);
 iterate_supers(sync_fs_one_sb, &nowait);
 iterate_supers(sync_fs_one_sb, &wait);
 iterate_bdevs(fdatawrite_one_bdev, NULL);
 iterate_bdevs(fdatawait_one_bdev, NULL);
 if (unlikely(laptop_mode))
  laptop_sync_completion();
}

void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
{
 struct super_block *sb, *p = NULL;

 spin_lock(&sb_lock);
 list_for_each_entry(sb, &super_blocks, s_list) {
  if (hlist_unhashed(&sb->s_instances))
   continue;
  sb->s_count++;
  spin_unlock(&sb_lock);

  down_read(&sb->s_umount);
  if (sb->s_root && (sb->s_flags & SB_BORN))
   f(sb, arg);
  up_read(&sb->s_umount);

  spin_lock(&sb_lock);
  if (p)
   __put_super(p);
  p = sb;
 }
 if (p)
  __put_super(p);
 spin_unlock(&sb_lock);
}

iterate_supers 遍历 super_blocks 链表，执行 f(sb, arg)。perf ftrace 最后结果显示卡死在 down_read(&sb->s_umount) 下。也就是说在拿这个读写信号量锁的时候卡死了，肯定是有人 down_write 了，一直没有 up_write （写者一直持有锁，读者只能一直等）。
问题：谁持有该锁？找到它就相当于找到了导致 sync 卡死的元凶！

void __sched down_read(struct rw_semaphore *sem)
{
 might_sleep();
 rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);

 LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
}
EXPORT_SYMBOL(down_read);

#define LOCK_CONTENDED(_lock, try, lock) \
 lock(_lock)

inline void __down_read(struct rw_semaphore *sem)
{
 if (!rwsem_read_trylock(sem)) {
  rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
  DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
 } else {
  rwsem_set_reader_owned(sem);
 }
}

struct rw_semaphore {
 atomic_long_t count;
 /*
  * Write owner or one of the read owners as well flags regarding
  * the current state of the rwsem. Can be used as a speculative
  * check to see if the write owner is running on the cpu.
  */
 atomic_long_t owner;
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 struct optimistic_spin_queue osq; /* spinner MCS lock */
#endif
 raw_spinlock_t wait_lock;
 struct list_head wait_list;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
 struct lockdep_map dep_map;
#endif
};

根据上面的代码，可以看到 trylock 失败后会走 rwsem_down_read_slowpath。用 trace 抓取 rwsem_down_read_slowpath，找到 sem 的 owner。

luo@luo:/usr/share/bcc/tools$ sudo ./trace  'rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) "count=0x%lx  owner=%s", sem->count.counter, ((struct task_struct *)((sem->owner.counter)&~0x7))->comm'
/virtual/main.c:44:66: warning: comparison of array '((struct task_struct *)((sem->owner.counter) & ~7))->comm' not equal to a null pointer is
      always true [-Wtautological-pointer-compare]
        if (((struct task_struct *)((sem->owner.counter)&~0x7))->comm != 0) {
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~    ~
1 warning generated.
PID     TID     COMM            FUNC             -
10127   10127   sync            rwsem_down_read_slowpath count=0x103  owner=modprobe

元凶终于找到了, modprobe ！但是，modprobe 在加载哪个模块的时候出的事故呢？

sync 在同步哪个资源的时候卡死了？

从 down_read(&sb->s_umount) 可以发现，sem 是 super_block 的成员，可以通过 sem 获取 super_block 相关的信息。

luo@luo:/usr/share/bcc/tools$ sudo ./trace 'rwsem_down_read_slowpath(struct rw_semaphore *sem, int state) "count=0x%lx  fs name=%s", sem->count.counter, (((struct super_block*)((void *)sem-(void*)(&(((struct super_block*)0)->s_umount))))->s_id)'
PID     TID     COMM            FUNC             -
10144   10144   sync            rwsem_down_read_slowpath count=0x103  fs name=nfsd

到此，已经找到了 sync 卡死和挂载 nfsd 失败的两个直接联系。

结论

当 Linux >= v5.3 时，在 docker 镜像中加载 nfsd 模块失败，并且 modprobe 退出后 nfsd 模块的读写锁被写者 modprobe 持有，导致 sync 命令申请获取该锁时失败，一直等待。
这个 bug 目前只有一个 workaround 的方案：

针对 Linux >= 5.3 的，在 host 侧插入 nfsd 模块。

针对 win 和 macosx，暂时没有比较好的策略，保留原有策略，后续需要确认 win 和 macosx docker 所用版本，如果不超过 5.3，应该是没问题。

后续找到在 docker 中加载 nfsd 失败的原因了，会写一篇对应的文章。

工具安装

perf

sudo apt update -y && sudo apt upgrade -y
sudo apt install -y linux-tools-common linux-tools-`uname -r` linux-cloud-tools-`uname -r`

ebpf trace

docker pull zlim/bcc
docker run -it --rm \
  --privileged \
  -v /lib/modules:/lib/modules:ro \
  -v /usr/src:/usr/src:ro \
  -v /etc/localtime:/etc/localtime:ro \
  --workdir /usr/share/bcc/tools \
  zlim/bcc

[置顶] Linux Lab v1.3 升级部分内核到 v6.6，新增上游内核工具链支持，完善 riscv64 和 nolibc 开发支持，另有新增 2 款虚拟开发板：ppc64le/pseries 和 ppc64le/powernvLinux Lab 发布 v1.3 正式版，升级部分内核到 v6.6，新增 2 款 ppc64 虚拟开发板

Linux 下的 Sync 卡死问题分析报告

背景

现象收集与汇总

异常环境信息

正常环境信息

问题初步分析

寻找必现路径

调用栈初步分析

问题深入分析

获取第一现场

分析第一现场

排除干扰，寻找元凶

寻找第一现场和第二现场的直接联系

sync 卡死时请求获取的锁被谁持有？

sync 在同步哪个资源的时候卡死了？

结论

工具安装

猜你喜欢：

Read Album:

Read Related:

Read Latest:

支付宝打赏￥9.68元		微信打赏￥9.68元
	请作者喝杯咖啡吧