nixpkgs/nixos/modules/system
Graham Christensen ece5c0f304
stage-1: modprobe ext{2,3,4} before resizing
I noticed booting a system with an ext4 root which expanded to 5T took
quite a long time (12 minutes in some cases, 43(!) in others.)

I changed stage-1 to run `resize2fs -d 62` for extra debug output and
timing information. It revealed the adjust_superblock step taking
almost all of the time:

    [Fri Oct 30 11:10:15 UTC 2020] zero_high_bits_in_metadata: Memory used: 132k/0k (63k/70k), time:  0.00/ 0.00/ 0.00
    [Fri Oct 30 11:21:09 UTC 2020] adjust_superblock: Memory used: 396k/4556k (295k/102k), time: 654.21/ 0.59/ 5.13

but when I ran resize2fs on a disk with the identical content growing
to the identical target size, it would only take about 30 seconds. I
looked at what happened between those two steps in the fast case with
strace and found:

```
   235	getrusage(RUSAGE_SELF, {ru_utime={tv_sec=0, tv_usec=1795}, ru_stime={tv_sec=0, tv_usec=3590}, ...}) = 0
   236	write(1, "zero_high_bits_in_metadata: Memo"..., 84zero_high_bits_in_metadata: Memory used: 132k/0k (72k/61k), time:  0.00/ 0.00/ 0.00
   237	) = 84
   238	gettimeofday({tv_sec=1604061278, tv_usec=480147}, NULL) = 0
   239	getrusage(RUSAGE_SELF, {ru_utime={tv_sec=0, tv_usec=1802}, ru_stime={tv_sec=0, tv_usec=3603}, ...}) = 0
   240	gettimeofday({tv_sec=1604061278, tv_usec=480192}, NULL) = 0
   241	mmap(NULL, 2564096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa3c7355000
   242	access("/sys/fs/ext4/features/lazy_itable_init", F_OK) = 0
   243	brk(0xf85000)                           = 0xf85000
   244	brk(0xfa6000)                           = 0xfa6000
   245	gettimeofday({tv_sec=1604061278, tv_usec=538828}, NULL) = 0
   246	getrusage(RUSAGE_SELF, {ru_utime={tv_sec=0, tv_usec=58720}, ru_stime={tv_sec=0, tv_usec=3603}, ...}) = 0
   247	write(1, "adjust_superblock: Memory used: "..., 79adjust_superblock: Memory used: 396k/2504k (305k/92k), time:  0.06/ 0.06/ 0.00
   248	) = 79
   249	gettimeofday({tv_sec=1604061278, tv_usec=539119}, NULL) = 0
   250	getrusage(RUSAGE_SELF, {ru_utime={tv_sec=0, tv_usec=58812}, ru_stime={tv_sec=0, tv_usec=3603}, ...}) = 0
   251	gettimeofday({tv_sec=1604061279, tv_usec=939}, NULL) = 0
   252	getrusage(RUSAGE_SELF, {ru_utime={tv_sec=0, tv_usec=520411}, ru_stime={tv_sec=0, tv_usec=3603}, ...}) = 0
   253	write(1, "fix_uninit_block_bitmaps 2: Memo"..., 88fix_uninit_block_bitmaps 2: Memory used: 396k/2504k (305k/92k), time:  0.46/ 0.46/ 0.00
   254	) = 88
```

In particular the access to /sys/fs seemed interesting. Looking
at the source of resize2fs:

```
[root@ip-172-31-22-182:~/e2fsprogs-1.45.5]# rg -B2 -A1 /sys/fs/ext4/features/lazy_itable_init .
./resize/resize2fs.c
923-	if (getenv("RESIZE2FS_FORCE_LAZY_ITABLE_INIT") ||
924-	    (!getenv("RESIZE2FS_FORCE_ITABLE_INIT") &&
925:	     access("/sys/fs/ext4/features/lazy_itable_init", F_OK) == 0))
926-		lazy_itable_init = 1;
```

I confirmed /sys is mounted, and then found a bug suggesting the
ext4 module is maybe not loaded:
https://bugzilla.redhat.com/show_bug.cgi?id=1071909

My home server doesn't have ext4 loaded and had 3T to play with, so
I tried (and succeeded with) replicating the issue locally:

```
[root@kif:/scratch]# lsmod | grep -i ext

[root@kif:/scratch]# zfs create -V 3G rpool/scratch/ext4

[root@kif:/scratch]# time mkfs.ext4 /dev/zvol/rpool/scratch/ext4
mke2fs 1.45.5 (07-Jan-2020)
Discarding device blocks: done
Creating filesystem with 786432 4k blocks and 196608 inodes
Filesystem UUID: 560a4a8f-93dc-40cc-97a5-f10049bf801f
Superblock backups stored on blocks:
	32768, 98304, 163840, 229376, 294912

Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

real	0m2.261s
user	0m0.000s
sys	0m0.025s

[root@kif:/scratch]# zfs set volsize=3T rpool/scratch/ext4

[root@kif:/scratch]# time resize2fs -d 62 /dev/zvol/rpool/scratch/ext4
resize2fs 1.45.5 (07-Jan-2020)
fs has 11 inodes, 1 groups required.
fs requires 16390 data blocks.
With 1 group(s), we have 22234 blocks available.
Last group's overhead is 10534
Need 16390 data blocks in last group
Final size of last group is 26924
Estimated blocks needed: 26924
Extents safety margin: 49
Resizing the filesystem on /dev/zvol/rpool/scratch/ext4 to 805306368 (4k) blocks.
read_bitmaps: Memory used: 132k/0k (63k/70k), time:  0.00/ 0.00/ 0.00
read_bitmaps: I/O read: 1MB, write: 0MB, rate: 3802.28MB/s
fix_uninit_block_bitmaps 1: Memory used: 132k/0k (63k/70k), time:  0.00/ 0.00/ 0.00
resize_group_descriptors: Memory used: 132k/0k (68k/65k), time:  0.00/ 0.00/ 0.00
move_bg_metadata: Memory used: 132k/0k (68k/65k), time:  0.00/ 0.00/ 0.00
zero_high_bits_in_metadata: Memory used: 132k/0k (68k/65k), time:  0.00/ 0.00/ 0.00
```

here it got stuck for quite some time ... straceing this 20 minutes in revealed this in a tight loop:

```
getuid()                                = 0
geteuid()                               = 0
getgid()                                = 0
getegid()                               = 0
prctl(PR_GET_DUMPABLE)                  = 1 (SUID_DUMP_USER)
fallocate(3, FALLOC_FL_ZERO_RANGE, 2222649901056, 2097152) = 0
fsync(3)                                = 0
```

it finally ended 43(!) minutes later:

```
adjust_superblock: Memory used: 264k/3592k (210k/55k), time: 2554.03/ 0.16/15.07
fix_uninit_block_bitmaps 2: Memory used: 264k/3592k (210k/55k), time:  0.16/ 0.16/ 0.00
blocks_to_move: Memory used: 264k/3592k (211k/54k), time:  0.00/ 0.00/ 0.00
Number of free blocks: 755396/780023556, Needed: 0
block_mover: Memory used: 264k/3592k (216k/49k), time:  0.05/ 0.01/ 0.00
block_mover: I/O read: 1MB, write: 0MB, rate: 18.68MB/s
inode_scan_and_fix: Memory used: 264k/3592k (216k/49k), time:  0.00/ 0.00/ 0.00
inode_ref_fix: Memory used: 264k/3592k (216k/49k), time:  0.00/ 0.00/ 0.00
move_itables: Memory used: 264k/3592k (216k/49k), time:  0.00/ 0.00/ 0.00
calculate_summary_stats: Memory used: 264k/3592k (216k/49k), time: 16.35/16.35/ 0.00
fix_resize_inode: Memory used: 264k/3592k (222k/43k), time:  0.04/ 0.00/ 0.00
fix_resize_inode: I/O read: 1MB, write: 0MB, rate: 22.80MB/s
fix_sb_journal_backup: Memory used: 264k/3592k (222k/43k), time:  0.00/ 0.00/ 0.00
overall resize2fs: Memory used: 264k/3592k (222k/43k), time: 2570.90/16.68/15.07
overall resize2fs: I/O read: 1MB, write: 1MB, rate: 0.00MB/s
The filesystem on /dev/zvol/rpool/scratch/ext4 is now 805306368 (4k) blocks long.

real	43m1.943s
user	0m16.761s
sys	0m15.069s
```

I then cleaned up and recreated the zvol, loaded the ext4 module, created the ext4 fs,
resized the volume, and resize2fs'd and it went quite quickly:

```
[root@kif:/scratch]# zfs destroy rpool/scratch/ext4

[root@kif:/scratch]# zfs create -V 3G rpool/scratch/ext4

[root@kif:/scratch]# modprobe ext4

[root@kif:/scratch]# time resize2fs -d 62 /dev/zvol/rpool/scratch/ext4

[root@kif:/scratch]# time mkfs.ext4 /dev/zvol/rpool/scratch/ext4
mke2fs 1.45.5 (07-Jan-2020)
Discarding device blocks: done
Creating filesystem with 786432 4k blocks and 196608 inodes
Filesystem UUID: 5b415f2f-a8c4-4ba0-ac1d-78860de77610
Superblock backups stored on blocks:
	32768, 98304, 163840, 229376, 294912

Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

real	0m1.013s
user	0m0.001s
sys	0m0.023s

[root@kif:/scratch]# zfs set volsize=3T rpool/scratch/ext4

[root@kif:/scratch]# time resize2fs -d 62 /dev/zvol/rpool/scratch/ext4
resize2fs 1.45.5 (07-Jan-2020)
fs has 11 inodes, 1 groups required.
fs requires 16390 data blocks.
With 1 group(s), we have 22234 blocks available.
Last group's overhead is 10534
Need 16390 data blocks in last group
Final size of last group is 26924
Estimated blocks needed: 26924
Extents safety margin: 49
Resizing the filesystem on /dev/zvol/rpool/scratch/ext4 to 805306368 (4k) blocks.
read_bitmaps: Memory used: 132k/0k (63k/70k), time:  0.00/ 0.00/ 0.00
read_bitmaps: I/O read: 1MB, write: 0MB, rate: 3389.83MB/s
fix_uninit_block_bitmaps 1: Memory used: 132k/0k (63k/70k), time:  0.00/ 0.00/ 0.00
resize_group_descriptors: Memory used: 132k/0k (68k/65k), time:  0.00/ 0.00/ 0.00
move_bg_metadata: Memory used: 132k/0k (68k/65k), time:  0.00/ 0.00/ 0.00
zero_high_bits_in_metadata: Memory used: 132k/0k (68k/65k), time:  0.00/ 0.00/ 0.00
adjust_superblock: Memory used: 264k/1540k (210k/55k), time:  0.02/ 0.02/ 0.00
fix_uninit_block_bitmaps 2: Memory used: 264k/1540k (210k/55k), time:  0.15/ 0.15/ 0.00
blocks_to_move: Memory used: 264k/1540k (211k/54k), time:  0.00/ 0.00/ 0.00
Number of free blocks: 755396/780023556, Needed: 0
block_mover: Memory used: 264k/3592k (216k/49k), time:  0.01/ 0.01/ 0.00
block_mover: I/O read: 1MB, write: 0MB, rate: 157.11MB/s
inode_scan_and_fix: Memory used: 264k/3592k (216k/49k), time:  0.00/ 0.00/ 0.00
inode_ref_fix: Memory used: 264k/3592k (216k/49k), time:  0.00/ 0.00/ 0.00
move_itables: Memory used: 264k/3592k (216k/49k), time:  0.00/ 0.00/ 0.00

calculate_summary_stats: Memory used: 264k/3592k (216k/49k), time: 16.20/16.20/ 0.00
fix_resize_inode: Memory used: 264k/3592k (222k/43k), time:  0.00/ 0.00/ 0.00
fix_resize_inode: I/O read: 1MB, write: 0MB, rate: 5319.15MB/s
fix_sb_journal_backup: Memory used: 264k/3592k (222k/43k), time:  0.00/ 0.00/ 0.00
overall resize2fs: Memory used: 264k/3592k (222k/43k), time: 16.45/16.38/ 0.00
overall resize2fs: I/O read: 1MB, write: 1MB, rate: 0.06MB/s
The filesystem on /dev/zvol/rpool/scratch/ext4 is now 805306368 (4k) blocks long.

real	0m17.908s
user	0m16.386s
sys	0m0.079s
```

Success!
2020-10-30 12:18:23 -04:00
..
activation nixos/doc: Improve code listings 2020-09-23 01:25:25 +02:00
boot stage-1: modprobe ext{2,3,4} before resizing 2020-10-30 12:18:23 -04:00
etc treewide: completely remove types.loaOf 2020-09-02 00:42:50 +02:00