Issue Report
A kernel crash was caused by a subspace farming process (farming-5.18)
Trace below
I checked the 6th disc (farming-5), a nvme, but no errors in nvme error-log nor any smart data errors. Also no issues in xfs_repair.
Server was running dec-11 farmer, just to test if files work in preparation for the numa test
I have never encountered a subspace crash that took down the entire server before this crash.
Environment
Ubuntu server 22.04
CLI
Farmer DEC-11 version
kernel: 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Problem
Dec 31 20:07:19 subtrx kernel: [2170950.307914] BUG: Bad page state in process farming-5.18 pfn:11f4624
Dec 31 20:07:19 subtrx kernel: [2170950.307972] page:00000000e2c9ace2 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x11f4624
Dec 31 20:07:19 subtrx kernel: [2170950.307975] flags: 0x97ffffc2000000(idle|node=2|zone=2|lastcpupid=0x1fffff)
Dec 31 20:07:19 subtrx kernel: [2170950.307979] raw: 0097ffffc2000000 dead000000000100 dead000000000122 0000000000000000
Dec 31 20:07:19 subtrx kernel: [2170950.307980] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
Dec 31 20:07:19 subtrx kernel: [2170950.307981] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set
Dec 31 20:07:19 subtrx kernel: [2170950.307982] Modules linked in: nfnetlink cpuid tls nvme_fabrics binfmt_misc xfs nls_iso8859_1 intel_rapl_msr intel_rapl_common iwlmvm edac_mce_amd mac80211 snd_
hda_intel btusb snd_usb_audio snd_intel_dspcfg btrtl snd_intel_sdw_acpi btbcm snd_hda_codec kvm_amd snd_usbmidi_lib btintel libarc4 kvm bluetooth iwlwifi snd_rawmidi snd_hda_core snd_seq_device ec
dh_generic mc snd_hwdep rapl wmi_bmof gigabyte_wmi input_leds joydev ecc cfg80211 snd_pcm snd_timer snd ccp soundcore plx_dma k10temp mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scs
i_dh_alua msr efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0
multipath linear radeon hid_generic drm_ttm_helper ttm drm_kms_helper syscopyarea usbhid sysfillrect sysimgblt hid fb_sys_fops cec rc_core crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_
intel crypto_simd mxm_wmi igb cryptd atlantic ahci drm
Dec 31 20:07:19 subtrx kernel: [2170950.308045] dca libahci macsec i2c_algo_bit xhci_pci nvme xhci_pci_renesas i2c_piix4 nvme_core wmi
Dec 31 20:07:19 subtrx kernel: [2170950.308053] CPU: 7 PID: 1313160 Comm: farming-5.18 Not tainted 5.15.0-89-generic #99-Ubuntu
Dec 31 20:07:19 subtrx kernel: [2170950.308055] Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS MASTER/TRX40 AORUS MASTER, BIOS F5k 09/25/2020
Dec 31 20:07:19 subtrx kernel: [2170950.308056] Call Trace:
Dec 31 20:07:19 subtrx kernel: [2170950.308058] <TASK>
Dec 31 20:07:19 subtrx kernel: [2170950.308061] show_stack+0x52/0x5c
Dec 31 20:07:19 subtrx kernel: [2170950.308067] dump_stack_lvl+0x4a/0x63
Dec 31 20:07:19 subtrx kernel: [2170950.308071] dump_stack+0x10/0x16
Dec 31 20:07:19 subtrx kernel: [2170950.308072] bad_page.cold+0x63/0x94
Dec 31 20:07:19 subtrx kernel: [2170950.308076] check_new_page_bad+0x6d/0x80
Dec 31 20:07:19 subtrx kernel: [2170950.308080] rmqueue_bulk+0x45f/0x770
Dec 31 20:07:19 subtrx kernel: [2170950.308082] ? nvme_queue_rq+0x13c/0x1e1 [nvme]
Dec 31 20:07:19 subtrx kernel: [2170950.308087] rmqueue+0x5a6/0xbb0
Dec 31 20:07:19 subtrx kernel: [2170950.308090] ? kmem_cache_alloc+0x1ab/0x2f0
Dec 31 20:07:19 subtrx kernel: [2170950.308092] ? xas_alloc+0xa7/0xd0
Dec 31 20:07:19 subtrx kernel: [2170950.308095] get_page_from_freelist+0xdf/0x540
Dec 31 20:07:19 subtrx kernel: [2170950.308097] ? __mod_memcg_lruvec_state+0x63/0xe0
Dec 31 20:07:19 subtrx kernel: [2170950.308100] __alloc_pages+0x17e/0x330
Dec 31 20:07:19 subtrx kernel: [2170950.308103] alloc_pages+0x9e/0x1e0
Dec 31 20:07:19 subtrx kernel: [2170950.308105] __page_cache_alloc+0x7e/0x90
Dec 31 20:07:19 subtrx kernel: [2170950.308108] page_cache_ra_unbounded+0xac/0x210
Dec 31 20:07:19 subtrx kernel: [2170950.308111] force_page_cache_ra+0xe6/0x150
Dec 31 20:07:19 subtrx kernel: [2170950.308113] page_cache_sync_ra+0x40/0xe0
Dec 31 20:07:19 subtrx kernel: [2170950.308115] filemap_get_pages+0xde/0x3f0
Dec 31 20:07:19 subtrx kernel: [2170950.308117] ? atime_needs_update+0x104/0x180
Dec 31 20:07:19 subtrx kernel: [2170950.308121] filemap_read+0xbc/0x3e0
Dec 31 20:07:19 subtrx kernel: [2170950.308123] ? uprobe_notify_resume+0x10/0x390
Dec 31 20:07:19 subtrx kernel: [2170950.308125] ? xfs_file_buffered_read+0xb1/0xc0 [xfs]
Dec 31 20:07:19 subtrx kernel: [2170950.308190] ? xfs_file_read_iter+0xb3/0x1c0 [xfs]
Dec 31 20:07:19 subtrx kernel: [2170950.308242] generic_file_read_iter+0xe5/0x150
Dec 31 20:07:19 subtrx kernel: [2170950.308244] ? down_read+0x13/0xa0
Dec 31 20:07:19 subtrx kernel: [2170950.308247] xfs_file_buffered_read+0xa1/0xc0 [xfs]
Dec 31 20:07:19 subtrx kernel: [2170950.308298] xfs_file_read_iter+0xb3/0x1c0 [xfs]
Dec 31 20:07:19 subtrx kernel: [2170950.308347] new_sync_read+0x10d/0x190
Dec 31 20:07:19 subtrx kernel: [2170950.308351] vfs_read+0x103/0x1a0
Dec 31 20:07:19 subtrx kernel: [2170950.308353] __x64_sys_pread64+0x96/0xc0
Dec 31 20:07:19 subtrx kernel: [2170950.308354] do_syscall_64+0x5c/0xc0
Dec 31 20:07:19 subtrx kernel: [2170950.308357] ? irqentry_exit+0x1d/0x30
Dec 31 20:07:19 subtrx kernel: [2170950.308359] ? common_interrupt+0x55/0xa0
Dec 31 20:07:19 subtrx kernel: [2170950.308360] entry_SYSCALL_64_after_hwframe+0x62/0xcc
Dec 31 20:07:19 subtrx kernel: [2170950.308362] RIP: 0033:0x7fe169ec759f
Dec 31 20:07:19 subtrx kernel: [2170950.308365] Code: 08 89 3c 24 48 89 4c 24 18 e8 6d e4 f7 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff
ff 77 31 44 89 c7 48 89 04 24 e8 ad e4 f7 ff 48 8b
Dec 31 20:07:19 subtrx kernel: [2170950.308366] RSP: 002b:00007fdaee9f0cf0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
Dec 31 20:07:19 subtrx kernel: [2170950.308368] RAX: ffffffffffffffda RBX: 00007fe169ec7540 RCX: 00007fe169ec759f
Dec 31 20:07:19 subtrx kernel: [2170950.308370] RDX: 0000000000004d80 RSI: 00007fdb4003f000 RDI: 00000000000000e8
Dec 31 20:07:19 subtrx kernel: [2170950.308371] RBP: 00000000000000e8 R08: 0000000000000000 R09: 00007fdb4003f000
Dec 31 20:07:19 subtrx kernel: [2170950.308372] R10: 0000015940eb5aa0 R11: 0000000000000293 R12: 0000000000004d80
Dec 31 20:07:19 subtrx kernel: [2170950.308373] R13: 7fffffffffffffff R14: 0000015940eb5aa0 R15: 00007fdb4003f000
Dec 31 20:07:19 subtrx kernel: [2170950.308375] </TASK>
Dec 31 20:07:19 subtrx kernel: [2170950.308376] Disabling lock debugging due to kernel taint
Total server standstill 5min later, latest log entry anywhere is actually the subspace node log at 2023-12-31T20:12:13.284700Z
(nor further kernel nor syslog entries).