Issue Report
Environment
Ubuntu 22.03
Advanced CLI
Problem
thread 'plotting-1.0' panicked at /home/subspace/crates/subspace-farmer-components/src/plotting.rs:540:48:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2024-02-13T13:52:24.988357Z WARN single_disk_farm{disk_farm_index=2}: subspace_farmer::single_disk_farm::plotting: Failed to send sector index for initial plotting error=send failed because receiver is gone
Error: Background task plotting-2 panicked
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::result::unwrap_failed
3: <core::pin::Pin<P> as core::future::future::Future>::poll
4: subspace_farmer::single_disk_farm::plotting::plotting::{{closure}}::{{closure}}::{{closure}}::{{closure}}::{{closure}}::{{closure}}
5: tokio::runtime::context::runtime::enter_runtime
6: tokio::runtime::scheduler::multi_thread::worker::block_in_place
7: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
8: rayon_core::registry::WorkerThread::wait_until_cold
9: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
2024-02-13T14:02:41.081200Z WARN single_disk_farm{disk_farm_index=2}: subspace_farmer::single_disk_farm::plotting: Failed to send sector index for initial plotting error=send failed because receiver is gone
Error: Background task plotting-2 panicked
Stack backtrace:
0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
1: std::sys_common::backtrace::__rust_begin_short_backtrace
2: core::ops::function::FnOnce::call_once{{vtable.shim}}
3: std::sys::pal::unix::thread::Thread::new::thread_start
4: <unknown>
5: <unknown>
Can you tell me the specific reason why this problem occurs?
What mechanism produces this error?
I will help you solve this problem
There is no problem to solve for me, this is a problem with your hardware. Please read linked thread and threads mentioned in there carefully.
This doesn’t seem to be a hardware issue , I’m very sure
As you wish, but the error you’re getting indicates in-memory data corruption. I checked the code path and so far I see no other explanation for this.
Are you running on an overclocked or undervolted system? Some users in Discord have tweaked these settings and had success with similar issues.
My server is not using low voltage memory modules or overclocking.
A similar situation occurs on three of my seven servers.
My monitoring process redirects all logs.
when the problem occurs, it should be after the plot sector is completed.
the probability of this problem occurring is rare。
thread 'plotting-1.0' panicked at /home/subspace/crates/subspace-farmer-components/src/plotting.rs:540:48:
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
stack backtrace:
0: 0x555555d9f89f - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h854e3d9599d23b9b
1: 0x555555846a20 - core::fmt::write::hdaa13832d911494b
2: 0x555555d66e0e - std::io::Write::write_fmt::ha2d6d8f909a702b7
3: 0x555555da192e - std::sys_common::backtrace::print::h78d1eab0d976c677
4: 0x555555da0ac7 - std::panicking::default_hook::{{closure}}::h3f8628a95270c213
5: 0x555555da21ab - std::panicking::rust_panic_with_hook::hd1b06f3095c8ec01
6: 0x555555da1ca0 - std::panicking::begin_panic_handler::{{closure}}::hb82004c56d4db4fa
7: 0x555555da1bf6 - std::sys_common::backtrace::__rust_end_short_backtrace::h17b40b71bb1ece3d
8: 0x555555da1be3 - rust_begin_unwind
9: 0x555555669384 - core::panicking::panic_fmt::h9bd50ad4fc2ca95e
10: 0x555555669932 - core::result::unwrap_failed::h861383bd8d19e70e
11: 0x555555e72b0e - <core::pin::Pin<P> as core::future::future::Future>::poll::he4816b870cadeba8
12: 0x555555eda91c - subspace_farmer::single_disk_farm::plotting::plotting::{{closure}}::{{closure}}::{{closure}}::{{closure}}::{{closure}}::{{closure}}::h2e5f89001d0adedf
13: 0x555555f79836 - tokio::runtime::context::runtime::enter_runtime::h2070983c4c1f0457
14: 0x555556146b4f - tokio::runtime::scheduler::multi_thread::worker::block_in_place::hcdd8dd12015f21ef
15: 0x555556065b9e - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb6804824d14ce268
16: 0x5555556a386f - rayon_core::registry::WorkerThread::wait_until_cold::h774ae0930d9e0a00
17: 0x555555bb0c22 - rayon_core::registry::ThreadBuilder::run::hbfca6208f57be30c
18: 0x555555de3749 - std::sys_common::backtrace::__rust_begin_short_backtrace::h352810b4fc8f3ff6
19: 0x555555de4783 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h31ae30d80733f54d
20: 0x555555da3cd5 - std::sys::pal::unix::thread::Thread::new::thread_start::hd551cfa6ba15fff0
21: 0x7ffff7d1bac3 - <unknown>
22: 0x7ffff7dad850 - <unknown>
23: 0x0 - <unknown>
2024-02-15T12:13:56.931045Z WARN single_disk_farm{disk_farm_index=1}: subspace_farmer::single_disk_farm::plotting: Failed to send sector index for initial plotting error=send failed because receiver is gone
Error: Background task plotting-1 panicked
Stack backtrace:
0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
1: std::sys_common::backtrace::__rust_begin_short_backtrace
2: core::ops::function::FnOnce::call_once{{vtable.shim}}
3: std::sys::pal::unix::thread::Thread::new::thread_start
4: <unknown>
5: <unknown>
a new error occurred, directly destroying the 1T plot drive.
2024-02-15T09:22:04.842107Z ERROR subspace_farmer::utils::farmer_piece_getter: Failed to retrieve first segment piece from node error=Parse error: invalid value: integer `1149`, expected u8 piece_index=47992
2024-02-15T09:22:35.105039Z ERROR subspace_farmer::utils::farmer_piece_getter: Failed to retrieve first segment piece from node error=Cannot convert piece. PieceIndex=47974 piece_index=47974
I do not believe anything is destroyed. You can run scrub on it to fix errors. However, everything indicates that you seem to have hardware issues. The fact that you don’t use low-power or overclocked RAM doesn’t guarantee you don’t have stability issues.
There is not much application can do if bits change in memory unexpectedly. Since you are the only one so far with such errors (with thousands of users plotting petabytes of space) and me not seeing issues in code, I strongly recommend to thoroughly check your hardware.
I need to check the hardware. The server uses with register ECC memory. Can ECC find the memory bit error?
In general - yes, even ECC memory can be faulty. To the best of my understanding so far your farmer received something either from disk or from network, checked it and it was good, but by the time it got to the plotting process, it turned out to be invalid. It is possible that glitch happened during piece cache sync and farmer wrote and created valid checksum for corrupted piece, in which case you may hit that error sometimes. Removing piece cache file and syncing piece cache again might fix that (scrub doesn’t check contents beyond a few checksums for performance reasons).
you are right. It is more likely that there is some signal interference in the memory operation. removing piece cache file and syncing piece cache can fix it
This problem becomes more frequent when I use Table:generate_parallel in multiple threads at the same time。
I’ve created a Lazy
that will generate 50 PosTables simultaneously in multiple threads. After the generation is completed, I pass the ownership to Hash, and then enter get(SBucket of u16) find_proof in parallel Hash through Lazy, access it from the Table generated in Lazy, and check the probability of Invalid scalar happening in this way. It will be extremely high
It happens about once every 6 hours
#[derive(Clone)]
struct LazyTable {
current_index: u16,
max_index: u16,
}
impl LazyTable {
fn new(max_index : u16) -> LazyTable{
LazyTable {
current_index: 0,
max_index : max_index,
}
}
fn get_current(&mut self) -> u16 {
return self.current_index;
}
fn next<PosTable: Table>(&mut self,tables_maps : &mut Option<HashMap<u16, PosTable>>, generator_vec : &Vec<std::sync::Mutex<<PosTable as Table>::Generator>> ,sector_id: &SectorId ,farmer_protocol_info : FarmerProtocolInfo) {
tables_maps.take();
let mut end_index = self.current_index + generator_vec.len() as u16;
if end_index > self.max_index { end_index = self.max_index }
let result : HashMap<u16,PosTable> = (self.current_index..end_index)
.into_par_iter()
.enumerate()
.map(|(index, current)| {
let generator_opt = generator_vec.get(index);
if let Some(generator_lock) = generator_opt{
let mut generator = generator_lock.lock().unwrap();
let piece = PieceOffset::from(current);
let pos_table = generator.generate_parallel(
§or_id.derive_evaluation_seed(piece, farmer_protocol_info.history_size),
);
(current,pos_table)
} else {
let mut generator = PosTable::generator();
let piece = PieceOffset::from(current);
warn!("generator table from index: {} PosTable {}",index, current);
let pos_table = generator.generate_parallel(
§or_id.derive_evaluation_seed(piece, farmer_protocol_info.history_size),
);
(current,pos_table)
}
}).collect();
*tables_maps = Some(result);
self.current_index = end_index;
}
}
let mut lazy_blocks = LazyTable::new(1000);
let mut _tables_maps: Option<HashMap<u16, PosTable>> = Some(HashMap::with_capacity(0));
lazy_blocks.next::<PosTable>(&mut _tables_maps,generator_vec,§or_id,farmer_protocol_info);
for ((piece_offset, record), mut encoded_chunks_used) in (PieceOffset::ZERO..)
.zip(raw_sector.records.iter_mut())
.zip(sector_contents_map.iter_record_bitfields_mut())
{
// Derive PoSpace table (use parallel mode because multiple tables concurrently will use
// too much RAM)
let index: u16 = piece_offset.get();
if index >= lazy_blocks.get_current() {
lazy_blocks.next::<PosTable>(&mut _tables_maps,generator_vec,§or_id,farmer_protocol_info);
}
let table: &HashMap<u16, PosTable> = _tables_maps.as_ref().unwrap();
let pos_table_cache: PosTable;
let pos_table : &PosTable ;
if let Some(pos) = table.get(&index) {
pos_table = pos;
} else {
warn!("table_generator.generate_parallel index : {}",index);
pos_table_cache = table_generator.generate_parallel(
§or_id.derive_evaluation_seed(piece_offset, farmer_protocol_info.history_size),
);
pos_table = &pos_table_cache;
}
... Some Think ...
}
Why does my method trigger compared to directly using ASYNC to generate Table::generate_parallel?
Piece getter must returns valid pieces of history that contain proper scalar bytes; qed: "Invalid scalar"
As I already mentioned before, this is a hardware issue, there is no point in finding software issues here as far as I’m concerned. You’re free to do so, but it is a waste of my time unless proven otherwise.