# Deterministic Debugging of CKB's Flaky Tests

*This doc keeps track of my experience diagnosing CKB’s unstable tests. While the issue is unlikely to affect most application developers on CKB, it may be of interest to Rust developers working on complex multithreaded systems or those curious about advanced debugging techniques.*

*A* [*PR has been created*](https://github.com/nervosnetwork/ckb/pull/4880) *in CKB to fix tests per discovery documented in this doc.*

---

Every now and then, when running CKB’s test, you might see errors like:

* `pthread lock: Invalid argument`
    
* `terminate called without an active exception`
    

These failures occur randomly (~5% of test runs) and are difficult to reproduce. I suspected the issue might be related to ckb-sync, and this post documents the debugging journey, tools used, and eventual fix.

## Reproducing the Issue

### Testing with `cargo nextest`

[Cargo-nextest](https://nexte.st/) is a test runner for Rust projects. Based on the actual command executed by `make test`, I could piece together the following command:

```bash
$ cargo nextest run  --features with_sentry --no-fail-fast \
	--hide-progress-bar --success-output immediate-final \
	--failure-output immediate-final -p ckb-sync
```

However, this command would fail to build. It seems that running the `ckb-sync` package alone, `ckb-sync` is missing a dev-time feature dependency, like [the following line](https://github.com/nervosnetwork/ckb/blob/734ce6295339cba252032180e3508cfce8c21ba7/chain/Cargo.toml#L44):

```ini
ckb-tx-pool = { workspace = true, features = ["internal"] }
```

After adding this line, the above command would succeed in building and proceed to run the tests from `ckb-sync` package.

On my machine, the test fails after ~20 iterations (5% failure rate) with either `pthread lock: Invalid argument` or `terminate called without an active exception`.

### Deterministic Debugging with `rr` Multithreaded Failures

[rr](https://rr-project.org/) is a lightweight debugging tool for recording, replaying and debugging execution of applications (trees of processes and threads)

records a program’s execution once and allows deterministic replay of that exact run multiple times. By capturing all sources of nondeterminism (including thread scheduling), `rr` enables reliable debugging of multithreaded failures, making it especially useful for diagnosing flaky or intermittent test issues in complex systems like CKB.

**Installation**

Install `rr` per the [instructions](https://github.com/rr-debugger/rr/wiki/Building-And-Installing).

> *Note:* `rr` has specific requirements on OS and your CPU, refer to the docs for more details.

To limit the traces generated by rr, I will focus on one particular test `tests::sync_shared::test_insert_parent_unknown_block`. From my local experiments, this test might throw the above mentioned errors.

**Set** `rr` as the cargo runner

```bash
$ export CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUNNER="/home/ubuntu/rr-obj/bin/rr record"
```

The `rr` binary on my machine is installed to `/home/ubuntu/rr-obj/bin/rr`, you might want to adjust this accordingly.

**Run the test repeatedly**

Then we can keep running this command until one of the errors shows up:

```bash
$ cargo nextest run  --features with_sentry --no-fail-fast \\
	--hide-progress-bar --success-output immediate-final \\
	--failure-output immediate-final -p ckb-sync \\
	tests::sync_shared::test_insert_parent_unknown_block
```

As each test run would accumulate `rr` traces and test data, you might use the following command between test runs to clean up all those data:

```bash
$ rm -rf ~/.local/share/rr/ /tmp/ckb-tmp-* /tmp/.tmp*
```

It’s not hard to write a script that keeps running the test until it fails:

```bash
#!/usr/bin/env bash
set -x

# This script simply runs the test repeatedly until faliure happens

unset CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUNNER
while true; do
  cargo nextest run  --features with_sentry --no-fail-fast \\
    --hide-progress-bar --success-output immediate-final \\
    --failure-output immediate-final -p \\
    ckb-sync tests::sync_shared::test_insert_parent_unknown_block
  RETURN_CODE=$?
  rm -rf /tmp/ckb-tmp-* /tmp/.tmp*

  if [[ "$RETURN_CODE" -ne 0 ]]; then
    echo "Failed with $RETURN_CODE"
    exit $RETURN_CODE
  fi
done
```

Or try this one that takes `rr` into account:

```bash
#!/usr/bin/env bash
set -x

RR="${RR:-/home/ubuntu/rr-obj/bin/rr}"

# This script runs the test wrapped with rr, so when failure happens,
# you can replay the failed test using rr

export CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUNNER="$RR record"
while true; do
  cargo nextest run  --features with_sentry --no-fail-fast \\
    --hide-progress-bar --success-output immediate-final \\
    --failure-output immediate-final -p \\
    ckb-sync tests::sync_shared::test_insert_parent_unknown_block
  RETURN_CODE=$?
  rm -rf /tmp/ckb-tmp-* /tmp/.tmp*

  if [[ "$RETURN_CODE" -ne 0 ]]; then
    echo "Failed with $RETURN_CODE, use $RR replay to rerun the failure!"
    exit $RETURN_CODE
  fi

  rm -rf ~/.local/share/rr
done
```

For some reason, the test tends to fail more often running under `rr` on my machine, my guess is that `rr` alters the behavior of multi-thread scheduling. Or maybe `rr` picks a different execution path compared to what is normally tested in our daily development.

## Failures Observed

### Error Case: `pthread lock: Invalid argument`

With enough runs, one of the test outputs might look like following:

```bash
$ cargo nextest run  --features with_sentry --no-fail-fast --hide-progress-bar --success-output immediate-final --failure-output immediate-final -p ckb-sync tests::sync_shared::test_insert_parent_unknown_block
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.24s
info: using target runner `/home/ubuntu/rr-obj/bin/rr record` defined by environment variable `CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUNNER`
────────────
 Nextest run ID d8f86415-500a-4b18-9093-cd21a656013a with nextest profile: default
    Starting 1 test across 1 binary (68 tests skipped)
     SIGABRT [   2.098s] ckb-sync tests::sync_shared::test_insert_parent_unknown_block
──── STDOUT:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block

running 1 test
test tests::sync_shared::test_insert_parent_unknown_block ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 68 filtered out; finished in 0.93s

──── STDERR:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block
pthread lock: Invalid argument

────────────
     Summary [   2.099s] 1 test run: 0 passed, 1 failed, 68 skipped
     SIGABRT [   2.098s] ckb-sync tests::sync_shared::test_insert_parent_unknown_block
──── STDOUT:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block

running 1 test
test tests::sync_shared::test_insert_parent_unknown_block ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 68 filtered out; finished in 0.93s

──── STDERR:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block
pthread lock: Invalid argument

error: test run failed
```

At this case, you can find that `rr` has generated traces:

```bash
$ ls -lh ~/.local/share/rr/
total 12K
drwxrwx--- 2 ubuntu ubuntu 4.0K Apr 27 05:55 ckb_sync-a645bdacec5c2bf3-0
drwxrwx--- 2 ubuntu ubuntu 4.0K Apr 27 05:55 ckb_sync-a645bdacec5c2bf3-1
drwxrwx--- 2 ubuntu ubuntu 4.0K Apr 27 05:55 ckb_sync-a645bdacec5c2bf3-2
-rw------- 1 ubuntu ubuntu    8 Apr 27 05:55 cpu_lock
lrwxrwxrwx 1 ubuntu ubuntu   27 Apr 27 05:55 latest-trace -> ckb_sync-a645bdacec5c2bf3-2
```

I run the clean command above between each run, so you’ll only see three traces here. The last one corresponds to the actual failing test, while the first two might be bookkeeping runs of `nextest`. If you didn’t run the clean command, there could be more traces.

Now you can repeat the failed case:

```bash
$ ~/rr-obj/bin/rr replay
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/ubuntu/.local/share/rr/ckb_sync-a645bdacec5c2bf3-2/mmap_hardlink_4_ckb_sync-a645bdacec5c2bf3...
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /home/ubuntu/.local/share/rr/ckb_sync-a645bdacec5c2bf3-2/mmap_hardlink_4_ckb_sync-a645bdacec5c2bf3.
Use `info auto-load python-scripts [REGEXP]' to list them.
Remote debugging using 127.0.0.1:7932
Reading symbols from /lib64/ld-linux-x86-64.so.2...
Reading symbols from /usr/lib/debug/.build-id/1c/8db5f83bba514f8fd5f1fb6d7be975be1bb855.debug...
BFD: warning: system-supplied DSO at 0x6fffd000 has a section extending past end of file

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) y
Debuginfod has been enabled.
To make this setting permanent, add 'set debuginfod enabled on' to .gdbinit.
Downloading separate debug info for system-supplied DSO at 0x6fffd000
0x00007ce375295540 in _start () from /lib64/ld-linux-x86-64.so.2
(rr)
```

`rr` starts a gdb session for the ckb-sync test. We can use `c` to continue to run the test:

```bash
(rr) c
Continuing.
Downloading separate debug info for /lib/x86_64-linux-gnu/libstdc++.so.6

running 1 test
test tests::sync_shared::test_insert_parent_unknown_block ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 68 filtered out; finished in 0.93s

pthread lock: Invalid argument
[New Thread 72541.73153]
[New Thread 72541.72544]
[New Thread 72541.73028]
[New Thread 72541.73029]
[New Thread 72541.73082]
[New Thread 72541.73109]
[New Thread 72541.73111]
[New Thread 72541.73113]
[New Thread 72541.73114]
[New Thread 72541.73115]
[New Thread 72541.73118]
[New Thread 72541.73119]
[New Thread 72541.73152]
[New Thread 72541.73156]
[New Thread 72541.73157]

Thread 2 received signal SIGABRT, Aborted.
[Switching to Thread 72541.73153]
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
warning: 44     ./nptl/pthread_kill.c: No such file or directory
(rr)
```

And use `bt` to print stack trace when the error happens:

```bash
(rr) bt
```

As the stack trace is quite long, I’m posting a screenshot instead:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1748232267733/e93660b0-2a93-4264-891b-862770aad9a3.png align="center")

We can track the failure to [this line](https://github.com/nervosnetwork/rust-rocksdb/blob/a2d7b2e6f1efaecedd37b5363408f0f0888a0e26/src/optimistic_transaction_db.rs#L153):

```rust
ffi::rocksdb_optimistictransactiondb_close(self.inner);
```

For some reason, closing an optimistic transaction DB triggers the failure. But the cause is unclear. Just a guess: what should a proper shutdown process in RocksDB look like? Is simply closing the DB instance enough?

As long as you don’t run `rr record ...` or rerun the test, you can use `rr replay` to rerun the failed test as many times as needed. Or, you can use `rr replay ~/.local/share/rr/ckb_sync-a645bdacec5c2bf3-2` to manually pick the trace of the failure test. This way, you can continue running more `rr record ...`.

### Error Case: `terminate called without an active exception`

Enough runs of the test also reveal another error:

```bash
$ cargo nextest run  --features with_sentry --no-fail-fast --hide-progress-bar --success-output immediate-final --failure-output immediate-final -p ckb-sync tests::sync_shared::test_insert_parent_unknown_block
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.24s
info: using target runner `/home/ubuntu/rr-obj/bin/rr record` defined by environment variable `CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUNNER`
────────────
 Nextest run ID 2aa60494-7933-4b99-9f5f-cdced260411a with nextest profile: default
    Starting 1 test across 1 binary (68 tests skipped)
     SIGABRT [   2.083s] ckb-sync tests::sync_shared::test_insert_parent_unknown_block
──── STDOUT:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block

running 1 test
test tests::sync_shared::test_insert_parent_unknown_block ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 68 filtered out; finished in 0.92s

──── STDERR:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block
terminate called without an active exception

────────────
     Summary [   2.084s] 1 test run: 0 passed, 1 failed, 68 skipped
     SIGABRT [   2.083s] ckb-sync tests::sync_shared::test_insert_parent_unknown_block
──── STDOUT:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block

running 1 test
test tests::sync_shared::test_insert_parent_unknown_block ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 68 filtered out; finished in 0.92s

──── STDERR:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block
terminate called without an active exception

error: test run failed
```

We can also use `rr` to jump into a gdb session:

```bash
$ ~/rr-obj/bin/rr replay
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/ubuntu/.local/share/rr/ckb_sync-a645bdacec5c2bf3-2/mmap_hardlink_4_ckb_sync-a645bdacec5c2bf3...
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /home/ubuntu/.local/share/rr/ckb_sync-a645bdacec5c2bf3-2/mmap_hardlink_4_ckb_sync-a645bdacec5c2bf3.
Use `info auto-load python-scripts [REGEXP]' to list them.
Remote debugging using 127.0.0.1:9252
Reading symbols from /lib64/ld-linux-x86-64.so.2...
Reading symbols from /usr/lib/debug/.build-id/1c/8db5f83bba514f8fd5f1fb6d7be975be1bb855.debug...
BFD: warning: system-supplied DSO at 0x6fffd000 has a section extending past end of file

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) y
Debuginfod has been enabled.
To make this setting permanent, add 'set debuginfod enabled on' to .gdbinit.
0x000072be440c4540 in _start () from /lib64/ld-linux-x86-64.so.2
(rr)
```

We can also use `c` to continue with the test, and use `bt` to check out stack trace when failure happens:

```bash
(rr) c
Continuing.

running 1 test
test tests::sync_shared::test_insert_parent_unknown_block ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 68 filtered out; finished in 0.92s

terminate called without an active exception
[New Thread 74722.74724]
[New Thread 74722.74725]
[New Thread 74722.74741]
[New Thread 74722.74742]
[New Thread 74722.74743]
[New Thread 74722.74744]
[New Thread 74722.74749]
[New Thread 74722.74750]
[New Thread 74722.74751]
[New Thread 74722.74752]
[New Thread 74722.74753]
[New Thread 74722.74754]
[New Thread 74722.74755]
[New Thread 74722.74771]
[New Thread 74722.74773]
[New Thread 74722.74775]
[New Thread 74722.74776]

Thread 1 received signal SIGABRT, Aborted.
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
warning: 44     ./nptl/pthread_kill.c: No such file or directory
(rr) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x000072be4384527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x000072be438288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x000072be43ca5ff5 in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95
#6  0x000072be43cbb0da in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:48
#7  0x000072be43ca5a55 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:58
#8  0x00005f7ac4eab073 in std::__terminate () at /usr/include/x86_64-linux-gnu/c++/13/bits/c++config.h:322
#9  std::thread::~thread (this=0x72be3c170ea0, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/std_thread.h:173
#10 0x00005f7ac4eefb68 in std::default_delete<std::thread>::operator() (this=0x5f7ac7268318 <rocksdb::PeriodicTaskScheduler::Default()::timer+152>,
    __ptr=0x72be3c170ea0) at /usr/include/c++/13/bits/unique_ptr.h:99
#11 0x00005f7ac4ede608 in std::unique_ptr<std::thread, std::default_delete<std::thread> >::~unique_ptr (
    this=0x5f7ac7268318 <rocksdb::PeriodicTaskScheduler::Default()::timer+152>, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/unique_ptr.h:404
#12 0x00005f7ac505d876 in rocksdb::Timer::~Timer (this=0x5f7ac7268280 <rocksdb::PeriodicTaskScheduler::Default()::timer>, __in_chrg=<optimized out>)
    at rocksdb/util/timer.h:48
#13 0x000072be43847a76 in __run_exit_handlers (status=0, listp=<optimized out>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at ./stdlib/exit.c:108
#14 0x000072be43847bbe in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:138
#15 0x000072be4382a1d1 in __libc_start_call_main (main=main@entry=0x5f7ac3eeaf40 <main>, argc=argc@entry=4, argv=argv@entry=0x7ffe436ca588)
    at ../sysdeps/nptl/libc_start_call_main.h:74
#16 0x000072be4382a28b in __libc_start_main_impl (main=0x5f7ac3eeaf40 <main>, argc=4, argv=0x7ffe436ca588, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7ffe436ca578) at ../csu/libc-start.c:360
#17 0x00005f7ac3e1a4c5 in _start ()
(rr)
```

This is actually less obvious than the previous one, the error happens when C++ code destructs a timer instance.

Just to take a guess: maybe there is a timer left unprocessed, when closing a Rocksdb DB instance?

## Diving into the Code

Once we had a reliable way to reproduce the failure and replay it in `rr`, it is time to figure out why it was happening.

When diving into the [code](https://github.com/nervosnetwork/rust-rocksdb/blob/txn/src/optimistic_transaction_db.rs#L152-L153), the first thing I noticed is a redundant delete C++ call:

```rust
ffi::rocksdb_optimistictransactiondb_close_base_db(self.base_db);
ffi::rocksdb_optimistictransactiondb_close(self.inner);
```

If we track down the code, both [Line 152 and 153](https://github.com/nervosnetwork/rust-rocksdb/blob/txn/src/optimistic_transaction_db.rs#L152-L153) here will eventually invoke `~StackableDB` ([Line 34-41](https://github.com/facebook/rocksdb/blob/6c0e55a2a9e244ebbe71613c48eb3157ecfce9af/include/rocksdb/utilities/stackable_db.h#L34-L41)):

```cpp
  ~StackableDB() override {
    if (shared_db_ptr_ == nullptr) {
      delete db_;
    } else {
      assert(shared_db_ptr_.get() == db_);
    }
    db_ = nullptr;
  }
```

After the first invocation, `db_` will become `nullptr`, and C++’s `delete` first checks if its operand is `nullptr`. If so, the `delete` call will just be a [NOOP](https://en.wikipedia.org/wiki/NOP_\(code\)).

So the extra delete C++ call is redundant, not a double-free. In fact, even if we remove one of the double delete calls, we could still run into failures. Here’s one possible stacktrace:

![](https://cdn.hashnode.com/res/hashnode/image/upload/v1748232412657/ad99b855-8a9d-4487-a963-b16d419c4aed.png align="center")

And now, it’s mostly just repeating the following loop:

* Read the code.
    
* Add `println!` in Rust code or `printf` lines in C++ code.
    
* Use the above bash script to compile and repeatedly run the tests until a failure occurs.
    
* Use `rr` to replay the failure in a gdb session, setting breakpoints to peek into memory data. If required, restart `rr` to begin another gdb session.
    

Throughout this process, `rr` has been super reliable and helpful. As long as bash script captures a test failure, `rr` can deterministically rerun it as many times as you like.

By the end, I had likely started over 100 gdb sessions using `rr` to replay the failures.

## Failure Workflow Analysis

After enough trials, I’ve nailed down to the following workflow (I’m using `tests::sync_shared::test_insert_parent_unknown_block` as an example, I believe other tests could be fixed similarly):

* At the very start, two `shared` structures and two `chain_controller`s are created ([Line 83-94](https://github.com/nervosnetwork/ckb/blob/734ce6295339cba252032180e3508cfce8c21ba7/sync/src/tests/sync_shared.rs#L83-L94)).
    
    The `shared1` variable created at [Line 83](https://github.com/nervosnetwork/ckb/blob/734ce6295339cba252032180e3508cfce8c21ba7/sync/src/tests/sync_shared.rs#L83) and its discarded chain controller work as expected.
    
    The issue stems from the `shared` and `chain` variables created at [Line 84](https://github.com/nervosnetwork/ckb/blob/734ce6295339cba252032180e3508cfce8c21ba7/sync/src/tests/sync_shared.rs#L84).
    
* As part of the block starting at 84, [start\_chain\_services](https://github.com/nervosnetwork/ckb/blob/734ce6295339cba252032180e3508cfce8c21ba7/chain/src/init.rs#L24) is invoked. It creates a bunch of threads, some of which might hold a copy of `Shared` structure, meaning each of those threads also holds `Arc<RocksDB>` instance.
    
* A unit test might not use CKB’s full shutdown process like a normal CKB node does. When the test `tests::sync_shared::test_insert_parent_unknown_block` terminates, this thread ([Line 59-73](https://github.com/nervosnetwork/ckb/blob/734ce6295339cba252032180e3508cfce8c21ba7/chain/src/init.rs#L59-L73)) is still running.
    
    * Note: By terminate, I mean you can see `test tests::sync_shared::test_insert_parent_unknown_block ... ok` generated by Rust’s test infrastructure; in other words, you can think that control flow for the main test thread has already exited the method `test_insert_parent_unknown_block`.
        
    
    When Rust initiates thread termination (or `pthread` determines to kill all still running threads), `PreloadUnverifiedBlocksChannel` still holds a copy of `Shared` structure, which maintains a copy of `Arc<RocksDB>`.
    
* Then the `Drop` impl of the underlying `OptimisticTransactionDB` starts executing, which will invoke RocksDB’s destructor ([Line 717](https://github.com/facebook/rocksdb/blob/6c0e55a2a9e244ebbe71613c48eb3157ecfce9af/db/db_impl/db_impl.cc#L717)):
    
    ```cpp
    DBImpl::~DBImpl() {
      ThreadStatus::OperationType cur_op_type =
          ThreadStatusUtil::GetThreadOperation();
      ThreadStatusUtil::SetThreadOperation(ThreadStatus::OperationType::OP_UNKNOWN);
    ```
    
    which then invokes RocksDB’s global Timer’s Cancel method ([Line 86](https://github.com/facebook/rocksdb/blob/6c0e55a2a9e244ebbe71613c48eb3157ecfce9af/util/timer.h#L86)):
    
    ```rust
    InstrumentedMutexLock l(&mutex_);
    ```
    
    This is when `Timer::Cancel` decides to acquire a mutex lock, and pthread signals an error, aborting the test program.
    

This leads to a crash like `pthread lock: Invalid argument` or

`terminate called without an active exception`.

## What Caused the Crash

My best guess at what happened here is:

* When `test_insert_parent_unknown_block` finishes, Rust decides that all the requested tests (in this case, just one) have finished, and starts terminating the process.
    
* At this point, two threads are involved:
    
    1. The thread containing `PreloadUnverifiedBlocksChannel` runs `Drop` trait impl, which then calls RocksDB’s destructor, and eventually invokes RocksDB’s `Timer::Cancel` on the global Timer object.
        
    2. A cleanup thread (most likely the top running thread) runs cleanup hooks for RocksDB’s global Timer structure, which also target the same global timer.
        
    
    If Thread 1 finishes before 2, the termination shuts down cleanly and the test passes.
    
    However, if Thread 2 finishes first, the thread containing `PreloadUnverifiedBlocksChannel` will acquire a lock on a pthread mutex, which has already been cleaned up (in other words, this pthread mutex might not be properly initialized), and will trigger the aborting process, causing the process running the test to fail even when the Rust test itself succeeds.
    

## The Fix

Based on the above assumption, I’ve prepared a fix in [this commit](https://github.com/xxuejie/ckb/commit/91be319d33a49d64fee42004696896820a4a8bec), to ensure that all chain service’s threads are properly terminated before the testing function exits.

Now when running the following test command, the test never fails:

```bash
$ cargo nextest run  --features with_sentry --no-fail-fast \\
    --hide-progress-bar --success-output immediate-final \\
    --failure-output immediate-final -p \\
    ckb-sync tests::sync_shared::test_insert_parent_unknown_block
```

You can also run this command including bash script indefinitely until a failure happens.

With this fix, I could no longer reproduce the pthread error—even after running the command for over an hour.

## Additional Notes

* **Googling the Error** led to [this discussion](https://github.com/facebook/rocksdb/issues/11349.). Unfortunately, the proposed fix does not work for us.
    
* **Different RocksDB Versions**: Currently, CKB is running on `ckb-rocksdb` v0.21.1 using RocksDB 8.5.4. CKB also developed a newer version: `ckb-rocksdb` v0.22.0, with RocksDB 9.10.0. I’ve tried both and found the newer one has lower failure frequency, but the issue still occurs.
    

---

*✍🏻 Written by Xuejie Xiao*

*His previous posts include:*

* [**Against ROP Attacks: A Blockchain Architect’s Take on VM-Level Security**](https://blog.cryptape.com/against-rop-attacks)
    
* [**A Journey Optimizing CKB Smart Contract: Porting Bitcoin as an Example**](https://blog.cryptape.com/a-journey-optimizing-ckb-smart-contract)
    
* [**Optimizing C++ Code for CKB-VM: Porting Bitcoin as an Example**](https://blog.cryptape.com/optimizing-c-code-for-ckb-vm)
    

*Find more in his personal website* [**Less is More**](https://xuejie.space/).
