Skip to content

Destroyed pool fix: prevent SIGSEGV on IOC exit when pvAccess holds NDArrays after driver/pool destroyed#570

Open
kgofron wants to merge 3 commits intoareaDetector:masterfrom
kgofron:destroyed-pool
Open

Destroyed pool fix: prevent SIGSEGV on IOC exit when pvAccess holds NDArrays after driver/pool destroyed#570
kgofron wants to merge 3 commits intoareaDetector:masterfrom
kgofron:destroyed-pool

Conversation

@kgofron
Copy link
Member

@kgofron kgofron commented Feb 19, 2026

Segmentation fault

"fix: prevent SIGSEGV on IOC exit when pvAccess holds NDArrays after driver/pool destroyed" refers to Segmentation fault after ioc exits, when acquisition was performed (memory/pool allocated).

epics> auto_settings.sav: 2354 of 2354 PV's connected ACQUIRE CHANGE: ADAcquire=1 (was 0), current ADStatus=0 PrvHst: Checking if TCP streaming should start - WritePrvHst=0 PrvHst: WritePrvHst is disabled (0) - TCP streaming not started After acquireStart: ADStatus=1 ACQUIRE CHANGE: ADAcquire=0 (was 1), current ADStatus=1 PrvImg TCP connection closed by peer PrvHst TCP disconnected After acquireStop: ADStatus=0 epics> exit PrvHst TCP disconnected ./st.cmd: line 5: 2343260 Segmentation fault ../../bin/linux-x86_64/tpx3App st_base.cmd 

Fix applied to ADCore 3.14.0 master.

epics> exit PrvHst TCP disconnected 

Problem

When an IOC exits (e.g. user types exit) after acquisition has run, the process can hit a SIGSEGV (signal 11). The crash is in NDArrayPool::release() (or equivalent use of the pool) after the detector driver and its NDArrayPool have already been destroyed.
Cause: Shutdown order: the detector driver destructor runs and deletes pNDArrayPoolPvt_. Later, the pvAccess ServerContext is torn down (atexit). Its MonitorElements still hold NDArray-derived data. The deleter used by ntndArrayConverter (freeNDArray) calls NDArray::release() on those arrays. By then the pool is gone, so release() runs against freed memory → SIGSEGV.
This has been seen with areaDetector IOCs (e.g. ADTimePix3) using ADCore 3.12.1 and 3.14.0. See issue areaDetector/ADTimePix3#5.

Approach

Two parts:
“Destroyed pool” registry

  • Before the driver deletes its pool, it registers the pool pointer in a static set.
  • In NDArray::release(), we check that set using only the pool address (no dereference).
  • If the pool was registered as destroying, we set pNDArrayPool = NULL and return without calling the pool.
    So any late release() (from PVA or elsewhere) no-ops safely, even for NDArrays that are not the driver’s pArrays[] (e.g. copies handed to PVA).

asynNDArrayDriver destructor

  • Store maxAddr in a member maxAddr_.
  • In ~asynNDArrayDriver(): call NDArrayPool::registerDestroyingPool(pNDArrayPoolPvt_), null pNDArrayPool on each pArrays[i], then delete pNDArrayPoolPvt_.

Changes

File Change
NDArray.h Declare NDArrayPool::registerDestroyingPool(NDArrayPool*) and NDArrayPool::isPoolDestroyed(NDArrayPool*).
NDArrayPool.cpp Implement both with a static std::set<NDArrayPool*> and a mutex. Pools are only ever added; the set is process-lifetime.
NDArray.cpp At the start of NDArray::release(), if isPoolDestroyed(pNDArrayPool) then set pNDArrayPool = NULL and return ND_ERROR without calling the pool.
asynNDArrayDriver.h Add private member int maxAddr_.
asynNDArrayDriver.cpp Constructor: initialize maxAddr_(maxAddr) (initializer order matches member declaration). Destructor: call registerDestroyingPool(pNDArrayPoolPvt_), then loop over pArrays[0..maxAddr_-1] and set pArrays[i]->pNDArrayPool = NULL, then delete pNDArrayPoolPvt_.

ADCore314_fix.md

References

@ericonr
Copy link
Member

ericonr commented Mar 5, 2026

Hi Kaz! Have you seen #572 ? It would be nice to determine how these two interact, since you're currently registering your exit handler manually, and you could take advantage of ASYN_DESTRUCTIBLE in the future

@exzombie
Copy link

exzombie commented Mar 5, 2026

You need the latest asyn, and the ADCore from the PR that Erico linked above. Then, follow these guidelines

@kgofron
Copy link
Member Author

kgofron commented Mar 11, 2026

destructible-drivers

I compiled with destructible-drivers branch, but unfortunatly exit after acquisition still results in segmentation fault.

epics> exit PrvHst TCP disconnected ./st.cmd: line 5: 2228300 Segmentation fault ../../bin/linux-x86_64/tpx3App st_base.cmd 

Perhaps I missed something.

kg1@lap133454:/epics/support2/areaDetector/ADCore/iocBoot$ git branch -a * destructible-drivers 
kg1@lap133454:/epics/support2/areaDetector/ADCore/iocBoot$ git log commit 91d002c3afa482c6da827f3b11b95706d1d5028b (HEAD -> destructible-drivers, origin/destructible-drivers) Author: Jure Varlec <jure.varlec@cosylab.com> Date: Fri Feb 27 09:25:06 2026 +0000 Add a release note about port shutdown 
  • asyn/asyn/asynDriver/asynDriver.h
/* Version number names similar to those provide by base * These macros are always numeric */ #define ASYN_VERSION 4 #define ASYN_REVISION 45 #define ASYN_MODIFICATION 0 

Summary

The destructible-drivers branch (ADCore PR 572) fixes shutdown order (asyn calls shutdownPortDriver() and then deletes the driver), but it does not include the pool-safety fix from ADCore PR 570. So the crash you see is still the same: after the driver (and its NDArrayPool) are destroyed, pvAccess (PVA) can later call NDArray::release() on arrays that belonged to that pool → use-after-free → SIGSEGV.

So ADTimePix3 needs both:

  • Destructible drivers (PR 572)
  • Destroyed-pool safety (PR 570) – this is what prevents the SIGSEGV when PVA (or anything else) calls release() after the pool is gone.

What to do

Apply the ADCore PR 570 (destroyed-pool) changes on top of your current destructible-drivers ADCore. That PR adds:

  • NDArrayPool: registerDestroyingPool(NDArrayPool*) and isPoolDestroyed(NDArrayPool*) (e.g. static set + mutex).
  • NDArray::release(): at the start, if isPoolDestroyed(pNDArrayPool) then set pNDArrayPool = NULL and return (do not call the pool).
  • asynNDArrayDriver destructor: call registerDestroyingPool(pNDArrayPoolPvt_), null pNDArrayPool on each pArrays[i], then delete pNDArrayPoolPvt_.

Ways to get that into your tree:

  • Option A: In your ADCore repo (on destructible-drivers), merge or cherry-pick the commits from PR 570 (the “fix: prevent SIGSEGV on IOC exit when pvAccess holds NDArrays after driver/pool destroyed” commit and any dependencies). Then rebuild ADCore and your IOC.
  • Option B : Manually apply the same code changes from PR 570’s diff into your current ADCore (same files and logic as above), then rebuild.

After ADCore has both destructible-drivers and the pool-safety logic, exit should no longer segfault. If you paste your ADCore branch/commit and the PR 570 patch or link, I can outline exact merge/cherry-pick steps or a minimal patch for your tree.

@kgofron
Copy link
Member Author

kgofron commented Mar 13, 2026

Stack trace:

“SIGSEGV on exit with ADCore master. Backtrace shows crash in NDArrayPool::release (ADCore) called from PVA teardown (freeNDArray → NDArray::release) after the driver and its pool are already destroyed. Full bt full and info sharedlibrary attached.”

Attached: The full bt full output (and optionally info sharedlibrary) from your tpx3_SIGSEGV.md file, or the relevant frames (#0–#2, #37, #49, #65#66, #71#72, #79#81) to see the PVA → NDArray release → pool release path and the exit-handler order.

GDB analysis – SIGSEGV on exit (ADCore master)

Build: ADTimePix3 IOC, run with ADCore current master (no PR 570, no destructible PR 572 in this run).
Repro: Acquisition run, then exit in the IOC shell.
Crash: SIGSEGV in NDArrayPool::release() at NDArrayPool.cpp:373 (onReleaseArray(pArray)).

Where it crashes

#0 – NDArrayPool::release(this=0x555556ccc870, pArray=0x7fff3c001a00) at NDArrayPool.cpp:373 So the fault is inside the pool’s release() (use-after-free on the pool or its internals).

Call chain (who called into the pool)

#81–79 – User types exit → epicsExit(0) → C library runs atexit handlers.
#72–71 – One of those handlers destroys the pvAccess ServerContext (static/atexit cleanup).
#70–65 – PVA tears down transports and channels, then ServerMonitorRequesterImpl::destroy.
#57–48 – MonitorElementQueue and its MonitorElements are destroyed.
#37–36 – MonitorElement destructor runs (PVA held a structure that contained array data).
#30–11 – Destructors for PVStructure → PVValueArray → shared_vector; the shared storage uses a custom deleter.
#2 – freeNDArray::operator() (ADCore ntndArrayConverter.cpp:44) – deleter for the PVA copy of the NDArray data.
#1 – NDArray::release() – called from that deleter.
#0 – NDArrayPool::release() – crash.

So: PVA is shutting down (atexit), destroying MonitorElements that still hold NDArray-backed data. Their deleter calls NDArray::release(), which calls NDArrayPool::release() on a pool that no longer exists.

Why the pool is gone

Shutdown order is:

  • Earlier in exit, the ADTimePix driver is destroyed (either by epicsAtExit(exitCallbackC) or by asyn destructible teardown). That runs ~asynNDArrayDriver, which deletes the NDArrayPool (pNDArrayPoolPvt_). So the pool at 0x555556ccc870 is freed.
  • Later, the pvAccess ServerContext is destroyed (another atexit). Its MonitorElements still hold NDArrays (or views) that point at that same pool. When those are freed, freeNDArray → NDArray::release() → NDArrayPool::release() runs on the already deleted pool → use-after-free → SIGSEGV.
    So the trace matches the “destroyed pool” scenario: driver (and pool) destroyed first, PVA tears down later and calls release() on that pool.

Conclusion

Root cause: Use-after-free in ADCore: NDArray::release() is called from PVA’s deleter after the driver’s NDArrayPool has already been destroyed. The crash is in NDArrayPool::release (ADCore), not in ADTimePix3.
Fix: This is exactly what ADCore PR 570 (destroyed-pool safety) addresses: make NDArray::release() (and optionally the pool) safe when the pool has already been marked destroyed / unregistered, so PVA’s late release no-ops instead of touching freed pool memory.

tpx3_SIGSEGV.md
tpx3_SIGSEGV_frames.md

Why fix ADCore

  • ADCore / PVA lifetime bug, and ADTimePix3 can at best work around it, not truly fix it.

  • The crash is in:

    • NDArrayPool::release() → NDArray::release() → freeNDArray (from ntndArrayConverter)
    • Called while pvAccess ServerContext is being destroyed at atexit.
  • By that time, the driver’s NDArrayPool has already been deleted (driver shutdown/destructor ran earlier), but PVA still holds NDArrays whose deleter calls NDArray::release().

So the use‑after‑free is between PVA’s lifetime and ADCore’s pool, not inside ADTimePix3.

What ADTimePix3 can and cannot do

  • What it cannot do:

    • It cannot intercept NDArray::release() or freeNDArray calls coming from PVA. Once an NDArray is handed to PVA, only ADCore’s NDArray/NDArrayPool layer can protect against late release() calls (this is exactly what PR 570 does with the “destroyed pool” registry and early return in NDArray::release()).
  • What it can do as a workaround (no ADCore change):

    • Do not destroy the driver/pool at exit.
      • Don’t pass ASYN_DESTRUCTIBLE in the IOC (use ADTimePixConfigWithFlags(..., 0)), and
      • Don’t register epicsAtExit (or guard it behind a macro).
    • Then, when PVA shuts down at atexit, the NDArrayPool is still alive; NDArray::release() works and the process exits cleanly.
    • Trade‑off: the driver, its threads, and the pool leak until process exit (which is usually fine for IOC shutdown).
  • If we are willing to leak on exit: yes, ADTimePix3 can avoid the crash by not being destructible (no ASYN_DESTRUCTIBLE, no epicsAtExit deletion) when using an unpatched ADCore.

  • If you want a correct, leak‑free, future‑proof fix: ADCore must be changed (PR 570 or equivalent). That’s the only place that can safely handle “PVA calls NDArray::release() after the pool is gone.”

@kgofron kgofron mentioned this pull request Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants