1

My server is crashing every two days around early afternoon. I've tried overloading the server with CPU intensive programs but that does not cause it to crash so I believe it to be a certain program or configuration. being run that is causing it. I've downloaded crash and tried doing some simple commands on it but I'm not sure what it is outputting.

[root@resh boot]# crash /usr/lib/debug/lib/modules/2.6.32-642.1.1.el6.x86_64/vmlinux /var/crash/127.0.0.1-2016-08-02-09\:12\:20/vmcore KERNEL: /usr/lib/debug/lib/modules/2.6.32-642.1.1.el6.x86_64/vmlinux DUMPFILE: /var/crash/127.0.0.1-2016-08-02-09:12:20/vmcore [PARTIAL DUMP] CPUS: 32 DATE: Tue Aug 2 09:09:29 2016 UPTIME: 12:47:24 LOAD AVERAGE: 4.78, 4.66, 4.55 TASKS: 998 NODENAME: resh.cluster.org RELEASE: 2.6.32-642.1.1.el6.x86_64 VERSION: #1 SMP Tue May 31 21:57:07 UTC 2016 MACHINE: x86_64 (2294 Mhz) MEMORY: 31.8 GB PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000002" PID: 42993 COMMAND: "kslowd002" TASK: ffff88040d88d520 [THREAD_INFO: ffff880100000000] CPU: 7 STATE: TASK_RUNNING (PANIC) crash> bt PID: 42993 TASK: ffff88040d88d520 CPU: 7 COMMAND: "kslowd002" #0 [ffff8801000039c0] machine_kexec at ffffffff8103fdcb #1 [ffff880100003a20] crash_kexec at ffffffff810d1fe2 #2 [ffff880100003af0] oops_end at ffffffff8154bd00 #3 [ffff880100003b20] no_context at ffffffff810518cb #4 [ffff880100003b70] __bad_area_nosemaphore at ffffffff81051b55 #5 [ffff880100003bc0] bad_area_nosemaphore at ffffffff81051c23 #6 [ffff880100003bd0] __do_page_fault at ffffffff8105231c #7 [ffff880100003cf0] do_page_fault at ffffffff8154dc8e #8 [ffff880100003d20] page_fault at ffffffff8154af95 [exception RIP: unknown or invalid address] RIP: 0000000000000002 RSP: ffff880100003dd8 RFLAGS: 00010202 RAX: ffffffffa0465a80 RBX: ffff8801bc7da200 RCX: ffff8801bc7da2a8 RDX: 0000000000000002 RSI: 00000000ffffffff RDI: ffff8801bc7da200 RBP: ffff880100003e20 R8: ffffffff81ad12d8 R9: fe2582cc8764a601 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8801bc7da248 R14: ffff8801bc7da290 R15: 00000000ffffffff ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff880100003dd8] fscache_object_slow_work_execute at ffffffffa0460e9f [fscache] #10 [ffff880100003e28] slow_work_execute at ffffffff81121363 #11 [ffff880100003e68] slow_work_thread at ffffffff81121645 #12 [ffff880100003ee8] kthread at ffffffff810a662e #13 [ffff880100003f48] kernel_thread at ffffffff8100c28a 

Since it seemed to be happening every two days, I've tried looking at the cron jobs but there are no cron jobs that match a schedule of every two days. I've also tried updating the kernel but that has not helped at all either.

2

1 Answer 1

2

I realize this post is ancient history now, but I just had this same exact error happen tonight. Notice the stack entry

#9 [ffff880100003dd8] fscache_object_slow_work_execute at ffffffffa0460e9f [fscache] 

It's just a hunch, but we've had problems with post-2.6.32-642 kernels when cachefilesd has been running. We turned it off on our RHEL 6.x systems with these post-Dirty COW kernels.

Update: see this Red Hat Solutions page for more info about the problem.

Is cachefilesd running on this system? If so, give it a try, turn it off and disable it.

1
  • Yes! We have been turning off cachefilesd on all our hosts... We too noticed this process running away and causing issues over time. Commented Apr 28, 2017 at 22:49

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.