1

One of our systems has a growing log file (we will be addressing) but currently the application owner will delete the file with rm then wait for the next maintenance window to reboot. I find myself with weeks until the next maintenance window and a disk with 100% utilization.

Following guidance from this post I located the file and truncated it. The issue now is the program/process does not appear to be written logs anywhere. What is the best way to get this process to stop using the old file and start using the 'new file'?

# find /proc/*/fd -ls | grep '(deleted)'|grep path 112567191 0 l-wx------ 1 user1 group1 64 Feb 20 14:10 /proc/27312/fd/2 -> /path/file.log\ (deleted) # > "/proc/27312/fd/2" # find /proc/*/fd -ls | grep '(deleted)'|grep path 112567191 0 l-wx------ 1 user1 group1 64 Feb 20 14:10 /proc/27312/fd/2 -> /path/file.log\ (deleted) # stat /path/file.log File: ‘/path/file.log’ Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: 811h/2065d Inode: 2890717 Links: 1 Access: (0644/-rw-r--r--) Uid: (54322/loc_psoft) Gid: (54321/oinstall) Context: unconfined_u:object_r:unlabeled_t:s0 Access: 2019-02-20 12:44:42.738686325 -0500 Modify: 2019-02-08 11:38:19.741494973 -0500 Change: 2019-02-08 11:38:19.741494973 -0500 Birth: - # stat /proc/27312/fd/2 File: ‘/proc/27312/fd/2’ -> ‘/path/file.log (deleted)’ Size: 64 Blocks: 0 IO Block: 1024 symbolic link Device: 3h/3d Inode: 112567191 Links: 1 Access: (0300/l-wx------) Uid: (54322/loc_psoft) Gid: (54321/oinstall) Context: unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 Access: 2019-02-20 14:10:45.155518866 -0500 Modify: 2019-02-20 14:10:45.154518886 -0500 Change: 2019-02-20 14:10:45.154518886 -0500 Birth: - 

At this time I don't have a disk space issue, I only have the issue of logs not being written.

UPDATE 1: The PID can be found using lsof +L1|grep $path and is it located in the "held" file path as well proc/$PID/fd/N. I haven't been able to sell an interuption to the deciders yet, either as a init 6 or kill 1 $PID. I'm going to try and recreate the issue elsewhere and give a few of the sugestions here and I've dug up.

1
  • Consider putting the app, or at least its logs, on their own filesystem. Commented Feb 21, 2019 at 0:21

5 Answers 5

2

The program in question will have to be altered or, simply, restarted.

What appears to be happening is that the program is opening a file handle for writing to the log, and keeping that selfsame file handle open for the duration. If the file is removed, as you describe it is "held" in abeyance and indeed is still written to until the file handle is closed.

If you can alter the program to change it from (pseudocode):

LogFileHandle = OpenFileHandle( Logfile, 'wa' ) UpdateLog( log_entry ) { LogFileHandle.Write( log_entry ) } do_literally_everything_forever() LogFileHandle.Close() 

to (pseudocode):

UpdateLog( log_entry ) { LogFileHandle = OpenFileHandle( Logfile, 'wa' ) LogFileHandle.Write( log_entry ) LogFileHandle.Close() } do_literally_everything_forever() 

That will solve the issue.

If you cannot, then rather than rebooting the entire system, a file which has been rmed will be well and truly gone once all processes holding open a file handle have been closed (or, more specifically, their file handles have been closed).

Most well-written daemons will incidentally cycle their file handles if sent SIGHUP (read your program's documentation!). But simply stopping (or terminating) and restarting the program will also release any open file handles.

2

You could try attaching with a debugger to that process, and force redirect its file descriptor 2 elsewhere:

gdb -batch -p PID -ex 'p $f=open("/path/to/log", 01101, 0666), dup2($f, 2), close($f)' 

Replace PID with the pid of your process, and "/path/to/log" with the file where fd 2 (stderr) should be redirected. 01101 is O_WRONLY|O_CREAT|O_TRUNC. You may change the 0666 perms to something more restrictive if the process's umask is not right. The process may be buffering and its output may not appear immediately in the file where stderr was redirected.

This is a hack. YMMV.

1
  • I like this workaround hack until I can get the the restart or process killed, I'm going to try out a few things before I mark as answered. Sadly I'm too new to upvote this answer. Commented Feb 21, 2019 at 17:17
1

The underlying issue appear to be with the inode of the file is the same after it is deleted, and used by the software that is writing the logs. It is easy enough to restore the file but this generates a new file, with a new inode, and the process will continue to write to the original file. I have yet found a way to swap the inode so the logging program switches off the deleted file and starts using the new one. This is why killing the process, or a reboot, is needed.

As a temporary solution copying the current context of /proc/$PID/fd/# to the original log location appears to be the best solutions. After trying to work with solution proposed by @mosvy I found another method

# nohup tail -c +0 -f /proc/$PID/fd/# > /path/file.log & 

Two references which came up a lot were one from Linux.com which covered what happened and how to recover a static file. The second were referenced in this superuser post.

0

If modifying the system is an option, then perhaps you could add a signal handler that causes the system to get a new reference to the file. This would allow you to do something like:

hup=1 pid=$(get-the-pid-somehow) kill -n $hup $pid 
0

Is is possible to restart the application services ? If yes, did you tried restarting it. This should release the old PID and create a new one.

This should also reset the application service and i believe it should write the logs on the mentioned file.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.