7

I have a systemctl service that generates rather high volume of logs, on the order of 20-50 Gb per day. The service runs on a VPS with Ubuntu 24.04 LTS. The service is written in node.js, and simply console.logs log messages. As it is running as a systemd service, these logs are picked by journald, and I can then view and analyze them using journalctl.

From time to time I observe the following problems:

  • The logs in journalctl are severly delayed. Namely, the service logs current time every minute, and in journalctl output I see records such as

    Jul 07 18:29:43 vc start-manager.sh[2733323]: Current time: 2025-07-07 18:24:43 

    The first timestamp comes from journalctl, the second comes from the service; the difference can be up to 20 minutes and more

  • systemd-journald process eats 20-40% CPU as reported by htop

  • The node.js service gradually increases its heap size (as reported by node.js's process.memoryUsage()), until it runs out of heap and crashes.

    I've dumped node.js heap and saw that almost all memory is eaten by log strings (that actually should be immediately written to stdout).

So my assumption is that journald can not manage such a large flow of logs, the stdout pipe from the service to journald becomes full, and nodejs apparently starts bufferizing log messages. This explains both high heap usage and log delays.

Overall, when this happens, the system doesn't look overloaded: la15 is around 3 (with 8 vcpu), memory usage is around 2-5 Gb out of 16 Gb, and disk write is around 5 MBytes/s (almost all by systemd-journald), which does not sound too much (sync; dd if=/dev/zero of=tempfile bs=1M count=1024; sync reports 600-700 MB/s).

There are also periods when everything works fine, logs are not delayed, systemd-journald consumes much less CPU, and node.js service does not use too much heap. I do not see much difference between these two regimes; in particular, the log speed by the service is approximately the same.

My journald config (in /etc/systemd/journald.conf.d/journald.conf) is

[Journal] SystemMaxUse=100G SystemKeepFree=30G SystemMaxFileSize=1G SystemMaxFiles=100 RateLimitInterval=0 RateLimitBurst=0 

How can I diagnose and fix this behavior? Or am I hitting inherent journald performance?

8
  • 2
    "20-50 Gb per day"? This is Bad System Design, and misuse of the "logging" system. Surely there's a better way to blast out this volume of data (redirecting STDERR, opening a file, ... ) Commented Jul 8 at 0:07
  • 2
    @waltinator redirecting STDERR?!?!?! Ooof. Are you really suggesting what's apparently a long-lived process that generates multiple gigabytes of data per day tie itself irrevocably to some output stream or file? Commented Jul 8 at 10:37
  • 1
    What journald features are you using? Can you just use rsyslog directly and pitch journald overboard? See How can journald be disabled? Commented Jul 8 at 10:41
  • Oh, and "journalctl seemingly can't handle too much logs"? There's absolutely no "seemingly" about it. Commented Jul 8 at 10:44
  • @AndrewHenle, re: what features I'm using — actually only --since option that allows to quickly browse logs around given point in time. In fact, I did not intentionally choose journald, I'm using it simply because it is default; and now that I encountered problems I'm asking what are the reasons, and the answer "journald is a wrong tool" is a completely valid answer... Commented Jul 8 at 15:35

2 Answers 2

8

Or am I hitting inherent journald performance?

You are probably hitting inherent journald performance limits.1

Consider using LogNamespace= to direct your logs to a separate journald instance (with its own process and storage) instead of overwhelming the system instance.

  1. Route logs using:

    [Service] LogNamespace=nodeapp 

    This affects all logging interfaces (stdout, syslog, even journald API).

  2. Configure the limits in /etc/systemd/[email protected].

  3. Check if systemctl status systemd-journald@nodeapp is running. Defining LogNamespace= automatically adds a dependency on this service.

  4. Read the separate logs using:

    journalctl --namespace=nodeapp 

But if you aren't actually making use of journald's features (i.e. just dumping plain text messages via stdout2 instead of structured records via API – or even worse, generating structured logs and still just dumping them into the "message text" field), then you might be better off with logging directly to file.


1 One of the possible causes is that a journald message has many more hidden fields than just the text and timestamp, some of them have to be retrieved by journald every time; take a look at journalctl -n 1 -o verbose to get a rough idea of what really goes into storing each of your stdout lines.

And on top of that, every field in a journald message – including even the MESSAGE field – is indexed, so writing a new message involves updating many database indices. But this is mostly a guess.

(So if you were making proper use of journald, you wouldn't be logging plain text via stdout; you would be using the API to send messages with discrete fields that could be filtered by.)

2 I suppose one could still make use of journald's filtering by message priority... but I've not seen any software send the priority tags via stdout logging.

13
  • 1
    or, potentially more powerful than just dumping to a file, dump the logs to an actual structured log ingestion system; nothing wrong about journald, but Petr's use case sounds more like influxDB or Prometheus would be the tool of choice here, considering how useless 2 to 5 days of this log data quickly becomes if you don't have sensible query capabilities for it. In any case, it really sounds questionable a service with this much logging output should just write logs to stdout Commented Jul 8 at 9:12
  • 1
    OH! Your edit mentioning indices made me think! my suspicion for the different times might be that towards the end of writing a large log file, it gets more CPU- and IO-intense to keep the index up to date. @Petr, can you try reducing SystemMaxFileSize= to something usual, like 128M, and see whether that helps? Commented Jul 8 at 10:23
  • 1
    definitely not disagreeing with the separation into log namespaces, at all! You might be right on the reading / searching performance on the logs, but so far, we can assume that the log is only written to at ludicrous speeds, not read from. It matters less when looking for something in the log takes longer if you only do it once a week, vs. logging several hundred thousand times a day. Commented Jul 8 at 10:27
  • 1
    @MarcusMüller, thanks for the suggestion, I'll try it. However, I started with small SystemMaxFileSize, but was observing journald errors, missing logs and bad log files (checked with journald --verify) and was suspecting that too many files can be a problem. Will give it a second try. Commented Jul 8 at 15:42
  • 1
    @MarcusMüller, thanks, decreasing MaxFileSize really helped, the service is running without problems for almost two days. I also do not observe missing logs for now. Commented Jul 10 at 15:32
3

The fact that it gets slower over time, and then most of the time is good again, sounds like handling the log gets increasingly computationally intense with age – and that makes sense, a "fuller" hash table, as used as index in the log files, is more intense to handle (due to hash collisions).

Reducing your SystemMaxFileSize to something more usual (like 128M) hence might solve this issue (and your comment indicates that did the trick, indeed!).

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.