We are operating Cassandra in an environment where occasionally the host machine running Cassandra is unpowered without a proper shutdown. We are ok about losing data, but our problem is that in rare situations Cassadra is not able to start up after an unclean shutdown. Startup then fails due to a Commit log corruption:
ERROR [main] 2024-05-02 12:39:13,834 JVMStabilityInspector.java:196 - Exiting due to error while processing commit log during initialization. org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Mutation checksum failure at 2378 in Next section at 2016 in CommitLog-7-1714580480939.log at org.apache.cassandra.db.commitlog.CommitLogReader.readSection(CommitLogReader.java:387) at org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:244) at org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:147) at org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:191) at org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:223) at org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:204) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:353) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:744) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:878) We found out about the option "cassandra.commitlog.ignorereplayerrors", which seems to affect the error handling when the Commit log is processed. Some questions about this setting:
- Is it safe to enable this setting, if losing data written to Cassandra is acceptable?
- Are all Commit log corruptions ignored when enabling this setting or can it still happen that Cassandra is not able to start? Note that our key requirement is that Cassandra is able to start up after a power outage without manual intervention.
- Should "nodetool repair" be executed after Commit log corruptions have occurred?
Update to question, Oct 10, 2024:
We have enabled the "ignorereplayerrors" option and after a while we still ran into a situation where Cassandra was not able to start due to a Commit Log error.
This time the following error occurred at startup:
ERROR [main] 2024-10-21 09:18:28,276 CommitLogReplayer.java:494 - Ignoring commit log replay error org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Encountered bad header at position 606236 of commit log /opt/cassandra/data/commitlog/CommitLog-7-1729338769682.log, with invalid CRC. The end of segment marker should be zero. We did some research and found the option "cassandra.commitlog.allow_ignore_sync_crc", which seems to affect the behavior in this case. After enabling this option, Cassandra was able to start.
Is it safe to enable this option, if data loss is acceptable?
Furthermore, we are wondering if there are still corruptions left, where Cassandra will not be able to start. Therefore, we are thinking about adding a startup check to our application, which is able to detect when Cassandra is stuck, check the system.log for defective Commit Log files and then delete those files automatically. Is this a reasonable strategy? Note that we cannot burden the users of our product to fix the database themselves by manually deleting corrupted Commit Log files.