21

A decade ago, in 2014 some guy named Jeff Tweeted1:

There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.

Well, somehow this seems applicable here:

Inbox with four unread messages and a counter showing 3.

The notification counter tells me I have 3 unread notifications. But as you can see in the list, there are 4 that I have not read. The rest below are indeed read, so the count is exactly off-by-one.

This is not because I left a page open and the socket died in the mean time (which usually leaves a disparity between counter and actual inbox messages). This is persistent - the screenshot I took was after reloading the page. I have also tried hard reloading the page as well and doing multiple reloads. It is the same result on any page I try.

In fact, since I started writing this bug report, I got one new message. I have yet to click on any of the others, so the counter now shows 4 but there are 5 things unread (same 4 as before + 1 new one).

Since then, I tried marking the notifications read one by one until I got no counter and yet one of them unread:

Inbox with one unread messages and no counter.

Finally, I marked the last one as read and...I am owed an inbox message, I think:

Inbox with no unread messages. The "refresh" button shows "-1"

The "refresh" link claims that if I click it, I would get negative one new notifications.


After some experimentation: chat notifications do not add to the counter. Marking them as read still subtracts from the counter.

This is exactly what caused it - in the screenshot I do have a chat reply which is the off-by-one error.

This also means that if you only have new notifications that came from chat, you would never be able to tell until you open the inbox and see them.


1 for the record, that was not original but it is how I first became aware of the quote. Also, felt appropriate to use that exact source.

7
  • 1
    Anyone ever double posted an answer/question/comment? Happened to me yesterday, nothing I did. There's a sync and/or resend issue somewhere in the system (possibly, could be in my is ISP or somewhere else in the chain I suppose). Commented Nov 20, 2024 at 19:23
  • 3
    I just got a "(-1)" as well. I feel like I have reached new heights in antisociality Commented Nov 20, 2024 at 20:52
  • 5
    This seems to be a recently-introduced issue, so I'm tossing status-review at this in the event balpha/devs are still available following the chat upgrades. Commented Nov 20, 2024 at 20:53
  • 1
    I had a repro of the -1 Commented Nov 21, 2024 at 0:01
  • 4
    This even causes the red number on the inbox icon to not show up. I had two notifications in the inbox but no indication of that. Clicking the "refresh" somehow fixes the count. Commented Nov 21, 2024 at 5:09
  • I've also been seeing an off-by-total error as of a day ago. Commented Nov 21, 2024 at 11:59
  • 7
    Reproduced. I'm looking at it; this might be related to this because StackAuth (the authentication service mentioned there) is also responsible for inserting chat notifications into the inbox. Because ... reasons. Commented Nov 21, 2024 at 12:05

1 Answer 1

15

This was a fun one. Fixed now.

As I mentioned in a comment, this was related to the migration of our auth service to GCP. It's the first thing we've moved that actually does non-trivial backend work (StackSnippets has been in GCP for a while, but it doesn't talk to things like Redis or databases).

This auth service, called "StackAuth", is responsible for network login functionality, but that's not the only thing it does. For largely historical reasons, it also has some small private APIs for "global network things", i.e. things that don't necessarily involve a specific Stack Exchange site.

One of those things is a private API that Chat calls to insert items into the Stack Exchange™ MultiCollider SuperDropdown™, a.k.a. the global inbox.

This involves three things:

  1. Inserting the inbox item into the database,
  2. sending a message via Redis that gets forwarded to the websocket, so that your inbox lights up immediately when you have a new notification, and
  3. incrementing a counter in Redis that stores your unread notification count, which is how the site still knows about that number on subsequent page loads.

Points 1 and 2 worked fine -- the notification was actually in your inbox (point 1), and the realtime notification also came as expected. But point 3 didn't work -- when you loaded the page after the message appeared in your inbox, you didn't get an indication of that, because the counter had never been incremented.

One detail you need to now about our cloud migration is that we start by only migrating application itself, while keeping shared resources like SQL databases and Redis in our datacenter initially. That means that for now, StackAuth runs happily in a Kubernetes cluster in GCP, but it still talks to SQL Server and Redis on bare metal servers in New York New Jersey.

And Redis in particular was the culprit here. Clearly, Redis kinda worked -- that's point 2 above. So what's different about point 3?

Well, the realtime notifications use Redis pub/sub channels, but the counter from point 3 is an actual value in the redis key space.

For the non-pub/sub Redis use, we utilize a Redis feature called "databases", which are basically namespaces. Redis' creator has said that the databases feature was a mistake, but that didn't stop us from using them 😬.

Each Stack Exchange site has a database for any site-specific caching needs, and there's one database (number 0) for all the network-wide stuff -- like, for example, inboxes.

Now, because Redis databases are a "please don't use this" feature, we're gradually stopping its use. Stack Overflow for Teams switched to a different approach a while ago, where we only use database 0, and instead differentiate individual sites by using a key prefix (or a "g" prefix for global stuff).

For the public network, we are finally going to make that change as well as we're migrating to GCP. And that's why the application in GCP was already configured to use the new model. But this was wrong, because it still talks to Redis in the datacenter, together with other on-prem apps.

And so, when a new chat inbox message came in, StackAuth in GCP incremented a counter with a "g" prefix, but the Q&A website (which displays your inbox) read a counter without the prefix.

This misconfiguration is fixed now -- switching everything to use the new Redis model is going to be a later step.

3
  • 5
    It's been years since I've seen anyone call the SEMCSD by its proper, correct, and official name. Commented Nov 21, 2024 at 18:20
  • 1
    @Slate In my heart it will always be the SuperCollider Commented Nov 21, 2024 at 18:21
  • 2
    Thanks for digging into this and fixing it, balpha! :) Commented Nov 21, 2024 at 20:10

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.