I have an application with very strict requirements around auditing and the "replayability" of user actions. For this reason, I've chosen an event-sourced architecture because of its append-only/ledger nature. The system is event-driven and uses CQRS as well. Specifically: writes are appended to Redpanda (Kafka) topics, consumers process those writes and store the results in Redis/Scylla, then the client reads exclusively from Redis as a pull-through cache with Scylla reads on cache misses.
The issue is with notifying the browser/client application when a change has made its way through the back end. When a write occurs, the client application obviously needs to know when the change is completed (or rejected). A change may require multiple consumers to handle different processing steps (something like the saga pattern), any of which could be potentially long-running (up to 10s, waiting on third-party services, etc). I'm familiar with strategies where the data is simply optimistically updated at the client in a sort of "fire and forget" way but I'm talking about something where it's critical that the user knows each step has succeeded/failed.
My question is: why is short polling so bad for this? To be clear - the client only reads from read-optimized stores like Redis (with the exception of the cache miss explained earlier). It never hits a relational DB with a huge, multiple-scan join, parses text files, etc. It only exchanges keys for values. On writes, the mutating change request is given a UUID, written to Redpanda immediately and the UUID is returned - all in a single request/response. The client can then poll the read store (possibly with some sort of backoff function) with that UUID to check the status of the request, which will be stored in Redis. Either the status is returned, or a timeout is reached, at which point the client stops polling. I think I've heard this referred to as the "coat check pattern"?
My reasoning is as follows:
Why not websockets?
While WS would provide real-time feedback for the client, it requires drastically reducing the number of clients a single machine can serve since the connections must be held open until the result is returned. Besides this, websockets have been a considerable issue for me in the past when multiple proxies, load balancers, kubernetes pods, ingresses, etc are involved. This application will serve 10's of millions of users and will need to scale elastically. The user actions being performed are largely transactional and there are no real-time requirements outside the one I've described (no long-running chats or the like).
Why not SSE?
This would essentially be the same as the WS situation. While there aren't all the protocol upgrade issues and everything happens over standard HTTP, the issue still remains that once a client is connected, it must hold a connection open while the requested change makes it through the system - thereby tying up server resources until the change is complete.
Why short polling?
It seems like everything I read on this topic lists the options as WS, SSE, and short polling only to categorically dismiss short polling due to its "chattiness". In summary, "It's too chatty because you have to make a bunch of requests, a majority of which don't return anything, which uses resources for nothing". Then there's what feels like something of a hand-wave as the recommendation of "just use WS/SSE because short polling is too chatty" is made with no mention of the complex path a WS request actually needs to traverse to make it through all of the (ever-elastically-shifting) layers required for a typical, large-scale, enterprise network. Not to mention that a WS/SSE connection may return a majority of its responses as heartbeats until the one "success/failure" response comes back - all while preventing other users' requests (also in the form of websockets/SSE) from even making it to the server - which sounds a lot like the complaint about short polling: that most responses carry no domain-level data.
This is why short polling seems so appealing to me. Yes, it's chatty - there are more requests made and some of them will return empty response bodies. However:
- As stated before: these requests are only made to read-optimized stores.
- The back-end infra can scale/shift/etc along with load and the client won't care because there's no long-running connection to sever.
- One server can handle more users - not from the standpoint of concurrent connections - but from the standpoint that eventually, if 1,000,000 users make a request at the same time to one server, all "is it done yet?" polling sessions will eventually get their answer - assuming it's within the timeout (to be clear - I'm not actually planning to throw millions of users at a single box; just using the extreme for the sake of example). I know this because I load tested it myself. I ran the test with far more users than a single machine could handle and all users ultimately received their "yes, it's done" responses well within the timeout.
- Keep-alive is standard since HTTP/1.1. I realize this won't be useful on each and every connection - and will be less likely to apply as load increases - but in the case where a client is able to make multiple short-poll requests before its connection is severed, all requests after the initial one require much lower overhead, similar to the repeat requests made in a WS/SSE approach.
What am I missing that makes short polling so bad? I'm not building Discord or some sort of low-latency gaming server. I just need to get a lot of users through multi-step, transactional workflows and don't want to worry about having to "hold the door open", so to speak while the back end converges on the final state of each request.
To be clear, I'm not saying that WS and SSE don't have their place. I've built trading applications that relied on tick-level data and other monitoring/status dashboards where it's nice to just have that one pipe connected to the client and let the server emit whatever happens whenever it happens. My current requirements, however, aren't that kind of real-time.
Am I way off here? Am I missing something obvious? Like I said, I've load tested this myself (on an actual, physical rack across multiple machines with actual networking cables between them - not just on my computer) and it seems to work quite well. Just seems like I'm missing something considering how opposed the internet evidently is to this idea.
Thanks!