Polling Application Misses 1/500 records from an OData Feed

Question

I apologize in advance for the lack of knowledge. Please feel free to edit this question to use the appropriate terminology if it is not correct.

I recently wrote a .NET core application that is supposed to continuously grab new data from an OData feed using HTTP requests and then write the data to an SQL Server database. However, despite having pretty solid exception logging and HTTP non-success code management, the application regularly misses roughly 1/500 records. I have thought about why this could be for a really long time and have done some research, but I haven't been able to find a fix. I am asking for possible solutions, but any general suggestions on how to improve the application's design are also appreciated.

Here is the general flow of the application.

A watchdog service runs this whole application repeatedly. Before each run, the watchdog reads from a table that stores some table and program metadata and gets the last time the program succeeded.
Now for the main application. First, the application does some setup (things like setting up DI and connections)
Then the application checks the metadata table to see when the last time each table was successfully updated.

The following then is done for each table, asynchronously:

Create an empty temp table with the same schema as the target table.
An HTTP request is sent to get the first batch of records along with the count of the total records to fetch. The records are filtered to only include records with a created time stamp greater than or equal to the last successful application completion datetime (and the count reflects this filter). If there are any records, write them to the temp table.
Get the remaining batches using the exact same filter but with a "skip" clause to separate them.
Generate an SQL statement to insert the records from the temp table into the target table and run the statement.
Truncate the temporary table.
Do the same process, but this time filter by the modified time stamp. On completion, update the matching records in the target table.
Drop the temp table.
If there were any exceptions, they get logged and re-thrown. The top level of this application catches them and returns a negative exit code depending on the exception.

Back in the watchdog:

If the exit code signals success (and as a secondary measure if there were no logs higher than information), only then does the application update the metadata for application success, and it records the time the application started. This is to be able to catch any records that might've been uploaded during the run.

Unfortunately, I have to keep polling this feed instead of using a publish-subscribe sort of pattern. There are also no 'flags' in the feed that signal whether records have been deleted (they just disappear). I have another routine that deletes records in the SQL database, but that seems to be working appropriately.

Even in times where there are no logged exceptions, it still seems to miss records. I can't see why this happens. Again, any advice would greatly help. Thanks very much.

The sooner you can make this defect reproducible on demand and fast the sooner it’ll be solved. — candied_orange
– candied_orange, Commented Apr 2, 2024 at 19:18
I have to agree with @candied_orange. Unfortunately, there is no general fix for this. It sounds like "lots of logging and monitoring" is in your future. I don't think we can effectively answer this, since it likely requires intimate knowledge of your technology ecosystem. Unless you can accept "lots of logging and monitoring" as an answer. — Greg Burghardt
– Greg Burghardt, Commented Apr 2, 2024 at 19:22
That's what I feared... I'll keep the question open in case I figure it out. — javery
– javery, Commented Apr 2, 2024 at 19:27
@GregBurghardt well as general fixes go mine would be write the simplest possible thing that gets this right. Now slowly mutate this to be like the broken system until it shows you the bug. But again, speedy, determininiatic reproducibility is key. Sadly thats not good enough to be an answer here. — candied_orange
– candied_orange, Commented Apr 2, 2024 at 19:30
Can you identify which data is missing? Find a common property about them — are they at either end of the data set or in the middle? — Erik Eidt
– Erik Eidt, Commented Apr 3, 2024 at 22:28

Ewan · Accepted Answer · 2024-04-02 20:56:42Z

It's hard to debug without seeing the code, but...

created time stamp greater than or equal to the last successful application completion datetime

This is incorrect, you need to keep track of the max Created/Updated date that you have pulled, rather than when the app ran.

Also, you should set a period end date which is less than the current time, as well as a start date, to avoid inconsistencies at the end of the data set. maybe now - 1 min, so you avoid any problems with millisecond accuracy/rounding.

Also, because you store this info on a database, if you run the watchdog service twice at the same time, it's going to corrupt the date.

Get the remaining batches using the exact same filter but with a "skip" clause to separate them.

This assumes the order won't change, and the set stays the same. Say for example there are 10 records but five have the exact same date, if you pull 10 take 5 and then 10 skip 5 take 5 you might get duplicates and skip some rows if the order is different on the second pull. Follow @J_H's advice and get record with a date between two dates instead

Generate an SQL statement to insert the records from the temp table into the target table and run the statement.

Your whole mucking around with temp tables suggests there is some issue with just updating the main table directly? Unless the data has a nice primary key I can see this going wrong.

Do the same process, but this time filter by the modified time stamp. On completion, update the matching records in the target table.

This is going to bring back duplicates if you have a row that's been created and modified. Why not just query all rows with Created OR Modified >= lastMaxDate

Steve · Accepted Answer · 2024-04-04 02:20:54Z

An HTTP request is sent to get the first batch of records along with the count of the total records to fetch. The records are filtered to only include records with a created time stamp greater than or equal to the last successful application completion datetime

This is almost certainly the origin of the difficulty.

Firstly, anything that uses a HTTP API (rather than a standard database engine) may be written in an overly-complicated way by developers who have limited understanding of transactional consistency. You can't assume the underlying technologies are designed for consistency, and you won't necessarily be able to see what's under the carpet of the HTTP API.

Secondly, even within a database engine, generated timestamps are usually assigned at a different moment than when transactions are committed. This means it is possible for transactions with earlier timestamps, to still be pending even after a transaction with a later timestamp has already committed and become visible. An algorithm that assumes records only become visible in increasing-time order, and only selects the increment after the highest timestamp last selected, will often fail for this reason of in-flight transactions.

Thirdly, system clocks are sometimes adjusted (for example, for daylight savings), and they are always subject to occasional corrections due to inaccuracy (even if they use UTC). Also, no two clocks will be perfectly synchronised, so timestamps assigned concurrently by different clocks may not be in increasing order. So you can get out-of-order timestamps assigned.

Database engines often have a special series number (confusingly called both "timestamp" and "rowversion" in SQL Server, and probably different names in others) which is more suited to this area of incremental synchronisation and cache refreshing. But if you don't control the data source, you can't necessarily alter the workings and the data to include this kind of feature.

Conclusion

My advice would be to review your algorithm in light of the above.

If possible, avoid incremental algorithms completely, as they are always more subtle and troublesome in practice than one might casually expect. Fetching a complete snapshot directly from source on each refresh cycle, is typically the easiest and safest option, if performance capabilities permit this approach.

Most source data is not completely immutable, and is subject to both routine updates and exceptional interventions by system administrators, and incremental algorithms almost always overlook these possibilities and allow cached data to become inconsistent with the source.

However you may have to use an incremental algorithm in certain circumstances, perhaps accepting potential imperfections as the price of getting any useful data at all.

If so, I suggest firstly doing basic reconciliation checks after each increment, if possible in the circumstances and within the capability of the source system. Such as counting the number of records in both source and destination, or summing up a balance of financial or numeric data.

I suggest secondly that you allow settlement time for in-flight transactions and inconsistent clocks at the source.

Probably the safest way to do this is to only pull records that are at least a few minutes old at the time of fetching (the exact delay you'd have to calibrate to the characteristics of the source system).

But an alternative approach is to fetch right up-to-date on each increment, but before the next fetch, cut back what you've already stored by a few minutes, and then re-fetch right up-to-date from the slightly earlier time to which you've cut back the previous fetch.

This cutback means that anything missed right at the forefront of the previous increment (for the reasons described above), will be fetched in the next increment provided the missing records are within the cutback period.

J_H · Accepted Answer · 2024-04-03 23:33:02Z

Before each run, the watchdog ... gets the last time the program succeeded.

... records are filtered to only include records with a created time stamp greater than or equal to the last successful application completion datetime

serial numbers

The origin server is not supplying us with unique sequential IDs. If it did, our debugging efforts would be simpler, and perhaps the downloader app would be simpler to implement.

timestamp granularity

Given three events A, B, C, there may be a happens-before relationship among them, or they might be racing one another. I suspect that your timestamps are not sufficiently fine grained to reliably induce happens-before relationships.

Suppose we're doing one-minute polls, and we examine the noon-time poll. I recommend you request all records from t0 - 10, that is, from 11:58:50 onward. And then de-dup on your end. Due to stamp granularity, server - client clock skew, or WAN latencies, commits near 11:59:00 may not have been visible in the previous query's results. By deliberately asking for 70 seconds worth of data, every 60 seconds, you get to see "late arrivals". An event might bear an 11:59:59 timestamp, but due to CPU queueing its COMMIT succeeded at 12:00:01, preventing a noon-time query from seeing it.

use DB timestamps

Don't rely on your wristwatch. Don't ask for timestamps that are within a few seconds of current wallclock.

Each run should store timestamps from the database, and subsequent runs should ask for fresh data based on that. Be prepared for overlapping query results, and de-dup as needed.

Stack Exchange Network

Polling Application Misses 1/500 records from an OData Feed

3 Answers 3

serial numbers

timestamp granularity

use DB timestamps

Hot Network Questions

Polling Application Misses 1/500 records from an OData Feed

3 Answers 3

serial numbers

timestamp granularity

use DB timestamps

Related

Hot Network Questions