Couple of years ago, I designed an API service which has a limit on the number of requests per month per user. To track the monthly usage of each user, I used a table called monthly_usage. The table structure is as follows:
| Column name | Type | Description |
|---|---|---|
| id | integer | A unique identifier for the user |
| current_period_start | timestamp | The start time of the current monthly usage period |
| current_period_end | timestamp | The end time of the current monthly usage period |
| total_requests | integer | The total number of requests made by the user in the period |
Whenever a request is served, an event is fired and the user details are used to increment a counter in Redis called current_count. Every 10 minutes, a schedule called usage_collection_job is executed, which performs the following steps:
- Collect the
current_countvalue for each user from Redis. - Add the
current_countvalue to the correspondingtotal_requestsvalue in themonthly_usagetable. - Reset
current_countin Redis to zero. - Store the updated
total_requestsvalue in Redis under the keycurrent_total.
Whenever a new request is received, a UsageFilter queries the current_total from Redis and verifies that the user hasn't exceeded their quota. If the user has exceeded their quota, the filter rejects the request. At the end of each month (or when the subscription renews), the monthly_usage for each user is reset to zero.
This system has been working well, but as the number of users is growing (over 10000 in last few months alone), I'm facing the following issues:
- The
usage_collection_jobis taking more and more time to complete execution, which is affecting the API performance. - Since this job runs for all users at the same time, it causes high CPU usage and spikes in Redis usage.
- Since the
current_totalis updated only every 10 minutes, a user can exceed their quota and still send thousands of requests before the usage limit is enforced.
I'm looking for suggestions to improve the current system and overcome these limitations. Specifically, I'm interested in:
- Ways to optimize the
usage_collection_jobto improve performance. - Alternatives to the current system that can handle a growing number of users and requests more efficiently.
- Ways to enforce the usage limit more frequently or in real-time to prevent users from exceeding their quota.
Any advice or suggestions would be greatly appreciated. Thank you!