2

I'm designing a "polite" web crawler using Airflow with the Celery Executor, PostgreSQL for metadata and actual content used by the crawler, and Redis as the Celery broker. My goal is to enforce politeness by ensuring that I only have one active request per domain at any given time, with a configurable delay between requests to that same domain.

Current Architecture

My current setup works well for parallel, non-polite crawling:

  1. A main DAG identifies domains to be crawled.
  2. For each domain, it triggers a dedicated sub-DAG (TriggerDagRunOperator).
  3. Each sub-DAG is configured with max_active_tasks = 1 for that specific domain.
  4. Celery workers execute the download tasks, saving content to a PostgreSQL database.
  5. Downstream tasks in the DAG then process this data once the download phase is complete.

The Politeness Problem

This design doesn't effectively control server load. The pool limits the number of concurrent tasks for a domain's sub-DAG, but it doesn't guarantee only one is running at a time with a specific delay between them (e.g., one request to example.com every 5 seconds). My ideal system would manage a queue of URLs, releasing them for crawling based on domain-specific rules without having workers sitting idle waiting for a lock.

What I've Tried

I've explored two main approaches, each with significant drawbacks:

  1. External Crawler Service with a Deferred Operator:
  • Idea: The Airflow DAG would trigger a separate, standalone crawler process (potentially another Celery app) and use a deferred operator (BaseTrigger) to wait for it to finish.
  • Problem: This created architectural confusion. Should I use the existing Airflow Celery workers? When I tried, I ran into issues registering the new crawler tasks with Airflow's Celery app. Spinning up entirely new workers felt clumsy, as I'd have to manage their lifecycle (starting/stopping them) manually from the DAG.
  1. Custom Operator with Redis Lock and Deferral:
  • Idea: I created a custom operator that attempts to acquire a domain-specific lock in Redis. If the lock is taken, the operator defers itself using a custom trigger. The trigger's job is to periodically check Redis until the lock is released.
  • Problem: This was the more promising approach, but I hit a wall with Airflow's tooling. The standard RedisHook does not have async support, which is a requirement for writing an efficient, non-blocking trigger. While I could technically make it work, it feels like I'm fighting the framework and creating a fragile solution.

My Question

Given the limitations I've encountered, what is a recommended architectural pattern or best practice for implementing this kind of per-domain, asynchronous rate-limiting within the Airflow ecosystem?

Specifically:

  • Is there a better way to manage domain-specific locks and delays that integrates smoothly with Airflow's deferrable operators?
  • How can I achieve this without creating a system where worker slots are tied up by tasks that are simply waiting for a timer or a lock?
1
  • 3
    I like the question, but I have no idea what Airflow is and questions about specific tools are off-topic. Commented Jul 6 at 10:24

2 Answers 2

1

This design question is generic across languages and technologies; the fact that you're using specific python libs isn't relevant.

A work table contains (domain, url) pairs, and at any instant only a single crawler machine can work on the entries of a given domain. Create a current-status crawl table with PK of domain. Assume it has something in the ballpark of a million rows.

The work is performed by N crawler machines or processes, the "workers".

naïve central approach

You're already using a scalable RDBMS: postgres. Relying on ACID guarantees, let workers race to COMMIT an UPDATE of the crawl.crawled_at timestamp column. A worker who successfully updates it with a fresh stamp has permission to pull any of that domain's work entries and crawl them.

Downside is that we're leaning pretty heavily on database scaling.

sharded approach

We don't need a central database to help us create an N-way partition of the set of domains. A hash function could do that as easily, with no communication overhead.

Add a hash column to the crawl table. Compute SHA3(domain) or some other convenient message digest. Store a prefix of that hash, perhaps the first 64 bits. Arrange for workers to elect a leader, so they know the set of current workers. Crucially, they know N, the size of the set of workers. Each worker has a name, from 0 .. N-1.

A worker may work on a domain's entries iff the worker's name matches the domain's hash modulo N.

From time to time workers come and go, so workers will eventually need to become aware of a new name and a new N. If this happens frequently, it may be convenient to identify workers with a (community, id) tuple, so we can shard first on the assigned community (maybe East Coast datacenter vs West Coast datacenter) and then on a numeric ID within that community. The idea is that workers frequently come and go, but communities seldom do, and re-sharding for a given worker transition will only affect a single community. If there are four communities then the million domains form four shards, each with 250 K domains, and each community will further partition its 250 K domains so just a single worker is repsonsible for any given domain.

Leader elections can be handled by Raft or Paxos, but they happen seldom enough that you can view it as just a DB problem or a Redis/Valkey problem.


Using either setup, the crawled documents can be put in a local filestore or database, so there's few scaling challenges there. Combine the results as one unified view at your leisure.


randomize across domains

The OP problem setup seems to suggest that a worker will focus on one domain at a time, fetching URLs sequentially from it, and then move on to the next domain. We might sleep() for a moment before each GET, to obey a configured rate limit.

Consider storing prefix of an url hash in the work table, indexing on that, and doing SELECT ... ORDER BY hash. This effectively shuffles the deck, so URLs appear in an arbitrary (random?) order, and crucially the domains are similarly randomized. If a single domain like "reddit.com" accounts for half the work entries, then this doesn't help much. But if no domain is especially dominant, then after crawling domain d there's a fair chance you will do work for several unrelated domains before returning to d. And those unrelated domains will consume enough time that the rate limit won't have need of a sleep() call. Use a local in-memory datastructure like a pqueue to double check this and to occasionally sleep(), if desired.

0

I don’t know much about Celery and Airflow, but the problem could be solved in this way.

You can have multiple crawler server that make request to the target website. Every crawler will have rate limiter per domain. This can easily control the rate of request per domain. But, there are challenges associated with this design.

  1. You'll have to ensure that all request of a particular domain must go to the same crawler, otherwise you can not control the rate of request.

  2. Crawler need to buffer urls of domain in memory while waiting for 5sec. In case crawler restarts, we may either skip processing all buffered jobs. So, to skip processing all buffered job, we can update meta data about each url in DB/Redis, and after restarts fetch unprocessed jobs again from DB.

enter image description here

Let me know if this solves your problem or any issue in this approach!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.