I have a technical problem in anticipation of writing a Youtube webscraping program using Scrapy / Python. I am well aware of the different anti-scraping systems used on the web. However, one of them poses a problem for me.
The predicted scraping frequency of my script will be 189 data sets (pages) per second for around 208MB / s. I want to have as little IP ban, blacklisting, etc. as possible.
I have a NordVPN subscription with more than 5000 VPNs at my disposal + NordVPN CLI associated with the Openpyn library for finer control. I know that it is easy to use free proxies (none of the data going through the Proxies will be compromising so the non-encryption does not matter to me), but the speed seems too low to me. But the paid proxies are too expensive (minimum 180 € / month). I thought I understood that I could use my VPNs as a proxy. But I'm afraid that my very high frequency of acquiring Youtube pages will force me to change VPNs every second. I am afraid that the switch from one VPN to another is too high at least 5 seconds (so server unused 5/6 of the time). Or is there a more optimal switch method or a faster paid VPN in the switch?
I also thought about renting IPs (example: 8 IPs for $ 7.80 / month) and allocated them in a rotating and rapid manner to my two baremetal servers (ionos.com). But is it possible / allowed (maybe on Cloud Server XS)?
What do you think? What will be the optimal method to maintain the highest possible scraping frequency?
PS: The Youtube Data V3 API is not a possibility because 10'000 quota / day while I will need tens of millions per day. I am ready to pay for VPN, IP, Proxy, so paid solution (but reasonable) interests me.
Sincerely, Kyusuke