1

I have a technical problem in anticipation of writing a Youtube webscraping program using Scrapy / Python. I am well aware of the different anti-scraping systems used on the web. However, one of them poses a problem for me.

The predicted scraping frequency of my script will be 189 data sets (pages) per second for around 208MB / s. I want to have as little IP ban, blacklisting, etc. as possible.

I have a NordVPN subscription with more than 5000 VPNs at my disposal + NordVPN CLI associated with the Openpyn library for finer control. I know that it is easy to use free proxies (none of the data going through the Proxies will be compromising so the non-encryption does not matter to me), but the speed seems too low to me. But the paid proxies are too expensive (minimum 180 € / month). I thought I understood that I could use my VPNs as a proxy. But I'm afraid that my very high frequency of acquiring Youtube pages will force me to change VPNs every second. I am afraid that the switch from one VPN to another is too high at least 5 seconds (so server unused 5/6 of the time). Or is there a more optimal switch method or a faster paid VPN in the switch?

I also thought about renting IPs (example: 8 IPs for $ 7.80 / month) and allocated them in a rotating and rapid manner to my two baremetal servers (ionos.com). But is it possible / allowed (maybe on Cloud Server XS)?

What do you think? What will be the optimal method to maintain the highest possible scraping frequency?

PS: The Youtube Data V3 API is not a possibility because 10'000 quota / day while I will need tens of millions per day. I am ready to pay for VPN, IP, Proxy, so paid solution (but reasonable) interests me.

Sincerely, Kyusuke

1
  • 2
    I’m voting to close this question because it is unrelated to software development. Commented Jul 19, 2021 at 16:41

1 Answer 1

1

For all stuff related to google/youtube am using Bright Data. The good stuff is they have around 72 million proxy IPs. There are a lot of similar services are there. Instead of using VPN services, you can easily integrate the proxy service with scrapy framework. So either you can maintain the same IP till the end of the process or you can change on each request.

Am just adding a single line shell script to show how it is easy to integrate with a proxy

curl --proxy zproxy.lum-superproxy.io:22225 --proxy-user lum-customer-xxxxxxxxxx-zone-residential:xxxxxxxxx "http://lumtest.com/myip.json" 

NB: Am not recommending bright proxies, you can use any proxy services, other than the VPN services.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.