Async Proxy Pool

异步爬虫代理池，以 Python asyncio 为基础，旨在充分利用 Python 的异步性能。

运行环境

项目使用了 sanic，（也提供了 Flask）一个异步网络框架。所以建议运行 Python 环境为 Python3.5+，并且 sanic 不支持 Windows 系统，Windows 用户（比如我 😄）可以考虑使用 Ubuntu on Windows。

如何使用

安装 Redis

项目数据库使用了 Redis，Redis 是一个开源（BSD 许可）的，内存中的数据结构存储系统，它可以用作数据库、缓存和消息中间件。所以请确保运行环境已经正确安装了 Redis。安装方法请参照官网指南。

下载项目源码

$ git clone https://github.com/chenjiandongx/async-proxy-pool.git

安装依赖

使用 requirements.txt

$ pip install -r requirements.txt

配置文件

配置文件 config.py，保存了项目所使用到的所有配置项。如下所示，用户可以根据需求自行更改。不然按默认即可。

#!/usr/bin/env python # coding=utf-8 # 请求超时时间（秒） REQUEST_TIMEOUT = 15 # 请求延迟时间（秒） REQUEST_DELAY = 0 # redis 地址 REDIS_HOST = "localhost" # redis 端口 REDIS_PORT = 6379 # redis 密码 REDIS_PASSWORD = None # redis set key REDIS_KEY = "proxies:ranking" # redis 连接池最大连接量 REDIS_MAX_CONNECTION = 20 # REDIS SCORE 最大分数 MAX_SCORE = 10 # REDIS SCORE 最小分数 MIN_SCORE = 0 # REDIS SCORE 初始分数 INIT_SCORE = 9 # server web host SERVER_HOST = "localhost" # server web port SERVER_PORT = 3289 # 是否开启日志记录 SERVER_ACCESS_LOG = True # 批量测试数量 VALIDATOR_BATCH_COUNT = 256 # 校验器测试网站，可以定向改为自己想爬取的网站，如新浪，知乎等 VALIDATOR_BASE_URL = "https://httpbin.org/" # 校验器循环周期（分钟） VALIDATOR_RUN_CYCLE = 15 # 爬取器循环周期（分钟） CRAWLER_RUN_CYCLE = 30 # 请求 headers HEADERS = { "X-Requested-With": "XMLHttpRequest", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 " "(KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36", }

运行项目

运行客户端，启动收集器和校验器

# 可设置校验网站环境变量 set/export VALIDATOR_BASE_URL="https://example.com" $ python client.py 2018-05-16 23:41:39,234 - Crawler working... 2018-05-16 23:41:40,509 - Crawler √ http://202.83.123.33:3128 2018-05-16 23:41:40,509 - Crawler √ http://123.53.118.122:61234 2018-05-16 23:41:40,510 - Crawler √ http://212.237.63.84:8888 2018-05-16 23:41:40,510 - Crawler √ http://36.73.102.245:8080 2018-05-16 23:41:40,511 - Crawler √ http://78.137.90.253:8080 2018-05-16 23:41:40,512 - Crawler √ http://5.45.70.39:1490 2018-05-16 23:41:40,512 - Crawler √ http://117.102.97.162:8080 2018-05-16 23:41:40,513 - Crawler √ http://109.185.149.65:8080 2018-05-16 23:41:40,513 - Crawler √ http://189.39.143.172:20183 2018-05-16 23:41:40,514 - Crawler √ http://186.225.112.62:20183 2018-05-16 23:41:40,514 - Crawler √ http://189.126.66.154:20183 ... 2018-05-16 23:41:55,866 - Validator working... 2018-05-16 23:41:56,951 - Validator × https://114.113.126.82:80 2018-05-16 23:41:56,953 - Validator × https://114.199.125.242:80 2018-05-16 23:41:56,955 - Validator × https://114.228.75.17:6666 2018-05-16 23:41:56,957 - Validator × https://115.227.3.86:9000 2018-05-16 23:41:56,960 - Validator × https://115.229.88.191:9000 2018-05-16 23:41:56,964 - Validator × https://115.229.89.100:9000 2018-05-16 23:41:56,966 - Validator × https://103.18.180.194:8080 2018-05-16 23:41:56,967 - Validator × https://115.229.90.207:9000 2018-05-16 23:41:56,968 - Validator × https://103.216.144.17:8080 2018-05-16 23:41:56,969 - Validator × https://117.65.43.29:31588 2018-05-16 23:41:56,971 - Validator × https://103.248.232.135:8080 2018-05-16 23:41:56,972 - Validator × https://117.94.69.166:61234 2018-05-16 23:41:56,975 - Validator × https://103.26.56.109:8080 ...

运行服务器，启动 web 服务

Sanic

$ python server_sanic.py [2018-05-16 23:36:22 +0800] [108] [INFO] Goin' Fast @ http://localhost:3289 [2018-05-16 23:36:22 +0800] [108] [INFO] Starting worker [108]

Flask

$ python server_flask.py * Serving Flask app "async_proxy_pool.webapi_flask" (lazy loading) * Environment: production WARNING: Do not use the development server in a production environment. Use a production WSGI server instead. * Debug mode: on * Restarting with stat * Debugger is active! * Debugger PIN: 322-954-449 * Running on http://localhost:3289/ (Press CTRL+C to quit)

总体架构

项目主要几大模块分别是爬取模块，存储模块，校验模块，调度模块，接口模块。

爬取模块：负责爬取代理网站，并将所得到的代理存入到数据库，每个代理的初始化权值为 INIT_SCORE。

存储模块：封装了 Redis 操作的一些接口，提供 Redis 连接池。

校验模块：验证代理 IP 是否可用，如果代理可用则权值 +1，最大值为 MAX_SCORE。不可用则权值 -1，直至权值为 0 时将代理从数据库中删除。

调度模块：负责调度爬取器和校验器的运行。

接口模块：使用 sanic 提供 WEB API 。

/

欢迎页面

$ http http://localhost:3289/ HTTP/1.1 200 OK Connection: keep-alive Content-Length: 42 Content-Type: application/json Keep-Alive: 5 { "Welcome": "This is a proxy pool system." }

/pop

随机返回一个代理，分三次尝试。

尝试返回权值为 MAX_SCORE，也就是最新可用的代理。
尝试返回随机权值在 (MAX_SCORE -3) - MAX_SCORE 之间的代理。
尝试返回权值在 0 - MAX_SCORE 之间的代理

$ http http://localhost:3289/pop HTTP/1.1 200 OK Connection: keep-alive Content-Length: 38 Content-Type: application/json Keep-Alive: 5 { "http": "http://46.48.105.235:8080" }

/get/<count:int>

返回指定数量的代理，权值从大到小排序。

$ http http://localhost:3289/get/10 HTTP/1.1 200 OK Connection: keep-alive Content-Length: 393 Content-Type: application/json Keep-Alive: 5 [ { "http": "http://94.177.214.215:3128" }, { "http": "http://94.139.242.70:53281" }, { "http": "http://94.130.92.40:3128" }, { "http": "http://82.78.28.139:8080" }, { "http": "http://82.222.153.227:9090" }, { "http": "http://80.211.228.238:8888" }, { "http": "http://80.211.180.224:3128" }, { "http": "http://79.101.98.2:53281" }, { "http": "http://66.96.233.182:8080" }, { "http": "http://61.228.45.165:8080" } ]

/count

返回代理池中所有代理总数

$ http http://localhost:3289/count HTTP/1.1 200 OK Connection: keep-alive Content-Length: 15 Content-Type: application/json Keep-Alive: 5 { "count": "698" }

/count/<score:int>

返回指定权值代理总数

$ http http://localhost:3289/count/10 HTTP/1.1 200 OK Connection: keep-alive Content-Length: 15 Content-Type: application/json Keep-Alive: 5 { "count": "143" }

/clear/<score:int>

删除权值小于等于 score 的代理

$ http http://localhost:3289/clear/0 HTTP/1.1 200 OK Connection: keep-alive Content-Length: 22 Content-Type: application/json Keep-Alive: 5 { "Clear": "Successful" }

扩展代理爬取网站

在 crawler.py 文件里新增你自己的爬取方法。

class Crawler: @staticmethod def run(): ... # 新增你自己的爬取方法 @staticmethod @collect_funcs # 加入装饰器用于最后运行函数 def crawl_xxx(): # 爬取逻辑

选择其他 web 框架

本项目使用了 Sanic，但是开发者完全可以根据自己的需求选择其他 web 框架，web 模块是完全独立的，替换框架不会影响到项目的正常运行。需要如下步骤。

在 webapi.py 里更换框架。
在 server.py 里修改 app 启动细节。

Sanic 性能测试

使用 wrk 进行服务器压力测试。基准测试 30 秒, 使用 12 个线程, 并发 400 个 http 连接。

测试 http://127.0.0.1:3289/pop

$ wrk -t12 -c400 -d30s http://127.0.0.1:3289/pop Running 30s test @ http://127.0.0.1:3289/pop 12 threads and 400 connections Thread Stats Avg Stdev Max +/- Stdev Latency 350.37ms 118.99ms 660.41ms 60.94% Req/Sec 98.18 35.94 277.00 79.43% 33694 requests in 30.10s, 4.77MB read Socket errors: connect 0, read 340, write 0, timeout 0 Requests/sec: 1119.44 Transfer/sec: 162.23KB

测试 http://127.0.0.1:3289/get/10

Running 30s test @ http://127.0.0.1:3289/get/10 12 threads and 400 connections Thread Stats Avg Stdev Max +/- Stdev Latency 254.90ms 95.43ms 615.14ms 63.51% Req/Sec 144.84 61.52 320.00 66.58% 46538 requests in 30.10s, 22.37MB read Socket errors: connect 0, read 28, write 0, timeout 0 Requests/sec: 1546.20 Transfer/sec: 761.02KB

性能还算不错，再测试一下没有 Redis 操作的 http://127.0.0.1:3289/

$ wrk -t12 -c400 -d30s http://127.0.0.1:3289/ Running 30s test @ http://127.0.0.1:3289/ 12 threads and 400 connections Thread Stats Avg Stdev Max +/- Stdev Latency 127.86ms 41.71ms 260.69ms 55.22% Req/Sec 258.56 92.25 520.00 68.90% 92766 requests in 30.10s, 13.45MB read Requests/sec: 3081.87 Transfer/sec: 457.47KB

⭐️ Requests/sec: 3081.87

关闭 sanic 日志记录，测试 http://127.0.0.1:3289/

$ wrk -t12 -c400 -d30s http://127.0.0.1:3289/ Running 30s test @ http://127.0.0.1:3289/ 12 threads and 400 connections Thread Stats Avg Stdev Max +/- Stdev Latency 34.63ms 12.66ms 96.28ms 58.07% Req/Sec 0.96k 137.29 2.21k 73.29% 342764 requests in 30.10s, 49.69MB read Requests/sec: 11387.89 Transfer/sec: 1.65MB

⭐️ Requests/sec: 11387.89

实际代理性能测试

test_proxy.py 用于测试实际代理性能

运行代码

$ cd test $ python test_proxy.py # 可设置的环境变量 TEST_COUNT = os.environ.get("TEST_COUNT") or 1000 TEST_WEBSITE = os.environ.get("TEST_WEBSITE") or "https://httpbin.org/" TEST_PROXIES = os.environ.get("TEST_PROXIES") or "http://localhost:3289/get/20"

实测效果

https://httpbin.org/

测试代理： http://localhost:3289/get/20 测试网站： https://httpbin.org/ 测试次数： 1000 成功次数： 1000 失败次数： 0 成功率： 1.0

https://taobao.com

测试代理： http://localhost:3289/get/20 测试网站： https://taobao.com/ 测试次数： 1000 成功次数： 984 失败次数： 16 成功率： 0.984

https://baidu.com

测试代理： http://localhost:3289/get/20 测试网站： https://baidu.com 测试次数： 1000 成功次数： 975 失败次数： 25 成功率： 0.975

https://zhihu.com

测试代理： http://localhost:3289/get/20 测试网站： https://zhihu.com 测试次数： 1000 成功次数： 1000 失败次数： 0 成功率： 1.0

可以看到其实性能是非常棒的，成功率极高。 😉

实际应用示例

import random import requests # 确保已经启动 sanic 服务 # 获取多个然后随机选一个 try: proxies = requests.get("http://localhost:3289/get/20").json() req = requests.get("https://example.com", proxies=random.choice(proxies)) except: raise # 或者单独弹出一个 try: proxy = requests.get("http://localhost:3289/pop").json() req = requests.get("https://example.com", proxies=proxy) except: raise

aiohttp 的坑

整个项目都是基于 aiohttp 这个异步网络库的，在这个项目的文档中，关于代理的介绍是这样的。

划重点：aiohttp supports HTTP/HTTPS proxies

但是，它根本就不支持 https 代理好吧，在它的代码中是这样写的。

划重点：Only http proxies are supported

我的心情可以说是十分复杂的。😲 不过只有 http 代理效果也不错没什么太大影响，参见上面的测试数据。

参考借鉴项目

✨🍰✨

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Async Proxy Pool

运行环境

如何使用

安装 Redis

下载项目源码

安装依赖

配置文件

运行项目

Sanic

Flask

总体架构

扩展代理爬取网站

选择其他 web 框架

Sanic 性能测试

实际代理性能测试

运行代码

实测效果

实际应用示例

aiohttp 的坑

参考借鉴项目

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
async_proxy_pool		async_proxy_pool
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
client.py		client.py
requirements.txt		requirements.txt
server_flask.py		server_flask.py
server_sanic.py		server_sanic.py

License

chenjiandongx/async-proxy-pool

Folders and files

Latest commit

History

Repository files navigation

Async Proxy Pool

运行环境

如何使用

安装 Redis

下载项目源码

安装依赖

配置文件

运行项目

Sanic

Flask

总体架构

扩展代理爬取网站

选择其他 web 框架

Sanic 性能测试

实际代理性能测试

运行代码

实测效果

实际应用示例

aiohttp 的坑

参考借鉴项目

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages