pkgsrc.se | The NetBSD package collection

./www/py-scrapy, High-level Web Crawling and Web Scraping framework

[ CVSweb ] [ Homepage ] [ RSS ] [ Required by ] [ Add to tracker ]

Branch: CURRENT, Version: 2.14.2, Package name: py313-scrapy-2.14.2, Maintainer: pkgsrc-users

Scrapy is a fast high-level web crawling and web scraping framework, used to
crawl websites and extract structured data from their pages. It can be used for
a wide range of purposes, from data mining to monitoring and automated testing.

Required to run:
[net/py-twisted] [security/py-OpenSSL] [devel/py-setuptools] [devel/py-ZopeInterface] [textproc/py-lxml] [textproc/py-cssselect] [lang/py-six] [security/py-cryptography] [security/py-service_identity] [www/py-parsel] [www/py-w3lib] [devel/py-pydispatcher] [devel/py-queuelib] [lang/python37] [www/py-protego]

Required to build:
[pkgtools/cwrappers]

Master sites:

https://files.pythonhosted.org/packages/source/S/Scrapy/ (Download)

Filesize: 1226.176 KB

Version history: (Expand)

(2026-03-13) Updated to version: py313-scrapy-2.14.2
(2026-01-13) Updated to version: py313-scrapy-2.14.1
(2026-01-08) Updated to version: py313-scrapy-2.14.0
(2025-12-04) Updated to version: py313-scrapy-2.13.4
(2025-10-09) Updated to version: py313-scrapy-2.13.3
(2025-07-15) Package has been reborn

CVS history: (Expand)

2026-03-13 12:13:57 by Adam Ciarcinski | Files touched by this commit (2) | Package updated

Log message: py-scrapy: updated to 2.14.2 Scrapy 2.14.2 (2026-03-12) Security bug fixes - Values from the ``Referrer-Policy`` header of HTTP responses are no longer executed as Python callables. See the `cwxj-rr6w-m6w7`_ security advisory for details. .. _cwxj-rr6w-m6w7: \ https://github.com/scrapy/scrapy/security/advisories/GHSA-cwxj-rr6w-m6w7 - In line with the `standard <https://fetch.spec.whatwg.org/#http-redirect-fetch>`__, 301 redirects of ``POST`` requests are converted into ``GET`` requests. Converting to a ``GET`` request implies not only a method change, but also omitting the body and ``Content-*`` headers in the redirect request. On cross-origin redirects (for example, cross-domain redirects), this is effectively a security bug fix for scenarios where the body contains secrets. Deprecations - Passing a response URL string as the first positional argument to :meth:`scrapy.spidermiddlewares.referer.RefererMiddleware.policy` is deprecated. Pass a :class:`~scrapy.http.Response` instead. The parameter has also been renamed to ``response`` to reflect this change. The old parameter name (``resp_or_url``) is deprecated. New features - Added a new setting, :setting:`REFERER_POLICIES`, to allow customizing supported referrer policies. Bug fixes - Made additional redirect scenarios convert to ``GET`` in line with the `standard <https://fetch.spec.whatwg.org/#http-redirect-fetch>`__: - Only ``POST`` 302 redirects are converted into ``GET`` requests; other methods are preserved. - ``HEAD`` 303 redirects are not converted into ``GET`` requests. - ``GET`` 303 redirects do not have their body or standard ``Content-*`` headers removed. - Redirects where the original request body is dropped now also have their ``Content-Encoding``, ``Content-Language`` and ``Content-Location`` headers removed, in addition to the ``Content-Type`` and ``Content-Length`` headers that were already being removed. - Redirects now preserve the source URL fragment if the redirect URL does not include one. This is useful when using browser-based download handlers, such as `scrapy-playwright`_ or `scrapy-zyte-api`_, while letting Scrapy handle redirects. .. _scrapy-playwright: https://github.com/scrapy-plugins/scrapy-playwright .. _scrapy-zyte-api: https://scrapy-zyte-api.readthedocs.io/en/latest/ - The ``Referer`` header is now removed on redirect if :class:`~scrapy.spidermiddlewares.referer.RefererMiddleware` is disabled. - The handling of the ``Referer`` header on redirects now takes into account the ``Referer-Policy`` header of the response that triggers the redirect.

2026-01-13 13:39:52 by Adam Ciarcinski | Files touched by this commit (2) | Package updated

Log message: py-scrapy: updated to 2.14.1 Scrapy 2.14.1 (2026-01-12) Deprecations - ``scrapy.utils.defer.maybeDeferred_coro()`` is deprecated. (:issue:`7212`) Bug fixes - Fixed custom stats collectors that require a ``spider`` argument in their ``open_spider()`` and ``close_spider()`` methods not receiving the argument when called by the engine. Note, however, that the ``spider`` argument is now deprecated and will stop being passed in a future version of Scrapy. Quality assurance - Replaced deprecated ``codecov/test-results-action@v1`` GitHub Action with ``codecov/codecov-action@v5``.

2026-01-08 17:07:35 by Adam Ciarcinski | Files touched by this commit (3) | Package updated

Log message: py-scrapy: updated to 2.14.0 Scrapy 2.14.0 (2026-01-05) Highlights: - More coroutine-based replacements for Deferred-based APIs - The default priority queue is now ``DownloaderAwarePriorityQueue`` - Dropped support for Python 3.9 and PyPy 3.10 - Improved and documented the API for custom download handlers

2025-12-04 17:03:50 by Adam Ciarcinski | Files touched by this commit (2) | Package updated

Log message: py-scrapy: updated to 2.13.4 Scrapy 2.13.4 (2025-11-17) Security bug fixes - Improved protection against decompression bombs in :class:`~scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware` for responses compressed using the ``br`` and ``deflate`` methods: if a single compressed chunk would be larger than the response size limit (see :setting:`DOWNLOAD_MAXSIZE`) when decompressed, decompression is no longer carried out. This is especially important for the ``br`` (Brotli) method that can provide a very high compression ratio. Please, see the `CVE-2025-6176`_ and `GHSA-2qfp-q593-8484`_ security advisories for more information. (:issue:`7134`) .. _CVE-2025-6176: https://nvd.nist.gov/vuln/detail/CVE-2025-6176 .. _GHSA-2qfp-q593-8484: https://github.com/advisories/GHSA-2qfp-q593-8484 Modified requirements - The minimum supported version of the optional ``brotli`` package is now ``1.2.0``. (:issue:`7134`) - The ``brotlicffi`` and ``brotlipy`` packages can no longer be used to decompress Brotli-compressed responses. Please install the ``brotli`` package instead. (:issue:`7134`) Other changes - Restricted the maximum supported Twisted version to ``25.5.0``, as Scrapy currently uses some private APIs changed in later Twisted versions. (:issue:`7142`) - Stopped setting the ``COVERAGE_CORE`` environment variable in tests, it didn't have an effect but caused the ``coverage`` module to produce a warning or an error. (:issue:`7137`) - Removed the documentation build dependency on the deprecated ``sphinx-hoverxref`` module. (:issue:`6786`, :issue:`6922`)

2025-10-09 09:58:14 by Thomas Klausner | Files touched by this commit (442)

Log message: *: remove reference to (removed) Python 3.9

2025-07-05 13:44:20 by Thomas Klausner | Files touched by this commit (116)

Log message: *: some more recursive Python restrictions on Python 3.11+ Reported in SmartOS bulk build

2025-07-03 06:42:12 by Adam Ciarcinski | Files touched by this commit (2) | Package updated

Log message: py-scrapy: updated to 2.13.3 Scrapy 2.13.3 (2025-07-02) - Changed the values for :setting:`DOWNLOAD_DELAY` (from ``0`` to ``1``) and :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` (from ``8`` to ``1``) in the default project template. - Improved :class:`scrapy.core.engine.ExecutionEngine` logic related to initialization and exception handling, fixing several cases where the spider would crash, hang or log an unhandled exception. (:issue:`6783`, :issue:`6784`, :issue:`6900`, :issue:`6908`, :issue:`6910`, - Fixed a Windows issue with :ref:`feed exports <topics-feed-exports>` using :class:`scrapy.extensions.feedexport.FileFeedStorage` that caused the file to be created on the wrong drive. - Allowed running tests with Twisted 25.5.0+ again. Pytest 8.4.1+ is now required for running tests in non-pinned envs as support for the new Twisted version was added in that version. - Fixed running tests with lxml 6.0.0+. - Added a deprecation notice for ``scrapy.spidermiddlewares.offsite.OffsiteMiddleware`` to :ref:`the Scrapy 2.11.2 release notes <release-2.11.2>`. - Updated :ref:`contribution docs <topics-contributing>` to refer to ruff_ instead of black_. - Added ``.venv/`` and ``.vscode/`` to ``.gitignore``.

2025-07-01 13:50:30 by Adam Ciarcinski | Files touched by this commit (2) | Package updated

Log message: py-scrapy: updated to 2.13.2 Scrapy 2.13.2 (2025-06-09) - Fixed a bug introduced in Scrapy 2.13.0 that caused results of request errbacks to be ignored when the errback was called because of a downloader error. - Added a note about the behavior change of :func:`scrapy.utils.reactor.is_asyncio_reactor_installed` to its docs and to the "Backward-incompatible changes" section of :ref:`the Scrapy 2.13.0 release notes <release-2.13.0>`. - Improved the message in the exception raised by :func:`scrapy.utils.test.get_reactor_settings` when there is no reactor installed. - Updated the :class:`scrapy.crawler.CrawlerRunner` examples in :ref:`topics-practices` to install the reactor explicitly, to fix reactor-related errors with Scrapy 2.13.0 and later. - Fixed ``scrapy fetch`` not working with scrapy-poet_. - Fixed an exception produced by :class:`scrapy.core.engine.ExecutionEngine` when it's closed before being fully initialized. - Improved the README, updated the Scrapy logo in it. - Restricted the Twisted version used in tests to below 25.5.0, as some tests fail with 25.5.0. - Updated type hints for Twisted 25.5.0 changes. - Removed the old artwork.