The NetBSD Project

CVS log for pkgsrc/www/py-scrapy/Makefile

[BACK] Up to [cvs.NetBSD.org] / pkgsrc / www / py-scrapy

Request diff between arbitrary revisions


Keyword substitution: kv
Default branch: MAIN


Revision 1.23: download - view: text, markup, annotated - select for diffs
Mon Nov 11 07:29:27 2024 UTC (30 hours, 50 minutes ago) by wiz
Branches: MAIN
CVS tags: HEAD
Diff to: previous 1.22: preferred, colored
Changes since revision 1.22: +1 -2 lines
py-*: remove unused tool dependency

py-setuptools includes the py-wheel functionality nowadays

Revision 1.22: download - view: text, markup, annotated - select for diffs
Tue May 14 19:15:59 2024 UTC (5 months, 4 weeks ago) by adam
Branches: MAIN
CVS tags: pkgsrc-2024Q3-base, pkgsrc-2024Q3, pkgsrc-2024Q2-base, pkgsrc-2024Q2
Diff to: previous 1.21: preferred, colored
Changes since revision 1.21: +8 -4 lines
py-scrapy: updated to 2.11.2

Scrapy 2.11.2 (2024-05-14)
--------------------------

Security bug fixes
~~~~~~~~~~~~~~~~~~

-   Redirects to non-HTTP protocols are no longer followed. Please, see the
    `23j4-mw76-5v7h security advisory`_ for more information. (:issue:`457`)

    .. _23j4-mw76-5v7h security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-23j4-mw76-5v7h

-   The ``Authorization`` header is now dropped on redirects to a different
    scheme (``http://`` or ``https://``) or port, even if the domain is the
    same. Please, see the `4qqq-9vqf-3h3f security advisory`_ for more
    information.

    .. _4qqq-9vqf-3h3f security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-4qqq-9vqf-3h3f

-   When using system proxy settings that are different for ``http://`` and
    ``https://``, redirects to a different URL scheme will now also trigger the
    corresponding change in proxy settings for the redirected request. Please,
    see the `jm3v-qxmh-hxwv security advisory`_ for more information.
    (:issue:`767`)

    .. _jm3v-qxmh-hxwv security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-jm3v-qxmh-hxwv

-   :attr:`Spider.allowed_domains <scrapy.Spider.allowed_domains>` is now
    enforced for all requests, and not only requests from spider callbacks.
    (:issue:`1042`, :issue:`2241`, :issue:`6358`)

-   :func:`~scrapy.utils.iterators.xmliter_lxml` no longer resolves XML
    entities. (:issue:`6265`)

-   defusedxml_ is now used to make
    :class:`scrapy.http.request.rpc.XmlRpcRequest` more secure.
    (:issue:`6250`, :issue:`6251`)

    .. _defusedxml: https://github.com/tiran/defusedxml

Bug fixes
~~~~~~~~~

-   Restored support for brotlipy_, which had been dropped in Scrapy 2.11.1 in
    favor of brotli_. (:issue:`6261`)

    .. _brotli: https://github.com/google/brotli

    .. note:: brotlipy is deprecated, both in Scrapy and upstream. Use brotli
        instead if you can.

-   Make :setting:`METAREFRESH_IGNORE_TAGS` ``["noscript"]`` by default. This
    prevents
    :class:`~scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware` from
    following redirects that would not be followed by web browsers with
    JavaScript enabled. (:issue:`6342`, :issue:`6347`)

-   During :ref:`feed export <topics-feed-exports>`, do not close the
    underlying file from :ref:`built-in post-processing plugins
    <builtin-plugins>`.
    (:issue:`5932`, :issue:`6178`, :issue:`6239`)

-   :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
    now properly applies the ``unique`` and ``canonicalize`` parameters.
    (:issue:`3273`, :issue:`6221`)

-   Do not initialize the scheduler disk queue if :setting:`JOBDIR` is an empty
    string. (:issue:`6121`, :issue:`6124`)

-   Fix :attr:`Spider.logger <scrapy.Spider.logger>` not logging custom extra
    information. (:issue:`6323`, :issue:`6324`)

-   ``robots.txt`` files with a non-UTF-8 encoding no longer prevent parsing
    the UTF-8-compatible (e.g. ASCII) parts of the document.
    (:issue:`6292`, :issue:`6298`)

-   :meth:`scrapy.http.cookies.WrappedRequest.get_header` no longer raises an
    exception if ``default`` is ``None``.
    (:issue:`6308`, :issue:`6310`)

-   :class:`~scrapy.selector.Selector` now uses
    :func:`scrapy.utils.response.get_base_url` to determine the base URL of a
    given :class:`~scrapy.http.Response`. (:issue:`6265`)

-   The :meth:`media_to_download` method of :ref:`media pipelines
    <topics-media-pipeline>` now logs exceptions before stripping them.
    (:issue:`5067`, :issue:`5068`)

-   When passing a callback to the :command:`parse` command, build the callback
    callable with the right signature.
    (:issue:`6182`)

Documentation
~~~~~~~~~~~~~

-   Add a FAQ entry about :ref:`creating blank requests <faq-blank-request>`.
    (:issue:`6203`, :issue:`6208`)

-   Document that :attr:`scrapy.selector.Selector.type` can be ``"json"``.
    (:issue:`6328`, :issue:`6334`)

Quality assurance
~~~~~~~~~~~~~~~~~

-   Make builds reproducible. (:issue:`5019`, :issue:`6322`)

-   Packaging and test fixes.

Revision 1.21: download - view: text, markup, annotated - select for diffs
Fri Feb 16 19:02:45 2024 UTC (8 months, 3 weeks ago) by adam
Branches: MAIN
CVS tags: pkgsrc-2024Q1-base, pkgsrc-2024Q1
Diff to: previous 1.20: preferred, colored
Changes since revision 1.20: +6 -6 lines
py-scrapy: updated to 2.11.1

Scrapy 2.11.1 (2024-02-14)
--------------------------

Highlights:

-   Security bug fixes.

-   Support for Twisted >= 23.8.0.

-   Documentation improvements.

Security bug fixes
~~~~~~~~~~~~~~~~~~

-   Addressed `ReDoS vulnerabilities`_:

    -   ``scrapy.utils.iterators.xmliter`` is now deprecated in favor of
        :func:`~scrapy.utils.iterators.xmliter_lxml`, which
        :class:`~scrapy.spiders.XMLFeedSpider` now uses.

        To minimize the impact of this change on existing code,
        :func:`~scrapy.utils.iterators.xmliter_lxml` now supports indicating
        the node namespace with a prefix in the node name, and big files with
        highly nested trees when using libxml2 2.7+.

    -   Fixed regular expressions in the implementation of the
        :func:`~scrapy.utils.response.open_in_browser` function.

    Please, see the `cc65-xxvf-f7r9 security advisory`_ for more information.

    .. _ReDoS vulnerabilities: https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
    .. _cc65-xxvf-f7r9 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-cc65-xxvf-f7r9

-   :setting:`DOWNLOAD_MAXSIZE` and :setting:`DOWNLOAD_WARNSIZE` now also apply
    to the decompressed response body. Please, see the `7j7m-v7m3-jqm7 security
    advisory`_ for more information.

    .. _7j7m-v7m3-jqm7 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-7j7m-v7m3-jqm7

-   Also in relation with the `7j7m-v7m3-jqm7 security advisory`_, the
    deprecated ``scrapy.downloadermiddlewares.decompression`` module has been
    removed.

-   The ``Authorization`` header is now dropped on redirects to a different
    domain. Please, see the `cw9j-q3vf-hrrv security advisory`_ for more
    information.

    .. _cw9j-q3vf-hrrv security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-cw9j-q3vf-hrrv

Modified requirements
~~~~~~~~~~~~~~~~~~~~~

-   The Twisted dependency is no longer restricted to < 23.8.0. (:issue:`6024`,
    :issue:`6064`, :issue:`6142`)

Bug fixes
~~~~~~~~~

-   The OS signal handling code was refactored to no longer use private Twisted
    functions. (:issue:`6024`, :issue:`6064`, :issue:`6112`)

Documentation
~~~~~~~~~~~~~

-   Improved documentation for :class:`~scrapy.crawler.Crawler` initialization
    changes made in the 2.11.0 release. (:issue:`6057`, :issue:`6147`)

-   Extended documentation for :attr:`Request.meta <scrapy.http.Request.meta>`.
    (:issue:`5565`)

-   Fixed the :reqmeta:`dont_merge_cookies` documentation. (:issue:`5936`,
    :issue:`6077`)

-   Added a link to Zyte's export guides to the :ref:`feed exports
    <topics-feed-exports>` documentation. (:issue:`6183`)

-   Added a missing note about backward-incompatible changes in
    :class:`~scrapy.exporters.PythonItemExporter` to the 2.11.0 release notes.
    (:issue:`6060`, :issue:`6081`)

-   Added a missing note about removing the deprecated
    ``scrapy.utils.boto.is_botocore()`` function to the 2.8.0 release notes.
    (:issue:`6056`, :issue:`6061`)

-   Other documentation improvements. (:issue:`6128`, :issue:`6144`,
    :issue:`6163`, :issue:`6190`, :issue:`6192`)

Quality assurance
~~~~~~~~~~~~~~~~~

-   Added Python 3.12 to the CI configuration, re-enabled tests that were
    disabled when the pre-release support was added. (:issue:`5985`,
    :issue:`6083`, :issue:`6098`)

-   Fixed a test issue on PyPy 7.3.14. (:issue:`6204`, :issue:`6205`)

Revision 1.20: download - view: text, markup, annotated - select for diffs
Tue Oct 10 17:18:23 2023 UTC (13 months ago) by triaxx
Branches: MAIN
CVS tags: pkgsrc-2023Q4-base, pkgsrc-2023Q4
Diff to: previous 1.19: preferred, colored
Changes since revision 1.19: +2 -2 lines
py-scrapy: Update to 2.11.0

upstream changes:
-----------------
  * 2.11.0: https://docs.scrapy.org/en/latest/news.html#scrapy-2-11-0-2023-09-18
  * 2.10.0: https://docs.scrapy.org/en/2.10/news.html#scrapy-2-10-0-2023-08-04

Revision 1.19: download - view: text, markup, annotated - select for diffs
Sun Jun 18 05:39:38 2023 UTC (16 months, 3 weeks ago) by adam
Branches: MAIN
CVS tags: pkgsrc-2023Q3-base, pkgsrc-2023Q3, pkgsrc-2023Q2-base, pkgsrc-2023Q2
Diff to: previous 1.18: preferred, colored
Changes since revision 1.18: +2 -2 lines
py-ZopeInterface: moved to py-zope.interface

Revision 1.18: download - view: text, markup, annotated - select for diffs
Wed May 10 12:40:44 2023 UTC (18 months ago) by adam
Branches: MAIN
Diff to: previous 1.17: preferred, colored
Changes since revision 1.17: +2 -2 lines
py-scrapy: updated to 2.9.0

Scrapy 2.9.0 (2023-05-08)
-------------------------

Highlights:

-   Per-domain download settings.
-   Compatibility with new cryptography_ and new parsel_.
-   JMESPath selectors from the new parsel_.
-   Bug fixes.

Deprecations
~~~~~~~~~~~~

-   :class:`scrapy.extensions.feedexport._FeedSlot` is renamed to
    :class:`scrapy.extensions.feedexport.FeedSlot` and the old name is
    deprecated. (:issue:`5876`)

New features
~~~~~~~~~~~~

-   Settings correponding to :setting:`DOWNLOAD_DELAY`,
    :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` and
    :setting:`RANDOMIZE_DOWNLOAD_DELAY` can now be set on a per-domain basis
    via the new :setting:`DOWNLOAD_SLOTS` setting. (:issue:`5328`)

-   Added :meth:`TextResponse.jmespath`, a shortcut for JMESPath selectors
    available since parsel_ 1.8.1. (:issue:`5894`, :issue:`5915`)

-   Added :signal:`feed_slot_closed` and :signal:`feed_exporter_closed`
    signals. (:issue:`5876`)

-   Added :func:`scrapy.utils.request.request_to_curl`, a function to produce a
    curl command from a :class:`~scrapy.Request` object. (:issue:`5892`)

-   Values of :setting:`FILES_STORE` and :setting:`IMAGES_STORE` can now be
    :class:`pathlib.Path` instances. (:issue:`5801`)

Bug fixes
~~~~~~~~~

-   Fixed a warning with Parsel 1.8.1+. (:issue:`5903`, :issue:`5918`)

-   Fixed an error when using feed postprocessing with S3 storage.
    (:issue:`5500`, :issue:`5581`)

-   Added the missing :meth:`scrapy.settings.BaseSettings.setdefault` method.
    (:issue:`5811`, :issue:`5821`)

-   Fixed an error when using cryptography_ 40.0.0+ and
    :setting:`DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING` is enabled.
    (:issue:`5857`, :issue:`5858`)

-   The checksums returned by :class:`~scrapy.pipelines.files.FilesPipeline`
    for files on Google Cloud Storage are no longer Base64-encoded.
    (:issue:`5874`, :issue:`5891`)

-   :func:`scrapy.utils.request.request_from_curl` now supports $-prefixed
    string values for the curl ``--data-raw`` argument, which are produced by
    browsers for data that includes certain symbols. (:issue:`5899`,
    :issue:`5901`)

-   The :command:`parse` command now also works with async generator callbacks.
    (:issue:`5819`, :issue:`5824`)

-   The :command:`genspider` command now properly works with HTTPS URLs.
    (:issue:`3553`, :issue:`5808`)

-   Improved handling of asyncio loops. (:issue:`5831`, :issue:`5832`)

-   :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
    now skips certain malformed URLs instead of raising an exception.
    (:issue:`5881`)

-   :func:`scrapy.utils.python.get_func_args` now supports more types of
    callables. (:issue:`5872`, :issue:`5885`)

-   Fixed an error when processing non-UTF8 values of ``Content-Type`` headers.
    (:issue:`5914`, :issue:`5917`)

-   Fixed an error breaking user handling of send failures in
    :meth:`scrapy.mail.MailSender.send()`. (:issue:`1611`, :issue:`5880`)

Documentation
~~~~~~~~~~~~~

-   Expanded contributing docs. (:issue:`5109`, :issue:`5851`)

-   Added blacken-docs_ to pre-commit and reformatted the docs with it.
    (:issue:`5813`, :issue:`5816`)

-   Fixed a JS issue. (:issue:`5875`, :issue:`5877`)

-   Fixed ``make htmlview``. (:issue:`5878`, :issue:`5879`)

-   Fixed typos and other small errors. (:issue:`5827`, :issue:`5839`,
    :issue:`5883`, :issue:`5890`, :issue:`5895`, :issue:`5904`)

Quality assurance
~~~~~~~~~~~~~~~~~

-   Extended typing hints. (:issue:`5805`, :issue:`5889`, :issue:`5896`)

-   Tests for most of the examples in the docs are now run as a part of CI,
    found problems were fixed. (:issue:`5816`, :issue:`5826`, :issue:`5919`)

-   Removed usage of deprecated Python classes. (:issue:`5849`)

-   Silenced ``include-ignored`` warnings from coverage. (:issue:`5820`)

-   Fixed a random failure of the ``test_feedexport.test_batch_path_differ``
    test. (:issue:`5855`, :issue:`5898`)

-   Updated docstrings to match output produced by parsel_ 1.8.1 so that they
    don't cause test failures. (:issue:`5902`, :issue:`5919`)

-   Other CI and pre-commit improvements. (:issue:`5802`, :issue:`5823`,
    :issue:`5908`)

Revision 1.17: download - view: text, markup, annotated - select for diffs
Thu Apr 27 09:33:44 2023 UTC (18 months, 2 weeks ago) by adam
Branches: MAIN
Diff to: previous 1.16: preferred, colored
Changes since revision 1.16: +11 -10 lines
py-scrapy: updated to 2.8.0

Scrapy 2.8.0 (2023-02-02)
-------------------------

This is a maintenance release, with minor features, bug fixes, and cleanups.

Deprecation removals
~~~~~~~~~~~~~~~~~~~~
-   The ``scrapy.utils.gz.read1`` function, deprecated in Scrapy 2.0, has now
    been removed. Use the :meth:`~io.BufferedIOBase.read1` method of
    :class:`~gzip.GzipFile` instead.
-   The ``scrapy.utils.python.to_native_str`` function, deprecated in Scrapy
    2.0, has now been removed. Use :func:`scrapy.utils.python.to_unicode`
    instead.
-   The ``scrapy.utils.python.MutableChain.next`` method, deprecated in Scrapy
    2.0, has now been removed. Use
    :meth:`~scrapy.utils.python.MutableChain.__next__` instead.
-   The ``scrapy.linkextractors.FilteringLinkExtractor`` class, deprecated
    in Scrapy 2.0, has now been removed. Use
    :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
    instead.
-   Support for using environment variables prefixed with ``SCRAPY_`` to
    override settings, deprecated in Scrapy 2.0, has now been removed.
-   Support for the ``noconnect`` query string argument in proxy URLs,
    deprecated in Scrapy 2.0, has now been removed. We expect proxies that used
    to need it to work fine without it.
-   The ``scrapy.utils.python.retry_on_eintr`` function, deprecated in Scrapy
    2.3, has now been removed.
-   The ``scrapy.utils.python.WeakKeyCache`` class, deprecated in Scrapy 2.4,
    has now been removed.

Deprecations
~~~~~~~~~~~~
-   :exc:`scrapy.pipelines.images.NoimagesDrop` is now deprecated.
-   :meth:`ImagesPipeline.convert_image
    <scrapy.pipelines.images.ImagesPipeline.convert_image>` must now accept a
    ``response_body`` parameter.

New features
~~~~~~~~~~~~
-   Applied black_ coding style to files generated with the
    :command:`genspider` and :command:`startproject` commands.
    .. _black: https://black.readthedocs.io/en/stable/

-   :setting:`FEED_EXPORT_ENCODING` is now set to ``"utf-8"`` in the
    ``settings.py`` file that the :command:`startproject` command generates.
    With this value, JSON exports won’t force the use of escape sequences for
    non-ASCII characters.
-   The :class:`~scrapy.extensions.memusage.MemoryUsage` extension now logs the
    peak memory usage during checks, and the binary unit MiB is now used to
    avoid confusion.
-   The ``callback`` parameter of :class:`~scrapy.http.Request` can now be set
    to :func:`scrapy.http.request.NO_CALLBACK`, to distinguish it from
    ``None``, as the latter indicates that the default spider callback
    (:meth:`~scrapy.Spider.parse`) is to be used.

Bug fixes
~~~~~~~~~
-   Enabled unsafe legacy SSL renegotiation to fix access to some outdated
    websites.
-   Fixed STARTTLS-based email delivery not working with Twisted 21.2.0 and
    better.
-   Fixed the :meth:`finish_exporting` method of :ref:`item exporters
    <topics-exporters>` not being called for empty files.
-   Fixed HTTP/2 responses getting only the last value for a header when
    multiple headers with the same name are received.
-   Fixed an exception raised by the :command:`shell` command on some cases
    when :ref:`using asyncio <using-asyncio>`.
-   When using :class:`~scrapy.spiders.CrawlSpider`, callback keyword arguments
    (``cb_kwargs``) added to a request in the ``process_request`` callback of a
    :class:`~scrapy.spiders.Rule` will no longer be ignored.
-   The :ref:`images pipeline <images-pipeline>` no longer re-encodes JPEG
    files.
-   Fixed the handling of transparent WebP images by the :ref:`images pipeline
    <images-pipeline>`.
-   :func:`scrapy.shell.inspect_response` no longer inhibits ``SIGINT``
    (Ctrl+C).
-   :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>`
    with ``unique=False`` no longer filters out links that have identical URL
    *and* text.
-   :class:`~scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware` now
    ignores URL protocols that do not support ``robots.txt`` (``data://``,
    ``file://``).
-   Silenced the ``filelock`` debug log messages introduced in Scrapy 2.6.
-   Fixed the output of ``scrapy -h`` showing an unintended ``**commands**``
    line.
-   Made the active project indication in the output of :ref:`commands
    <topics-commands>` more clear.

Documentation
~~~~~~~~~~~~~
-   Documented how to :ref:`debug spiders from Visual Studio Code
    <debug-vscode>`.
-   Documented how :setting:`DOWNLOAD_DELAY` affects per-domain concurrency.
-   Improved consistency.
-   Fixed typos.

Quality assurance
~~~~~~~~~~~~~~~~~
-   Applied :ref:`black coding style <coding-style>`, sorted import statements,
    and introduced :ref:`pre-commit <scrapy-pre-commit>`.
-   Switched from :mod:`os.path` to :mod:`pathlib`.
-   Addressed many issues reported by Pylint.
-   Improved code readability.
-   Improved package metadata.
-   Removed direct invocations of ``setup.py``.
-   Removed unnecessary :class:`~collections.OrderedDict` usages.
-   Removed unnecessary ``__str__`` definitions.
-   Removed obsolete code and comments.
-   Fixed test and CI issues.

Revision 1.16: download - view: text, markup, annotated - select for diffs
Wed Oct 19 14:25:20 2022 UTC (2 years ago) by nia
Branches: MAIN
CVS tags: pkgsrc-2023Q1-base, pkgsrc-2023Q1, pkgsrc-2022Q4-base, pkgsrc-2022Q4
Diff to: previous 1.15: preferred, colored
Changes since revision 1.15: +2 -2 lines
fighting a losing battle against the py-cryptography rustification, part 5

Convert py-OpenSSL users to versioned_dependencies.mk

Revision 1.15: download - view: text, markup, annotated - select for diffs
Wed Oct 19 13:56:34 2022 UTC (2 years ago) by nia
Branches: MAIN
Diff to: previous 1.14: preferred, colored
Changes since revision 1.14: +3 -2 lines
fighting a losing battle against py-cryptography rustification, part 2

Switch users to versioned_dependencies.mk.

Revision 1.14: download - view: text, markup, annotated - select for diffs
Wed Jan 5 15:41:31 2022 UTC (2 years, 10 months ago) by wiz
Branches: MAIN
CVS tags: pkgsrc-2022Q3-base, pkgsrc-2022Q3, pkgsrc-2022Q2-base, pkgsrc-2022Q2, pkgsrc-2022Q1-base, pkgsrc-2022Q1
Diff to: previous 1.13: preferred, colored
Changes since revision 1.13: +4 -2 lines
python: egg.mk: add USE_PKG_RESOURCES flag

This flag should be set for packages that import pkg_resources
and thus need setuptools after the build step.

Set this flag for packages that need it and bump PKGREVISION.

Revision 1.13: download - view: text, markup, annotated - select for diffs
Tue Jan 4 20:55:35 2022 UTC (2 years, 10 months ago) by wiz
Branches: MAIN
Diff to: previous 1.12: preferred, colored
Changes since revision 1.12: +2 -1 lines
*: bump PKGREVISION for egg.mk users

They now have a tool dependency on py-setuptools instead of a DEPENDS

Revision 1.12: download - view: text, markup, annotated - select for diffs
Wed Oct 6 09:07:00 2021 UTC (3 years, 1 month ago) by jperkin
Branches: MAIN
CVS tags: pkgsrc-2021Q4-base, pkgsrc-2021Q4
Diff to: previous 1.11: preferred, colored
Changes since revision 1.11: +2 -2 lines
py-scrapy: Switch to PYTHON_VERSIONS_INCOMPATIBLE.

Revision 1.11: download - view: text, markup, annotated - select for diffs
Mon Mar 22 08:56:56 2021 UTC (3 years, 7 months ago) by triaxx
Branches: MAIN
CVS tags: pkgsrc-2021Q3-base, pkgsrc-2021Q3, pkgsrc-2021Q2-base, pkgsrc-2021Q2, pkgsrc-2021Q1-base, pkgsrc-2021Q1
Diff to: previous 1.10: preferred, colored
Changes since revision 1.10: +6 -3 lines
py-scrapy: Update to 2.4.1

upstream cheanges:
------------------
A lot of changes listed at https://github.com/scrapy/scrapy/blob/master/docs/news.rst

Revision 1.10: download - view: text, markup, annotated - select for diffs
Wed Jan 29 22:06:30 2020 UTC (4 years, 9 months ago) by adam
Branches: MAIN
CVS tags: pkgsrc-2020Q4-base, pkgsrc-2020Q4, pkgsrc-2020Q3-base, pkgsrc-2020Q3, pkgsrc-2020Q2-base, pkgsrc-2020Q2, pkgsrc-2020Q1-base, pkgsrc-2020Q1
Diff to: previous 1.9: preferred, colored
Changes since revision 1.9: +13 -10 lines
py-scrapy: updated to 1.8.0

Scrapy 1.8.0:

Highlights:
* Dropped Python 3.4 support and updated minimum requirements; made Python 3.8
  support official
* New :meth:`Request.from_curl <scrapy.http.Request.from_curl>` class method
* New :setting:`ROBOTSTXT_PARSER` and :setting:`ROBOTSTXT_USER_AGENT` settings
* New :setting:`DOWNLOADER_CLIENT_TLS_CIPHERS` and
  :setting:`DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING` settings

Revision 1.9: download - view: text, markup, annotated - select for diffs
Thu Aug 22 08:21:11 2019 UTC (5 years, 2 months ago) by adam
Branches: MAIN
CVS tags: pkgsrc-2019Q4-base, pkgsrc-2019Q4, pkgsrc-2019Q3-base, pkgsrc-2019Q3
Diff to: previous 1.8: preferred, colored
Changes since revision 1.8: +2 -2 lines
py-scrapy: updated to 1.7.3

Scrapy 1.7.3:
Enforce lxml 4.3.5 or lower for Python 3.4 (issue 3912, issue 3918).

Scrapy 1.7.2:
Fix Python 2 support (issue 3889, issue 3893, issue 3896).

Scrapy 1.7.1:
Re-packaging of Scrapy 1.7.0, which was missing some changes in PyPI.

Scrapy 1.7.0:
Highlights:
Improvements for crawls targeting multiple domains
A cleaner way to pass arguments to callbacks
A new class for JSON requests
Improvements for rule-based spiders
New features for feed exports


Backward-incompatible changes

429 is now part of the RETRY_HTTP_CODES setting by default
This change is backward incompatible. If you don’t want to retry 429, you must override RETRY_HTTP_CODES accordingly.

Crawler, CrawlerRunner.crawl and CrawlerRunner.create_crawler no longer accept a Spider subclass instance, they only accept a Spider subclass now.
Spider subclass instances were never meant to work, and they were not working as one would expect: instead of using the passed Spider subclass instance, their from_crawler method was called to generate a new instance.

Non-default values for the SCHEDULER_PRIORITY_QUEUE setting may stop working. Scheduler priority queue classes now need to handle Request objects instead of arbitrary Python data structures.


New features

A new scheduler priority queue, scrapy.pqueues.DownloaderAwarePriorityQueue, may be enabled for a significant scheduling improvement on crawls targetting multiple web domains, at the cost of no CONCURRENT_REQUESTS_PER_IP support (issue 3520)
A new Request.cb_kwargs attribute provides a cleaner way to pass keyword arguments to callback methods (issue 1138, issue 3563)
A new JSONRequest class offers a more convenient way to build JSON requests (issue 3504, issue 3505)
A process_request callback passed to the Rule constructor now receives the Response object that originated the request as its second argument (issue 3682)
A new restrict_text parameter for the LinkExtractor constructor allows filtering links by linking text (issue 3622, issue 3635)
A new FEED_STORAGE_S3_ACL setting allows defining a custom ACL for feeds exported to Amazon S3 (issue 3607)
A new FEED_STORAGE_FTP_ACTIVE setting allows using FTP’s active connection mode for feeds exported to FTP servers (issue 3829)
A new METAREFRESH_IGNORE_TAGS setting allows overriding which HTML tags are ignored when searching a response for HTML meta tags that trigger a redirect (issue 1422, issue 3768)
A new redirect_reasons request meta key exposes the reason (status code, meta refresh) behind every followed redirect (issue 3581, issue 3687)
The SCRAPY_CHECK variable is now set to the true string during runs of the check command, which allows detecting contract check runs from code (issue 3704, issue 3739)
A new Item.deepcopy() method makes it easier to deep-copy items (issue 1493, issue 3671)
CoreStats also logs elapsed_time_seconds now (issue 3638)
Exceptions from ItemLoader input and output processors are now more verbose (issue 3836, issue 3840)
Crawler, CrawlerRunner.crawl and CrawlerRunner.create_crawler now fail gracefully if they receive a Spider subclass instance instead of the subclass itself (issue 2283, issue 3610, issue 3872)


Bug fixes

process_spider_exception() is now also invoked for generators (issue 220, issue 2061)
System exceptions like KeyboardInterrupt are no longer caught (issue 3726)
ItemLoader.load_item() no longer makes later calls to ItemLoader.get_output_value() or ItemLoader.load_item() return empty data (issue 3804, issue 3819)
The images pipeline (ImagesPipeline) no longer ignores these Amazon S3 settings: AWS_ENDPOINT_URL, AWS_REGION_NAME, AWS_USE_SSL, AWS_VERIFY (issue 3625)
Fixed a memory leak in MediaPipeline affecting, for example, non-200 responses and exceptions from custom middlewares (issue 3813)
Requests with private callbacks are now correctly unserialized from disk (issue 3790)
FormRequest.from_response() now handles invalid methods like major web browsers

Revision 1.8: download - view: text, markup, annotated - select for diffs
Thu Jan 31 09:07:46 2019 UTC (5 years, 9 months ago) by adam
Branches: MAIN
CVS tags: pkgsrc-2019Q2-base, pkgsrc-2019Q2, pkgsrc-2019Q1-base, pkgsrc-2019Q1
Diff to: previous 1.7: preferred, colored
Changes since revision 1.7: +8 -8 lines
py-scrapy: updated to 1.6.0

Scrapy 1.6.0:

Highlights:
* better Windows support;
* Python 3.7 compatibility;
* big documentation improvements, including a switch
  from .extract_first() + .extract() API to .get() + .getall()
  API;
* feed exports, FilePipeline and MediaPipeline improvements;
* better extensibility: :signal:item_error and
  :signal:request_reached_downloader signals; from_crawler support
  for feed exporters, feed storages and dupefilters.
* scrapy.contracts fixes and new features;
* telnet console security improvements, first released as a
  backport in :ref:release-1.5.2;
* clean-up of the deprecated code;
* various bug fixes, small new features and usability improvements across
  the codebase.

Revision 1.7: download - view: text, markup, annotated - select for diffs
Thu Jan 24 14:11:48 2019 UTC (5 years, 9 months ago) by adam
Branches: MAIN
Diff to: previous 1.6: preferred, colored
Changes since revision 1.6: +4 -3 lines
py-scrapy: updated to 1.5.2

Scrapy 1.5.2:

* *Security bugfix*: Telnet console extension can be easily exploited by rogue
  websites POSTing content to http://localhost:6023, we haven't found a way to
  exploit it from Scrapy, but it is very easy to trick a browser to do so and
  elevates the risk for local development environment.

  *The fix is backwards incompatible*, it enables telnet user-password
  authentication by default with a random generated password. If you can't
  upgrade right away, please consider setting :setting:TELNET_CONSOLE_PORT
  out of its default value.

  See :ref:telnet console <topics-telnetconsole> documentation for more info

* Backport CI build failure under GCE environemnt due to boto import error.

Revision 1.6: download - view: text, markup, annotated - select for diffs
Tue Aug 14 06:56:39 2018 UTC (6 years, 3 months ago) by adam
Branches: MAIN
CVS tags: pkgsrc-2018Q4-base, pkgsrc-2018Q4, pkgsrc-2018Q3-base, pkgsrc-2018Q3
Diff to: previous 1.5: preferred, colored
Changes since revision 1.5: +5 -4 lines
py-scrapy: updated to 1.5.1

Scrapy 1.5.1:
This is a maintenance release with important bug fixes, but no new features:
* O(N^2) gzip decompression issue which affected Python 3 and PyPy
  is fixed
* skipping of TLS validation errors is improved
* Ctrl-C handling is fixed in Python 3.5+
* testing fixes
* documentation improvements

Revision 1.5: download - view: text, markup, annotated - select for diffs
Thu Jan 4 21:31:41 2018 UTC (6 years, 10 months ago) by adam
Branches: MAIN
CVS tags: pkgsrc-2018Q2-base, pkgsrc-2018Q2, pkgsrc-2018Q1-base, pkgsrc-2018Q1
Diff to: previous 1.4: preferred, colored
Changes since revision 1.4: +9 -3 lines
py-scrapy: updated to 1.5.0

Scrapy 1.5.0:
This release brings small new features and improvements across the codebase.
Some highlights:

* Google Cloud Storage is supported in FilesPipeline and ImagesPipeline.
* Crawling with proxy servers becomes more efficient, as connections
  to proxies can be reused now.
* Warnings, exception and logging messages are improved to make debugging
  easier.
* scrapy parse command now allows to set custom request meta via
  --meta argument.
* Compatibility with Python 3.6, PyPy and PyPy3 is improved;
  PyPy and PyPy3 are now supported officially, by running tests on CI.
* Better default handling of HTTP 308, 522 and 524 status codes.
* Documentation is improved, as usual.

Backwards Incompatible Changes
* Scrapy 1.5 drops support for Python 3.3.
* Default Scrapy User-Agent now uses https link to scrapy.org.
  **This is technically backwards-incompatible**; override
  :setting:USER_AGENT if you relied on old value.
* Logging of settings overridden by custom_settings is fixed;
  **this is technically backwards-incompatible** because the logger
  changes from [scrapy.utils.log] to [scrapy.crawler]. If you're
  parsing Scrapy logs, please update your log parsers.
* LinkExtractor now ignores m4v extension by default, this is change
  in behavior.
* 522 and 524 status codes are added to RETRY_HTTP_CODES

New features
- Support <link> tags in Response.follow
- Support for ptpython REPL
- Google Cloud Storage support for FilesPipeline and ImagesPipeline
- New --meta option of the "scrapy parse" command allows to pass additional
  request.meta
- Populate spider variable when using shell.inspect_response
- Handle HTTP 308 Permanent Redirect
- Add 522 and 524 to RETRY_HTTP_CODES
- Log versions information at startup
- scrapy.mail.MailSender now works in Python 3 (it requires Twisted 17.9.0)
- Connections to proxy servers are reused
- Add template for a downloader middleware
- Explicit message for NotImplementedError when parse callback not defined
- CrawlerProcess got an option to disable installation of root log handler
- LinkExtractor now ignores m4v extension by default
- Better log messages for responses over :setting:DOWNLOAD_WARNSIZE and
  :setting:DOWNLOAD_MAXSIZE limits
- Show warning when a URL is put to Spider.allowed_domains instead of
  a domain.

Bug fixes
- Fix logging of settings overridden by custom_settings;
  **this is technically backwards-incompatible** because the logger
  changes from [scrapy.utils.log] to [scrapy.crawler], so please
  update your log parsers if needed
- Default Scrapy User-Agent now uses https link to scrapy.org.
  **This is technically backwards-incompatible**; override
  :setting:USER_AGENT if you relied on old value.
- Fix PyPy and PyPy3 test failures, support them officially
- Fix DNS resolver when DNSCACHE_ENABLED=False
- Add cryptography for Debian Jessie tox test env
- Add verification to check if Request callback is callable
- Port extras/qpsclient.py to Python 3
- Use getfullargspec under the scenes for Python 3 to stop DeprecationWarning
- Update deprecated test aliases
- Fix SitemapSpider support for alternate links

Revision 1.4: download - view: text, markup, annotated - select for diffs
Mon Sep 4 18:08:30 2017 UTC (7 years, 2 months ago) by wiz
Branches: MAIN
CVS tags: pkgsrc-2017Q4-base, pkgsrc-2017Q4, pkgsrc-2017Q3-base, pkgsrc-2017Q3
Diff to: previous 1.3: preferred, colored
Changes since revision 1.3: +2 -2 lines
Follow some redirects.

Revision 1.3: download - view: text, markup, annotated - select for diffs
Sat May 20 06:25:36 2017 UTC (7 years, 5 months ago) by adam
Branches: MAIN
CVS tags: pkgsrc-2017Q2-base, pkgsrc-2017Q2
Diff to: previous 1.2: preferred, colored
Changes since revision 1.2: +6 -2 lines
Scrapy 1.4 does not bring that many breathtaking new features
but quite a few handy improvements nonetheless.

Scrapy now supports anonymous FTP sessions with customizable user and
password via the new :setting:`FTP_USER` and :setting:`FTP_PASSWORD` settings.
And if you're using Twisted version 17.1.0 or above, FTP is now available
with Python 3.

There's a new :meth:`response.follow <scrapy.http.TextResponse.follow>` method
for creating requests; **it is now a recommended way to create Requests
in Scrapy spiders**. This method makes it easier to write correct
spiders; ``response.follow`` has several advantages over creating
``scrapy.Request`` objects directly:

* it handles relative URLs;
* it works properly with non-ascii URLs on non-UTF8 pages;
* in addition to absolute and relative URLs it supports Selectors;
  for ``<a>`` elements it can also extract their href values.

Revision 1.2: download - view: text, markup, annotated - select for diffs
Sun Mar 19 22:59:10 2017 UTC (7 years, 7 months ago) by adam
Branches: MAIN
CVS tags: pkgsrc-2017Q1-base, pkgsrc-2017Q1
Diff to: previous 1.1: preferred, colored
Changes since revision 1.1: +2 -2 lines
Changes 1.3.3:
Bug fixes
- Make ``SpiderLoader`` raise ``ImportError`` again by default for missing
  dependencies and wrong :setting:`SPIDER_MODULES`.
  These exceptions were silenced as warnings since 1.3.0.
  A new setting is introduced to toggle between warning or exception if needed ;
  see :setting:`SPIDER_LOADER_WARN_ONLY` for details.

Revision 1.1: download - view: text, markup, annotated - select for diffs
Mon Feb 13 21:25:33 2017 UTC (7 years, 8 months ago) by adam
Branches: MAIN
Added www/py-scrapy version 1.3.2

Scrapy is a fast high-level web crawling and web scraping framework, used to
crawl websites and extract structured data from their pages. It can be used for
a wide range of purposes, from data mining to monitoring and automated testing.

Diff request

This form allows you to request diffs between any two revisions of a file. You may select a symbolic revision name using the selection box or you may type in a numeric name using the type-in text box.

Log view options

CVSweb <webmaster@jp.NetBSD.org>