Up to [cvs.NetBSD.org] / pkgsrc / www / py-scrapy
Request diff between arbitrary revisions
Keyword substitution: kv
Default branch: MAIN
py-*: remove unused tool dependency py-setuptools includes the py-wheel functionality nowadays
py-scrapy: updated to 2.11.2 Scrapy 2.11.2 (2024-05-14) -------------------------- Security bug fixes ~~~~~~~~~~~~~~~~~~ - Redirects to non-HTTP protocols are no longer followed. Please, see the `23j4-mw76-5v7h security advisory`_ for more information. (:issue:`457`) .. _23j4-mw76-5v7h security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-23j4-mw76-5v7h - The ``Authorization`` header is now dropped on redirects to a different scheme (``http://`` or ``https://``) or port, even if the domain is the same. Please, see the `4qqq-9vqf-3h3f security advisory`_ for more information. .. _4qqq-9vqf-3h3f security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-4qqq-9vqf-3h3f - When using system proxy settings that are different for ``http://`` and ``https://``, redirects to a different URL scheme will now also trigger the corresponding change in proxy settings for the redirected request. Please, see the `jm3v-qxmh-hxwv security advisory`_ for more information. (:issue:`767`) .. _jm3v-qxmh-hxwv security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-jm3v-qxmh-hxwv - :attr:`Spider.allowed_domains <scrapy.Spider.allowed_domains>` is now enforced for all requests, and not only requests from spider callbacks. (:issue:`1042`, :issue:`2241`, :issue:`6358`) - :func:`~scrapy.utils.iterators.xmliter_lxml` no longer resolves XML entities. (:issue:`6265`) - defusedxml_ is now used to make :class:`scrapy.http.request.rpc.XmlRpcRequest` more secure. (:issue:`6250`, :issue:`6251`) .. _defusedxml: https://github.com/tiran/defusedxml Bug fixes ~~~~~~~~~ - Restored support for brotlipy_, which had been dropped in Scrapy 2.11.1 in favor of brotli_. (:issue:`6261`) .. _brotli: https://github.com/google/brotli .. note:: brotlipy is deprecated, both in Scrapy and upstream. Use brotli instead if you can. - Make :setting:`METAREFRESH_IGNORE_TAGS` ``["noscript"]`` by default. This prevents :class:`~scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware` from following redirects that would not be followed by web browsers with JavaScript enabled. (:issue:`6342`, :issue:`6347`) - During :ref:`feed export <topics-feed-exports>`, do not close the underlying file from :ref:`built-in post-processing plugins <builtin-plugins>`. (:issue:`5932`, :issue:`6178`, :issue:`6239`) - :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>` now properly applies the ``unique`` and ``canonicalize`` parameters. (:issue:`3273`, :issue:`6221`) - Do not initialize the scheduler disk queue if :setting:`JOBDIR` is an empty string. (:issue:`6121`, :issue:`6124`) - Fix :attr:`Spider.logger <scrapy.Spider.logger>` not logging custom extra information. (:issue:`6323`, :issue:`6324`) - ``robots.txt`` files with a non-UTF-8 encoding no longer prevent parsing the UTF-8-compatible (e.g. ASCII) parts of the document. (:issue:`6292`, :issue:`6298`) - :meth:`scrapy.http.cookies.WrappedRequest.get_header` no longer raises an exception if ``default`` is ``None``. (:issue:`6308`, :issue:`6310`) - :class:`~scrapy.selector.Selector` now uses :func:`scrapy.utils.response.get_base_url` to determine the base URL of a given :class:`~scrapy.http.Response`. (:issue:`6265`) - The :meth:`media_to_download` method of :ref:`media pipelines <topics-media-pipeline>` now logs exceptions before stripping them. (:issue:`5067`, :issue:`5068`) - When passing a callback to the :command:`parse` command, build the callback callable with the right signature. (:issue:`6182`) Documentation ~~~~~~~~~~~~~ - Add a FAQ entry about :ref:`creating blank requests <faq-blank-request>`. (:issue:`6203`, :issue:`6208`) - Document that :attr:`scrapy.selector.Selector.type` can be ``"json"``. (:issue:`6328`, :issue:`6334`) Quality assurance ~~~~~~~~~~~~~~~~~ - Make builds reproducible. (:issue:`5019`, :issue:`6322`) - Packaging and test fixes.
py-scrapy: updated to 2.11.1 Scrapy 2.11.1 (2024-02-14) -------------------------- Highlights: - Security bug fixes. - Support for Twisted >= 23.8.0. - Documentation improvements. Security bug fixes ~~~~~~~~~~~~~~~~~~ - Addressed `ReDoS vulnerabilities`_: - ``scrapy.utils.iterators.xmliter`` is now deprecated in favor of :func:`~scrapy.utils.iterators.xmliter_lxml`, which :class:`~scrapy.spiders.XMLFeedSpider` now uses. To minimize the impact of this change on existing code, :func:`~scrapy.utils.iterators.xmliter_lxml` now supports indicating the node namespace with a prefix in the node name, and big files with highly nested trees when using libxml2 2.7+. - Fixed regular expressions in the implementation of the :func:`~scrapy.utils.response.open_in_browser` function. Please, see the `cc65-xxvf-f7r9 security advisory`_ for more information. .. _ReDoS vulnerabilities: https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS .. _cc65-xxvf-f7r9 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-cc65-xxvf-f7r9 - :setting:`DOWNLOAD_MAXSIZE` and :setting:`DOWNLOAD_WARNSIZE` now also apply to the decompressed response body. Please, see the `7j7m-v7m3-jqm7 security advisory`_ for more information. .. _7j7m-v7m3-jqm7 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-7j7m-v7m3-jqm7 - Also in relation with the `7j7m-v7m3-jqm7 security advisory`_, the deprecated ``scrapy.downloadermiddlewares.decompression`` module has been removed. - The ``Authorization`` header is now dropped on redirects to a different domain. Please, see the `cw9j-q3vf-hrrv security advisory`_ for more information. .. _cw9j-q3vf-hrrv security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-cw9j-q3vf-hrrv Modified requirements ~~~~~~~~~~~~~~~~~~~~~ - The Twisted dependency is no longer restricted to < 23.8.0. (:issue:`6024`, :issue:`6064`, :issue:`6142`) Bug fixes ~~~~~~~~~ - The OS signal handling code was refactored to no longer use private Twisted functions. (:issue:`6024`, :issue:`6064`, :issue:`6112`) Documentation ~~~~~~~~~~~~~ - Improved documentation for :class:`~scrapy.crawler.Crawler` initialization changes made in the 2.11.0 release. (:issue:`6057`, :issue:`6147`) - Extended documentation for :attr:`Request.meta <scrapy.http.Request.meta>`. (:issue:`5565`) - Fixed the :reqmeta:`dont_merge_cookies` documentation. (:issue:`5936`, :issue:`6077`) - Added a link to Zyte's export guides to the :ref:`feed exports <topics-feed-exports>` documentation. (:issue:`6183`) - Added a missing note about backward-incompatible changes in :class:`~scrapy.exporters.PythonItemExporter` to the 2.11.0 release notes. (:issue:`6060`, :issue:`6081`) - Added a missing note about removing the deprecated ``scrapy.utils.boto.is_botocore()`` function to the 2.8.0 release notes. (:issue:`6056`, :issue:`6061`) - Other documentation improvements. (:issue:`6128`, :issue:`6144`, :issue:`6163`, :issue:`6190`, :issue:`6192`) Quality assurance ~~~~~~~~~~~~~~~~~ - Added Python 3.12 to the CI configuration, re-enabled tests that were disabled when the pre-release support was added. (:issue:`5985`, :issue:`6083`, :issue:`6098`) - Fixed a test issue on PyPy 7.3.14. (:issue:`6204`, :issue:`6205`)
py-scrapy: Update to 2.11.0 upstream changes: ----------------- * 2.11.0: https://docs.scrapy.org/en/latest/news.html#scrapy-2-11-0-2023-09-18 * 2.10.0: https://docs.scrapy.org/en/2.10/news.html#scrapy-2-10-0-2023-08-04
py-ZopeInterface: moved to py-zope.interface
py-scrapy: updated to 2.9.0 Scrapy 2.9.0 (2023-05-08) ------------------------- Highlights: - Per-domain download settings. - Compatibility with new cryptography_ and new parsel_. - JMESPath selectors from the new parsel_. - Bug fixes. Deprecations ~~~~~~~~~~~~ - :class:`scrapy.extensions.feedexport._FeedSlot` is renamed to :class:`scrapy.extensions.feedexport.FeedSlot` and the old name is deprecated. (:issue:`5876`) New features ~~~~~~~~~~~~ - Settings correponding to :setting:`DOWNLOAD_DELAY`, :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` and :setting:`RANDOMIZE_DOWNLOAD_DELAY` can now be set on a per-domain basis via the new :setting:`DOWNLOAD_SLOTS` setting. (:issue:`5328`) - Added :meth:`TextResponse.jmespath`, a shortcut for JMESPath selectors available since parsel_ 1.8.1. (:issue:`5894`, :issue:`5915`) - Added :signal:`feed_slot_closed` and :signal:`feed_exporter_closed` signals. (:issue:`5876`) - Added :func:`scrapy.utils.request.request_to_curl`, a function to produce a curl command from a :class:`~scrapy.Request` object. (:issue:`5892`) - Values of :setting:`FILES_STORE` and :setting:`IMAGES_STORE` can now be :class:`pathlib.Path` instances. (:issue:`5801`) Bug fixes ~~~~~~~~~ - Fixed a warning with Parsel 1.8.1+. (:issue:`5903`, :issue:`5918`) - Fixed an error when using feed postprocessing with S3 storage. (:issue:`5500`, :issue:`5581`) - Added the missing :meth:`scrapy.settings.BaseSettings.setdefault` method. (:issue:`5811`, :issue:`5821`) - Fixed an error when using cryptography_ 40.0.0+ and :setting:`DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING` is enabled. (:issue:`5857`, :issue:`5858`) - The checksums returned by :class:`~scrapy.pipelines.files.FilesPipeline` for files on Google Cloud Storage are no longer Base64-encoded. (:issue:`5874`, :issue:`5891`) - :func:`scrapy.utils.request.request_from_curl` now supports $-prefixed string values for the curl ``--data-raw`` argument, which are produced by browsers for data that includes certain symbols. (:issue:`5899`, :issue:`5901`) - The :command:`parse` command now also works with async generator callbacks. (:issue:`5819`, :issue:`5824`) - The :command:`genspider` command now properly works with HTTPS URLs. (:issue:`3553`, :issue:`5808`) - Improved handling of asyncio loops. (:issue:`5831`, :issue:`5832`) - :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>` now skips certain malformed URLs instead of raising an exception. (:issue:`5881`) - :func:`scrapy.utils.python.get_func_args` now supports more types of callables. (:issue:`5872`, :issue:`5885`) - Fixed an error when processing non-UTF8 values of ``Content-Type`` headers. (:issue:`5914`, :issue:`5917`) - Fixed an error breaking user handling of send failures in :meth:`scrapy.mail.MailSender.send()`. (:issue:`1611`, :issue:`5880`) Documentation ~~~~~~~~~~~~~ - Expanded contributing docs. (:issue:`5109`, :issue:`5851`) - Added blacken-docs_ to pre-commit and reformatted the docs with it. (:issue:`5813`, :issue:`5816`) - Fixed a JS issue. (:issue:`5875`, :issue:`5877`) - Fixed ``make htmlview``. (:issue:`5878`, :issue:`5879`) - Fixed typos and other small errors. (:issue:`5827`, :issue:`5839`, :issue:`5883`, :issue:`5890`, :issue:`5895`, :issue:`5904`) Quality assurance ~~~~~~~~~~~~~~~~~ - Extended typing hints. (:issue:`5805`, :issue:`5889`, :issue:`5896`) - Tests for most of the examples in the docs are now run as a part of CI, found problems were fixed. (:issue:`5816`, :issue:`5826`, :issue:`5919`) - Removed usage of deprecated Python classes. (:issue:`5849`) - Silenced ``include-ignored`` warnings from coverage. (:issue:`5820`) - Fixed a random failure of the ``test_feedexport.test_batch_path_differ`` test. (:issue:`5855`, :issue:`5898`) - Updated docstrings to match output produced by parsel_ 1.8.1 so that they don't cause test failures. (:issue:`5902`, :issue:`5919`) - Other CI and pre-commit improvements. (:issue:`5802`, :issue:`5823`, :issue:`5908`)
py-scrapy: updated to 2.8.0 Scrapy 2.8.0 (2023-02-02) ------------------------- This is a maintenance release, with minor features, bug fixes, and cleanups. Deprecation removals ~~~~~~~~~~~~~~~~~~~~ - The ``scrapy.utils.gz.read1`` function, deprecated in Scrapy 2.0, has now been removed. Use the :meth:`~io.BufferedIOBase.read1` method of :class:`~gzip.GzipFile` instead. - The ``scrapy.utils.python.to_native_str`` function, deprecated in Scrapy 2.0, has now been removed. Use :func:`scrapy.utils.python.to_unicode` instead. - The ``scrapy.utils.python.MutableChain.next`` method, deprecated in Scrapy 2.0, has now been removed. Use :meth:`~scrapy.utils.python.MutableChain.__next__` instead. - The ``scrapy.linkextractors.FilteringLinkExtractor`` class, deprecated in Scrapy 2.0, has now been removed. Use :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>` instead. - Support for using environment variables prefixed with ``SCRAPY_`` to override settings, deprecated in Scrapy 2.0, has now been removed. - Support for the ``noconnect`` query string argument in proxy URLs, deprecated in Scrapy 2.0, has now been removed. We expect proxies that used to need it to work fine without it. - The ``scrapy.utils.python.retry_on_eintr`` function, deprecated in Scrapy 2.3, has now been removed. - The ``scrapy.utils.python.WeakKeyCache`` class, deprecated in Scrapy 2.4, has now been removed. Deprecations ~~~~~~~~~~~~ - :exc:`scrapy.pipelines.images.NoimagesDrop` is now deprecated. - :meth:`ImagesPipeline.convert_image <scrapy.pipelines.images.ImagesPipeline.convert_image>` must now accept a ``response_body`` parameter. New features ~~~~~~~~~~~~ - Applied black_ coding style to files generated with the :command:`genspider` and :command:`startproject` commands. .. _black: https://black.readthedocs.io/en/stable/ - :setting:`FEED_EXPORT_ENCODING` is now set to ``"utf-8"`` in the ``settings.py`` file that the :command:`startproject` command generates. With this value, JSON exports won’t force the use of escape sequences for non-ASCII characters. - The :class:`~scrapy.extensions.memusage.MemoryUsage` extension now logs the peak memory usage during checks, and the binary unit MiB is now used to avoid confusion. - The ``callback`` parameter of :class:`~scrapy.http.Request` can now be set to :func:`scrapy.http.request.NO_CALLBACK`, to distinguish it from ``None``, as the latter indicates that the default spider callback (:meth:`~scrapy.Spider.parse`) is to be used. Bug fixes ~~~~~~~~~ - Enabled unsafe legacy SSL renegotiation to fix access to some outdated websites. - Fixed STARTTLS-based email delivery not working with Twisted 21.2.0 and better. - Fixed the :meth:`finish_exporting` method of :ref:`item exporters <topics-exporters>` not being called for empty files. - Fixed HTTP/2 responses getting only the last value for a header when multiple headers with the same name are received. - Fixed an exception raised by the :command:`shell` command on some cases when :ref:`using asyncio <using-asyncio>`. - When using :class:`~scrapy.spiders.CrawlSpider`, callback keyword arguments (``cb_kwargs``) added to a request in the ``process_request`` callback of a :class:`~scrapy.spiders.Rule` will no longer be ignored. - The :ref:`images pipeline <images-pipeline>` no longer re-encodes JPEG files. - Fixed the handling of transparent WebP images by the :ref:`images pipeline <images-pipeline>`. - :func:`scrapy.shell.inspect_response` no longer inhibits ``SIGINT`` (Ctrl+C). - :class:`LinkExtractor <scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor>` with ``unique=False`` no longer filters out links that have identical URL *and* text. - :class:`~scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware` now ignores URL protocols that do not support ``robots.txt`` (``data://``, ``file://``). - Silenced the ``filelock`` debug log messages introduced in Scrapy 2.6. - Fixed the output of ``scrapy -h`` showing an unintended ``**commands**`` line. - Made the active project indication in the output of :ref:`commands <topics-commands>` more clear. Documentation ~~~~~~~~~~~~~ - Documented how to :ref:`debug spiders from Visual Studio Code <debug-vscode>`. - Documented how :setting:`DOWNLOAD_DELAY` affects per-domain concurrency. - Improved consistency. - Fixed typos. Quality assurance ~~~~~~~~~~~~~~~~~ - Applied :ref:`black coding style <coding-style>`, sorted import statements, and introduced :ref:`pre-commit <scrapy-pre-commit>`. - Switched from :mod:`os.path` to :mod:`pathlib`. - Addressed many issues reported by Pylint. - Improved code readability. - Improved package metadata. - Removed direct invocations of ``setup.py``. - Removed unnecessary :class:`~collections.OrderedDict` usages. - Removed unnecessary ``__str__`` definitions. - Removed obsolete code and comments. - Fixed test and CI issues.
fighting a losing battle against the py-cryptography rustification, part 5 Convert py-OpenSSL users to versioned_dependencies.mk
fighting a losing battle against py-cryptography rustification, part 2 Switch users to versioned_dependencies.mk.
python: egg.mk: add USE_PKG_RESOURCES flag This flag should be set for packages that import pkg_resources and thus need setuptools after the build step. Set this flag for packages that need it and bump PKGREVISION.
*: bump PKGREVISION for egg.mk users They now have a tool dependency on py-setuptools instead of a DEPENDS
py-scrapy: Switch to PYTHON_VERSIONS_INCOMPATIBLE.
py-scrapy: Update to 2.4.1 upstream cheanges: ------------------ A lot of changes listed at https://github.com/scrapy/scrapy/blob/master/docs/news.rst
py-scrapy: updated to 1.8.0 Scrapy 1.8.0: Highlights: * Dropped Python 3.4 support and updated minimum requirements; made Python 3.8 support official * New :meth:`Request.from_curl <scrapy.http.Request.from_curl>` class method * New :setting:`ROBOTSTXT_PARSER` and :setting:`ROBOTSTXT_USER_AGENT` settings * New :setting:`DOWNLOADER_CLIENT_TLS_CIPHERS` and :setting:`DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING` settings
py-scrapy: updated to 1.7.3 Scrapy 1.7.3: Enforce lxml 4.3.5 or lower for Python 3.4 (issue 3912, issue 3918). Scrapy 1.7.2: Fix Python 2 support (issue 3889, issue 3893, issue 3896). Scrapy 1.7.1: Re-packaging of Scrapy 1.7.0, which was missing some changes in PyPI. Scrapy 1.7.0: Highlights: Improvements for crawls targeting multiple domains A cleaner way to pass arguments to callbacks A new class for JSON requests Improvements for rule-based spiders New features for feed exports Backward-incompatible changes 429 is now part of the RETRY_HTTP_CODES setting by default This change is backward incompatible. If you don’t want to retry 429, you must override RETRY_HTTP_CODES accordingly. Crawler, CrawlerRunner.crawl and CrawlerRunner.create_crawler no longer accept a Spider subclass instance, they only accept a Spider subclass now. Spider subclass instances were never meant to work, and they were not working as one would expect: instead of using the passed Spider subclass instance, their from_crawler method was called to generate a new instance. Non-default values for the SCHEDULER_PRIORITY_QUEUE setting may stop working. Scheduler priority queue classes now need to handle Request objects instead of arbitrary Python data structures. New features A new scheduler priority queue, scrapy.pqueues.DownloaderAwarePriorityQueue, may be enabled for a significant scheduling improvement on crawls targetting multiple web domains, at the cost of no CONCURRENT_REQUESTS_PER_IP support (issue 3520) A new Request.cb_kwargs attribute provides a cleaner way to pass keyword arguments to callback methods (issue 1138, issue 3563) A new JSONRequest class offers a more convenient way to build JSON requests (issue 3504, issue 3505) A process_request callback passed to the Rule constructor now receives the Response object that originated the request as its second argument (issue 3682) A new restrict_text parameter for the LinkExtractor constructor allows filtering links by linking text (issue 3622, issue 3635) A new FEED_STORAGE_S3_ACL setting allows defining a custom ACL for feeds exported to Amazon S3 (issue 3607) A new FEED_STORAGE_FTP_ACTIVE setting allows using FTP’s active connection mode for feeds exported to FTP servers (issue 3829) A new METAREFRESH_IGNORE_TAGS setting allows overriding which HTML tags are ignored when searching a response for HTML meta tags that trigger a redirect (issue 1422, issue 3768) A new redirect_reasons request meta key exposes the reason (status code, meta refresh) behind every followed redirect (issue 3581, issue 3687) The SCRAPY_CHECK variable is now set to the true string during runs of the check command, which allows detecting contract check runs from code (issue 3704, issue 3739) A new Item.deepcopy() method makes it easier to deep-copy items (issue 1493, issue 3671) CoreStats also logs elapsed_time_seconds now (issue 3638) Exceptions from ItemLoader input and output processors are now more verbose (issue 3836, issue 3840) Crawler, CrawlerRunner.crawl and CrawlerRunner.create_crawler now fail gracefully if they receive a Spider subclass instance instead of the subclass itself (issue 2283, issue 3610, issue 3872) Bug fixes process_spider_exception() is now also invoked for generators (issue 220, issue 2061) System exceptions like KeyboardInterrupt are no longer caught (issue 3726) ItemLoader.load_item() no longer makes later calls to ItemLoader.get_output_value() or ItemLoader.load_item() return empty data (issue 3804, issue 3819) The images pipeline (ImagesPipeline) no longer ignores these Amazon S3 settings: AWS_ENDPOINT_URL, AWS_REGION_NAME, AWS_USE_SSL, AWS_VERIFY (issue 3625) Fixed a memory leak in MediaPipeline affecting, for example, non-200 responses and exceptions from custom middlewares (issue 3813) Requests with private callbacks are now correctly unserialized from disk (issue 3790) FormRequest.from_response() now handles invalid methods like major web browsers
py-scrapy: updated to 1.6.0 Scrapy 1.6.0: Highlights: * better Windows support; * Python 3.7 compatibility; * big documentation improvements, including a switch from .extract_first() + .extract() API to .get() + .getall() API; * feed exports, FilePipeline and MediaPipeline improvements; * better extensibility: :signal:item_error and :signal:request_reached_downloader signals; from_crawler support for feed exporters, feed storages and dupefilters. * scrapy.contracts fixes and new features; * telnet console security improvements, first released as a backport in :ref:release-1.5.2; * clean-up of the deprecated code; * various bug fixes, small new features and usability improvements across the codebase.
py-scrapy: updated to 1.5.2 Scrapy 1.5.2: * *Security bugfix*: Telnet console extension can be easily exploited by rogue websites POSTing content to http://localhost:6023, we haven't found a way to exploit it from Scrapy, but it is very easy to trick a browser to do so and elevates the risk for local development environment. *The fix is backwards incompatible*, it enables telnet user-password authentication by default with a random generated password. If you can't upgrade right away, please consider setting :setting:TELNET_CONSOLE_PORT out of its default value. See :ref:telnet console <topics-telnetconsole> documentation for more info * Backport CI build failure under GCE environemnt due to boto import error.
py-scrapy: updated to 1.5.1 Scrapy 1.5.1: This is a maintenance release with important bug fixes, but no new features: * O(N^2) gzip decompression issue which affected Python 3 and PyPy is fixed * skipping of TLS validation errors is improved * Ctrl-C handling is fixed in Python 3.5+ * testing fixes * documentation improvements
py-scrapy: updated to 1.5.0 Scrapy 1.5.0: This release brings small new features and improvements across the codebase. Some highlights: * Google Cloud Storage is supported in FilesPipeline and ImagesPipeline. * Crawling with proxy servers becomes more efficient, as connections to proxies can be reused now. * Warnings, exception and logging messages are improved to make debugging easier. * scrapy parse command now allows to set custom request meta via --meta argument. * Compatibility with Python 3.6, PyPy and PyPy3 is improved; PyPy and PyPy3 are now supported officially, by running tests on CI. * Better default handling of HTTP 308, 522 and 524 status codes. * Documentation is improved, as usual. Backwards Incompatible Changes * Scrapy 1.5 drops support for Python 3.3. * Default Scrapy User-Agent now uses https link to scrapy.org. **This is technically backwards-incompatible**; override :setting:USER_AGENT if you relied on old value. * Logging of settings overridden by custom_settings is fixed; **this is technically backwards-incompatible** because the logger changes from [scrapy.utils.log] to [scrapy.crawler]. If you're parsing Scrapy logs, please update your log parsers. * LinkExtractor now ignores m4v extension by default, this is change in behavior. * 522 and 524 status codes are added to RETRY_HTTP_CODES New features - Support <link> tags in Response.follow - Support for ptpython REPL - Google Cloud Storage support for FilesPipeline and ImagesPipeline - New --meta option of the "scrapy parse" command allows to pass additional request.meta - Populate spider variable when using shell.inspect_response - Handle HTTP 308 Permanent Redirect - Add 522 and 524 to RETRY_HTTP_CODES - Log versions information at startup - scrapy.mail.MailSender now works in Python 3 (it requires Twisted 17.9.0) - Connections to proxy servers are reused - Add template for a downloader middleware - Explicit message for NotImplementedError when parse callback not defined - CrawlerProcess got an option to disable installation of root log handler - LinkExtractor now ignores m4v extension by default - Better log messages for responses over :setting:DOWNLOAD_WARNSIZE and :setting:DOWNLOAD_MAXSIZE limits - Show warning when a URL is put to Spider.allowed_domains instead of a domain. Bug fixes - Fix logging of settings overridden by custom_settings; **this is technically backwards-incompatible** because the logger changes from [scrapy.utils.log] to [scrapy.crawler], so please update your log parsers if needed - Default Scrapy User-Agent now uses https link to scrapy.org. **This is technically backwards-incompatible**; override :setting:USER_AGENT if you relied on old value. - Fix PyPy and PyPy3 test failures, support them officially - Fix DNS resolver when DNSCACHE_ENABLED=False - Add cryptography for Debian Jessie tox test env - Add verification to check if Request callback is callable - Port extras/qpsclient.py to Python 3 - Use getfullargspec under the scenes for Python 3 to stop DeprecationWarning - Update deprecated test aliases - Fix SitemapSpider support for alternate links
Follow some redirects.
Scrapy 1.4 does not bring that many breathtaking new features but quite a few handy improvements nonetheless. Scrapy now supports anonymous FTP sessions with customizable user and password via the new :setting:`FTP_USER` and :setting:`FTP_PASSWORD` settings. And if you're using Twisted version 17.1.0 or above, FTP is now available with Python 3. There's a new :meth:`response.follow <scrapy.http.TextResponse.follow>` method for creating requests; **it is now a recommended way to create Requests in Scrapy spiders**. This method makes it easier to write correct spiders; ``response.follow`` has several advantages over creating ``scrapy.Request`` objects directly: * it handles relative URLs; * it works properly with non-ascii URLs on non-UTF8 pages; * in addition to absolute and relative URLs it supports Selectors; for ``<a>`` elements it can also extract their href values.
Changes 1.3.3: Bug fixes - Make ``SpiderLoader`` raise ``ImportError`` again by default for missing dependencies and wrong :setting:`SPIDER_MODULES`. These exceptions were silenced as warnings since 1.3.0. A new setting is introduced to toggle between warning or exception if needed ; see :setting:`SPIDER_LOADER_WARN_ONLY` for details.
Added www/py-scrapy version 1.3.2 Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.