To add to my post, I also tried the following in requirements.txt:
git+git://github.com/scrapedia/scrapy-useragents.git
git+https://github.com/scrapedia/scrapy-useragents.git
And my scrapinghub.yml looks like this:
projects:
default: <myid>
requirements:
file: requirements.txt
Is anyone actually reading this? This doesn't bode well for when I actually become a paying subscriber :(
Hi MarcA! Hope you're doing good.
Instead of adding git to the requirements.txt, can you please try adding "scrapy-useragents" (as in pip) and letting us know if it works?
@ednei.bach, thanks! That seems to do something. But now I get different errors, see my log.
Is this because it can't contact Splash, which is a separate module on Zyte?
2 questions:
1. Is there a trial available, I just want to make sure I can use Zyte before paying for nothing :)
2. if I add that module, do I keep executing via `http://localhost:8050/execute` or is there another URL I need to call on your platform?
2021-10-02 12:59:26 INFO Log opened. 2021-10-02 12:59:26 INFO [scrapy.utils.log] Scrapy 2.0.1 started (bot: foobar) 2021-10-02 12:59:26 INFO [scrapy.utils.log] Versions: lxml 4.5.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 19.10.0, Python 3.8.2 (default, Apr 23 2020, 14:32:57) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.15.0-76-generic-x86_64-with-glibc2.2.5 2021-10-02 12:59:26 INFO [scrapy.crawler] Overridden settings: {'AUTOTHROTTLE_ENABLED': True, 'BOT_NAME': 'foobar', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'LOG_ENABLED': False, 'LOG_LEVEL': 'INFO', 'MEMUSAGE_LIMIT_MB': 950, 'NEWSPIDER_MODULE': 'foobar.spiders', 'SPIDER_MODULES': ['foobar.spiders'], 'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector', 'TELNETCONSOLE_HOST': '0.0.0.0'} 2021-10-02 12:59:26 INFO [scrapy.extensions.telnet] Telnet Password: <anonymized> 2021-10-02 12:59:27 INFO [scrapy.middleware] Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.spiderstate.SpiderState', 'scrapy.extensions.throttle.AutoThrottle', 'scrapy.extensions.debug.StackTraceDump', 'sh_scrapy.extension.HubstorageExtension'] 2021-10-02 12:59:27 INFO [scrapy.middleware] Enabled downloader middlewares: ['sh_scrapy.diskquota.DiskQuotaDownloaderMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy_splash.SplashCookiesMiddleware', 'scrapy_splash.SplashMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats', 'sh_scrapy.middlewares.HubstorageDownloaderMiddleware'] 2021-10-02 12:59:27 INFO [scrapy.middleware] Enabled spider middlewares: ['sh_scrapy.diskquota.DiskQuotaSpiderMiddleware', 'sh_scrapy.middlewares.HubstorageSpiderMiddleware', 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy_splash.SplashDeduplicateArgsMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2021-10-02 12:59:27 INFO [scrapy.middleware] Enabled item pipelines: [] 2021-10-02 12:59:27 INFO [scrapy.core.engine] Spider opened 2021-10-02 12:59:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2021-10-02 12:59:27 INFO TelnetConsole starting on 6023 2021-10-02 12:59:27 INFO [scrapy.extensions.telnet] Telnet console listening on 0.0.0.0:6023 2021-10-02 12:59:27 WARNING [py.warnings] /usr/local/lib/python3.8/site-packages/scrapy_splash/request.py:41: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead. url = to_native_str(url) 2021-10-02 12:59:41 ERROR [scrapy.downloadermiddlewares.retry] Gave up retrying <GET https://www.example.com/allobjects via http://localhost:8050/execute> (failed 3 times): Connection was refused by other side: 111: Connection refused. 2021-10-02 12:59:41 ERROR [scrapy.core.scraper] Error downloading <GET https://www.example.com/allobjects via http://localhost:8050/execute> Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 42, in process_request defer.returnValue((yield download_func(request=request, spider=spider))) twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 111: Connection refused. 2021-10-02 12:59:41 INFO [scrapy.core.engine] Closing spider (finished) 2021-10-02 12:59:41 INFO [scrapy.statscollectors] Dumping Scrapy stats: {'downloader/exception_count': 3, 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 3, 'downloader/request_bytes': 3813, 'downloader/request_count': 3, 'downloader/request_method_count/POST': 3, 'elapsed_time_seconds': 14.797728, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2021, 10, 2, 12, 59, 41, 913131), 'log_count/ERROR': 2, 'log_count/INFO': 10, 'log_count/WARNING': 1, 'memusage/max': 62865408, 'memusage/startup': 62865408, 'retry/count': 2, 'retry/max_reached': 1, 'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 2, 'scheduler/dequeued': 4, 'scheduler/dequeued/disk': 4, 'scheduler/enqueued': 4, 'scheduler/enqueued/disk': 4, 'splash/execute/request_count': 1, 'start_time': datetime.datetime(2021, 10, 2, 12, 59, 27, 115403)} 2021-10-02 12:59:41 INFO [scrapy.core.engine] Spider closed (finished) 2021-10-02 12:59:41 INFO Main loop terminated.
You're welcome! :)
Yes, this error is because it cannot contact Splash - Which needs to be running as an instance, either locally or in Scrapy Cloud.
1) Yes, there are trials for both our Smart Proxy Manager and our Zyte Data API. Since you are trying to use a headless browser (Splash) in order to scrape, we would suggest trying out our Zyte Data API which should do a similar job and you can do a trial of it. You can verify more about it here: https://docs.zyte.com/zyte-api/get-started.html
For Zyte Data API, you can sign up here: https://app.zyte.com/account/signup/zyteapi
For Smart Proxy Manager, you can sign up for it in the "Smart Proxy Manager" tab, right below the tools in the left of your Zyte dashboard: https://app.zyte.com/
Note that unfortunately we don't have trials for Splash or Scrapy Cloud. You can run a job in Scrapy Cloud for free for up to 1 hour though - So it should be enough for testing purposes.
2) This is not exactly a module but actually a headless browser emulator - You can think of it as something similar to Selenium for example. Once you have it set up, you can simply point to its IP and port as the endpoint. You can check information on what it is and how to set it up (locally) here:
https://www.zyte.com/blog/handling-javascript-in-scrapy-with-splash/
Is this information helpful?
Hi Ednei,
Thanks.
I see that the Zyte Data API (after 14 day trial) starts at $100/month, which is too expensive for me.
I do see that Splash starts at $25/month, so if you look at my requirements, would a Splash subscription + Scrapy Cloud be sufficient? (and then later optionally add Proxy Manager if I like everything).
Please let me know
Hi MarcA,
Happy Monday!
It would depend a lot on the target website you would like to Scrape - By not using any proxies or our Smart Proxy Manager, you would likely face a lot of bans depending if the website has an antibot protection or not. We would first suggest trying it out locally with Splash (you can set it up locally using this guide) - If it works locally and is sustainable, then it should work well in Scrapy Cloud and Splash. Otherwise, you would have to resort to a solution to avoid bans - Such as Smart Proxy Manager or Zyte Data API for example.
Hope this helps.
Thanks again. Yes, Splash works locally. Could you point me to your relevant documentation so I know what the Splash URL is I need to call from my Spider on Scrapy Cloud? (as you can see above, right now I use `http://localhost:8050/`)
Hi MarcA,
You can refer to this article here - You will need to modify your Splash URL as well as use your API Key, both of which are supplied in the dashboard once you have a Splash unit:
Can you please let us know if this helps?
Hi MarcA,
Happy Friday!
Have you tried our suggestions? How are things going on your side?
Test Test
I'm new here. My Scrapy spider which I deployed via my local machine to Zyte Cloud results in output below.
I checked this page https://support.zyte.com/support/solutions/articles/22000200400-deploying-python-dependencies-for-your-projects-in-scrapy-cloud
And added line `git+git://github.com/scrapedia/scrapy-useragents` to requirements.txt (it's currently the only line), however, the same error with the same output is generated.
What am I doing wrong?
File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 177, in crawl
return self._crawl(crawler, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 181, in _crawl
d = crawler.crawl(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
_inlineCallbacks(None, g, status)
--- <exception caught here> ---
File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 89, in crawl
self.engine = self._create_engine()
File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 103, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/usr/local/lib/python3.8/site-packages/scrapy/core/engine.py", line 69, in __init__
self.downloader = downloader_cls(crawler)
File "/usr/local/lib/python3.8/site-packages/scrapy/core/downloader/__init__.py", line 83, in __init__
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/usr/local/lib/python3.8/site-packages/scrapy/middleware.py", line 53, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/usr/local/lib/python3.8/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/usr/local/lib/python3.8/site-packages/scrapy/utils/misc.py", line 50, in load_object
mod = import_module(module)
File "/usr/local/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
builtins.ModuleNotFoundError: No module named 'scrapy_user_agents'