Start a new topic

Unhandled error in Deferred: No module named 'scrapy_user_agents'

I'm new here. My Scrapy spider which I deployed via my local machine to Zyte Cloud results in output below.


I checked this page https://support.zyte.com/support/solutions/articles/22000200400-deploying-python-dependencies-for-your-projects-in-scrapy-cloud

And added line `git+git://github.com/scrapedia/scrapy-useragents` to requirements.txt (it's currently the only line), however, the same error with the same output is generated.

What am I doing wrong?


 File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 177, in crawl

   return self._crawl(crawler, *args, **kwargs)

    File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 181, in _crawl

   d = crawler.crawl(*args, **kwargs)

    File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator

   return _cancellableInlineCallbacks(gen)

    File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks

   _inlineCallbacks(None, g, status)

  --- <exception caught here> ---

    File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks

   result = g.send(result)

    File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 89, in crawl

   self.engine = self._create_engine()

    File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 103, in _create_engine

   return ExecutionEngine(self, lambda _: self.stop())

    File "/usr/local/lib/python3.8/site-packages/scrapy/core/engine.py", line 69, in __init__

   self.downloader = downloader_cls(crawler)

    File "/usr/local/lib/python3.8/site-packages/scrapy/core/downloader/__init__.py", line 83, in __init__

   self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)

    File "/usr/local/lib/python3.8/site-packages/scrapy/middleware.py", line 53, in from_crawler

   return cls.from_settings(crawler.settings, crawler)

    File "/usr/local/lib/python3.8/site-packages/scrapy/middleware.py", line 34, in from_settings

   mwcls = load_object(clspath)

    File "/usr/local/lib/python3.8/site-packages/scrapy/utils/misc.py", line 50, in load_object

   mod = import_module(module)

    File "/usr/local/lib/python3.8/importlib/__init__.py", line 127, in import_module

   return _bootstrap._gcd_import(name[level:], package, level)

    File "<frozen importlib._bootstrap>", line 1014, in _gcd_import

    File "<frozen importlib._bootstrap>", line 991, in _find_and_load

    File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked

    File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed

    File "<frozen importlib._bootstrap>", line 1014, in _gcd_import

    File "<frozen importlib._bootstrap>", line 991, in _find_and_load

    File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked

  builtins.ModuleNotFoundError: No module named 'scrapy_user_agents'


 

 

 

 


To add to my post, I also tried the following in requirements.txt:
git+git://github.com/scrapedia/scrapy-useragents.git

git+https://github.com/scrapedia/scrapy-useragents.git

And my scrapinghub.yml looks like this:
projects:

  default: <myid>

requirements:

  file: requirements.txt

Is anyone actually reading this? This doesn't bode well for when I actually become a paying subscriber :(

Hi MarcA! Hope you're doing good.


Instead of adding git to the requirements.txt, can you please try adding "scrapy-useragents" (as in pip) and letting us know if it works? 


1 person likes this

@ednei.bach, thanks! That seems to do something. But now I get different errors, see my log.
Is this because it can't contact Splash, which is a separate module on Zyte?

2 questions:
1. Is there a trial available, I just want to make sure I can use Zyte before paying for nothing :)
2. if I add that module, do I keep executing via `http://localhost:8050/execute` or is there another URL I need to call on your platform?
 

2021-10-02 12:59:26 INFO Log opened.
2021-10-02 12:59:26 INFO [scrapy.utils.log] Scrapy 2.0.1 started (bot: foobar)
2021-10-02 12:59:26 INFO [scrapy.utils.log] Versions: lxml 4.5.1.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 19.10.0, Python 3.8.2 (default, Apr 23 2020, 14:32:57) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.2, Platform Linux-4.15.0-76-generic-x86_64-with-glibc2.2.5
2021-10-02 12:59:26 INFO [scrapy.crawler] Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'BOT_NAME': 'foobar',
 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
 'LOG_ENABLED': False,
 'LOG_LEVEL': 'INFO',
 'MEMUSAGE_LIMIT_MB': 950,
 'NEWSPIDER_MODULE': 'foobar.spiders',
 'SPIDER_MODULES': ['foobar.spiders'],
 'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector',
 'TELNETCONSOLE_HOST': '0.0.0.0'}
2021-10-02 12:59:26 INFO [scrapy.extensions.telnet] Telnet Password: <anonymized>
2021-10-02 12:59:27 INFO [scrapy.middleware] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.spiderstate.SpiderState',
 'scrapy.extensions.throttle.AutoThrottle',
 'scrapy.extensions.debug.StackTraceDump',
 'sh_scrapy.extension.HubstorageExtension']
2021-10-02 12:59:27 INFO [scrapy.middleware] Enabled downloader middlewares:
['sh_scrapy.diskquota.DiskQuotaDownloaderMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'sh_scrapy.middlewares.HubstorageDownloaderMiddleware']
2021-10-02 12:59:27 INFO [scrapy.middleware] Enabled spider middlewares:
['sh_scrapy.diskquota.DiskQuotaSpiderMiddleware',
 'sh_scrapy.middlewares.HubstorageSpiderMiddleware',
 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-10-02 12:59:27 INFO [scrapy.middleware] Enabled item pipelines:
[]
2021-10-02 12:59:27 INFO [scrapy.core.engine] Spider opened
2021-10-02 12:59:27 INFO [scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-10-02 12:59:27 INFO TelnetConsole starting on 6023
2021-10-02 12:59:27 INFO [scrapy.extensions.telnet] Telnet console listening on 0.0.0.0:6023
2021-10-02 12:59:27 WARNING [py.warnings] /usr/local/lib/python3.8/site-packages/scrapy_splash/request.py:41: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
  url = to_native_str(url)

2021-10-02 12:59:41 ERROR [scrapy.downloadermiddlewares.retry] Gave up retrying <GET https://www.example.com/allobjects via http://localhost:8050/execute> (failed 3 times): Connection was refused by other side: 111: Connection refused.
2021-10-02 12:59:41 ERROR [scrapy.core.scraper] Error downloading <GET https://www.example.com/allobjects via http://localhost:8050/execute>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 42, in process_request
    defer.returnValue((yield download_func(request=request, spider=spider)))
twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 111: Connection refused.
2021-10-02 12:59:41 INFO [scrapy.core.engine] Closing spider (finished)
2021-10-02 12:59:41 INFO [scrapy.statscollectors] Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 3,
 'downloader/request_bytes': 3813,
 'downloader/request_count': 3,
 'downloader/request_method_count/POST': 3,
 'elapsed_time_seconds': 14.797728,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 10, 2, 12, 59, 41, 913131),
 'log_count/ERROR': 2,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'memusage/max': 62865408,
 'memusage/startup': 62865408,
 'retry/count': 2,
 'retry/max_reached': 1,
 'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 2,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/disk': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/disk': 4,
 'splash/execute/request_count': 1,
 'start_time': datetime.datetime(2021, 10, 2, 12, 59, 27, 115403)}
2021-10-02 12:59:41 INFO [scrapy.core.engine] Spider closed (finished)
2021-10-02 12:59:41 INFO Main loop terminated.

 

You're welcome! :)


Yes, this error is because it cannot contact Splash - Which needs to be running as an instance, either locally or in Scrapy Cloud. 


1) Yes, there are trials for both our Smart Proxy Manager and our Zyte Data API. Since you are trying to use a headless browser (Splash) in order to scrape, we would suggest trying out our Zyte Data API which should do a similar job and you can do a trial of it. You can verify more about it here: https://docs.zyte.com/zyte-api/get-started.html


For Zyte Data API, you can sign up here: https://app.zyte.com/account/signup/zyteapi 

For Smart Proxy Manager, you can sign up for it in the "Smart Proxy Manager" tab, right below the tools in the left of your Zyte dashboard: https://app.zyte.com/ 

Note that unfortunately we don't have trials for Splash or Scrapy Cloud. You can run a job in Scrapy Cloud for free for up to 1 hour though - So it should be enough for testing purposes. 


2) This is not exactly a module but actually a headless browser emulator - You can think of it as something similar to Selenium for example. Once you have it set up, you can simply point to its IP and port as the endpoint. You can check information on what it is and how to set it up (locally) here:

https://www.zyte.com/blog/handling-javascript-in-scrapy-with-splash/


Is this information helpful?


1 person likes this

Hi Ednei,

Thanks.
I see that the Zyte Data API (after 14 day trial) starts at $100/month, which is too expensive for me.
I do see that Splash starts at $25/month, so if you look at my requirements, would a Splash subscription + Scrapy Cloud be sufficient? (and then later optionally add Proxy Manager if I like everything).

Please let me know

Hi MarcA,


Happy Monday! 


It would depend a lot on the target website you would like to Scrape - By not using any proxies or our Smart Proxy Manager, you would likely face a lot of bans depending if the website has an antibot protection or not. We would first suggest trying it out locally with Splash (you can set it up locally using this guide) - If it works locally and is sustainable, then it should work well in Scrapy Cloud and Splash. Otherwise, you would have to resort to a solution to avoid bans - Such as Smart Proxy Manager or Zyte Data API for example.


Hope this helps.


1 person likes this

Thanks again. Yes, Splash works locally. Could you point me to your relevant documentation so I know what the Splash URL is I need to call from my Spider on Scrapy Cloud? (as you can see above, right now I use `http://localhost:8050/`)

Hi MarcA,


You can refer to this article here - You will need to modify your Splash URL as well as use your API Key, both of which are supplied in the dashboard once you have a Splash unit:

Using Scrapy with Splash


Can you please let us know if this helps?

Hi MarcA,


Happy Friday!


Have you tried our suggestions? How are things going on your side?

Login to post a comment