Monitoring Web Scrapers with Spidermon: A Guide for Customers

Modified on Mon, 19 Jun, 2023 at 3:35 PM

At Zyte, we understand the importance of maintaining the quality and reliability of your web scrapers. To ensure smooth data collection and timely delivery, we have implemented a monitoring system using the Spidermon framework. This documentation aims to explain how we implement monitors and the overall process involved in monitoring your web scrapers.

Understanding the Challenges:
1. Initially, a web scraper is created and undergoes a few crawls to gather production data. This helps us identify the specific challenges faced by the scraper during data collection.
2. Challenges can vary between scrapers. For example, a scraper might encounter long-running times, prompting us to implement a running time monitor. Another scraper might face anti-bot measures (website's way of preventing computer programs like web scrapers from collecting data and allowing only humans to visit them) implemented by the target site, leading us to create monitors for specific error codes related to those anti-bot measures.
3. Spidermon provides built-in basic monitors, but we also develop custom monitors if there are specific subsets of data critical to your requirements. Click the link to check out the complete list of monitors which are built-in in Spidermon: Monitoring your jobs — Spidermon documentation. Please reach out to us if you have any questions about the monitors mentioned on that page.
Incorporating Monitors:
1. Monitors are additional code segments integrated into your web scraping project alongside the scraper's code.
2. We deploy your web scrapers on the Scrapy Cloud platform, which allows us to leverage the monitoring capabilities of Spidermon.
Monitoring and Alert System:
1. When a monitor detects an error during the scraping process, it raises an alert within our monitoring system.
2. We have a dedicated team responsible for monitoring these alerts. They perform preliminary investigations to determine the severity of the alert, the impact on data collection and delivery, and whether further investigation by a developer is required.
3. If deeper investigation is necessary, we create a ticket to notify you about the issue. Meanwhile, our development team starts working on a solution.
Communication and Updates:
1. Throughout the process, our development team remains responsible for providing consistent updates regarding the scope of the problem, the proposed solution, and any temporary risk mitigation measures implemented if required.
2. You can expect clear and transparent communication from our team, ensuring you stay informed about the progress and resolution of the issue.

Implementing monitors using the Spidermon framework allows us to proactively identify and address challenges faced by your web scrapers. With our monitoring system in place, we can swiftly respond to any issues that arise, ensuring the consistent performance and reliability of your data collection process. Our dedicated team is committed to delivering high-quality results and providing you with the necessary updates and support throughout the monitoring and issue resolution process.

Please feel free to reach out to us if you have any further questions or require additional information. We are here to assist you in maximizing the effectiveness of your web scrapers.