We understand that the world web scraping technology may be new to some of you and it might be difficult for you understand certain words we use in our email communication. Don't worry, this article contains a list of words we use at Zyte in our communication with you as a customer. We assume you already understand the basics of the website terminology such as URL, web page, etc. If not, then please refer to Mozilla foundation's guide to the internet, starting with what is a URL?
Web scraping VS web crawling, what's the difference?
When a software program reads and collects data from a given website, it is web scraping. When a software program reads links of websites from one page to another and creates an index of URLs, we call it web crawling.
What is a spider?
A spider, sometimes called a web crawler, spider-bot or just bot. Is a computer program that systematically browses the websites and could be used for collecting data. Search engines like Google, Bing, DuckDuckGo, etc. run hundreds of such spiders to constant collect data about websites with indexed repository which makes it easy for us to find information on them when we make a search.
What is an item?
The main goal of a spider in scraping is to extract data from sources, typically, web pages and return the extracted data as items. For example, if you wish to collect data about books from a site that sells books online, the data for each book will be stored in a single item.
What is a field?
Field is the meta data of an item. For example, if a single item stores data about a book, the details about the book such as the name of the author, price, number of reviews, etc. will be considered as a fields of that item.
What is a request?
A request is an attempt made by your computer when you try to visit a website. When you go to your browser, type in www.google.com and hit Enter, you are essentially sending a request to Google's server through your browser that "Hey Google, I want make a search so could you please display your search page to my browser". If your request is successful, Google will then show their web page to your browser.
What is a threshold?
Some websites do not like spiders visiting their pages and put anti-bot measures in place to prevent them from crawling their site. In such cases, the spider may not be able to access the website at the first attempt. To work around this hurdle, spiders may use different computer addresses to send multiple requests and prevent the site from considering it as a bot allowing them to access the page.
Since there are multiple requests to access the same page in such scenario, some of those requests result in a failure and raise an error. These errors are expected and normal and they should be ignored. However, if they happen too frequently then it may end up getting the spider banned permanently from crawling the site. Therefore, we set a threshold for the extent at which we can ignore such errors. If the number of errors exceed the set threshold level, our monitoring systems raise and error which gets then sent to the development team for further investigation.
Thresholds are also set for other things as well such as number of minimum items we expect to collect, the number of fields we expect to collect, etc.
What is a headless browser?
For general purpose to visit a website, we use browsers such as Google Chrome, Mozilla Firefox, Microsoft Edge, etc. But for the purpose of web scraping, a computer program like spider does not need to actually open a website in these browsers. They can visit a website and read its web page in the back-end while the site remaining invisible to the human eye. Such browsers which help spiders to read data are known as headless browsers. The benefits of scraping data this way is that it uses less data and resources to achieve its task while also increase the speed at which it will collect the data.
What is rich content?
Rich content consists of media that involves sound, images and/or video. It’s content that stands out because of visual flair and design. It’s content that takes advantage of the sensory features of the viewer. Some examples of rich content are:
- Static images
- Animated GIFs
- Audio clips
So, what’s the difference between rich content and normal content?
Normal content is usually text-based. Think of text-only social media posts or blogs. Though these formats can inform and even encourage interaction with your brand, there’s not much visual flair to them. That’s why text-based content is used in conjunction with rich content, creating a combination that’s both engaging and informative.
You’ll most likely find rich content in blog content, on social media and in emails, from ads, to promotions and newsletters.
What is a response code?
Whenever you visit a website, you are basically sending a request to some computer on the internet to allow you to access that site. Your request will then get a response from that site whether it will allow you to visit the site or not.
- If we are able to successfully open the web page, the site sends us a response code of 200
- But suppose the page is no longer available, the site will send us a 404 not found code
- If the location of the page is now under a different URL, the site will redirect us and will give us a 308 code
Basically a response code is a numerical way of the internet and web pages to tell your computer how your request was handled by their server. The entire list of code can be found at https://developer.mozilla.org/en-US/docs/Web/HTTP/Status