We understand that the world of web scraping technology may be new to you and some of the jargon we use to communicate could be confusing. Don't worry, this article contains a list of words we use at Zyte in our communication with you as our customer. We assume you already understand the basics of the website terminology such as URL, web page, etc. If not, then please refer to Mozilla foundation's guide to the internet, starting with what is a URL? If you still find a word that you don't understand and can't find an explanation for, please let us know. We will try to cover it in this article.
Web scraping VS web crawling, what's the difference?
When a software program reads and collects data from a given website, it is web scraping. When a software program reads links of websites from one page to another and creates a list of URLs, we call it web crawling. Google runs multiple web crawlers to create a huge list of such URLs which makes it easy for you to search the web.
What is a spider?
A spider or sometimes called a web crawler, spider-bot or just bot, is a computer program that systematically browses the websites and could be used for collecting data. Search engines like Google, Bing, DuckDuckGo, etc. run hundreds of such programs to collect data.
What is an item?
The main goal of a web scraper is to collect data from websites, typically by reading web pages and return the extracted data as items. For example, if you wish to collect data about books from a site that sells books online, the data for each book will be collected as a single unit we call as an item.
What is a field?
Field is the meta data of an item. For example, if a single item stores data about a book, the details about the book such as the name of the author, price, number of reviews, etc. will be considered as a field of that item.
What is a request?
A request is an attempt made by your computer when you try to visit a website. When you go to your browser, type in www.google.com and hit Enter, you are essentially sending a request to Google's server through your browser that "Hey Google, I want make a search so could you please display your search page to my browser". If your request is successful, Google will show their web page to your browser.
What is a threshold?
Some websites do not like automated computer programs like spiders trying to collect data from their web pages and hence put an anti-bot detector in place. The anti-bot detector detects that the request to visit their website is coming from a computer program and not an actual human hence it prevents the program from crawling the site.
In such cases, the spiders may be unable to access the website at the first attempt and to work around this hurdle, spiders may send the requests through multiple computers on the internet so that the anti-bot detector sees those requests coming from multiple sources instead of just one.
Since there are multiple requests to access the same page, some of those requests are not successful and create an error. These errors are expected and normal which usually can be ignored. However, if they happen too frequently, it may end up getting the spider banned permanently from crawling the site. Therefore, we set a threshold for ignoring the number of times such errors happen. If the number of errors exceed the set threshold level, our monitoring systems sends us an alert based on which we notify the development team for further investigation.
Thresholds are also set for other things as well such as number of minimum items we expect to collect, the number of fields we expect to collect, etc.
What is a headless browser?
For general purpose to visit a website, we use browsers such as Google Chrome, Mozilla Firefox, Microsoft Edge, etc. But for the purpose of web scraping, a computer program like spider does not need to actually open a website in these browsers. They can visit a website and read its web page in the back-end while the site remaining invisible to the human eye. Such browsers which help spiders to read data are known as headless browsers. The benefits of scraping data this way is that it uses less data and resources to achieve its task while also increase the speed at which it will collect the data.
What is rich content?
Rich content consists of media that involves sound, images and/or video. It’s content that stands out because of visual flair and design. It’s content that takes advantage of the sensory features of the viewer. Some examples of rich content are:
- Static images
- Animated GIFs
- Audio clips
So, what’s the difference between rich content and normal content?
Normal content is usually text-based. Think of text-only social media posts or blogs. Though these formats can inform and even encourage interaction with your brand, there’s not much visual flair to them. That’s why text-based content is used in conjunction with rich content, creating a combination that’s both engaging and informative.
You’ll most likely find rich content in blog content, on social media and in emails, from ads, to promotions and newsletters.
What is a response code?
Whenever you visit a website, you are basically sending a request to some computer on the internet to allow you to access that site. Your request will then get a response from that site whether it will allow you to visit the site or not.
- If we are able to successfully open the web page, the site sends us a response code of 200.
- But suppose tried accessing a page is no longer available, the site will send us a 404 not found code.
- If the location of the page is now under a different URL, the site will redirect us and will give us a 308 code.
Basically, a response code is a numerical way of the internet and web pages to tell your computer how your request was handled by their server. The entire list of code can be found at https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
What is data validation errors?
Suppose we are extracting information of students from a public website which has the following information for each student, name, year of class, subject and email. Sometimes we may want to validate if we are capturing the right type of information in those fields. For example, the name can never be a number, year of class can never be letters, email should always be in the format of a valid email address.
The data validation monitors checks exactly this information to confirm if the data collected is in the expected format or not. If they're not, they raise a data validation error.