Machine learning with Scrapy and MonkeyLearn

Modified on Wed, 3 Feb, 2021 at 7:58 AM

NOT TO BE CONFUSED WITH THE MONKEYLEARN ADDON

The purpose of this guide is to bring machine learning technology to the data that you extract through Scrapy Cloud. MonkeyLearn is a classifier service that lets you analyze text. It provides machine learning capabilities like categorizing products or sentiment analysis to figure out if a customer review is positive or negative. For more details on the pipeline, you can check the github repository: scrapy-monkeylearn.

Getting Started with MonkeyLearn

Before you get started with the MonkeyLearn using Scrapy, you first need to sign up for the MonkeyLearn service. MonkeyLearn provides public modules that are already ready to go or you can create your own text analysis module by training a custom machine learning model.

Using MonkeyLearn with a Scrapy Cloud Project

scrapy-monkeylearn requires your spiders to generate Item objects from a pre-defined Item class. For example:

class ProductItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    description = scrapy.Field()
    category = scrapy.Field()

You'll also have to declare where MonkeyLearn will store the analysis' results as an additional field in your Item class. In this case, results will be stored in the category field of each of the items scraped.

You'll also need to add:

https://github.com/scrapy-plugins/scrapy-monkeylearn/archive/366340d.zip#egg=scrapy-monkeylearn

to your requirements.txt.

Enabling MonkeyLearn

Enable it in your project's settings.py:

ITEM_PIPELINES = {
    'scrapy_monkeylearn.pipelines.MonkeyLearnPipeline': 100,
}

Configuring MonkeyLearn

Add these to your settings.py:

MONKEYLEARN_TOKEN = 'your_API_token'
MONKEYLEARN_FIELD_TO_PROCESS = ['title', 'description']
MONKEYLEARN_FIELD_OUTPUT = 'category'
MONKEYLEARN_MODULE = 'classifier_id'
MONKEYLEARN_BATCH_SIZE = 200
MONKEYLEARN_USE_SANDBOX = False

MonkeyLearn token: your MonkeyLearn API token. You can access it from your account settings on the MonkeyLearn website.
MonkeyLearn field to process: A field or list of Item text fields that will be used as input for the classifier, in this case it is: title,description. Also comma-separated string with field names is supported.
MonkeyLearn field output: the name of the new field that will be added to your items in order to store the categories returned by the classifier, in this case it is: category.
MonkeyLearn module: the id of the classifier that you are going to use.
MonkeyLearn batch size: the amount of items the pipeline will retain before sending to MonkeyLearn for analysis.
MonkeyLearn sandbox: In case of using a classifier, if the sandbox version should be used.