The purpose of this guide is to bring machine learning technology to the data that you extract through Scrapy Cloud. MonkeyLearn is a classifier service that lets you analyze text. It provides machine learning capabilities like categorizing products or sentiment analysis to figure out if a customer review is positive or negative. For more details on the pipeline, you can check the github repository: scrapy-monkeylearn.

Getting Started with MonkeyLearn

Before you get started with the MonkeyLearn using Scrapy, you first need to sign up for the MonkeyLearn service. MonkeyLearn provides public modules that are already ready to go or you can create your own text analysis module by training a custom machine learning model.

Using MonkeyLearn with a Scrapy Cloud Project

scrapy-monkeylearn requires your spiders to generate Item objects from a pre-defined Item class. For example:

class ProductItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    description = scrapy.Field()
    category = scrapy.Field()

You'll also have to declare where MonkeyLearn will store the analysis' results as an additional field in your Item class. In this case, results will be stored in the category field of each of the items scraped.

You'll also need to add:

to your requirements.txt.

Enabling MonkeyLearn

Enable it in your project's

    'scrapy_monkeylearn.pipelines.MonkeyLearnPipeline': 100,

Configuring MonkeyLearn

Add these to your

MONKEYLEARN_FIELD_TO_PROCESS = ['title', 'description']
MONKEYLEARN_MODULE = 'classifier_id'
  • MonkeyLearn token: your MonkeyLearn API token. You can access it from your account settings on the MonkeyLearn website.
  • MonkeyLearn field to process: A field or list of Item text fields that will be used as input for the classifier, in this case it is: title,description. Also comma-separated string with field names is supported.
  • MonkeyLearn field output: the name of the new field that will be added to your items in order to store the categories returned by the classifier, in this case it is: category.
  • MonkeyLearn module: the id of the classifier that you are going to use.
  • MonkeyLearn batch size: the amount of items the pipeline will retain before sending to MonkeyLearn for analysis.
  • MonkeyLearn sandbox: In case of using a classifier, if the sandbox version should be used.