Magic Fields addon

Modified on Wed, 3 Feb, 2021 at 9:38 AM

Sometimes, you need to add certain fields to your scraped data that can be derived from the context. For example, you may need a timestamp for when an item was scraped, or you need to extract an identifier from a URL. This is where the Magic Fields addon comes in.

You can enable the addon by going to Settings -> Addons and clicking Add on the Magic Fields addon.

Navigate to the settings of the spider you want to modify. Let’s use the $time magic variable as an example.

Add { "timestamp": "$time" } to the MAGIC_FIELDS setting. This will add a timestamp field containing the time at which the item was scraped.

The following magic variables are available:

time The UTC timestamp at which the item was scraped, in the format ‘%Y-%m-%d %H:%M:%S’
unixtime The Unix time at which the item was scraped.
isotime The UTC timestamp at which the item was scraped, in the format ‘%Y-%m-%dT%H:%M:%S’ .
spider:<attribute> The value of the specified attribute argument.
env:<variable> The value of the specified variable. Note: the name of the variable will be omitted.
jobid The job ID. Shortcut for $env:SCRAPY_JOB .
jobtime The UTC timestamp at which the job started, in the format ‘%Y-%m-%d %H:%M:%S’ .
setting:<name> The value for the specified setting.field:<name>The value of the existing field specified
response:<property> The value of the specified property of the response.

You can also use regular expressions to extract a portion of the variable.

For example, let’s say you need to extract a parameter from a URL like this: http://www.example.com/product.html?item_no=345. The normal syntax, { "sku": "$field:url" } will store the full URL into the sku field. If we want to extract only the item_no value, we can use a regex like this:

{ "sku": "$field:url,r'item_no=(\d+)'" }