Sometimes, you need to add certain fields to your scraped data that can be derived from the context. For example, you may need a timestamp for when an item was scraped, or you need to extract an identifier from a URL. This is where the Magic Fields addon comes in.
You can enable the addon by going to Settings -> Addons
and clicking Add
on the Magic Fields addon.
Navigate to the settings of the spider you want to modify. Let’s use the $time
magic variable as an example.
Add { "timestamp": "$time" }
to the MAGIC_FIELDS
setting. This will add a timestamp
field containing the time at which the item was scraped.
The following magic variables are available:
-
time
The UTC timestamp at which the item was scraped, in the format‘%Y-%m-%d %H:%M:%S’
-
unixtime
The Unix time at which the item was scraped. -
isotime
The UTC timestamp at which the item was scraped, in the format‘%Y-%m-%dT%H:%M:%S’
. -
spider:<attribute>
The value of the specified attribute argument. -
env:<variable>
The value of the specified variable. Note: the name of the variable will be omitted. -
jobid
The job ID. Shortcut for$env:SCRAPY_JOB
. -
jobtime
The UTC timestamp at which the job started, in the format‘%Y-%m-%d %H:%M:%S’
. -
setting:<name>
The value for the specified setting.field:<name>The value of the existing field specified -
response:<property>
The value of the specified property of the response.
You can also use regular expressions to extract a portion of the variable.
For example, let’s say you need to extract a parameter from a URL like this: http://www.example.com/product.html?item_no=345
. The normal syntax, { "sku": "$field:url" }
will store the full URL into the sku
field. If we want to extract only the item_no
value, we can use a regex like this:
{ "sku": "$field:url,r'item_no=(\d+)'" }
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article