Sometimes, you need to add certain fields to your scraped data that can be derived from the context. For example, you may need a timestamp for when an item was scraped, or you need to extract an identifier from a URL. This is where the Magic Fields addon comes in.
You can enable the addon by going to Settings -> Addons and clicking Add on the Magic Fields addon.
Navigate to the settings of the spider you want to modify. Let’s use the $time magic variable as an example.
Add { "timestamp": "$time" } to the MAGIC_FIELDS setting. This will add a timestamp field containing the time at which the item was scraped.
The following magic variables are available:
-
timeThe UTC timestamp at which the item was scraped, in the format‘%Y-%m-%d %H:%M:%S’ -
unixtimeThe Unix time at which the item was scraped. -
isotimeThe UTC timestamp at which the item was scraped, in the format‘%Y-%m-%dT%H:%M:%S’. -
spider:<attribute>The value of the specified attribute argument. -
env:<variable>The value of the specified variable. Note: the name of the variable will be omitted. -
jobidThe job ID. Shortcut for$env:SCRAPY_JOB. -
jobtimeThe UTC timestamp at which the job started, in the format‘%Y-%m-%d %H:%M:%S’. -
setting:<name>The value for the specified setting.field:<name>The value of the existing field specified -
response:<property>The value of the specified property of the response.
You can also use regular expressions to extract a portion of the variable.
For example, let’s say you need to extract a parameter from a URL like this: http://www.example.com/product.html?item_no=345. The normal syntax, { "sku": "$field:url" } will store the full URL into the sku field. If we want to extract only the item_no value, we can use a regex like this:
{ "sku": "$field:url,r'item_no=(\d+)'" }Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article