Compiling a Mask Price Dataset With MTurk
I’ve decided to put together a dataset of KN95 mask prices after seeing masks from AliExpress being sold at a tenfold markup in some online shops.
The dataset consists of the unit prices of products advertised as KN95 masks on various online marketplaces broken down by date. Product URLs are scraped automatically, but price, quantity and availability data is gathered by MTurk workers. The data is available on GitHub and it’s updated at least once a week.
Since we cannot be sure which products advertised as KN95 masks actually fulfill the filtration criteria, care must be taken to analyze the distribution of prices and filter out both negative and positive outliers.
MTurk Lessons Learned
Online marketplaces protect their product pages from scraping, so I’ve decided to leverage MTurk for data collection. These are the lessons I’ve learned in the process.
Custom MTurk Task UI
There is no predefined MTurk task UI for price data collection, so I had to create my own. You can use the layout I’ve created as starting point (pictured below).
Include Currency in Price
Marketplaces may display prices in different currencies based on the location of the viewer. Since MTurk workers are distributed across the globe, it’s best to ask workers to include the currency symbol in the price field and check the presence of the symbol with a regex.
Allow N/A Answers
It’s possible that a product is not in stock or not available for shipping to a worker’s location. In order to avoid receiving garbage input from workers faced with an impossible task, make sure that you accept n/a or empty answers in this case.
MTurk API
Amazon conveniently exposes an API to interact with MTurk which can be used to automate task creation and data retrieval. I recommend using the official MTurk Boto3 client for Python users which will feel familiar if you’ve used Boto3 with other AWS APIs before. I’ve implemented a CLI with Click in Python that you can use as inspiration.
Require Multiple Assignments Per HIT
Tasks in MTurk are called Human Intelligence Tasks (HITs). For each HIT, multiple assignments can be created ensuring that multiple different workers complete your task. Since verifying the output of a worker is the same effort as collecting the pricing information yourself, it’s recommended that you assign multiple assignments (at least 3) for each HIT. This way, a completed assignment can be automatically checked by matching it with other assignments for the same HIT.
Save Your HITs
When creating a HIT, one can supply custom input parameters to the task layout UI that will be displayed to workers (like the product URL). Unfortunately, these input parameters cannot be retrieved through the MTurk API after creating the HIT, so they must be stored when creating it.
A SQLite database is a good solution for saving the HITs, as you can commit it directly to the repository. You can use my implementation with SQLAlchemy ORM in Python as a starting point.
Group HITs with Requester Annotation
Each HIT has an optional string property called RequesterAnnotation
that can be set to an arbitrary value. You can use this property to group HITs into batches that makes administration easier. For example, I set the RequesterAnnotation
property to something like “batch_2020–10–06_16_59_44”.
Prune HITs
Interacting with the MTurk API can become very slow if you have too many active HITs. Therefore, it’s best to regularly delete HITs whose output you’ve already exported from MTurk. You can use this code as example to prune HITs.
Amazon’s Fee Can Add Up
When creating a HIT, you specify the reward for each completed assignment. The minimum reward is $0.01. Amazon also charges a fee in addition to the reward paid to workers, which is normally 20%, but the minimum fee is $0.01 per assignment, so when offering the minimum reward, Amazon’s fee is effectively 50%.
Conclusion
MTurk is a good alternative to gather pricing data from online marketplaces if scraping is not an option. However, you must be prepared to handle faulty input from workers and the price may be prohibitive. If you decide to go with MTurk, you can use my Python CLI on GitHub as a starting point for your project to automate task administration.