Compiling a Mask Price Dataset With MTurk

Agost Biro
4 min readOct 13, 2020

--

I’ve decided to put together a dataset of KN95 mask prices after seeing masks from AliExpress being sold at a tenfold markup in some online shops.

Daily Median KN95 Mask Prices. Figure by author.

The dataset consists of the unit prices of products advertised as KN95 masks on various online marketplaces broken down by date. Product URLs are scraped automatically, but price, quantity and availability data is gathered by MTurk workers. The data is available on GitHub and it’s updated at least once a week.

Since we cannot be sure which products advertised as KN95 masks actually fulfill the filtration criteria, care must be taken to analyze the distribution of prices and filter out both negative and positive outliers.

Histogram for AliExpress.com KN95 Mask Unit Prices on 2020–10–06. Figure by author.

MTurk Lessons Learned

Online marketplaces protect their product pages from scraping, so I’ve decided to leverage MTurk for data collection. These are the lessons I’ve learned in the process.

Custom MTurk Task UI

There is no predefined MTurk task UI for price data collection, so I had to create my own. You can use the layout I’ve created as starting point (pictured below).

MTurk UI for pricing data collection. Image by author.

Include Currency in Price

Marketplaces may display prices in different currencies based on the location of the viewer. Since MTurk workers are distributed across the globe, it’s best to ask workers to include the currency symbol in the price field and check the presence of the symbol with a regex.

Allow N/A Answers

It’s possible that a product is not in stock or not available for shipping to a worker’s location. In order to avoid receiving garbage input from workers faced with an impossible task, make sure that you accept n/a or empty answers in this case.

MTurk API

Amazon conveniently exposes an API to interact with MTurk which can be used to automate task creation and data retrieval. I recommend using the official MTurk Boto3 client for Python users which will feel familiar if you’ve used Boto3 with other AWS APIs before. I’ve implemented a CLI with Click in Python that you can use as inspiration.

Require Multiple Assignments Per HIT

Tasks in MTurk are called Human Intelligence Tasks (HITs). For each HIT, multiple assignments can be created ensuring that multiple different workers complete your task. Since verifying the output of a worker is the same effort as collecting the pricing information yourself, it’s recommended that you assign multiple assignments (at least 3) for each HIT. This way, a completed assignment can be automatically checked by matching it with other assignments for the same HIT.

Save Your HITs

When creating a HIT, one can supply custom input parameters to the task layout UI that will be displayed to workers (like the product URL). Unfortunately, these input parameters cannot be retrieved through the MTurk API after creating the HIT, so they must be stored when creating it.

A SQLite database is a good solution for saving the HITs, as you can commit it directly to the repository. You can use my implementation with SQLAlchemy ORM in Python as a starting point.

Group HITs with Requester Annotation

Each HIT has an optional string property called RequesterAnnotation that can be set to an arbitrary value. You can use this property to group HITs into batches that makes administration easier. For example, I set the RequesterAnnotation property to something like “batch_2020–10–06_16_59_44”.

Prune HITs

Interacting with the MTurk API can become very slow if you have too many active HITs. Therefore, it’s best to regularly delete HITs whose output you’ve already exported from MTurk. You can use this code as example to prune HITs.

Amazon’s Fee Can Add Up

When creating a HIT, you specify the reward for each completed assignment. The minimum reward is $0.01. Amazon also charges a fee in addition to the reward paid to workers, which is normally 20%, but the minimum fee is $0.01 per assignment, so when offering the minimum reward, Amazon’s fee is effectively 50%.

Conclusion

MTurk is a good alternative to gather pricing data from online marketplaces if scraping is not an option. However, you must be prepared to handle faulty input from workers and the price may be prohibitive. If you decide to go with MTurk, you can use my Python CLI on GitHub as a starting point for your project to automate task administration.

--

--

Agost Biro
Agost Biro

Written by Agost Biro

Software engineer with expertise in machine learning and full stack web development.

No responses yet