OpenWPM - Web privacy measurement framework
Web Privacy Measurement is the observation of websites and serves to detect, characterize and quantify privacy-impacting behaviors. Applications of Web Privacy Measurement include the detection of price discrimination, targeted news articles and new forms of browser fingerprinting. Although originally focused solely on privacy violations, WPM now encompasses measuring security violations on the web as well.
For these studies to be truly large-scale and repeatable, creating an automated measurement platform is necessary. At least within the academic literature, measurement infrastructures in the field of WPM have been largely one-off and do not comprehensively address the engineering challenges within this realm.
OpenWPM, a flexible, stable, scalable and general web measurement platform, is our solution to this infrastructure vacuum. This tutorial shows how to get started with OpenWPM, gives an overview of its general functionality and lists some key engineering challenges which are still being solved. We hope that this tool will enable other researchers to perform WPM studies and welcome future collaboration.
Installation
OpenWPM has been developed and tested on Ubuntu 14.04/16.04. An installation script,
install.sh
is included to install both the system and python dependencies automatically. A few of the python dependencies require specific versions, so you should install the dependencies in a virtual environment if you're installing a shared machine. If you plan to develop OpenWPM's instrumentation extension or run tests you will also need to install the development dependencies included in install-dev.sh
.
It is likely that OpenWPM will work on platforms other than Ubuntu, however we do not officially support anything else. For pointers on alternative platform support see the wiki.
Quick Start
Once installed, it is very easy to run a quick test of OpenWPM. Check out
demo.py
for an example. This will use the default setting specified in automation/default_manager_params.json
and automation/default_browser_params.json
, with the exception of the changes specified in demo.py
.Instrumentation and Data Access
OpenWPM provides several instrumentation modules which can be enabled independently of each other for each crawl. With the exception of response body content, all instrumentation saves to a SQLite database specified by
manager_params['database_name']
in the main output directory. Response bodies are saved to content.ldb
. The SQLite schema specified by: automation/schema.sql
, instrumentation may specify additional tables necessary for their measurement data (see extension tables).- HTTP Request and Response Headers, redirects, and POST request bodies
- Set
browser_params['http_instrument'] = True
- Data is saved to the
http_requests
,http_responses
, andhttp_redirects
tables.http_requests
schema documentationchannel_id
can be used to link a request saved in thehttp_requests
table to its corresponding response in thehttp_responses
table.channel_id
can also be used to link a request to the subsequent request that results after an HTTP redirect (3XX response). Use thehttp_redirects
table, which includes a mapping betweenold_channel_id
, thechannel_id
of the HTTP request that resulted in a 3XX response, andnew_channel_id
, the HTTP request that resulted from that redirect.
- OCSP POST request bodies are not recorded
- Note: request and response headers for cached content are also saved, with the exception of images. See: Bug 634073.
- Set
- Javascript Calls
- Records all method calls (with arguments) and property accesses for APIs of potential fingerprinting interest:
- HTML5 Canvas
- HTML5 WebRTC
- HTML5 Audio
- Plugin access (via
navigator.plugins
) - MIMEType access (via
navigator.mimeTypes
) window.Storage
,window.localStorage
,window.sessionStorage
, andwindow.name
access.- Navigator properties (e.g.
appCodeName
,oscpu
,userAgent
, ...) - Window properties (via
window.screen
)
- Set
browser_params['js_instrument'] = True
- Data is saved to the
javascript
table.
- Records all method calls (with arguments) and property accesses for APIs of potential fingerprinting interest:
- Response body content
- Saves all files encountered during the crawl to a
LevelDB
database de-duplicated by the md5 hash of the content. - Set
browser_params['save_all_content'] = True
- The
content_hash
column of thehttp_responses
table contains the md5 hash for each script, and can be used to do content lookups in the LevelDB content database. - NOTE: this instrumentation may lead to performance issues when a large number of browsers are in use.
- Set
browser_params['save_javascript'] = True
to save only Javascript files. This will lessen the performance impact of this instrumentation when a large number of browsers are used in parallel.
- Saves all files encountered during the crawl to a
- Flash Cookies
- Recorded by scanning the respective Flash directories after each page visit.
- To enable: call the
CommandSequence::dump_flash_cookies
command after a page visit. Note that calling this command will close the current tab before recording the cookie changes. - Data is saved to the
flash_cookies
table. - NOTE: Flash cookies are shared across browsers, so this instrumentation will not correctly attribute flash cookie changes if more than 1 browser is running on the machine.
- Cookie Access (Experimental -- Needs tests)
- Set
browser_params['cookie_instrument'] = True
- Data is saved to the
javascript_cookies
table. - Will record cookies set both by Javascript and via HTTP Responses
- Set
- Content Policy Calls (Experimental -- Needs tests)
- Set
browser_params['cp_instrument'] = True
- Data is saved to the
content_policy
table. - Provides additional information about what caused a request and what it's for
- NOTE: This instrumentation is largely unchanged since it was ported from FourthParty, and is not linked to any other instrumentation tables.
- Set
- Cookie Access (Alternate)
- Recorded by scanning the
cookies.sqlite
database in the Firefox profile directory. - Should contain both cookies added by Javascript and by HTTP Responses
- To enable: call the
CommandSequence::dump_profile_cookies
command after a page visit. Note that calling this command will close the current tab before recording the cookie changes. - Data is saved to the
profile_cookies
table
- Recorded by scanning the
- Log Files
- Stored in the directory specified by
manager_params['data_directory']
. - Name specified by
manager_params['log_file']
.
- Stored in the directory specified by
- Browser Profile
- Contains cookies, Flash objects, and so on that are dumped after a crawl is finished
- Automatically saved when the platform closes or crashes by specifying
browser_params['profile_archive_dir']
. - Save on-demand with the
CommandSequence::dump_profile
command.
- Rendered Page Source
- Save the top-level frame's rendered source with the
CommandSequence::dump_page_source
command. - Save the full rendered source (including all nested iframes) with the
CommandSequence::recursive_dump_page_source
command.- The page source is saved in the following nested json structure:
{ 'document_url': "http://example.com", 'source': "<html> ... </html>", 'iframes': { 'frame_1': {'document_url': ..., 'source': ..., 'iframes: { ... }}, 'frame_2': {'document_url': ..., 'source': ..., 'iframes: { ... }}, 'frame_3': { ... } } }
- Save the top-level frame's rendered source with the
- Screenshots
- Selenium 3 can be used to screenshot an individual element. None of the built-in commands offer this functionality, but you can use it when writing your own. See the Selenium documentation.
- Viewport screenshots (i.e. a screenshot of the portion of the website visible in the browser's window) are available with the
CommandSequence::save_screenshot
command. - Full-page screenshots (i.e. a screenshot of the entire rendered DOM) are available with the
CommandSequence::screenshot_full_page
command. - This functionality is not yet supported by Selenium/geckodriver, though it is planned. We produce screenshots by using JS to scroll the page and take a viewport screenshot at each location. This method will save the parts and a stitched version in the
screenshot_path
. - Since the screenshots are stitched they have some limitations:
- On the area of the page present when the command is called will be captured. Sites which dynamically expand when scrolled (i.e., infinite scroll) will only go as far as the original height.
- We only scroll vertically, so pages that are wider than the viewport will be clipped.
- In geckodriver v0.15 doing any scrolling (or having devtools open) seems to break element-only screenshots. So using this command will cause any future element-only screenshots to be misaligned.
Post a Comment