Challenge: How can data connect individuals with the health providers they need

From Demand-Driven Open Data for HHS
Jump to: navigation, search

Background description

Since the launch of health insurance marketplaces as part of the Patient Protection and Affodable Care Act (PPACA), millions of Americans have obtained new health care coverage. There have, however, been widespread consumer complaints; in particular, people tend to choose the wrong plan because relevant information isn't available or is inaccurate. Often, patients don't discover until after a purchase that their physician isn't in-network, or that an in-network specialist they need isn't taking patients.

In November 2015, the Centers for Medicare & Medicaid Services (CMS) enacted a new regulatory requirement for health insurers who list plans on insurace marketplaces. They must now publish a machine-readable version of their provider network directory, publish it to a specified JSON standard, and update it at least monthly. Finally, this data is becoming accessible.

But computer- and engineer-accessibility doesn't make it particularly accessible to the general market of health care consumers. The new challenge, then, is to transform this vast directory of provider data into insights that can guide individuals to the health care they're paying for, that they deserve, and that they often badly need.

Bayes hack logo.png Source: This prompt was posted as a challenge from HHS (U.S. Department of Health and Human Services) for use at BayesHack 2016:

Resources and Links

  • The CMS Health Insurance Marketplace Public Use Files, this prompt's core dataset.
    • "Network PUF" has only URLs to non-machine readable and non-standardized provider networks.
    • "Machine-Readable PUF" is intended to overcome these limitations. It has URLs that can be crawled to obtain machine-readable, standardized and frequently updated provider networks.

  • Directory data schemas from the HHS DDOD.
  • Detailed analysis of the Provider Network Directories from David Portnoy, who contributed heavily to the development of data standards for the directory.
  • HIFLD geospatial data on the locations of pharmacies and hospitals in the United States and Territories.
  • Department of Health and Human Services geospatial data collected on Esri's ArcGIS open data platform. Includes a variety of health indicators at a county level—including everything from Low Birth Weight percentages to population per healthcare facility—and location data for points of care.

Data Dictionaries

2016 Data Dictionaries

2014 and 2015 Data Dictionaries

Benefits and Cost Sharing Data Dictionary Benefits and Cost Sharing Data Dictionary
Rate Data Dictionary Rate Data Dictionary
Plan Attributes Data Dictionary Plan Attributes Data Dictionary
Business Rules Data Dictionary Business Rules Data Dictionary
Service Area Data Dictionary Service Area Data Dictionary
Network Data Dictionary Network Data Dictionary
Plan ID Crosswalk Data Dictionary Plan ID Crosswalk Data Dictionary
Machine-readable Data Dictionary **  
 **  This is the only source for listings of healthcare providers in a machine-readable or standardized format  See section #Aggregated_Machine-Readable_Marketplace_Data below for aggregated and tabular version of this.

Aggregated Machine-Readable Marketplace Data

Aggregated machine-readable provider network directories and drug formularies in tabular format



The datasets presented here are a results of a separate effort to make the newly available data easily accessible for use with analytics and application development. There are a couple reasons this work was needed. First, the original "Machine-readable URL PUF" seed file is just a starting point, with the actual data being scattered throughout thousands of URLs. Second, for analytics and aggregate operations, it's much easier to work with tabular data than the original JSON schema used. The steps taken to produce these data:

  1. Crawl the thousands of URLs starting with the "Machine-readable URL PUF" seed file found on this CMS page:
    • (As of 3/1/2016, there are 636 plan URLs, 23,711 provider URLs and 1,662 formulary URLs.)
  2. Convert the data from the JSON schema ( into a tabular format with the same fields.
    • (There were a number of challenges converting multiple independent array fields from JSON to a tabular row.)
  3. Aggregate the results into delimited text files, such that there are 3 sets of files for each of the defined entities: Plans, Providers, and Formularies.


Credit: These files have been made possible through the efforts of these members of the open data community.

  • Jeff Stewart: Code to crawl URLs (Python, RedShift), execution of code and hosting aggregated files
  • Mark Silverberg: Source code and Socrata platform (hosting, API, visualization)