YMCA ExpLoratory Data Analysis: what the village people have to do with obesity (part 2)
The web scraping is described in my previous post. Github of code. For this one, I will talk about how I wrangled the data together so that I could explore.
I scraped one state at a time, and some states have a LOT more zipcodes than others. My average time was about 3 zip codes per min. which meant a few nights of letting my computer stay on all night to work while I snore.
Zip + County Data
Now that I have my precious YMCA data, I can take a look at the obesity stats, right? WRONG! it turns out that the obesity data that I found (the gold standard is from the Centers for Disease Control) is collected at the county level. Why is this a problem? Well, it may not be an issue that keeps you up at night, so allow me to explain. Zip codes to not necessarily correspond to county boundaries. In fact, zip codes do not even denote a physical boundary, but a postal route, which often includes a certain area, but not always. I knew there were places that connect this information, so I located a nice .csv file (for free) that connects zip codes to county from https://www.unitedstateszipcodes.org/zip-code-database/ It lists every zip code for every US state and territory. In this file, county = county with the largest percentage of the zip code's population. So, even if a zip code straddles the line of 2 different counties, it put it into the one with the bulk of the population. AWESOME!
Zip Codes
These are a few of the columns from the file, after I did some initial cleaning:
added leading zeros back to zips that began with a zero
rename the column 'zip' to 'zipcode' b/c 'zip' is a function in python
remove the word 'County' from each county name for easier merging
pull info for one state and save the dataframe and a list of just the zip codes (for selenium)
Merge Zip code Database with YMCA locations
Next, I summed the YMCA locations dataframe over the zip codes to get a total number of locations for each zip. Then, merge it with the zip code dataframe (below left). Then sum over county to get the number of YMCA locations per county (below, right). You can see how we get 4 locations for Greene county Missouri in the sample below.
health data
Finally, I can get it merged with the health data. The statistics came from www.HealthCountyRankings.com, which pulls stats from many reliable places, and pools them in order to get a ranking of each county per state. They are awesome, so they make their data files freely available in .csv format. I had to open it in Excel and save it by the 'sheet', as there are many sheets of highly organized data and descriptions of where the data came from. My kind of people! The only real cleaning that this data needed was
convert the numbers from strings to floats.
Remove the word 'County' from county names for easier merging
Remove the rows containing aggregated state-wide stats
I decided to start with the following stats from the file:
Adult Obesity % - defined as a Body Mass Index > 30. Collected from the CDC Diabetes Interactive Atlas (2013).
Food Environment Index -A rank (1-10) based on a household's access to healthy foods. Collected by the USDA food environment atlas, Map the Meal Gap (2014).
Severe Housing Problems - % of households that have one of the following: overcrowding, high housing costs, lack of plumbing or kitchen. Collected from the Comprehensive Housing Affordability Strategy (CHAS) data (2009-2013).
Some College - % of adults aged 25-44 with some post-secondary education. Collected from the American Community Survey (2011-2015).
Access to exercise opportunities - % of adults aged 20 and over, reporting no leisure-time physical activity. Collected from the CDC Diabetes Interactive Atlas (2013).
The merged health + ymca locations data sample is below.
I thought it would be fun to plot the locations of the YMCA and the county population (irs estimated population from 2014). The size of the dots is supposed to represent the number of YMCA locations for the county, but as you can see, it is rare to find more than 1. The more densely populated areas of Kansas City and St. Louis have a higher number of YMCAs, as expected. (I plotted this with Basemap, a great toolbox for all your mapping needs).
See my next post where I put our variables into a linear regression analysis!