<< Chapter < Page | Chapter >> Page > |
Cleaning and Analysis
To facilitate sharing data, we have conducted both data cleaning and analysis with the open source statistical software R,which is available free of charge at http://www.r-project.org . We use the program R to clean our data sets. R is considered a statistical standardamong statisticians. There are several advantages to using R. We are able to manipulate extremely large data sets (>2GB) on a normal desktop. It also allows us to produce impressive graphics with minimal coding.
Clean Data is...
Cleaning Process
1. First we start with ``dirty'' data. (Fig.1)
2. Next we must download the data. A section of download code is shown below. (Fig. 2)
3. Once we have the data, we clean the data as best we can according to the rules describing clean data above. A section ofcleaning code is shown below. (Fig. 3)
4. Now that the data has been cleaned, it may look like the top part of the data below. (Fig. 4)
5. With clean data, we are able to explore it. The code below (Fig. 5) is the command used to produce the plot in figure Fig.6.
6. With R code we are able to produce complex plots with minimal amount of code. (Fig. 6)
Interesting Findings
Location, Location, Location...
The data graphed (Fig. 7&Fig. 8) is from the Federal Housing Finance Agency (FHFA) house price index (HPI). Both of thesegraphs analyze what time the HPI peaked for each metropolitan statistical area (MSA).
Looking at both graphs we believe that timing seems to be very significant. If a state peaked earlier than 2006 or later than2007, their HPI was not as greatly affected. This also supports the claim that California and Florida were impacted the greatest.
In Figure 7, you can see that both California and Florida peaked around the same time. The graph shows in what year each MSA areareached its maximum housing price.
In Figure 8, every point is a MSA and labeled by state. It graphs the peak HPI time versus the percent change in HPI between thenmaximum HPI to 2009, quarter 1 HPI. This graph shows that if HPI peaked between 2006 and 2007, then that state typically experienced a much larger percentchange in HPI.
Merced, CA
The city with the greatest percent change in the FHFA HPI was Merced, CA. This observation is very unusual of small cities.Further research into Merced showed that University California of Merced has finished construction in late 2005. Using both Figures 9 and 10, we hypothesizethat the construction increased due to the necessity of housing for UC Merced students and employees.
Myth Busters
After discovering Merced, CA we decided to look more closely at college towns. Contrary to belief, college towns were not greatlyimpacted by the housing crisis. They were affected more by the location that they were in, rather than being a ``college town''. (Fig. 11)
Other Explorations
Communication and Future Work
It is extremely important that all of our data cleaning and findings are reproducible. We've made both the data and programmingcode available to the public through our PFUG's website on http://github.com/hadley/data-housing-crisis . Github is a very advance website that is able to track changes made to data and code from multiple individuals.
Github is advantageous to both our research group and to the general public. Firstly, we are able to freely store large amounts ofdata. Also it allows us to work on the same data without having to e-mail changes back and forth. In addition, others can view and download our data forfree. We hope that by keeping the code transparent and self-replicating, others are able to easily build off our work.
We would like to develop a website that will allow users to easily access the data they are interested in, which would otherwise be a daunting task for those who wish to use a data set of this size.Because our analysis and findings also involve large amounts of information, (such as construction price time series for each US metropolitan area) we areexploring interactive graphical methods for displaying this information. Our future research will involve using the internet application Many Eyes, http://manyeyes.alphaworks.ibm.com , and then eventually the program Protovis, http://vis.stanford.edu/protovis , to create this website.
Acknowledgements
This Connexions module describes work conducted as part of Rice University's VIGRE program, supported by National ScienceFoundation grant DMS--0739420.
Notification Switch
Would you like to follow the 'The art of the pfug' conversation and receive update notifications?