It's been rainy here in NYC as of late. Just about the only thing worse than 90 degree city heat is 90 degree city heat with intense thunderstorms roaring through. So I find myself indoors when it rains, crunching data, writing code, checking into GitHub.
Of course I'm not unique here. There are thousands of other GitHub coders in New York and millions of contributors worldwide. The data scientist inside of me asks questions. Is it possible to measure these effects? And if so, exactly how much more do people code when it's rainy?
So, I poured through GitHub's publicly available data on BigQuery and found that, yes, it is possible to measure the effects of weather and, yes, these effects are sizable. But the analysis turned out to be trickier than expected.
The dataset here was collected from GitHub's data hosted on BigQuery. Github check-in size can vary greatly by users, and also greatly by project. While my check-ins may average a few hundred lines, others may average 2 or 3 lines. So if I write 1,000 lines of code, this may only be 4 or 5 check-ins for me, but dozens of check-ins for others. Instead of using a raw check-in count, I counted the number of unique users per location who checked in at any time during a given day.
Location data was geocoded using Google Maps API, and then nearest weather stations were determined via WeatherUnderground's API. Historical weather was also collected from WeatherUnderground. Since the collection of historical weather data is somewhat expensive, I restricted analysis to the most popular 1,000 locations on GitHub.
The final dataset contains day, location, count, and weather information for each data point. You can download the full GitHub weather dataset in csv format.
All analysis was done using IPython notebook and Numpy. IPython's new notebook feature is awesome if you haven't checked it out.
Many factors drive check-in behavior on GitHub. For example, check-ins drop drastically on the weekends; people check-in 56% more on Fridays than they do on Saturdays.
As is the general increase in GitHub usage, and check-ins drop during the New Years holiday.
Notice the cyclic weekend / weekday behavior is also noticeable here.
Unfortunately, measuring the impact of the weather turns out to be less straightforward. Let's start by considering the naive analysis below.
A naive analysis of the weather
In the dataset at hand, it rains about 26% of the time. For each day in the dataset, compute total check-in counts for both rainy locations and non-rainy locations. Then normalize these counts across their respective probabilities:
p(rain) * E(rain_count_per_day) = observed_rain_count p(clear) * E(clear_count_per_day) = observed_clear_count
Solving for E(x) makes the two values comparable. Intuitively, the value can be thought of as what the values would have looked like had it rained (or not rained) 100% of the time.
Unfortunately, this analysis results in an imperceptible difference between the two groups.
And not surprisingly, the average increase in check-ins E(rain_count_per_day) - E(clear_count_per_day) showed effectively zero increase, much less a statistically significant increase.
Simulating a controlled environment
Why did the above analysis show zero change? In short, there's too much other stuff going on. In the context of weather, factors such as day of week, seasonality, and location are much bigger drivers of check-in volume. And if weather is at all a factor, its effect is smaller.
In a controlled environment, you could run an A/B test: put 50% of users in a rainy bucket and put the other 50% of users in a clear bucket. Half of users would see rain; the other half see clear skies. The resulting data would look something like this: "On June 3, the rain test group within the NYC population committed 5,234 total check-ins, while the clear-skies group committed 4,923 check-ins". The resulting analysis would simply aggregate these numbers across all dates and locations, and then measure the change in rainy vs sunny environments.
Of course, in the domain of weather, this sort of experiment is impossible.
So while we can't measure these numbers directly, is it possible to estimate them? E.g., for each day and each location, if it did in fact rain on that day, how many check-ins would we have expected had it actually not rained?
Modeling clear skies with linear regression
To answer this question, I turned to one of the simplest tools in my toolbox: linear regression. To train the model, I used the following variables as inputs:
- is_weekend: 0/1 indicator variable. "1" if day is Saturday or Sunday, "0" otherwise
- days_since_beginning: # of days since first day in dataset
- is_christmas: 0/1 indicator variable. Dip around December actually began on Saturday, December 21 and went through January 2
- seven_day_average_count: Average check-in count overall all locations for 3 days before and after date
The following graphs below show the performance of this model when trained over various subsets of these features. The first plot shows the regression over just day_number. As expected, the regression captures overall upward trend, but the linearity of a single dimension doesn't really capture what's going on.
Adding in seven day counts does a much better job of capturing non-linear growth patterns.
But the model still doesn't capture week over week trends in the lower number of weekend checkins. So adding is_weekend to the model, along with the rest of the variables, the regression improves.
Notice how the root mean squared error (RMSE) decreases as more and more independent variables are added to the model.
Given that the regression models above provide a reasonable way of predicting check-in counts, we can build a model over clear days, apply it to days when it actually rained, and then compare the actual number of check-ins to the expected number of check-ins predicted by the model.
Specifically, the process is:
- Split the data into a training set consisting of sunny date / location pairs, and a test set consisting of rainy pairs.
- Train a linear model (using the variables defined above) over the sunny training set.
- Evaluate model on the rainy test set. Measure change in actual counts compared to predicted counts by model.
The linear regression models account for major sources of variances outlined above: weekends, seasonality, and top line GitHub growth. So for example, if it's sunny in New York on a Friday and rains the next day on Saturday, the model will predict a decrease based on the fact that the weekend has arrived. The above process then compares this predicted (decreased) count for Saturday with the actual observed count. A decrease in check-in counts from Friday to Saturday is expected; the model will predict exactly how large this decrease should be had it not rained on Saturday, and comparing it with the actual value shows whether this decrease was larger or smaller than expected. The model serves as a normalizing factor.
The following plot shows predicted rain values vs actual rain values.
The actual counts are higher than the predicted counts, and people do in fact check-in more often when it rains. The residuals of the regression -- the average increase in actual rain count check-ins as compared to predicted clear-weather check-ins -- shows a 10.1% increase. Or equivalently, people code (or at least check-in) 10% more when it's raining as compared to when it's clear.
Assessing statistical significance
The method above involves forming a training and testing set over the data, partitioned by weather (rainy vs sunny). To assess statistical significance, I ran this same process over randomly selected training and testing sets. This is in contrast to the method above where the data was split such that all rainy data was in the training set, and sunny data in the test set. This is a standard process in statistics known as bootstrapping.
The following graph shows the distribution of the average residual increase of actual vs predicted values over the testing set.
Since over 99.5% of the probability mass falls to the left of the 10% average residual increase value, we can conclude that the result here is in fact significant.
Is a 10.1% increase in check-ins on rainy days a surprising result? Probably not. But understanding relationships between human behavior and these sorts of externalities can be critical. While A/B tests represent a holy grail in terms of analysis, they're not always feasible, possible, or practical.
The key idea here lies in building a model off the "A" test group and then applying it to infer what points in the "B" group would have looked like had they been in "A". But in order to do this, you need a reasonable way of modeling behavior of what's going on. In the domain of GitHub and weather, a simple linear regressor worked quite well. In other domains, more sophisticated algorithms may be necessary. And in some domains, building such a model may be impossible.
This sort of methodology can also be adopted to to the context of scenario planning. What would check-in counts look like in July had it rained every day? Would online sales in Q1 be lower had a massive snowstorm not hit the northeast? Will the construction delays on the F train affect my business in August? In the GitHub analysis here, the predictive model was applied to historical data. Applying this sort of analysis to scenarios in the future could be a useful tool for all types of businesses.
But alas, if GitHub sees an uptick in July check-ins, we know that at least some of this can be blamed on the weather.