Mashing data together
Written by Jason on March 18, 2012Mashing data together can be hard. There are lots of issues to deal with; to name just a few:
- different file formats
- different source locations
- different date formats & conventions
- sometimes even different characters as the decimal point
Today, we'll show you how Tuhunga handles these kinds of issues transparently for you. We'll use two datasets that are available to all members - the yearly commodity price dataset, and the World Bank agricultural dataset (one of the twelve that were recently added).
I mention these two because the commodity price dataset was originally in an Excel file, while the agricultural dataset came from an XML feed. The dates were stored in different formats. The agricultural dataset provided a breakdown by year and by country, while the commodity price data is purely time-dependent.
None of these are issues when using Tuhunga. Once the data is in the system, these issues no longer exist. The two datasets are tied together in a few clicks, and can be used as if they were one.
Let's take a look at the price of two major cereal crops - wheat and corn - and see if we can observe a response from farmers to changes in price:
Examining the chart, it seems like there is a supply response to changes in price. That is, in general, when both crop prices rise, there is an increase in land used in production either in the same year, or the following. Conversely, when prices fall, cropland area is stagnant or falls.
Note the outlier in cropland in 1992. Initially, it seems unlikely that there could be such a large increase in a single year and there might be an error in the underlying data. Fortunately, finding out what caused this discrepancy is easy - we simply create a filter to find countries that contained data in 1992 and did not in 1991.
It turns out that the step up was caused by the addition of the former Soviet bloc states to the dataset, as shown in green below.
Keep reading to see our adjustment for this discovery.
Once we back out the Soviet bloc increase, it certainly looks more reasonable.