you're reading...
Web Scraping and APIs

Google Refine: One of The Best Tools You’ve Probably Never Heard About

Lots of data that’s available online tends to not be the cleanest thing in the world, particularly if you’ve had to scrape it in the first place. At the same time, lots of internal data sets can be just as messy, with columns having different names in what should be identical spreadsheet templates, or with what should be identical values (e.g. “United Kingdom” vs “UK”) not being identical. There are lots more examples that I could mention, but

Google Refine is a great tool for dealing with messy data and turning it quickly and easily into a much better dataset, which then allows for the fun to begin with the analysis and visualisation.

Three handy Google Refine tricks.

My reason for using Refine was to clean up and append additional information to the Information is Beautiful Awards. I’m still learning how to use the software, but found these three things to be very handy and things that I’ve not encountered elsewhere (without writing some code).

Add together multiple different Excel worksheet. This can be an absolute pain when there are either a lot of them, or where each one has some additional or missing columns. In the past, I’ve dealt with this using a combination of Python and VBA. However, Refine makes it incredibly easy to join together multiple Excel worksheets (particularly if they have similarly named columns). NB: this functionality doesn’t seem to exist when plugging into Google spreadsheets.

Reconciliation. My new favourite word. The data set that was provided, whilst exhaustive in some respects, still lacked a lot of potentially interesting information, such as when in the year it was released, the stars who featured, and so on. My first approach to this was to use the URLs provided and see if I could scrape the data from the link, as most of them were from a particular website, which had a consistent format.

Then I discovered Reconciliation.

This screencast does it far more justice than I can, but essentially the idea is that my reconciling my database with another one (in this case Freebase), there’s a whole lot more information that I can easily add on to my dataset. Think of reconciling a bit like Primary Keys that are universally defined and being able to left join. Also, think of remakes of films (Casino Royale, Ocean’s Eleven, etc.) – which version of the film am I referring to?

I simply asked Refine to start Reconciling, and by looking at the data in my film name column, was able to identify that it was a list of films. It then did its best at fuzzy matching the names, and where it wasn’t sure, gave me a list of options from which to choose. I could then choose a confidence level and it would leave me to manually choose the remaining records which were lower than this level.

Facets. Without Data Validation built into data entry, lots of alternative spellings can crop up for what should be the same value (e.g. “United Kingdom” vs “UK”). There are normally a few ways to deal with these, and tools like Tableau make it easy to group such values together.

Refine takes a nice approach to these (and other data validation issues) using what it calls “Facets”, which are essentially a combination of a Summary of the data, combined with Data Manipulation. What this means, is that (a) I can see what mistakes there are in the data and (b) I can then easily correct them.



About these ads



  1. Pingback: Nova SilvaFlirten met data - Nova Silva - November 14, 2012

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog Stats

  • 327,381 hits

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 526 other followers


Get every new post delivered to your Inbox.

Join 526 other followers

%d bloggers like this: