Tuhunga Blog - Introducing data extraction

Introducing data extraction

Written by Jason on February 22, 2011

In the last few months, we’ve been rolling out new features, and over the next little while, we’ll highlight some of them in the blog. Today, we’ll highlight data extraction using regular expressions.

With this feature, you can capture the exact data you want, even if it’s not cleanly delimited in the source. In the example below, we want to capture the ticker, but it’s not a clean capture as it's surrounded by parentheses.

Web page with selected item

Without using a data extractor, the item would be stored as:

(NasdaqGS: AAPL)

However, if we use a regular expression to isolate the ticker, we’re able to capture only what we want:

AAPL

This feature is extremely useful in cleansing input before storage. We'll look at the extractor we used (it’s easier than it looks - we promise) as well as provide some useful links that explain how to build them.

Tuhunga uses regular expressions to extract data prior to storage. If you’re not familiar with them, they can look intimidating at first, but once you’ve started using them, you’ll wonder how you ever managed data without them.

For the example above, we know that the desired ticker (AAPL) is preceded by an exchange name, in this case, NasdaqGS, then a colon, and a space; a closing parenthesis follows the ticker. We don’t necessarily want to restrict the ticker to a specific exchange, so we’ll focus on the colon and space. Let’s look at the expression first, and then explain it.

!(?<=: )[A-Za-z.-]+!

For Tuhunga, a regular expression must be surrounded by exclamation marks (** or as of late 2012, the tilde: ~ **). Everything between them determines what is captured and what is discarded. We’ll first examine the portion of the pattern after the parentheses:

[A-Za-z.-]+

The contents of the square brackets form a character class, and will extract any characters that match them. In this instance, it matches any uppercase or lowercase English letter, the period, and the dash. The plus sign after the character class indicates that it must match at least one character before capturing the data, with no upper limit on the length of the match.

If you wished to specify the length more precisely, instead of the plus sign, you would use the curly braces {x,y} to indicate the minimum and maximum length to capture. For instance, if you wanted to only capture tickers that are at least two letters, and no more than 4, you would use {2,4}. For no upper bound, you would use {2,}.

The first portion of the pattern (?<=) is a positive lookback – that is, the second portion of the pattern we discussed above will only capture data if the lookback matches its data. In our case, the ticker will only be matched if it is immediately preceeded by a colon and space.

Regular expressions can have a bit of a learning curve, but they are extremely powerful tools. Going forward, we’ll periodically use them in other blog posts, and explain the examples. However, for a more comprehensive overview, you might consider the following tutorials:

http://sites.google.com/site/trevizemoonflair/tutorials/regex (good place to start)
http://www.zytrax.com/tech/web/regex.htm (more comprehensive, with examples)
http://www.regular-expressions.info/tutorialcnt.html (quite comprehensive, although some of the advanced version-specific elements described do not apply)

Tags: examples, features