Scrape Autotrader Database: Scraping the Web for Commodity Futures Contract Data

I’m fascinated by commodity futures contracts. I worked on a project in which we predicted the yield of grains using climate data (which exposed me to the futures markets) but we never attempted to predict the price. What fascinates me about the price data is the complexity of the data. Every tick of price represents a transaction in which one entity agrees to sell something (say 10,000 bushels of corn) and the other entity agrees to buy that thing at a future point in time (I use the word entity rather than person because the markets are theoretically anonymous). Thus, price is determined by how much people think the underlying commodity is worth.

The data is complex because the variables that effect the price span many domains. The simplest variables are climatic and economic. Prices will rise if the weather is bad for a crop, supply is running thin, or if there is a surge in demand. The correlations are far from perfect, however. Many other factors contribute to the price of commodities such as the value of US currency, political sentiment, and changes in investing strategies. It is very difficult to predict the price of commodities using simple models, and thus the data is a lot of fun to toy around with.

As you might imagine there is an entire economy surrounding commodity price data. Many people trade futures contracts on imaginary signals called “technicals” (please be prepared to cite original research if you intend to argue) and are willing to shell out large sums of money to get the latest ticks before the guy in the next suburb over. The Chicago Mercantile Exchange of course realizes this, and charges a rather hefty sum to the would be software developer who wishes to deliver this data to their users. The result is that researches like myself are told that rather large sums of money can be exchanged for poorly formatted text files.

Fortunately, commodity futures contract data is also sold to websites who intend to profit off banner adds and is remarkably easy to scrape (it’s literally structured). I realize this article was supposed to be about scraping price data and not what I ramble about to my girlfriend over diner so I’ll make a nice heading here with the idea that 90% of readers will skip to it.
Scraping the Data

There’s a lot of ways to scrape data from the web. For old schoolers there’s curl, sed, and awk. For magical people there’s Perl. For enterprise there’s com.important.scrapper.business.ScrapperWebPageIntegrationMatchingScrapperService. And for no good, standards breaking, rouge formatting, try-whatever-the-open-source-community-coughs-up hacker there’s Node.js. Thus, I used Node.js.

Node.js is quite useful getting stuff done. I don’t recommend writing your next million line project in it, but for small to medium light projects there’s really no disadvantage. Some people complain about “callback hell” causing their code to become indented beyond readability (they might consider defining functions) but asynchronous, non-blocking IO code is really quite sexy. It’s also written in Javascript, which can be quite concise and simple if you’re careful during implementation.

The application I had in mind would be very simple: HTML is to be fetched, patterns are to be matched, data extracted and then inserted into a database. Node.js comes with HTTP and HTTPS layers out of the box. Making a request is simple:

var req = http.request({
     hostname: 'www.penguins.com',
     path: '/fly.php?' + querystring.stringify(yourJSONParams)
}, function(res) {
    if (res.statusCode != 200) {
        console.error('Server responded with code: ' + res.statusCode);
        return done(new Error('Could not retrieve data from server.'), '', symbol);
    }
    var data = '';
    res.setEncoding('utf8');
    res.on('data', function(chunk) {
        data += chunk;
    });

    res.on('end', function() {
        return done('', data.toString(), symbol);
    });
});

req.on('error', function(err) {
    console.error('Problem with request: ', err);
    return done(err, '');
});

req.end();

Don’t worry about ‘done’ and ‘symbol’, they are the containing function’s callback and the current contract symbol respectively. The juice here is making the HTTP request with some parameters and a callback that handles the results. After some error checking we add a few listeners within the result callback that append the data (HTML) to the ‘data’ variable and eventually pass it back to the containing function’s callback. It’s also a good idea to create an error listener for the request.

Although it would be possible to match our data at this point, it usually makes sense to traverse the DOM a bit in case things move around or new stuff shows up. If we require that our data lives in some DOM element, failure indicates the data no longer exists, which is preferable to a false positive. For this I brought in the cheerio library which provides core jQuery functionality and promises to be lighter than jsDom. Usage is quite straightforward:

$ = cheerio.load(html);
$('area', '#someId').each(function() {
    var data = $(this).attr('irresponsibleJavascriptAttributeContainingData');
    var matched = data.match('yourFancyRegex');
});

Here we iterate over each of the area elements within the #someId element and match against a javascript attribute. You’d be surprised what kind of data you’ll find in these attributes…

The final step is data persistence. I chose to stuff my price data into a PostreSQL database using the pg module. I was pretty happy with the process, although if the project grew any bigger I would need to employ aspects to deal with the error handling boilerplate.

/**
* Save price data into a postgres database.
* @param err callback
* @param connectConfig The connection parameters
* @param symbol the symbol in which to append the data
* @param price the price data object
* @param complete callback
*/
exports.savePriceData = function(connectConfig, symbol, price, complete) {
    var errorMsg = 'Error saving price data for symbol ' + symbol;
    pg.connect(connectConfig, function(err, client, done) {
        if (err) {
            console.error(errorMsg, err);
            return complete(err);
        }
        var stream = client.copyFrom('COPY '
            + symbol
            + ' (timestamp, open, high, low, close, volume, interest) FROM STDIN WITH DELIMITER \'|\' NULL \'\'');
        stream.on('close', function() {
            console.log('Data load complete for symbol: ' + symbol);
            return complete();
        });
        stream.on('error', function(err) {
            console.error(errorMsg, err);
            return complete(err);
        });
        for (var i in price) {
            var r = price[i];
            stream.write(i + '|' + r[0] + '|' + r[1] + '|' + r[2] + '|' + r[3] + '|' + r[4] + '|' + r[5] + '\n');
        }
        stream.end();
        done();
        complete();
    });
};

As I have prepared all of the data in the price object, it’s optimal to perform a bulk copy. The connect function retrieves a connection for us from the pool given a connection configuration. The callback provides us with an error object, a client for making queries, and a callback that *must* be called to free up the connection. Note in this case we employ the ‘copyFrom’ function to prepare our bulk copy and write to the resulting ‘stream’ object. As you can see the error handling gets a cumbersome.

After tying everything together I was very please with how quickly Node.js fetched, processed, and persisted the scrapped data. It’s quite satisfying to watch log messages scroll rapidly through the console as this asynchronous, non-blocking language executes. I was able to scrape and persist two dozen contracts in about 10 seconds… and I never had to view a banner ad.

Source:http://cfusting.wordpress.com/2013/10/30/scraping-the-web-for-commodity-futures-contract-data/

Scrape Autotrader Database

Friday, 13 December 2013

Scraping the Web for Commodity Futures Contract Data

No comments:

Post a Comment