Thursday 30 September 2010

Getting started with real-time ETL and the dark art of polling:

    There has been a lot of discussion about real-time ETL over the last few years and a lot of it can be summarised as "don't do it unless you REALLY need to". Helpful, eh? I recently had the need to deal with real-time for the first time so I thought I would summarise my approach to give you some food for thought if you're starting (or struggling) on this same journey.


Is it really real-time?
    The question often asked is "how far behind the source (in time) can you be, and still call it real-time?". I don't really care about this kind of latency. I thinks it's basically posturing; "I'm more real-time than you are". My feeling is that I want something that works continuously first and I'll worry about latency later. As long as the process is always working to catch up to the source that's a good start.


Old options are out
    Next question (and one that actually matters): "How will you know when the data has changed on the source?" This is an old classic from batch ETL; the difference is that we have taken some of our traditional options away. In batch ETL we could periodically extract the whole resource and do a complete compare. Once you go real-time, this approach will actually miss a large number of changes that update the same resource multiple times. In fact, I would say that repeated updates of a single resource are the main type of insight that real-time adds, so you better make sure you're getting it.


CDC:  awesome and out of reach
    What can you do to capture changes? Your first (and best) option is change data capture. CDC itself is beyond the scope of this discussion, however the main point is that it is tightly bound to the source system. If you've been around data warehousing or data integration for more than 5 minutes you can see how that could be a problem. There are numerous half-way house approaches which I'm won't go over; suffice it to say that most enterprise databases have metadata tables and pseudo-column values that they use internally to keep track of changes and these can be a rich seem of information for your real-time ETL quest.


Polling: painful but necessary
    You will inevitably come across some system which allows you no detailed interaction with it's backend. Web based services are the perfect case here - you're not going to get access to the remote database so you just have to cope with using their API. And that leaves you with - POLLING. Basically asking the source system: 'has this resource changed' or (when you can't aks that) extracting the resource and comparing it to your copy.
    A naive approach would be to simply iterate through the entire list of resources over a given interval. The time it takes to complete an iteration would be, roughly speaking, your latency from live. However, DON'T DO THIS unless you want to be strangled by the SysAdmin for the source or banned from API access to the web service.


My 'First law of real-time ETL'
    So I would propose the following heuristic: data changed by humans follows Nexton's first law. Restated:
'Data in motion will stay in motion, data at rest will stay at rest.' 
    Basically a resource that has changed is more likely be changed again when you next check. Conversely a resource which has not changed since you last checked is less likely to changed when you check again. To implment this in your polling process you would simply track how many times you've checked the resource without finding a change and adjust your retry interval accordingly.
For example:
> Check resource - no change - unchanged count = 1 - next retry = 4 min
> Check resource - no change - unchanged count = 2 - next retry = 8 min
> Check resource - no change - unchanged count = 3 - next retry = 16 min
> Check resource - no change - unchanged count = 4 - next retry = 32 min
> Check resource - CHANGED - unchanged count = 0 - next retry = 1 min


Keep it simple stupid
    This a simplistic approach but it can massively reduce the strain you place on the source system. You should also be aware of system driven changes (i.e. invoice generation, etc.) and data relationships (i.e. company address changes > you need to check all other company elements sooner than scheduled). You should also note that changes which are not made by humans are much less likely to obey this heuristic.


A note for the web dudes
    Finally, if you are mostly working with web services then familarise yourself with the following:
> Webhooks, basically change data capture for the web. You subscribe to a resource and changes are notified to a location you specify. Sadly, webhooks are not widely supported right now.
> RSS, that little orange icon that you see on every blog you read. Many services offer RSS feeds of recently changed data and this is a good comprise.
> E-tag and If-Modified-Since headers, HTTP header elements that push the burden of looking for changes off to the remote service (which is nice).


Good luck.

Disqus for @joeharris76