<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-7423363075024286359</id><updated>2012-01-30T15:47:36.853-08:00</updated><category term='potential'/><category term='old blog'/><category term='introduction'/><category term='analytic'/><category term='columnar'/><category term='small-business'/><category term='etl'/><category term='IT'/><category term='predictions'/><category term='about'/><category term='api'/><category term='mapreduce'/><category term='hadoop'/><category term='acquisitions'/><category term='results'/><category term='dw'/><category term='psychology psychiatry shamanism brain ai steam'/><category term='enterprise'/><category term='polling'/><category term='off topic'/><category term='adbms'/><category term='rainstor'/><category term='database'/><category term='google istant'/><category term='MySQL'/><category term='cdc'/><category term='SharePoint'/><category term='no-sql'/><category term='analyst'/><category term='real-time'/><category term='HandlerSocket'/><category term='MarkLogic'/><category term='hadapt'/><category term='bi'/><category term='oracle'/><category term='37signals'/><category term='sap'/><category term='statistician'/><category term='gpu'/><category term='parstream'/><category term='sql'/><category term='orm'/><category term='market'/><category term='saas'/><category term='Data Intensive Computing'/><category term='datamapper'/><category term='open-source'/><category term='segmentation'/><category term='opportunities'/><title type='text'>@joeharris76</title><subtitle type='html'>Data Intensive Macro-Blogging</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>39</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-7107900746098068495</id><published>2011-04-08T08:41:00.000-07:00</published><updated>2011-04-08T08:41:39.029-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sql'/><category scheme='http://www.blogger.com/atom/ns#' term='dw'/><category scheme='http://www.blogger.com/atom/ns#' term='IT'/><category scheme='http://www.blogger.com/atom/ns#' term='no-sql'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><title type='text'>Data Management - Data is Data is Data is…</title><content type='html'>&lt;i&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;[Sometimes I want to write about one topic but I end up writing 1,500 words of background before I even touch on the subject at hand. Sometimes the background turns out to be more interesting; hopefully this is one of those times.]&lt;/span&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;In this post I talk about the problems with mainstream data management, especially SQL databases. I then touch on the advantages of SQL databases and the good attributes we need to retain.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Data is Data is Data is…&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;Current IT practice splits data management into lots of niches: SQL databases, Email platforms, Network file systems, enterprise search, etc. There is plenty of overlap between niches and, in truth, the separations are artificial. It merely reflects the way systems are implemented, not fundamental data differences. Have a look at your email client; see those headers in the messages list (From, Subject, etc) they're just database field names and the message body is simply a BLOB field. Some email clients, e.g., Gmail, can also parse that blob and find links to previous messages, which is very much like a foreign key link.&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;File systems seem less like a database at first glance but let's consider the big file system developments of the last 10 years ZFS and BTRFS. Both of these introduce database-like ideas to the file system such as copy-on-write (a la &lt;a href="http://en.wikipedia.org/wiki/Multiversion_concurrency_control"&gt;MVCC&lt;/a&gt;), deduplication (a la &lt;a href="http://en.wikipedia.org/wiki/Database_normalization"&gt;normalisation&lt;/a&gt;), data integrity guarantees (a la &lt;a href="http://en.wikipedia.org/wiki/ACID"&gt;ACID&lt;/a&gt;) and enhanced file metadata (a la SQL DDL).&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;The basic point I'm making is that data is data. Simple as that. It may be more or less 'structured' but structure and meaning are essentially equivalent. The most 'unstructured' file I can imagine is just plain text but the written word is still &lt;b&gt;very&lt;/b&gt; structured. At a high level it has a lot of metadata (name, created, changed, size, etc.), it has structure embedded in the text itself (language, punctuation, words used, etc.) and, looking deeper, we can analyse the semantic content of the text using techniques like &lt;a href="http://en.wikipedia.org/wiki/Natural_language_processing"&gt;NLP&lt;/a&gt;.&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;Data is data; it needs to be stored, changed, versioned, retrieved, backed up, restored, searched, indexed, etc. The methods may vary but &lt;b&gt;it's all just data&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The SQL Database Black Box&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&lt;b&gt;All&lt;/b&gt; data cannot be kept in databases because, amongst other things, SQL databases are opaque to other applications. Enterprise search illustrates the issue. Most enterprise search apps can look into JDBC/ODBC accessible databases, profile the data and include its content in search results. However, access to any given database is typically highly restricted and there is a DBA whose job hangs on keeping that data safe and secure. The DBA must be convinced that the search system will not compromise the security of his data and this typically means limiting search access to the people who also have database access. This is a time consuming process and we have to repeat it for every database in the company.&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;So a year later, when we have access to all SQL databases and a process to mirror access credentials, the next problem is that SQL provides no mechanism to trace data history. For example, I search for 'John Doe' and find a result from the CRM database. I look in the database and the record now has a name of 'Jane Doe'. Why did it change? When did it change? Who changed it? There is no baseline answer to these questions. The CRM application may record &lt;i&gt;some&lt;/i&gt; of this information but how much? The database has internal mechanisms that trace some of this but each product has its own scheme and, worse, the tables are often not user accessible for security reasons.&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;In my experience, 80% of the value &lt;b&gt;actually&lt;/b&gt;&amp;nbsp;gained from a data warehouse comes from resolving this issue in a single place and in a consistent way. Hence the growth of the MDM industry, but I won't digress on that. The data warehouse doesn't actually &lt;i&gt;solve&lt;/i&gt; the problem, it merely limits the number of SQL databases that must be queried to 1. And, of course, we never manage to get &lt;b&gt;everything&lt;/b&gt; in the DW.&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;There are many other black box attributes of SQL databases such as: 2 very similar queries may perform in drastically different ways; background tasks can make the database extremely slow without warning; the database disk format cannot be accessed by other applications; the database &amp;nbsp;may bypass the filesystem making us entirely reliant on the database to detect disk errors, etc., etc.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The SQL Database Choke Point&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;Current SQL databases are also a very real constraint on day-to-day operation. For example, a large company may only be able to process bulk updates against a few percent of the customer base each night. SQL databases must be highly tuned towards high performance for single type of &amp;nbsp;access query and that tuning usually makes other access styles unworkable. &lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;Further the schema of a production SQL database is effectively set in stone. Although SQL provides ALTER statements the performance and risk of using ALTER is so bad that it's never used. Instead we either add a new small table and use a join when we need the additional data, or we create a new table and export the existing data into it. Both of these operations impose significant overheads when all we really want is a new field. So, in practice, production SQL databases satisfy a single type of access, are very resistant to other access patterns and are very difficult to change.&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;These issues are well recognised and the answer has come back that we need specialist SQL databases for each use case. Michael Stonebraker, in particular, has been beating a drum about this for at least 5 years (and, credit where it's due, Vertica paid off in spades). However, we haven't seen a huge uptake in specialist databases for markets other than analytics. In particular the mainstream OLTP market has very few specialist offerings. Perhaps it's a more difficult problem or perhaps the structure of SQL itself is less amenable to secondary innovation around OLTP. I sense a growing recognition that improvements in the OLTP space require significant re-engineering of existing applications.&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;Specialist databases have succeeded to some extent in the data warehouse and business intelligence sphere. I think this exception proves the observation. 15 years ago I would add another complaint to my black box attributes: it was impossible to get reports and analysis from my production systems. The data warehouse was invented and gained popular acceptance simply because this was such a serious problem. The great thing about selling analytic databases for the last 15 years was that you weren't displacing a production system. Businesses don't immediately start losing money if the DW goes down. The same cannot be said of most other uses for SQL databases and that's why they will only be replaced slowly and only when there is a compelling reason (mainframes are still around, right?).&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;There's a baby in this bathwater!&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;It's worth remembering the SQL databases offer a lot advantages. &lt;a href="http://en.wikipedia.org/wiki/Codd%27s_12_rules"&gt;Codd outlined 12 rules that relational databases should follow&lt;/a&gt;. I won't list them all here but at a high level a relational database maintains the absolute integrity of the data it stores and allows us to place constraints on that data, such as the type and length of the data or it's relation to other data. We take it for granted now but this was a real breakthrough and it took years to implement in practice.&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;Just for kicks imagine a CRM system based on Word docs. When you want to update a customer's information you open their file and make whatever changes you want and then save it. The system only checks that the doc exists, you can change whatever you want and the system won't care. If you want the system to make sure you only change the right things you'll have to build that function yourself. That's more or less what data management was like before SQL databases.&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;b&gt;What to keep &amp;amp; what to throw away&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;So what would our ideal data management platform look like? It persists data in a format that can be freely parsed by other applications, i.e., plain text (XML? JSON? &lt;a href="http://en.wikipedia.org/wiki/Protocol_Buffers"&gt;Protocol Buffers&lt;/a&gt;? ). It maintains data integrity at an atomic level probably by storing checksums alongside each item. It lets define stored data as strictly or loosely as we want but it enforces the definitions we set. All changes to our stored data actually create new versions and the system keeps a linked history of changes.&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;I think we're starting to see systems emerge that address some of the issues above. It's still early days but I'm excited about projects like &lt;a href="http://ceph.newdream.net/about/"&gt;Ceph&lt;/a&gt; and the very new &lt;a href="http://www.acunu.com/technology/"&gt;Acunu&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;In my next post I'll look about how the new breed of NoSQL databases display some of the traits we need for our ideal data management platform.&lt;/i&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-7107900746098068495?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/7107900746098068495/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2011/04/data-management-data-is-data-is-data-is.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7107900746098068495'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7107900746098068495'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2011/04/data-management-data-is-data-is-data-is.html' title='Data Management - Data is Data is Data is…'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-1683061309813507321</id><published>2011-04-06T02:25:00.000-07:00</published><updated>2011-04-06T02:25:18.035-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='opportunities'/><category scheme='http://www.blogger.com/atom/ns#' term='market'/><category scheme='http://www.blogger.com/atom/ns#' term='MarkLogic'/><category scheme='http://www.blogger.com/atom/ns#' term='no-sql'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><title type='text'>Unsolicited advice for MarkLogic - Pivot!</title><content type='html'>&lt;i&gt;[This is actually &lt;a href="http://www.dbms2.com/2011/04/05/whither-marklogic/"&gt;a really long comment on Curt Monash's post&lt;/a&gt;&amp;nbsp;but I think it's worth cross posting here.]&amp;nbsp;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://www.marklogic.com/themes/marklogic/images/logo.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://www.marklogic.com/themes/marklogic/images/logo.gif" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Seeing as I have been doing a lot of &lt;a href="http://joeharris76.blogspot.com/2011/03/analytic-database-market-opportunities.html"&gt;thinking about database opportunities lately&lt;/a&gt; I'll wade in on MarkLogic's as well. I can't really comment about the specific verticals that MarkLogic sells into or should sell into. However, I see 2 fundamental problems with MarkLogic's product positioning.&lt;br /&gt;&lt;br /&gt;First problem; they backed the wrong horse by focusing exclusively on XML and XQuery. This has been toned down a lot but the die is cast. People who know about MarkLogic (not many) know of it as a 'really expensive XML database that you need if you have lots of XML and eXist-db is too slow for you'. They've put themselves into a niche within a niche, kind of like a talkative version of Ab Initio.&lt;br /&gt;&lt;br /&gt;This problem is obvious if you compare them to the 'document oriented' NoSQLs such as CouchDB and MongoDB. Admittedly they were created long after MarkLogic but the NoSQLs offer far greater flexibility, talk about XML only as a problem to be dealt with and use a storage format that the market finds more appealing (JSON).&lt;br /&gt;&lt;br /&gt;Second problem; 'Enterprise class' pricing is past its sell by date. What does MarkLogic actually cost? You won't find any pricing on the website. I presume that the answer is that old standby 'whatever you're looking to spend'. Again, the contrast with the new NoSQLs couldn't be more stark - they're all either pure open source or open core, e.g., free to start.&lt;br /&gt;&lt;br /&gt;MarkLogic was essentially an accumulator bet: 1st bet - XML will flood the enterprise, 2nd bet - organisations will want to persist XML as XML, 3rd bet - an early, high quality XML product will move into an Oracle-like position.&lt;br /&gt;&lt;br /&gt;The first bet was a win, XML certainly has flooded the enterprise. The second bet was a loss; XML has become almost a wire protocol rather than a persistence format. Rightly or not, very few organisations choose to persist significant volumes of data in XML. And the third bet was loss as well; the huge growth of open source and the open core model make it extremely unlikely that we'll see another Oracle in the data persistence market.&lt;br /&gt;&lt;br /&gt;The new MarkLogic CEO needs to acknowledge that the founding premise of the company has failed and they must pivot the product to find a much larger addressable market. Their underlying technology is probably very good and could certainly be put to use in other ways (Curt gives some examples). I would be tempted to split the company in 2; leaving a small company to continue selling and supporting MarkLogic at maximum margins (making them an acquisition target) and a new company to build a new product in start-up mode on the foundations of the existing tech.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-1683061309813507321?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/1683061309813507321/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2011/04/unsolicited-advice-for-marklogic-pivot.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/1683061309813507321'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/1683061309813507321'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2011/04/unsolicited-advice-for-marklogic-pivot.html' title='Unsolicited advice for MarkLogic - Pivot!'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-1010664852184294737</id><published>2011-03-29T09:00:00.000-07:00</published><updated>2011-03-29T12:49:02.296-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='segmentation'/><category scheme='http://www.blogger.com/atom/ns#' term='rainstor'/><category scheme='http://www.blogger.com/atom/ns#' term='hadapt'/><category scheme='http://www.blogger.com/atom/ns#' term='gpu'/><category scheme='http://www.blogger.com/atom/ns#' term='adbms'/><category scheme='http://www.blogger.com/atom/ns#' term='analytic'/><category scheme='http://www.blogger.com/atom/ns#' term='parstream'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Analytic Database Market Opportunities</title><content type='html'>&lt;i&gt;In my first post in this series I gave &lt;a href="http://joeharris76.blogspot.com/2010/12/initial-thoughts-about-parstream.html"&gt;an overview of ParStream and their product&lt;/a&gt;.&amp;nbsp;&lt;/i&gt;&lt;i&gt;In the second post I gave &lt;a href="http://joeharris76.blogspot.com/2011/01/analytic-database-market-fly-over.html"&gt;an overview of the Analytic Database Market&lt;/a&gt; from my perspective.&amp;nbsp;&lt;/i&gt;&lt;i&gt;In the third post I introduced &lt;a href="http://joeharris76.blogspot.com/2011/03/analytic-database-market-segmentation.html"&gt;a simple Analytic Database Market Segmentation.&lt;/a&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;In this post I will look at the gaps in this market and the new opportunities for ParStream and RainStor to introduce differentiated offerings. First, though I'll address the positioning of Hadapt.&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-qIIbvMv3qXU/TZH_Oea-LEI/AAAAAAAAFa0/UPAdCl0lS_4/s1600/ADBMS+Opportunities2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="380" src="http://4.bp.blogspot.com/-qIIbvMv3qXU/TZH_Oea-LEI/AAAAAAAAFa0/UPAdCl0lS_4/s640/ADBMS+Opportunities2.jpg" width="640" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;b&gt;Hadapt Positioning&lt;/b&gt;&lt;br /&gt;Hadapt have recently come out of stealth and will be offering a very fast 'adaptive' version of Hadoop. Hadapt is a reworked and commercialized version of the Daniel Abadi's HadoopDB project. You can read &lt;a href="http://www.dbms2.com/2011/03/23/hadapt-commercialized-hadoopdb/"&gt;Curt Monash's overview for more on that&lt;/a&gt;. &amp;nbsp;Basically Hadapt provides a Hadoop compatible interface (buzz phrase alert) and uses standard SQL databases (currently Postgres or VectorWise) underneath instead of HDFS. The unique part of their offering is keeping track of node performance and adapting queries to make the best use of each node. The devil is in the details of course, but a number of questions remain unanswered: How much &amp;nbsp;of the Hadoop API will be mapped to the database? Will there be a big inflection in performance between logic that maps to the DB and logic that runs in Hadoop? Etc. At a high level Hadapt seems like a very smart play for cloud based Hadoop users. Amazon EC2 instances have notoriously inconsistent I/O performance and a product that works around that should find fertile ground.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;RainStor's Current Positioning&lt;/b&gt;&lt;br /&gt;RainStor, if you don't know, is a special archival database that features massive compression. They current sell it as an OLDR solution (Online Data Retrieval) primarily aimed at company's that have large data volumes and stringent data retention requirements, e.g., anyone in Financial Services. They promise between 95% (20:1) &amp;nbsp;and 98% (40:1) compression rates for data whilst remaining fully query-able. Again &lt;a href="http://www.dbms2.com/category/products-and-vendors/clearpace/"&gt;Curt Monash has the best summary of their offering&lt;/a&gt;. I briefly met some RainStor guys a while back and I feel pretty confident that the product delivers what it promises. That said, I have never come across a RainStor client and I talk to lots of Teradata and Netezza types who would be their natural customers. So, though I have no direct knowledge of how they are doing, I suspect that it's been slow going to date and focusing on a different part of the market might be more productive.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Hyper Compressed Hadoop - RainStor Opportunity&lt;/b&gt;&lt;br /&gt;I tweeted a while back that "RainStor needs a MapReduce story like yesterday". I still think that's right although now I think they need a &lt;b&gt;Hadoop compatible&lt;/b&gt; story. To me, RainStor and Hadoop/MapReduce seem like a great fit. Hadoop users value the ability to process large data volumes over simple speed. Sure, they're happy with speed when they can get it but Hadoop is about processing as much data as possible. RainStor massively compresses databases while keeping them online and fully query-able. If RainStor could bring that compression to Hadoop it would be incredibly valuable. Imagine a Hadoop cluster that's maxed out at 200TB of raw data, compressed using splittable LZO to 50TB and replicated on 200TB of disk. If RainStor (replacing HDFS) could compress that same data at 20:1, half their headline rate, that cluster can now scale out to roughly 2,000TB. And many operations in Hadoop are constrained by disk I/O so if RainStor can operate to some extent on compressed data the cluster might just run faster. Even if it runs slightly slower the potential cost savings are huge (&lt;i&gt;insert your own Amazon EC2 calculation here where you take EC2+S3 spend and divide by 20&lt;/i&gt;).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;ParStream Opportunities&lt;/b&gt;&lt;br /&gt;I see 2 key opportunities for ParStream in the current market. They can co-exist; but may require significant re-engineering of the product. First a bit of background; I'm looking for &lt;a href="http://en.wikipedia.org/wiki/Blue_Ocean_Strategy"&gt;'Blue Ocean Strategies'&lt;/a&gt; where ParStream can create a temporary monopoly. Selling into the 'MPP Upstart' segment is not considered due to the large number of current competitors. It's interesting to note though that that is where ParStream's current marketing is targeted.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Real-Time Analytic Hadoop - ParStream Opportunity &lt;/b&gt;1&lt;br /&gt;ParStream's first opportunity is to repurpose their technology into a Hadoop compatible offering. Specifically a 'real-time analytic Hadoop' product that uses GPU acceleration to vastly speed up Hadoop processing and opens up the MapReduce concept for many different and untapped use cases. &amp;nbsp;ParStream claim to have a unique index format and to mix workloads across CPUs and GPUs to minimise response times. It should be possible to use this technology to replace HDFS with their own data layer and indexing. They should also aim to greatly simplify data loading and cluster administration work. Finally transparent SQL access would be a very handy feature for business that want to provide BI directly from their 'analytic Hadoop' infrastructure. In summary: Hadoop's coding flexibility, processing speeds that approach CEP, and Data Warehouse style SQL access for downstream apps.&lt;br /&gt;&lt;br /&gt;Target customers: Algorithmic trading companies (as always…), Large-scale online ad networks, M2M communications, IP-based Telcos, etc . Generally businesses with&lt;i&gt; large volumes of data and high inbound data rates who need to make semi-complex decisions quickly and who have a relatively small staff&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Single User Data Warehouses - ParStream Opportunity &lt;/b&gt;2&lt;br /&gt;ParStream's second opportunity is to market ParStream as a single user, desk side data warehouse for analytic professionals, specifically targeting GPU powered workstations (&lt;a href="http://www.thinkmate.com/System/SuperServer_7046GT-TRF-TC4"&gt;like this one: ~$12k =&amp;gt; 4 GPUs [960 cores], 2 Quad core CPUs, 48GB RAM, 3.6TB of fast disk&lt;/a&gt;). This version of ParStream must run on Windows (preferably Win7 x64, but Win Server at a minimum). Many &amp;nbsp;IT departments will balk at having a non-Windows workstation out in the office running on the standard LAN. However they are very used to analysts requesting 'special' powerful hardware. That's why the desk side element is so critical, this strategy is designed to penetrate restrictive centralised IT regimes.&lt;br /&gt;&lt;br /&gt;In my experience a handful of users place 90% of the complex query demand on any given data warehouse. They're typically statisticians and operational researchers doing hard boiled analysis and what-if modelling. Many very large businesses have separate SAS environments that this group alone uses but that's a huge investment that many can't afford. Sophisticated analysts are a scarce and expensive resource and many companies can't fill the vacancies they have. A system that improves analyst productivity and ensures their time is well used will justify a significant premium. It also gives the business an excellent retention tool &amp;nbsp;to retain their most valuable 'quants'.&lt;br /&gt;&lt;br /&gt;This opportunity avoids the challenges of selling a large scale GPU system into a business that has never purchased one before &lt;b&gt;and&lt;/b&gt; avoids the red ocean approach of selling directly into the competitive MPP upstart segment. However it will be difficult to talk directly to these users inside the larger corporation and, when you convince them they need ParStream; you still have to work up the chain of command to get purchase authority (not the normal direction). On the plus side though these users form a fairly tight community and they will market it themselves if it makes their jobs easier.&lt;br /&gt;&lt;br /&gt;Target customers: Biotech/Bioscience start-up companies, University researchers, marketing departments or consultancies. Generally, if a business is running their data warehouse on Oracle or SQL Server, their will be an analytic professional who would give &lt;b&gt;anything&lt;/b&gt; to have a very fast database all to themselves.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;In my next post I will look at why Hadoop is getting so much press, whether the hype is warranted and, generally, the future shape of the thing we currently call the data warehouse.&lt;/i&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-1010664852184294737?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/1010664852184294737/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2011/03/analytic-database-market-opportunities.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/1010664852184294737'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/1010664852184294737'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2011/03/analytic-database-market-opportunities.html' title='Analytic Database Market Opportunities'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-qIIbvMv3qXU/TZH_Oea-LEI/AAAAAAAAFa0/UPAdCl0lS_4/s72-c/ADBMS+Opportunities2.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-4516978289179269610</id><published>2011-03-25T10:09:00.000-07:00</published><updated>2011-03-29T09:01:52.400-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='segmentation'/><category scheme='http://www.blogger.com/atom/ns#' term='mapreduce'/><category scheme='http://www.blogger.com/atom/ns#' term='adbms'/><category scheme='http://www.blogger.com/atom/ns#' term='analytic'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><category scheme='http://www.blogger.com/atom/ns#' term='hadoop'/><title type='text'>Analytic Database Market Segmentation</title><content type='html'>&lt;div style="margin-bottom: 0cm;"&gt;&lt;span style="color: #666666;"&gt;&lt;i&gt;In my first post in this series I gave an &lt;a href="http://joeharris76.blogspot.com/2010/12/initial-thoughts-about-parstream.html"&gt;overview of ParStream and their product.&lt;/a&gt;&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span style="color: #666666;"&gt;&lt;i&gt;In the second post I gave an &lt;a href="http://joeharris76.blogspot.com/2011/01/analytic-database-market-fly-over.html"&gt;overview of the Analytic Database Market from my perspective.&lt;/a&gt;&lt;/i&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span style="color: #666666;"&gt;In this post I will briefly talk about how vendors are positioned and introduce a simple market segmentation. &lt;/span&gt; &lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span style="color: #666666;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;a href="https://lh6.googleusercontent.com/-C9Kzwcdgo9I/TYzKuoX7xHI/AAAAAAAAFas/ft0j5ftTECY/s1600/ADBMS+Positioning.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="394" src="https://lh6.googleusercontent.com/-C9Kzwcdgo9I/TYzKuoX7xHI/AAAAAAAAFas/ft0j5ftTECY/s640/ADBMS+Positioning.jpg" width="640" /&gt;&lt;/a&gt;&lt;span style="color: #666666;"&gt;&lt;span style="font-family: inherit;"&gt;&lt;i&gt;Please note that this is not exhaustive, I've left off numerous vendors with perfectly respectable offerings that I didn't feel I could reasonably place.&lt;/i&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="color: #666666;"&gt;&lt;span style="font-family: inherit;"&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh6.googleusercontent.com/-C9Kzwcdgo9I/TYzKuoX7xHI/AAAAAAAAFas/ft0j5ftTECY/s1600/ADBMS+Positioning.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;/span&gt;&lt;/a&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span style="color: #666666;"&gt;The chart above gives you my view of the current Analytic Database market and how the various vendors are positioned. The X axis is log scale going from small data sizes (&amp;lt;100GB) to very large data sizes (~500TB). I have removed the scale because it based purely on my own impressions of common customer data sizes for that vendor based on published case studies and anecdotal information. &lt;/span&gt; &lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="clear: left; float: left; font-family: inherit; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span style="color: #666666;"&gt;The Y axis is a log scale of the number of employees that a vendor's customers have. Employee size is highly variable for a given vendor but nonetheless each vendor seems to find a natural home in businesses of a certain size. Finally the size of each vendor's bubble represents the approximate $ cost per TB for their products (paid versions in the case of 'open core' vendors). Pricing information is notoriously difficult to come across so again this is very subjective but I have first hand experience with a number of these so it's not a stab in the dark.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;a href="https://lh3.googleusercontent.com/-FbOuOM9UPDo/TYzK6hNoTiI/AAAAAAAAFaw/yO_iZALqoxA/s1600/ADBMS+Segmentation.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="396" src="https://lh3.googleusercontent.com/-FbOuOM9UPDo/TYzK6hNoTiI/AAAAAAAAFaw/yO_iZALqoxA/s640/ADBMS+Segmentation.jpg" width="640" /&gt;&lt;/a&gt;&lt;span style="color: #666666;"&gt;&lt;span style="font-family: inherit;"&gt;&lt;b&gt;Market Segments&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span style="color: #666666;"&gt;&lt;b&gt;SMP Defenders:&lt;/b&gt;&lt;/span&gt;&lt;span style="color: #666666;"&gt; Established vendors with large bases operating on SMP platforms&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span style="color: #666666;"&gt;&lt;b&gt;Teradata+Aster:&lt;/b&gt;&lt;/span&gt;&lt;span style="color: #666666;"&gt; The top of the tree. Big companies, big data, big money.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span style="color: #666666;"&gt;&lt;b&gt;MPP Upstarts:&lt;/b&gt;&lt;/span&gt;&lt;span style="color: #666666;"&gt; Appliance – maybe, Columnar – maybe, Parallelism – always.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span style="color: #666666;"&gt;&lt;b&gt;Open Source Upstarts:&lt;/b&gt;&lt;/span&gt;&lt;span style="color: #666666;"&gt; Columnar databases, smaller businesses, free to start.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span style="color: #666666;"&gt;&lt;b&gt;Hadoop+Hive:&lt;/b&gt;&lt;/span&gt;&lt;span style="color: #666666;"&gt; The standard bearer for MapReduce. Big Data, small staff.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span style="color: #666666;"&gt;&lt;span style="font-family: inherit;"&gt;&lt;b&gt;SMP &amp;gt; MPP Inflection&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span style="color: #666666;"&gt;Still with me? Good, let's look at the notable segments of the market. First, there is a clear inflection point between the big single server (SMP) databases and the multi-server parallelised (MPP) databases. This point moves forward a little every year but not enough to keep up with the rising tide of data. For many years Teradata owned the MPP approach and charged a handsome rent. In the previous decade a bevy of new competitors jumped into the space with lower pricing and now the SMP old guard getting into MPP offerings, e.g., Oracle Exadata and Microsoft SQL Server PDW. &lt;/span&gt; &lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span style="color: #666666;"&gt;&lt;span style="font-family: inherit;"&gt;&lt;b&gt;Teradata's Diminished Monopoly&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span style="color: #666666;"&gt;&lt;span style="font-family: inherit;"&gt;Teradata have not their lost grip on the high end however. They maintain a near monopoly on data warehouse implementations in the very largest companies with the largest volumes of 'traditional' DW data (customers &amp;amp; transactions). Even Netezza has failed to make large dent into Teradata's customers. Perhaps there are instances of Teradata being displaced by Netezza; however I have never actually heard of one. There are 2 vendors who have a publicised history of being 'co-deployed' with Teradata: Greenplum and Aster Data. Greenplum's performance reputation is mixed and it was acquired last year by EMC. Aster's performance reputation is solid at very large scales and their SQL/MapReduce offering has earned them a lot of attention. It's no surprise that Teradata decided to acquire them.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span style="color: #666666;"&gt;&lt;span style="font-family: inherit;"&gt;&lt;b&gt;The DBA Inflection Point&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span style="color: #666666;"&gt;The other inflection point in this market happens when the database becomes complex enough to need full time babysitting, e.g., a Database Administrator. This gets a lot less attention than SMP&amp;gt;MPP because it's very difficult to prove. Nevertheless word gets around fairly quickly about the effort required to keep a given product humming along. It's no surprise that vendors of notoriously fiddly products sell them primarily to large enterprises where the cost of employing such an expensive specimen as a collection of DBAs is not an issue. &lt;/span&gt; &lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span style="color: #666666;"&gt;&lt;span style="font-family: inherit;"&gt;&lt;b&gt;Small, Simple &amp;amp; Open(ish)&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span style="color: #666666;"&gt;Smaller businesses, if they &lt;/span&gt;&lt;span style="color: #666666;"&gt;&lt;b&gt;really&lt;/b&gt;&lt;/span&gt;&lt;span style="color: #666666;"&gt; need an analytic DB, choose products that have a reputation for being usable by technically inclined end users without a DBA. Recent columnar database vendors fall into this end of the spectrum, especially those that target the MySQL installed base. It's not that a DBA is completely unnecessary, simply that you can take a project a long way without one.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span style="color: #666666;"&gt;&lt;span style="font-family: inherit;"&gt;&lt;b&gt;MapReduce: Reduced to Hadoop&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;span style="color: #666666;"&gt;Finally we have those customers in smaller businesses (or perhaps government or universities) who need to analyse truly vast quantities of data with the minimum amount of resource. In the past it was literally impossible for them to do this; they were forced to rely on gross simplifications. Now though we have the MapReduce concept of processing data in fairly simple steps, in parallel across a numerous cheap machines. In many ways this is MPP minus the database, sacrificing the convenience of SQL and ACID reliability for pure scale. Hadoop has become the face of MapReduce and is effectively the SQL of MapReduce, creating a common API that alternative approaches can offer to minimise adoption barriers. 'Hadoop compatible' is &lt;/span&gt;&lt;span style="color: #666666;"&gt;&lt;b&gt;the&lt;/b&gt;&lt;/span&gt;&lt;span style="color: #666666;"&gt; Big Data buzz phrase for 2011.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span class="Apple-style-span" style="font-family: inherit;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0cm;"&gt;&lt;span style="color: #666666;"&gt;&lt;span style="font-family: inherit;"&gt;&lt;i&gt;In my next post I will look at the &lt;a href="http://joeharris76.blogspot.com/2011/03/analytic-database-market-opportunities.html"&gt;gaps where a differentiated offering could be introduced. I will also look at where Hadapt have pitched themselves and how ParStream and RainStor can take advantage of these market gaps.&lt;/a&gt;&lt;/i&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-4516978289179269610?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/4516978289179269610/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2011/03/analytic-database-market-segmentation.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/4516978289179269610'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/4516978289179269610'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2011/03/analytic-database-market-segmentation.html' title='Analytic Database Market Segmentation'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='https://lh6.googleusercontent.com/-C9Kzwcdgo9I/TYzKuoX7xHI/AAAAAAAAFas/ft0j5ftTECY/s72-c/ADBMS+Positioning.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-4429521623532110031</id><published>2011-01-28T08:23:00.000-08:00</published><updated>2011-01-28T08:23:54.126-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MySQL'/><category scheme='http://www.blogger.com/atom/ns#' term='HandlerSocket'/><category scheme='http://www.blogger.com/atom/ns#' term='orm'/><category scheme='http://www.blogger.com/atom/ns#' term='no-sql'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><title type='text'>HandlerSocket - More grist for the ORM mill</title><content type='html'>A plugin called HandlerSocket was released last year that allows InnoDB to be used to directly, bypassing the MySQL parsing and optimising steps. The genius of HandlerSocket is that the data is still "in" MySQL so you can use the entire MySQL toolchain (monitoring, replication, etc.). You also have your data stored in a highly reliable database, as opposed to some of the horror stories I'm seeing about newer NoSQL products.&lt;br /&gt;&lt;br /&gt;In the original blog post &amp;nbsp;( &lt;a href="http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story-for.html"&gt;here&lt;/a&gt; ) it talks about 720,000 qps on an 8 core Xeon with 32GB RAM. Granted this is all in memory data we're talking about but that is a hell of a figure. He also claims it outperforms Memcached.&lt;br /&gt;&lt;br /&gt;Next, Percona added HandlerSocket to their InnoDB fork back in December ( &lt;a href="http://www.mysqlperformanceblog.com/2010/12/14/percona-server-now-both-sql-and-nosql/"&gt;here&lt;/a&gt; ) so if you're looking for someone to talk to they may be the best people.&lt;br /&gt;&lt;br /&gt;Finally, Ilya Grigorik (way-smart guy from PostRank) blogged about it a couple of weeks ago ( &lt;a href="http://www.igvita.com/2011/01/14/handlersocket-the-nosql-mysql-ruby/"&gt;here&lt;/a&gt; ) and there's a fairly interesting discussion in the comments comparing this to prepared statements in Oracle.&lt;br /&gt;&lt;br /&gt;All of this reinforces my opinion that new generation ORMs are the technology that will finally allow the RDBMS apple cart to tip all the way over. Products like Redis, Riak, CouchDB, etc. are not enough on their own.&lt;br /&gt;&lt;br /&gt;The *really* interesting thing about HandlerSocket is that shows open source databases are perfect fodder for the next wave.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-4429521623532110031?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/4429521623532110031/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2011/01/handlersocket-more-grist-for-orm-mill.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/4429521623532110031'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/4429521623532110031'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2011/01/handlersocket-more-grist-for-orm-mill.html' title='HandlerSocket - More grist for the ORM mill'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-1231940129792003003</id><published>2011-01-19T08:37:00.000-08:00</published><updated>2011-03-26T03:57:09.314-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sql'/><category scheme='http://www.blogger.com/atom/ns#' term='columnar'/><category scheme='http://www.blogger.com/atom/ns#' term='dw'/><category scheme='http://www.blogger.com/atom/ns#' term='open-source'/><category scheme='http://www.blogger.com/atom/ns#' term='gpu'/><category scheme='http://www.blogger.com/atom/ns#' term='enterprise'/><category scheme='http://www.blogger.com/atom/ns#' term='bi'/><category scheme='http://www.blogger.com/atom/ns#' term='no-sql'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><category scheme='http://www.blogger.com/atom/ns#' term='oracle'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Intensive Computing'/><title type='text'>Analytic Database Market 'Fly Over'</title><content type='html'>This is a follow up to my previous post where I laid out my &lt;a href="http://joeharris76.blogspot.com/2010/12/initial-thoughts-about-parstream.html"&gt;initial thoughts about ParStream&lt;/a&gt;. This is a very high level 'fly over' view of the analytic database market. I'll follow this up with some thoughts about how ParStream can position themselves in this market.&lt;br /&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Powerhouse Vendors&lt;/b&gt;&lt;/div&gt;&lt;div&gt;The power players in the Analytic Database market are: Oracle (particularly Exadata), IBM (mostly Netezza, also DB2), and Teradata. Each of these vendors employs a large, very well funded and sophisticated sales force. A new vendor competing against them in accounts will find it very, very hard to win deals. They can easily put more people to work on a bid than a company like ParStream *employs*. If you are tendering for business in a Global 5000 corporation then you should expect to encounter them and you need a strategy for countering their access to the executive boards of these companies (which you will not get). In terms of technology their offerings have become very similar in recent years with all 3&amp;nbsp;emphasising&amp;nbsp;MPP appliances of one kind or another, however most of the installed base are still using their traditional SMP offerings (Netezza and Teradata excepted).&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;New MPP niche players&lt;/b&gt;&lt;/div&gt;&lt;div&gt;There are a number of recent entrants to the market who also offer MPP technology, particularly: Greenplum, AsterData and ParAccel. All 3 offer software-only MPP databases, although Greenplum's emphasis has shifted slightly since being acquired by EMC. These vendors seem to focus mostly on (or succeed with) customers who have&amp;nbsp;&lt;b&gt;very large&lt;/b&gt;&amp;nbsp;data volumes but are small companies in terms of employees. Many of these customers are in the web space. These vendors also have strong stories about supporting MapReduce/Hadoop inside their databases, which also plays to the leanings of web customers. According to testimonials on the vendor's websites customers seem to choose them because they are very fast and software only.&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Microsoft&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Microsoft is a unique case. They do not employ a direct sales force (as far as I know) however they have steadily become major force in enterprise software. Almost all companies run Windows desktops, have at least a few Windows servers and at least a few instances of SQL Server in production. Therefore&amp;nbsp;Microsoft will be considered in virtually&amp;nbsp;&lt;b&gt;every&lt;/b&gt;&amp;nbsp;selection process you're involved in. Microsoft have been steadily adding BI-DW features to the SQL Server product line and generally those features are all "free" with a SQL Server license. This doesn't necessarily make SQL Server cheaper but it does make it&amp;nbsp;&lt;b&gt;feel&amp;nbsp;&lt;/b&gt;like very good value.&amp;nbsp;Recent improvements include the Parallel Data Warehouse appliance (with HP hardware), columnar indexing for the next release and PowerPivot for local analysis of large data volumes.&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Proprietary columnar&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Columnar databases have been the hot technology in analytic databases for the last few years. The biggest vendors are Sybase with their very mature IQ product, SAND with an equally mature product and Vertica with their newer (and reportedly much faster) product. These databases can be used in single server (SMP / scale-up) and MPP (multi-server / scale-out) configurations. They appear to be most popular with customers who appreciate the high levels of compression that these databases offer and already have relatively mature star-schema / Kimball style data warehouses in place. &amp;nbsp;In my experience Sybase and SAND are used most in companies where they were introduced by an OEM as part of another product. Vertica is so new that it's not clear who their 'natural' customers are yet.&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Open Source columnar&lt;/b&gt;&lt;/div&gt;&lt;div&gt;In the open source world there are 2 MySQL storage engines and a standalone product offering&amp;nbsp;columnar databases. The MySQL engine Infobright was the first open source columnar database. It features very high compression and very fast loading however it is not suited for lots of joins and may be better thought of as a OLAP tool managed via SQL. The InfiniDB MySQL engine on the other hand is very good at joins and very good at squeezing all the available performance out of a server, however it does not have any compression currently. Finally there is LucidDB which is a Java based standalone product and has performance characteristics somewhere between the other two. LucidDB features excellent compression, index support and generally good performance but can be slow to load.&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Vectorised columnar&lt;/b&gt;&lt;br /&gt;There is only one player here: VectorWise. VectorWise is a columnar database (AFAIK) that has been architected from top to bottom to take advantage of the vector pipelines built into all recent CPUs. Vectorisation is a way of running many highly parallel operations through a single CPU. It basically removes all of the waiting and memory shifting that slows a CPU down. Initial testers have been very positive about the performance of VectorWise and had nothing but good things to say. There is also talk of an open source release so they are covering a lot of bases.&amp;nbsp;They also have the advantage of being part of Ingres who may not be the force they once were but have a significant installed base and are well placed to sell VectorWise.&amp;nbsp;They are the biggest direct competitor to ParStream that I can see right now.&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Open Source MapReduce/NoSQL&lt;/b&gt;&lt;br /&gt;ParStream will also compete with a new breed of open source MapReduce/NoSQL products, most notably Hadoop (and it's variants). These products are not databases per se but they have gained a lot of mindshare among developers who need to work with large data volumes. Part of their attraction is their 'cloud friendliness'. They are perfect for the cloud because they have been designed to run on many small servers and to expect that a single server could fail at any time. There is a trade-off to be made and MapReduce products tend to be much more complex to query, however for a technically savvy audience the trade is well worth it.&lt;/div&gt;&lt;div&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;&lt;i&gt;Next time I'll talk about where I think ParStream need to place themselves to maximise their opportunity.&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;UPDATE: Actually, in the next post I talk about &lt;a href="http://joeharris76.blogspot.com/2011/03/analytic-database-market-segmentation.html"&gt;how analytic database vendors are positioned and introduce a simple market segmentation&lt;/a&gt;. A further post about market opportunities will follow.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-1231940129792003003?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/1231940129792003003/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2011/01/analytic-database-market-fly-over.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/1231940129792003003'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/1231940129792003003'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2011/01/analytic-database-market-fly-over.html' title='Analytic Database Market &apos;Fly Over&apos;'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-1306441477668629561</id><published>2011-01-19T07:52:00.000-08:00</published><updated>2011-01-19T07:52:17.609-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sql'/><category scheme='http://www.blogger.com/atom/ns#' term='columnar'/><category scheme='http://www.blogger.com/atom/ns#' term='open-source'/><category scheme='http://www.blogger.com/atom/ns#' term='etl'/><category scheme='http://www.blogger.com/atom/ns#' term='enterprise'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><title type='text'>My take on why businesses have problems with ETL tools</title><content type='html'>Check out this very nice piece by Rick about the reasons &lt;a href="http://datadoghouse.typepad.com/data_doghouse/2011/01/my-take-on-why-etl-has-not-always-kept-up-with-the-integration-workload.html"&gt;why companies have failed to get the most out of their ETL tools&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;My take is from the other side of the fence. As a business user I'm often frustated by ETL tools and have been known to campaign against them for the following reasons:&lt;br /&gt;&lt;br /&gt;&amp;gt; ETL tools have been too focussed on Extract-Transform-Load and too little focused on actual data integration. I have complex integration challenges that are not necessarily a good fit for the ETL strategy and sometimes I feel like I'm pushing a square peg into a round hole.&lt;br /&gt;&lt;br /&gt;&amp;gt; It's still very challenging to generate reusable logic inside ETL tools and this really should be the easiest thing in the world (ever heard the mantra Don't Repeat Yourself!). Often the hoops that have to be jumped through are more trouble than they are worth.&lt;br /&gt;&lt;br /&gt;&amp;gt; Some ETL tools are a hodge podge of technologies and approaches with different data types and different syntaxes wherever you look. (SSIS I'm looking at you! This still is not being addressed in Denali.)&lt;br /&gt;&lt;br /&gt;&amp;gt; ETL tools are too focused on their own execution engines and fail miserably to take advantage of the processing power of columnar and MPP databases by running processes on the database. This is understandable in open source tools (database specific SQL may be a bridge too far) but in commercial tools it's pathetic.&lt;br /&gt;&lt;br /&gt;&amp;gt; Finally, where is the ETL equivalent of SQL? Why are we stuck with incompatible formats for each tool. The design graphs in each tool look very similar and the data they capture is near identical. Even the open source projects have failed to utilise a common format. &lt;b&gt;Very poor show&lt;/b&gt;. &lt;i&gt;This is the single biggest obstacle to more widespread ETL.&lt;/i&gt; Right now it's much easier for other parts of the stack to stick with SQL and pretend that ETL doesn't exist.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-1306441477668629561?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/1306441477668629561/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2011/01/my-take-on-why-businesses-have-problems.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/1306441477668629561'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/1306441477668629561'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2011/01/my-take-on-why-businesses-have-problems.html' title='My take on why businesses have problems with ETL tools'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-7477759189406649278</id><published>2011-01-12T13:15:00.000-08:00</published><updated>2011-01-12T13:15:25.426-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='psychology psychiatry shamanism brain ai steam'/><title type='text'>Chinese Mother: Psychology is Modern Shamanism</title><content type='html'>&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;A couple of days ago there was a widely linked article in the WSJ called "Chinese Mother" ( &lt;a href="http://on.wsj.com/f3nh9d"&gt;http://on.wsj.com/f3nh9d&lt;/a&gt; ). &amp;nbsp;The basic premise of the article is that Western mothers are too soft and don't push their children enough and Chinese mothers are like a blacksmith's hammer cruelly pounding there children until they become brilliant swords of achievement (or something equally pathetic).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; I'm not going to deal with the premise though; it's the subtext that I'm interested in. The subtext is: '&lt;i&gt;Western people develop psychological problems because their parents make them weak, self indulgent quitters.&lt;/i&gt;'&amp;nbsp; I've seen lots of counterpoints who's subtext is something like '&lt;i&gt;Chinese parents turn their children into soulless robots who can only take orders&lt;/i&gt;'.&amp;nbsp; The really interesting thing about both of these ideas is that they tacitly accept the current fashions of Western psychology as if they were scientifically proven facts. &amp;nbsp;You may well expect that from Western responses but in the original piece she frames Chinese Mothers as the antidote to the 'problems' identified by Western psychological ideas.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&lt;i&gt;&amp;nbsp; &amp;nbsp;I'm going to digress for a minute but if you do nothing else make sure you read&amp;nbsp; "The Americanization of Mental Illness" in the New York Times ( &lt;a href="http://nyti.ms/ggQKCG"&gt;http://nyti.ms/ggQKCG&lt;/a&gt; ).&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;blockquote&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;Let me introduce&amp;nbsp;an imaginary a world in which the internal combustion engine evolved on it's own and everyone in this world is given an engine when they're born, sort of like a puppy, and the engine has to develop and eventually reach a mature state. They keep the engine through their life and use it to assist with physical work.&amp;nbsp; These engines are completely sealed &amp;nbsp;(a la Honda) and cannot be opened or disassembled without destroying them. &amp;nbsp;The engines accept a few limited inputs (petroleum products, coolant and accelerator signals). &amp;nbsp;They output power and waste (heated coolant and exhaust). &amp;nbsp;Virtually all work is done by the engines and no one can conceive of life without them.&lt;/blockquote&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;People in this imaginary world are naturally very curious about engines but they basically know nothing about them. They cannot create an engine from first principles.&amp;nbsp; They have invested huge efforts in studying engines but this 'study' basically amounts to looking at engines while they're working and measuring which parts get hot.&amp;nbsp; The engines display &amp;nbsp;remarkably diverse behaviour. &amp;nbsp;They are very sensitive to the quality of the petroleum products the user provides.&amp;nbsp; Some substitutes have been found to work but others will kill the engine.&amp;nbsp; Scientists studying engines have found that chemicals can be added to the fuel to generate different performance characteristics.&amp;nbsp; It's not known whether these additives have a long term impact on the engine. &amp;nbsp;Temperature, humidity, age, etc; many other variables&amp;nbsp;also subtly affect the engines.&lt;/blockquote&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;blockquote&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;Alongside the scientists, a separate field of engine philosophy has grown up.&amp;nbsp; These people develop complex theories about engine performance and how it can be influenced.&amp;nbsp; Their theories are never tested (it would be unethical to destroy an engine to test a theory).&amp;nbsp; Regardless, engine philosophies are extremely popular and wield a huge influence over people's perception of how engines should be used to best effect.&amp;nbsp; Finally there is a third group - the practical philosophers.&amp;nbsp; They are engine philosophers who also study all of the components and inputs of engines.&amp;nbsp; They are called upon to intervene when an engine is not performing as expected. &amp;nbsp;They use various mechanical devices and chemical cocktails depending on which school of philosophy they belong to.&amp;nbsp; No one knows if these 'treatments' actually work but many people 'feel' like they do and that seems to be good enough.&lt;/blockquote&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Back to reality, clearly my imaginary world is ridiculous.&amp;nbsp; Right?&amp;nbsp; They sound like cargo cult tribes making earphones out of coconuts and waiting for wartime planes to return.&amp;nbsp; And what does this have to do with the 'Chinese Mother' nonsense anyway?&amp;nbsp; Well the truth is that &lt;b&gt;the engine people are us &lt;/b&gt;and&amp;nbsp;this is how our culture deals with the brain.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&lt;span class="s1"&gt;&amp;nbsp;&lt;/span&gt;&amp;nbsp; How much to we know about the brain? &amp;nbsp;Nothing. &amp;nbsp;Seriously - &lt;b&gt;NOTHING!&lt;/b&gt;&amp;nbsp;&amp;nbsp;The brain is, in many ways, the last great mystery of the natural world.&amp;nbsp; I don't want to demean the good work that scientists are doing with&amp;nbsp;fMRIs of the brain, but they are a&amp;nbsp;long way from explaining the mechanics of the brain and do not deserve sensational headlines.&amp;nbsp; If the path from superstitious farmers to an explanation of brain phenomena from first principles is a mile - we've gone about 100 feet.&amp;nbsp; Into that vacuum of understanding we have pushed a huge volume of nonsense.&amp;nbsp; The nonsense varies widely in quality from laughably stupid 'ghost in the machine' stuff to the very sophisticated but utterly meaningless 'mental illnesses' of modern psychology.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; To understand our progress in brain science let's consider a steam engine in our imaginary world. &amp;nbsp;People have been tinkering with the idea of steam power since ancient Greece. The first workable&amp;nbsp;steam engine appeared in 1712.&amp;nbsp; In a world of natural 'engines' such machines would seem rudimentary and laughable.&amp;nbsp; Compared to the high powered and perfectly working natural engines they would be.&amp;nbsp; Many people would doubt that 'evolved' engines could possibly work on the same principles. &amp;nbsp;Perhaps they would gain acceptance because you could create new ones as needed.&amp;nbsp; Given time, steam engines could become increasingly sophisticated and perhaps eventually reach (or even surpass) the effectiveness of natural engines.&amp;nbsp; I'd like to think that this is where we are now in our understanding of the brain.&amp;nbsp; Modern computers are the 'steam age' of brain science.&amp;nbsp; Compared to the brain the are incredibly inflexible and crude.&amp;nbsp; Yet we have found them to be immensely useful and they have clearly changed our world.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; So, if our brain science is in the steam age, at least scientists are studying something real.&amp;nbsp; If you lived in a pre-Enlightenment tribe/village/etc. someone in the tribe was designated as the shaman (or whatever you called it).&amp;nbsp; They were essentially selected at random and if you were lucky they had some knowledge about various plants that could be used if someone displayed a certain symptom.&amp;nbsp; They also had a fancy story about to explain what they were doing and why it worked.&amp;nbsp; Sometimes their stuff worked, sometimes it killed the patient but they basically knew nothing.&amp;nbsp; The function of the shaman was to provide you with a reason to believe you would get better.&amp;nbsp; That works surprisingly well a lot of time, it's called the placebo effect.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; The problem with psychology and psychiatry is that it's&amp;nbsp;&lt;b&gt;still like that&lt;/b&gt;.&amp;nbsp; There's a huge psycho-pharma industry geared up to give you a reason to believe you should feel better and charge you handsomely for the privilege.&amp;nbsp;&amp;nbsp; They're basically modern shamans! &amp;nbsp;There is no detailed explanation for the effect of SRI anti-depressants.&amp;nbsp; They are stuffing the world's population full of chemicals whose effect cannot be adequately explained.&amp;nbsp; The use of the terms 'mental health' and 'mental illness' are basically ridiculous.&amp;nbsp; The modern psycho-pharma practitioner has no better basis to label some symptom a 'mental illness' than a shaman had to explain why a tribe member was sick.&amp;nbsp; &lt;b&gt;They fundamentally DO NOT KNOW,&amp;nbsp;they're just guessing.&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Now, you may be about to rebuke me with various double blind, statistically valid and incredibly sophisticated studies that have been done on psycho-pharma drugs and mental illnesses.&amp;nbsp; Those things are great but what are they really measuring?&amp;nbsp; They're measuring deeply subjective experiences and outcomes as reported by human beings.&amp;nbsp; These experiences and outcomes are very strongly shaped by the culture and expectations of the participants.&amp;nbsp; They do not study of the &lt;i&gt;actual&lt;/i&gt;&amp;nbsp;&lt;i&gt;physical&amp;nbsp;effects&lt;/i&gt; of the compounds, &lt;i&gt;it's ALL subjective&lt;/i&gt;.&amp;nbsp; It may be sophisticated but it's not science. &amp;nbsp;Good science is not subjective. &amp;nbsp;Good science relies on verifiable and repeatable outcomes. &amp;nbsp;Good science says 'we don't know' very clearly when that's the truth. &amp;nbsp;No one in psycho-pharma ever says 'we don't know'.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;It's kind of depressing, or maybe that's&amp;nbsp;meaningless term. All I can say is be very careful about anyone who tries to sell you an explanation for how the brain works and remember that the placebo effect is a powerful force.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; As far as parenting and being a Chinese Mother, I don't have any advice for you but I can promise you that simplistic explanation for complex outcomes (like the success or happiness of your kids) are invariably wrong. &amp;nbsp;I guess you'll just have to do what seems best to you; know that your culture will have a huge effect that you can't really control; and&amp;nbsp;trust your kids will probably turn out a little weird and mostly OK. As far as I can tell most people do.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-7477759189406649278?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/7477759189406649278/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2011/01/chinese-mother-psychology-is-modern.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7477759189406649278'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7477759189406649278'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2011/01/chinese-mother-psychology-is-modern.html' title='Chinese Mother: Psychology is Modern Shamanism'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-7783737798579971761</id><published>2011-01-04T13:12:00.000-08:00</published><updated>2011-01-04T14:50:20.870-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='columnar'/><category scheme='http://www.blogger.com/atom/ns#' term='dw'/><category scheme='http://www.blogger.com/atom/ns#' term='predictions'/><category scheme='http://www.blogger.com/atom/ns#' term='SharePoint'/><category scheme='http://www.blogger.com/atom/ns#' term='gpu'/><category scheme='http://www.blogger.com/atom/ns#' term='enterprise'/><category scheme='http://www.blogger.com/atom/ns#' term='bi'/><category scheme='http://www.blogger.com/atom/ns#' term='acquisitions'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Intensive Computing'/><title type='text'>2011 Preview: BI-DW Top 5</title><content type='html'>&lt;div&gt;&lt;i&gt;Here are the trends I expect to see in 2011, but beware my crystal ball is hazy and known to be biased.&lt;/i&gt;&lt;br /&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;i&gt;&lt;/i&gt;&lt;b&gt;Top 5 for 2011&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;5)&lt;/b&gt; &lt;span class="s1"&gt;&lt;b&gt;Niche BI acquisitions take off&amp;nbsp;&lt;/b&gt;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="s1"&gt;&amp;nbsp;&amp;nbsp;Big BI consolidation&amp;nbsp;may well be finished, but I think 2011 will be the start of&amp;nbsp;niche vendor acquisitions as&amp;nbsp;established BI&amp;nbsp;vendors seek new growth in a (hopefully) recovering economy.&amp;nbsp; I don't expect any given deal size to be huge (probably sub $100m) however we could easily see half a dozen vendors being picked up.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp; The driver for such acquisitions should be clear; Big BI vendors have ageing product stacks and many have been through post-merger product integration pains.&amp;nbsp; Their focus on innovation has been sorely lacking (non-existent?).&amp;nbsp; Also, there is huge leverage in applying a niche product to an existing portfolio. &amp;nbsp;The Business Objects / Xcelsius acquisition is a great example of this (although BO&amp;nbsp;seems to think Xcelsius is a lot&amp;nbsp;better and more useful than I do).&lt;/div&gt;&lt;div&gt;&amp;nbsp; I will not make any predictions about who might be acquired. However,&amp;nbsp;here are some examples of companies with offerings that are not available from Big BI vendors.&amp;nbsp; Tableau's data visualisation offering is 1&lt;span class="s4"&gt;&lt;sup&gt;st&lt;/sup&gt; class IMHO and is a perfect fit for the people who &lt;b&gt;actually use&lt;/b&gt; BI products in practice.&amp;nbsp; Lyza's BI/ETL collaboration offering is unique (and hard to describe) and a great fit for &lt;b&gt;business oriented&lt;/b&gt; BI projects.&amp;nbsp; Jedox' Palo offering brings unique power to Excel power users and appears to be the only rival to Microsoft's PowerPivot offerings; I suspect a stronger US sales force would help them immensely.&lt;/span&gt;&lt;br /&gt;&lt;span class="s4"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="s1"&gt;&lt;b&gt;4) GPU based computing comes to the fore&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="s1"&gt;&lt;b&gt;&lt;/b&gt;&lt;/span&gt;&lt;span class="s1"&gt;&lt;b&gt;&amp;nbsp;&lt;/b&gt; I blogged some time ago about GPU's offering a glimpse of the many-core future.&amp;nbsp; Since then I've been waiting (and waiting) for signs that GPUs were making the jump into business servers.&amp;nbsp; Finally, in April 2010, Jedox released Palo OLAP Accelerator for GPUs. And&amp;nbsp;this autumn I&amp;nbsp;discovered&amp;nbsp;ParStream's new GPU accelerated database (I blogged about it&amp;nbsp;last week).&amp;nbsp; Finally in December we saw the announcement of a new class of Amazon EC2 instance featuring a GPU as part of the package.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp; Based on these weak signals, I think 2011 will be the year that GPU processing and GPU acceleration&amp;nbsp;starts to become a widely accepted part of business computing.&amp;nbsp; The most recent GPU cards from Nvidia and AMD offer many hundreds (512+) of processing cores and multiple cards can be used in a single server.&amp;nbsp; There is a large class of business computing problems that could be addressed&amp;nbsp;by GPUs: analytic calculations (e.g. SAS / R), anything related to MapReduce / Hadoop, anything related to enterprise search&amp;nbsp;/ e-discovery, anything related to stream processing&amp;nbsp;/ CEP, etc.&amp;nbsp; As final note I would &lt;b&gt;strongly suggest&lt;/b&gt; that vendors who&amp;nbsp;sell columnar databases or&amp;nbsp;in-memory&amp;nbsp;BI products (or&amp;nbsp;are losing sales to&amp;nbsp;such) should point their R&amp;amp;D team at GPUs and get something together quickly. Niche vendors have an opportunity to push the price/perform baseline up by an order of magnitude and take market share while Big BI vendors try to catch up.&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;3) Data Warehousing morphs into Data Intensive Computing&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp; I once asked Netezza CTO Justin Lindsey if he considers Netezza machines to be supercomputers.&amp;nbsp; He said no he didn't but that the scientific computing 'guys' call it a "Data Intensive Supercomputer" and use it in applications where the ratio of data to calculations is very high,&amp;nbsp;i.e., the opposite of classical supercomputing applications.&amp;nbsp; That phrase really stuck with me and it seems to describe the direction that data warehousing is headed.&lt;/div&gt;&lt;div&gt;&amp;nbsp; If you've been around BI-DW for a while you'll be familiar with the Inmon v Kimball ideology war.&amp;nbsp;That fight&amp;nbsp;illustrates the&amp;nbsp;idea that data warehouses had a well defined purpose&amp;nbsp;simply because we could argue about the right way to do 'it'.&amp;nbsp; I've noticed the purpose of the data warehouse stretching out over the last few&amp;nbsp;years. The rise of analytics and&amp;nbsp;ever increasing data volumes mean that more&amp;nbsp;activities&amp;nbsp;are finding a home on the data warehouse&amp;nbsp;as a platform.&amp;nbsp; Either the activity cannot be done elsewhere&amp;nbsp;or the data warehouse is the most accessible platform for data driven projects&amp;nbsp;with short term data processing needs.&lt;/div&gt;&lt;div&gt;&amp;nbsp; In 2011 we need to borrow this term from the supercomputing guys and apply it to ourselves.&amp;nbsp; We need to change our thinking from&amp;nbsp;delivering&amp;nbsp;and supporting a data warehouse to offering a Data Intensive Computing service (that enables a data warehouse).&amp;nbsp; Those that&amp;nbsp;fail to&amp;nbsp;make the change&amp;nbsp;should not&amp;nbsp;be surprised when departments implement their own analytic database, make it available to the&amp;nbsp;wider business and start competing with them for funding.&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;2) SharePoint destabilises incumbent BI platforms&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp; SharePoint is not typically considered a BI product and is rarely&amp;nbsp;mentioned when I talk to&amp;nbsp;fellow BI people. Those who specialise in Microsoft's products occasionally mention the special challenges (read headaches) associated with supporting it but it's "just a portal".&amp;nbsp; Right?&amp;nbsp; Not quite.&amp;nbsp; Microsoft has managed to drive a nuclear Trojan horse into the safety of incumbent BI installations.&amp;nbsp; SharePoint contains extensive BI capabilities and enables BI capabilities in other Microsoft products (like, um, Excel!).&amp;nbsp; Worst of all, if you're the incumbent BI vendor,&amp;nbsp;SharePoint is everywhere!&amp;nbsp; It has something like 75% market share overall and effectively 100% market share in &lt;b&gt;big&lt;/b&gt; &lt;span class="s2"&gt;companies.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp; So what?&amp;nbsp; Well, when&amp;nbsp;you want to deploy a dashboard solution where is the natural home for such content?&amp;nbsp; The intranet portal.&amp;nbsp; When you need to collaborate on analysis with widely dispersed teams, what can you use that's better than email?&amp;nbsp; Excel docs on the portal.&amp;nbsp; If report bursting is filling up your inboxes&amp;nbsp;like sand in an hourglass, where can&amp;nbsp;you&amp;nbsp;put&amp;nbsp;reports instead?&amp;nbsp; Maybe the&amp;nbsp;intranet?&amp;nbsp; You get the point. We have a history in BI&amp;nbsp;of pushing &lt;b&gt;yet another friggin' portal&lt;/b&gt; onto the business when we select our BI platform.&amp;nbsp; Our chosen platform comes&amp;nbsp;with such a nice&amp;nbsp;portal, heck that's part of why we bought it. A year later we wonder why it doesn't get used.&amp;nbsp; We wonder why we spend more time unlocking expired logins than answering questions about reports.&lt;/div&gt;&lt;div&gt;&amp;nbsp; &amp;nbsp;Right now businesses are only using a small fraction of SharePoint's capability. But they pay for all of them and&amp;nbsp;I expect business to push for more return from SharePoint investments in 2011.&amp;nbsp; I expect a lot of these initiatives to involve communicating business performance (BI) and collaborating on performance analysis (BI again).&amp;nbsp; The trouble for incumbent vendors is clear: SharePoint has no substitute; your BI suite has direct substitutes, Microsoft offers some substitutes for &lt;b&gt;free&lt;/b&gt;, your BI content is going to end up on SharePoint, once it's there its SharePoint content. BI vendors should expect hard conversation about maintenance fees and upgrade cycles in any account where dashboards are being hosted on SharePoint.&lt;/div&gt;&lt;div&gt;&amp;nbsp; As a final note, I would suggest that vendors who sell to large customers need to have a compelling SharePoint story.&amp;nbsp; It's basically a case of "if you can't beat them, join them".&amp;nbsp; If you have a portal as part of your suite you need to integrate with SharePoint (yesterday).&amp;nbsp; You need to make you products work better with SharePoint than Microsoft's own products do.&amp;nbsp; This will be a huge, expensive PITA - do it anyway.&amp;nbsp; You must find a way to embrace SharePoint without letting it own you.&amp;nbsp;&amp;nbsp;Good luck.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;1) BI starts to dissolve into&amp;nbsp;other systems&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp; My final trend for 2011 is&amp;nbsp;about BI becoming bifurcated (love that word) between the strategic stuff (dashboards and analysis) and everything else. That "everything else" doesn't naturally live on&amp;nbsp;a portal or in a report that gets emailed out.&amp;nbsp; It belongs in the system that generates the data in the first place; it belongs right at the point of interaction.&amp;nbsp;James Taylor and Neil Raden talked about this idea in the book "Smart Enough Systems". I won't repeat their arguments here but I will outline some of the reason why I think it's happening now.&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; First, 'greenfield' BI sites are a thing of the past. Everyone now has BI, it may not work very well but they have it.&amp;nbsp; New companies use BI from day 1.&amp;nbsp; The market is effectively saturated.&amp;nbsp; Second, most of the Big BI vendors are now part of large companies that sell line of business systems.&amp;nbsp; There is a natural concern about diluting the value of the BI suite, however "BI for the masses" is a dead-end and I think they probably get that.&amp;nbsp; Third, deep integration is one of the last remaining levers that Big BI vendors can use against nimble niche vendors and against SharePoint.&amp;nbsp; They will essentially &lt;b&gt;have&lt;/b&gt; to go down this route at some point.&amp;nbsp; Finally, many system vendors have reached an impasse with their customers regarding upgrades. Customers are simply refusing to upgrade systems that work perfectly well. These vendors must&amp;nbsp;create a real, tangible reason for the customers to move. I suspect that deep BI integration is their best bet.&lt;/div&gt;&lt;div&gt;&amp;nbsp; I have had&amp;nbsp;too many conversations about 'completing the circle' and feeding the results of analysis back into source systems.&amp;nbsp; Sadly it never happens in practice, the walls are just too high.&amp;nbsp; Once the data has left the source system it is considered tainted and pushing tainted data into production systems is never taken lightly.&amp;nbsp; Thus the ultimate answer seems to be to push the "smarts" that have been generated by analysis down into the source system instead.&amp;nbsp; Expect to see plenty of marketing talk in 2011 about systems getting 'smarter' and more integrated.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-7783737798579971761?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/7783737798579971761/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2011/01/2011-preview-bi-dw-top-5.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7783737798579971761'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7783737798579971761'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2011/01/2011-preview-bi-dw-top-5.html' title='2011 Preview: BI-DW Top 5'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-6196027502743753447</id><published>2010-12-30T04:12:00.000-08:00</published><updated>2010-12-30T05:25:44.891-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='columnar'/><category scheme='http://www.blogger.com/atom/ns#' term='potential'/><category scheme='http://www.blogger.com/atom/ns#' term='dw'/><category scheme='http://www.blogger.com/atom/ns#' term='enterprise'/><category scheme='http://www.blogger.com/atom/ns#' term='saas'/><category scheme='http://www.blogger.com/atom/ns#' term='bi'/><category scheme='http://www.blogger.com/atom/ns#' term='results'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><category scheme='http://www.blogger.com/atom/ns#' term='analyst'/><title type='text'>2010 Review: a BI-DW Top 5</title><content type='html'>&lt;i&gt;This post is written completely 'off the cuff' without any fact checking or referring back to sources. Just sayin'…&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Top 5 from 2010&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;5) Big BI consolidation is finished&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;There were no significant acquisitions of "Big BI" vendors in 2010. &amp;nbsp;Since Cognos went to IBM and BO went to SAP, the last remaining member of the old guard is MicroStrategy. (It's interesting to consider why they have not been acquired but that's for another post.) &amp;nbsp;In many ways the very definition of Big BI has shifted to encompass smaller players. Analysts, in particular, need things to talk about and they have effectively elevated a few companies to Big BI status that were previously somewhat ignored, e.g., SAS (as a BI provider), InformationBuilders, Pentaho, Acuate, etc. &amp;nbsp;All of the major conglomerates now have a 'serious' BI element in their offerings and so I don't see further big spending on BI acquisitions in 2011. &amp;nbsp;The only dark horse in this race seems to be HP and it's very unclear what their intentions are, particularly with the rumours of Neoview being cancelled; if HP were to move I see them going for either a few niche players or someone like InformationBuilders with solid software but lacking in name recognition.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;4) Analytic database consolidation began&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;We've seen an explosion of specialist Analytic databases over the last ~5 years and 2010 saw the start of a consolidation phase amongst these players. The first big acquisition of 2010 was Sybase by SAP; everyone assumed Sybase's IQ product (the original columnar database) was the target but the talk since then has been largely about the Sybase mobile offerings. I suspect both products are of interest to SAP; IQ allows them to move some of their ageing product lines forward and Mobile will be an enabler for taking both SAP and Business Objects to smartphones going forward.&lt;br /&gt;&amp;nbsp;&amp;nbsp;The banner acquisition was Netezza by IBM. I've long been very critical/sceptical of IBM's claims in the Data Warehouse / Analytic space. Particularly as I've worked with a number of DW's that were taken off DB2 (onto Teradata) but never come across one actively running on DB2. I'm a big Netezza fan so my hope is that they survive the integration and are able to leverage the resources of IBM going forward.&lt;br /&gt;&amp;nbsp;&amp;nbsp;We also saw Teradata acquiring the dry husk of Kickfire's ill-fated MySQL 'DW appliance'. Kickfire's fundamental technology appeared to be quite good but sadly their market strategy was quite bad. I think this a good sign from Teradata that they are open to external ideas and they see where the market is going. The competition with Netezza seems to have revitalised them and given them a new enemy to focus on. A new version of Teradata database that incorporated some columnar features (and an 'free' performance boost) could be just the ticket to get their very conservative customers migrated onto the latest version.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;3) BI vendors started thinking about mobile&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;Mobile BI became a 'front of mind' issue in 2010. MicroStrategy has marketed aggressively in this space but other vendors are in the hunt and have more or less complete mobile offerings. Business Objects also made some big noise about mobile but everything seemed to be demos and prototypes. Cognos has had a 'mobile' offering for some time but they remained strangely quiet, my impression is that their mobile offerings are not designed for the iOS/Android touchscreen world.&lt;br /&gt;&amp;nbsp;&amp;nbsp;Niche vendors have been somewhat quiet on the mobile front, possibly waiting to see how it plays out before investing, with the notable exception of Qlikview who have embraced it with both arms. This is a great strategic move for Qlikview (who IMHO prove the koan that 'strategy trumps product') because newer mobile platforms are being embraced by their mid-market customers far faster than at Global 5000 companies that the Big BI vendors focus on. Other niche and mid-market vendors should take note of this move and get something (anything!) ready as quickly as possible.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;2) Hadoop became the one true MapReduce&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;I remain somewhat non-plussed by MapReduce personally, however a lot of attention has been lavished on it over the last 2 years and during the course of 2010 the industry has settled on Hadoop as the MapReduce of choice. &amp;nbsp;From Daniel Adabadi's HadoopDB project to Pentaho's extensive Hadoop integration to Aster's "seamless connectivity" with Hadoop to Paraccel's announcement of the same thing coming soon and on and on. &amp;nbsp;The basic story of MapReduce was very sexy but in practice the details turned out to be "a bit more complicated" (as Ben Goldacre [read his book!] would say). &amp;nbsp;It's not clear that Hadoop is the best possible MR implementation but it looks likely to become the SQL of MapReduce. Expect other MapReduce implementations to start talking about Hadoop compatibility ad nauseum.&lt;br /&gt;&amp;nbsp;&amp;nbsp;All of this casts Cloudera in an interesting light. They are after all "the Hadoop company" according to themselves. It's far too early for a 'good' acquisition in this space however money talks and I wonder if we might see something happen in 2011.&lt;br /&gt;&lt;br /&gt;1&lt;b&gt;) The Cloud got real and we all got sick of hearing about it&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;I'm not sure whether 2010 was truly the "year of the Cloud" but it certainly was the peak of it's hype cycle. &amp;nbsp;In 2010 the reality of cloud pricing hit home; the short version is that a lot of the fundamental cost of cloud computing is operational and we shouldn't expect to see continuous price/performance gains like we have seen in the hardware world. &amp;nbsp;Savvy observers have noted that the bulk of enterprise IT spending has been non-hardware for a long time but the existence of cloud offerings brings those costs into focus.&lt;br /&gt;&amp;nbsp;&amp;nbsp;Ultimately, my hope for the Cloud is that it will drive companies toward buying results, e.g., SaaS services that require little-to-no customisation, and away from buying potential, e.g. faster hardware and COTS software that is rarely fit for purpose. The cycle should go something like: "This Cloud stuff seems expensive, how much does it cost us to do the same thing?" &amp;gt; "OMG are you frickin' serious, we really spend that?!" &amp;gt; "Is there anyone out there that can provide the exact same thing for a monthly fee?". &amp;nbsp;Honestly, big companies are incredibly bad at hardware and even worse at software. The Cloud (as provided by Amazon, et al) is IMHO just a half step towards then endpoint which is the use of SaaS offerings for everything.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-6196027502743753447?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/6196027502743753447/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2010/12/2010-review-bi-dw-top-5.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/6196027502743753447'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/6196027502743753447'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2010/12/2010-review-bi-dw-top-5.html' title='2010 Review: a BI-DW Top 5'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-2708687753491203972</id><published>2010-12-15T13:59:00.000-08:00</published><updated>2011-04-05T02:14:42.655-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sql'/><category scheme='http://www.blogger.com/atom/ns#' term='columnar'/><category scheme='http://www.blogger.com/atom/ns#' term='dw'/><category scheme='http://www.blogger.com/atom/ns#' term='open-source'/><category scheme='http://www.blogger.com/atom/ns#' term='gpu'/><category scheme='http://www.blogger.com/atom/ns#' term='enterprise'/><category scheme='http://www.blogger.com/atom/ns#' term='bi'/><title type='text'>Initial thoughts about ParStream</title><content type='html'>So here are my thoughts about ParStream based on researching their product on the internet only. I have not used the product, so I am simply assuming it lives up to all claims. As an analytics user and a BI-DW practitioner I sincerely hope that ParStream succeeds.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;I'm a GPU believer&lt;/b&gt;&lt;br /&gt;I'm a long time believer in the importance of utilising GPU for challenging database problems. I wrote a post in July 2009 about using GPUs for databases and implored database vendors to move in that direction: "Why GPUs matter for DW/BI" (http://joeharris76.blogspot.com/2009/07/why-gpus-matter-for-dwbi.html). &amp;nbsp;Here's the key quote - "There's a new world coming. It has a lot of cores. It will require new approaches. That world is accessible today through GPUs. Database vendors who move in this direction now will gain market share and momentum. Those who think they can wait on Intel and 'traditional' CPUs to 'catch up' may live to regret it."&lt;br /&gt;&lt;br /&gt;&lt;b&gt;On the right track&lt;/b&gt;&lt;br /&gt;I think ParStream is *fundamentally* on the right track with a GPU accelerated analytic database. The ParStream presentation from Mike Hummel (http://www.youtube.com/watch?v=knicXkXd9hQ) talks about a query that took 12 minutes on Oracle taking just a few &lt;b&gt;*miliseconds*&lt;/b&gt; on ParStream. If that is even half right the potential to shake up the industry and radically raise the bar on database performance is very exciting.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Reminiscent of Netezza&lt;/b&gt;&lt;br /&gt;I remember the first time I used Netezza back in 2004. I had just taken a new role and my new company had recently installed a first generation Netezza appliance. In my previous job we had an Oracle data warehouse that was updated *weekly* and contained roughly 100 million rows. Queries commonly took *hours* to return. The Netezza machine held just less than 1 *billion* rows. I ran the following query: "SELECT month, &amp;nbsp;COUNT(*), SUM(call_value) FROM cdr GROUP BY month;". It came back in 15 seconds! I was literally blown away.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;A fast database changes the game&lt;/b&gt;&lt;br /&gt;When you have a very fast analytic databases it totally changes the game. You can ask more questions, ask more complex questions and ask them more often. Analytics requires a lot of trial and error and removing time spent waiting on the database enables a new spectrum of possibilities. For example, Netezza enabled me to reprice _every_ call in our database against _every_ one of our competitors tariffs (i.e. an 'explosive' operation: 50 mil records in =&amp;gt; 800 mil records out) and then calculate the best *possible* price for each customer on any tariff. I used that information to benchmark my company on "value for money" and to understand the hidden drivers for customer churn.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;ParStream appliance strategy&lt;/b&gt;:&lt;br /&gt;So, given that background, let's look at the positioning of ParStream, the potential problems they may face, and the opportunities they need to pursue.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;ParStream is not Netezza&lt;/b&gt;&lt;br /&gt;I've positively compared ParStream to Netezza above so you might expect me to applaud ParStream for offering an appliance. Sadly not; Netezza's appliance success was due to unique factors that ParStream cannot replicate. Netezza had to use custom hardware because they use a custom FPGA chip. Customers were (and are) nervous about investing heavily in such hardware, however Netezza goes to great lengths to reassure them; providing service guarantees, plenty of spare parts and using commodity components wherever possible (power supplies, disks, host server, etc.). Also we must remember that most customers looking at Netezza were using very large servers (or server clusters) and required *very many* disks to get reasonable I/O performance for their databases. Netezza was actually reducing complexity for those customers.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The world has changed going into 2011&lt;/b&gt;&lt;br /&gt;ParStream cannot replicate those market conditions. The world has changed considerably going into 2011 and different factors need to be emphasised. ParStream relies on Nvidia GPUs that are widely available and installed on commodity interconnects (e.g. PCIe). Moreover there are high quality server offerings available in 2 form factors that make the appliance strategy more of a liability than an asset. First, Nvidia (and others) sell 1U rack mounted 'server' that contain 4 GPUs and connect to 'host' server via a PCIe card. Second Supermicro (and others) sell 4U 'super' servers that contain 2 Intel Xeons and &amp;nbsp;4 GPUs in a pre-integrated package. The ParStream appliance may well be superior to these offerings in some key way however such advantages will be quickly wiped by out as the server manufactures continuously refresh their product line.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Focus on the database software business&lt;/b&gt;&lt;br /&gt;ParStream should focus on the database software business where they have a huge advantage not the server business where they have huge disadvantages. You should read this article if you have any further doubts: "The Power of Commodity Hardware" (http://www.svadventure.com/svadventure/2009/01/the-power-of-commodity-hardware.html). Key quotes: "Customers love commodity hardware.", "Competing with HP, IBM, and Dell is dumb.", "Commodity hardware is much more capital efficient". &amp;nbsp;Also consider the fates of Kickfire and Dataupia who floundered on a database appliance strategy, and ParAccel who is going strong after initially offering an appliance and quickly moving to emphasise software-only.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Position GPUs as a new commodity&lt;/b&gt;&lt;br /&gt;ParStream must position GPUs and GPU acceleration as a new commodity. Explain that GPUs are an essential part of &lt;b&gt;all&lt;/b&gt;&amp;nbsp;serious supercomputers and the technology is being embraced by everyone; Intel with Larabee, AMD with Fusion, etc. Emphasise the option to add 'commodity' 4 GPU pizza boxes servers alongside a customer's existing Xeon/Opteron servers and, using ParStream, make huge performance gains. Talk to Dell customers about using a single Dell PowerEdge C410x GPU chasis (http://www.dell.com/us/en/enterprise/servers/poweredge-c410x/pd.aspx) to accelerate an entire rack of "standard" servers running ParStream. The message must be clear: ParStream runs on commodity hardware; you may not have purchased GPU hardware before but you can get exactly what ParStream needs from your preferred vendor.&lt;br /&gt;&lt;br /&gt;One final point here; ParStream needs to make Windows support a priority. This is probably not going to be fun, technically speaking, but Windows support will be important for the markets that ParStream should target (which will have to be another post, sadly).&lt;br /&gt;&lt;br /&gt;UPDATE -&amp;nbsp;I followed this post up with:&lt;br /&gt;&lt;a href="http://joeharris76.blogspot.com/2011/01/analytic-database-market-fly-over.html"&gt;An overview of the analytic database market&lt;/a&gt;, &lt;a href="http://joeharris76.blogspot.com/2011/03/analytic-database-market-segmentation.html"&gt;a simple segmentation of the main analytic database vendors&lt;/a&gt;, and &lt;a href="http://joeharris76.blogspot.com/2011/03/analytic-database-market-opportunities.html"&gt;a summary of the key opportunities I see in the analytic databases market (esp. for ParStream and RainStor)&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-2708687753491203972?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/2708687753491203972/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2010/12/initial-thoughts-about-parstream.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/2708687753491203972'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/2708687753491203972'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2010/12/initial-thoughts-about-parstream.html' title='Initial thoughts about ParStream'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-8001866520759372455</id><published>2010-12-09T03:47:00.000-08:00</published><updated>2010-12-15T12:55:36.718-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sql'/><category scheme='http://www.blogger.com/atom/ns#' term='columnar'/><category scheme='http://www.blogger.com/atom/ns#' term='dw'/><category scheme='http://www.blogger.com/atom/ns#' term='open-source'/><category scheme='http://www.blogger.com/atom/ns#' term='bi'/><category scheme='http://www.blogger.com/atom/ns#' term='database'/><title type='text'>Comment regarding Infobright's performance problems</title><content type='html'>&lt;b&gt;UPDATE: This is a classic case of the comments being better than the post; make sure you read them! In summary, Jeff explained better and a lightbulb went off for me: Infobright is for OLAP in the classical sense with the huge advantage of being managed with a SQL interface. Cool.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;I made a comment over on Tom Barber's blog post about a Columnar DB benchmarking exercise:&amp;nbsp;http://pentahomusings.blogspot.com/2010/12/my-very-dodgy-col-store-database.html&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Jeff Kibler said...&lt;br /&gt;Tom –&lt;br /&gt;&lt;br /&gt;Thanks for diving in! As indicated in your results, I believe your tests cater well to databases designed for star-schemas and full table-scan queries. Because a few of the benchmarked databases are engineered specifically for table scans, I would anticipate their lower query execution time. However, in analytics, companies overwhelmingly use aggregates, especially in ad-hoc fashion. Plus, they often go much higher than 90 gigs.&lt;br /&gt;&lt;br /&gt;That said, Infobright caters to the full fledged analytic. As needed by the standard ad-hoc analytic query, Infobright uses software intelligence to drastically reduce the required query I/O. With denormalization and a larger data set, Infobright will show its dominance.&lt;br /&gt;&lt;br /&gt;Cheers,&lt;br /&gt;&lt;br /&gt;Jeff&lt;br /&gt;Infobright Community Manager&lt;br /&gt;8 December 2010 17:04&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Joe Harris said...&lt;br /&gt;Tom,&lt;br /&gt;&lt;br /&gt;Awesome work, this is the first benchmark I've seen for VectorWise and it does look very good. Although, I'm actually surprised how close InfiniDB and LucidDB are, based on all the VW hype.&lt;br /&gt;&lt;br /&gt;NFS on Dell Equilogic though? I always cringe when I see a database living on a SAN. So much potential for trouble (and really, really slow I/O).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Jeff,&lt;br /&gt;&lt;br /&gt;I have to say that your comment is off base. I'm glad that Infobright has a community manager who's speaking for them but this comment is *not* helping.&lt;br /&gt;&lt;br /&gt;First, your statement that "in analytics, companies overwhelmingly use aggregates" is plain wrong. We use aggregates as a fallback when absolutely necessary. Aggregates are a maintenance nightmare and introduce a huge "average of an average" issue that is difficult to work around. I'm sure I remember reading some Infobright PR about removing the need for aggregate tables.&lt;br /&gt;&lt;br /&gt;Second, you guys have a very real performance problem with certain types of queries that should be straightforward. Just looking at it prima facie it seems that Infobright starts to struggle as soon as we introduce multiple joins and string or range predicates. The irony of the poor Infobright performance is that your compression is so good that the data could *almost* fit in RAM.&lt;br /&gt;&lt;br /&gt;What I'd like to see from Infobright is: 1) a recognition of the issue as being real. 2) An explanation of why Infobright is not as fast in these circumstances. 3) An explanation of how to rewrite the queries to get better performance (if possible). 4) A statement about how Infobright is going to address the issues and when.&lt;br /&gt;&lt;br /&gt;I like Infobright; I like MySQL; I'm an open source fan; I want you to succeed. The Star Schema Benchmark is not going away, Infobright needs to have a better response to it.&lt;br /&gt;&lt;br /&gt;Joe&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-8001866520759372455?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/8001866520759372455/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2010/12/comment-regarding-infobrights.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/8001866520759372455'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/8001866520759372455'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2010/12/comment-regarding-infobrights.html' title='Comment regarding Infobright&apos;s performance problems'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-2944124552865540163</id><published>2010-11-18T06:31:00.000-08:00</published><updated>2010-11-18T06:42:06.157-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='potential'/><category scheme='http://www.blogger.com/atom/ns#' term='real-time'/><category scheme='http://www.blogger.com/atom/ns#' term='google istant'/><title type='text'>Google Instant: how I *wish* it worked</title><content type='html'>There's something very grating about a product that &lt;b&gt;could&lt;/b&gt; be really useful but just isn't. It's like the really promising kid on X-Factor / American Idol who keeps falling apart and forgetting their song. The first couple of times your rooting for them but after a blowing it repeatedly you start to wish they'd just give up. For me, this is a perfect metaphor for the current iteration of Google Instant.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Google Instant how do I hate thee? Let me cout the ways.&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;i&gt;&amp;gt;&amp;gt; Too damn fast&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;I'm on a fast connection and if anything GI is just too damn fast. Results are constantly flickering just below my field of focus. I tend to use very exact (e.g. long) search phrases and this gets old quick. I find myself pausing while typing to look at GI results that are irrelevant to what I actually need. The cynic in me wonders whether this is what Google want. Are they trying to be 'sticky' now, despite a decade of saying this isn't their goal?&lt;br /&gt;&lt;b&gt;&lt;i&gt;&amp;gt;&amp;gt; Very generic phrases&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;It's frustrating is how useless the GI suggestions are. GI only gives you very generic phrases and they seem to be based on an average global searches. The trouble is that I'm &lt;b&gt;not&lt;/b&gt; an average global searcher. I've been using Google forever, they have a huge trove of data about what I've searched for and which results I've clicked on. They have know which subjects I'm interested in and which ones I'm not. They put this info to use in 'normal' search in a variety of ways but apparently not in GI.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Technologically impressive&lt;/b&gt;&lt;br /&gt;GI is clearly a very impressive bit of technology. The number of elements that have to work in harmony for it to return results that fast is honestly a little mind boggling. I take my hat off to the clever clogs who made this happen. Nevertheless, I'd prefer to wait (like a whole *second*) for even slightly better results. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;An incomplete puzzle&lt;/b&gt;&lt;br /&gt;Having said that, I'm quite sure that GI can be made a lot better and, as it's bad form to complain without offering a solution, I have a some suggestions about how it could be better. None of my suggestions are particularly original (or insightful?), mostly I'm just suggesting that existing elements be combined in better ways.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Instant Example&lt;/b&gt;&lt;br /&gt;Here's a sample of Google Instant in action. 5 suggestions and a ton of dead space. &amp;nbsp;It's not clear how the &amp;nbsp;suggestions are ordered or whether the order has some hidden meaning. (I should probably check out whether more costly PPC terms appear first…)&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/_W0USePhdhsc/TOUJ8BND23I/AAAAAAAAFZU/beEONwNeJ-w/s1600/google_instant-1.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/_W0USePhdhsc/TOUJ8BND23I/AAAAAAAAFZU/beEONwNeJ-w/s1600/google_instant-1.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Related Searches&lt;/b&gt;&lt;br /&gt;Here's a sample of the Related Searches option. This is buried under "More Search Tools" on the left. &amp;nbsp;Obviously there are more suggestions here but they are also different from Instant and in a different (also non-obvious) order.&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/_W0USePhdhsc/TOUN4_N4SSI/AAAAAAAAFZc/n66HVZ6APXo/s1600/google_related_searches.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/_W0USePhdhsc/TOUN4_N4SSI/AAAAAAAAFZc/n66HVZ6APXo/s1600/google_related_searches.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;b&gt;Wonder Wheel (of Doom!)&lt;/b&gt;&lt;br /&gt;Here's a sample of the Wonder Wheel option also buried under "More Search Tools". Again there is no context around any of the terms and the underlines suggest links but actually trigger a new 'wheel', the results are displayed on the left but you need a very wide screen otherwise they only get ~200 pixels of width.&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/_W0USePhdhsc/TOUOzdTD8cI/AAAAAAAAFZg/Xwd3viBPmOY/s1600/google_wonder_wheel-1.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/_W0USePhdhsc/TOUOzdTD8cI/AAAAAAAAFZg/Xwd3viBPmOY/s1600/google_wonder_wheel-1.png" /&gt;&lt;span class="Apple-style-span" style="-webkit-text-decorations-in-effect: none; color: black;"&gt;&amp;nbsp;&lt;/span&gt;&lt;/a&gt;&lt;/div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;A mockup example&lt;/b&gt;&lt;br /&gt;Here's my mockup for your amusement. Points to note:&lt;br /&gt;1) Search suggestions appear in columns, each column depends on the column to it's left. If you've used Finder on the Mac you know the score here. User can navigate with the mouse or arrow keys.&lt;br /&gt;2) Suggested terms are greyscale to indicate some hidden metric that may help the user choose between terms. Possible metrics: number of results, popularity of the term, previous visits, etc. Previously used terms could appear in purple. There are lots of possibilities here.&lt;br /&gt;3) When a term is selected (using&amp;nbsp;with space or arrow right)&amp;nbsp;it's added to the search box. Terms can be removed the same way (arrow left or backspace). I strongly feel that users should be encouraged to build long and specific search terms. Long terms are far more likely to result in quality responses in my experience.&lt;br /&gt;4) Note that all aspects of Googles offering can be integrated in the Instant Search experience. I've noticed that the video, image and social aspects have dropped to the bottom of the results. My mockup allows them to become much more front and center.&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/_W0USePhdhsc/TOU4ZLc-lbI/AAAAAAAAFZo/5UxQJgNrJY4/s1600/google_instant_mockup-1.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/_W0USePhdhsc/TOU4ZLc-lbI/AAAAAAAAFZo/5UxQJgNrJY4/s1600/google_instant_mockup-1.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;*Do* be dense&lt;/b&gt;&lt;br /&gt;Ultimately the Instant Search experience needs to become much more information dense. Sure "your mom" might not appreciate the color coding of the suggestions but it doesn't &lt;b&gt;detract&lt;/b&gt;&amp;nbsp;from her experience. Google needs to think much more holistically about Instant. Just getting &lt;i&gt;any old crap&lt;/i&gt;&amp;nbsp;faster is not an improvement regardless of how impressive it is, but getting the &lt;b&gt;exact right result&lt;/b&gt;&amp;nbsp;faster would be invaluable.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-2944124552865540163?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/2944124552865540163/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2010/11/google-instant-how-i-wish-it-worked.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/2944124552865540163'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/2944124552865540163'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2010/11/google-instant-how-i-wish-it-worked.html' title='Google Instant: how I *wish* it worked'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_W0USePhdhsc/TOUJ8BND23I/AAAAAAAAFZU/beEONwNeJ-w/s72-c/google_instant-1.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-6517158608775625125</id><published>2010-11-02T09:42:00.000-07:00</published><updated>2010-11-04T06:43:19.869-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='37signals'/><category scheme='http://www.blogger.com/atom/ns#' term='statistician'/><category scheme='http://www.blogger.com/atom/ns#' term='bi'/><category scheme='http://www.blogger.com/atom/ns#' term='analyst'/><title type='text'>Thoughts on 37signals requirements for a "Business Analyst"</title><content type='html'>Jason Fried posted a job requirement on Friday evening for a new "Business Analyst" role at 37signals, although in reality the role is more of a Business Intelligence Analyst than a typical BA as I have experienced it. The role presents something of a conundrum for me and I thought it would be interesting to pick it apart in writing for your enjoyment.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;UPDATE: I'm made a follow-up comment on the 37s blog that pretty is a good summary for this post - "&lt;i&gt;I was reflecting on the requirement from years of experience doing this kind of thing (sifting meaning from piles of data). I actually said that they’ve described 2 roles not often combined in a single person.&lt;br /&gt;&lt;br /&gt;So let me give an actual suggestion: &gt; First, do the basics, make sure the data is organised and reliable. &gt; Second, establish your metrics, build a performance baseline. &gt; Third, outsource the hard analytics on a “pay for performance” basis. &gt; Finally, if that works, then think about bringing analytic talent in-house.&lt;br /&gt;&lt;br /&gt;It’s hard to describe the indignity of hiring a genius and then forcing them to spend 95% of their time just pushing the data around.&lt;/i&gt;"&lt;br /&gt;&lt;br /&gt;I call myself a Business Intelligence professional and my last title was "Solution Architect". You can review my claim to that title on my &lt;a href="http://linkedin.com/in/joeharris76"&gt;Linked In profile&lt;/a&gt;. &amp;nbsp;This &lt;b&gt;should&lt;/b&gt;&amp;nbsp;be a great opportunity for me. I'm a fan of 37signals products; I love the Rework book and I pretty much agree with all of it; and I really like their take on business practices like global teams and firing the workaholics.&lt;br /&gt;&lt;br /&gt;However, it doesn't seem like this role is for me. Why not? It seems like they've left out the step where I do my best work: gather, integrate, prepare and verify the data. In the Business Intelligence industry we usually call the results of this phase the "data warehouse". A data warehouse, in practice, can't actually be defined in more detail than that. It's simply the place where we make the prepared data available for use. Nevertheless, the way that you choose to prepare the data inherently defines the outcomes that you get. It's the garbage in, garbage out axiom.&lt;br /&gt;&lt;br /&gt;Jason tells us a little bit about where their data comes from:&lt;i&gt; "[their] own databases, raw usage logs, Google Analytics, and occasional qualitative surveys."&lt;/i&gt; We're looking at very raw data sources here. Making good use of these data sources will take a lot of preparation (GA excepted) and will require a serious investment of time (and therefore money). The key is to create structures and processes that are automated and repeatable. This may seem obvious to you, but there's a sizeable number of white collar workers whose sole job is wrangling data between spreadsheets [ e.g. accountants ;) ].&lt;br /&gt;&lt;br /&gt;The content and structure of the data is largely defined by the questions that you want to answer. Jason has at least given us an indication of their questions: &amp;nbsp;"&lt;i&gt;How many&amp;nbsp;customers that joined 6 months ago are still active?"; "What’s the average lifetime value of a Basecamp customer?"; "Which upgrade paths generate the most revenue?"&lt;/i&gt;. Thats a good start. I can easily imagine where I'd get that data from and how I'd organise it. This is 'meat &amp;amp; potatoes' BI and it's where 80% of the business value is found. These are the things you put on your "dashboard" and track closely over time.&lt;br /&gt;&lt;br /&gt;Another question is trickier: &lt;i&gt;"In the long term would it be worth picking up 20% more free customers at the expense of 5% pay customers?&lt;/i&gt;" There's a lot of implied data packed into that question: what's the long term?; what does 'worth' mean?; can you easily change that mix?; are those variables even related?; etc. This is more of a classical business analysis situation where we'd build a model of the business (usually in a spreadsheet) and then flex the various parameters to see what happens. If you want to get fancy you then run a Monte Carlo simulation where you (effectively) jitter &amp;nbsp;the variables at random to see the 'shape' of all possible outcomes. This type of analysis requires a lot experience with the business. It's also high risk because you have to decide on the allowed range of many variables and guessing wrong invalidates the model. It &lt;b&gt;can&lt;/b&gt; reveal very interesting structural limits to growth and revenue if &amp;nbsp;done correctly. Often the credibility of these models is defined by the credibility of the person who produced it, for better or worse. Would that be the same in 37signals?&lt;br /&gt;&lt;br /&gt;Now we move on to slippery territory: &lt;i&gt;"What are the key drivers that encourage people to upgrade?"; "What usage patterns lead to long-term customers?". &lt;/i&gt;We're basically moving into operational research here. We want to split customers into various cohorts and analyse the differences in behaviour. The primary success factor in this kind of analysis is experimental design and this is a specialist skill. Think briefly about the factors involved and they make the business model seem tame. How do we define usage patterns? Will we discover them via very clever statistics or just create them _a priori_? What are the implications of both approaches? The people who can do this correctly from a base of zero are pretty rare, in my experience. However, this is an area where you can outsource the work &lt;b&gt;very&lt;/b&gt; effectively if you have already put the effort in to capture and organise the underlying information.&lt;br /&gt;&lt;br /&gt;And the coup de grace: &lt;i&gt;"Which customers are likely to cancel their account in the next 7 days?"&lt;/i&gt;. It sounds reasonable on it's face. However, consider your own actions: Do you subscribe to any services you no longer need or use? The odds are good that you do. Why haven't you cancelled them? Have you said to yourself: "I really need to cancel that when I get a second."? And yet didn't do it. When you finally did cancel it, was it because you started paying for a substitute service? Or were you reminded of it at just the right time when you could take action? Predicting the behaviour of single human is a fools game. The best you can do is group similar people together and treat them as a whole ("we expect to lose 10% of this group"). Anyone who tells you they can do better than that is probably pulling your leg, in my humble opinion.&lt;br /&gt;&lt;br /&gt;Finally, let's have a look at their questions for the applicants cover letter:&lt;br /&gt;&lt;i&gt;1. Explain the process of determining the value of a visitor to the basecamphq.com home page.&lt;/i&gt;&lt;br /&gt;&amp;gt;&amp;gt; Very open ended. How do we define value in this context? The cost per visit / cost per click? The cost per conversion (visitors needed to deliver a certain number of new paid/free signups)? Or perhaps the acquisition cost (usually marketing expenses as a % of year one revenue)? I'd say we need to track all of those metrics but this is pretty much baseline stuff.&lt;br /&gt;&lt;i&gt;2. How would you figure out which industry to target for a Highrise marketing campaign?&lt;/i&gt;&lt;br /&gt;&amp;gt;&amp;gt; This isn't particularly analytical, I'm pretty sure the standard dogma is sell to the people who already love your stuff. It would be interesting to know how much demographic data 37signals has about their customers industry. This is an area where you typically need to spend money to get good data.&lt;br /&gt;&lt;i&gt;3. How would you segment our customer base and what can we do with that information?&lt;/i&gt;&lt;br /&gt;&amp;gt;&amp;gt; This is a classical analytic piece of work. I've seen some amazing stuff done with segmentation (self organising maps spring to mind). However, in my experience models based on simple demographics (for individuals) or industry &amp;amp; company size (for businesses) perform nearly as well and are much easier to update and maintain.&lt;br /&gt;&lt;br /&gt;As far as I can tell they want to hire a split personality. Someone who'll A) create a reliable infrastructure for common analysis requirements, B) build high quality models of business processes and C) do deep diving 'hard stats' analytics that can throw up unexpected insights. Good luck to them. Such people do exist, simple probability essentially dictates that that is the case. And 37signals seems to have a magnetic attraction for talent so I wouldn't bet against it. On the other hand one of their koans is not hiring rockstars. This sounds like a rockstar to me.&lt;br /&gt;&lt;br /&gt;Full disclosure: I'm currently working on a web service (soon to be at &lt;a href="http://appconductor.com/"&gt;appconductor.com&lt;/a&gt;) that synchronises &amp;nbsp;various web apps and also backs them up. It's going to launch with support for Basecamp, Highrise and Freshbooks (i.e. 2/3rds 37signals products). Make of that what you will.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-6517158608775625125?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/6517158608775625125/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2010/11/thoughts-on-37signals-requirements-for.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/6517158608775625125'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/6517158608775625125'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2010/11/thoughts-on-37signals-requirements-for.html' title='Thoughts on 37signals requirements for a &quot;Business Analyst&quot;'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-6016398744640932464</id><published>2010-10-22T07:08:00.000-07:00</published><updated>2010-10-25T04:14:47.120-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='sql'/><category scheme='http://www.blogger.com/atom/ns#' term='dw'/><category scheme='http://www.blogger.com/atom/ns#' term='open-source'/><category scheme='http://www.blogger.com/atom/ns#' term='etl'/><category scheme='http://www.blogger.com/atom/ns#' term='orm'/><category scheme='http://www.blogger.com/atom/ns#' term='datamapper'/><category scheme='http://www.blogger.com/atom/ns#' term='bi'/><title type='text'>Disrespecting the database? ORM as disruptive technology</title><content type='html'>The premise of this post is that ORMs are a disruptive innovation for the all parts of the IT industry that utilise databases, particularly relational databases. I'm particularly interested in the ultimate impact of ORMs on my work in the BI-DW-OLAP-DSS-{insert acronym here} industry.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;What makes a 'disruptive technology'?&lt;/b&gt; &lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; In case you are not familiar with &lt;a href="http://en.wikipedia.org/wiki/Innovator's_dilemma"&gt;the "innovator's dilemma" concept&lt;/a&gt;; it was originally expressed in those terms by Clayton M. Christensen in the article 'Disruptive Technologies: Catching the Wave'. &lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&lt;i&gt; "Generally, disruptive innovations were technologically straightforward, consisting of off-the-shelf components put together in a product architecture that was often simpler than prior approaches. They offered less of what customers in established markets wanted and so could rarely be initially employed there. They offered a different package of attributes valued only in emerging markets remote from, and unimportant to, the mainstream."&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Let's talk about ORMs&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; If you are a true BI-DW person you may not have heard of ORM and are unlikely to have come across one directly in your work. An &lt;a href="http://en.wikipedia.org/wiki/Object-relational_mapping"&gt;ORM [Object-Relational Mapper]&lt;/a&gt; is simply a set of code routines that 'map' tables and columns in a relational database to objects, attributes and methods in a programming language. Programmers can then interact with the data stored by the underlying database without writing any SQL.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;An ugly history with DBAs&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; Now, as you'd probably imagine, DBA's hate ORMs and for good reason. They have typically produced horrible SQL and correspondingly awful performance problems for the DBAs to deal with. ORM use in "enterprise" IT environments is patchy and somewhat limited. It seems like a lot enterprise ORM use is kept out of sight and only comes to light when the DBAs get &lt;b&gt;really&lt;/b&gt; fired up about some bad SQL that keeps reappearing every time an update to the software is released.&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; The early ORMs were rightly criticised (Hibernate seems to have taken the most heat) but ORMs haven't gone away. The sweet spot for early ORMs was small and 'simple' transactional applications. The kind of app that is needed quickly and where imperfect SQL was not a huge issue. But ORMs keep evolving and becoming more sophisticated in the way they generate SQL and deal with databases. This is where the disruptive innovation part comes in.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The hockey stick graph&lt;/b&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Disruptivetechnology.gif/450px-Disruptivetechnology.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" src="http://upload.wikimedia.org/wikipedia/commons/thumb/8/8e/Disruptivetechnology.gif/450px-Disruptivetechnology.gif" /&gt;&lt;/a&gt;&lt;/div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; Looking at the graph from the Wikipedia article I linked above you can see that ORMs started in the bottom left "low quality use" corner. &lt;b&gt;My entire point for this post is that ORMs are going to follow the "disruptive technology" curve and eventually they will come be the dominate way in which ALL database access occurs.&lt;/b&gt; Seriously.&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; Let me explain why I see this happening. There are 3 good technical reasons and a human reason. As usual the human reason is the trump card.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;ORMs are getting better quickly&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; First, we're seeing much better ORMs developed. In particular I want to draw your attention to Datamapper. It's a Ruby ORM that's been around for about 2 and a half years. The interesting thing about Datamapper (for me) is how much respect it has for the database. DB access is designed to minimise the number of queries hitting the backend and at the same to minimise the data being pulled out unnecessarily (i.e. only get Text/Blob fields if you really want them). Here's the kicker though: it supports foreign keys. Real (honest-to-goodness, enforced-by-the-database) foreign keys. Nice.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;ORMs are the ultimate metadata layer&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;Second, because an &amp;nbsp;ORM is deeply involved in the application itself it can contain a much richer set of metadata about the data that's being stored. Compare the following SQL DDL with the equivalent ORM setup code.&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;CREATE TABLE users (&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;id &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; int(10) &amp;nbsp; &amp;nbsp; NOT NULL AUTO_INCREMENT,&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;tenant &amp;nbsp; &amp;nbsp; &amp;nbsp; int(10) &amp;nbsp; &amp;nbsp; NOT NULL,&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;first_name &amp;nbsp; varchar(50) NOT NULL,&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;last_name &amp;nbsp; &amp;nbsp;varchar(50) NOT NULL,&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;title &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;varchar(50) DEFAULT NULL,&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;email &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;varchar(99) DEFAULT NULL,&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;passhash &amp;nbsp; &amp;nbsp; varchar(50) DEFAULT NULL,&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;salt &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; varchar(50) DEFAULT NULL,&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;permission &amp;nbsp; int(11) &amp;nbsp; &amp;nbsp; DEFAULT '1',&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;created_at &amp;nbsp; datetime &amp;nbsp; &amp;nbsp;NOT NULL,&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;updated_at &amp;nbsp; datetime &amp;nbsp; &amp;nbsp;DEFAULT NULL,&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;deleted_at &amp;nbsp; datetime &amp;nbsp; &amp;nbsp;NOT NULL DEFAULT '2999-12-31',&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;manager_id &amp;nbsp; int(10) &amp;nbsp; &amp;nbsp; NOT NULL,&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;PRIMARY KEY (id, tenant),&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;UNIQUE INDEX &amp;nbsp;unique_users_email (email),&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;INDEX index_users_manager (manager_id),&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;INDEX users_tenant_fk (tenant),&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;CONSTRAINT users_tenant_fk &amp;nbsp;FOREIGN KEY (tenant)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;REFERENCES &amp;nbsp;tenants (id)&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;ON DELETE NO ACTION&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;ON UPDATE NO ACTION,&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;CONSTRAINT users_manager_fk FOREIGN KEY (manager_id)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;REFERENCES &amp;nbsp;users (id)&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;ON DELETE NO ACTION&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;ON UPDATE NO ACTION&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;);&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;What can we tell about this table? It's got an FK to itself on 'manager_id' and another to 'tenants' on 'tenant'. &amp;nbsp;We don't gain a lot of insight. Here's the Datamapper syntax:&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;class User&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;include DataMapper::Resource&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;property :id,&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;  &lt;/span&gt;Serial&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;property :tenant,&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;Integer, &amp;nbsp;:min =&amp;gt; 0, :required =&amp;gt; true, :key =&amp;gt; true&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;property :first_name,&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;String, &amp;nbsp; :required =&amp;gt; true &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;property :last_name,&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;String, &amp;nbsp; :required =&amp;gt; true &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;property :title, &amp;nbsp; &amp;nbsp; &amp;nbsp;String &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;property :email, &amp;nbsp; &amp;nbsp; &amp;nbsp;String, &amp;nbsp; :length =&amp;gt; (5..99), :unique =&amp;gt; true,&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; :format =&amp;gt; :email_address,&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; :messages =&amp;gt; {:presence =&amp;gt; 'We need your email address.',&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; :is_unique =&amp;gt; 'That email is already registered.',&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; :format &amp;nbsp; &amp;nbsp;=&amp;gt; "That's not an email address" }&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;property :passhash, &amp;nbsp; String&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;property :salt, &amp;nbsp; &amp;nbsp; &amp;nbsp; String&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;property :permission, Integer, &amp;nbsp;:default =&amp;gt; 1 &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;property :phone,&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;String &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;property :mobile,&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;String &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;property :created_at,&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;DateTime, :required =&amp;gt; true&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;property :updated_at,&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;DateTime&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;property :deleted_at,&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;ParanoidDateTime, :required =&amp;gt; true&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;has 1, &amp;nbsp;:manager&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;has n, &amp;nbsp;:authorities&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;has n, &amp;nbsp;:subscriptions, :through =&amp;gt; :authorities&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;belongs_to :tenant, &amp;nbsp;:parent_key =&amp;gt; [:id], :child_key =&amp;gt; [:tenant]&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;nbsp;&amp;nbsp;belongs_to :manager, self&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;end&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;An awful lot more insightful you ask me, and I actually stripped out 50% of the metadata to avoid distracting you. We can see that:&lt;br /&gt;&amp;nbsp;&amp;nbsp;Columns&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;gt; Email must be unique, it has a Min and Max length and a specific format.&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;gt; Deleted_At has a special type ParanoidDateTime, which means deletes are logical not physical.&lt;br /&gt;&amp;nbsp;&amp;nbsp;Child Tables&amp;nbsp;&lt;i&gt;{Try finding this out in SQL…}&lt;/i&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;gt; The table has an FK that depends on it from Authorities (1 to many)&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;gt; a relationship to Subscriptions through Authorities (many to many)&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;We're getting a much richer set of metadata here and it's being specified this way because it's useful for the developer not because we're trying to specify a top-down data dictionary. The really interesting thing about the ORM example is that nothing prevents us from enriching this further. We are not bound by the constraints of SQL 99/2003/etc and the way it's been implemented by the vendor.&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; We've been talking about the power and importance of metadata for at least 10 years and, truthfully, we've made almost no progress. Every new BI-DW-ETL-OLAP project I work on still has to start more or less from nothing. The rise of ORMs creates an inflection point where we can change that if we become involved in the systems early in their lifecycle.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;An aside on "metadata layers"&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;There is another point here and it's important. We could make a lot of progress simply by using ORMs to interact with our existing (so called 'legacy') databases. Datamapper has a lot features and tricks for accommodating theses databases and it's open source so we can add anything else we need.&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;Imagine an ETL tool that interacted with the database via Datamapper instead of using ODBC/JDBC plus it's own metadata. You would start by declaring a very simple model, just table-column-datatype, and then as you learned more about the data you would specify that learning (new metadata) in the the ORM itself. I think that's an incredibly powerful concept. The ETL becomes purely for orchestration and all of the knowledge about how to interact with sources and destinations is held in a way that is usable by downstream tools (like a reporting tool or another ETL process). &lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;I imagine this is what the Business Objects guys had in mind when they created their metadata layer ('Universes' in BO parlance) back in the 90s. To my reckoning they didn't quite get there. The re-use of Universe metadata for other processes is (in my experience) non-existent. Yet here is a universal metadata layer; spontaneously realised and completely open for everyone to take advantage of.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;ORMs will ultimately write better SQL&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;Third, ORMs can generate much better SQL than people do. The human element in any technology is the most unpredictable. It's generally not a scheduled report that brings down the database. It's usually a badly formed query submitted ad-hoc by a user. Maybe this isn't the case right now but the existence of Datamapper indicates we're getting close.&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;Clearly SQL is a complex domain and it will take time for ORMs to be able to cover all of the edge cases, particularly in analytics. However, let me refer you to the previous discussion of Business Objects Universes. If you review the SQL BO generates you'll see that the bar is not set all very high.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;ORMs are&amp;nbsp;blessed by the kingmakers&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;Fourth, developers love ORMs. Stephen O'Grady from RedMonk wrote recently that developers are the new/old kingmakers. He has a great quote from&amp;nbsp;Abraham Lincoln&amp;nbsp;“With public sentiment, nothing can fail; without it nothing can succeed.” ORMs have the kind of positive sentiment that your fancy data dictionary / master data / shared metadata project could never dream of. Developers want to use an ORM for their projects because it helps them. They want to stuff all of the beautiful metadata in there. They want to take huge chunks business logic out of spaghetti code and put into a place where we can get at it and reuse it. Who are we to say they shouldn't?&lt;br /&gt;&lt;br /&gt;&lt;b&gt;A final thought on BI becoming "operationalised"&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;My final thought for you is that the new 'edge' of BI is in putting classical BI functionality into operational apps, particularly web apps. If you think this post is a call for existing BI companies to get onboard with ORMs then you are only half right. It's also a warning that the way data is accessed is changing and a lot of core BI-DW skillsets may feel a bit like being a mainframe specialist someday soon.&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-6016398744640932464?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/6016398744640932464/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2010/10/can-has-dataz-orm-as-disruptive.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/6016398744640932464'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/6016398744640932464'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2010/10/can-has-dataz-orm-as-disruptive.html' title='Disrespecting the database? ORM as disruptive technology'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-217196899719393733</id><published>2010-10-08T04:07:00.000-07:00</published><updated>2010-10-10T02:09:32.632-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='dw'/><category scheme='http://www.blogger.com/atom/ns#' term='open-source'/><category scheme='http://www.blogger.com/atom/ns#' term='etl'/><category scheme='http://www.blogger.com/atom/ns#' term='enterprise'/><category scheme='http://www.blogger.com/atom/ns#' term='IT'/><category scheme='http://www.blogger.com/atom/ns#' term='bi'/><title type='text'>The easy way to go open source in BI-DW: slipsteaming</title><content type='html'>I'd like to propose a slightly devious strategy for getting open source Business Intelligence &amp;amp; Data Warehousing into your company. You've probably heard a lot about open source BI / DW offerings in the last few years. You're kind of self-selected into that group by simply reading this post! However, just in case, I'll wrap up a few of the leading lights for you. This is by no means comprehensive, consider it an invitation to do some 'googling'.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Open Source Reporting / Dashboards&lt;/b&gt;&lt;br /&gt;&amp;gt; Pentaho: Open Core offering, very complete BI suite, the standard bearer IMO&lt;br /&gt;&amp;gt; Jaspersoft: Open Core offering, very complete BI suite &lt;br /&gt;&amp;gt; Actuate/BIRT: BIRT is very open, other offerings less clear, more OEM focused&amp;nbsp; &lt;br /&gt;&amp;gt; SpagoBI: The 'most' open of the FLOSS BI offerings, possibly less mature&amp;nbsp; &lt;br /&gt;&amp;gt; Palo: I'm kind of confused about what's open/closed but I hear good things &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Open Source ETL / Data Integration &lt;/b&gt;&lt;br /&gt;&amp;gt; Pentaho PDI/Kettle: Stream based, Open Core, excellent for multi-core/server &lt;br /&gt;&amp;gt; Talend: Code generator, Open Core, suited to single big machine processing&amp;nbsp; &lt;br /&gt;&amp;gt; Palo ETL: Tightly integrated with Palo suite, if you like Palo give it a look&lt;br /&gt;&amp;gt; CloverETL: New offering with a lot of features 'checked off'&lt;br /&gt;&amp;gt; Apatar: Another new offering with a lot of features claimed&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Open Source Analytic Databases&lt;/b&gt;&lt;br /&gt;&amp;gt; Infobright: MySQL storage engine, Open Core, good compression, limited &lt;s&gt;DDL&lt;/s&gt; SQL&lt;br /&gt;&amp;gt; InfiniDB: MySQL storage engine, Open Core, maximises modern CPUs , limited &lt;s&gt;DDL&lt;/s&gt;&amp;nbsp;SQL&lt;br /&gt;&amp;gt; LucidDB: Java based, completely open, good compression, supports most of SQL &lt;br /&gt;&amp;gt; MonetDB: completely open but not very active, good compression, likes a lot of RAM&lt;br /&gt;&amp;gt; VectorWise:&amp;nbsp; promising to become open soon, maximises modern CPUs, good 'buzz'&lt;br /&gt;&amp;gt; Greenplum: kinda-sorta open, free 'single node edition', good SQL support&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Open Source OLAP&lt;/b&gt;&lt;br /&gt;&amp;gt; Pentaho Analysis/Mondrian: Mature tool, Open Core, likes a fast DB underneath &lt;br /&gt;&amp;gt; Palo: Well regarded OLAP, nice options for Excel use, tightly integrated with suite&lt;br /&gt;&lt;br /&gt;&lt;b&gt;How do you bring it in?&lt;/b&gt; &lt;br /&gt;OK, with that out of the way, how can we bring open source into businesses that already have some sort of BI-DW infrastructure in place? One of the problems that open source faces is free licenses don't buy fast talking salespeople who'll come and woo senior managers and executives. So we often have to bring it in by the back door. You're not going to rip out the existing software and replace it with your new shiny open source alternative. You need to find pain points where the business is not getting what it needs but is blocked from getting something better, usually for political or financial reasons.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Typical pain points&lt;/b&gt;&lt;br /&gt;Let's talk about some typical pain points. Is the main DW constantly overloaded by queries? Are some team's queries throttled because they're not considered important enough? Do you have an analysis team that is not allowed to run the complex queries that they'd like to? Do you have policy of killing queries that run over an certain time and it is killing a lot of queries? Does it take a *very* long time to produce the daily report burst? Has a certain team asked for Excel ODBC access to the DW and been blocked? Do some teams want to load their own data in the DW but are not allowed to? Do more people want access to ad-hoc reporting but you can't afford the licenses? Is your ETL development slow because you can't afford any more server licenses for your expensive tool? Are you still doing your ETL jobs as hand coded SQL?&amp;nbsp; &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Finally - the splistreaming strategy&lt;/b&gt;&lt;br /&gt;If your company has more than 500 people I bet I could easily find at least 3 of those. These are the areas where you can implement open source first. You will be using a strategy that I call 'slipstreaming'. Have you ever watched the Tour De France on television? Did you notice that Lance Armstrong almost never rode at the front of the group? He always sat behind his team mates (in the slipstream) to conserve energy so he could attack at the end or breakaway on the climbs. Sitting behind his team reportedly saves 25% of his energy.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Open source as the 'domestique&lt;/b&gt;'&lt;br /&gt;The bad news: your open source efforts are not in Lance's position. You are going to be the team mate out in front cutting the wind (a domestique). You need to find a pain point where you can put the open source solution in front of the existing solution to 'cut the wind'. Essentially you are going to make the existing solution work better by taking some of the demand away. You will then present this as a 'business-as-usual' or 'tactical' solution to your problem. You need to be very careful to position the work correctly. You goal is to run this as a small project within your team. Be careful to keep the scope down. Talk about fire-fighting, taking the pressure off, etc. I'm sure you'll know how to position it in you company. You don't want project managers or architecture astronauts getting involved and making things complicated.&lt;br /&gt;&lt;br /&gt;How about some examples?&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The data warehouse edge copy:&lt;/b&gt;&lt;br /&gt;You have an Oracle based DW. It's been around for a few while and is suffering, despite hardware upgrades. The overnight load barely finishes by 8:30 and the daily report burst has been getting bigger and only finishes around 10:30 (sometimes 11). The customer insight team has been completely banned from running anything until the reports are out. They're not happy about starting their queries later and later. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Edge copy slipstream strategy&lt;/b&gt;&lt;br /&gt;The slipstream solution to this is to add a edge copy of the DW between either the daily report run or the insight team. You should be able to make use of a reclaimed server (that the DW ran on previously) or you can purchase a "super PC" (basically a gaming machine with extra hard disks). The edge copy will run one of the analytic databases I mentioned. On an older machine I'd lean towards LucidDB or Infobright because of their compression. You then add a new step to the ETL that copies over just the changed data from the DW, or a time-limited subset, to the edge machine. Finally you switch them over to the edge copy. If your edge copy takes a while to load (for whatever reason) then talk to the insight team about running an extra day behind everyone else. You'll probably find that they're happy to run a day behind if they have a database to themselves, no restrictions.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The ETL co-worker:&lt;/b&gt;&lt;br /&gt;You use Business Object's Data Integrator for your ETL processing. You've got a 4-core license and the word has come down that you are &lt;b&gt;not&lt;/b&gt; getting any more. Your processing window is completely taken up with the exsiting run. ETL development has become a 1-in-1-out affair where new requests can only be delivered by killing something else. The DW devs have started using hand coded routines in the warehouse to deliver work that has political priority.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Co-worker slipstream strategy&lt;/b&gt; &lt;br /&gt;The slipstream solution to this is to add an open source ETL tool as a co-worker to the existing processing. The idea is to leave all of the existing processing on BODI but put new requests onto the new open source package. Again you need to identify either a older server that you can reclaim or source a super-PC to run on. Think carefully about the kind of work that can be best done on the co-worker process. Isolated processes are best. You can also do a lot of post loading activities like data cleanup and aggregations. Once you've established the co-worker as a valid and reliable ETL solution then you should aim to set a policy that any existing ETL processing that has to be changed is moved to the new tool at the same time.&lt;br /&gt;&lt;br /&gt;Be devious. Be political.&amp;nbsp; But be nice.&lt;br /&gt;Don't tell them, show them.&lt;br /&gt;Ask forgiveness, not permission.&lt;br /&gt;I wish you luck.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-217196899719393733?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/217196899719393733/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2010/10/devious-strategy-for-getting-to-open.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/217196899719393733'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/217196899719393733'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2010/10/devious-strategy-for-getting-to-open.html' title='The easy way to go open source in BI-DW: slipsteaming'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-5544105453172570915</id><published>2010-10-05T02:53:00.000-07:00</published><updated>2010-10-05T02:54:18.491-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='dw'/><category scheme='http://www.blogger.com/atom/ns#' term='etl'/><category scheme='http://www.blogger.com/atom/ns#' term='api'/><category scheme='http://www.blogger.com/atom/ns#' term='saas'/><category scheme='http://www.blogger.com/atom/ns#' term='bi'/><category scheme='http://www.blogger.com/atom/ns#' term='results'/><title type='text'>The trouble with SaaS BI - it's all about the data</title><content type='html'>&amp;nbsp;&amp;nbsp;&amp;nbsp; Some data was released yesterday that purports to show that SaaS BI customer's are very pleased with it's ease of use, etc., etc. Boring. Seriously, I really like the idea of SaaS BI but I haven't seen anyone making great leaps forward. I'd say that they *can't* take us forward because of the box that they've painted themselves into. The box actually has a name: it's called BI.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The BI sandbox&lt;/b&gt; &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; Eh? What? Here's the thing; BI as we currently know it is the last stage in the information pipeline. It's the beautiful colours on the box that holds the cereal. But it's not the cereal and it's not even the box. It is *very* important (who would buy cereal in a plain cardboard box?) but is also *very* dependent on other elements in the pipeline. &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; I don't want to get into a long discussion about definitions of BI. Suffice it to say this: why are terms like 'data warehouse' and 'OLAP cube' still prevalent? Simply because BI does not imply data gathering, preparation and storage. Last example on this theme. If I tell you I'm a Business Intelligence manager, what would you guess is my remit? Does it include the entire data warehouse? The OLAP cubes? All of the ETL processing? No? It could but it rarely does.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;It's all about the data&lt;/b&gt; &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; I once worked for a clever chap who's mantra was "it's all about the data". His daily struggle was to get the business to invest more time, effort and money into the data itself. It was a hard fight. We had a very fast data warehouse (NZ) and some perfectly serviceable BI software (BO) and nearly a dozen newly minted graduates to turn out our reports. What we did not have was a strong mandate to get the data itself absolutely sorted, to get every term clearly defined and to remove all of the wiggle room from the data. As a consequence we had the same problems that so many BI teams have. Conflicting numbers, conflicting metrics, and political battles using our data as ammunition.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Data is the 'other' 90% &lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; I'd estimate that gathering, preparing, and storing the data for BI represents at least 90% of the total effort, with analysis and presentation being the last 10%. I really hope no one is surprised by that figure. I'd think that figure is consistent for any situation in which decisions need to be made from data. For instance a scientist in a lab would have to spend a lot of time collecting and collating measurements before she could do the final work of analyzing the results. A research doctor conducting a study will have to collect, organize and standardize all of the results study data before he can begin to evaluate the outcome.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;It's NOT about speed&lt;/b&gt; &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; One of the tragedies of the Inmon-Kimball data warehouse definition war is the data warehouse has been conceived as something that you create because you want to speed up your data access. It's implied that we'd prefer to leave the data in it's original systems if we could, but alas that would be too slow to do anything with. What a load of tosh! Anyone who's been in the trenches knows that the *real* purpose of a data warehouse is to organize and preserve the data somewhere safe away from the many delete-ers and archive-ers of the IT world. We value the data for it's own sake and believe it deserves the respect of being properly stored and treated. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Nibbling at the edges&lt;/b&gt; &lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; So, back to the topic, how does SaaS BI help with this issue?&amp;nbsp; Let's assume that SaaS BI does what it claims and makes it much easier for "users" to produce reporting and analysis. Great, how much effort have we saved? Even if it takes half as much time and effort we've only knocked 5% off our total.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The real opportunity&lt;/b&gt; &lt;br /&gt;&amp;nbsp;&amp;nbsp; And finally I come to my point: the great untapped opportunity for the SaaS [BI-DW-OLAP-ETL] acronym feast is the other 90% where the most of the hard work happens. Customers are increasingly using online applications in place of their old in-house apps. Everything from ERP to Invoicing to call centre IVRs and diallers are moving to a SaaS model. And every SaaS service that's worth it's salt offers an open API for accessing the data that they hold.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The holy grail - instant data&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; This is the mother-load, the shining path for data people. Imagine an end to custom integrations for each customer. Imagine an end to customers having to configure they're own ETL and design their own data warehouse before they can actually do anything with their data. The customer simply signs up to the service and you instantly present them with ready to use data. Magic. Sounds like a service worth paying for.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-5544105453172570915?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/5544105453172570915/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2010/10/trouble-with-saas-bi-its-all-about-data.html#comment-form' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/5544105453172570915'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/5544105453172570915'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2010/10/trouble-with-saas-bi-its-all-about-data.html' title='The trouble with SaaS BI - it&apos;s all about the data'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-9040034213881802301</id><published>2010-10-04T03:14:00.000-07:00</published><updated>2010-10-04T03:14:55.028-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='potential'/><category scheme='http://www.blogger.com/atom/ns#' term='sap'/><category scheme='http://www.blogger.com/atom/ns#' term='open-source'/><category scheme='http://www.blogger.com/atom/ns#' term='small-business'/><category scheme='http://www.blogger.com/atom/ns#' term='enterprise'/><category scheme='http://www.blogger.com/atom/ns#' term='IT'/><category scheme='http://www.blogger.com/atom/ns#' term='results'/><category scheme='http://www.blogger.com/atom/ns#' term='oracle'/><title type='text'>Buying results versus buying potential in business IT</title><content type='html'>&lt;div&gt;&lt;b&gt;It's all about potential&lt;/b&gt;&lt;/div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;You have probably noticed that business IT (in general) is very expensive. In fact we (the IT industry) invented a special word to justify the expense: Enterprise. We use the word Enterpise to imply that a product or service is: robust, sophisticated, reliable, professional, complex and (most of all) valuable. But if you look deeper you might notice something interesting about "enterprise" IT; it's all about potential. The most successful IT products and companies make all their money by selling potential. Oracle databases have the potential to handle to world's largest workloads. SAP software has the potential to handle the processes of the world's biggest companies. IBM servers have the potential to run the world's most demanding calculations and applications.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;You're buying what&amp;nbsp;&lt;b&gt;could be&lt;/b&gt;&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;Selling potential is insanely lucrative. After all you're not buying what &lt;b&gt;actually is&lt;/b&gt;, you're buying what &lt;b&gt;could be&lt;/b&gt;. You're not buying Oracle database to&amp;nbsp;simply&amp;nbsp;keep track of your local scrap metal business; you're buying a Oracle to keep working while you become the biggest scrap metal business in the world. You're not buying SAP to tame your 20 site tool hire business processes, you're buying SAP to help you become the world leader in tool hire. And when the purchase is framed like this customers actually &lt;b&gt;want to pay more&lt;/b&gt;. I've been involved in more than one discussion where suppliers were eliminated from consideration for being too cheap. It was understood that they couldn't be "enterprise enough" at that price point.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Really expensive DIY&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;The funny thing is that buying potential actually means you'll have to do it yourself. This is actually the defining characteristic of enterprise IT; whatever it costs you'll spend the same again to get it working. You don't simply install the Oracle database and press run. You can't just install SAP on everyone's desktop. You need to hire experts, create a strategy, run a long project, put it live over a long weekend, perform extensive tuning, etc., etc. Of course your enterprise IT supplier will be happy to help with all this, but that'll cost extra.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Small business need results&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;Ironically, small businesses want to buy the exact opposite of potential. They want to buy results. Strike that; they &lt;b&gt;need&lt;/b&gt;&amp;nbsp;to buy results. A specific action should result in a specific outcome. In a small business you can't afford to waste time and money on something that &lt;b&gt;might&lt;/b&gt;&amp;nbsp;be great. What is required is something that is OK&amp;nbsp;&lt;b&gt;right now&lt;/b&gt;. I think the web is perfect for delivering this kind of service and I think that's why small businesses are embracing web services like Basecamp, Freshbooks and (hopefully) my very own AppConductor.com. I think there is latent demand for better technology in small businesses. They can't risk spending big money on potential but they're happy to try out a service will a small monthly fee. If it helps them they'll keep paying the monthly fee and if it doesn't then they won't have lost much.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Harris law of IT spending&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;I think this is so powerful I'm going to make it a law. The Harris law of IT spending: "Pay only for results. Never buy potential".&lt;br /&gt;&lt;br /&gt;P.S. You might notice that open source embodies my law completely. The thing that has potential is free, and then you &amp;nbsp;spend time/money on results.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-9040034213881802301?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/9040034213881802301/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2010/10/buying-results-versus-buying-potential.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/9040034213881802301'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/9040034213881802301'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2010/10/buying-results-versus-buying-potential.html' title='Buying results versus buying potential in business IT'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-805732953489746517</id><published>2010-10-01T03:36:00.000-07:00</published><updated>2010-10-01T03:36:33.886-07:00</updated><title type='text'>Further thoughts on real-time ETL, the Magic of Queues</title><content type='html'>&amp;nbsp;&amp;nbsp; &amp;nbsp;Now that we have polling out of the way, let's talk about integrating the data once you know something has changed. If you are doing this kind of thing for the first time then your inclination will be towards "one giant script"; do this &amp;gt; then this &amp;gt; then this &amp;gt; then this &amp;gt; ad infinitum. Avoid this pattern. It's not not practical and it's hard to change. However, in real-time ETL it will come back to bite you - hard.&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Real-time = inconsistent&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;The hard thing about real-time ETL, in my view, is not the continuous processing. It's the inconsistency. One system may receive a large number of changes in a short spell, another system may go hours without a single update. Or all of your systems could (easily) burst into life at exactly 9 AM and hammer you with updates for an hour. But these are just the obvious options.&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;The dreaded 'mysterious slowdown'&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;A more insidious problems is the dreaded 'mysterious slowdown'. Let's say you have a dependency on address checking using an app from MoMoneySys. The app runs on it's own server, was installed by the MMS engineers and is not to be interfered with on penalty of death. The app is like a foreign footballer, brilliant when it works, liable to collapse under pressure. If your real-time ETL flows directly through this app then your performance is directly tied to it's performance, not a nice situation to be in.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Time-to-live? Get over it&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;I mentioned in the previous post that I'm not really interest in talking about how far my ETL process is behind live and this pretty much explains why; it's just meaningless talk. If no updates are happening on the source and my ETL process is empty then my 'time-to-live' is 0. If the app suddenly saturates me with changes then my TTL is going to go down the pan. The best I can do is monitor the average performance and make sure my ETL 'pipe' is 'wide' enough to deliver a performance I'm happy with (another post perhaps…). &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Behold the QUEUE&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;So you need a way to cope with the vagaries of sporadic updates in the source, patchy performance in dependencies and, simply, your desire to remain sane. And behold I present to you the QUEUE. Queueing is an excellent way to break up your processing flow and handle the inconsistent, patchy nature of real-time ETL.&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; Contrary to the title, there is no magic involved in queues. It's simply a list of jobs that need to be worked. There are any number queueing apps/protocols available; from the complex, like AMQP (http://en.wikipedia.org/wiki/AMQP), to the fairly simple, like BeanstalkD (http://kr.github.com/beanstalkd/), to my personal favourite - the queue table. Jobs are added to the queue and then worked in order. Typically you also give them a priority and add some logic where a job is checked out of the queue to be worked and only deleted on success.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Queues beat the real-time blues&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;So how can we use queues to help our real-time blues? Basically we use queues to buffer each processing step from the next step. Let's think about processing the steps I've mentioned (above and previous post) in a real-time ETL process that uses internal and external sources.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;A real-time ETL example&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;A. Check for changes&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;A.1 Poll external web service for changes&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;A.2 Receive change notices from internal database&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;B. Retrieve changed data&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;B.1 Extract changed data from web service&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;B.2 Extract changed data from internal database&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;C. Validate changed data&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;C.1 Append geocoding information (using geonames.org data)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;C.2 Process new personal data with external identity verification service&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;D. Load data into final database (e.g. data warehouse)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;This time with queues&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;A. Check for changes&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Take job from the {poll} queue&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;A.1 Poll external web service for changes (based on job)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Add job to the {retrieve} queue&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;A.2 Receive change notices from internal database&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Add job to the {retrieve} queue&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;B. Retrieve changed data&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Take job from the {retrieve} queue&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Choose worker 1 or 2 for the job&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;B.1 Extract changed data from web service&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Load raw changes to the staging area&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Add job to the {validate} queue&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;B.2 Extract changed data from internal database&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Load raw changes to the staging area&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Add job to the {validate} queue&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;C. Validate changed data&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;b&gt;&amp;gt; Take job from the {validate} queu&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&lt;b&gt;e&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Add job to {geocode} or {identity} queues (or both)&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;C.1 Append geocoding information (using geonames.org data)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Take job from the {geocode} queue&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Update staging tables with geocoding.&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Add job to the {final} queue&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;C.2.a Submit new personal data to external identity verification service&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Take job from the {identity} queue&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Update staging tables with geocoding.&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Add job to the {ready} queue&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;C.2.b Retrieve verified personal data from verification service&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; DON'T WAIT! Use a separate listener to receive the data back&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Update staging tables as required (or any other work…)&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Add job to the {final} queue&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;D. Load data into final database (e.g. data warehouse)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Take jobs from {final}&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;&amp;gt; Load data in the final database&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;A mental model to finish&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;I've glossed over some issues here but you get the picture. You never want to have any part of your real-time ETL process waiting on another process that is still working. I like to picture it as a warehouse 'pick line': In normal ETL you ask each person to hand off directly to the next person. They do their work and then, if the next person isn't ready, they have to wait to hand off. In real-time ETL, there is a 'rolling table' (queue) between each person. They pick their items off the end of their table, do their work, and then place at the start of the next table.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-805732953489746517?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/805732953489746517/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2010/10/further-thoughts-on-real-time-etl-magic.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/805732953489746517'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/805732953489746517'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2010/10/further-thoughts-on-real-time-etl-magic.html' title='Further thoughts on real-time ETL, the Magic of Queues'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-7517033858199552721</id><published>2010-09-30T03:29:00.000-07:00</published><updated>2010-09-30T03:41:29.738-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='etl'/><category scheme='http://www.blogger.com/atom/ns#' term='api'/><category scheme='http://www.blogger.com/atom/ns#' term='cdc'/><category scheme='http://www.blogger.com/atom/ns#' term='real-time'/><category scheme='http://www.blogger.com/atom/ns#' term='polling'/><title type='text'>Getting started with real-time ETL and the dark art of polling:</title><content type='html'>&amp;nbsp;&amp;nbsp; &amp;nbsp;There has been a lot of discussion about real-time ETL over the last few years and a lot of it can be summarised as "don't do it unless you REALLY need to". Helpful, eh? I recently had the need to deal with real-time for the first time so I thought I would summarise my approach to give you some food for thought if you're starting (or struggling) on this same journey.&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Is it really real-time?&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;The question often asked is "how far behind the source (in time) can you be, and still call it real-time?". I don't really care about this kind of latency. I thinks it's basically posturing; "I'm more real-time than you are". My feeling is that I want something that works continuously first and I'll worry about latency later. As long as the process is always working to catch up to the source that's a good start.&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Old options are out&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;Next question (and one that actually matters): "How will you know when the data has changed on the source?" This is an old classic from batch ETL; the difference is that we have taken some of our traditional options away. In batch ETL we could periodically extract the whole resource and do a complete compare. Once you go real-time, this approach will actually miss a large number of changes that update the same resource multiple times. In fact, I would say that repeated updates of a single resource are the main type of insight that real-time adds, so you better make sure you're getting it.&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;CDC: &amp;nbsp;awesome and out of reach&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;What can you do to capture changes? Your first (and best) option is change data capture. CDC itself is beyond the scope of this discussion, however the main point is that it is tightly bound to the source system. If you've been around data warehousing or data integration for more than 5 minutes you can see how that could be a problem. There are numerous half-way house approaches which I'm won't go over; suffice it to say that most enterprise databases have metadata tables and pseudo-column values that they use internally to keep track of changes and these can be a rich seem of information for your real-time ETL quest.&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Polling: painful but necessary&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;You will inevitably come across some system which allows you no detailed interaction with it's backend. Web based services are the perfect case here - you're not going to get access to the remote database so you just have to cope with using their API. And that leaves you with - POLLING. Basically asking the source system: 'has this resource changed' or (when you can't aks that) extracting the resource and comparing it to your copy.&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;A naive approach would be to simply iterate through the entire list of resources over a given interval. The time it takes to complete an iteration would be, roughly speaking, your latency from live. However, DON'T DO THIS unless you want to be strangled by the SysAdmin for the source or banned from API access to the web service.&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;My 'First law of real-time ETL'&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;So I would propose the following heuristic: data changed by humans follows Nexton's first law. Restated:&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;i&gt; &lt;/i&gt;&lt;/span&gt;&lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;i&gt;'Data in motion will stay in motion, data at rest will stay at rest.'&amp;nbsp;&lt;/i&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;Basically a resource that has changed is more likely be changed again when you next check. Conversely a resource which has not changed since you last checked is less likely to changed when you check again. To implment this in your polling process you would simply track how many times you've checked the resource without finding a change and adjust your retry interval accordingly.&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;For example:&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&amp;gt; Check resource - no change - unchanged count = 1 - next retry = 4 min&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&amp;gt; Check resource - no change - unchanged count = 2 - next retry = 8 min&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&amp;gt; Check resource - no change - unchanged count = 3 - next retry = 16 min&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&amp;gt; Check resource - no change - unchanged count = 4 - next retry = 32 min&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&amp;gt; Check resource - CHANGED - unchanged count = 0 - next retry = 1 min&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;&lt;b&gt;Keep it simple stupid&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;/b&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;This a simplistic approach but it can massively reduce the strain you place on the source system. You should also be aware of system driven changes (i.e. invoice generation, etc.) and data relationships (i.e. company address changes &amp;gt; you need to check all other company elements sooner than scheduled). You should also note that changes which are not made by humans are much less likely to obey this heuristic.&lt;br /&gt;&lt;span class="Apple-style-span" style="font-weight: 800;"&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;b&gt;A note for the web dudes&lt;/b&gt;&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;Finally, if you are mostly working with web services then familarise yourself with the following:&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&amp;gt; Webhooks, basically change data capture for the web. You subscribe to a resource and changes are notified to a location you specify. Sadly, webhooks are not widely supported right now.&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&amp;gt; RSS, that little orange icon that you see on every blog you read. Many services offer RSS feeds of recently changed data and this is a good comprise.&lt;br /&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;&amp;gt; E-tag and If-Modified-Since headers, HTTP header elements that push the burden of looking for changes off to the remote service (which is nice).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;&lt;b&gt;Good luck.&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;/b&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-7517033858199552721?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/7517033858199552721/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2010/09/getting-started-with-real-time-etl-and.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7517033858199552721'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7517033858199552721'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2010/09/getting-started-with-real-time-etl-and.html' title='Getting started with real-time ETL and the dark art of polling:'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-6585094388221992342</id><published>2010-03-01T12:00:00.000-08:00</published><updated>2010-03-03T01:08:09.375-08:00</updated><title type='text'>Financialisaton:  Optimised to death.</title><content type='html'>&lt;div&gt;A few previous tweets: &lt;/div&gt;&lt;div&gt;&gt;&gt; Today's theme: financialisation. Not really a word, made it up to describe the trend of running businesses as if they were hedge funds.&lt;/div&gt;&lt;div&gt;&gt;&gt; Financialisation: #1 benchmark industry for advanced analytics is (still) the financial industry &amp;amp; financial markets. Is this a good sign?&lt;/div&gt;&lt;div&gt;&gt;&gt; Financialisation: For me Enron is the base case of financialising the energy business. There was deceit but at core it was models run amok.&lt;/div&gt;&lt;div&gt;&gt;&gt; Financialisation: CEP, Real-time BI, etc: Can you filter signal from noise in real time? Processing delay may prevent overreactions.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Financialisation; a term that I invented (AFAIK) to describe running 'normal' businesses like they're hedge funds. E.g. the use of statistics / 'quantitative models' to wring every bit of  excess / waste / inefficiency from a business. Traders on the financial markets attempt to profit from small price movements by developing complex predictive models and the ability to move on changes very, very quickly. The problem is that financial markets are not like normal businesses. They are probably more like a casino than a business that provides tangible goods or services (e.g. dentists, dry cleaners, demolition, design, etc.).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Many business executives seem envious of this apparent ability to turn thin air into money using leverage and very fast moving transactions. One suspects they would love to turn their own business so quickly and, perhaps, avoid messy interactions with opinionated customers. Business Intelligence and Analytic Database companies haven't failed to notice this desire and heavily market their reference customers from Finance in other sectors.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;My thoughts on this are influenced by Nicholas Nassim Taleb's books "Fooled by Randomness" and "The Black Swan". His premise is that the world is much more random and much less predictable than it appears to human observers and events often come out of left field to completely upset our ideas (hence the black swan). Taleb never mentions Business Intelligence or Analytics, but I'm struck by the &lt;b&gt;relevance&lt;/b&gt; of his ideas to our industry.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;On the other side of the fence; Thomas H. Davenport's "Competing on Analytics" is the standard bearer for financialisation and a favourite handout of BI vendors (e.g. &lt;a href="http://www.oracle.com/technology/products/bi/odm/pdf/competing%20on%20analytics%20hbr%20article.pdf"&gt;Oracle&lt;/a&gt;,&lt;a href="http://download.microsoft.com/documents/uk/peopleready/Competing%20on%20Analytics.pdf"&gt; Microsoft&lt;/a&gt;). The choice quote: "Employees hired for their expertise with numbers …are armed with the best evidence… As a result, &lt;i&gt;they make the best decisions&lt;/i&gt;." Really? Simply applying the power of numbers to a business, using very clever people of course, is a sure fire way to success? Does that mesh with your experience? &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Consider the case of Enron. They were principally involved in energy supply, which has a very real need to analyse and forecast future demand. Enron got into trouble by using models (and modellers) to make highly leveraged plays on the energy futures market. Ultimately their crimes were about deceit (they used shell companies to inflate profits and conceal losses); however, it is my understanding that their losses stemmed from deals based on very sophisticated models that did not turn out as predicted. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A counter example very relevant to Business Intelligence is disaster recovery. It's common practice in IT to run a disaster recovery copy of important systems. We keep an exact duplicate of the system in another data centre far from the primary system so that, if the worst happens, business can carry on by switching to the DR instance. This is inherently excess capacity that bears significant costs and yet we hope it will never be used. We carry the cost of all this "excess" equipment for a very good reason; the cost of not having it is potentially much, much higher.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This sums up the risk of fincialisation: Can you be certain that the what looks like excess (on the cost side) is not actually very important? Can you be sure that what looks like  new profit (on the opportunity side) is not exposing you to a large unexpected loss? &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;{This is where I was going to go over some common types of analysis and discuss whether they are more or less likely financialised. But this has been in draft for long enough so that will have to wait. TTFN}&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-6585094388221992342?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/6585094388221992342/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2010/03/financialisaton-optimised-to-death.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/6585094388221992342'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/6585094388221992342'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2010/03/financialisaton-optimised-to-death.html' title='Financialisaton:  Optimised to death.'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-5798118528751979053</id><published>2010-01-01T02:01:00.000-08:00</published><updated>2010-01-01T05:39:48.222-08:00</updated><title type='text'>Unsolicited advice for Linked In and Stack Overflow - MERGE!</title><content type='html'>I think Linked In and Stack Overflow are on a collision course. They have both established impressive beachheads in the nascent market for professional reputation services, in particular for reputation that cannot be faked.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Linked In comes at this from the position of an online C.V./resume service that allows you to "connect" to people you've worked with. In theory it's a business version of Facebook, in practice it's actually a reputation service. I do not maintain highly active relationships with former colleagues and customers. We connect on Linked In because it allows us to keep in touch with little effort and verifies who we are, the roles we've held, and the work we've done.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Linked In have recently expanded their group functionality to create discussion forums so people can converse with "real" people. Sadly these groups are trending heavily towards spammy selling posts. There is no way to remove the noise from these groups and no obvious reward for high quality contributors.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The business model for Linked In seems to be selling premium access to user data for recruitment and sales professionals. In my opinion this is a short term model. They are really in direct competition with their customers. The recruitment industry exists because of a lack of quality information about potential employees and is ripe for &lt;a href="http://en.wikipedia.org/wiki/Disintermediation"&gt;disintermediation&lt;/a&gt;. They are also in competition with 'outside' sales professionals which represent a huge cost burden on B2B sales. The rewards for moving sales 'inside' are potentially huge.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Stack Overflow comes at reputation from the other side. They have created a high quality answer board for technical questions and with Stack Exchange have expanded the product into virtually any topic. Their key innovation is to reward high quality answers and to encourage quality contributors with incentives like badges and points. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Stack Overflow users generate a different form of reputation, they are verifying that they actually understand a specific subject. This is incredibly valuable because it verifies something that Linked In cannot: the ability to do something *again*. We've all worked with people who just scraped by, doing what they're told without necessarily understanding it. You can guarantee that those people could not generate a good reputation on Stack Overflow.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The business model for Stack Overflow seems to be around advertising and particularly job advertising via the Careers site. S.O. Careers allows users to create an online C.V./resume that is linked to their reputation. Sound familiar?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This is where they collide. Linked In has a lock on the verified C.V./resume side but the discussions functionality is poor. Stack Overflow has a lock on quality discussions and answer board functionality.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Building a reputation on Stack Overflow has some value but it's limited to that context. S.O. Careers may be relatively successful but it seems unlikely to eclipse Linked In in this respect, never mind the 800lb gorillas like Monster. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Likewise building a C.V. on Linked In has some value but participating in discussions is frustrating and has no clear payback. The depth of reputation is limited to fairly shallow "I'll scratch your back if you scratch mine" recommendations.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Potential options:&lt;/div&gt;&lt;div&gt;1. Stack Overflow adds social network functionality: seems unlikely to receive broad adoption for any number of reasons.&lt;/div&gt;&lt;div&gt;2. Linked In completely revamps their discussion functions to emulate S.O.: this would require their customers to recreate all of questions and answers that exist across all S.O. sites.&lt;/div&gt;&lt;div&gt;3. Integrate using APIs. Move Linked In groups/discussions to Stack Exchange and use Linked In profiles in place of re-creating a CV on Stack Overflow.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;4. MERGE!!!! - I'm dead serious here. Merging these services would create a very strong network effect. It would be the natural home for professional questions and discussions and provide clear incentives for people to share their knowledge. The combination would pretty much own the professional reputation space. If this ever happens Monster are toast…&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It's a new year, time for thinking big. &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-5798118528751979053?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/5798118528751979053/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2010/01/unsolicited-advice-for-linked-in-and.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/5798118528751979053'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/5798118528751979053'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2010/01/unsolicited-advice-for-linked-in-and.html' title='Unsolicited advice for Linked In and Stack Overflow - MERGE!'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-8866935970241267382</id><published>2009-12-24T05:16:00.000-08:00</published><updated>2009-12-24T05:20:27.993-08:00</updated><title type='text'>Comment on Cringley's "DVD is Dead" post</title><content type='html'>&lt;div&gt;In regards to &lt;a href="http://tr.im/Ivrs"&gt;http://tr.im/Ivrs&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Apple's "plan" to be "front and center" of the living room seems a lot more like an outside bet than a central strategy to me. When you see as many TV spots for the AppleTV as the iPhone then you'll know the strategy has changed. The living room tech cycle is super-slow compared to Apple's "normal" and thus difficult to integrate. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Bob's kinda missed the point here though, this actually signals the *failure* of Blu-Ray. It's just going to take over from DVD in a smooth flattish decline, no one is out there re-buying their library in Blu-Ray.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There's a huge pent up demand for an "iTunes" experience for video content. I.e. put DVD in the machine, machine makes digital copy, moves copy to my device(s), I make future purchases as downloads and everything lives in a single library.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;My guess is Apple hasn't been able to swing that with the Hollywood studios yet.  It's still not clear whether you're legally allowed to make a backup copy of a DVD you bought. In the meantime they're just keeping the AppleTV on life support until something gives.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-8866935970241267382?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/8866935970241267382/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2009/12/comment-on-cringleys-dvd-is-dead-post.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/8866935970241267382'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/8866935970241267382'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2009/12/comment-on-cringleys-dvd-is-dead-post.html' title='Comment on Cringley&apos;s &quot;DVD is Dead&quot; post'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-4090469059329188565</id><published>2009-10-25T03:12:00.000-07:00</published><updated>2009-10-26T10:29:46.858-07:00</updated><title type='text'>Unsolicited advice for Kickfire</title><content type='html'>&lt;div&gt;Following up on the Kickfire BBBT tweetstream on Friday (23-Oct), I want to lay out my thoughts about Kickfire's positioning. I should point out that I have little experience with MySQL, no experience with Kickfire and I'm not a marketer ( but I play one on TV… ;) ).&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Kickfire should consider doing the following:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;1. Emphasise the benefits of the FPGA&lt;/b&gt;&lt;/div&gt;&lt;div&gt;We now know that Kickfire's "SQL chip" is in fact an FPGA. Great! They need to bring this out in the open and even emphasise it. This is actually a strength, FPGA's have seen major advances recently and a good argument can be made that they are not "propietary hardware" but a commodity component advancing at Moore's Law speed (or better).&lt;/div&gt;&lt;div&gt;They should also obtain publishing rights to &lt;a href="http://tr.im/D5xH"&gt;recent research about the speed advantages of executing SQL logic on an FPGA&lt;/a&gt;. Good research foundations and advances in FPGAs make Kickfire seem much more viable longterm. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;2. Pull back on the hyperbole.&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Dump the P&amp;amp;G style 'Boswelox' overstatement. A lot of the key phrases in their copy seem tired. How many time have we heard about "revolutionary" advances? My suggestion is to use more concrete statements. Example: "Crunch a 100 million web log records  in under a minute". Focus on common tasks and provide concrete examples of improved performance.&lt;/div&gt;&lt;div&gt;Also, reign in the buzzwords: availability, scalability, sustainability, etc. If this is really for smaller shops and data marts then &lt;b&gt;plain english is paramount&lt;/b&gt;. "Data mart" type customers will have to ram this down the throat of IT. They need to want it more than an iPhone or they'll just give up and go with the default.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;3. Come up with a MapReduce story.&lt;/b&gt;&lt;/div&gt;&lt;div&gt;MapReduce is the new darling of the web industry. Google invented the term, Yahoo has released the main open source project and everyone just thinks it's yummy. Is it a mainstream practice? Probably not, but the bastion of MySQL is not mainstream either. &lt;/div&gt;&lt;div&gt;Kickfire's "natural" customers (e.g. web companies) may not have any experience with data warehousing. When they hit scaling issues with MySQL they may not go looking for a better MySQL. Even if they do they'll probably find and try Infobright in the first instance.&lt;/div&gt;&lt;div&gt;Kickfire needs a story about MapReduce and they need to insert themselves into the MapReduce dialogue. They need to start talking about things like "The power of MapReduce in a 4U server" or "Accelerating Hadoop with Kickfire". &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;4. Offer Kickfire as a service.&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Kickfire needs to be available as a service. This may be a complete pain in the ass to do and it may seem like a distraction. I bet Kickfire policy is to offer free POCs. But IMHO their prices are too low to make this scalable. &lt;/div&gt;&lt;div&gt;Customers need to be able to try the product out for a small project or even some weekend analysis. When they get a taste of the amazing performance then they'll be fired up to get Kickfire onsite and willing to jump through the hoops in IT.&lt;/div&gt;&lt;div&gt;If this is absolutely out of the question, the bargain basement approach would be to put up a publicly accessible system (registration required) filled with everything from data.gov. Stick Pentaho/Jasper on top (nice PR for the partner…) and let people play around.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;5. Deliver code compatibility with Oracle and SQL Server.&lt;/b&gt;&lt;/div&gt;&lt;div&gt;There are probably compelling reasons the choice of MySQL. However, many potential customers have never used it. They've never come across it in a previous role. It's not used anywhere in their company. Frankly, it makes them nervous. &lt;/div&gt;&lt;div&gt;Kickfire needs to maximise their code compatibility with Oracle and SQL Server and then they need to talk about it everywhere.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;That is all. Comments?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-4090469059329188565?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/4090469059329188565/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2009/10/unsolicited-advice-for-kickfire.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/4090469059329188565'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/4090469059329188565'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2009/10/unsolicited-advice-for-kickfire.html' title='Unsolicited advice for Kickfire'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-5256213704823320091</id><published>2009-07-29T12:17:00.001-07:00</published><updated>2009-07-30T04:24:37.352-07:00</updated><title type='text'>Why GPUs matter for DW/BI</title><content type='html'>&lt;div&gt;I tweeted a few days ago that I wasn't particularly excited about either the Groovy Corp or XtremeData announcements because I think any gains they achieve by using FPGAs will be swept away by GPGPU and related developments. I got a few replies either asking what GPGPU is and a few dismissing it as irrelevant (vis a vis Intel x64 progress). So I want to explain my thoughts on GPGPU, and how it may affect the Database / Business Intelligence / Analytics industry (industries?).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;   GPGPU stands for "general-purpose computing on graphics processing units". (&lt;a href="http://tr.im/uIFL"&gt;http://tr.im/uIFL&lt;/a&gt;) GPGPU is also referred to as "stream processing" or "stream computing" in some contexts. The idea is that you can offload the processing normally done by the CPU to the computer's graphics card(s).&lt;br /&gt;  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;   But why would you want to? Well, GPUs are on a roll. Their performance is increasing exponentially faster than the increase in CPU performance. I don't want to overload this post with background info but suffice to say that GPUs are *incredibly* powerful now and getting more powerful much faster than CPUs. If you doubt that this is the case have a look at this article on the Top500 supercomputing site, point 4 specifically. (&lt;a href="http://tr.im/uIGd"&gt;http://tr.im/uIGd&lt;/a&gt;)&lt;br /&gt;  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;   This is not a novel insight on my part. I've been reading about this trend since at least 2004. There was a memorable post on Coding Horror in 2006 (&lt;a href="http://tr.im/uIGk"&gt;http://tr.im/uIGk&lt;/a&gt;). Nvidia released their C compatibility layer "CUDA" in 2006 (&lt;a href="http://tr.im/uILn"&gt;http://tr.im/uILn&lt;/a&gt;) and ATI (now AMD) released their alternative "Stream SDK" in 2007 (&lt;a href="http://tr.im/uITf"&gt;http://tr.im/uITf&lt;/a&gt;). More recently the OpenCL project has been established to allow programmers to tap the power of *any* GPU (Nvidia, AMD, etc) from within high level languages. This is being driven by Apple and their next OSX update will delegate many tasks to the GPU using OpenCL.&lt;br /&gt;  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;   That's the &lt;i&gt;what&lt;/i&gt;. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Some people feel that GPGPU will fail to take hold because Intel will eventually catch up. This is a reasonable point of view and in fact Intel has a project called Larrabee (&lt;a href="http://tr.im/uIFW"&gt;http://tr.im/uIFW&lt;/a&gt;). They are attempting to make a hybrid chip that effectively emulates a GPU within the main processor. It's worth noting that this is very similar to the approach IBM have taken with the Cell chip used in the Playstation3 and many new supercomputers. Intel will be introducing a new set of extensions (like SSE2) that will have to be used to tap into the full functionality. The prototypes that have been demo'ed are significantly slower than current pure GPUs. The point is that Intel are aware of GPGPU and are embracing it. The issue for Intel is that the exponential growth of GPU power looks like it's going to put them on the wrong side of a technology growth curve for once.&lt;br /&gt;  &lt;/div&gt;&lt;br /&gt;  &lt;div&gt;&lt;br /&gt;   &lt;b&gt;Why are GPUs important to databases and analytics?&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;ol&gt;    &lt;li&gt;&lt;b&gt;The multi-core future is here now.&lt;/b&gt;&lt;br /&gt;    &lt;/li&gt;&lt;div&gt;I'm sure you've heard the expression "the future is already here it's just unevenly distributed". Well that applies double to GPGPU. We can all see that multi-core chips are where computing is going. The clock speed race ended in 2004. Current high end CPUs now have 4 cores and 8 cores will arrive next year and on it goes. GPUs have been pushing this trend for longer and are much further out on this curve. High end GPUs now contain up to 128 cores and the core count is doubling faster than CPUs.&lt;br /&gt;    &lt;/div&gt;&lt;br /&gt;    &lt;li&gt;&lt;b&gt;Core scale out is hard.&lt;/b&gt;&lt;br /&gt;    &lt;/li&gt;&lt;div&gt;Utilizing more cores is not straightforward. Current software does not utilize even 2 cores effectively. If you have a huge spreadsheet calculating on your dual core machine you'll notice that it only uses one core. So half the available power of your PC is just sitting there while you're twiddling your thumbs.&lt;br /&gt;    &lt;/div&gt;&lt;br /&gt;    &lt;div&gt;Database software has a certain amount of parallelism built in already, particularly the big 3 "enterprise" databases. But the parallel strategies they employ where designed for single core chips residing in their own sockets and having their own private supply of RAM. Can they use the cores we have right now? Yes, but the future now looks very different. Hundreds of cores on a single piece of silicon.&lt;br /&gt;    &lt;/div&gt;&lt;br /&gt;    &lt;div&gt;Daniel Abadi's recent post about hadoopDB predicts a "scalability crisis for the parallel database system". His point is that current MPP databases don't scale well past 100 nodes (&lt;a href="http://tr.im/uIFo"&gt;http://tr.im/uIFo&lt;/a&gt;). I'm predicting a similar crisis in scalability for *all database systems* at the CPU level. Strategies for dividing tasks up among 16 or 32 or even 64 processors with their own RAM will grind to a halt when used across 256 (and more) cores on a single chip with a single path to RAM.&lt;br /&gt;    &lt;/div&gt;&lt;br /&gt;    &lt;li&gt;&lt;b&gt;Main memory I/O is the new disk I/O.&lt;/b&gt;&lt;br /&gt;    &lt;/li&gt;&lt;div&gt;Disk access has long been our achilles heel in the database industry. The rule of thumb for improving performance is to minimize the amount of disk I/O that you perform. This weakness has become ever more problematic as disk speeds have increased very, very slowly compared to CPU speed. Curt Monash had a great post about this a while ago (&lt;a href="http://tr.im/uIYb"&gt;http://tr.im/uIYb&lt;/a&gt;)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In our new multi-core world we will have a new problem. Every core we add increases the demand for data going into and out of RAM. Intel have doubled the width of this "pipe" in recent chips but practical considerations will constrain increases in this area in a similar manner to the constraints on disk speed seen in the past.&lt;br /&gt;    &lt;/div&gt;&lt;br /&gt;    &lt;li&gt;&lt;b&gt;Databases will have to change.&lt;/b&gt;&lt;br /&gt;    &lt;/li&gt;&lt;div&gt;Future databases will have to be heavily rewritten and probably re-architected to take advantage of multi-core processor improvements. Products that seek to fully utilize many cores will have to be just as parsimonious with RAM access as current generation columnar and "in-memory" databases are with disk. Further they will have to become just savvy about parallelizing the actions as current MPP databases but they will have to co-ordinate this parallelism at 2 levels instead of just 1.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;ul&gt;     &lt;li&gt;&lt;b&gt;1st:&lt;/b&gt; Activity and data must be split and recombined across Servers/Instances (as currently)&lt;br /&gt;     &lt;/li&gt;&lt;br /&gt;     &lt;li&gt;&lt;b&gt;2nd: &lt;/b&gt;Activity and data must be split and recombined across Cores, which will probably have dedicated RAM "pools".&lt;br /&gt;     &lt;/li&gt;&lt;br /&gt;     &lt;/ul&gt;&lt;li&gt;&lt;b&gt;1st movers will gain all the momentum.&lt;/b&gt;&lt;br /&gt;    &lt;/li&gt;&lt;div&gt;So, finally, this is my basic point. There's a new world coming. It has a lot of cores. It will require new approaches. That world is accessible today through GPUs. Database vendors who move in this direction now will gain market share and momentum. Those who think they can wait on the Intel and "traditional" CPUs to "catch up" may live to regret it.&lt;/div&gt;   &lt;/ol&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;b&gt;   A few parting thoughts…&lt;/b&gt;&lt;br /&gt;  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;   I said at the start that I feel FPGAs will be swept away. I should make 2 caveats to that. First, I can well imagine a world where FPGAs come to the fore as a means to co-ordinate very large numbers of small simple cores. But I think we're still quite a long way from that time. Second, Netezza use FPGAs in a very specific way between the disk and CPU/RAM. This seems like a grey area to me, however Vertica are able to achieve very good performance without resorting to such tricks.&lt;br /&gt;  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;   Kickfire is a very interesting case as regards GPGPU. They are using a "GPU-like" chip as their workhorse. Justin Swanhart was very insistent that their chip is not a GPU (that is an analogy) and that it is truly a unique chip. For their sake I hope this is marketing spin and the chip is actually 99% standard GPU with small modifications. Otherwise, I can't imagine how a start-up can engage in the core count arms race long term, especially when it sells to the mid-market. Perhaps they have plans to move to a commodity GPU platform.&lt;br /&gt;  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;   A very interesting paper was published recently about performing database operations on a GPU. You can find it here (&lt;a href="http://tr.im/ufYk"&gt;http://tr.im/ufYk&lt;/a&gt;). I'd love to know what you think of the ideas presented.&lt;br /&gt;  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;   Finally, I want to point out that I'm not a database researcher nor an industry analyst. My opinion is merely that of a casual observer, albeit an observer with a vested interest. I hope you will do me the kindness of pointing out the flaws in my arguments in the comments.&lt;br /&gt;  &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-5256213704823320091?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/5256213704823320091/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2009/07/why-gpus-matter-for-dwbi.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/5256213704823320091'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/5256213704823320091'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2009/07/why-gpus-matter-for-dwbi.html' title='Why GPUs matter for DW/BI'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-26231362268276904</id><published>2009-07-06T23:15:00.000-07:00</published><updated>2009-07-07T02:49:50.273-07:00</updated><title type='text'>Useful benchmarks vs human nature. A final thought on the TPC-H dust-up.</title><content type='html'>The was a considerable flap recently on Twitter and in the blogosphere about TPC-H in general. It was all triggered by the new benchmark submitted by ParAccel in the 30TB class. You can relive the gory details on Curt Monash's DBMS2 site here (&lt;a href="http://tr.im/rbCe"&gt;http://tr.im/rbCe&lt;/a&gt;), if you're interested. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I stayed out of the discussion because I'm kind of burned out on benchmarks in general. I got fired up about benchmarks a while ago and even sent an email with some proposals to Curt. He was kind enough to respond and his response can be summed up as "What's in it for the DB vendor?". Great question and, to be honest, not one I could find a good answer for. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For the database buyer; a perfect benchmark tells them which database has the best mix of cost and performance, especially in data warehousing. This is what TPC-H appears to offer (leaving aside the calculation of their metrics). However, a lot of vendors have not submitted a benchmark. It's interesting to note that vendors such as Teradata, Netezza and Vertica are TPC members but have no benchmarks. The question is why not.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For a database vendor; a perfect benchmark is a benchmark that they can win. Curt has referred to Oracle's reputed policy of WAR (win all reviews). This why their licenses specifically prohibit you from publishing benchmarks. There is simply no upside to being 3rd, 5th or anything but first in a benchmark. If Oracle are participating in a given benchmark the simple economic reality is that they know they can win it. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This is the very nature of the TPC-H, it is designed to be very elastic and to allow vendors wiggle room so that they can submit winning figures. I'm sure the TPC folks would disagree on principle but TPC is an industry group made of up of vendors. Anything that denied them this wiggle room will either be vetoed or get even less participation than we currently see. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This is a bitter pill to swallow but seems unlikely to change. These days I'm delivering identical solutions across Teradata, Netezza, Oracle and SQL Server. I have some very well formed thoughts on the relative cost and performance of these databases but of course I can't actually publish any data. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;By the way, the benchmark I suggested to Curt was about reducing the hardware variables. Get a hardware vendor to stand up a few common configurations (mid-size SMP using a SAN, 12 server cluster using local storage, etc.) at a few storage levels (1TB, 10TB, 100TB) and then test each database using identical hardware. The metrics would be things like max user data, aggregate performance, concurrent simple queries, concurrent complex queries, etc. Basically trying to isolate the performance elements that are driven by the database software and establish some approximate performance boundaries. With many more metrics being produced there can be a lot more winners. Maybe the TPC should look into it…&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-26231362268276904?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/26231362268276904/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2009/07/useful-benchmarks-vs-human-nature-final.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/26231362268276904'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/26231362268276904'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2009/07/useful-benchmarks-vs-human-nature-final.html' title='Useful benchmarks vs human nature. A final thought on the TPC-H dust-up.'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-5312099259979052508</id><published>2009-07-05T14:45:00.000-07:00</published><updated>2009-07-06T13:06:29.156-07:00</updated><title type='text'>The future of BI? It has nothing to do with business…</title><content type='html'>I've been reading and re-reading Stuart Sutherland's excellent book Irrationality for several weeks (review to come - promise). One of the things he talks about is "making the wrong connections". His point is that humans can't mentally evaluate evidence and make connections. We focus on the elements that are unusual or different and we massively over value our initial guesses. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;That really resonates with me. After all that's what Business Intelligence is about, right? We provide factual, numeric, and clean data in a format that allows the user to make reasonable, rational decisions. We lambast the BI nay-sayers who operate on "gut instinct" and rightly so. But we leave that hyper-rational approach at the office door and conduct the rest of our lives in our normal irrational way. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In truth we conduct 95% of our working lives that way as well. The minute-to-minute stuff that business is *really* made of is unrecorded, unanalysed and (of course) irrational. All those conversations, relationships, emails, phone calls and meaningful looks are dealt with by instinct.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;Outside the office we're seeing an explosion in personal monitoring and self surveillance. Devices like the iPhone can track every interaction, accessories like Nike+ allow us to track every step we take, software like RescueTime continuously monitors our computer usage. Even Facebook is a way to monitor your relationships, something that seemed completely intangible a few years ago. Etc, etc, etc.&lt;div&gt;&lt;br /&gt;&lt;div&gt;This is the future of BI: Rational Augmentation. Using tracking data to make faster, better and more rational decisions about everything in our lives. It's about dealing with huge volumes of hyper-personal data and finding the patterns that matter. It lives outside the office and outside the corporation. It's a dash of text-mining, a pinch of regression, a dollop of aggregation and spoonful advanced analytics and a heap of basic statistics.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Many people will feel uncomfortable about this but the young will adopt it without question and those who adopt it will do better. Let's face it, it's a sub-optimal world out there and an edge in rationality could be a very big edge indeed. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As a final thought, this has the makings of a classic innovators dilemma for the current BI players. Rational Augmentation (I'm loving this phrase but call it what you like…) is going to need to deal with large data volumes very cheaply and very locally. It will probably be service based. It will probably be free for at least some users. But ultimately it will be a huge market, dwarfing the current BI market. The current players may have the skills to take this on but they've been swallowed by the corporate quicksand and they will sit and watch it pass them by. C'est la vie.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-5312099259979052508?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/5312099259979052508/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2009/07/future-of-bi-it-has-nothing-to-do-with.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/5312099259979052508'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/5312099259979052508'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2009/07/future-of-bi-it-has-nothing-to-do-with.html' title='The future of BI? It has nothing to do with business…'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-8128209908863846220</id><published>2009-05-20T07:49:00.000-07:00</published><updated>2009-05-20T08:30:48.925-07:00</updated><title type='text'>How to Fix the Newspaper Industry - everybody else is doing it…</title><content type='html'>&lt;i&gt;NOTE: Don't expect me to be doing multiple posts per day. I don't know what's come over me!&lt;/i&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Everyone seems to agree that newspapers are dead. Even here in the UK they're not doing great, although our papers seem to 'get' the web a lot more. One of the things that I hear quite a bit from the pundits is that they should make the physical paper free as well as the online version.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I was just reading a post on Tim Ferriss' blog about Alan Webber and his "RULE #24 - If you want to change the game, change the economics of how the game is played." In it he mentions the free paper theory.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This triggered a thought for me that &lt;b&gt;giving the paper away is nowhere near a bold enough strategy&lt;/b&gt;. The problem with the paper is not that it costs too much (except on Sundays - £2! who are they kidding?). For a lot of people, especially the core newspaper market, the cost is not an issue. The issue is having to go get the damn thing, cart it around all day and then filter through the ads just to find a few interesting tidbits.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So here is my "fix": force people to take the paper. &lt;b&gt;Stick it through *everyone's* mailbox every single day.&lt;/b&gt; Become *the* alternative delivery provider.  I haven't bought a paper in ages but I can *guarantee* that if it came through my door I would look at it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the UK (and most of Europe) we have fairly strong opt-out regulation against so-called junk mail. However there is a huge loophole called the "door drop". Marketers are still allowed to put whatever they want through all of the doors in a given area. This allows a lot of room for targeting. Millionaires all live in the same neighborhood right? There is a big business around this. When I was involved (~2yrs ago) it cost about £0.05 per door. Now I get 3 or 4 drops a week, about 20 pieces in total. Hmm… that's sound like £1 of revenue per house minus delivery costs. Seems workable.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now you wouldn't want to push your paper on literally everyone. You would target the exact slice of the population that already reads you. Plus your economics are now much more predictable. You know exactly how many papers to print and you can streamline your distribution arm. In fact you'd want to buy or partner with someone like DHL or TNT who are already doing alternative deliveries. You also need to get you deliveries done *very* early to catch the commuters.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This is a winner takes all play. There is only room for a handful of players in a market like this. Once they have your paper in their hands why would they buy a competing paper? If you get it right it should pay back in spades. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I don't really see anyone brave enough to make the switch right now. But they'll get more adventurous (desperate) as time goes on and profits dwindle. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Perhaps TNT should think about buying a newspaper group to beef up the delivery pipeline…&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-8128209908863846220?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/8128209908863846220/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2009/05/how-to-fix-newspaper-industry-everybody.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/8128209908863846220'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/8128209908863846220'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2009/05/how-to-fix-newspaper-industry-everybody.html' title='How to Fix the Newspaper Industry - everybody else is doing it…'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-4235997057078477783</id><published>2009-05-20T01:56:00.000-07:00</published><updated>2009-05-21T01:57:24.766-07:00</updated><title type='text'>How To Fix Twitter - it came to me in the shower…</title><content type='html'>&lt;p&gt;&lt;span class="Apple-style-span" style="font-style: italic; "&gt;UPDATE: One Sentence Summary - It's possible to know in advance who will need to receive messages and therefore to structure the Twitter application and tweet data in a way that makes it much faster to deliver them.&lt;/span&gt;&lt;/p&gt;&lt;p&gt;So I got to thinking about Twitter and the ongoing problems they have keeping the service up and running smoothly. This line of thought was triggered by Twitter removing the ability to see all @ replies. This follows a long history of removing features to "streamline" the service (Google it if you care).&lt;/p&gt;&lt;div&gt;It's worth remembering that Twitter started out as a 'plain vanilla' Ruby On Rails app. Which is great, 'cuz RoR is great. But it means that Twitter was conceived as a database backed single instance app. There are tons of article out there about the architecture you need to scale such an app. Some of them where written by Twitter people who have since been ejected. (Again, Google it if you care).&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt; The other thing to remember is that Twitter are only keeping a few weeks of tweets online (6-8 at last reporting). This may be a practical measure but it's also insane! There is huge value in all those old Tweets. I suspect they are doing this to limit the size of their databases. Which is a clue that they are still using a database (or probably a number of sharded databases) as the back-end.&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt; Here's the thing though: Twitter is not a database app. It's a messaging platform. This is not an insight but it is important. We (the IT industry) know how to run messaging platforms at scale. We know how to run huge email services. We know how to run huge IM platforms. We know how to run huge IRC instances.&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt; Of course Twitter is not exactly like any of those things. It's an asynchronous, asymmetric, instant micro-message stream. It's asynchronous because messages are simply pushed out (like email). It's asymmetric because there is no way to guarantee or confirm receipt (like IRC). But it's the instant streaming aspect that is key. That's what makes the experience unique.&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt; My "fix" is based on the following observation: Twitter usage forms naturally into cliques. My wife tried out Twitter and found it boring. She didn't find a tribe that she connected with. I, on the other hand, love it because I can talk trash about Bikes, Business Intelligence and Data Warehousing all day long. What could be better?&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt; Here's the architecture:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;    &lt;li&gt;Load all of the data into a huge data warehouse (MPP of course!).&lt;br /&gt;  &lt;/li&gt;&lt;li&gt;Cluster users into their natural cliques using data mining algorithms.&lt;br /&gt;  &lt;/li&gt;&lt;li&gt;The cliques I follow might be:&lt;br /&gt;  &lt;/li&gt;&lt;li style="list-style: none"&gt;&lt;ul&gt;&lt;li&gt;BI-DW (~2,000)&lt;br /&gt;    &lt;/li&gt;&lt;li&gt;UK Mountain Biking (~1,000)&lt;br /&gt;    &lt;/li&gt;&lt;li&gt;Web 2.0 (~5,000)&lt;br /&gt;    &lt;/li&gt;&lt;li&gt;Twitter Celebs (~1,000)&lt;br /&gt;    &lt;/li&gt;&lt;li&gt;&lt;i&gt;Of course cliques wouldn't really have names…&lt;/i&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;i&gt;&lt;i&gt;    &lt;li&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;The backend database only contains users info, not tweets.&lt;br /&gt;&lt;/span&gt;    &lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;Following, Followers, Clique memberships, Bloom filter of following, etc.&lt;br /&gt;&lt;/span&gt;    &lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;Tweets are stored in "clique streams": all tweets for a clique in reverse order.&lt;br /&gt;&lt;/span&gt;    &lt;/li&gt;&lt;li style="list-style: none"&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;New tweets are added to the top/front of the stream.&lt;br /&gt;&lt;/span&gt;      &lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;Tweets can exist in multiple streams as required.&lt;br /&gt;&lt;/span&gt;      &lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;Streams have a maximum message age.&lt;br /&gt;&lt;/span&gt;      &lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;To provide an update the system only has to filter a small number of streams.&lt;br /&gt;&lt;/span&gt;    &lt;/li&gt;&lt;li style="list-style: none"&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;This has got to be a 1000x reduction. (60m users to 60k possibles)&lt;br /&gt;&lt;/span&gt;      &lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;The system stores a bloom filter of people a user follows as the first filter for streams.&lt;br /&gt;&lt;/span&gt;    &lt;/li&gt;&lt;li style="list-style: none"&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;Probably another 10x reduction, removes bulk of non-following clique messages.&lt;br /&gt;&lt;/span&gt;      &lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;The detailed filter should now be running over a very small dataset.&lt;br /&gt;&lt;/span&gt;    &lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;Final step is to combine the filtered streams and remove any duplicates.&lt;/span&gt;&lt;/li&gt;&lt;/i&gt;&lt;/i&gt;&lt;/ul&gt;&lt;/div&gt;&lt;i&gt;&lt;i&gt;&lt;div&gt;&lt;ul&gt;    &lt;li&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;It should go without saying that all tweets are added to the data warehouse in real time. ;-)&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;This also answers the question of how Twitter can make money: &lt;/span&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-style: normal;"&gt;sell access to the data in that killer data warehouse.&lt;/span&gt;&lt;/b&gt;&lt;/li&gt;   &lt;/ul&gt;{I have refrained from naming any specific technologies or products in the post because that's not really what it's about. Very restrained of me, don't you think?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I also haven't talked about DMs, mentions, etc. because I think that they can easily fit in this architecture and this post doesn't need to be any longer.}&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;UPDATE 2: This approach also makes it a lot easier to spot spam accounts. Someone may *actually* want to follow 4,000 people but they will only be in a few cliques. A spam account would be following too many different cliques.&lt;/div&gt; &lt;/i&gt;&lt;/i&gt;&lt;div&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-4235997057078477783?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/4235997057078477783/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2009/05/how-to-fix-twitter-it-came-to-me-in.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/4235997057078477783'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/4235997057078477783'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2009/05/how-to-fix-twitter-it-came-to-me-in.html' title='How To Fix Twitter - it came to me in the shower…'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-4088704972079357082</id><published>2009-04-29T02:39:00.000-07:00</published><updated>2010-02-05T02:43:14.637-08:00</updated><title type='text'>bokeh</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_W0USePhdhsc/S2v1dUJhCFI/AAAAAAAAFXk/0D2kJz7-02k/s1600-h/bokeh_light.jpg"&gt;&lt;img style="cursor:pointer; cursor:hand;width: 400px; height: 250px;" src="http://4.bp.blogspot.com/_W0USePhdhsc/S2v1dUJhCFI/AAAAAAAAFXk/0D2kJz7-02k/s400/bokeh_light.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5434707259326269522" /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-4088704972079357082?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/4088704972079357082/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2010/02/bokeh.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/4088704972079357082'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/4088704972079357082'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2010/02/bokeh.html' title='bokeh'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_W0USePhdhsc/S2v1dUJhCFI/AAAAAAAAFXk/0D2kJz7-02k/s72-c/bokeh_light.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-8584446399620681205</id><published>2009-04-29T01:47:00.000-07:00</published><updated>2009-04-29T01:52:09.601-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='about'/><category scheme='http://www.blogger.com/atom/ns#' term='introduction'/><title type='text'>Let the macro-blogging begin...</title><content type='html'>I'm setting this blog up as place to put thoughts that don't fit into Twitter's 140 character limit. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I've made a couple abortive blog starts in the past so… no promises!  I'll also be putting up some essays that I've written in the past, probably reworked to save embarrassment. &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-8584446399620681205?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/8584446399620681205/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2009/04/let-macro-blogging-begin.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/8584446399620681205'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/8584446399620681205'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2009/04/let-macro-blogging-begin.html' title='Let the macro-blogging begin...'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-3178223324545025323</id><published>2008-02-29T11:01:00.000-08:00</published><updated>2011-04-14T13:03:29.242-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='old blog'/><category scheme='http://www.blogger.com/atom/ns#' term='off topic'/><title type='text'>Data can never be perfect...</title><content type='html'>[This was originally posted on an old blog: &lt;a href="http://perfectinfo.wordpress.com/2008/02/29/data-can-never-be-perfect/"&gt;Data can never be perfect...&lt;/a&gt;&amp;nbsp;]&lt;br /&gt;&lt;br /&gt;This is a re-post of a comment I made on&amp;nbsp;Doug Henschen's article about&amp;nbsp;&lt;a href="http://www.intelligententerprise.com/blog/archives/2008/02/is_poor_data_go.html#community"&gt;data governance and the suprime crisis.&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Doug's position (quoting from a lot of data governance vendors) is: "A first step toward avoiding such calamities... is an integrated, overarching data governance program that addresses data security, data privacy &lt;em&gt;and&lt;/em&gt; data quality so that risks can be better understood and outcomes anticipated."&lt;br /&gt;&lt;br /&gt;Basically, if the banks had better data they would have made better decisions and not got themselves into this mess.&lt;br /&gt;&lt;br /&gt;The problem is not a lack of governance but an unshakeable belief in the data and risk models. Interested readers should look at &lt;a href="http://www.portfolio.com/views/blogs/market-movers/2008/02/19/did-black-scholes-cause-the-housing-bubble"&gt;"Did Black-Scholes Cause the Housing Bubble?"&lt;/a&gt; in Portfolio.&lt;br /&gt;&lt;br /&gt;I'm a data guy, but every executive needs to understand that data is merely a map and &lt;strong&gt;the map is not the territory&lt;/strong&gt;. If an explorer has a map that does not match the territory they can see, they would do well to question the map, rather than ask the territory to change.&lt;br /&gt;&lt;br /&gt;The credit score is simply another map. There is evidence that they were significantly weakened by new financial products over the last 7 years. Again, see &lt;a href="http://www.businessweek.com/magazine/content/08_07/b4071038384407.htm?chan=rss_topStories_ssi_5"&gt;"Credit Scores: Not-So-Magic Numbers"&lt;/a&gt; for details.&lt;br /&gt;&lt;br /&gt;Data quality, data governance, etc. are all &lt;strong&gt;**super**&lt;/strong&gt; important. However, as data professionals we need to build systems that incorporate common sense, human based checks and balances. Trusting too much in software will eventually get you fired or indicted for criminal negligence.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-3178223324545025323?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/3178223324545025323/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2008/02/data-can-never-be-perfect.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/3178223324545025323'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/3178223324545025323'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2008/02/data-can-never-be-perfect.html' title='Data can never be perfect...'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-1204949867618594837</id><published>2008-01-18T11:26:00.000-08:00</published><updated>2011-04-14T13:02:29.485-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='old blog'/><category scheme='http://www.blogger.com/atom/ns#' term='off topic'/><title type='text'>Social Networking: The new Rock n' Roll</title><content type='html'>[ This was originally posted on an old blog: &lt;a href="http://perfectinfo.wordpress.com/2008/01/18/social-networking-the-new-rock-n-roll/"&gt;Social Networking: The new Rock n' Roll&lt;/a&gt; ]&lt;br /&gt;&lt;br /&gt;In the 1950's Rock n' Roll swept the (western) world and created a youth culture impact that still reverberates today. It also created a complete break between the "kids" who loved and the older generation who just didn't get it. Some said Rock n' Roll was undermining the moral fiber of the nation, some just thought it was a bunch of noise.  The point is that music and culture where fundamentally changed by Rock n' Roll and everyone over a certain age was simply left behind. All of their objections and concerns simply became irrellevent.  The kids just didn't care and they quickly found that they could define the world on their own terms.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;Social Networking is the new Rock n' Roll&lt;/strong&gt;, it creates a complete break between generations.&lt;br /&gt;&lt;br /&gt;In our era Social Networking (Facebook, MySpace, Bebo, et al) is going to generate another break.  Our culture will be fundamentally changed and a new generation will re-define how the world works.  You don't have to look far (or listening long) to hear complaints about lost productivity in the workplace, moral decline, "infantile" behaviour and kids not understaning the "real world".  Well, once again, the kids just don't care.  They grew up with IM and email.  Everyone they know or care to know is online all the time and they expect be there to.&lt;br /&gt;&lt;br /&gt;I suppose that some of you who are 30+ (like me) will be thinking "I get it, I'm there".  I congratulate you for being so hip, but the truth is that you don't/can't really get it.  No matter how much you try to engage with this new paradigm it will always be an effort.  Some of you friends just won't participate. Those that do won't be totally open, or totally engaged.  You may feel like you're joining in but it will never be the same.&lt;br /&gt;&lt;br /&gt;We're like old jazz fans who appreciate Rock n' Roll for it's roots in Jazz but really we long for something more nuanced and sophisticated.  Roll on you cool cats. &lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-1204949867618594837?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/1204949867618594837/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2011/04/social-networking-new-rock-n-roll.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/1204949867618594837'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/1204949867618594837'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2011/04/social-networking-new-rock-n-roll.html' title='Social Networking: The new Rock n&apos; Roll'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-7741814437636748379</id><published>2008-01-18T10:26:00.000-08:00</published><updated>2011-04-14T13:03:49.939-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='old blog'/><category scheme='http://www.blogger.com/atom/ns#' term='off topic'/><title type='text'>Blog Interrupted: It's been a while...</title><content type='html'>[This was originally posted on an old blog: &lt;a href="http://perfectinfo.wordpress.com/2008/01/18/blog-interrupt…s-been-a-while"&gt;Blog Interrupted: It's been a while...&lt;/a&gt; ]&lt;br /&gt;&lt;br /&gt;Well it's been almost a year and a half since my last post.  You would have been right to assume that this blog was dead but, just like Lazarus, it's alive again. Resurrected and better than ever.&lt;br /&gt;&lt;br /&gt;I've been out of the Business Intelligence / Data Warehouse arena working in Database Marketing.  It's definitely given me an extra perspective on data warehouse and the value of data for it's own sake.  The first lesson was that it's not polite to laugh at people when they call a million row table "big";-).  The second lesson was that sub-queries in SQL Server do not work and will never return.&lt;br /&gt;&lt;br /&gt;Anyway, I'm back in the industry working as a Data Warehouse Designer / Architect so I'll be using this blog as a place to crystallise my thoughts on how data warehousing should be done. I also love to research new products in this space so I'll be putting company and product summaries here as well with links to loads of useful resources.&lt;br /&gt;&lt;br /&gt;Enjoy.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-7741814437636748379?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/7741814437636748379/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2008/01/blog-interrupted-its-been-while.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7741814437636748379'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7741814437636748379'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2008/01/blog-interrupted-its-been-while.html' title='Blog Interrupted: It&apos;s been a while...'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-546408276067977276</id><published>2006-08-25T09:21:00.000-07:00</published><updated>2011-04-14T13:08:14.817-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='old blog'/><title type='text'>Why is Business Intelligence innovation so slow?</title><content type='html'>[ This was originally posted on an old blog: &lt;a href="http://perfectinfo.wordpress.com/2006/08/25/why-is-busines…vation-so-slow/"&gt;Why is Business Intelligence innovation so slow?&lt;/a&gt; ]&lt;br /&gt;&lt;br /&gt;You might have gathered that I'm a big fan of Business Intelligence. The potential that it has to impact a business is incredible (and largely unrealized - but that's another post).&lt;br /&gt;&lt;br /&gt;I'm also a big fan of innovation. I had subscriptions to Popular Science and Omni as a kid (can you tell?) and now I get my fix for innovation thinking from the &lt;a href="http://www.techtrend.com/blog/" title="Killer Innovations homepage"&gt;Killer Innovations&lt;/a&gt; podcast.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;The point is that I've been wondering why the pace of innovation is slow in Business Intelligence.&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;One reason could be entrenchment. Classical BI/DW projects have been so costly and so complicated that the solutions, once delivered, have an incredibly high level of entrenchment. After all, the business thinks, now that we've spent so much on this solution we need squeeze every last bit of value out of it. I see this in action across all the elements of the solution.&lt;br /&gt;&lt;br /&gt;How many BI "shops" do you know that have changed from, say, Oracle to SQL Server of from Cognos to BusinessObjects. I honestly don't know of any significant examples of an established shop switching (maybe a couple of instances of DW appliances being added on top). It seems that once the vendor is in it takes dynamite to get them out again.&lt;br /&gt;&lt;br /&gt;This level of entrenchment is potentially extremely costly. If your vendor wants to double your "suport" fees or gouge you on additional licenses then you're pretty much forced to put up with it because you don't have an easy way out.&lt;br /&gt;&lt;br /&gt;Another reason may be time. BI projects take time and lots of it. &lt;a href="www.kimballgroup.com" title="Ralph Kimball"&gt;Kimball&lt;/a&gt; has taught us that successful BI projects are incremental and iterative (and big bang projects fail) but this implies a long term vision and the timescale to go with it. My hunch is that most shops aren't that receptive to "new and improved" tech when they're knee deep in the political quagmire half way through a project.&lt;br /&gt;&lt;br /&gt;I could probably come up with more reasons... but (at the risk of repeating myself) I think it comes down to open standards and how thin they are in the BI space.&lt;br /&gt;&lt;br /&gt;Sure we have ODBC, which is brilliant, but many key technologies don't use it, particularly if you're in a dedicated Oracle or SQL Server environment. For example, SQL Server Integration Services doesn't even give you the option to use ODBC against SQL Server databases (AFAIK). Meaning that you have to keep your warehouse on SQL Server of face up to the nightmare of re-writing all your ETL packages.&lt;br /&gt;&lt;br /&gt;My plea to the vendors is this:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;implement open standards in your products  (this will help me trust you)&lt;/li&gt;&lt;li&gt;make migration to your product easy, whatever it takes (or else I'll never change).&lt;/li&gt;&lt;/ul&gt;You (BI vendors) will have to take the risk of being the first to open up. You'll have to give me a good reason to move and proof that I can trust you. After all, most of the businesses who want BI now have it, so you future health depends on converting your competitor's customers into your customers (and surely that's cheaper than just buying your competitors!).&lt;br /&gt;&lt;br /&gt;[ All of this reminds me that I need to do a post on open source BI. ]&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-546408276067977276?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/546408276067977276/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2006/08/why-is-business-intelligence-innovation.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/546408276067977276'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/546408276067977276'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2006/08/why-is-business-intelligence-innovation.html' title='Why is Business Intelligence innovation so slow?'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-7303437640404555547</id><published>2006-08-16T14:34:00.000-07:00</published><updated>2011-04-14T13:07:57.519-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='old blog'/><title type='text'>The enterprise software business model is broken</title><content type='html'>[This was originally posted on an old blog: &lt;a href="http://perfectinfo.wordpress.com/2006/08/16/the-enterprise-software-business-model-is-broken/"&gt;The enterprise software business model is broken&lt;/a&gt; ]&lt;br /&gt;&lt;br /&gt;&lt;a href="http://andyonenterprisesoftware.com" title="Andy On Enterprise Software"&gt;Andy&lt;/a&gt; asks whether enterprise software is finished. &lt;br /&gt;&lt;br /&gt;My view from the trenches is that the business model for enterprise software (and hardware…) is broken. They (particularly the BI guys) reportedly spend as much as 90% of their income on sales and marketing.&lt;br /&gt;&lt;br /&gt;The number of sales people that I deal with at these companies is mind numbing. The number of reseller layers that get a cut of my license fees is shocking.&lt;br /&gt;&lt;br /&gt;These companies could learn a huge lesson from the “web2.0″ upstarts by shortening their development cycles and moving their engineers out towards their customers. How do they know what needs to be improved, what needs to be simplified or what needs to be added? It certainly isn’t because I’ve told them.&lt;br /&gt;&lt;br /&gt;What I really want are a set of building blocks that stack together nicely, use industry standards, can be interchanged if I’ not happy, can be managed by fewer (possibly better) people and can be consumedly easily across networks.&lt;br /&gt;&lt;br /&gt;Is that too much to ask for?&lt;br /&gt;&lt;br /&gt;&lt;acronym title="By The Way"&gt;BTW&lt;/acronym&gt;, my favourite example of getting this right is Netezza. I load my data into their box and my queries instantly run 1000x faster. It uses no proprietary &lt;acronym title="Structured Query Language (a database standard)"&gt;SQL&lt;/acronym&gt;, has no indexes or tuning and I only need 1 DBA. Awesome.&lt;br /&gt;&lt;br /&gt;Future post to come on this topic, but I think the market is looking for a complete solution that is relatively hardware and OS agnostic but does everything else. It also needs to come totally pre-configured with best practice set-ups. I'm not holding my breath.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-7303437640404555547?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/7303437640404555547/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2006/08/enterprise-software-business-model-is.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7303437640404555547'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7303437640404555547'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2006/08/enterprise-software-business-model-is.html' title='The enterprise software business model is broken'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-8270620889463228628</id><published>2006-08-14T22:28:00.000-07:00</published><updated>2011-04-14T13:06:56.875-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='old blog'/><title type='text'>The occupation of an Analyst</title><content type='html'>[ This was originally posted on an old blog: &lt;a href="http://perfectinfo.wordpress.com/2006/08/14/the-occupation-of-an-analyst/"&gt;The occupation of an Analyst&lt;/a&gt; ]&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;span&gt;This is the first instalment of my "book", as explained &lt;a href="http://perfectinfo.wordpress.com/2006/08/13/a-book-unwritten/" title="A book unwritten…"&gt;&lt;span&gt;here&lt;/span&gt;&lt;/a&gt;. It’s a work in progress and your feedback would be a great help.&lt;/span&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span&gt;&lt;/span&gt;&lt;/b&gt;&lt;span&gt;What is an Analyst? Recently it seems that every second job includes the term analyst.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span&gt;W&lt;/span&gt;&lt;span&gt;hen I was (briefly) &lt;/span&gt;&lt;span&gt;at &lt;a target="_blank" href="http://www.google.co.uk/url?sa=t&amp;amp;ct=res&amp;amp;cd=1&amp;amp;url=http%3A%2F%2Fwww.simons-rock.edu%2F&amp;amp;ei=CtXgRNugKoX-QdGegcYK&amp;amp;sig2=zJPBe7CBivDlKTlAvkDBvg" title="Simon's Rock"&gt;college&lt;/a&gt;, a friend had a mentor who would go on for hours (literally) about how devalued the term doctor had become. His premise was that there were too many people who were "officially" doctors of this or that but had just "cruised" through grad school and didn't deserve the accolade.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Now, 12 years later, I feel very much the same way about the title of Analyst. Everywhere I look in organisation charts I see people who are ostensibly employed as analysts. Sadly very few of these seem to actually do any analysis.&lt;span&gt; Such as a &lt;/span&gt;Network Support Analyst who's job is correctly route the LAN cabling or a Helpdesk Analyst who's job is to deal with support calls to reset user passwords. Important jobs but not really analysts.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span&gt;The occupation of an Analyst is to analyse. &lt;/span&gt;&lt;/b&gt;&lt;span&gt;To paraphrase Webster's, &lt;/span&gt;&lt;span&gt;a&lt;/span&gt;&lt;i&gt;&lt;/i&gt;&lt;span&gt;nalysis is the “resolution of anything into its constituent elements” or the creation of a “brief methodical illustration of the principles” of a subject.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For me, an Analyst seeks to:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;understand the present&lt;br /&gt;&lt;ul&gt;&lt;li&gt;by examining the past&lt;br /&gt;&lt;ul&gt;&lt;li&gt;in order to influence the future.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;For instance a Psychoanalyst seeks to understand the patient’s mental state by examining how the patient’s past life experiences have influenced them so that the patient can improve that state in the future. A Business Analyst seeks to understand a business situation by examining how it has operated in the past so that new processes can be implemented to improve future operations.&lt;br /&gt;&lt;br /&gt;If you can twist what you do into this framework then I salute you as a fellow analyst.&lt;br /&gt;&lt;br /&gt;Now get back to work.&lt;br /&gt;&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-8270620889463228628?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/8270620889463228628/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2006/08/occupation-of-analyst.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/8270620889463228628'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/8270620889463228628'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2006/08/occupation-of-analyst.html' title='The occupation of an Analyst'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-7648566431903075211</id><published>2006-08-13T23:03:00.000-07:00</published><updated>2011-04-14T13:08:50.606-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='old blog'/><title type='text'>A book unwritten...</title><content type='html'>[ This was originally posted on an old blog: &lt;a href="http://perfectinfo.wordpress.com/2006/08/13/a-book-unwritten/"&gt;A book unwritten...&lt;/a&gt; ]&lt;br /&gt;&lt;br /&gt;About a year ago, I wrote up the outline of a book I want to write about what it means to be an analyst and the difference between a good analyst and most analysts. ;-) You can see it below.&lt;br /&gt;&lt;br /&gt;I want it to be a Tufte style book, very tightly written and driven by good examples. Basically practicing what it preaches.&lt;br /&gt;&lt;br /&gt;Needless to say, it's gone pretty much unwritten, so in a &lt;a href="http://en.wikipedia.org/wiki/Anthony_Robbins" title="Tony Robbins"&gt;Tony Robbins&lt;/a&gt; style of "chunking" it down into manageable parts, I'll try writing it here in very small pieces. Let's see how it goes...&lt;br /&gt;&lt;b&gt;The Occupation of an Analyst&lt;/b&gt;&lt;br /&gt;&lt;i&gt;Discerning the Truth from Trends, Lies and Politics.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;The &lt;b&gt;“why”&lt;/b&gt; of analysis.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A brief history of analytic thought.&lt;/li&gt;&lt;li&gt;The gap between facts and intuition.&lt;/li&gt;&lt;/ul&gt;The &lt;b&gt;“what”&lt;/b&gt; of analysis.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The occupation of an Analyst.&lt;/li&gt;&lt;li&gt;The obligations of an Analyst&lt;/li&gt;&lt;/ul&gt;The &lt;b&gt;“how”&lt;/b&gt; of analysis.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The skills of critical thought.&lt;/li&gt;&lt;li&gt;The skills of analytical calculation.&lt;/li&gt;&lt;li&gt;The skills of meaningful presentation.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-7648566431903075211?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/7648566431903075211/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2006/08/book-unwritten.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7648566431903075211'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7648566431903075211'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2006/08/book-unwritten.html' title='A book unwritten...'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7423363075024286359.post-7633448202786574006</id><published>2006-08-11T20:11:00.000-07:00</published><updated>2011-04-14T13:09:11.506-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='old blog'/><title type='text'>An Introduction to the Perfect Information blog</title><content type='html'>[ This was originally posted on an old blog: &lt;a href="http://perfectinfo.wordpress.com/2006/08/11/an-introduction-to-the-perfect-information-blog/"&gt;An Introduction to the Perfect Information blog&lt;/a&gt; ]&lt;br /&gt;&lt;br /&gt;So the title is "Perfect Information"... Why? Perfect information means knowing &lt;strong&gt;everything&lt;/strong&gt; that relates to a decision .&lt;br /&gt;&lt;br /&gt;In economics (stick with me...), perfect competition can only exist when all parties are rational and have &lt;a title="perfect information" href="http://en.wikipedia.org/wiki/Perfect_information" target="_blank"&gt;perfect information&lt;/a&gt; about their available alternatives. Of course, in the real world this never happens (outside of &lt;a title="Chess" href="http://en.wikipedia.org/wiki/Chess"&gt;chess&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;This blog is called Perfect Information because:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;I &lt;strong&gt;love&lt;/strong&gt; it when information comes a little bit closer to being perfect.&lt;/li&gt;&lt;li&gt;I love good &lt;a title="Edward Tufte" href="http://www.edwardtufte.com/"&gt;information design&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;I love the way the internet is helping us come closer to having perfect information.&lt;/li&gt;&lt;li&gt;I work in &lt;a title="Business Intelligence" href="http://en.wikipedia.org/wiki/Business_Intelligence"&gt;Business Intelligence&lt;/a&gt; and spend way too much time thinking about this stuff.&lt;/li&gt;&lt;/ul&gt;Thanks for visiting, y'all come back now!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7423363075024286359-7633448202786574006?l=joeharris76.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://joeharris76.blogspot.com/feeds/7633448202786574006/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://joeharris76.blogspot.com/2006/08/introduction-to-perfect-information.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7633448202786574006'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7423363075024286359/posts/default/7633448202786574006'/><link rel='alternate' type='text/html' href='http://joeharris76.blogspot.com/2006/08/introduction-to-perfect-information.html' title='An Introduction to the Perfect Information blog'/><author><name>Joe Harris</name><uri>http://www.blogger.com/profile/09242541409318280541</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/-fy-rAbK1v74/TkJNCVXP6aI/AAAAAAAAFhE/8lcMjYCvORY/s220/new_profile_square.jpg'/></author><thr:total>0</thr:total></entry></feed>
