Thursday, 8 March 2012

How To Copy A Standby Database in SQL Server

I ran across a SQL Server problem today where my Standby database was detached and would not re-attach. This problem originally arose because I wanted to make a development copy of log shipping standby database with restarting the log chain from a full backup.

I couldn’t find a complete solution online so I thought I’d share my solution here. A tip ‘o the hat to Paul Randal for his “Hack attach a damaged database” post which got me going in the right direction.

The Problem

I have a log shipping setup whereby the reporting database is kept in Standby mode. I wanted to make an up-to-date copy for a developer to use. So I (foolishly as it happens) tried to simply detach the standby database and make a copy of it.
When I finished copying it and tried to re-attach it I got the error “Cannot attach a database that was being restored.” Uh-oh, this database is on the end of a very long log shipping chain and, although I haven’t really lost anything, rebuilding the chain is going to cost me a lot of time.
Plus, I really need to make new development copies of the standby database quickly and often. This is not a one time process.

How To Copy A Log Shipping Standby

This may not be ‘the right way’ but it works and forget about detaching the DB that is definitely the wrong way to go, although you can recover from it using a variant of this process.
  1. Take the standby DB offline (SQL)
    ALTER DATABASE MyStandbyDb SET OFFLINE;
  2. Create a ‘shell’ DB that will become the new target DB (SQL)
    NOTE: that the internal ‘name’ values must match the Standby DB
    CREATE DATABASE MyNewDevDb ON PRIMARY
    ( NAME = 'MyLiveDb', --!!! Must match the Live & Standby name
    FILENAME = 'C:\My\Data\Path\MyNewDevDb.mdf' ,
    SIZE = 3072KB , FILEGROWTH = 1024KB )
    LOG ON
    ( NAME = 'MyLiveDb_log', --!!! Must match the Live & Standby name
    FILENAME = 'C:\My\Data\Path\MyNewDevDb_log.ldf' ,
    SIZE = 1024KB , FILEGROWTH = 10%) ;
  3. (OPTIONAL) Apply TDE encryption to the target DB (SQL)
    ALTER DATABASE MyNewDevDb SET ENCRYPTION ON;
  4. Backup the target DB (SQL)
    BACKUP DATABASE MyNewDevDb to disk = 'C:\My\Backup\Path\shell.bak' WITH FORMAT;
  5. Restore the target backup in standby mode (SQL) NOTE: The shell DB must match the state of the DB we want to replace it with
    RESTORE DATABASE MyNewDevDb
    FROM DISK = N'C:\My\Backup\Path\shell.bak'
    WITH FILE = 1
    ,MOVE 'MyLiveDb' TO 'C:\My\Data\Path\MyNewDevDb.mdf'
    ,MOVE 'MyLiveDb_log' TO 'C:\My\Data\Path\MyNewDevDb_log.ldf'
    ,STANDBY = 'C:\My\Backup\Path\ROLLBACK_UNDO_MyNewDevDb.BAK'
    ,NOUNLOAD , STATS = 10 ;
  6. Take the target DB offline (SQL)
    ALTER DATABASE MyNewDevDb SET OFFLINE;
  7. Rename (or delete) the target data and log files (CMD)
    RENAME "C:\My\Data\Path\MyNewDevDb.mdf" "DeleteThis.mdf"
    RENAME "C:\My\Data\Path\MyNewDevDb_log.ldf" "DeleteThis_log.ldf"
  8. Copy standby DB’s data and log files, using target’s filenames (CMD)
    COPY "C:\My\Data\Path\MyStandbyDb.mdf" "C:\My\Data\Path\MyNewDevDb.mdf"
    COPY "C:\My\Data\Path\MyStandbyDb_log.ldf" "C:\My\Data\Path\MyNewDevDb_log.ldf"
  9. Take the target DB online (SQL)
    ALTER DATABASE MyNewDevDb SET ONLINE;
  10. Restore the target DB for normal usage (SQL)
    RESTORE DATABASE MyNewDevDb WITH RECOVERY;
  11. Take the standby DB back online (SQL)
    ALTER DATABASE MyNewDevDb SET ONLINE;
  12. Resume log shipping!

Recovering a detached standby DB

If you need to attach a detached SQL Server standby DB you can do so use a slight variation on the steps above. Simply omit steps 1, 8 and 11 and treat the database you want to attach as the target in the other steps. Voila! :-)
I hope this helps someone. Let me know if you find any issues with the steps.

Friday, 8 April 2011

Data Management - Data is Data is Data is…

[Sometimes I want to write about one topic but I end up writing 1,500 words of background before I even touch on the subject at hand. Sometimes the background turns out to be more interesting; hopefully this is one of those times.]

In this post I talk about the problems with mainstream data management, especially SQL databases. I then touch on the advantages of SQL databases and the good attributes we need to retain.

Data is Data is Data is…
Current IT practice splits data management into lots of niches: SQL databases, Email platforms, Network file systems, enterprise search, etc. There is plenty of overlap between niches and, in truth, the separations are artificial. It merely reflects the way systems are implemented, not fundamental data differences. Have a look at your email client; see those headers in the messages list (From, Subject, etc) they're just database field names and the message body is simply a BLOB field. Some email clients, e.g., Gmail, can also parse that blob and find links to previous messages, which is very much like a foreign key link.

File systems seem less like a database at first glance but let's consider the big file system developments of the last 10 years ZFS and BTRFS. Both of these introduce database-like ideas to the file system such as copy-on-write (a la MVCC), deduplication (a la normalisation), data integrity guarantees (a la ACID) and enhanced file metadata (a la SQL DDL).

The basic point I'm making is that data is data. Simple as that. It may be more or less 'structured' but structure and meaning are essentially equivalent. The most 'unstructured' file I can imagine is just plain text but the written word is still very structured. At a high level it has a lot of metadata (name, created, changed, size, etc.), it has structure embedded in the text itself (language, punctuation, words used, etc.) and, looking deeper, we can analyse the semantic content of the text using techniques like NLP.

Data is data; it needs to be stored, changed, versioned, retrieved, backed up, restored, searched, indexed, etc. The methods may vary but it's all just data.


The SQL Database Black Box
All data cannot be kept in databases because, amongst other things, SQL databases are opaque to other applications. Enterprise search illustrates the issue. Most enterprise search apps can look into JDBC/ODBC accessible databases, profile the data and include its content in search results. However, access to any given database is typically highly restricted and there is a DBA whose job hangs on keeping that data safe and secure. The DBA must be convinced that the search system will not compromise the security of his data and this typically means limiting search access to the people who also have database access. This is a time consuming process and we have to repeat it for every database in the company.

So a year later, when we have access to all SQL databases and a process to mirror access credentials, the next problem is that SQL provides no mechanism to trace data history. For example, I search for 'John Doe' and find a result from the CRM database. I look in the database and the record now has a name of 'Jane Doe'. Why did it change? When did it change? Who changed it? There is no baseline answer to these questions. The CRM application may record some of this information but how much? The database has internal mechanisms that trace some of this but each product has its own scheme and, worse, the tables are often not user accessible for security reasons.

In my experience, 80% of the value actually gained from a data warehouse comes from resolving this issue in a single place and in a consistent way. Hence the growth of the MDM industry, but I won't digress on that. The data warehouse doesn't actually solve the problem, it merely limits the number of SQL databases that must be queried to 1. And, of course, we never manage to get everything in the DW.

There are many other black box attributes of SQL databases such as: 2 very similar queries may perform in drastically different ways; background tasks can make the database extremely slow without warning; the database disk format cannot be accessed by other applications; the database  may bypass the filesystem making us entirely reliant on the database to detect disk errors, etc., etc.


The SQL Database Choke Point
Current SQL databases are also a very real constraint on day-to-day operation. For example, a large company may only be able to process bulk updates against a few percent of the customer base each night. SQL databases must be highly tuned towards high performance for single type of  access query and that tuning usually makes other access styles unworkable.

Further the schema of a production SQL database is effectively set in stone. Although SQL provides ALTER statements the performance and risk of using ALTER is so bad that it's never used. Instead we either add a new small table and use a join when we need the additional data, or we create a new table and export the existing data into it. Both of these operations impose significant overheads when all we really want is a new field. So, in practice, production SQL databases satisfy a single type of access, are very resistant to other access patterns and are very difficult to change.

These issues are well recognised and the answer has come back that we need specialist SQL databases for each use case. Michael Stonebraker, in particular, has been beating a drum about this for at least 5 years (and, credit where it's due, Vertica paid off in spades). However, we haven't seen a huge uptake in specialist databases for markets other than analytics. In particular the mainstream OLTP market has very few specialist offerings. Perhaps it's a more difficult problem or perhaps the structure of SQL itself is less amenable to secondary innovation around OLTP. I sense a growing recognition that improvements in the OLTP space require significant re-engineering of existing applications.

Specialist databases have succeeded to some extent in the data warehouse and business intelligence sphere. I think this exception proves the observation. 15 years ago I would add another complaint to my black box attributes: it was impossible to get reports and analysis from my production systems. The data warehouse was invented and gained popular acceptance simply because this was such a serious problem. The great thing about selling analytic databases for the last 15 years was that you weren't displacing a production system. Businesses don't immediately start losing money if the DW goes down. The same cannot be said of most other uses for SQL databases and that's why they will only be replaced slowly and only when there is a compelling reason (mainframes are still around, right?).


There's a baby in this bathwater!
It's worth remembering the SQL databases offer a lot advantages. Codd outlined 12 rules that relational databases should follow. I won't list them all here but at a high level a relational database maintains the absolute integrity of the data it stores and allows us to place constraints on that data, such as the type and length of the data or it's relation to other data. We take it for granted now but this was a real breakthrough and it took years to implement in practice.

Just for kicks imagine a CRM system based on Word docs. When you want to update a customer's information you open their file and make whatever changes you want and then save it. The system only checks that the doc exists, you can change whatever you want and the system won't care. If you want the system to make sure you only change the right things you'll have to build that function yourself. That's more or less what data management was like before SQL databases.


What to keep & what to throw away
So what would our ideal data management platform look like? It persists data in a format that can be freely parsed by other applications, i.e., plain text (XML? JSON? Protocol Buffers? ). It maintains data integrity at an atomic level probably by storing checksums alongside each item. It lets define stored data as strictly or loosely as we want but it enforces the definitions we set. All changes to our stored data actually create new versions and the system keeps a linked history of changes.

I think we're starting to see systems emerge that address some of the issues above. It's still early days but I'm excited about projects like Ceph and the very new Acunu.


In my next post I'll look about how the new breed of NoSQL databases display some of the traits we need for our ideal data management platform.

Wednesday, 6 April 2011

Unsolicited advice for MarkLogic - Pivot!

[This is actually a really long comment on Curt Monash's post but I think it's worth cross posting here.] 


Seeing as I have been doing a lot of thinking about database opportunities lately I'll wade in on MarkLogic's as well. I can't really comment about the specific verticals that MarkLogic sells into or should sell into. However, I see 2 fundamental problems with MarkLogic's product positioning.

First problem; they backed the wrong horse by focusing exclusively on XML and XQuery. This has been toned down a lot but the die is cast. People who know about MarkLogic (not many) know of it as a 'really expensive XML database that you need if you have lots of XML and eXist-db is too slow for you'. They've put themselves into a niche within a niche, kind of like a talkative version of Ab Initio.

This problem is obvious if you compare them to the 'document oriented' NoSQLs such as CouchDB and MongoDB. Admittedly they were created long after MarkLogic but the NoSQLs offer far greater flexibility, talk about XML only as a problem to be dealt with and use a storage format that the market finds more appealing (JSON).

Second problem; 'Enterprise class' pricing is past its sell by date. What does MarkLogic actually cost? You won't find any pricing on the website. I presume that the answer is that old standby 'whatever you're looking to spend'. Again, the contrast with the new NoSQLs couldn't be more stark - they're all either pure open source or open core, e.g., free to start.

MarkLogic was essentially an accumulator bet: 1st bet - XML will flood the enterprise, 2nd bet - organisations will want to persist XML as XML, 3rd bet - an early, high quality XML product will move into an Oracle-like position.

The first bet was a win, XML certainly has flooded the enterprise. The second bet was a loss; XML has become almost a wire protocol rather than a persistence format. Rightly or not, very few organisations choose to persist significant volumes of data in XML. And the third bet was loss as well; the huge growth of open source and the open core model make it extremely unlikely that we'll see another Oracle in the data persistence market.

The new MarkLogic CEO needs to acknowledge that the founding premise of the company has failed and they must pivot the product to find a much larger addressable market. Their underlying technology is probably very good and could certainly be put to use in other ways (Curt gives some examples). I would be tempted to split the company in 2; leaving a small company to continue selling and supporting MarkLogic at maximum margins (making them an acquisition target) and a new company to build a new product in start-up mode on the foundations of the existing tech.

Tuesday, 29 March 2011

Analytic Database Market Opportunities

In my first post in this series I gave an overview of ParStream and their productIn the second post I gave an overview of the Analytic Database Market from my perspective. In the third post I introduced a simple Analytic Database Market Segmentation.

In this post I will look at the gaps in this market and the new opportunities for ParStream and RainStor to introduce differentiated offerings. First, though I'll address the positioning of Hadapt.


Hadapt Positioning
Hadapt have recently come out of stealth and will be offering a very fast 'adaptive' version of Hadoop. Hadapt is a reworked and commercialized version of the Daniel Abadi's HadoopDB project. You can read Curt Monash's overview for more on that.  Basically Hadapt provides a Hadoop compatible interface (buzz phrase alert) and uses standard SQL databases (currently Postgres or VectorWise) underneath instead of HDFS. The unique part of their offering is keeping track of node performance and adapting queries to make the best use of each node. The devil is in the details of course, but a number of questions remain unanswered: How much  of the Hadoop API will be mapped to the database? Will there be a big inflection in performance between logic that maps to the DB and logic that runs in Hadoop? Etc. At a high level Hadapt seems like a very smart play for cloud based Hadoop users. Amazon EC2 instances have notoriously inconsistent I/O performance and a product that works around that should find fertile ground.

RainStor's Current Positioning
RainStor, if you don't know, is a special archival database that features massive compression. They current sell it as an OLDR solution (Online Data Retrieval) primarily aimed at company's that have large data volumes and stringent data retention requirements, e.g., anyone in Financial Services. They promise between 95% (20:1)  and 98% (40:1) compression rates for data whilst remaining fully query-able. Again Curt Monash has the best summary of their offering. I briefly met some RainStor guys a while back and I feel pretty confident that the product delivers what it promises. That said, I have never come across a RainStor client and I talk to lots of Teradata and Netezza types who would be their natural customers. So, though I have no direct knowledge of how they are doing, I suspect that it's been slow going to date and focusing on a different part of the market might be more productive.

Hyper Compressed Hadoop - RainStor Opportunity
I tweeted a while back that "RainStor needs a MapReduce story like yesterday". I still think that's right although now I think they need a Hadoop compatible story. To me, RainStor and Hadoop/MapReduce seem like a great fit. Hadoop users value the ability to process large data volumes over simple speed. Sure, they're happy with speed when they can get it but Hadoop is about processing as much data as possible. RainStor massively compresses databases while keeping them online and fully query-able. If RainStor could bring that compression to Hadoop it would be incredibly valuable. Imagine a Hadoop cluster that's maxed out at 200TB of raw data, compressed using splittable LZO to 50TB and replicated on 200TB of disk. If RainStor (replacing HDFS) could compress that same data at 20:1, half their headline rate, that cluster can now scale out to roughly 2,000TB. And many operations in Hadoop are constrained by disk I/O so if RainStor can operate to some extent on compressed data the cluster might just run faster. Even if it runs slightly slower the potential cost savings are huge (insert your own Amazon EC2 calculation here where you take EC2+S3 spend and divide by 20).

ParStream Opportunities
I see 2 key opportunities for ParStream in the current market. They can co-exist; but may require significant re-engineering of the product. First a bit of background; I'm looking for 'Blue Ocean Strategies' where ParStream can create a temporary monopoly. Selling into the 'MPP Upstart' segment is not considered due to the large number of current competitors. It's interesting to note though that that is where ParStream's current marketing is targeted.

Real-Time Analytic Hadoop - ParStream Opportunity 1
ParStream's first opportunity is to repurpose their technology into a Hadoop compatible offering. Specifically a 'real-time analytic Hadoop' product that uses GPU acceleration to vastly speed up Hadoop processing and opens up the MapReduce concept for many different and untapped use cases.  ParStream claim to have a unique index format and to mix workloads across CPUs and GPUs to minimise response times. It should be possible to use this technology to replace HDFS with their own data layer and indexing. They should also aim to greatly simplify data loading and cluster administration work. Finally transparent SQL access would be a very handy feature for business that want to provide BI directly from their 'analytic Hadoop' infrastructure. In summary: Hadoop's coding flexibility, processing speeds that approach CEP, and Data Warehouse style SQL access for downstream apps.

Target customers: Algorithmic trading companies (as always…), Large-scale online ad networks, M2M communications, IP-based Telcos, etc . Generally businesses with large volumes of data and high inbound data rates who need to make semi-complex decisions quickly and who have a relatively small staff.

Single User Data Warehouses - ParStream Opportunity 2
ParStream's second opportunity is to market ParStream as a single user, desk side data warehouse for analytic professionals, specifically targeting GPU powered workstations (like this one: ~$12k => 4 GPUs [960 cores], 2 Quad core CPUs, 48GB RAM, 3.6TB of fast disk). This version of ParStream must run on Windows (preferably Win7 x64, but Win Server at a minimum). Many  IT departments will balk at having a non-Windows workstation out in the office running on the standard LAN. However they are very used to analysts requesting 'special' powerful hardware. That's why the desk side element is so critical, this strategy is designed to penetrate restrictive centralised IT regimes.

In my experience a handful of users place 90% of the complex query demand on any given data warehouse. They're typically statisticians and operational researchers doing hard boiled analysis and what-if modelling. Many very large businesses have separate SAS environments that this group alone uses but that's a huge investment that many can't afford. Sophisticated analysts are a scarce and expensive resource and many companies can't fill the vacancies they have. A system that improves analyst productivity and ensures their time is well used will justify a significant premium. It also gives the business an excellent retention tool  to retain their most valuable 'quants'.

This opportunity avoids the challenges of selling a large scale GPU system into a business that has never purchased one before and avoids the red ocean approach of selling directly into the competitive MPP upstart segment. However it will be difficult to talk directly to these users inside the larger corporation and, when you convince them they need ParStream; you still have to work up the chain of command to get purchase authority (not the normal direction). On the plus side though these users form a fairly tight community and they will market it themselves if it makes their jobs easier.

Target customers: Biotech/Bioscience start-up companies, University researchers, marketing departments or consultancies. Generally, if a business is running their data warehouse on Oracle or SQL Server, their will be an analytic professional who would give anything to have a very fast database all to themselves.

In my next post I will look at why Hadoop is getting so much press, whether the hype is warranted and, generally, the future shape of the thing we currently call the data warehouse.

Friday, 25 March 2011

Analytic Database Market Segmentation

In my first post in this series I gave an overview of ParStream and their product.


In this post I will briefly talk about how vendors are positioned and introduce a simple market segmentation.

Please note that this is not exhaustive, I've left off numerous vendors with perfectly respectable offerings that I didn't feel I could reasonably place.

The chart above gives you my view of the current Analytic Database market and how the various vendors are positioned. The X axis is log scale going from small data sizes (<100GB) to very large data sizes (~500TB). I have removed the scale because it based purely on my own impressions of common customer data sizes for that vendor based on published case studies and anecdotal information.

The Y axis is a log scale of the number of employees that a vendor's customers have. Employee size is highly variable for a given vendor but nonetheless each vendor seems to find a natural home in businesses of a certain size. Finally the size of each vendor's bubble represents the approximate $ cost per TB for their products (paid versions in the case of 'open core' vendors). Pricing information is notoriously difficult to come across so again this is very subjective but I have first hand experience with a number of these so it's not a stab in the dark.



Market Segments
SMP Defenders: Established vendors with large bases operating on SMP platforms
Teradata+Aster: The top of the tree. Big companies, big data, big money.
MPP Upstarts: Appliance – maybe, Columnar – maybe, Parallelism – always.
Open Source Upstarts: Columnar databases, smaller businesses, free to start.
Hadoop+Hive: The standard bearer for MapReduce. Big Data, small staff.

SMP > MPP Inflection
Still with me? Good, let's look at the notable segments of the market. First, there is a clear inflection point between the big single server (SMP) databases and the multi-server parallelised (MPP) databases. This point moves forward a little every year but not enough to keep up with the rising tide of data. For many years Teradata owned the MPP approach and charged a handsome rent. In the previous decade a bevy of new competitors jumped into the space with lower pricing and now the SMP old guard getting into MPP offerings, e.g., Oracle Exadata and Microsoft SQL Server PDW.

Teradata's Diminished Monopoly
Teradata have not their lost grip on the high end however. They maintain a near monopoly on data warehouse implementations in the very largest companies with the largest volumes of 'traditional' DW data (customers & transactions). Even Netezza has failed to make large dent into Teradata's customers. Perhaps there are instances of Teradata being displaced by Netezza; however I have never actually heard of one. There are 2 vendors who have a publicised history of being 'co-deployed' with Teradata: Greenplum and Aster Data. Greenplum's performance reputation is mixed and it was acquired last year by EMC. Aster's performance reputation is solid at very large scales and their SQL/MapReduce offering has earned them a lot of attention. It's no surprise that Teradata decided to acquire them.

The DBA Inflection Point
The other inflection point in this market happens when the database becomes complex enough to need full time babysitting, e.g., a Database Administrator. This gets a lot less attention than SMP>MPP because it's very difficult to prove. Nevertheless word gets around fairly quickly about the effort required to keep a given product humming along. It's no surprise that vendors of notoriously fiddly products sell them primarily to large enterprises where the cost of employing such an expensive specimen as a collection of DBAs is not an issue.

Small, Simple & Open(ish)
Smaller businesses, if they really need an analytic DB, choose products that have a reputation for being usable by technically inclined end users without a DBA. Recent columnar database vendors fall into this end of the spectrum, especially those that target the MySQL installed base. It's not that a DBA is completely unnecessary, simply that you can take a project a long way without one.

MapReduce: Reduced to Hadoop
Finally we have those customers in smaller businesses (or perhaps government or universities) who need to analyse truly vast quantities of data with the minimum amount of resource. In the past it was literally impossible for them to do this; they were forced to rely on gross simplifications. Now though we have the MapReduce concept of processing data in fairly simple steps, in parallel across a numerous cheap machines. In many ways this is MPP minus the database, sacrificing the convenience of SQL and ACID reliability for pure scale. Hadoop has become the face of MapReduce and is effectively the SQL of MapReduce, creating a common API that alternative approaches can offer to minimise adoption barriers. 'Hadoop compatible' is the Big Data buzz phrase for 2011.

Friday, 28 January 2011

HandlerSocket - More grist for the ORM mill

A plugin called HandlerSocket was released last year that allows InnoDB to be used to directly, bypassing the MySQL parsing and optimising steps. The genius of HandlerSocket is that the data is still "in" MySQL so you can use the entire MySQL toolchain (monitoring, replication, etc.). You also have your data stored in a highly reliable database, as opposed to some of the horror stories I'm seeing about newer NoSQL products.

In the original blog post  ( here ) it talks about 720,000 qps on an 8 core Xeon with 32GB RAM. Granted this is all in memory data we're talking about but that is a hell of a figure. He also claims it outperforms Memcached.

Next, Percona added HandlerSocket to their InnoDB fork back in December ( here ) so if you're looking for someone to talk to they may be the best people.

Finally, Ilya Grigorik (way-smart guy from PostRank) blogged about it a couple of weeks ago ( here ) and there's a fairly interesting discussion in the comments comparing this to prepared statements in Oracle.

All of this reinforces my opinion that new generation ORMs are the technology that will finally allow the RDBMS apple cart to tip all the way over. Products like Redis, Riak, CouchDB, etc. are not enough on their own.

The *really* interesting thing about HandlerSocket is that shows open source databases are perfect fodder for the next wave.

Wednesday, 19 January 2011

Analytic Database Market 'Fly Over'

This is a follow up to my previous post where I laid out my initial thoughts about ParStream. This is a very high level 'fly over' view of the analytic database market. I'll follow this up with some thoughts about how ParStream can position themselves in this market.


Powerhouse Vendors
The power players in the Analytic Database market are: Oracle (particularly Exadata), IBM (mostly Netezza, also DB2), and Teradata. Each of these vendors employs a large, very well funded and sophisticated sales force. A new vendor competing against them in accounts will find it very, very hard to win deals. They can easily put more people to work on a bid than a company like ParStream *employs*. If you are tendering for business in a Global 5000 corporation then you should expect to encounter them and you need a strategy for countering their access to the executive boards of these companies (which you will not get). In terms of technology their offerings have become very similar in recent years with all 3 emphasising MPP appliances of one kind or another, however most of the installed base are still using their traditional SMP offerings (Netezza and Teradata excepted).


New MPP niche players
There are a number of recent entrants to the market who also offer MPP technology, particularly: Greenplum, AsterData and ParAccel. All 3 offer software-only MPP databases, although Greenplum's emphasis has shifted slightly since being acquired by EMC. These vendors seem to focus mostly on (or succeed with) customers who have very large data volumes but are small companies in terms of employees. Many of these customers are in the web space. These vendors also have strong stories about supporting MapReduce/Hadoop inside their databases, which also plays to the leanings of web customers. According to testimonials on the vendor's websites customers seem to choose them because they are very fast and software only.


Microsoft
Microsoft is a unique case. They do not employ a direct sales force (as far as I know) however they have steadily become major force in enterprise software. Almost all companies run Windows desktops, have at least a few Windows servers and at least a few instances of SQL Server in production. Therefore Microsoft will be considered in virtually every selection process you're involved in. Microsoft have been steadily adding BI-DW features to the SQL Server product line and generally those features are all "free" with a SQL Server license. This doesn't necessarily make SQL Server cheaper but it does make it feel like very good value. Recent improvements include the Parallel Data Warehouse appliance (with HP hardware), columnar indexing for the next release and PowerPivot for local analysis of large data volumes.


Proprietary columnar
Columnar databases have been the hot technology in analytic databases for the last few years. The biggest vendors are Sybase with their very mature IQ product, SAND with an equally mature product and Vertica with their newer (and reportedly much faster) product. These databases can be used in single server (SMP / scale-up) and MPP (multi-server / scale-out) configurations. They appear to be most popular with customers who appreciate the high levels of compression that these databases offer and already have relatively mature star-schema / Kimball style data warehouses in place.  In my experience Sybase and SAND are used most in companies where they were introduced by an OEM as part of another product. Vertica is so new that it's not clear who their 'natural' customers are yet.


Open Source columnar
In the open source world there are 2 MySQL storage engines and a standalone product offering columnar databases. The MySQL engine Infobright was the first open source columnar database. It features very high compression and very fast loading however it is not suited for lots of joins and may be better thought of as a OLAP tool managed via SQL. The InfiniDB MySQL engine on the other hand is very good at joins and very good at squeezing all the available performance out of a server, however it does not have any compression currently. Finally there is LucidDB which is a Java based standalone product and has performance characteristics somewhere between the other two. LucidDB features excellent compression, index support and generally good performance but can be slow to load.


Vectorised columnar
There is only one player here: VectorWise. VectorWise is a columnar database (AFAIK) that has been architected from top to bottom to take advantage of the vector pipelines built into all recent CPUs. Vectorisation is a way of running many highly parallel operations through a single CPU. It basically removes all of the waiting and memory shifting that slows a CPU down. Initial testers have been very positive about the performance of VectorWise and had nothing but good things to say. There is also talk of an open source release so they are covering a lot of bases. They also have the advantage of being part of Ingres who may not be the force they once were but have a significant installed base and are well placed to sell VectorWise. They are the biggest direct competitor to ParStream that I can see right now.


Open Source MapReduce/NoSQL
ParStream will also compete with a new breed of open source MapReduce/NoSQL products, most notably Hadoop (and it's variants). These products are not databases per se but they have gained a lot of mindshare among developers who need to work with large data volumes. Part of their attraction is their 'cloud friendliness'. They are perfect for the cloud because they have been designed to run on many small servers and to expect that a single server could fail at any time. There is a trade-off to be made and MapReduce products tend to be much more complex to query, however for a technically savvy audience the trade is well worth it.


Next time I'll talk about where I think ParStream need to place themselves to maximise their opportunity.

UPDATE: Actually, in the next post I talk about how analytic database vendors are positioned and introduce a simple market segmentation. A further post about market opportunities will follow.

Disqus for @joeharris76