I have been following Google Big Query for some time and 2 weeks ago I gave it a try. The query speed is impressing and the query syntax is natural and I really like the REST API and the security measures. The service has been in beta (US) for some time but is now (May 1 2012) open to everyone.
But why do I hesitate to move my analytics to BigQuery from Hadoop?
- Google market BigQuery as real-time Big Data analytics. Real time sounds awesome, but how often do you really need real time analytics? Can you act in real time? The real need is probably automated actions based on real time data (big or small) for targeting, recommendations, etc. I can’t see BigQuery filling that spot and it is probably not the intention either. Also, the query speed is impressive, almost “real time”, but is the data real time? I can’t find a way to “stream” new data into BigQuery in real-time, according to the documentation you can append data to a table when uploading a new CSV-file. But is it really a real-time solution to frequently upload new CSV-files to BigQuery? Hadoop Hive isn’t a great solution for real-time either, but at least you can chose to store data directly in different storage backends (Hbase, cassandra, MongoDB, etc.) rather than csv-files.
- Big Data requires Big Data ETL (Extract Transform Load) before you even can perform analytics. BigQuery doesn’t offer any tools to collect massive amounts of data or clean and structure the data. There are many other ways to do that, but I really like how I can apply Pig to carry out Big Data ETL in Hadoop, i.e. close to the data. You can probably perform similar jobs with mapreduce in google appengine, but I prefer sticking to Apache Hadoop and the abstractions from low level coding that Pig and Hive offer.
- JOINS. BigQuery supports joins only when one side of the join is much smaller than the other. This is a drawback if I want to combine data from multiple data sources, especially since the ETL part of the process is lacking.
I am in the beginning of my journey on big data analytics and my conclusions may be wrong, I appreciate all comments correcting me.