So you got access to Azure HDInsight and your usage pattern is such that you don’t want to run a cluster 24/7? Then you want a persistent data storage both in terms of the data you want to analyse, but also your meta data store. Azure blob storage and SQL Server to the rescue.
When you have signed up for azure blob storage it is time to upload your raw data (your web server log files). The easiest way to do it is by using a Azure blob storage client, I’ve mostly been using Cloudberry Explorer for Windows Azure. AZCopy is an alternative I’ve used when I needed command line access from a Talend job collecting files from a server and compressing those (7zip) before uploading to Azure blob storage. The explorer software is great becuase you can easily create a directory structure with folders and subfolders, you can’t do that from the Azure web interface yet (only containers). You want to set up a tier0 folder for your raw data and one or more staging folders (ex. tier1). For production use you want to set up an automatic job to collect your web data, either in a streaming fashion (flume, scribe, etc.), logging directly from your .net web application, scheduled transports of log files or perhaps let enable logging on a gif-object in the blob storage and analyse those log files. I will have to write a separate post on that.
To access your blob storage containers you specify what container to use when setting up your cluster. If you want to access more than one container I advise you to read the excellent post by Denny Lee. But wait, isn’t Hadoop all about moving compute to data vs. traditionally moving data to compute, so why should I use blob storage instead of local disk HDFS? Denny Lee describes that as well, in short it is all about the network: the performance of utilizing HDFS with local disk or HDFS using ASV (blob storage) is comparable (if your cluster is smaller than 40 nodes).
The other challenge is to set up persistent storage of meta data. The best option is to use Azure SQL Server for that. In the Hadooponazure.com preview it was straight forward to make that setup when launcing your cluster,
but not so in HDInsight feature preview (yet). Actually, when spinning up a cluster, Azure set up a temporary SQL Server as your meta store behind the scenes. We will set up our own database, prepare it with the correct tables and point the cluster to use that as meta store instead. Update 2013-09-13: the azure management portal now allows you to specify SQL server as metastore when launching a HDInsight cluster.
1) This step is only necessary the first time you set up your own SQL Server as meta store. Create a Azure SQL Server instance and make a note of server name, database name, user and password. Remote desktop to your HDInsight cluster and open a terminal window. From your terminal window, replace the parameters with yours and run:
%HIVE_HOME% CreateHiveMetaStoreOnSqlAzure.cmd SERVER_NAME DATABASE_NAME USER PASSWORD %HIVE_HOME%
Now your SQL Server is populated with the needed tables to run as a hive meta store, make sure to check that the tables are created before moving forward.
2) It is time to point your cluster to your newly created meta store. Locate your hive-site.xml file (C:appsdisthive-0.9.0conf) and open it. Locate the properties (SERVER_NAME, DATABASE_NAME, USER, PASSWORD) below and change them according to your credentials, (keep a local copy of your new hive-site.xml and copy-replace the next time you spin up your cluster). Update 2013-09-13: this can still be useful if you want to use a SQL server in another subscription plan.
<description>JDBC connect string for a JDBC metastore</description>
Then you likely need to restart your Hive server. Go to the bin folder (C:appsdisthive-0.9.0bin) and run stop_daemons followed by start_daemons. The easiest way to check if your changes have taken effect is to:
1. Launch your hive client
2. Create a table
CREATE EXTERNAL TABLE IF NOT EXISTS campaigns_tier1(
) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't';
3. Check if the table exist in your SQL Server database (log in to your sql database management portal).
Now you should have a HDInsight cluster with persistent data and meta data storage. If you have any feedback, questions or tips to further improve the setup, please add a comment.
If you missed part one on Azure HDInsight.