Read between the lines

Read between the lines – this expression derives from a simple form of cryptography, in which a hidden meaning was conveyed by secreting it between lines of text. Unfortunately, web analytics tools often limit your possibilities to extract meaningful data by combining data from separate rows. Let me exemplify:

Internal bounce rate

You don’t want your customers to end up in dead ends on your web site, right? One sign of dead ends is what we can call internal bounce. Imagine that a visitor on your homepage notice a push for certain product, let’s say Converse sneakers on sale, and clicks on the push. However, when the visitor enter the landing page he sees that the sale is limited to kids converse sneakers and hence is not relevant for the visitor. Thus visitor chose to go back to the homepage. This is an internal bounce. The interesting information here doesn’t exist on a single logged hit or row, but three. It’s only when we study those together we can identify that the push doesn’t communicate the same message as the landing page. So, how do you get hold of an internal bounce? Well, if you define an internal bounce as page view where previous page is the same as the next page, then it’s only a matter of adding previous and next page as dimensions to your page view level. That can be done by window functions or self joins. But it is hard to accomplish in traditional web analytics tools, if you collect a lot of data you probably have to export web data to a MPP or run a big data solution such as Hadoop or Google Big Query.


Product conversion rate (calculated metrics)

Often conversion rates are calculated as a ratio of “sessions with transaction divided by sessions”. Usually it is possible to segment by campaigns and if you are lucky you even have support for attribution modelling. However, this kind of conversion rate doesn’t help you optimize yield since it is completely disconnected from products. You only know the quantity sold of a particular product, but not the figures behind. Let’s say you don’t reach sales target on a product A, what should you do? Lower the price, promote on homepage and e-mail, change product copy and images or check product availability and errors? Sales figures doesn’t tell you, but product exposure and product conversion rate give you a hint. If exposure is low for product A but conversion rate is high, then it may be a matter of promoting the product rather than lowering price that is the solution. The opposite is also true, high exposure and low conversion rate imply you may have to lower price or work on your product presentation. Otherwise you will give away valuable profit margin unnecessarily. The exposure and the transaction are logged as two different events and web analytics tools doesn’t allow you to create the calculated metrics needed to tie those together.

Load avro data to Hive in HDInsight

I tried to load avro data to Hive in HDInsight using
load data inpath 'wasb://' into table avrotable partition (ds='2013-12-08');
and got this error message:

FAILED: SemanticException [ERROR 10028]: Path is not legal ”wasb://”: Move from wasb:// to: wasb:// is not valid. Please check that values for params “” and “hive.metastore.warehouse.dir” do not conflict.

The reason is apparently that Hive disallows data to be loaded from a path that has different “schema” and “authority” than the to path. See

In this case, the authority part of the URI’s do not match since the containers (from : avro to: analytics) are different even though they belong to same storage account (myaccount).

There are two ways to solve this if you don’t want to use the default container (analytics) as from.

  1. Create an external table with location pointing to the path in the container you desire, and
  2. Ensure that the data that has to be loaded into the table is inside the same container.

Hope this helps anyone facing the same issue.

Disable Restricted Area window in digital ocean wordpress admin section

Thought I would share how to disable the restricted area window when you try to log in to the admin section on digital ocean wordpress site.
1. First SSH to your droplet (ex. using Putty) with the credentials you get when launching a droplet. From there navigate to etc/apache2 by using the cd command.
2. Then open apache2.conf by entering sudo nano apache2.conf.
3. Comment (#) the following part and save (Ctrl-X and Y):
#<DirectoryMatch ^.*/wp-admin/>
# AuthType Basic
# AuthName "Please login to your droplet via SSH for login details."
# AuthUserFile /etc/apache2/.htpasswd
# Require valid-user

4. Restart your apache server sudo service apache2 restart

That should be it!

Pig AvroStorage on HDInsight

This example shows how to load files with textloader and store them with AvroStorage on HDInsight.

-- run from Hadoop Command Line %PIG_HOME%/bin/pig -f asv://[container1]@[account]
--put your jar-files (avro-*.jar, json-simple-*.jar, piggybank.jar, etc.) in a blob storage and reference them from script
REGISTER asv://[container2]@[account]*.jar;
--Store files compressed by snappy
SET avro.output.codec snappy;
RAW_LOGS = LOAD '$PATH_TO_WEBLOGS' USING TextLoader as (line:chararray);
[extract fields from line]
"schema": {"type": "record", "name": "logrow", "namespace": "com.robertsahlin.avro",
"fields": [
{"name": "field1","type": ["null","string"]},
{"name": "field2","type": ["null","string"]},
{"name": "field3","type": ["null","string"]}

I noticed that using AvroStorage() with schema on HDInsight only works if you reference the piggybank.jar in blob storage and rename/remove the C:appsdistpig- file. Otherwise you will end up with an error telling you that AvroStorage can’t interpret the schema argument.

ERROR 1200: Pig script failed to parse:
pig script failed to validate: java.lang.RuntimeException: could not instantiate '' with arguments

I guess it has something to do with how Hadoop/Pig loads the jar-files. If piggybank is already loaded, it want load the needed jar-files (json, etc.) that are picked up by the custom loader.

Pig solution to Hadoop small file problem

A common problem with weblogs as input is that you often get a lot of relatively small files to process (ex. 40 files 10 MB each per day). That is not what Hadoop does best (small file problem). One way to get around that is to run a pig job that consolidate many small files to bigger files. This example shows how to load files with textloader and store them as 264 MB size files with PigStorage on HDInsight. The trick is to set max.CombinedSplitSize to the desired block size (preferably a multiple of 64 MB).

-- run from Hadoop Command Line %PIG_HOME%/bin/pig -f asv://[container1]@[account]
--Store as 264 MB size blocks compressed by snappy
SET pig.maxCombinedSplitSize 268435456;
RAW_LOGS = LOAD '$PATH_TO_WEBLOGS' USING TextLoader as (line:chararray);
[extract fields from line]
USING PigStorage():

Pig and hCatalog on Azure HDInsight

I’ve been trying to store data from Pig to hCatalog on Azure HDInsight and it has been easier said than done. First, launching grunt with pig -useHCatalog has been working fine when I want to store data to hCatalog. However, in production mode you really want to run pig scripts and then I hit some problems.

ERROR 1070: Could not resolve org.apache.hcatalog.pig.HCatStorer using imports

I’ve tried to run the script with additional jars
pig -Dpig.additional.jars=%HCAT_HOME%sharehcataloghcatalog-*.jar;%HIVE_HOME%libhive-metastore-*.jar;%HIVE_HOME%liblibthrift-*.jar;%HIVE_HOME%libhive-exec-*.jar;%HIVE_HOME%liblibfb303-*.jar;%HIVE_HOME%libjdo2-api-*-ec.jar;%HIVE_HOME%conf;%HADOOP_HOME%conf;%HIVE_HOME%libslf4j-api-*.jar -f myscript.pig

and by setting PIG_CLASSPATH and PIG_OPTS
set HCAT_HOME=c:appsdisthcatalog-
set PIG_CLASSPATH=%HCAT_HOME%sharehcataloghcatalog-*.jar;%HIVE_HOME%libhive-metastore-*.jar;%HIVE_HOME%liblibthrift-*.jar;%HIVE_HOME%libhive-exec-*.jar;%HIVE_HOME%liblibfb303-*.jar;%HIVE_HOME%libjdo2-api-*-ec.jar;%HIVE_HOME%conf;%HADOOP_HOME%conf;%HIVE_HOME%libslf4j-api-*.jar
set PIG_OPTS=-Dhive.metastore.uris=thrift://headnodehost:9083

but no luck.

I can’t figure out what is wrong with the file path so I resorted to registering the jar-files in the script.
REGISTER c:appsdisthcatalog-*.jar;...
but that didn’t work either.

So I copied the jar files to blob storage and register the path in the script file.
REGISTER asv://*.jar;
And now suddenly it works!

PUL och molntjänster för analys

Jag hade intressant samtal med en “Big Data”-konsult igår om Hadoop i molnet. Enligt konsulten så får företag som är registrerade i ett land inte föra över personuppgifter till annat land. Konsulten hänvisade till PUL och om det stämmer så har det stor påverkan på möjligheten att nyttja molntjänster eftersom dessa ofta har datacenter i annat land än Sverige samt att kund- och webbanalys ofta inbegriper personuppgifter i form av ID-nummer som indirekt kan kopplas till en person.

Jag var av annan uppfattning/tolkning och utan att vara juridisk expert så tänkte jag återge min tolkning, i synnerhet för att nyttja Azures Hadooptjänst i webbanalyssyfte.

PUL förklarat av Datainspektionen:

“PuL är en förkortning av personuppgiftslagen. Personuppgiftslagen innehåller regler som ska skydda människor mot att deras personliga integritet kränks genom behandling av personuppgifter. Lagen gäller för behandling av personuppgifter i hela samhället – för verksamhet som bedrivs av såväl myndigheter som enskilda…”

Vad är en personuppgift enligt Datainspektionen:

“All slags information som direkt eller indirekt kan knytas till en fysisk person som är i livet räknas enligt personuppgiftslagen som personuppgifter. Även bilder (foton) och ljudupptagningar på individer som behandlas i dator kan vara personuppgifter även om inga namn nämns. Krypterade uppgifter och olika slags elektroniska identiteter, som exempelvis IP-nummer, räknas som personuppgifter om de kan kopplas till fysiska personer.”

Med andra ord räknas kund-id, order-id, medlemskapsnummer, m.m. som personuppgifter då dessa indirekt kan knytas till en person. Så lagras dessa för analyssyfte så måste företaget leva upp till PUL.

Vilka regler gäller för överföring av personuppgifter till tredje land?

 “EU:s dataskyddsdirektiv kräver att alla medlemsstater har regler som ger ett likvärdigt skydd för personuppgifter och personlig integritet. Detta gäller även EES-länderna. Därför kan personuppgifter föras över fritt inom detta område utan begränsningar. Eftersom det inte finns några generella regler som ger motsvarande garantier utanför EU/EES har man ansett att överföring till sådana länder bör begränsas. Personuppgifter får därför föras över endast om det finns en adekvat skyddsnivå i mottagarlandet eller om det finns särskilda garantier för att uppgifterna och de registrerades rättigheter skyddas.”

Med andra ord är det fullt möjligt att nyttja molntjänster baserade i EU för nämnda ändamål. Men för molntjänster i andra länder gäller särskilda regler. Datainspektionen har en lista över vilka länder som har adekvat skyddsnivå. Eftersom Microsofts Hadoop-tjänst i dagsläget endast finns tillgänglig i deras amerikanska datacenter (US-North) så gäller Safe Harbor-principerna. Uppdatering 2013-09-13: HDInsight finns nu att tillgå i region Northern Europe vilket betyder att datacentret ska följa EU:s dataskyddsdirektiv.

Det är en samling frivilliga regler om personlig integritet och dataskydd som har tagits fram och beslutats av USA:s handelsdepartement (Department of Commerce – DoC). Organisationer i USA kan anmäla till departementet att de ansluter sig till dessa regler. EU-kommissionen har bedömt att reglerna (med tillhörande frågor och svar) utgör en adekvat skyddsnivå. Det är därmed tillåtet att föra över personuppgifter från EU/EES till organisationer i USA som har anslutit sig till reglerna. På USA:s handelsdepartements webbplats finns en lista över företag som anslutit sig till Safe Harbor-principerna.

I fallet Microsoft Azure så går det att läsa att de uppfyller nämnda principer. Jag tolkar sålunda att det är fritt fram att nyttja Azure HDInsight för webbanalys, givet att:

“Den personuppgiftsbehandling som sker i Sverige måste fortfarande följa reglerna i personuppgiftslagen. Det innebär att uppgifter bara får föras över om den personuppgiftsansvarige i Sverige har följt övriga krav i personuppgiftslagen, till exempel de grundläggande kraven på personuppgiftsbehandling och reglerna om när sådan behandling över huvud taget är tillåten.”


Azure HDInsight – persistent data with blob storage and SQL Server (part 2)

So you got access to Azure HDInsight and your usage pattern is such that you don’t want to run a cluster 24/7? Then you want a persistent data storage both in terms of the data you want to analyse, but also your meta data store. Azure blob storage and SQL Server to the rescue.

When you have signed up for azure blob storage it is time to upload your raw data (your web server log files). The easiest way to do it is by using a Azure blob storage client, I’ve mostly been using Cloudberry Explorer for Windows Azure. AZCopy is an alternative I’ve used when I needed command line access from a Talend job collecting files from a server and compressing those (7zip) before uploading to Azure blob storage. The explorer software is great becuase you can easily create a directory structure with folders and subfolders, you can’t do that from the Azure web interface yet (only containers). You want to set up a tier0 folder for your raw data and one or more staging folders (ex. tier1). For production use you want to set up an automatic job to collect your web data, either in a streaming fashion (flume, scribe, etc.), logging directly from your .net web application, scheduled transports of log files or perhaps let enable logging on a gif-object in the blob storage and analyse those log files. I will have to write a separate post on that.

To access your blob storage containers you specify what container to use when setting up your cluster. If you want to access more than one container I advise you to read the excellent post by Denny Lee. But wait, isn’t Hadoop all about moving compute to data vs. traditionally moving data to compute, so why should I use blob storage instead of local disk HDFS? Denny Lee describes that as well, in short it is all about the network: the performance of utilizing HDFS with local disk or HDFS using ASV (blob storage) is comparable (if your cluster is smaller than 40 nodes).

The other challenge is to set up persistent storage of meta data. The best option is to use Azure SQL Server for that. In the preview it was straight forward to make that setup when launcing your cluster, but not so in HDInsight feature preview (yet). Actually, when spinning up a cluster, Azure set up a temporary SQL Server as your meta store behind the scenes. We will set up our own database, prepare it with the correct tables and point the cluster to use that as meta store instead. Update 2013-09-13: the azure management portal now allows you to specify SQL server as metastore when launching a HDInsight cluster.

1) This step is only necessary the first time you set up your own SQL Server as meta store. Create a Azure SQL Server instance and make a note of server name, database name, user and password. Remote desktop to your HDInsight cluster and open a terminal window. From your terminal window, replace the parameters with yours and run:
Now your SQL Server is populated with the needed tables to run as a hive meta store, make sure to check that the tables are created before moving forward.

2) It is time to point your cluster to your newly created meta store. Locate your hive-site.xml file (C:appsdisthive-0.9.0conf) and open it. Locate the properties (SERVER_NAME, DATABASE_NAME, USER, PASSWORD) below and change them according to your credentials, (keep a local copy of your new hive-site.xml and copy-replace the next time you spin up your cluster). Update 2013-09-13: this can still be useful if you want to use a SQL server in another subscription plan.
<description>JDBC connect string for a JDBC metastore</description>


Then you likely need to restart your Hive server. Go to the bin folder (C:appsdisthive-0.9.0bin) and run stop_daemons followed by start_daemons. The easiest way to check if your changes have taken effect is to:

1. Launch your hive client
2. Create a table
session_id STRING

3. Check if the table exist in your SQL Server database (log in to your sql database management portal).
SQL database management portal

Now you should have a HDInsight cluster with persistent data and meta data storage. If you have any feedback, questions or tips to further improve the setup, please add a comment.

If you missed part one on Azure HDInsight.

Azure HDInsight – Hadoop analytics in the cloud – part 1

This is part 1 in a serie of blog posts about web analytics using Hadoop in the cloud.

Why would you like to run your Hadoop cluster in the cloud? Well, there may be several reasons for that. Perhaps you don’t have resources (time, money and skills) to operate a Hadoop cluster inhouse or your demand fluctuates frequently, the arguments are the same as for cloud services in general. The important part when applying this to Hadoop is how to keep your data and meta data when taking down your Hadoop cluster. Also, it is preferable that the service doesn’t take to long to spin up your cluster.

I have tried out both Amazon Elastic MapReduce and Azure HDInsight, and I’m impressed by both offerings. However, being more comfortable and experienced with Windows than Linux and working for a company that run pretty much all systems on Microsoft technology (even BI), the choice was never hard when deciding on a service for my Proof of Concept. I look forward to a GA-release of Azure HDInsight to fully recommend it as ready for production use cases.

So, how do you get started? First, apply for the Azure HDInsight feature preview (you need to log in to Azure first), it may take a while to get granted access, so don’t wait, just do it. Second, download the local HDInsight distribution to start develop on your local machine before throwing your code on a bigger dataset.

HDInsight Feature Preview

If you don’t already have a good use case for your Big data/Hadoop pilot I suggest that you start analysing your web logs. Reason, even if you don’t have tens of TB:s in web logs, the data is of such nature that it suits well for practicing development and analysis on Hadoop. The data doesn’t change once it is written and it is pretty easy to transport and parse. Also, in most businesses nowadays, the web presence is a crucial part of business. That use case may even be your Trojan Horse to get a Hadoop implementation through the Finance department as some web analytics solutions are rediculously overpriced data collection tools. Cost savings usually serves as a stronger argument than strategic objectives, perhaps there is som truth in “it is easier to save a buck than earning one”. Example, it is not unusual that a proprietary web analytics SAAS solution may cost 200 000 USD annually. That is approximate what you would pay for a 25-30 node cluster running 24/7, and that is a lot of computing power for a standard web analytics implementation at a medium/large company.

I strongly suggest you get your hands on “Programming Hive” and “Programming Pig” to accompany you on your big data journey. If you are interested in operating a Hadoop cluster inhouse, then “Hadoop – the definitive guide” is probably a good investment as well.

But this blog serie will focus on Hadoop in the cloud or more specifically:

  • - Azure HDInsight as Hadoop compute cluster
  • Azure Blob storage for persistent data storage
  • Azure SQL Server as persistent metastore
  • - Excel PowerPivot and PowerView as analytics frontend
  • Pig for data processing/enhancement
  • - hCatalog for meta data management
  • Hive for analysis

Please, comment, connect or send me a message if you have feedback or questions. Let the big data journey begin.

Google Tag Manager on WordPress

Well, a WordPress site that you host yourself is not perhaps the kind of site where you benefit most from using Google Tag Manager (GTM). However, it is a great way to learn and try out GTM without the risk of negative impact on your e-commerce site. In order to fully utilize GTM for Google Analytics, you need a data layer. WordPress doesn’t provide a data layer as a JSON-array as default so you have to create one (I couldn’t find any plugin at the time writing). I could probably have written a plugin for this but I didn’t have the time to figure out how.

First, find the functions.php in your theme. Add the following function in it:
if ( !function_exists( 'before_tag_manager' ) ) {
function before_tag_manager() {
global $query_string, $post;
$post_id = $post->ID;
if( have_posts() ) :
while( have_posts() ) : the_post();
$output[]= array( 'title' => get_the_title(), 'date' => get_the_date());
echo "<script>dataLayer =" . json_encode($output) . "</script>";
} // endif

Then you need to add the function call to your theme’s header template (usually header.php). Locate the body start tag and add the function call immediately after that, then add the GTM-script:
<body <?php body_class(); ?>>
<?php before_tag_manager(); ?>
<!-- Google Tag Manager -->
<noscript><iframe src="//"
height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
})(window,document,'script','dataLayer','<YOUR CONTAINER ID>');</script>
<!-- End Google Tag Manager -->

If you want to play with other variables in the data layer than “title” and “date”, you can easily add context by using the built in template tags in wordpress and put them in the $output variable in the before_tag_manager function call. If you view the source code on one of your blog posts you will now see a dataLayer containing both the “title” and “date” of the post. The variables are now easily accessed from the Google Tag Manager Macros as Data Layer variables.

Google Tag Manager – Data Layer

« Older posts

Copyright © 2015

Theme by Anders NorenUp ↑