The latest technology and data news, analysis and ideas from the DataMine Lab blog
- e-mail: [email protected]
-
Calculating unique visitors in Hadoop and HiveSeptember 6, 2011 by Radek Maciaszek
Unique visitors
One of the most important website metrics is the number of unique visitors. However, it is also one of the most difficult to calculate. In this post, I will review a sampling strategy which produces a very good estimate of unique users, yet is computationally cheap.
Non-additive data
It is relatively easy to calculate small numbers of unique visitors: all you need to do is perform a single SQL query.
To calculate the number of unique records in Hive, run the following:
1 SELECT COUNT(DISTINCT user_id) FROM page_viewsHowever, once the number of records in the table “page_views” becomes very large, this query may result in OOM errors. If this happens, there are other ways to calculate the exact number of unique visitors. Alternatively, it is possible to generate useful figures by using a sample.
Sampling
In practice, estimating the unique visitors metric gives pretty close results. In our tests on tens of millions of records, the results came within 0.1% of real values. One thing to remember is to ensure you sample visitors and not page views. The presented sampling method is a simple Bernoulli Sampling.
Having a sample can sometimes be even more useful than calculating the exact number. You can build a data warehouse around the sample and slice and dice on unique visitors — something which cannot be done on pre-calculated non-additive data. I will show at the end of this post how to create a cube that can be used to visualise unique visitors data.
Hashing
In order to sample users, we need to get every n-th user randomly from the population of records. One way to do it is to calculate the visitor hash for every record using a uniform hashing function (such as Md5). Md5 generates a random hexadecimal string on which we can filter only those users whose hash finishes with an arbitrary string, such as ’00′. Notice that since this is a uniform hashing function, the probability that the user hash finishes with ’0′ is 1/16, and so the probability that it finishes with ’00′ is 1/256.
Note that Hive (at the time of writing, version 0.7) does not implement an Md5 function, so feel free to use the following code to add an Md5 hash function to Hive:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 package com.dataminelab.hive.udf;import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text;import java.security.*;/*** Calculate md5 of the string*/public final class Md5 extends UDF {public Text evaluate(final Text s) {if (s == null) {return null;}try {MessageDigest md = MessageDigest.getInstance("MD5");md.update(s.toString().getBytes());byte[] md5hash = md.digest();StringBuilder builder = new StringBuilder();for (byte b : md5hash) {builder.append(Integer.toString((b & 0xff) + 0x100, 16).substring(1));}return new Text(builder.toString());} catch (NoSuchAlgorithmException nsae) {System.out.println("Cannot find digest algorithm");System.exit(1);}return null;}}Alternatively you may patch your Hive distribution with the code from the following ticket HIVE-1262.
HiveQL
The following query will generate a unique visitors sample:
1 2 3 4 5 6 7 ADD JAR ${CODE}/hive/udf/md5.jar;CREATE TEMPORARY FUNCTION mymd5 AS 'com.dataminelab.hive.udf.Md5';SELECTDISTINCT mymd5(user_id)FROM page_viewsWHEREsubstr(mymd5(user_id), -2) == '00';Pentaho
There are many other issues with unique visitors, such as how to present non-additive results to the end user. BI tools (such as Pentaho Mondrian) allow you to do this with the distinct aggregate function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 <Schema name="UNIQUE_VISITORS"><Cube name="UNIQUE_VISITORS" cache="true" enabled="true"><Table name="SAMPLE_UNIQUE_VISITORS"></Table><Dimension foreignKey="day" name="Date"><Hierarchy hasAll="true" primaryKey="day_date"><Table name="DIM_DATE"></Table><Level name="Year" column="year_number" type="Numeric" uniqueMembers="false" levelType="TimeYears" hideMemberIf="Never"></Level><Level name="Month" column="month_number" nameColumn="month_name" type="Numeric" uniqueMembers="false" levelType="TimeMonths" hideMemberIf="Never"></Level><Level name="Day" column="day_of_month_number" type="Numeric" uniqueMembers="false" levelType="TimeDays" hideMemberIf="Never"></Level></Hierarchy></Dimension><Dimension name="Website"><Hierarchy hasAll="true"><Level name="Website" column="publisher_name" type="String" uniqueMembers="false" levelType="Regular" hideMemberIf="Never"></Level></Hierarchy></Dimension><Measure name="Unique users sample" column="user_md5" datatype="Integer" formatString="#,###" aggregator="distinct-count" visible="false"></Measure><CalculatedMember name="Unique users" dimension="Measures"><Formula>[Measures].[Unique users sample] * 256</Formula><CalculatedMemberProperty name="FORMAT_STRING" value="#,###"/></CalculatedMember></Cube>After loading the sample to your aggregate, the OLAP tools will allow you to report on it in a similar way to how you would report on standard additive data. See below:
-
Rlharmon
-
Radek Maciaszek
-
Rlharmon
-
http://www.dataminelab.com Radek
-
Rlharmon
