The latest technology and data news, analysis and ideas from the DataMine Lab blog

Blog

  • YCSB run against HBase 0.92 on Amazon Elastic MapReduce
    September 16, 2012 by Krystian Nowak,  no comments

    In this post we will show you how in simple steps using Yahoo! Cloud Serving Benchmark: https://github.com/dataminelab/YCSB you can run benchmarks against HBase 0.92 cluster deployed automatically by Amazon Elastic MapReduce and what measurements and comparisons you can obtain while choosing among different available instance types.

    We will create EMR HBase clusters using the tooling provided by Amazon:
    http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip

    Note: As you might see in commands.rb the default_hadoop_version is set to 0.20(.x), but as our tests found using Hadoop in version 1.0.3 has significant performance gain. Therefore when creating EMR cluster, we will explicitly set this version.

    Let’s create one:

    elastic-mapreduce --create \
    --hbase \
    --name "EMR HBase YCSB" \
    --num-instances 2 \
    --instance-type m1.large \
    --hadoop-version 1.0.3
    Created job flow j-1PP3JU6UJ0HQ1
    

    elastic-mapreduce --list --active
    j-1PP3JU6UJ0HQ1     WAITING
    ec2-23-22-19-48.compute-1.amazonaws.com          EMR HBase YCSB
     COMPLETED      Start HBase

    Build the project (HBase master server variables should now defaults to localhost (127.0.0.1)).

    git clone [email protected]:dataminelab/YCSB.git
    cd YCSB
    export MAVEN_OPTS="-Xmx512m -Xms128m -Xss2m"

    (check http://jira.codehaus.org/browse/MASSEMBLY-549 why…)

    mvn clean install -Dcheckstyle.skip=true
    cd distribution/target
    scp -i ~/.ssh/dataminelab-ec2.pem ycsb-0.1.5-SNAPSHOT.tar.gz \
    [email protected]:/home/hadoop/ycsb.tar.gz
    ssh -i ~/.ssh/dataminelab-ec2.pem \
    [email protected]
    tar xvzf ycsb.tar.gz
    ln -s ycsb-0.1.5-SNAPSHOT ycsb
    cd ycsb
    

    Create the working table in HBase (aleady pre-split):

    hbase org.apache.hadoop.hbase.util.RegionSplitter usertable -c 200 -f family
    

    Hard to be perfect – because of https://issues.apache.org/jira/browse/HBASE-4163 is still not in place – please vote! :)
    But it still seems to be better than no split at all!

    You might spot:

    12/08/25 13:39:16 ERROR metrics.MetricsSaver:
    Failed SaveRecords hdfs:/mnt/var/lib/hadoop/metrics/raw/i-694c4712_04272_raw.bin
    Shutdown in progress
    

    as in https://forums.aws.amazon.com/thread.jspa?threadID=100643 but it doesn’t seem to hurt us…

    hbase shell
    scan '.META.', {COLUMNS => 'info:regioninfo'}
    exit
    

    Load initial data into HBase

    ./bin/ycsb load hbase -p columnfamily=family -P workloads/workloada | tee load.log
    

    Check for your own eyes that the data is loaded into HBase

    hbase shell
    
    hbase(main):001:0> count 'usertable'
    Current count: 1000, row: user995698996184959679
    1000 row(s) in 2.3210 seconds
    

    And run the tests – only as a warm-up:

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=10000 \
    -s \
    -threads 10 | tee warm-up-tests.log
    

    And now the real tests with 10 threads:

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 10 | tee real-tests-workload-a.log
    

    cat real-tests-workload-a.log
    

    [OVERALL], RunTime(ms), 47132.0
    [OVERALL], Throughput(ops/sec), 2121.700755325469
    [UPDATE], Operations, 50209
    [UPDATE], AverageLatency(us), 186.93305980999423
    

    And also 10 threads, but for another workload type.

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s -threads 10 | tee real-tests-workload-f.log
    cat real-tests-workload-f.log
    

    [OVERALL], RunTime(ms), 52748.0
    [OVERALL], Throughput(ops/sec), 1895.8064760749223
    [UPDATE], Operations, 50018
    [UPDATE], AverageLatency(us), 11.925006997480907
    

    Now we might check how these workload scenarios behave when increasing thread number.
    Starting with 100 threads.

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 100 | tee real-tests-workload-a-100t.log
    cat real-tests-workload-a-100t.log
    

    [OVERALL], RunTime(ms), 24234.0
    [OVERALL], Throughput(ops/sec), 4126.433935792688
    [UPDATE], Operations, 50063
    [UPDATE], AverageLatency(us), 1076.5547010766434
    

    500 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 500 | tee real-tests-workload-a-500t.log
    cat real-tests-workload-a-500t.log
    

    [OVERALL], RunTime(ms), 20706.0
    [OVERALL], Throughput(ops/sec), 4829.518014102193
    [UPDATE], Operations, 50099
    [UPDATE], AverageLatency(us), 6167.192359128925
    

    1000 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 1000 | tee real-tests-workload-a-1kt.log
    cat real-tests-workload-a-1kt.log
    

    [OVERALL], RunTime(ms), 21484.0
    [OVERALL], Throughput(ops/sec), 4654.626698938745
    [UPDATE], Operations, 49988
    [UPDATE], AverageLatency(us), 9423.208390013604
    

    2000 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 2000 | tee real-tests-workload-a-2kt.log
    cat real-tests-workload-a-2kt.log
    

    [OVERALL], RunTime(ms), 24358.0
    [OVERALL], Throughput(ops/sec), 4105.427374989737
    [UPDATE], Operations, 49957
    [UPDATE], AverageLatency(us), 7786.985767760274
    

    And the same for the other workload scenario now:
    100 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 100 | tee real-tests-workload-f-100t.log
    cat real-tests-workload-f-100t.log
    

    [OVERALL], RunTime(ms), 33924.0
    [OVERALL], Throughput(ops/sec), 2947.7655936799906
    [UPDATE], Operations, 50136
    [UPDATE], AverageLatency(us), 17.44125977341631
    

    1000 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 1000 | tee real-tests-workload-f-1kt.log
    cat real-tests-workload-f-1kt.log
    

    [OVERALL], RunTime(ms), 29309.0
    [OVERALL], Throughput(ops/sec), 3411.921252857484
    [UPDATE], Operations, 50127
    [UPDATE], AverageLatency(us), 16.611586570111914
    

    2000 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 2000 | tee real-tests-workload-f-2kt.log
    cat real-tests-workload-f-2kt.log
    

    [OVERALL], RunTime(ms), 29311.0
    [OVERALL], Throughput(ops/sec), 3411.688444611238
    [UPDATE], Operations, 49951
    [UPDATE], AverageLatency(us), 59.80148545574663
    

    3000 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 3000 | tee real-tests-workload-f-3kt.log
    cat real-tests-workload-f-3kt.log
    

    [OVERALL], RunTime(ms), 32314.0
    [OVERALL], Throughput(ops/sec), 3063.6875657609703
    [UPDATE], Operations, 49492
    [UPDATE], AverageLatency(us), 20.00127293299927
    

    4000 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 4000 | tee real-tests-workload-f-4kt.log
    cat real-tests-workload-f-4kt.log
    

    [OVERALL], RunTime(ms), 35051.0
    [OVERALL], Throughput(ops/sec), 2852.985649482183
    [UPDATE], Operations, 50095
    [UPDATE], AverageLatency(us), 38.50611837508733
    

    Let’s now try more instances instead just one slave – 4 slaves, same type as before.

    elastic-mapreduce --create \
    --hbase \
    --name "EMR HBase YCSB" \
    --num-instances 5 \
    --instance-type m1.large \
    --hadoop-version 1.0.3
    Created job flow j-OE7G6YUHMD2I
    

    elastic-mapreduce --list --active
    j-OE7G6YUHMD2I      WAITING
    ec2-50-17-100-242.compute-1.amazonaws.com         EMR HBase YCSB
    COMPLETED      Start HBase
    

    Now just copy already built test suite:

    scp -i ~/.ssh/dataminelab-ec2.pem ycsb-0.1.5-SNAPSHOT.tar.gz \
    [email protected]:/home/hadoop/ycsb.tar.gz
    ssh -i ~/.ssh/dataminelab-ec2.pem \
    [email protected]
    
    tar xvzf ycsb.tar.gz
    ln -s ycsb-0.1.5-SNAPSHOT ycsb
    cd ycsb
    

    Initialize table:

    hbase org.apache.hadoop.hbase.util.RegionSplitter usertable -c 200 -f family
    

    Load initial data:

    ./bin/ycsb load hbase \
    -p columnfamily=family \
    -P workloads/workloada | tee load.log
    

    And run tests:
    warm-up

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=10000 \
    -s \
    -threads 10 | tee warm-up-tests.log
    

    10 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 10 | tee real-tests-workload-a.log
    cat real-tests-workload-a.log
    

    [OVERALL], RunTime(ms), 42609.0
    [OVERALL], Throughput(ops/sec), 2346.9220117815485
    [UPDATE], Operations, 50073
    [UPDATE], AverageLatency(us), 117.53685618996265
    

    100 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 100 | tee real-tests-workload-a-100t.log
    cat real-tests-workload-a-100t.log
    

    [OVERALL], RunTime(ms), 23500.0
    [OVERALL], Throughput(ops/sec), 4255.31914893617
    [UPDATE], Operations, 49837
    [UPDATE], AverageLatency(us), 1089.7759295302687
    

    500 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 500 | tee real-tests-workload-a-500t.log
    cat real-tests-workload-a-500t.log
    

    [OVERALL], RunTime(ms), 19763.0
    [OVERALL], Throughput(ops/sec), 5059.960532307848
    [UPDATE], Operations, 50196
    [UPDATE], AverageLatency(us), 4854.259104311101
    

    1000 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 1000 | tee real-tests-workload-a-1kt.log
    cat real-tests-workload-a-1kt.log
    

    [OVERALL], RunTime(ms), 20028.0
    [OVERALL], Throughput(ops/sec), 4993.0097862991815
    [UPDATE], Operations, 49904
    [UPDATE], AverageLatency(us), 9582.977617024688
    

    2000 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 2000 | tee real-tests-workload-a-2kt.log
    cat real-tests-workload-a-2kt.log
    

    [OVERALL], RunTime(ms), 22608.0
    [OVERALL], Throughput(ops/sec), 4423.2130219391365
    [UPDATE], Operations, 49988
    [UPDATE], AverageLatency(us), 6244.29357045691
    

    5000 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 5000 | tee real-tests-workload-a-5kt.log
    cat real-tests-workload-a-5kt.log
    

    [OVERALL], RunTime(ms), 24861.0
    [OVERALL], Throughput(ops/sec), 4022.3643457624394
    [UPDATE], Operations, 50100
    [UPDATE], AverageLatency(us), 8150.377125748503
    

    10k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 10000 | tee real-tests-workload-a-10kt.log
    cat real-tests-workload-a-10kt.log
    

    [OVERALL], RunTime(ms), 25336.0
    [OVERALL], Throughput(ops/sec), 3946.9529523208084
    [UPDATE], Operations, 50176
    [UPDATE], AverageLatency(us), 8851.578204719388
    

    workload f, 10 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 10 | tee real-tests-workload-f.log
    cat real-tests-workload-f.log
    

    [OVERALL], RunTime(ms), 53310.0
    [OVERALL], Throughput(ops/sec), 1875.8206715438005
    [UPDATE], Operations, 49867
    [UPDATE], AverageLatency(us), 12.18058034371428
    

    100 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 100 | tee real-tests-workload-f-100t.log
    cat real-tests-workload-f-100t.log
    

    [OVERALL], RunTime(ms), 30991.0
    [OVERALL], Throughput(ops/sec), 3226.7432480397533
    [UPDATE], Operations, 50145
    [UPDATE], AverageLatency(us), 13.73040183467943
    

    1k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 1000 | tee real-tests-workload-f-1kt.log
    cat real-tests-workload-f-1kt.log
    

    [OVERALL], RunTime(ms), 29185.0
    [OVERALL], Throughput(ops/sec), 3426.4176803152304
    [UPDATE], Operations, 50047
    [UPDATE], AverageLatency(us), 29.82979998801127
    

    2k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 2000 | tee real-tests-workload-f-2kt.log
    cat real-tests-workload-f-2kt.log
    

    [OVERALL], RunTime(ms), 31906.0
    [OVERALL], Throughput(ops/sec), 3134.206732276061
    [UPDATE], Operations, 50111
    [UPDATE], AverageLatency(us), 24.55253337590549
    

    3k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 3000 | tee real-tests-workload-f-3kt.log
    cat real-tests-workload-f-3kt.log
    

    [OVERALL], RunTime(ms), 34410.0
    [OVERALL], Throughput(ops/sec), 2877.070619006103
    [UPDATE], Operations, 49607
    [UPDATE], AverageLatency(us), 23.37424153849255
    

    Now let’s see how even more serious instances offered by AWS would behave in this scenario!
    m1.xlarge (2 x more memory, 2 x more CPU than m1.large)

    elastic-mapreduce --create \
    --hbase \
    --name "EMR HBase YCSB" \
    --num-instances 5 \
    --instance-type m1.xlarge \
    --hadoop-version 1.0.3
    Created job flow j-2ICBS9029MJAV
    

    ./elastic-mapreduce --list --active
    j-2ICBS9029MJAV      WAITING
    ec2-107-21-130-111.compute-1.amazonaws.com         EMR HBase YCSB
    COMPLETED      Start HBase
    

    scp -i ~/.ssh/dataminelab-ec2.pem ycsb-0.1.5-SNAPSHOT.tar.gz \
    [email protected]:/home/hadoop/ycsb.tar.gz
    ssh -i ~/.ssh/dataminelab-ec2.pem \
    [email protected]
    
    tar xvzf ycsb.tar.gz
    ln -s ycsb-0.1.5-SNAPSHOT ycsb
    cd ycsb
    

    hbase org.apache.hadoop.hbase.util.RegionSplitter usertable -c 200 -f family
    

    ./bin/ycsb load hbase \
    -p columnfamily=family \
    -P workloads/workloada | tee load.log
    

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=10000 \
    -s \
    -threads 10 | tee warm-up-tests.log
    

    10 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 10 | tee real-tests-workload-a.log
    cat real-tests-workload-a.log
    

    [OVERALL], RunTime(ms), 39481.0
    [OVERALL], Throughput(ops/sec), 2532.8639092221574
    [UPDATE], Operations, 49981
    [UPDATE], AverageLatency(us), 62.85440467377604
    

    100 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 100 | tee real-tests-workload-a-100t.log
    cat real-tests-workload-a-100t.log
    

    [OVERALL], RunTime(ms), 17877.0
    [OVERALL], Throughput(ops/sec), 5593.779716954747
    [UPDATE], Operations, 50100
    [UPDATE], AverageLatency(us), 640.4568662674651
    

    1k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s -threads 1000 | tee real-tests-workload-a-1kt.log
    cat real-tests-workload-a-1kt.log
    

    [OVERALL], RunTime(ms), 13986.0
    [OVERALL], Throughput(ops/sec), 7150.00715000715
    [UPDATE], Operations, 49750
    [UPDATE], AverageLatency(us), 8759.566291457286
    

    2k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 2000 | tee real-tests-workload-a-2kt.log
    cat real-tests-workload-a-2kt.log
    

    [OVERALL], RunTime(ms), 14783.0
    [OVERALL], Throughput(ops/sec), 6764.526821348847
    [UPDATE], Operations, 50118
    [UPDATE], AverageLatency(us), 26718.534857735744
    

    3k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 3000 | tee real-tests-workload-a-3kt.log
    cat real-tests-workload-a-3kt.log
    

    [OVERALL], RunTime(ms), 15477.0
    [OVERALL], Throughput(ops/sec), 6396.588486140725
    [UPDATE], Operations, 49465
    [UPDATE], AverageLatency(us), 12066.01403012231
    

    4k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 4000 | tee real-tests-workload-a-4kt.log
    cat real-tests-workload-a-4kt.log
    

    [OVERALL], RunTime(ms), 15261.0
    [OVERALL], Throughput(ops/sec), 6552.650547146321
    [UPDATE], Operations, 49883
    [UPDATE], AverageLatency(us), 22551.664294449012
    

    another workload, 10 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 10 | tee real-tests-workload-f.log
    cat real-tests-workload-f.log
    

    [OVERALL], RunTime(ms), 45751.0
    [OVERALL], Throughput(ops/sec), 2185.744573889095
    [UPDATE], Operations, 49950
    [UPDATE], AverageLatency(us), 9.801721721721721
    

    500 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 500 | tee real-tests-workload-f-500t.log
    cat real-tests-workload-f-500t.log
    

    [OVERALL], RunTime(ms), 21870.0
    [OVERALL], Throughput(ops/sec), 4572.473708276178
    [UPDATE], Operations, 49678
    [UPDATE], AverageLatency(us), 11.18187125085551
    

    1k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 1000 | tee real-tests-workload-f-1kt.log
    cat real-tests-workload-f-1kt.log
    

    [OVERALL], RunTime(ms), 19207.0
    [OVERALL], Throughput(ops/sec), 5206.435153850159
    [UPDATE], Operations, 49879
    [UPDATE], AverageLatency(us), 11.812406022574631
    

    2k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 2000 | tee real-tests-workload-f-2kt.log
    cat real-tests-workload-f-2kt.log
    

    [OVERALL], RunTime(ms), 20493.0
    [OVERALL], Throughput(ops/sec), 4879.715024642561
    [UPDATE], Operations, 50114
    [UPDATE], AverageLatency(us), 12.770423434569182
    

    And for now, more CPU power!
    c1.xlarge (same memory, 5 x more CPU than m1.large)

    elastic-mapreduce --create \
    --hbase \
    --name "EMR HBase YCSB" \
    --num-instances 5 \
    --instance-type c1.xlarge \
    --hadoop-version 1.0.3
    Created job flow j-3KZHQRG2D74AY
    

    ./elastic-mapreduce --list --active
    j-3KZHQRG2D74AY     WAITING
    ec2-75-101-255-226.compute-1.amazonaws.com          EMR HBase YCSB
    COMPLETED      Start HBase
    

    scp -i ~/.ssh/dataminelab-ec2.pem ycsb-0.1.5-SNAPSHOT.tar.gz \
    [email protected]:/home/hadoop/ycsb.tar.gz
    ssh -i ~/.ssh/dataminelab-ec2.pem \
    [email protected]
    
    tar xvzf ycsb.tar.gz
    ln -s ycsb-0.1.5-SNAPSHOT ycsb
    cd ycsb
    

    hbase org.apache.hadoop.hbase.util.RegionSplitter usertable -c 200 -f family
    

    ./bin/ycsb load hbase \
    -p columnfamily=family \
    -P workloads/workloada | tee load.log
    

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=10000 \
    -s \
    -threads 10 | tee warm-up-tests.log
    

    10 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 10 | tee real-tests-workload-a.log
    cat real-tests-workload-a.log
    

    [OVERALL], RunTime(ms), 32121.0
    [OVERALL], Throughput(ops/sec), 3113.228106223343
    [UPDATE], Operations, 49973
    [UPDATE], AverageLatency(us), 71.10029415884577
    

    100 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 100 | tee real-tests-workload-a-100t.log
    cat real-tests-workload-a-100t.log
    

    [OVERALL], RunTime(ms), 15076.0
    [OVERALL], Throughput(ops/sec), 6633.059166887769
    [UPDATE], Operations, 50167
    [UPDATE], AverageLatency(us), 644.8327187194769
    

    1k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 1000 | tee real-tests-workload-a-1kt.log
    cat real-tests-workload-a-1kt.log
    

    [OVERALL], RunTime(ms), 12864.0
    [OVERALL], Throughput(ops/sec), 7773.63184079602
    [UPDATE], Operations, 50240
    [UPDATE], AverageLatency(us), 9889.390306528663
    

    2k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 2000 | tee real-tests-workload-a-2kt.log
    cat real-tests-workload-a-2kt.log
    

    [OVERALL], RunTime(ms), 14889.0
    [OVERALL], Throughput(ops/sec), 6716.367788300087
    [UPDATE], Operations, 50216
    [UPDATE], AverageLatency(us), 41222.41986617811
    

    3k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 3000 | tee real-tests-workload-a-3kt.log
    cat real-tests-workload-a-3kt.log
    

    [OVERALL], RunTime(ms), 14461.0
    [OVERALL], Throughput(ops/sec), 6845.9995850909345
    [UPDATE], Operations, 49451
    [UPDATE], AverageLatency(us), 51852.53568178601
    

    5k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 5000 | tee real-tests-workload-a-5kt.log
    cat real-tests-workload-a-5kt.log
    

    [OVERALL], RunTime(ms), 17072.0
    [OVERALL], Throughput(ops/sec), 5857.544517338331
    [UPDATE], Operations, 49835
    [UPDATE], AverageLatency(us), 82378.54861041436
    

    10k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloada \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 10000 | tee real-tests-workload-a-10kt.log
    cat real-tests-workload-a-10kt.log
    

    [OVERALL], RunTime(ms), 20226.0
    [OVERALL], Throughput(ops/sec), 4944.131316127757
    [UPDATE], Operations, 50113
    [UPDATE], AverageLatency(us), 49147.25219005049
    

    another workload, 10 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 10 | tee real-tests-workload-f.log
    cat real-tests-workload-f.log
    

    [OVERALL], RunTime(ms), 40801.0
    [OVERALL], Throughput(ops/sec), 2450.920320580378
    [UPDATE], Operations, 49966
    [UPDATE], AverageLatency(us), 12.13715326421967
    

    400 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 400 | tee real-tests-workload-f-400t.log
    cat real-tests-workload-f-400t.log
    

    [OVERALL], RunTime(ms), 17856.0
    [OVERALL], Throughput(ops/sec), 5600.358422939068
    [UPDATE], Operations, 50071
    [UPDATE], AverageLatency(us), 14.301591739729584
    

    500 threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 500 | tee real-tests-workload-f-500t.log
    cat real-tests-workload-f-500t.log
    

    [OVERALL], RunTime(ms), 17909.0
    [OVERALL], Throughput(ops/sec), 5583.784689262382
    [UPDATE], Operations, 50210
    [UPDATE], AverageLatency(us), 16.105915156343357
    

    1k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 1000 | tee real-tests-workload-f-1kt.log
    cat real-tests-workload-f-1kt.log
    

    [OVERALL], RunTime(ms), 16982.0
    [OVERALL], Throughput(ops/sec), 5888.5879166175955
    [UPDATE], Operations, 50088
    [UPDATE], AverageLatency(us), 15.313268647180962
    

    2k threads

    ./bin/ycsb run hbase \
    -p columnfamily=family \
    -P workloads/workloadf \
    -p columnfamily=family \
    -p operationcount=100000 \
    -s \
    -threads 2000 | tee real-tests-workload-f-2kt.log
    cat real-tests-workload-f-2kt.log
    

    [OVERALL], RunTime(ms), 17219.0
    [OVERALL], Throughput(ops/sec), 5807.538184563564
    [UPDATE], Operations, 49989
    [UPDATE], AverageLatency(us), 17.61469523295125
    

    Even after running these simple scenarios we are able to check how for given configuration the number of threads used influences the throughput for each of workload type:

    • workload a:
    • workload f:

    You can now play with other instance types and instance numbers. You can also mix multiple nodes running YCSB benchmark code and observe possible saturation, either from master’s CPU or network layer.

    We also invite you to play with the code or even contribute features and improvements, so that others can benefit from them too – have fun!

  • BigData events
    May 4, 2012 by Radek Maciaszek,  2 comments

    We observe an explosion of BigData events. While half a year ago London hosted maybe one interesting meetup a month nowadays there is rarely a week without few of them. Supply is keeping up with demand.

    There is an increasing number of monthly meetups: BigData London, HUG UK, Data Science London, London R, Cassandra London, Neo4j London, London MongoDB User Group, Oracle BigData, Data Visualisation London, Big Data Debate, DeNormalised London, LonData, CloudComputing.

    Upcoming conferences that are worth mentioning:

    We just had a London BigData week that was full of meetings and hackatons dedicated to Hadoop, Visualisations and NoSQL. In case you missed the last Big Data week you are for a treat – simply like us on Facebook to have a chance of winning one ticket (worth £495) for 3 days of SkillsMatter NoSQL tutorials.

    There are as well few online places where every data scientist can improve or challenge their skills:

    If you know of anything interesting coming up in London, let us know in the comments.

  • R Analytics in the Cloud
    November 21, 2011 by Radek Maciaszek,  no comments

    Last week I was invited to Big Data London to talk about “R Analytics in the Cloud”. As a case study, I presented the ageing project I’ve been working on as part of my Masters studies at Birkbeck, University of London. Ageing is one of the fundamental mysteries in biology and many scientists are already studying this process. I am excited to be part of the research group led by Eugene Schuster at UCL Institute of Healthy Ageing. This project has also given me the chance to use some of my Hadoop experience in the academic field.

    Bioinformatics is the science of applying information technology to biology in order to understand the latter. There are numerous ways in which computers can aid biologists. In this particular project, we have been using microarrays to find the connection between different genes. The use of microarray technologies has enabled us to detect changes to gene expression across the genome in thousands of experiments with hundreds of species. However, interpreting the changes identified in these experiments has been hampered by a lack of knowledge of the gene function. Even in highly studied genomes, approximately 50-60% of genes will be assigned functions, yet less than 30% will be annotated with a highly specific function. Little of the annotation will have been observed in experiments conducted with the species of interest, as most gene function annotation is based on annotations assigned to orthologous genes taken from experiments done with other species, such as yeast and mammalian cell culture.

    We are interested in building a better understanding of gene function in the worm C. elegans by harnessing the large quantity of experimental microarray data in the public database. Currently, we have a database of over fifty curated experiments. With this, we attempt to assign putative functions to genes based on the expression profile across experiments in the public repositories. My role in this project is to help expand the number of curated experiments in the database and study the functions of approximately 1000 genes known to be regulated in long-lived worms, to try to understand the functions of these genes, e.g. by showing experimental evidence of a role in nutrient sensing, innate immunity or stress response.

    Here are the slides from the presentation. Refer to slides 10 and 11 to see how to migrate your R application to the cloud in just 3 lines of code:

    Oh, and did I mention how cool our lab is? Have a look at the following ad, which was made at UCL  just a couple of metres from my desk.

    Full disclosure: DataMine Lab is in no way affiliated with Birkbeck or UCL and the above project is part of my individual bioinformatics studies.