Big Data on Cloud

One challenge of running big data infrastructure on Cloud is its unsuitability. Take Amazon EC2 as example, I read some white paper published by a heavy AWS customer saying roughly 20% of your EC2 nodes are "bad". If your servers are stateless(e.g. most web/api servers), that's less a concern, but as hosted data infrastructure, it's very hard to make it stateless.

In our case, we once ran a 9-node Hadoop cluster on EC2. Over 3-month period, we lost 4 of the nodes which no longer show up in the Hadoop jobtracker dashboard. When we are fixing the node, we found that usually it's not any Hadoop related failure that caused the trouble, rather it's EC2 related, mostly storage/network. To fix the problem, it usually requires rebooting the problematic node. But then it introduces another problem- data loss.

Here are the best practice and steps we figured to contain the data loss problems and it works quite well.

Frist of all, you need set the replication factor of HDFS to be 3. Before launching the Hadoop cluster, go to your namenode' hdfs-site.xml under conf sub directory of HDFS and set "dfs.replication" to be 3. It should look like following :

<property>
<name >
dfs.replication
</name >
< value>3 </value>
</property>

As you do this, you are assured each block of data on HDFS will be replicated to 3 nodes. You will be fine even if you lose 2 data nodes in your cluster - but you still need take immediate action no matter what if 1 data node disappears.

In case one, when you see data node loss, first login to the failed EC2 instance to see what's wrong there. If you can recover it by restarting datanode and tasktracker without rebooting the instance, congratulations! Here are steps you can take(assume you are running CDH3):

> service hadoop-0.20 datanode stop

>service hadoop-0.20 tasktracker stop

>service hadoop-0.20 datanode start

>service hadoop-0.20 tasktracker start

Try to connect to your jobtracker dashboard now, if you see the correct number of nodes, you are done.

In case 2, you are not that lucky, but no worry. Reboot the instance. Check with your instance-store storage space by :

> df -h

If you are using m2.4xlarge EC2 instances(highly recommend this type of instance for building Hadoop cluster), you will notice that you lose all your ephemeral storage on the instance. In our case, that's where HDFS is located(I will explain in following posts why we go with ephemeral storage). So as the first step, we need remount this storage device :

>umount /mnt

>yes Y|mdadm --create --verbose /dev/md0 --level=0 -c256 --raid-devices=2 /dev/sdb /dev/sdc

>yes Y|blockdev --setra 65536 /dev/md0

>yes Y|mkfs.xfs -f /dev/md0

>mount /dev/md0 /mnt

Here we make a RAID0 based on 2 ephemeral devices coming with m2.4xlarge instance and make an xfs file system on it then mount the device to /mnt, now you will see total 1.7 TB space on /mnt if you run df -h

The following steps depend on how you actually configure your Hadoop/HDFS, but if you are using CDH3, below are recommended steps:

>mkdir -p /mnt/hadoop/dfs/name

>mkdir -p /mnt/hadoop/dfs/data

>mkdir -p /mnt/hadoop/dfs/namesecondary

>chown -R hdfs:hadoop /mnt/hadoop/dfs/name /mnt/hadoop/dfs/data /mnt/hadoop/dfs/namesecondary

>chmod -R a+rw /mnt/hadoop/dfs/name /mnt/hadoop/dfs/data /mnt/hadoop/dfs/namesecondary

>mkdir -p /mnt/tempDirForHadoop

>chmod a+w /mnt/tempDirForHadoop

>mkdir -p /mnt/tempDirForHadoop/dfs/tmp

>chmod -R a+rw /mnt/tempDirForHadoop/dfs/tmp

>mkdir -p /mnt/tempDirForHadoop/s3

>chmod -R a+rw /mnt/tempDirForHadoop/s3

>mkdir -p /mnt/tempDirForHadoop/mapred/local

>chown -R mapred:hadoop /mnt/tempDirForHadoop/mapred/local

>chmod -R a+rw /mnt/tempDirForHadoop/mapred/local

>mkdir -p /mnt/tempDirForHadoop/mapred/temp

>chmod -R a+rw /mnt/tempDirForHadoop/mapred/temp

After this, run:

>service hadoop-0.20 datanode start

>service hadoop-0.20 tasktracker start

Check back to your jobtracker dashboard, you will see the recovered node appearing up there. However, we are not done yet. Even if there is no data loss for the Hadoop cluster, the newly recovered node has no data on it. Also, Hadoop cluster won't automatically re-balance among the data nodes unless you instruct it to. Now login to your namenode, go to Hadoop home directory and run:

>bin/start-balancer.sh -threshold 5

5 is a good number to use, the lower the number, the more evenly data is distributed across nodes. The tradeoff here is this takes network bandwidth to move around data among data nodes. It runs in background, but could affect regular MapReduce job running on the cluster. The re-balance job takes a while depending on how big the underlying data is. If you login to the recovered data node after a while. You will see data size grows significantly under /mnt.

In case 3, you are even unluckier - the instance is completely gone or non-functional. Again, not end of the world, just a few more miles to travel.

Terminate the bad instance, boot a new one with pre-built AMI used for your Hadoop cluster.
Go to every node of the cluster, change the "slaves" file under your Hadoop conf directory, remove the bad instance IP or DNS name from the list and add in the new replacement node.
Follow all steps under case without starting the datanode and tasktracker service on the replacement node.
Login to the namenode, run following commands:
> service hadoop-0.20 namenode stop
> service hadoop-0.20 jobtracker stop
>service hadoop-0.20 namenode start
>service hadoop-0.20 jobtracker start
Login to the replacement datanode, run:
>service hadoop-0.20 datanode start
>service hadoop-0.20 tasktracker start
Login back to namenode and go to Hadoop home directory, run:
>bin/start-balancer.sh -threshold 5

Go back to jobtracker dashboard now, you should see a healthy cluster running now.

Except for case 3, for which you have to restart your namenode, you can swap the bad node and recover it w/o interrupting the cluster from running its M/R jobs, that's why we call it "hot swap".

When I was working in the data mining group within Yahoo! SDS, I used to think only big Internet companies or giant enterprises need care about big data and large scale data mining. This is no longer the case. Big data and data driven apps will become the driving force of next gen Internet and also revolutionize many other traditional industries. As Twitter's/Square co-founder Jack Dorsey put it "Data will be a product by itself!". He was pushing big data initiatives really hard even during the very early stage of Square so that "We are no longer flying in the dark".

So, what does this mean to other startup's with a equal or even more ambitious goal to achieve? Why startup's need care about big data? How to collect and make sense of it? How to build the infrastructure to deal with big data? How can all the big data stuff to be pragmatic rather than fancy showcase? This blog is about to address all above questions in a very practical way.

To start with, a typical pattern of Internet upstart is usually built based on some sort of cloud hosting environment. Among all major cloud vendors (Amazon AWS, Rackspace, Google App Engine , Microsoft Azure etc.), AWS seems to be the most natural choice for startup's because of its deep tech stack and battle hardened infrastructure(major outage did happen sometimes :) ).

Now, back to the topic, if you run everything on AWS cloud and want your big data gig, how you would actually build it? Let's answer the question by first splitting the design requirements into 2 major categories:

Big Data : Able to store and access large scale data set and support deep mining
Fast Data : Support production dashboard and near real-time BI need

Some companies might focus one more than the other, but eventually both will become critical as your business advances. Here at Tapjoy, we are experiencing high data demands from both categories. We are seeing more than 200 Million mobile user sessions on daily basis. Both our internal business folks and external partners want to access granular stats in near-realtime fashion. Meanwhile, our engineering and science group need build models to facilitate various data driven applications such as ad ranking optimization, app recommendation, behavior targeting etc. So we come up with the architecture below and built it on Amazon AWS:

Syslog-ng log Collector Cluster : responsible of collecting logs from production web server, produce batch logs every minute
Hadoop ETL Cluster : pick up the logs collected by Syslog-ng and run E(extraction) and T(transformation) of the raw logs
Hadoop Data Warehouse : pulls post ETL data from ETL cluster and load them into Hadoop HDFS. Ready to be used by various data mining/modeling processes (e.g. Mahout ML algorithms)
Vertica DB Cluster : column oriented fast analytical DB for near real-time reporting and dashboarding. It also pulls data from the Hadoop ETL cluster
BI Suite : Tableau Desktop is neat solution for internal analysts for daily drag & drop type of pivotal analysis

Everything in this system runs on AWS cloud, it uses both EBS and S3 for data redundancy and backup. There are some nice features about this system:

SCALABLE : can handle 1 billion raw events/day with modest number of EC2 nodes. Beyond that, the system scales linearly,simply by adding more nodes.
ROBUST : the infrastructure is designed assuming any node can be down at any moment. Every component in the system has its back up plan.
FAULT-TOLERENT : System components are properly separated to avoid errors from cascading down flow, which helps isolate problems and solve them. Data at difference stages all get back up actively, which can contribute to quick recovery from system failure.

The overall end-to-end delay from production to dashboard is within 30 minutes and users are allowed to perform any deep and interactive queries that operates at the most granular level - raw events.

To build and run such a big data infrastructure purely on AWS presents a lot of challenges, in the posts that follow, I will talk about many tricks we learn along the way to get our big data up running healthily on the Cloud.

Big Data on Cloud

Sunday, November 13, 2011

Hot Swap and Recovery of EC2 Hadoop Cluster

Sunday, November 6, 2011

A Fast and Big Data Infrastructure based on Amazon EC2