Since our last blog post about the Viki analytics infrastructure, a lot of improvements has been implemented. With the old blog post being outdated, we’ve decided to make a follow-up blog post to share some of the changes.
Overview (or TL;DR)
We switched to a more flexible Hadoop solution. We were using Treasure Data, but we felt that we have outgrown the service and we wanted to do more. So we decided to run our own Hadoop cluster using AWS EMR. We ended up with a stack that uses the following technologies.
- AWS Elastic Map Reduce
- AWS S3
Upon making the decision to move to AWS EMR, we recognized that the EMR cluster must satisfy three main requirements.
- It should have a nice user-interface for ad-hoc queries similar to the Treasure Data console.
- The performance should be comparable or at least acceptable.
- The cost should also be comparable to what we were currently spending.
For the user-interface requirement, it so happened that AWS EMR just released support for Apache HUE (Hadoop User Experience). It proved to be exactly what we needed; so this requirement was satisfied quite quickly.
With the user-interface requirement being resolved quite easily, we were able to get on to implementing the cluster right away. With the performance and cost requirements, we knew that we had to keep the performance of each component of the system at an acceptable level.
The first decision we had to make is the storage format of our data. We stored our data in JSON, but we knew that using a binary format would be better in terms of performance. So we did some research on the different storage formats and ended up choosing the Parquet format for the following reasons:
- Our event data is quite wide and is also quite sparse, so we will be better served by a columnar based format.
- The wide adoption of Parquet by popular Hadoop technologies like Hive, Pig, Spark, etc. ensures us that we won’t be limited by technology options.
- Very good reviews on the use of Parquet from companies like Twitter, Stripe, etc.
Using Parquet reduced the CPU utilization of our queries by more than a factor of two and greatly reduced execution time.
Having decided on using Parquet, we tackled the problem of the best way of converting our data. At first we tried to do the transformation using HIVE, but this proved to be too inflexible. We needed to implement some logic which is very cumbersome to implement in SQL. Once again we looked around for possible options and we chose Cascalog. We chose Cascalog because the technology looks very promising and would definitely be useful having in our tool chest. The main selling point of Cascalog that we really like is that its abstraction is at a level high enough, that it could be ported to other technologies like Spark, Tez, Storm, etc. Also the logic programming DSL approach for ETL feels very natural.
Storage Layer - Working with the Limitations of EMR
Having chosen the storage format, we tackled the question of where to store the data. We had two options S3 and HDFS. A limitation of AWS EMR is that once a cluster is stopped all the data written to HDFS goes away. We wanted a cluster that we can restart without the fear of losing data. Because of this limitation of AWS EMR, we decided to use S3 as our main storage layer. All the data is stored in S3 and HDFS only serves as a temporary storage for intermediate results. The thing that worried us about this configuration at first was the latency of pulling the data from S3, but after some tests we noticed that the performance was acceptable.
Aside from our data being persisted through cluster restarts, using S3 as the storage layer also gave us some added benefits.
- One benefit is that we don’t have to run a cluster with a very high storage capacity, just enough space for intermediate results.
- Another benefit is that we could spin-up many clusters that have access to our data.
Right now, we have one cluster which operates continuously and handles our daily jobs and ad-hoc querying needs. From time to time, we spin-up a development cluster to try out different cluster configurations and experiment with new technologies. We are also looking into spinning-up a separate cluster for the analysis needs of another team in Viki.
Applications - Taking Advantage of New Technologies
The main advantage of running our own Hadoop cluster is the benefit of being able to use a lot of new technologies. Wanting to squeeze as much benefit as possible from our Hadoop cluster, we took the time to evaluate Spark, Impala, and Tez.
Because of its popularity, the first application we evaluated was Spark. It was quite easy to get started with Spark because of the available EMR boot up script. We got it up and running in no time. The performance improvements in running some of our regular Hive queries was really noticeable. However, we felt that it was more suited as an ad-hoc analytics tool but not quite as suited for our ETL use-case. We are satisfied with Spark and we are now using it.
We didn’t stop at Spark, we went ahead and looked at Impala. Most of our daily summarization jobs are implemented in HIVE and we really wanted to speed it up. Impala provides a very high degree of compatibility with HIVE and getting speed improvements without having to rewrite our queries compelled us to give it a try. It was easy setting-up Impala, the only issue was it only works with data in HDFS. Because we are not amenable to storing our data in HDFS, we had to pass on Impala.
Tez, being an alternative execution engine for HIVE and other Hadoop technologies, also appealed to us for exactly the same reasons as Impala. Installing Tez wasn’t as smooth as Spark and Impala because EMR does not have an available set up script. But even though we had to write our own set up script, it was still fast to get started with Tez. True enough, we noticed some improvements in the runtime of our queries. We noticed that it was generating better query execution plans and better at using the available memory to speed up query execution. We are now using Tez for all our ETL jobs and it is available as an option for ad-hoc HIVE queries.
Operating Cost - Taking Advantage of Elasticity in EMR
Despite its limitations, EMR offers a lot of flexibility features. One of the features that we are using to reduce our cost is the ability to resize the cluster. Right now, our cluster is configured to run task nodes only during office hours. We noticed that outside office hours, cluster utilization is very limited and whatever runs does not really require lots of resources and isn’t time sensitive. This configuration enabled us to stay within our target monthly Hadoop cost.
Engineering is all about choosing your problems. We initially put-off the problem of managing our own Hadoop cluster to be able to focus on creating a data pipeline that works. Once we had the pipeline working, we felt that to go up to the next level we had to move to a more flexible solution which in our case is running our own Hadoop cluster in EMR. Yes, we are managing a lot more stuff now, but with all of its benefits, we’d rather have the problem of managing our own cluster than miss out.
Thinking ahead pays off. Ever since, the start of the development of our data pipeline, we have already considered the possibility that we would be moving off Treasure Data. That is why everything was designed to be flexible enough to cover for this possibility. True enough, we were able to migrate our pipeline in less than two (2) months. Right now, though we are very happy with the flexibility that EMR gives us, everything is designed in such a way that if we want to go a step further and decide to run our own bare-metal cluster or to just move to another cloud-based Hadoop provider, we can.
If you like this article and are looking for an opportunity to work on data-related infrastructure/problems, we're growing and looking for one Data Infrastructure Engineer and a Data Scientist to join our team.