Last November, I wrote a post called Big Data, Bigger Brains. In that post, I wrote about how we were able to make business intelligence (BI) work in a Big Data environment at Klout. That was a big step forward for Klout, but our work wasn’t yet done. Recently we launched a whole new website and a new Klout Score that substantially upped the stakes.
Really Big Data
At Klout, we need to perform quick, deep analysis on vast amounts of user data. We need to set up complex alerts and monitors to ensure that our data collection and processing is accurate and timely. With our new Score, we increased the amount of signals we collected by four times. This means that we now collect and normalize more than 12 billion signals a day into a Hive data warehouse of more than 1 trillion rows. In addition, we have hundreds of millions of user profiles that translate into a massive “customer” dimension, rich with attributes. Our existing configuration of connecting Hive to SQL Server Analysis Services (SSAS) by using MySql as a staging area was no longer feasible.
What I really want for Christmas…
So, what’s wrong with this story? What I really want is the OLAP engine itself to reside alongside of Hive/Hadoop, rather than live alone in a non-clustered environment. If the multi-dimensional engine resided inside HDFS, we could eliminate the double-write (write to HDFS, write to the cube) and leverage the aggregate memory and disk available across the Hadoop cluster for virtually unlimited scale out. As an added benefit, a single write would eliminate latency and vastly simplify the operational environment. I can dream, can’t I?
Do you want to take on Big Data challenges like this? We’re looking for great engineers who can think out of the box. Check us out.