Impala: Another Google Inspired Platform Enters The Mainstream Data World

Amazon Web Services has added support for Impala, the Google inspired query tool developed by big data startup Cloudera. It provides real-time, parallel processing for large amounts of data. With Impala, a developer can load new or access existing data to run their queries using an SQL-like language on AWS Elastic MapReduce clusters. Its faster, more accessible and shows the increasing use of SQL in Hadoop, the open-source system for distributed computing. In a broader view, Impala reflects how Google deeply influences the market and its inventors to create new data platforms and a potentially richer application ecosystem.

Introduced last year, Impala is based on Google Dremel, the successor to the search company’s pioneering work in the big data analytics space with MapReduce, the technology Google developed to query data stored across its vast cloud universe.  As noted by Google Lead Product Manager William Vambenepe, Dremel is also the foundation for Big Query, Google’s own data analytics platform.  Apache Drill is  based on Google Dremel.  Hortonworks has announced Tez, which is part of its Stinger Initiative, designed to work with Hive, the database for querying Hadoop. Hortonworks says Stinger delivers “100x performance improvements at petabyte scale with familiar SQL semantics.”

Citus Data has its own analytics database based on Google Dremel. Its innovation comes in parallel computing in PostgresSQL core to do its queries. MapR is also supporting Drill to provide its own capabilities. JethroData is an analytics database company based on Hadoop that is leveraging the principles of the Google Dremel project.

Hadapt preceded all of these companies with its “Adaptive Analytical Platform,” which brings a native implementation of SQL to the Apache Hadoop open-source project.

Why Is Dremel The New Inspiration?

Hadoop is an important technology for Internet companies like Twitter that process data by the petabyte. Hadoop is also of increasing importance for more traditional organizations that also now must process unprecedented amounts of information. It’s for this new generation of users that Impala is useful. It gives them a way to query data that had previously required deep technical knowledge.

Hadoop has in the past been a complex undertaking, requiring people with multiple talents to unleash its potential. These people were the original data scientists who had learned the art of programming, the management of “clusters,” and data analytics. They emerged from Internet companies that needed to invent their own ways to process and analyze the vast amounts of data that they served. For example, Jeff Hammerbacher left Facebook to be one of Cloudera’s co-founders. Doug Cutting created Hadoop while at Yahoo! where he used it to help develop  an open-source search engine based on Lucene, which Cutting also originally created. Cutting also now works at Cloudera.

Google led the way with MapReduce, which treats a set of nodes as a cluster that processes data in parallel.  It maps the data across the clusters and then reduces it to answer a problem.

Going beyond MapReduce, Google Dremel represents a pillar for the next-generation of Hadoop technologies, fortified by a growing ecosystem of open-source projects such as Hive and Pig — all designed to abstract the complexity of MapReduce with higher-level languages.

The strength of Dremel is in its instant analysis. But it is primarily meant for querying while its counterpart, Google F1, is a massive relational database, originally designed to manage Google’s online advertising.

 Impala’s value comes with its aptitude for analysis. It is why it is viewed as a natural complement to business intelligence tools such as Tableau, the data visualization technology. Analysts can quickly query data with Impala and then run it in their business intelligence tool of choice.

Hadoop has largely not been viewed as a platform for app development. But that will likely change as Impala becomes more widely used and new pieces get added to the Hadoop environment. That became evident earlier this year with the latest version of Hadoop. In the new version comes Yarn, which abstracts MapReduce into a scheduler and a resource manager. It allows for scaling beyond what was possible with Hadoop before.

The application ecosystem that will come out of Hadoop is evident in both Impala and Yarn. Both simplify Hadoop and provide a deeper capability for the end user. And then there is Cascading, the application framework for Hadoop, which Concurrent has commercialized. It counts Twitter, Etsy and Airbnb as customers.

For a long time, Google has been ahead of the market. But Hadoop and the innovation at the platform layer shows that the difference between Google and its counterparts is starting to shrink.

Feature image courtesy of Electric Sheep via Creative Commons)