sparkler_sparks_lightTomorrow, Hadoop will turn 10 years old. Apache Hadoop version 0.1.0 was released on April 1, 2006.  It has grown from a little-known open source project to a massive ecosystem with rabid fans. I attended O’Reilly Strata+Hadoop in New York last fall and you could feel the excitement and enthusiasm during the multitude of short keynotes, on the show floor, but also in many of the technical sessions I attended.

Hadoop also has had its challenges. According to the Gartner 2015 Hadoop Adoption Study, 54% of survey respondents reported no plan to invest at the time. Only 26% of respondents claimed to be either deploying, piloting, or experimenting with Hadoop. The Gartner survey highlighted skill gaps (mentioned by 57% of the respondents) and the difficulty in figuring out how to get value from Hadoop (49% percent of the respondents).

There’s no shortage of new projects to address Hadoop challenges and gaps. One of them is known as Apache Spark. Apache Spark is an open source data processing platform that debuted in October 2012. It provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

With dozens of high-level operators, APIs for Java and Python, support for in-memory technology, and good integration with the Hadoop ecosystem, many hope it will deal with issues around fast batch processing of large data sets. At least that’s the hope of many developers in the trenches. For a good intro on Spark, I recommend Introduction to Apache Spark with Examples and Use Cases and the Spark FAQ.

Spark has generated a lot of excitement and new possibilities on how enterprises can take advantage of all the data stored using open source software on low-cost commodity hardware. At SAP, we decided early on that we wanted to help enterprises storing masses of data in Hadoop to improve the accuracy of predictive systems and to make sense of all this wealth of data.

Bringing Predictive Techniques to Apache Spark

That’s why we have brought automated predictive techniques to Spark in the form of Native Spark Modeling and the benefits are enormous. First, there is no data movement between the predictive engine and the data source. You do data manipulation, model training, and retraining directly on Hadoop data using the Spark engine.

By doing so, you can take advantage from the inherent benefits offered by Spark—faster response time, better use of CPUs with distributed processing, and higher scalability.

Another benefit is the fact that analysts can take advantage of Big Data using Spark without having to code. They simply use the self-service, flexible workflow provided by SAP Predictive Analytics while connected to Spark. That alone will make Big Data more available to the lines of business and fuel the demand for added-value predictive use cases.

In these ways, Native Spark Modeling helps enterprises who are using Hadoop make sense of all their data. And what’s not to like about that?

For more on Native Spark Modeling:

 

Join us here on the Analytics from SAP blog every Thursday for new posts about all things predictive (and read the previous series posts here). And follow me on Twitter at @pileroux . We look forward to hearing from you.