Print Save PDF

About 4 minutes

We are entering into a new era. It is being propelled by the growth of big data. Businesses and agencies of all sizes are looking to mine the growing stores of data to provide actionable insight. In fact, 87% of professionals feel that big data analytics will shift the competitive landscape of their industries within the next three years.big_data_linux-1.jpg

Data is being captured from a multitude of sources. The Internet of Things (IoT) is delivering data from sensors on all types of assets. Mobile devices are providing valuable information on user behavior. Social media carries a constant stream of new data. There is no shortage of data available and that will continue to grow.


Unlocking data in all of its forms.

There has been a push over the last few years to unlock all of the data that an organization stores. In 2014, Forrester estimated that companies were analyzing only 12% of the data they captured. Unlocking the data that enterprises captured in all of its forms ushered in the Hadoop rush.

Hadoop enabled these organizations to build data lakes, drawing data from multiple sources. It could access data whether it is stored in a relational database management system (RDMS) or a NoSQL database. It could be very structured as in an ERP system, or unstructured in the form of tweets and images.


Unlocking pools of data is only the beginning.

What was quickly apparent is that unlocking data is important, it is only the beginning. The value comes from insights gleaned from the data. Real time analytics on the data that an organization can access to make informed, quick decisions delivers this insight. In order to do this, algorithms need to be run on the data.

Building algorithms is a difficult task. Particularly if the data scientist is attempting to build their own algorithms from scratch. We have come accustomed to development platforms and the solution to this data challenge appears to be building data analysis platforms designed to simplify algorithm development. And Spark is quickly emerging as the platform of choice.


Enter Apache Spark.

Apache Spark is an open source engine built specifically for data science. It was founded in 2009 at UC Berkley and now has a community of active developers from over 200 companies. Spark’s growth is because it helps simplify the development of algorithms. Spark can be used on a wide variety of development platforms including Java and Python. It can access data from multiple sources including SQL and NoSQL (Non SQL) databases.

But what makes Spark the natural successor to Hadoop is the speed in which it can analyze data. Analytics applications can run up to 100x faster than MapReduce in memory and 10x faster when run on disk. It is designed for large, scale out data processing.


Who is using Spark?developer_linux.jpg

Apache Spark is having an impact on just about every industry. It is being used at eBay to analyze transactions. NASA is using it to assess climate change. OpenTable is relying on Spark to help it understand customer reviews of the 32,000 restaurants in their network. Apache Spark maintains a list of users on their Powered By page that includes TripAdvisor, Samsung, MediaCrossing, IBM Almaden, and more.

IBM calls this new era of digital knowledge, the insight economy. IBM views Apache Spark and the open source community as critical to delivering the benefits of actionable intelligence. They have developed the first Spark Technology Center to foster development.

To read more about how Apache Spark is fueling free markets in the insight economy, pick up your eBook here.

Written by Steve Erickson