Spark

A Simple Explanation - By Varsha Saini

What is Spark?

Apache Spark is an open-source unified analytics engine which is widely used across industries for large-scale data processing and developing big data projects.

Why is Apache Spark a Unified Framework?

Apache Spark is a multi-language engine (it supports various programming languages).

It is a single solution for executing data engineering, data science and machine learning on single-node machines or clusters.

Components of Spark

The below components are built on top of the Spark core engine.

1. Spark SQL

  • Spark SQL is a spark module for structured data processing and can also act as a distributed SQL query engine.
  • It supports many sources of data including Hive tables, and JSON.

2. Spark Streaming

  • Spark Streaming can be used to create analytical and interactive applications for live streaming data.

3. MLLib

  • MLLib is a machine-learning library built on top of Spark.
  • It provides multiple machine-learning algorithms.

4. GraphX

  • GraphX is Spark’s own graphical computation engine for manipulating graphs.

Why is Industry Migrating to Spark?

The below features are major reasons why everyone is migrating to Spark:

  • Spark is a Unified Engine: There is no need to learn specialized tools for different processes. Spark provides one solution to all (SQL, Machine Learning, Streaming). This saves a lot of time, effort and cost.
  • In Memory Execution: Spark takes the data in RAM and then processes it. This makes it very fast.
  • Easy to Code: Everything can be coded in one system.

RDD Fundamentals in Spark

RDD stands for Resilient Distributed Dataset. RDD is the fundamental data structure for Spark. In Spark, data is represented as RDD.

RDD are immutable i.e. data can’t change but overwrite. There are two major operations in RDD are:

  1. Transformations
  2. Actions

Spark Architecture

The Spark follows the master-slave architecture. The cluster consists of a single master and multiple slaves.

Why is Spark Faster than Hadoop?

Spark was built on top of Hadoop MapReduce and it extends the MapReduce model. Apache Spark is 100x faster in memory and 10x faster on disk than Hadoop due to its feature of processing data in memory(RAM).

Features of Spark

  1. Task Scheduling
  2. Memory Management
  3. Fault Recovery
  4. Interacting with Storage System