Spark

A Simple Explanation - By Varsha Saini

What is Spark?

Apache Spark is an open-source unified analytics engine which is widely used across industries for large-scale data processing and developing big data projects.

Why is Apache Spark a Unified Framework?

Apache Spark is a multi-language engine (it supports various programming languages).

It is a single solution for executing data engineering, data science and machine learning on single-node machines or clusters.

Components of Spark

The below components are built on top of the Spark core engine.

1. Spark SQL

Spark SQL is a spark module for structured data processing and can also act as a distributed SQL query engine.
It supports many sources of data including Hive tables, and JSON.

2. Spark Streaming

Spark Streaming can be used to create analytical and interactive applications for live streaming data.

3. MLLib

MLLib is a machine-learning library built on top of Spark.
It provides multiple machine-learning algorithms.

4. GraphX

GraphX is Spark’s own graphical computation engine for manipulating graphs.

Why is Industry Migrating to Spark?

The below features are major reasons why everyone is migrating to Spark:

Spark is a Unified Engine: There is no need to learn specialized tools for different processes. Spark provides one solution to all (SQL, Machine Learning, Streaming). This saves a lot of time, effort and cost.
In Memory Execution: Spark takes the data in RAM and then processes it. This makes it very fast.
Easy to Code: Everything can be coded in one system.

RDD Fundamentals in Spark

RDD stands for Resilient Distributed Dataset. RDD is the fundamental data structure for Spark. In Spark, data is represented as RDD.

RDD are immutable i.e. data can’t change but overwrite. There are two major operations in RDD are:

Transformations
Actions

Spark Architecture

The Spark follows the master-slave architecture. The cluster consists of a single master and multiple slaves.

Why is Spark Faster than Hadoop?

Spark was built on top of Hadoop MapReduce and it extends the MapReduce model. Apache Spark is 100x faster in memory and 10x faster on disk than Hadoop due to its feature of processing data in memory(RAM).

Features of Spark

Task Scheduling
Memory Management
Fault Recovery
Interacting with Storage System

Varsha Saini

Spark

A Simple Explanation - By Varsha Saini

What is Spark?

Why is Apache Spark a Unified Framework?

Components of Spark

1. Spark SQL

2. Spark Streaming

3. MLLib

4. GraphX

Why is Industry Migrating to Spark?

RDD Fundamentals in Spark

Spark Architecture

Why is Spark Faster than Hadoop?

Features of Spark

Other Popular Terms

Adjusted R-Squared

Autocorrelation

Bagging Algorithm

Bessel’s Correction

Boosting Algorithm

CatBoost

Citizen Data Scientist

Cohen Kappa

Confusion Matrix

Correlation

Cross Validation

Data Drift

Data Imputation

Differential Privacy

Elastic Net Regression

Evaluation Metrics

Feature Selection

Genetic Programming