Introduction to Big Data and Spark
- Introduction to Big Data
- Challenges with Big Data
- Batch Vs. Real Time Big Data Analytics
- Batch Analytics – Hadoop Ecosystem Overview
- Real Time Analytics Options, Streaming Data – Storm
- In Memory Data – Spark
- What is Spark?
- Modes of Spark
- Spark Installation Demo
- Overview of Spark on a cluster
- Spark Standalone Cluster
Spark Baby Steps
- Invoking Spark Shell
- Loading a File in Shell
- Performing Some Basic Operations on Files in Spark Shell
- Building a Spark Project with sbt, Building and Running Spark Project with sbt
- Caching Overview, Distributed Persistence
- Spark Streaming Overview
- Example: Streaming Word Count
Playing with RDDs
- RDDs
- Transformations in RDD
- Actions in RDD
- Loading Data in RDD
- Saving Data through RDD
- Scala and Hadoop Integration Hands on
Shark – When Spark meets Hive ( Spark SQL)
- Why Shark?
- Installing Shark
- Running Shark
- Loading of Data
- Hive Queries through Spark
- Testing Tips in Scala
- Performance Tuning Tips in Spark
- Shared Variables: Broadcast Variables
- Shared Variables: Accumulators
Spark Streaming
- Spark Streaming Architecture
- First Spark Streaming Program
- Transformations in Spark Streaming
- Fault tolerance in Spark Streaming
- Check pointing
- Parallelism level
Spark Mlib
- Classification Algorithm
- Clustering Algorithm
- Sequence Mining Algorithm
- Collbrative filtering
Spark GraphX
- Graph analysis with Spark
- GraphX for graphs
- Graph-parallel computation
- Installation of Spark and Scala
- Discussion of real time use cases using Spark
- Mini project implementation in Spark