Introduction to Big Data and Spark

  • Introduction to Big Data
  • Challenges with Big Data
  • Batch Vs. Real Time Big Data Analytics
  • Batch Analytics – Hadoop Ecosystem Overview
  • Real Time Analytics Options, Streaming Data – Storm
  • In Memory Data – Spark
  • What is Spark?
  • Modes of Spark
  • Spark Installation Demo
  • Overview of Spark on a cluster
  • Spark Standalone Cluster

Spark Baby Steps

  • Invoking Spark Shell
  • Loading a File in Shell
  • Performing Some Basic Operations on Files in Spark Shell
  • Building a Spark Project with sbt, Building and Running Spark Project with sbt
  • Caching Overview, Distributed Persistence
  • Spark Streaming Overview
  • Example: Streaming Word Count

Playing with RDDs

  • RDDs
  • Transformations in RDD
  • Actions in RDD
  • Loading Data in RDD
  • Saving Data through RDD
  • Scala and Hadoop Integration Hands on

Shark – When Spark meets Hive ( Spark SQL)

  • Why Shark?
  • Installing Shark
  • Running Shark
  • Loading of Data
  • Hive Queries through Spark
  • Testing Tips in Scala
  • Performance Tuning Tips in Spark
  • Shared Variables: Broadcast Variables
  • Shared Variables: Accumulators

Spark Streaming

  • Spark Streaming Architecture
  • First Spark Streaming Program
  • Transformations in Spark Streaming
  • Fault tolerance in Spark Streaming
  • Check pointing
  • Parallelism level

Spark Mlib

  • Classification Algorithm
  • Clustering Algorithm
  • Sequence Mining Algorithm
  • Collbrative filtering

Spark GraphX

  • Graph analysis with Spark
  • GraphX for graphs
  • Graph-parallel computation
  • Installation of Spark and Scala
  • Discussion of real time use cases using Spark
  • Mini project implementation in Spark