Applied Machine Learning with Apache Spark


Apache Spark is one of the most useful tools for processing Big Data, both streamed and warehoused. It is a scalable tool for processing structured or semi-structured data, streams, graphs and training and applying machine learning models.

Semantive offers customized trainings for Apache Spark, including Spark Streaming, Spark SQL, Spark ML and GraphX components. Our instructors are working with Spark on daily basis and are prepared to deliver both theoretical knowledge and practical tips and solutions for real-life problems. The courses that are designed to provide you with the knowledge required to design, implement and run highly-efficient Spark applications.


  1. Introduction to Big Data processing
    1. Big Data problem
    2. Batch processing
    3. Map-Reduce paradigm and Hadoop
    4. Spark as enhancement of MR
  2. Datasets processing: Spark Core
    1. Installing Spark
    2. Distribution of data
    3. Data processing: RDD API
      1. Transforming and collecting data
      2. Broadcasts and acumulators
      3. Caching
      4. Good and bad practises
    4. Data processing: DataFrames and SparkSQL
    5. External datastores integration
      1. HDFS
      2. Cassandra
  3. Streams processing: Spark Streaming
    1. Streaming over distributed data
      1. Simple data processing
      2. Windowing
      3. Stateful streaming
    2. Resilient streaming
      1. Cache configuration
      2. Checkpoints
      3. Write-ahead
    3. External systems integration
      1. Kafka
  4. Machine learning: Spark ML
    1. Introduction to machine learning
      1. Necessary math operations review
    2. Spark Pipelines API
      1. Data Preparations
      2. Transformers and Estimators
      3. Saving and loading pipelines
    3. Classification and regression
      1. Feature extraction
      2. Naive Bayes
      3. Logistic and linear regression
      4. Random forests
      5. Perceptron
      6. Hyperparameter tuning
    4. Clustering
      1. K-Means
      2. Bisecting k-means
    5. Collaborative Filtering
  5. Spark Deployment
    1. Cluster architecture
      1. Using built-in manager
      2. Using Mesos to deploy
    2. Cluster monitoring

Meet the trainers

Apply for training