Applied Machine Learning with Apache Spark


Apache Spark is one of the most useful tools for processing Big Data, both streamed and warehoused. It is a scalable tool for processing structured or semi-structured data, streams, graphs and training and applying machine learning models.

Semantive offers customized trainings for Apache Spark, including Spark Streaming, Spark SQL, Spark ML and GraphX components. Our instructors are working with Spark on daily basis and are prepared to deliver both theoretical knowledge and practical tips and solutions for real-life  problems. The courses that are designed to provide you with the  knowledge required to design, implement and run highly-efficient Spark applications.


  1. Introduction to Big Data processing
    • Big Data problem
    • Batch processing
    • Map-Reduce paradigm and Hadoop
    • Spark as enhancement of MR
  2. Datasets processing: Spark Core
    • Installing Spark
    • Distribution of data
    • Data processing: RDD API
      • Transforming and collecting data
      • Broadcasts and acumulators
      • Caching
      • Good and bad practises
    • Data processing: DataFrames and SparkSQL
    • External datastores integration
      • HDFS
      • Cassandra
  3. Streams processing: Spark Streaming
    • Streaming over distributed data
      • Simple data processing
      • Windowing
      • Stateful streaming
    • Resilient streaming
      • Cache configuration
      • Checkpoints
      • Write-ahead
    • External systems integration
      • Kafka
  4. Machine learning: Spark ML
    • Introduction to machine learning
      • Necessary math operations review
    • Spark Pipelines API
      • Data Preparations
      • Transformers and Estimators
      • Saving and loading pipelines
    • Classification and regression
      • Feature extraction
      • Naive Bayes
      • Logistic and linear regression
      • Random forests
      • Perceptron
      • Hyperparameter tuning
    • Clustering
      • K-Means
      • Bisecting k-means
    • Collaborative Filtering
  5. Spark Deployment
    • Cluster architecture
      • Using built-in manager
      • Using Mesos to deploy
    • Cluster monitoring

Meet the trainers


Software architect with Big Data processing and machine learning background. Has experience with designing, developing and deploying various solutions – from stream machine learning solution to isolated software sandbox. Amadeusz Conducts training for Apache Cassandra and Apache Spark libraries and holds BSc in Computer science, as well as Apache Spark Developer certificate.


Has experience in developing web applications using Scala for backend and AngularJS with TypeScript for the frontend. He is an enthusiast of clean and well-tested code. Marcin is an AWS Associate-level Certified Solutions Architect and is on his way toward Engineer degree in Computer Science at Warsaw University of Technology. His thesis is related to sequential pattern mining using Spark.

Applied Machine Learning with Apache Spark

Apply for training