Course Outline
PySpark & Machine Learning
Module 1: Big Data & Spark Foundations
- Overview of the Big Data ecosystem and the role of Spark in modern data platforms
- Understanding Spark architecture: driver, executors, cluster manager, lazy evaluation, DAG and execution planning
- Differences between RDD and DataFrame APIs and when to use each approach
- Creating and configuring SparkSession and understanding application configuration fundamentals
Module 2: PySpark DataFrames
- Reading and writing data from enterprise sources and formats (CSV, JSON, Parquet, Delta)
- Working with PySpark DataFrames: transformations, actions, column expressions, filtering, joins and aggregations
- Implementing advanced operations such as window functions, handling timestamps and working with nested data
- Applying data quality checks and writing reusable, maintainable PySpark code
Module 3: Processing Large Datasets Efficiently
- Understanding performance fundamentals: partitioning strategies, shuffle behaviour, caching and persistence
- Using optimisation techniques including broadcast joins and execution plan analysis
- Efficient processing of large datasets and best practices for scalable data workflows
- Understanding schema evolution and modern storage formats used in enterprise environments
Module 4: Feature Engineering at Scale
- Performing feature engineering with Spark MLlib: handling missing values, encoding categorical variables and feature scaling
- Designing reusable preprocessing steps and preparing datasets for Machine Learning pipelines
- Introduction to feature selection and handling imbalanced datasets
Module 5: Machine Learning with Spark MLlib
- Understanding MLlib architecture and the Estimator/Transformer pattern
- Training regression and classification models at scale (Linear Regression, Logistic Regression, Decision Trees, Random Forest)
- Comparing models and interpreting results in distributed Machine Learning workflows
Module 6: End-to-End ML Pipelines
- Building end-to-end Machine Learning pipelines combining preprocessing, feature engineering and modelling
- Applying train/validation/test split strategies
- Performing cross-validation and hyperparameter tuning using grid search and random search
- Structuring reproducible Machine Learning experiments
Module 7: Model Evaluation & Practical ML Decision Making
- Applying appropriate evaluation metrics for regression and classification problems
- Identifying overfitting and underfitting and making practical model selection decisions
- Interpreting feature importance and understanding model behaviour
Module 8: Production & Enterprise Practices
- Persisting and loading models in Spark
- Implementing batch inference workflows on large datasets
- Understanding the Machine Learning lifecycle in enterprise environments
- Introduction to versioning, experiment tracking concepts and basic testing strategies
Practical Outcome
- Ability to work autonomously with PySpark
- Ability to process large datasets efficiently
- Ability to perform feature engineering at scale
- Ability to build scalable Machine Learning pipelines
Requirements
Participants should have the following background:
Basic Python programming knowledge including working with functions, data structures and libraries
Fundamental understanding of data analysis concepts such as datasets, transformations and aggregations
Basic knowledge of SQL and relational data concepts
Introductory understanding of Machine Learning concepts such as training datasets, features and evaluation metrics
Familiarity with command line environments and basic software development practices is recommended
Experience with Pandas, NumPy or similar data processing libraries is helpful but not mandatory.
Delivery Options
Private Group Training
Our identity is rooted in delivering exactly what our clients need.
- Pre-course call with your trainer
- Customisation of the learning experience to achieve your goals -
- Bespoke outlines
- Practical hands-on exercises containing data / scenarios recognisable to the learners
- Training scheduled on a date of your choice
- Delivered online, onsite/classroom or hybrid by experts sharing real world experience
Private Group Prices RRP from €6840 online delivery, based on a group of 2 delegates, €2160 per additional delegate (excludes any certification / exam costs). We recommend a maximum group size of 12 for most learning events.
Contact us for an exact quote and to hear our latest promotions
Public Training
Please see our public courses
Testimonials (1)
practice tasks