Batch Schedule : 16-Aug-2025 To 17-Sep-2025
Schedule : Mon-Sat
Duration : 50 hrs.
Timings : 7:00 PM To 9:00 PM
Fees : Rs. INR 14900/- 12400/-(Inc.18% GST)
Data Engineers, Python Developers, Freshers
Section 1: Spark Architecture & Internals
- Distributed Computing Fundamentals
- RDD lineage, DAG scheduler, lazy evaluation
- Cluster managers overview
- Spark 4.x Updates
- Adaptive Query Execution (AQE) enhancements
- Catalyst optimizer improvements
- Performance Tuning
- Joins
- Partitioning, broadcast variables
- Memory management
Section 2: PySpark DataFrames & SQL
- Data Manipulation
- Complex types (JSON, arrays, maps)
- Window functions, pivot tables, UDFs/Pandas UDFs
- Spark SQL Deep Dive
- Temp views, catalog API, Hive metastore integration
- SQL syntax for Delta Lake operations
- Execution Plans
- Reading `explain()` output
- Predicate pushdown, partition pruning
Section 3: Incremental Data Processing & Apache Kafka
- Structured Streaming
- Event-time processing, watermarking, state management
- Kafka integration (source/sink)
- Delta Lake Essentials
- ACID transactions
- Schema evolution
Section 4: Spark Optimizations
- Catalyst Internals
- Logical vs. physical plans
- Custom optimization extensions
- Performance Best Practices
- File formats (Parquet/Delta)
- Resource allocation (executors/cores)
Section 5: Databricks Lakehouse Platform
- Lakehouse fundamentals
- Workspace Navigation
- DBFS, clusters, notebooks
- Delta Lake UI
- Viewing table history/schema
- Data Governance
- Unity Catalog basics (no Admin tasks)
Section 6: Apache Kafka Fundamentals
- Architecture
- Brokers, topics, partitions, consumer groups
- Spark-Kafka Integration
- Structured Streaming with Kafka
- Job execution
Section 7: Spark ML Introduction
- MLlib Workflow
- Transformers vs. estimators, pipelines
- Feature engineering (VectorAssembler, StringIndexer)
- Model Training
- Regression demo (no hyperparameter tuning)
Section 8: Capstone Project
- Pipeline implementation
- Domain Examples: IoT monitoring, retail analytics
1. Python: Language Fundamentals, Functions, Collections, Pandas, ...
2. SQL: CRUD Operations, Group By, Joins, Analytical queries, …
3.Good to have: Linux basics, Hadoop/Hive knowledge beneficial
- Local Installation: Spark 4.x, Java 11, Python 3.10
- Cloud: Databricks Community/Free Edition
1. Developer-Centric Focus:
- Covers PySpark application development (coding, debugging, optimization).
- Excludes: Cluster administration, infrastructure setup (YARN/K8s), or Spark cluster tuning.
2. Machine Learning Scope:
- Only introductory-level Spark ML (pipeline structure, basic concept).
- Excludes: Advanced ML concepts (hyperparameter tuning, etc), DL frameworks, or MLOps.
3. Language & Environment:
- PySpark (Python API) only – Scala/Java/R APIs not covered.
- Databricks usage focuses on developer work, not account/admin management.
4. Kafka Integration:
- Covers Spark-as-Consumer/Producer – not professional Kafka cluster setup, security, or Streams API.
5. Infrastructure Assumptions:
- All labs use local/standalone mode or Databricks Community Edition.
17th Nov 25
Sr.No | Batch Code | Start Date | End Date | Time |
---|---|---|---|---|
1 | Spark-O-04 | 16-Aug-2025 | 17-Sep-2025 | 7:00 PM To 9:00 PM |
Schedule : Mon-Sat
'Sunbeam Chambers', Plot No.R/2, Market Yard Road, Behind Hotel Fulora, Gultekdi, Pune - 411 037. MH-INDIA.
"Sunbeam IT Park", Second Floor, Phase 2 of Rajiv Gandhi Infotech Park,Hinjawadi, Pune - 411057, MH-INDIA