Hands On Tutorials to Implement Concepts
The best engineers have a strong appreciation for the pros and cons of a solution.
With each exercise below I train the reflection muscle to further connect the dots.
Delta Lake - The Definitive Guide - Modern Data Lakehouse Architectures with Data Lakes
🔹 Foundations
-
LH001: Intro to Fabric Lakehouses – Explore building a Lakehouse from scratch, uploading CSVs, and converting them into Delta tables. Compare manual UI vs. PySpark loading methods.
-
LH002: Delta Lake – Learn how versioned Parquet + JSON logs create Delta tables with support for INSERT/UPDATE/DELETE and time travel, plus why table maintenance is key (removing/vacuuming parquet files that are no longer needed).
-
LH003: Notebook Features – Practice core Notebook skills: multi-Lakehouse connections, language switching, parameter cells, markdown, and snippets.
-
LH004: Python Package Management – Install PyPI and custom packages inline or via Environments. Compare quick installs vs. managed, CI/CD-friendly deployments. Import Seaborn visualization package into notebook via env. - super useful & tradeoffs.
-
LH005: Tables & Schemas – Create tables with Spark SQL and the Delta Builder API. Explore schema evolution via createOrReplace and its impact on downstream analytics.
🔹 Data Loading & Transformation
-
LH006: Incremental Loading – Implement MERGE strategies in Spark SQL and Delta libraries for inserts/updates. Compare SQL simplicity vs. Python control.
-
LH007–LH011: PySpark Drills (DataFrames → Filtering → GroupBy → Cleaning → Reshaping) – Work through a sequence of Spark drills to master DataFrame transformations, filters, aggregations, deduplication, handling missing values, and reshaping (joins, pivots, unpivots).
-
LH012: Audible Data Cleaning (Spark) – Rebuild a messy-data exercise in Spark to highlight flexibility and performance compared to Dataflow approaches.
🔹 Advanced Lakehouse Techniques
-
LH013: Shortcut Strategies – Explore external and internal shortcuts into Lakehouses. Weigh convenience, cost savings, and deployment complexity.
-
LH014: Mirroring Strategies – Connect Fabric to external sources (Azure SQL, Snowflake, Databricks Unity Catalog, PostgreSQL, etc.). Learn Open Mirroring for incremental writes and cost-saving scenarios.
-
LH015: Orchestration with Notebooks – Use notebookutils to trigger, chain, and DAG-orchestrate multiple notebooks. Compare notebook orchestration vs. pipeline orchestration.
🔹 Admin & Ops
-
LH016: SQL Endpoint Refresh – Explore the lag between Spark updates and the T-SQL endpoint. Review both the legacy workaround and the new official REST API for endpoint refresh.
