Lakehouse | Yael Bernstein

Hands On Tutorials to Implement Concepts

The best engineers have a strong appreciation for the pros and cons of a solution.

With each exercise below I train the reflection muscle to further connect the dots.

🔹 Foundations

LH001: Intro to Fabric Lakehouses – Explore building a Lakehouse from scratch, uploading CSVs, and converting them into Delta tables. Compare manual UI vs. PySpark loading methods.
LH002: Delta Lake – Learn how versioned Parquet + JSON logs create Delta tables with support for INSERT/UPDATE/DELETE and time travel, plus why table maintenance is key (removing/vacuuming parquet files that are no longer needed).
LH003: Notebook Features – Practice core Notebook skills: multi-Lakehouse connections, language switching, parameter cells, markdown, and snippets.
LH004: Python Package Management – Install PyPI and custom packages inline or via Environments. Compare quick installs vs. managed, CI/CD-friendly deployments. Import Seaborn visualization package into notebook via env. - super useful & tradeoffs.
LH005: Tables & Schemas – Create tables with Spark SQL and the Delta Builder API. Explore schema evolution via createOrReplace and its impact on downstream analytics.

🔹 Data Loading & Transformation

LH006: Incremental Loading – Implement MERGE strategies in Spark SQL and Delta libraries for inserts/updates. Compare SQL simplicity vs. Python control.
LH007–LH011: PySpark Drills (DataFrames → Filtering → GroupBy → Cleaning → Reshaping) – Work through a sequence of Spark drills to master DataFrame transformations, filters, aggregations, deduplication, handling missing values, and reshaping (joins, pivots, unpivots).
LH012: Audible Data Cleaning (Spark) – Rebuild a messy-data exercise in Spark to highlight flexibility and performance compared to Dataflow approaches.

🔹 Advanced Lakehouse Techniques

LH013: Shortcut Strategies – Explore external and internal shortcuts into Lakehouses. Weigh convenience, cost savings, and deployment complexity.
LH014: Mirroring Strategies – Connect Fabric to external sources (Azure SQL, Snowflake, Databricks Unity Catalog, PostgreSQL, etc.). Learn Open Mirroring for incremental writes and cost-saving scenarios.
LH015: Orchestration with Notebooks – Use notebookutils to trigger, chain, and DAG-orchestrate multiple notebooks. Compare notebook orchestration vs. pipeline orchestration.

🔹 Admin & Ops

LH016: SQL Endpoint Refresh – Explore the lag between Spark updates and the T-SQL endpoint. Review both the legacy workaround and the new official REST API for endpoint refresh.