Azure Data Fundamentals | Yael Bernstein

Key Takeaway:

Day 1 built the foundation (data types, relational vs. non-relational, OLTP vs. OLAP, intro to Cosmos DB & Fabric).
Day 2 expanded into file formats, optimized engines (Spark, Databricks), real-time analytics, and integrated solutions (Fabric + Power BI + Event Stream).

Together, the two days gave a holistic view of Azure’s data services ecosystem, showing how raw data flows from ingestion → governance → analytics → visualization.

📅 Azure Data Fundamentals (DP-900) Virtual Training

Day 1 – March 19, 2025:

Core Data Concepts & Non-Relational Foundations

1. Data Types & Formats

Structured → rows/columns in CSV or relational DBs.
Semi-Structured → JSON (web APIs, system exchange), XML (finance/healthcare), flexible schemas.
Unstructured → text, audio, video; massive insights but complex to process.
Optimized Formats → Avro (streaming), ORC/Parquet (compressed, efficient for analytics).
Blob Storage → binary objects for media & large files.

2. Database Models

Relational (OLTP/OLAP): ACID compliance, referential integrity, star schemas for analysis.
Non-Relational: key-value, document, columnar; used for flexible/large-scale scenarios like social media.

3. Core Concepts

ETL Pipelines → Extract, Transform, Load for preparing clean data.
SQL Essentials → Views, stored procedures, indexes; trade-offs of Azure SQL DB vs. SQL Server vs. MariaDB.
Partitioning & Keys in non-relational storage for scalability & performance.

4. Azure Services Covered

Blob Storage & ADLS Gen2: scalable storage with directory support + analytics integration.
Azure Table Storage: simple NoSQL key-value store.
Azure Cosmos DB: multi-API, partitioned, globally distributed database.
Azure Synapse & Data Warehousing: structured queries, fact/dimension modeling.
Apache Spark & Databricks: batch + real-time analytics on massive datasets.
Azure Fabric: SaaS platform unifying ingestion, processing, analytics, and reporting.

5. Advanced Analytics & Real-Time

Streaming: compare batch (analyze after collection) vs. real-time (analyze as events flow).
Delta Lake + Spark: unify real-time streams with historical data.
Event Hubs, Event Stream, KQL: immediate insights (e.g., detecting breaches, IoT events).

6. Visualization

Power BI: build dashboards with facts/dimensions, hierarchies, scatter plots, maps, cards.
Seamless Fabric integration → instant reporting on governed data.

Day 2 – March 20, 2025:

Data Formats, Processing Engines & Real-Time Intelligence

1. File Formats & Use Cases

Delimited (CSV/TSV) → simple, widely used, no schema.
JSON → nested, human-readable, core for APIs/IoT.
XML → metadata-rich, schema validation (legacy systems).
Blobs → binary data like images/videos/backups.
Columnar Formats: Avro (streaming), ORC & Parquet (analytics & compression).
Tip: Format choice depends on workload (CSV for exchange, Parquet for big data, JSON for APIs, Avro for streaming).

2. Relational DB Deep Dive

Azure SQL Database → managed relational service.
SQL Server → maximum control, higher cost.
MariaDB → open-source, Oracle-compatible.
Normalization → reduce redundancy, enforce schema clarity.

3. Non-Relational DB & Storage

Cosmos DB → partition + row key structure, indexes, global low-latency multi-region access.
ADLS Gen2 → combines blob storage with directory/file system performance for analytics.

4. Big Data Processing

Apache Spark → distributed processing for both batch + real-time.
Azure Databricks → collaborative, scalable analytics workspace; integrates engineers, analysts, and scientists.
Lakehouse Architecture → combines flexibility of a data lake with the structure of a warehouse.

5. Fabric in Context

OneLake → centralized repository, eliminates silos/duplication.
Fully Managed SaaS → ingestion → analytics → visualization in one pipeline.
Integration with Power BI → direct dashboards without separate infra management.

6. Streaming & Real-Time

Event Stream + KQL DB → capture events as they arrive for instant analytics.
Delta Lake → query historical + real-time data together.
Example Use Cases → fraud detection, security monitoring, live dashboards.

7. Visualization & Action

Power BI → interactive, real-time visuals layered onto Fabric + Event Stream data.
Outcome → end-to-end pipeline: raw data → stream ingestion → analytics → visualization.