Data Engineering Internship 2026

Chapter 5

Code Patterns & References

Use these patterns as starting points. Read the official references alongside each week's activity.

Project 1 — Requirements & Deliverables

What to Deliver

#	Deliverable / Feature	Requirement	Required?
1	Entity-Relationship Diagram (ERD)	Identifies all entities, attributes, and relationships for the given scenario. Shows cardinality (1:1, 1:N, N:M). Uses correct ERD notation.	MVP
2	Star Schema Design	Fact table clearly identified with measurable metrics. At least 3 dimension tables. Foreign keys correctly defined.	MVP
3	Design Justification Document	Written explanation (1-2 pages): why you chose this schema, trade-offs considered, what queries it is optimised for.	MVP
4	Presentation (10 min)	Walk through your ERD and star schema. Explain the design decisions. Handle 3-5 questions from instructors.	MVP
5	Snowflake schema variant	Extend the star schema to snowflake by normalising at least one dimension table. Explain the trade-off.	Stretch

Acceptance Criteria

Check	How It's Verified	Blocking?
ERD covers all entities described in the scenario	Instructor checks against the scenario spec	Yes
Cardinality is marked on every relationship	Visual review during presentation	Yes
Star schema has a clear fact table with numeric measures	Instructor checks the schema diagram	Yes
Presentation covers design decisions and trade-offs	Presentation review	Yes
You can answer questions about normalisation vs denormalisation	Live Q&A during presentation	Yes

What to Submit

Deliverable Checklist

Send to your instructor before the presentation session:
[ ] ERD diagram (image or draw.io export)
[ ] Star schema diagram (image or draw.io export)
[ ] Design justification document (Google Doc or PDF)
[ ] Presentation slides
[ ] GitHub repo link with all materials committed

5.1 ERD Design Checklist

Before finalising your ERD, verify all of these:

Check	What to verify
Every entity is a noun	Entities are things (Movie, Director, User) — not actions (Watching, Rating)
Every attribute is atomic	No multi-valued fields. Split "genres" into a separate table, not a comma-separated string.
Primary keys on every table	Every entity has a unique identifier. Prefer surrogate keys (id SERIAL).
Foreign keys are explicit	Every relationship has a FK on the child table pointing to the PK of the parent.
Cardinality is marked	1:1, 1:N, N:M clearly shown. N:M resolved into a junction table.
No redundant data	If the same data appears in two places, it should be in one place with a FK.

5.2 Star Schema vs Snowflake Schema

Dimension	Star Schema	Snowflake Schema
Structure	Denormalised — all dimension attributes in one flat table	Normalised — dimension attributes split into related tables
Query speed	Faster — fewer joins	Slower — more joins
Storage	More storage — some data repeated	Less storage — data normalised
Maintenance	Easier — fewer tables to manage	Harder — more tables and relationships
When to use	Analytics and reporting (OLAP)	When storage matters or data changes frequently

5.3 SQL Patterns to Know (Week 1)

-- Create a fact table with foreign keys
CREATE TABLE fact_movie_ratings (
  rating_id   SERIAL PRIMARY KEY,
  movie_id    INT REFERENCES dim_movies(movie_id),
  user_id     INT REFERENCES dim_users(user_id),
  date_id     INT REFERENCES dim_date(date_id),
  rating      DECIMAL(3,1),
  review_len  INT
);

-- Basic star schema query
SELECT
  m.title,
  m.genre,
  AVG(f.rating)  AS avg_rating,
  COUNT(*)       AS total_reviews
FROM fact_movie_ratings f
JOIN dim_movies m ON f.movie_id = m.movie_id
JOIN dim_date  d ON f.date_id  = d.date_id
WHERE d.year = 2024
GROUP BY m.title, m.genre
ORDER BY avg_rating DESC;

Week 1 Official References

Category	Resource	What You'll Learn	URL (click or type)
Database	PostgreSQL Docs	Full reference for CREATE TABLE, constraints, data types	postgresql.org/docs/current/
Database	draw.io	Free ERD and diagram tool (browser-based)	app.diagrams.net
Concepts	Kimball Data Warehouse Toolkit	Star schema design principles (authoritative reference)	kimballgroup.com/data-warehouse-business-intelligence-resources/
SQL	SQL Tutorial — W3Schools	Core SQL syntax reference with examples	w3schools.com/sql/
SQL	SQLBolt	Interactive SQL exercises for practice	sqlbolt.com

Project 2 — Requirements & Deliverables

What to Build

#	Deliverable / Feature	Requirement	Required?
1	Shell script: data download	Bash script that downloads a dataset (CSV or JSON) using curl or wget. Handles errors. Logs download completion.	MVP
2	PySpark loader	Reads the downloaded file using PySpark. Applies basic schema inference or explicit schema. Writes to a staging table in PostgreSQL.	MVP
3	Pandas cleaner	Reads the staging table. Applies at least 3 cleaning operations (nulls, type casting, deduplication). Writes clean data to a new table.	MVP
4	SQL queries	At least 5 analytical SQL queries against the clean table. Must include GROUP BY, JOIN, and at least one window function.	MVP
5	README	Documents how to run each step. Explains the dataset. Shows sample query outputs.	MVP
6	Unit tests for cleaning functions	pytest tests for at least 2 Pandas cleaning functions. Tests cover happy path and null/edge cases.	Stretch

Acceptance Criteria

Check	How It's Verified	Blocking?
Shell script runs without errors on a clean machine	Instructor runs it from scratch	Yes
PySpark job loads data into PostgreSQL staging table	Instructor checks the table exists and has data	Yes
Pandas cleaning is documented — what was cleaned and why	Code comments + README	Yes
All 5 SQL queries return results and use correct syntax	Instructor runs each query	Yes
Window function is present in at least one query	Code review	Yes
README explains how to reproduce the pipeline end to end	Instructor follows README on clean environment	Yes

What to Submit

Deliverable Checklist

Send to your instructor before the presentation session:
[ ] GitHub repo link with all code committed
[ ] Short video or live demo showing each pipeline step running
[ ] README with setup instructions and sample outputs
[ ] SQL queries file with comments explaining each query

5.4 Shell Script Template

#!/bin/bash
set -e   # exit immediately on any error

DATA_URL="https://example.com/movies.csv"
OUTPUT_DIR="./data/raw"
OUTPUT_FILE="$OUTPUT_DIR/movies.csv"

mkdir -p "$OUTPUT_DIR"

echo "[$(date)] Starting download from $DATA_URL"
curl -L -o "$OUTPUT_FILE" "$DATA_URL"
echo "[$(date)] Downloaded $(wc -l < $OUTPUT_FILE) lines to $OUTPUT_FILE"

5.5 PySpark: Load CSV to PostgreSQL

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType

spark = SparkSession.builder \
    .appName("MoviesLoader") \
    .config("spark.jars", "/path/to/postgresql-42.7.3.jar") \
    .getOrCreate()

schema = StructType([
    StructField("movie_id",    IntegerType(), True),
    StructField("title",       StringType(),  True),
    StructField("genre",       StringType(),  True),
    StructField("release_year",IntegerType(), True),
    StructField("rating",      FloatType(),   True),
])

df = spark.read.csv("./data/raw/movies.csv", header=True, schema=schema)

DB_URL = "jdbc:postgresql://localhost:5432/bootcamp"
PROPS = {"user":"postgres","password":"postgres","driver":"org.postgresql.Driver"}

df.write.jdbc(url=DB_URL, table="staging_movies", mode="overwrite", properties=PROPS)
print(f"Loaded {df.count()} rows into staging_movies")

5.6 Pandas: Clean and Write

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("postgresql://postgres:postgres@localhost:5432/bootcamp")

# Read from staging
df = pd.read_sql("SELECT * FROM staging_movies", engine)

# 1. Drop duplicate rows
df = df.drop_duplicates(subset=["movie_id"])

# 2. Fill missing genres
df["genre"] = df["genre"].fillna("Unknown")

# 3. Remove rows with no title
df = df.dropna(subset=["title"])

# 4. Normalise text columns
df["title"] = df["title"].str.strip().str.title()
df["genre"] = df["genre"].str.strip().str.lower()

# 5. Clamp ratings to valid range
df = df[df["rating"].between(0, 10)]

# Write clean data
df.to_sql("clean_movies", engine, if_exists="replace", index=False)
print(f"Wrote {len(df)} clean rows to clean_movies")

5.7 SQL Patterns for Analytics

-- Window function: rank movies by rating within each genre
SELECT
  title,
  genre,
  rating,
  RANK() OVER (PARTITION BY genre ORDER BY rating DESC) AS genre_rank
FROM clean_movies;

-- LAG / LEAD: year-over-year change
SELECT
  release_year,
  COUNT(*)                                            AS movie_count,
  LAG(COUNT(*)) OVER (ORDER BY release_year)         AS prev_year_count,
  COUNT(*) - LAG(COUNT(*)) OVER (ORDER BY release_year) AS yoy_change
FROM clean_movies
GROUP BY release_year
ORDER BY release_year;

Week 2 Official References

Category	Resource	What You'll Learn	URL (click or type)
Python	Python 3 Official Docs	Language reference, standard library	docs.python.org/3/
Pandas	Pandas User Guide	DataFrame operations, I/O, cleaning	pandas.pydata.org/docs/user_guide/
PySpark	PySpark Documentation	SparkSession, DataFrames, SQL	spark.apache.org/docs/latest/api/python/
PostgreSQL	psycopg2 Docs	Python PostgreSQL adapter	psycopg.org/docs/
SQLAlchemy	SQLAlchemy Docs	Python ORM and database toolkit	docs.sqlalchemy.org
SQL	PostgreSQL Window Functions	RANK, LAG, LEAD, ROW_NUMBER explained	postgresql.org/docs/current/tutorial-window.html
Testing	pytest Docs	Python testing framework	docs.pytest.org

Project 3 — Requirements & Deliverables

What to Build

#	Deliverable / Feature	Requirement	Required?
1	dbt project structure	Separate folders for staging, intermediate, and marts models. Naming conventions followed: stg_, int_, fct_, dim_.	MVP
2	Staging models	One stg_ model per source table. Renames columns to snake_case. Casts data types. No business logic.	MVP
3	Intermediate models	At least one int_ model that joins or enriches staging data. Uses CTEs, no SELECT *.	MVP
4	Mart models	At least one fct_ fact table and one dim_ dimension table ready for reporting.	MVP
5	dbt tests	unique and not_null tests on all primary keys. At least one accepted_values test. schema.yml with column descriptions.	MVP
6	Airflow DAG	DAG that runs the full pipeline: extract → load → dbt run → dbt test. Scheduled to run daily.	MVP
7	dbt docs	dbt docs generate + serve. Screenshot of lineage graph submitted.	Stretch
8	Custom singular test	A .sql test file that validates a specific business rule (e.g. no rating above 10).	Stretch

Acceptance Criteria

Check	How It's Verified	Blocking?
dbt run completes with zero errors	Instructor runs: dbt run	Yes
dbt test passes with zero failures	Instructor runs: dbt test	Yes
All models follow staging/intermediate/mart structure	Instructor checks project folder layout	Yes
No SELECT * in any model	Code review	Yes
Airflow DAG runs successfully end to end	Instructor triggers the DAG and checks all tasks green	Yes
schema.yml has descriptions on all models and key columns	Code review	Yes

What to Submit

Deliverable Checklist

Send to your instructor:
[ ] GitHub repo link with full dbt project and Airflow DAG committed
[ ] Screenshot of dbt test results (all green)
[ ] Screenshot of Airflow DAG run (all tasks green)
[ ] README explaining how to run the pipeline locally

5.8 dbt Project Structure

my_project/
  dbt_project.yml
  profiles.yml
  models/
    staging/
      stg_movies.sql
      stg_ratings.sql
      _stg_models.yml      # sources + tests
    intermediate/
      int_movies_enriched.sql
    marts/
      fct_ratings.sql
      dim_movies.sql
      _mart_models.yml
  tests/
    assert_rating_range.sql
  macros/

5.9 dbt Model Templates

Staging model — stg_movies.sql

-- models/staging/stg_movies.sql
WITH source AS (
    SELECT * FROM {{ source("raw", "movies") }}
),

renamed AS (
    SELECT
        movie_id::INT           AS movie_id,
        TRIM(title)             AS title,
        LOWER(TRIM(genre))      AS genre,
        release_year::INT       AS release_year,
        rating::FLOAT           AS rating
    FROM source
    WHERE movie_id IS NOT NULL
)

SELECT * FROM renamed

Mart model — fct_ratings.sql

-- models/marts/fct_ratings.sql
WITH movies AS (
    SELECT * FROM {{ ref("stg_movies") }}
),

enriched AS (
    SELECT * FROM {{ ref("int_movies_enriched") }}
),

final AS (
    SELECT
        m.movie_id,
        m.title,
        m.genre,
        m.release_year,
        e.avg_rating,
        e.total_reviews
    FROM movies m
    LEFT JOIN enriched e USING (movie_id)
)

SELECT * FROM final

Schema tests — _mart_models.yml

# models/marts/_mart_models.yml
version: 2

models:
  - name: fct_ratings
    description: "One row per movie with aggregated rating stats."
    columns:
      - name: movie_id
        description: "Unique movie identifier"
        tests:
          - unique
          - not_null
      - name: genre
        description: "Normalised genre string"
        tests:
          - not_null
      - name: avg_rating
        description: "Mean rating across all reviews"

5.10 Airflow DAG Template

# dags/movies_pipeline.py
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator

default_args = {
    "owner": "intern",
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
}

with DAG(
    dag_id="movies_pipeline",
    default_args=default_args,
    schedule="@daily",
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:

    extract = BashOperator(
        task_id="extract",
        bash_command="bash /opt/airflow/scripts/download.sh",
    )

    load = BashOperator(
        task_id="load",
        bash_command="python /opt/airflow/scripts/load_pyspark.py",
    )

    transform = BashOperator(
        task_id="dbt_run",
        bash_command="dbt run --project-dir /opt/airflow/dbt",
    )

    test = BashOperator(
        task_id="dbt_test",
        bash_command="dbt test --project-dir /opt/airflow/dbt",
    )

    extract >> load >> transform >> test

Weeks 3–5 Official References

Category	Resource	What You'll Learn	URL (click or type)
dbt	dbt Core Docs	Project setup, models, tests, macros, sources	docs.getdbt.com
dbt	dbt Best Practices	Staging/intermediate/mart naming and structure	docs.getdbt.com/guides/best-practices
dbt	ref() and source()	How to reference models and declare raw sources	docs.getdbt.com/reference/dbt-jinja-functions/ref
dbt	dbt Tests	Generic tests: unique, not_null, accepted_values	docs.getdbt.com/docs/build/data-tests
Airflow	Airflow Concepts	DAGs, Operators, Tasks, Scheduling explained	airflow.apache.org/docs/apache-airflow/stable/core-concepts/
Airflow	BashOperator	Run shell commands from a DAG	airflow.apache.org/docs/apache-airflow/stable/howto/operator/bash.html
Airflow	Docker Compose Setup	Official local development environment	airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html
SQL	PostgreSQL CTEs	WITH clause for readable, modular queries	postgresql.org/docs/current/queries-with.html

Project 4 — Requirements & Deliverables

What to Build

#	Deliverable / Feature	Requirement	Required?
1	Power BI data model	Connect to PostgreSQL. Import fact and dimension tables. Define relationships. Use star schema layout in Model view.	MVP
2	Base DAX measures	At least 5 measures: Total Count, Average Rating, % of Total, period comparison, and one conditional measure.	MVP
3	Dashboard page 1: Overview	High-level KPIs visible at a glance. Cards, a bar/column chart, a trend line. Page-level filters.	MVP
4	Dashboard page 2: Deep-dive	Interactive drill-through or cross-filter behaviour. Slicers for at least 2 dimensions. One advanced visual.	MVP
5	Data story narrative	A written paragraph (in the report or as a document) explaining: what the data shows, what action it suggests, who the audience is.	MVP
6	Time intelligence measures	SAMEPERIODLASTYEAR, TOTALYTD, or DATESYTD using a proper date dimension.	Stretch

Acceptance Criteria

Check	How It's Verified	Blocking?
Data model uses star schema layout in Power BI Model view	Visual review of the .pbix file	Yes
All relationships defined correctly (no ambiguous relationships)	Instructor inspects Model view	Yes
At least 5 DAX measures defined in a dedicated Measures table	Code review	Yes
Dashboard answers at least 3 specific business questions	Presentation walkthrough	Yes
Slicers and cross-filter interactions work correctly	Live demo during presentation	Yes
Data story explains insight and recommended action	Written narrative reviewed	Yes

What to Submit

Deliverable Checklist

Send to your instructor:
[ ] .pbix file uploaded to the shared folder or GitHub repo
[ ] Screenshot of data model (Model view)
[ ] PDF export of the dashboard
[ ] Data story narrative (in the report or as a separate document)

5.11 Power BI Data Model Checklist

Check	What to verify
Star schema layout	Fact table in the centre. Dimension tables surrounding it. No circular relationships.
Relationship cardinality	Fact-to-dimension: Many-to-One (*:1). Cross-filter direction: Single.
Date table	A proper date dimension table. Mark as Date Table in Power BI. Continuous date range, no gaps.
Measures table	Create an empty table called "Measures" and place all DAX measures there. Never put measures in dimension tables.
No calculated columns for aggregations	Use measures for SUM, AVERAGE, COUNT. Calculated columns for row-level values only.
Hide FK columns	Hide foreign key columns in the fact table from report view — they are only needed for relationships.

5.12 DAX Measure Patterns

-- Basic measures
Total Movies = COUNTROWS(fct_ratings)
Avg Rating = AVERAGE(fct_ratings[avg_rating])
Total Reviews = SUM(fct_ratings[total_reviews])

-- % of total
Rating Share % =
DIVIDE(
    [Total Movies],
    CALCULATE([Total Movies], ALL(dim_movies[genre]))
)

-- Conditional measure
High Rated Movies =
CALCULATE(
    [Total Movies],
    fct_ratings[avg_rating] >= 7.5
)

-- Year-over-year comparison (requires date table)
Movies Last Year =
CALCULATE([Total Movies], SAMEPERIODLASTYEAR(dim_date[date]))

5.13 Data Storytelling Principles

Principle	What it means
Lead with the insight	Start with the conclusion, not the data. "Ratings declined 18% in 2024" not "Here is a chart of ratings over time."
Every visual earns its place	Remove any chart that does not directly support a finding. Less is more.
Use consistent colour	One colour for the primary metric. A second for comparison. Never rainbow charts.
Label the key points	Annotate the highest and lowest values. Label the most important bar or line directly.
State the recommended action	A dashboard without a recommendation is just decoration. End with what to do next.

Weeks 6–7 Official References

Category	Resource	What You'll Learn	URL (click or type)
Power BI	Power BI Desktop Docs	Full product documentation and tutorials	learn.microsoft.com/en-us/power-bi/
DAX	DAX Guide	Every DAX function with examples	dax.guide
DAX	SQLBI DAX Patterns	Production-grade DAX patterns (free)	daxpatterns.com
Power BI	Connect to PostgreSQL	Step-by-step connector setup	learn.microsoft.com/en-us/power-bi/connect-data/desktop-data-sources
Visualisation	Storytelling with Data (Cole Knaflic)	Data visualisation design principles	storytellingwithdata.com/chart-guide