Data Engineering Internship 2026
Chapter 5
Code Patterns & References
Use these patterns as starting points. Read the official references alongside each week's activity.
WEEK 1 | Database Design & Data Warehousing
Project 1 — Requirements & Deliverables
Project 1: Database Design
Duration: Week 1 | Individual project | Ends with a live presentation
Purpose: Demonstrate understanding of relational database design, data warehousing concepts, and how to communicate a data model to a non-technical audience.
What to Deliver
| # | Deliverable / Feature | Requirement | Required? |
|---|
| 1 | Entity-Relationship Diagram (ERD) | Identifies all entities, attributes, and relationships for the given scenario. Shows cardinality (1:1, 1:N, N:M). Uses correct ERD notation. | MVP |
| 2 | Star Schema Design | Fact table clearly identified with measurable metrics. At least 3 dimension tables. Foreign keys correctly defined. | MVP |
| 3 | Design Justification Document | Written explanation (1-2 pages): why you chose this schema, trade-offs considered, what queries it is optimised for. | MVP |
| 4 | Presentation (10 min) | Walk through your ERD and star schema. Explain the design decisions. Handle 3-5 questions from instructors. | MVP |
| 5 | Snowflake schema variant | Extend the star schema to snowflake by normalising at least one dimension table. Explain the trade-off. | Stretch |
Acceptance Criteria
| Check | How It's Verified | Blocking? |
|---|
| ERD covers all entities described in the scenario | Instructor checks against the scenario spec | Yes |
| Cardinality is marked on every relationship | Visual review during presentation | Yes |
| Star schema has a clear fact table with numeric measures | Instructor checks the schema diagram | Yes |
| Presentation covers design decisions and trade-offs | Presentation review | Yes |
| You can answer questions about normalisation vs denormalisation | Live Q&A during presentation | Yes |
What to Submit
Deliverable Checklist
- Send to your instructor before the presentation session:
- [ ] ERD diagram (image or draw.io export)
- [ ] Star schema diagram (image or draw.io export)
- [ ] Design justification document (Google Doc or PDF)
- [ ] Presentation slides
- [ ] GitHub repo link with all materials committed
5.1 ERD Design Checklist
Before finalising your ERD, verify all of these:
| Check | What to verify |
|---|
| Every entity is a noun | Entities are things (Movie, Director, User) — not actions (Watching, Rating) |
| Every attribute is atomic | No multi-valued fields. Split "genres" into a separate table, not a comma-separated string. |
| Primary keys on every table | Every entity has a unique identifier. Prefer surrogate keys (id SERIAL). |
| Foreign keys are explicit | Every relationship has a FK on the child table pointing to the PK of the parent. |
| Cardinality is marked | 1:1, 1:N, N:M clearly shown. N:M resolved into a junction table. |
| No redundant data | If the same data appears in two places, it should be in one place with a FK. |
5.2 Star Schema vs Snowflake Schema
| Dimension | Star Schema | Snowflake Schema |
|---|
| Structure | Denormalised — all dimension attributes in one flat table | Normalised — dimension attributes split into related tables |
| Query speed | Faster — fewer joins | Slower — more joins |
| Storage | More storage — some data repeated | Less storage — data normalised |
| Maintenance | Easier — fewer tables to manage | Harder — more tables and relationships |
| When to use | Analytics and reporting (OLAP) | When storage matters or data changes frequently |
5.3 SQL Patterns to Know (Week 1)
-- Create a fact table with foreign keys
CREATE TABLE fact_movie_ratings (
rating_id SERIAL PRIMARY KEY,
movie_id INT REFERENCES dim_movies(movie_id),
user_id INT REFERENCES dim_users(user_id),
date_id INT REFERENCES dim_date(date_id),
rating DECIMAL(3,1),
review_len INT
);
-- Basic star schema query
SELECT
m.title,
m.genre,
AVG(f.rating) AS avg_rating,
COUNT(*) AS total_reviews
FROM fact_movie_ratings f
JOIN dim_movies m ON f.movie_id = m.movie_id
JOIN dim_date d ON f.date_id = d.date_id
WHERE d.year = 2024
GROUP BY m.title, m.genre
ORDER BY avg_rating DESC;
Week 1 Official References
WEEK 2 | Core Data Engineering Tools
Project 2 — Requirements & Deliverables
Project 2: Core Tools Pipeline
Duration: Week 2 | Individual project | Deployed pipeline, presented at end of week
Purpose: Build a working end-to-end data pipeline that demonstrates proficiency with shell scripting, PySpark, Pandas, PostgreSQL, and SQL.
What to Build
| # | Deliverable / Feature | Requirement | Required? |
|---|
| 1 | Shell script: data download | Bash script that downloads a dataset (CSV or JSON) using curl or wget. Handles errors. Logs download completion. | MVP |
| 2 | PySpark loader | Reads the downloaded file using PySpark. Applies basic schema inference or explicit schema. Writes to a staging table in PostgreSQL. | MVP |
| 3 | Pandas cleaner | Reads the staging table. Applies at least 3 cleaning operations (nulls, type casting, deduplication). Writes clean data to a new table. | MVP |
| 4 | SQL queries | At least 5 analytical SQL queries against the clean table. Must include GROUP BY, JOIN, and at least one window function. | MVP |
| 5 | README | Documents how to run each step. Explains the dataset. Shows sample query outputs. | MVP |
| 6 | Unit tests for cleaning functions | pytest tests for at least 2 Pandas cleaning functions. Tests cover happy path and null/edge cases. | Stretch |
Acceptance Criteria
| Check | How It's Verified | Blocking? |
|---|
| Shell script runs without errors on a clean machine | Instructor runs it from scratch | Yes |
| PySpark job loads data into PostgreSQL staging table | Instructor checks the table exists and has data | Yes |
| Pandas cleaning is documented — what was cleaned and why | Code comments + README | Yes |
| All 5 SQL queries return results and use correct syntax | Instructor runs each query | Yes |
| Window function is present in at least one query | Code review | Yes |
| README explains how to reproduce the pipeline end to end | Instructor follows README on clean environment | Yes |
What to Submit
Deliverable Checklist
- Send to your instructor before the presentation session:
- [ ] GitHub repo link with all code committed
- [ ] Short video or live demo showing each pipeline step running
- [ ] README with setup instructions and sample outputs
- [ ] SQL queries file with comments explaining each query
5.4 Shell Script Template
#!/bin/bash
set -e # exit immediately on any error
DATA_URL="https://example.com/movies.csv"
OUTPUT_DIR="./data/raw"
OUTPUT_FILE="$OUTPUT_DIR/movies.csv"
mkdir -p "$OUTPUT_DIR"
echo "[$(date)] Starting download from $DATA_URL"
curl -L -o "$OUTPUT_FILE" "$DATA_URL"
echo "[$(date)] Downloaded $(wc -l < $OUTPUT_FILE) lines to $OUTPUT_FILE"
5.5 PySpark: Load CSV to PostgreSQL
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
spark = SparkSession.builder \
.appName("MoviesLoader") \
.config("spark.jars", "/path/to/postgresql-42.7.3.jar") \
.getOrCreate()
schema = StructType([
StructField("movie_id", IntegerType(), True),
StructField("title", StringType(), True),
StructField("genre", StringType(), True),
StructField("release_year",IntegerType(), True),
StructField("rating", FloatType(), True),
])
df = spark.read.csv("./data/raw/movies.csv", header=True, schema=schema)
DB_URL = "jdbc:postgresql://localhost:5432/bootcamp"
PROPS = {"user":"postgres","password":"postgres","driver":"org.postgresql.Driver"}
df.write.jdbc(url=DB_URL, table="staging_movies", mode="overwrite", properties=PROPS)
print(f"Loaded {df.count()} rows into staging_movies")
5.6 Pandas: Clean and Write
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("postgresql://postgres:postgres@localhost:5432/bootcamp")
# Read from staging
df = pd.read_sql("SELECT * FROM staging_movies", engine)
# 1. Drop duplicate rows
df = df.drop_duplicates(subset=["movie_id"])
# 2. Fill missing genres
df["genre"] = df["genre"].fillna("Unknown")
# 3. Remove rows with no title
df = df.dropna(subset=["title"])
# 4. Normalise text columns
df["title"] = df["title"].str.strip().str.title()
df["genre"] = df["genre"].str.strip().str.lower()
# 5. Clamp ratings to valid range
df = df[df["rating"].between(0, 10)]
# Write clean data
df.to_sql("clean_movies", engine, if_exists="replace", index=False)
print(f"Wrote {len(df)} clean rows to clean_movies")
5.7 SQL Patterns for Analytics
-- Window function: rank movies by rating within each genre
SELECT
title,
genre,
rating,
RANK() OVER (PARTITION BY genre ORDER BY rating DESC) AS genre_rank
FROM clean_movies;
-- LAG / LEAD: year-over-year change
SELECT
release_year,
COUNT(*) AS movie_count,
LAG(COUNT(*)) OVER (ORDER BY release_year) AS prev_year_count,
COUNT(*) - LAG(COUNT(*)) OVER (ORDER BY release_year) AS yoy_change
FROM clean_movies
GROUP BY release_year
ORDER BY release_year;
Week 2 Official References
WEEKS 3–5 | Data Pipelines: dbt + Airflow
Project 3 — Requirements & Deliverables
Project 3: ETL Pipeline with dbt + Airflow
Duration: Weeks 3–5 | Individual project
Purpose: Build a fully orchestrated ELT pipeline. dbt handles SQL transformations across staging, intermediate, and mart layers. Airflow schedules and monitors the pipeline end to end.
What to Build
| # | Deliverable / Feature | Requirement | Required? |
|---|
| 1 | dbt project structure | Separate folders for staging, intermediate, and marts models. Naming conventions followed: stg_, int_, fct_, dim_. | MVP |
| 2 | Staging models | One stg_ model per source table. Renames columns to snake_case. Casts data types. No business logic. | MVP |
| 3 | Intermediate models | At least one int_ model that joins or enriches staging data. Uses CTEs, no SELECT *. | MVP |
| 4 | Mart models | At least one fct_ fact table and one dim_ dimension table ready for reporting. | MVP |
| 5 | dbt tests | unique and not_null tests on all primary keys. At least one accepted_values test. schema.yml with column descriptions. | MVP |
| 6 | Airflow DAG | DAG that runs the full pipeline: extract → load → dbt run → dbt test. Scheduled to run daily. | MVP |
| 7 | dbt docs | dbt docs generate + serve. Screenshot of lineage graph submitted. | Stretch |
| 8 | Custom singular test | A .sql test file that validates a specific business rule (e.g. no rating above 10). | Stretch |
Acceptance Criteria
| Check | How It's Verified | Blocking? |
|---|
| dbt run completes with zero errors | Instructor runs: dbt run | Yes |
| dbt test passes with zero failures | Instructor runs: dbt test | Yes |
| All models follow staging/intermediate/mart structure | Instructor checks project folder layout | Yes |
| No SELECT * in any model | Code review | Yes |
| Airflow DAG runs successfully end to end | Instructor triggers the DAG and checks all tasks green | Yes |
| schema.yml has descriptions on all models and key columns | Code review | Yes |
What to Submit
Deliverable Checklist
- Send to your instructor:
- [ ] GitHub repo link with full dbt project and Airflow DAG committed
- [ ] Screenshot of dbt test results (all green)
- [ ] Screenshot of Airflow DAG run (all tasks green)
- [ ] README explaining how to run the pipeline locally
5.8 dbt Project Structure
my_project/
dbt_project.yml
profiles.yml
models/
staging/
stg_movies.sql
stg_ratings.sql
_stg_models.yml # sources + tests
intermediate/
int_movies_enriched.sql
marts/
fct_ratings.sql
dim_movies.sql
_mart_models.yml
tests/
assert_rating_range.sql
macros/
5.9 dbt Model Templates
Staging model — stg_movies.sql
-- models/staging/stg_movies.sql
WITH source AS (
SELECT * FROM {{ source("raw", "movies") }}
),
renamed AS (
SELECT
movie_id::INT AS movie_id,
TRIM(title) AS title,
LOWER(TRIM(genre)) AS genre,
release_year::INT AS release_year,
rating::FLOAT AS rating
FROM source
WHERE movie_id IS NOT NULL
)
SELECT * FROM renamed
Mart model — fct_ratings.sql
-- models/marts/fct_ratings.sql
WITH movies AS (
SELECT * FROM {{ ref("stg_movies") }}
),
enriched AS (
SELECT * FROM {{ ref("int_movies_enriched") }}
),
final AS (
SELECT
m.movie_id,
m.title,
m.genre,
m.release_year,
e.avg_rating,
e.total_reviews
FROM movies m
LEFT JOIN enriched e USING (movie_id)
)
SELECT * FROM final
Schema tests — _mart_models.yml
# models/marts/_mart_models.yml
version: 2
models:
- name: fct_ratings
description: "One row per movie with aggregated rating stats."
columns:
- name: movie_id
description: "Unique movie identifier"
tests:
- unique
- not_null
- name: genre
description: "Normalised genre string"
tests:
- not_null
- name: avg_rating
description: "Mean rating across all reviews"
5.10 Airflow DAG Template
# dags/movies_pipeline.py
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator
default_args = {
"owner": "intern",
"retries": 1,
"retry_delay": timedelta(minutes=5),
}
with DAG(
dag_id="movies_pipeline",
default_args=default_args,
schedule="@daily",
start_date=datetime(2026, 1, 1),
catchup=False,
) as dag:
extract = BashOperator(
task_id="extract",
bash_command="bash /opt/airflow/scripts/download.sh",
)
load = BashOperator(
task_id="load",
bash_command="python /opt/airflow/scripts/load_pyspark.py",
)
transform = BashOperator(
task_id="dbt_run",
bash_command="dbt run --project-dir /opt/airflow/dbt",
)
test = BashOperator(
task_id="dbt_test",
bash_command="dbt test --project-dir /opt/airflow/dbt",
)
extract >> load >> transform >> test
Weeks 3–5 Official References
| Category | Resource | What You'll Learn | URL (click or type) |
|---|
| dbt | dbt Core Docs | Project setup, models, tests, macros, sources | docs.getdbt.com |
| dbt | dbt Best Practices | Staging/intermediate/mart naming and structure | docs.getdbt.com/guides/best-practices |
| dbt | ref() and source() | How to reference models and declare raw sources | docs.getdbt.com/reference/dbt-jinja-functions/ref |
| dbt | dbt Tests | Generic tests: unique, not_null, accepted_values | docs.getdbt.com/docs/build/data-tests |
| Airflow | Airflow Concepts | DAGs, Operators, Tasks, Scheduling explained | airflow.apache.org/docs/apache-airflow/stable/core-concepts/ |
| Airflow | BashOperator | Run shell commands from a DAG | airflow.apache.org/docs/apache-airflow/stable/howto/operator/bash.html |
| Airflow | Docker Compose Setup | Official local development environment | airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html |
| SQL | PostgreSQL CTEs | WITH clause for readable, modular queries | postgresql.org/docs/current/queries-with.html |
WEEKS 6–7 | Data Visualisation: Power BI
Project 4 — Requirements & Deliverables
Project 4: Power BI Dashboard
Duration: Weeks 6–7 | Individual project
Purpose: Connect to your PostgreSQL data, design a proper star schema model in Power BI, write DAX measures, and build an interactive dashboard that tells a data story.
What to Build
| # | Deliverable / Feature | Requirement | Required? |
|---|
| 1 | Power BI data model | Connect to PostgreSQL. Import fact and dimension tables. Define relationships. Use star schema layout in Model view. | MVP |
| 2 | Base DAX measures | At least 5 measures: Total Count, Average Rating, % of Total, period comparison, and one conditional measure. | MVP |
| 3 | Dashboard page 1: Overview | High-level KPIs visible at a glance. Cards, a bar/column chart, a trend line. Page-level filters. | MVP |
| 4 | Dashboard page 2: Deep-dive | Interactive drill-through or cross-filter behaviour. Slicers for at least 2 dimensions. One advanced visual. | MVP |
| 5 | Data story narrative | A written paragraph (in the report or as a document) explaining: what the data shows, what action it suggests, who the audience is. | MVP |
| 6 | Time intelligence measures | SAMEPERIODLASTYEAR, TOTALYTD, or DATESYTD using a proper date dimension. | Stretch |
Acceptance Criteria
| Check | How It's Verified | Blocking? |
|---|
| Data model uses star schema layout in Power BI Model view | Visual review of the .pbix file | Yes |
| All relationships defined correctly (no ambiguous relationships) | Instructor inspects Model view | Yes |
| At least 5 DAX measures defined in a dedicated Measures table | Code review | Yes |
| Dashboard answers at least 3 specific business questions | Presentation walkthrough | Yes |
| Slicers and cross-filter interactions work correctly | Live demo during presentation | Yes |
| Data story explains insight and recommended action | Written narrative reviewed | Yes |
What to Submit
Deliverable Checklist
- Send to your instructor:
- [ ] .pbix file uploaded to the shared folder or GitHub repo
- [ ] Screenshot of data model (Model view)
- [ ] PDF export of the dashboard
- [ ] Data story narrative (in the report or as a separate document)
5.11 Power BI Data Model Checklist
| Check | What to verify |
|---|
| Star schema layout | Fact table in the centre. Dimension tables surrounding it. No circular relationships. |
| Relationship cardinality | Fact-to-dimension: Many-to-One (*:1). Cross-filter direction: Single. |
| Date table | A proper date dimension table. Mark as Date Table in Power BI. Continuous date range, no gaps. |
| Measures table | Create an empty table called "Measures" and place all DAX measures there. Never put measures in dimension tables. |
| No calculated columns for aggregations | Use measures for SUM, AVERAGE, COUNT. Calculated columns for row-level values only. |
| Hide FK columns | Hide foreign key columns in the fact table from report view — they are only needed for relationships. |
5.12 DAX Measure Patterns
-- Basic measures
Total Movies = COUNTROWS(fct_ratings)
Avg Rating = AVERAGE(fct_ratings[avg_rating])
Total Reviews = SUM(fct_ratings[total_reviews])
-- % of total
Rating Share % =
DIVIDE(
[Total Movies],
CALCULATE([Total Movies], ALL(dim_movies[genre]))
)
-- Conditional measure
High Rated Movies =
CALCULATE(
[Total Movies],
fct_ratings[avg_rating] >= 7.5
)
-- Year-over-year comparison (requires date table)
Movies Last Year =
CALCULATE([Total Movies], SAMEPERIODLASTYEAR(dim_date[date]))
5.13 Data Storytelling Principles
| Principle | What it means |
|---|
| Lead with the insight | Start with the conclusion, not the data. "Ratings declined 18% in 2024" not "Here is a chart of ratings over time." |
| Every visual earns its place | Remove any chart that does not directly support a finding. Less is more. |
| Use consistent colour | One colour for the primary metric. A second for comparison. Never rainbow charts. |
| Label the key points | Annotate the highest and lowest values. Label the most important bar or line directly. |
| State the recommended action | A dashboard without a recommendation is just decoration. End with what to do next. |
Weeks 6–7 Official References