Welcome to New South Wales Department of Education (NSW DOE) data stack in a box

This is an data-stack-in-a-box based data from NSW Education Data. With the push of one button you can have your own data stack up and running in 5 mins! 🏎️

🚀 TL;DR - What have we achieved?

Data Stack

Important

Click below 👇🏼 to setup your own free data stack packed with NSW Department of Education data.

Reports

This dashboard contains all the metrics we have collected in this data stack project. It uses a visualisation as code tool called evidence.dev.

💡 Objectives

Main quests

NSW Department of Education data stack in a box has two objectives:

Getting humans excited about the publicly available data curated by NSW Department of Education and our partners.
Simple one click totally free 💲 data stack, aiding in learning and proof of concepts.

Side quests

Help identify data quality and reliability issues with our data. This project is being run daily with several tests so hopefully we can find any issues first!
Building a report that shows all metrics related to (NSW) Department of Education (DOE) data, and consolidate any other publicaly available data.

🤗 Audience

The project is designed to be very simple when getting started but allows you to go as deep you like!

I want to analyse and gain insights into the data. With the infrastructure free and deployed in one click you don’t need to worry about any implementation details. You can skip straight to analysing and training ML models on top of your own local warehouse.
Interested in modelling via SQL? We got you covered with a environment setup for DBT.
Love DevOps and platform engineering? Check out our Orchestration, CICD pipelines, and automation such as linting, ect.

🥨 Overview of Data Stack (Architecture)

[!NOTE] We are simply going to extract data from the Data.NSW and load it into our in memory data warehouse 🦆, model, clean, and analyse our data. Data.NSW behind the scenes uses CKAN an open source data management system used by the likes of Government of Canada, NHS, and USAs Open Data.

Warning

Some of the datasets from ACARA and Data.NSW are based on CSV files located on their sites. This is challenging as the CSV name and URL for future datasets is unknown, so requires a code change each time new data arrives.

👅 Information Management

Data Catalog

🚧 TODO likley openmetadata see example

Conceptual Data Model

erDiagram
    "STUDENT" {
        int student_id
        string name
        date date_of_birth
        string gender
        int school_id
    }
    STAFF {
        int staff_id
        string name
        string role
        date date_of_birth
        int school_id
    }
    COURSE {
        int course_id
        string course_name
        string description
        int school_id
    }
    ENROLLMENT {
        int enrollment_id
        int student_id
        int course_id
        date enrollment_date
    }
    CLASS {
        int class_id
        int course_id
        int staff_id
        string class_room
        date class_time
        int class_size
    }
    SCHOOL {
        int school_id
        string school_name
        string address
    }
    NAPLAN {
        int naplan_id
        int student_id
        date test_date
        string test_type
        int score
    }
    %% HSC {
    %%     int hsc_id
    %%     int student_id
    %%     date exam_date
    %%     string subject
    %%     int score
    %% }
    INCIDENT {
        int incident_id
        int school_id
        date incident_date
        string incident_type
        string description
    }
    ATTENDANCE {
        int attendance_id
        int student_id
        date attendance_date
        bool present
    }
    RETENTION {
        int retention_id
        int school_id
        int year
        float retention_rate
    }
    APPRENTICESHIP {
        int apprenticeship_id
        int student_id
        string trade
        date start_date
        date end_date
    }
    TRAINEESHIP {
        int traineeship_id
        int student_id
        string field
        date start_date
        date end_date
    }
    "STUDENT" ||--o{ ENROLLMENT : enrolls
    COURSE ||--o{ ENROLLMENT : includes
    STAFF ||--o{ CLASS : teaches
    COURSE ||--o{ CLASS : consists_of
    SCHOOL ||--o{ "STUDENT" : has
    SCHOOL ||--o{ STAFF : employs
    SCHOOL ||--o{ COURSE : offers
    "STUDENT" ||--o{ NAPLAN : takes
    %% "STUDENT" ||--o{ HSC : sits
    SCHOOL ||--o{ INCIDENT : reports
    "STUDENT" ||--o{ ATTENDANCE : records
    SCHOOL ||--o{ RETENTION : tracks
    "STUDENT" ||--o{ APPRENTICESHIP : undertakes
    "STUDENT" ||--o{ TRAINEESHIP : participates_in

This is a high level overview of the entities that we are going to model in this project.

[!NOTE] Limitation - The data available publically for each entitity does not go down to a student. In some cases school level data is avaiable. But most entities only have data published at a state wide (NSW) aggregate level.

Sources

Education Sources

🚧 add column for asset checks

Name	Method (API, CSV, Excel)	Contract Y/N	Description	Source URL
`Apprenticeship and Traineeship training contract`	Excel	❌	Apprenticeships and Traineeships combine formal study of a nationally recognised qualification with on-the-job training.	https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-f7cba3fc-6e9b-4b8b-b1fd-e7dda9b49001
`Average government primary school class sizes`	API	❌	The average class size for each grade is calculated by taking the number of students in all classes that a student from that grade is in (including composite/multi age classes) divided by the total number of classes that includes a student from that grade.	https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-43438137-084e-4d50-81c0-ce741ea3b37b/details
`Enrolments`	API	❌	This data shows February census enrolment figures. All enrolments are self-reported in full-time equivalent (FTE) units and include both full-time and part-time students.	https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-818ae0d8-d7fb-4b62-963c-7263fdb8e1ca
`Incidents`	API	❌	Incidents in public schools and how the department supports schools through incidents while still protecting the identity of students and staff.	https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-43438137-084e-4d50-81c0-ce741ea3b37b
`Master dataset: NSW government school locations and student enrolment numbers`	CSV	✅	The master dataset contains comprehensive information for all government schools in NSW. Data items include school locations, latitude and longitude coordinates, school type, student enrolment numbers, electorate information, contact details and more.	https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-78c10ea3-8d04-4c9c-b255-bbf8547e37e7
`Resource Allocation Model (RAM)`	CSV	✅	The Resource Allocation Model (RAM) was developed to ensure a fair, efficient and transparent allocation of the state public education budget for every school. The model recognises that students and school communities are not all the same and that they have different needs which require different levels of support.	https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-3ea5010a-89bd-46bf-be2a-13c82cc0e1bb
`Staff`	CSV	❌	--------------------------------------	https://www.acara.edu.au/reporting/national-report-on-schooling-in-australia/staff-numbers
`Students`	CSV	❌	--------------------------------------	https://www.acara.edu.au/reporting/national-report-on-schooling-in-australia/student-numbers
`Student attendance`	CSV	❌	This dataset shows the attendance rates for all NSW government schools in Semester One by alphabetical order.	https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-b558a070-09f5-4941-a140-e60a744327bf
`Student retention rates at NSW government schools`	API	❌	The full-time apparent retention rate (ARR) measures the proportion of a cohort of full-time students that moves from one grade to the next, based on an expected rate of progression of one grade per year. It does not track individual students through their final years of secondary schooling.	https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-c9fd51b3-506d-4707-b607-0b1853654ce6

Utilisation Sources

Name	Method (API, CSV, Excel)	Contract Y/N	Description	Source URL
`Google Analytics`	API	❌	Captures all the traffic to the data visualisation via evidence.dev	https://analytics.google.com/analytics/web/?pli=1#/p438587109/reports/intelligenthome
`Github`	API	❌	Captures all the events that occour with the open source project nsw-doe-data-stack-in-a-box	https://github.com/wisemuffin/nsw-doe-data-stack-in-a-box

Bus Matrix

🚧 add descriptions for facts

Fact	Status	Dim School	Dim Schoolastic Year	Dim Calendar Year	Description
`Resource Allocation Model (RAM)`	✅	✅	❌	✅
`Staff`	✅	❌	❌	✅
`Students`	✅	❌	❌	✅
`Incident`	✅	❌	❌	✅
`Aparent Retention Rate`	✅	❌	❌	✅
`School`	✅	❌	❌	❌
`Attendance`	✅	✅	❌	✅	Dont have numerator and denominator so cant aggregate this fact table. Could just out disclamer on average of average
`Enrolment`	✅	✅	❌	✅
`Apprenticeship and Traineeship training contract`	🚧	❌	❌	✅ partially
`Web Analytics`	✅	❌	❌	✅
`Repo Reactions`	✅	❌	❌	✅

ERD

🚧 TODO Dimensional ERD check out

Give me more data!

Data that I want from DOE

Number of techers per school was on the data hub but was removed citing will now be reported by ABS. But ABS data isnt at a school level.

Data from ACARA / NESA

NAPLAN and HSC attainment by school. Can get NAPLAN by school going to ACARA's MySchool but no easy way to get a view for all schools data.

Contributing

See below. Also checkout the wiki with this repo for more info on the project.

To submit your code, fork the repository, create a new branch on your fork, and open a Pull Request (PR) once your work is ready for review.

In the PR template, please describe the change, including the motivation/context, test coverage, and any other relevant information. Please note if the PR is a breaking change or if it is related to an open GitHub issue.

A Core reviewer will review your PR in around five business days and provide feedback on any changes it requires to be approved. Once approved and all the tests pass, the reviewer will click the Squash and merge button in Github 🥳.

Contributing - Data Analyses & Reporting

to run the report locally simply run:

task evidence_setup && evidence
task evidence

[!WARNING] make sure that your dagster pipeline isnt running, when running evidence. Also close down the reporting server ctrl+c before you start dagster up again via task dag. This is due to a limitation with duckdb locking.

Contributing - Data Science

🚧 TODO - currently data scientists need to know how to work with pipelines. Still experimenting with this. But you can have a go with the examples that already exist in dagster.

Contributing - Data Modeling

I have been following the gitlab's data team's handbook for modeling, naming convetions and testing.

I am pretty relaxed with standards in this project. But please read through these before developing to help standise the modeling:

Enterprise data warehouse
Tests
SQL style guide

Differences to gitlab's data team's handbook:

Raw and other schema's 🚧 TODO - simplify CICD have just used one schema, prefix should be enough
staging layer added between raw and prep layers.

make sure you lint your code with sqlfluff:

sqlfluff lint
sqlfluff fix

Contributing - Pipeline Code / Ingestion

Dagster should automatically start in your codespace (give it a couple of mins for the setup to complete). If you have exited dagster simply type task dag on the command line to get it running again.

testing

We use pytest

debugging dagster

To debug dagster you should run dagster dev in debug mode. This allows you to set breakpoints in vs code. Simply hit F5 in vscode (just check that your debug config is set to Dagster: Debug Dagit UI).

Behind the scenes VSCode is using launch.json with the following args to run dagster in debug mode. Then just select the assets in dagster UI to materialise. If you set breakpoints they will be

{
    "name": "dagster dev",
    "type": "python",
    "request": "launch",
    "module": "dagster",
    "args": [
        "dev",
    ],
    "subProcess": true
}

This is one of the first things i wish i knew when learning dagster!

Disclaimer

Due to the evolving nature of school information and local enrolment areas, no responsibility can be taken by the NSW Department of Education, or any of its associated departments, if information is relied upon. For example, but not limited to, real estate purchases or rentals where the school intake zone data is used as a reference source.

Name		Name	Last commit message	Last commit date
Latest commit History 241 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
infra_as_code		infra_as_code
orchestration		orchestration
reports		reports
transformation		transformation
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.sqlfluff		.sqlfluff
.sqlfluffignore		.sqlfluffignore
ERD.md		ERD.md
ERD_debugging_dbt-parser.py		ERD_debugging_dbt-parser.py
ERD_generation.py		ERD_generation.py
LICENSE		LICENSE
README-tooling.md		README-tooling.md
README.md		README.md
Taskfile.yml		Taskfile.yml
dagster_cloud.yaml		dagster_cloud.yaml
pyproject.toml		pyproject.toml
requirements.in		requirements.in
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to New South Wales Department of Education (NSW DOE) data stack in a box

🚀 TL;DR - What have we achieved?

Data Stack

Reports

💡 Objectives

Main quests

Side quests

🤗 Audience

🥨 Overview of Data Stack (Architecture)

👅 Information Management

Data Catalog

Conceptual Data Model

Sources

Education Sources

Utilisation Sources

Bus Matrix

ERD

Give me more data!

Data that I want from DOE

Data from ACARA / NESA

Contributing

Contributing - Data Analyses & Reporting

Contributing - Data Science

Contributing - Data Modeling

Contributing - Pipeline Code / Ingestion

Disclaimer

About

Releases

Packages

Languages

License

wisemuffin/nsw-doe-data-stack-in-a-box

Folders and files

Latest commit

History

Repository files navigation

Welcome to New South Wales Department of Education (NSW DOE) data stack in a box

🚀 TL;DR - What have we achieved?

Data Stack

Reports

💡 Objectives

Main quests

Side quests

🤗 Audience

🥨 Overview of Data Stack (Architecture)

👅 Information Management

Data Catalog

Conceptual Data Model

Sources

Education Sources

Utilisation Sources

Bus Matrix

ERD

Give me more data!

Data that I want from DOE

Data from ACARA / NESA

Contributing

Contributing - Data Analyses & Reporting

Contributing - Data Science

Contributing - Data Modeling

Contributing - Pipeline Code / Ingestion

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages