This is an data-stack-in-a-box based data from NSW Education Data. With the push of one button you can have your own data stack up and running in 5 mins! ποΈ
Important
Click below ππΌ to setup your own free data stack packed with NSW Department of Education data.
This dashboard contains all the metrics we have collected in this data stack project. It uses a visualisation as code tool called evidence.dev.
NSW Department of Education data stack in a box has two objectives:
- Getting humans excited about the publicly available data curated by NSW Department of Education and our partners.
- Simple one click totally free π² data stack, aiding in learning and proof of concepts.
- Help identify data quality and reliability issues with our data. This project is being run daily with several tests so hopefully we can find any issues first!
- Building a report that shows all metrics related to (NSW) Department of Education (DOE) data, and consolidate any other publicaly available data.
The project is designed to be very simple when getting started but allows you to go as deep you like!
- I want to analyse and gain insights into the data. With the infrastructure free and deployed in one click you donβt need to worry about any implementation details. You can skip straight to analysing and training ML models on top of your own local warehouse.
- Interested in modelling via SQL? We got you covered with a environment setup for DBT.
- Love DevOps and platform engineering? Check out our Orchestration, CICD pipelines, and automation such as linting, ect.
[!NOTE] We are simply going to extract data from the Data.NSW and load it into our in memory data warehouse π¦, model, clean, and analyse our data. Data.NSW behind the scenes uses CKAN an open source data management system used by the likes of Government of Canada, NHS, and USAs Open Data.
Warning
Some of the datasets from ACARA and Data.NSW are based on CSV files located on their sites. This is challenging as the CSV name and URL for future datasets is unknown, so requires a code change each time new data arrives.
π§ TODO likley openmetadata see example
erDiagram
"STUDENT" {
int student_id
string name
date date_of_birth
string gender
int school_id
}
STAFF {
int staff_id
string name
string role
date date_of_birth
int school_id
}
COURSE {
int course_id
string course_name
string description
int school_id
}
ENROLLMENT {
int enrollment_id
int student_id
int course_id
date enrollment_date
}
CLASS {
int class_id
int course_id
int staff_id
string class_room
date class_time
int class_size
}
SCHOOL {
int school_id
string school_name
string address
}
NAPLAN {
int naplan_id
int student_id
date test_date
string test_type
int score
}
%% HSC {
%% int hsc_id
%% int student_id
%% date exam_date
%% string subject
%% int score
%% }
INCIDENT {
int incident_id
int school_id
date incident_date
string incident_type
string description
}
ATTENDANCE {
int attendance_id
int student_id
date attendance_date
bool present
}
RETENTION {
int retention_id
int school_id
int year
float retention_rate
}
APPRENTICESHIP {
int apprenticeship_id
int student_id
string trade
date start_date
date end_date
}
TRAINEESHIP {
int traineeship_id
int student_id
string field
date start_date
date end_date
}
"STUDENT" ||--o{ ENROLLMENT : enrolls
COURSE ||--o{ ENROLLMENT : includes
STAFF ||--o{ CLASS : teaches
COURSE ||--o{ CLASS : consists_of
SCHOOL ||--o{ "STUDENT" : has
SCHOOL ||--o{ STAFF : employs
SCHOOL ||--o{ COURSE : offers
"STUDENT" ||--o{ NAPLAN : takes
%% "STUDENT" ||--o{ HSC : sits
SCHOOL ||--o{ INCIDENT : reports
"STUDENT" ||--o{ ATTENDANCE : records
SCHOOL ||--o{ RETENTION : tracks
"STUDENT" ||--o{ APPRENTICESHIP : undertakes
"STUDENT" ||--o{ TRAINEESHIP : participates_in
This is a high level overview of the entities that we are going to model in this project.
[!NOTE] Limitation - The data available publically for each entitity does not go down to a student. In some cases school level data is avaiable. But most entities only have data published at a state wide (NSW) aggregate level.
π§ add column for asset checks
Name | Method (API, CSV, Excel) | Contract Y/N | Description | Source URL |
---|---|---|---|---|
Apprenticeship and Traineeship training contract |
Excel | β | Apprenticeships and Traineeships combine formal study of a nationally recognised qualification with on-the-job training. | https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-f7cba3fc-6e9b-4b8b-b1fd-e7dda9b49001 |
Average government primary school class sizes |
API | β | The average class size for each grade is calculated by taking the number of students in all classes that a student from that grade is in (including composite/multi age classes) divided by the total number of classes that includes a student from that grade. | https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-43438137-084e-4d50-81c0-ce741ea3b37b/details |
Enrolments |
API | β | This data shows February census enrolment figures. All enrolments are self-reported in full-time equivalent (FTE) units and include both full-time and part-time students. | https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-818ae0d8-d7fb-4b62-963c-7263fdb8e1ca |
Incidents |
API | β | Incidents in public schools and how the department supports schools through incidents while still protecting the identity of students and staff. | https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-43438137-084e-4d50-81c0-ce741ea3b37b |
Master dataset: NSW government school locations and student enrolment numbers |
CSV | β | The master dataset contains comprehensive information for all government schools in NSW. Data items include school locations, latitude and longitude coordinates, school type, student enrolment numbers, electorate information, contact details and more. | https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-78c10ea3-8d04-4c9c-b255-bbf8547e37e7 |
Resource Allocation Model (RAM) |
CSV | β | The Resource Allocation Model (RAM) was developed to ensure a fair, efficient and transparent allocation of the state public education budget for every school. The model recognises that students and school communities are not all the same and that they have different needs which require different levels of support. | https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-3ea5010a-89bd-46bf-be2a-13c82cc0e1bb |
Staff |
CSV | β | -------------------------------------- | https://www.acara.edu.au/reporting/national-report-on-schooling-in-australia/staff-numbers |
Students |
CSV | β | -------------------------------------- | https://www.acara.edu.au/reporting/national-report-on-schooling-in-australia/student-numbers |
Student attendance |
CSV | β | This dataset shows the attendance rates for all NSW government schools in Semester One by alphabetical order. | https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-b558a070-09f5-4941-a140-e60a744327bf |
Student retention rates at NSW government schools |
API | β | The full-time apparent retention rate (ARR) measures the proportion of a cohort of full-time students that moves from one grade to the next, based on an expected rate of progression of one grade per year. It does not track individual students through their final years of secondary schooling. | https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-c9fd51b3-506d-4707-b607-0b1853654ce6 |
Name | Method (API, CSV, Excel) | Contract Y/N | Description | Source URL |
---|---|---|---|---|
Google Analytics |
API | β | Captures all the traffic to the data visualisation via evidence.dev | https://analytics.google.com/analytics/web/?pli=1#/p438587109/reports/intelligenthome |
Github |
API | β | Captures all the events that occour with the open source project nsw-doe-data-stack-in-a-box | https://github.com/wisemuffin/nsw-doe-data-stack-in-a-box |
π§ add descriptions for facts
Fact | Status | Dim School | Dim Schoolastic Year | Dim Calendar Year | Description |
---|---|---|---|---|---|
Resource Allocation Model (RAM) |
β | β | β | β | |
Staff |
β | β | β | β | |
Students |
β | β | β | β | |
Incident |
β | β | β | β | |
Aparent Retention Rate |
β | β | β | β | |
School |
β | β | β | β | |
Attendance |
β | β | β | β | Dont have numerator and denominator so cant aggregate this fact table. Could just out disclamer on average of average |
Enrolment |
β | β | β | β | |
Apprenticeship and Traineeship training contract |
π§ | β | β | β partially | |
Web Analytics |
β | β | β | β | |
Repo Reactions |
β | β | β | β |
π§ TODO Dimensional ERD check out
Number of techers per school
was on the data hub but was removed citing will now be reported by ABS. But ABS data isnt at a school level.
NAPLAN
andHSC attainment
by school. Can get NAPLAN by school going to ACARA's MySchool but no easy way to get a view for all schools data.
See below. Also checkout the wiki with this repo for more info on the project.
To submit your code, fork the repository, create a new branch on your fork, and open a Pull Request (PR) once your work is ready for review.
In the PR template, please describe the change, including the motivation/context, test coverage, and any other relevant information. Please note if the PR is a breaking change or if it is related to an open GitHub issue.
A Core reviewer will review your PR in around five business days and provide feedback on any changes it requires to be approved. Once approved and all the tests pass, the reviewer will click the Squash and merge button in Github π₯³.
to run the report locally simply run:
task evidence_setup && evidence
task evidence
[!WARNING] make sure that your dagster pipeline isnt running, when running evidence. Also close down the reporting server
ctrl+c
before you start dagster up again viatask dag
. This is due to a limitation with duckdb locking.
π§ TODO - currently data scientists need to know how to work with pipelines. Still experimenting with this. But you can have a go with the examples that already exist in dagster.
I have been following the gitlab's data team's handbook for modeling, naming convetions and testing.
I am pretty relaxed with standards in this project. But please read through these before developing to help standise the modeling:
- Enterprise data warehouse
- Tests
- SQL style guide
Differences to gitlab's data team's handbook:
- Raw and other schema's π§ TODO - simplify CICD have just used one schema, prefix should be enough
- staging layer added between raw and prep layers.
make sure you lint your code with sqlfluff
:
sqlfluff lint
sqlfluff fix
Dagster should automatically start in your codespace (give it a couple of mins for the setup to complete). If you have exited dagster simply type task dag
on the command line to get it running again.
testing
We use pytest
debugging dagster
To debug dagster you should run dagster dev
in debug mode. This allows you to set breakpoints in vs code. Simply hit F5
in vscode (just check that your debug config is set to Dagster: Debug Dagit UI
).
Behind the scenes VSCode is using launch.json
with the following args to run dagster in debug mode. Then just select the assets in dagster UI to materialise. If you set breakpoints they will be
{
"name": "dagster dev",
"type": "python",
"request": "launch",
"module": "dagster",
"args": [
"dev",
],
"subProcess": true
}
This is one of the first things i wish i knew when learning dagster!
Due to the evolving nature of school information and local enrolment areas, no responsibility can be taken by the NSW Department of Education, or any of its associated departments, if information is relied upon. For example, but not limited to, real estate purchases or rentals where the school intake zone data is used as a reference source.