Skip to content

Commit

Permalink
Merge pull request #3129 from catalyst-cooperative/include-quarterly-…
Browse files Browse the repository at this point in the history
…updates-in-docs

Include sub-annual updates in annual_updates docs
  • Loading branch information
aesharpe authored Dec 13, 2023
2 parents 0009827 + 6293199 commit 2bd73ac
Show file tree
Hide file tree
Showing 10 changed files with 821 additions and 807 deletions.
24 changes: 0 additions & 24 deletions .github/ISSUE_TEMPLATE/annual_updates.md

This file was deleted.

24 changes: 24 additions & 0 deletions .github/ISSUE_TEMPLATE/existing_data_updates.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
name: Integrate New Year of Data
about: Check-list for integrating a new year of data
title: ''
labels: new-data
assignees: ''

---

### New year of data integration check-list:

Based on the [Existing Data Updates Docs](https://catalystcoop-pudl.readthedocs.io/en/dev/dev/existing_data_updates.html)


- [ ] [Obtain fresh data](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/existing_data_updates.html#obtain-fresh-data)
- [ ] [Map the structure of the new data](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/existing_data_updates.html#map-the-structure-of-the-new-data)
- [ ] [Test data extraction](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/existing_data_updates.html#test-data-extraction)
- [ ] [Update table and column transformations](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/existing_data_updates.html#update-table-column-transformations)
- [ ] [Update the PUDL db schema](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/existing_data_updates.html#update-the-pudl-db-schema)
- [ ] [Connect datasets](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/existing_data_updates.html#connect-datasets)
- [ ] [Run the ETL](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/existing_data_updates.html#run-the-etl)
- [ ] [Update the output routines and run full tests](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/existing_data_updates.html#update-the-output-routines-and-run-full-tests)
- [ ] [Run and update data validations](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/existing_data_updates.html#run-and-update-data-validations)
- [ ] [Update the documentation](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/existing_data_updates.html#update-the-documentation)
62 changes: 36 additions & 26 deletions docs/dev/annual_updates.rst → docs/dev/existing_data_updates.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,27 @@
===============================================================================
Annual Updates
Existing Data Updates
===============================================================================
Much of the data we work with is released in a "final" state annually. We typically
integrate the new year of data over 2-4 weeks in October since, by that
time, the final release for the previous year have been published by EIA and FERC. We
also integrate EIA early release data when available. The ``data_maturity`` field will
indicate whether the data is final or provisional. To see what data we have available
for each dataset, click on the links below and look at the "Years Liberated" field.

As of spring 2023 the annual updates include:
Many of the raw data inputs for PUDL are published on a annual or monthly basis. These
instructions explain the process for integrating new versions of existing data into
PUDL.

We update EIA monthly data and EPA CEMS hourly data on a quarterly basis.

EIA typically publishes an "early release" version of their annual data in the summer
followed by a final release in the fall. Our ``data_maturity`` column indicates
which version has been integrated into PUDL ("final" vs. "provisional"). This column
also shows when data are derrived from monthly updates ("monthly_update") or contain
incomplete year-to-date data ("incremental_ytd").

FERC publishes form submissions on a rolling basis meaning there is no official
date that the data are considered final or complete. To figure out when the data are
likely complete, we compare the number of respondents from prior years to the number of
current respondents. We usually update FERC once a year around when we integrate EIA's
final release in the fall.

To see what data we have available for each dataset, click on the links below and look
at the "Years Liberated" field.

* :doc:`/data_sources/eia860` (and eia860m)
* :doc:`/data_sources/eia861`
Expand All @@ -17,9 +30,6 @@ As of spring 2023 the annual updates include:
* :doc:`/data_sources/ferc1`
* :doc:`/data_sources/ferc714`

This document outlines all the tasks required to complete the annual update based on
our experience in 2022.

1. Obtain Fresh Data
--------------------
**1.1)** Add a new copy of the raw PUDL inputs from agency websites using the tools
Expand All @@ -32,18 +42,18 @@ archivers themselves.
refer to the new raw input archives.

**1.3)** In :py:const:`pudl.metadata.sources.SOURCES`, update the ``working_partitions``
to reflect the years of data that are available within each dataset and the
``records_liberated`` to show how many records are available. Check to make sure other
fields such as ``source_format`` or ``path`` are still accurate.
to reflect the years, months, or quarters of data that are available for each dataset
and the ``records_liberated`` to show how many records are available. Check to make
sure other fields such as ``source_format`` or ``path`` are still accurate.

.. note::

If you're updating EIA861, you can skip the rest of the steps in this section and
all steps after step two because 861 is not yet included in the ETL.

**1.4)** Update the years of data to be processed in the ``etl_full.yml`` and
``etl_fast.yml`` settings files stored under ``src/pudl/package_data/settings`` in the
PUDL repo.
**1.4)** Update the partitions of data to be processed in
the ``etl_full.yml`` and ``etl_fast.yml`` settings files stored under
``src/pudl/package_data/settings`` in the PUDL repo.

**1.5)** Use the ``pudl_datastore`` script (see :doc:`datastore`) to download the new
raw data archives in bulk so that network hiccups don't cause issues during the ETL.
Expand Down Expand Up @@ -79,21 +89,21 @@ the years (e.g. ``boiler_fuel``). However ``page_name`` does not necessarily cor
directly to PUDL database table names because we don't load the data from all pages, and
some pages result in more than one database table after normalization.

**2.A.1)** Add a column for the new year of data to each of the aforementioned files. If
there are any changes to prior years, make sure to address those too. (See note above).
If you are updating early release data with final release data, replace the values in
the appropriate year column.
**2.A.1)** If you're adding a new year, add a column for the new year of data to each of
the aforementioned files. If there are any changes to prior years, make sure to address
those too. (See note above). If you are updating early release data with final release
data, replace the values in the appropriate year column.

.. note::

If you are adding EIA's early release data, make sure the raw files have
**If you are adding EIA's early release data**, make sure the raw files have
``Early_Release`` at the end of the file name. This is how the excel extractor knows
to label the data as provisional vs. final.

Early release files also tend to have one extra row at the top and one extra column
on the right of each file indicating that it is early release. This means that the
skiprows and column map values will probably be off by 1 when you update from early
release to final release.
**If you are updating early release data to final release data** - early release
files tend to have one extra row at the top and one extra column on the right of each
file indicating that it is early release. This means that the skiprows and column map
values will probably be off by 1.

**2.A.2)** If there are files, spreadsheet pages, or individual columns with new
semantic meaning (i.e. they don't correspond to any of the previously mapped files,
Expand Down
2 changes: 1 addition & 1 deletion docs/dev/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Development
build_docs
datastore
clone_ferc1
annual_updates
existing_data_updates
pudl_id_mapping
naming_conventions
data_guidelines
Expand Down
10 changes: 5 additions & 5 deletions docs/dev/pudl_id_mapping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,10 +75,10 @@ Checking for Unmapped Records
-----------------------------

With every new year of data comes the possibility of new plants and utilities. Once
you've integrated the new data into PUDL :doc:`(see instructions) <annual_updates>`,
you'll need to check for unmapped utility and plants. To do this,
run the glue tests with specific arguments, or directly run the following ``make``
command.
you've integrated the new data into PUDL
:doc:`(see instructions) <existing_data_updates>`, you'll need to check for unmapped
utility and plants. To do this, run the glue tests with specific arguments, or directly
run the following ``make`` command.

.. code-block:: console
Expand Down Expand Up @@ -236,4 +236,4 @@ Once you’ve successfully mapped all unmapped PUDL IDs, you’ll want to rerun
This ensures that the newly mapped IDs get integrated into the PUDL database and output
tables that folks are using. Make sure to tell everyone else to do so as well so that
you can all use the newly mapped PUDL IDs. But furst, make sure to head back to the
:doc:`annual_updates` page to wrap up the validation tests!
:doc:`existing_data_updates` page to wrap up the validation tests!
3 changes: 2 additions & 1 deletion docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,8 @@ Deprecations
* Replace references to deprecated ``pudl-scrapers`` and
``pudl-zenodo-datastore`` repositories with references to `pudl-archiver
<https://www.github.com/catalyst-cooperative/pudl-archiver>`__ repository in
:doc:`intro`, :doc:`dev/datastore`, and :doc:`dev/annual_updates`. See :pr:`2190`.
:doc:`intro`, :doc:`dev/datastore`, and :doc:`dev/existing_data_updates`. See
:pr:`2190`.
* :mod:`pudl.etl` is now a subpackage that collects all pudl assets into a dagster
`Definition <https://docs.dagster.io/concepts/code-locations>`__. All
``pudl.etl._etl_{datasource}`` functions have been deprecated. The coordination
Expand Down
Loading

0 comments on commit 2bd73ac

Please sign in to comment.