Installable Django apps to consume and store 990 data and metadata. Depends on IRSx.
This repo contains four installable django apps.
Two are used for loading in IRS 990 filings and returns.
irsdb.filing
irsdb.return
And two of which are used for generating the schemas for the returns.
irsdb.metadata
irsdb.schemas
To use irsdb.filing
and irsdb.return
in in your application, first
install this package. You could do that by adding
https://github.com/datamade/990-xml-db/archive/refs/heads/feature/main.zip
to your app's requirements.txt
file.
Then have the following settings in your app's settings.py file:
INSTALLED_APPS = [
...
'irsdb.filing',
'irsdb.return',
...
]
Assuming you have a manage.py
for your application, you can then run
> python manage.py makemigrations
> python manage.py migrate
And now we are finally ready to load in the data.
First, we need to load in the index of filings for a given year. That looks like
> python manage.py enter_yearly_submissions 2021
This script checks to see if the IRS' index file is any bigger than
the one one disk, and only runs if it has. You can force it to try to
enter any new filings (regardless of whether the file is updated) with
the --enter
option.
And then we can load the actual returns. If you want to return all the returns for a given year, you can do
> python manage.py load_filings 2021
If you have a list of target organizations and a list of their EINs, you can just load those in by supplying a CSV that has a column of EINs with "ein" as the field name.
> python manage.py load_filings 2021 --file=target.csv
The IRS's 990 Schema changes over time. The irsdb.metadata
and irsdb.schemas
apps
will generate a new models.py
that you can use for the return
app based on what
the irsx
package currently knows about schemas.
To use, first install this repo. You could do that by adding
https://github.com/datamade/990-xml-db/archive/refs/heads/feature/main.zip
to your app's requirements.txt
file.
Then have the following settings in your app's settings.py file:
INSTALLED_APPS = [
...
'irsdb.metadata',
'irsdb.schemas',
...
]
GENERATED_MODELS_DIR = '/where/you/want/the/output/to/go'
Assuming you have a manage.py
for your application, you can then run
> python manage.py makemigrations
> python manage.py migrate
Next, run
> python manage.py load_metadata
> python manage.py generate_schemas_from_metadata
The generate_schemas_from_metadata
command will create a file called django_models_auto.py
in the directory you set GENERATED_MODELS_DIR
to.
You can replace the irsdb/return/models.py
with this file.
There's a problem with the 2014 index file. An internal comma has "broken" the .csv format for some time. You can fix it with a perl one liner (which first backs the file up to index_2014.csv.bak before modifying it)
$ perl -i.bak -p -e 's/SILVERCREST ASSET ,AMAGEMENT/SILVERCREST ASSET MANAGEMENT/g' index_2014.csv
We can see that it worked by diffing it.
$ diff index_2014.csv index_2014.csv.bak
39569c39569
< 11146506,EFILE,136171217,201212,1/14/2014,MOSTYN FOUNDATION INC CO SILVERCREST ASSET MANAGEMENT,990PF,93491211007003,201302119349100700
---
> 11146506,EFILE,136171217,201212,1/14/2014,MOSTYN FOUNDATION INC CO SILVERCREST ASSET ,AMAGEMENT,990PF,93491211007003,201302119349100700
For more details see here.
The loading process uses columns in the filing model to track load process (and to insure the same files aren't loaded twice).
TK - explanation of keyerrors
There's a sql script that will remove all entered rows from all return tables and reset the fields in filing as if they were new.
If you want to live dangerously, you can run it from the console like this:
$ python manage.py dbshell < ./return/sql/delete_all_return.sql
There are management commands to create or drop indexes on object_id, ein and (for schedule K) documentId. Use
$ python manage.py make_indexes
or
$ python manage.py drop_indexes
. These are just conveniences to create indexes named xx_<tablename> --they won't remove other indexes.
You can remove all filings from a given index file with the remove_year. It's likely to run faster if indexes are in place.
If loading gets interrupted, you can remove only the rows where parse_started is true and parse_complete is not with the management command remove_half_loaded. It also requires a year as a command line argument.
$ python manage.py remove_half_loaded 2018
The full download of uncompressed .xml files is over ~74 gigabytes. Processing a complete year of data probably entails moving at least 15 gigs of xml.
You probably want to look into a tool to help you move these files in bulk. AWS' S3 CLI can dramatically reduce download time, but seems unhelpful when trying to pull a subset of files (it seems like --exclude '*' hangs when processing so many files). You may want to look into moving all the files to your own S3 bucket as well. There are also alternatives to AWS' CLI tool, like S3 CMD.
You'll also want to configure IRSx file cache directory to set the WORKING_DIRECTORY variable to the file path of the folder where the xml files are located.
The worst option is to download the uncompressed files one at a time. That sounds, really, really slow.
With most hosting providers, you'll need to configure additional storage to support the static files and the database that's ultimately loaded. Make sure that you set the database storage directory to that storage, and get the fastest storage type you can afford.
You may want to look into tuning your database parameters to better support data loading. And you'll get better performance if you only create indexes after loading is complete (and delete them before bulk loads take place).
One random datapoint: on an Amazon t2.medium ec2 server (~$38/month) with 150 gigs of additional storage and postgres running on the default configs and writing to an SSD EBS volume, load time for the complete set of about 490,000 filings from 2017 took about 3 hours.
This assumes no schema changes are required, which is usually the case.
Run an S3 sync to the location of the fillings. The whole collection is now over 80 GB, make sure you have room. You can also retrieve some other way (if you don't retrieve en masse the load_filings.py script will attempt to download one filing at a time). It's useful to run this with nohup, i.e.
nohup aws s3 sync s3://irs-form-990/ ./ &
Then update the index file data
$ python manage.py enter_yearly_submissions 2018
index_2018.csv has changed. Downloading updated file...
Done!
Entering xml submissions from /home/webuser/virt/env/lib/python3.5/site-packages/irsx/CSV/index_2018.csv
Committing 10000 total entered=10000
commit complete
Committing 10000 total entered=20000
commit complete
Added 24043 new entries.
Then enter the filings into the relational database with:
$ python manage.py load_filings 2018
Running filings during year 2018
Processed a total of 100 filings
Processed a total of 200 filings
Processed a total of 300 filings
...
Handled 24000 filings
Processed a total of 24000 filings
Processed a total of 24043 filings
Done
This script finds filings where submission_year
is the entered year and parse_complete
has not been set to True. It enters them in groups of 100 and sets parse_complete
to True after each batch has completed. The script is fairly fault tolerant, but if it dies in the middle it's important to remove all the half entered filings where parse_started
= True and parse_complete
is not True. (By default it is null, so don't try to match on parse_complete
= False).
--