Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some modernization and tidying up #38

Open
wants to merge 42 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
185c3c4
point to propublica metadata submodule
asuozzo Apr 7, 2023
05f56c5
update submodules again
asuozzo Apr 10, 2023
874b6f1
update submodules again
asuozzo Apr 10, 2023
1b69c70
support 2021 versions
asuozzo Apr 11, 2023
f8fa826
Merge branch 'use-propub-submodule' of github.com:propublica/990-xml-…
asuozzo May 2, 2023
6e68ae4
Merge branch 'use-propub-submodule'
asuozzo May 2, 2023
32dcd93
add pipfile
asuozzo May 2, 2023
9305137
add allowed versionstrings
asuozzo May 7, 2023
a275d64
fix namespacing
asuozzo May 8, 2023
8308fc4
add 2022 990x version strings
asuozzo May 24, 2023
46eeef1
fix namespace function
asuozzo Aug 1, 2023
befcce0
add new schema versions
asuozzo Aug 21, 2023
52911ac
capture attributes so we can parse subsection #
asuozzo Oct 5, 2023
b570660
parse 990EZ subsection
asuozzo Oct 5, 2023
a919048
change gitmodules to relative
asuozzo Oct 26, 2023
5c6fd85
add guesses for 2023 schema versions
asuozzo Nov 30, 2023
1f989f4
use giving tuesday bucket, bring back downloads, get tests working
fgregg Jun 13, 2024
73e64e4
Create python-package.yml
fgregg Jun 13, 2024
e4759a7
flake8
fgregg Jun 13, 2024
77cfcef
gh action checkout get submodules
fgregg Jun 13, 2024
dc65dda
flake8 config
fgregg Jun 13, 2024
6b9c80d
black and isort checks
fgregg Jun 13, 2024
a7e71f0
black and isort checks
fgregg Jun 13, 2024
3285251
black and isort checks
fgregg Jun 13, 2024
48c9db7
fix imports
fgregg Jun 13, 2024
277f3c9
update testing instructions
fgregg Jun 13, 2024
2495bbd
Update README.md
fgregg Jun 13, 2024
d0be7fd
use environmental variables instead of local_settings.py
fgregg Jun 13, 2024
1a40d8b
Update text_format_utils.py
fgregg Jun 13, 2024
231637d
Update text_format_utils.py
fgregg Jun 13, 2024
0a88421
blacken
fgregg Jun 13, 2024
8361ba5
csv
fgregg Jun 13, 2024
7c7b5b2
add build steps
fgregg Jun 13, 2024
5f28011
use environmental variables instead of local_settings.py
fgregg Jun 13, 2024
ae899af
Update text_format_utils.py
fgregg Jun 13, 2024
5869383
Update text_format_utils.py
fgregg Jun 13, 2024
5ad69cc
blacken
fgregg Jun 13, 2024
769c902
csv
fgregg Jun 13, 2024
1939a40
add build steps
fgregg Jun 13, 2024
2358563
Merge branch 'env_settings'
fgregg Jun 13, 2024
29ba7c3
slight changes to metadaa
fgregg Jun 13, 2024
d8db8bb
update commit
fgregg Jun 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
[flake8]
exclude =
venv,
**/migrations/*
# So flake8 plays nicely with black
# https://black.readthedocs.io/en/stable/guides/using_black_with_other_tools.html
max-line-length = 120
extend-ignore = E203
54 changes: 54 additions & 0 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
name: Python package

on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]
release:

jobs:
build:

runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.9", "3.10", "3.11"]

steps:
- uses: actions/checkout@v4
with:
submodules: true
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest black isort
pip install .[tests]
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Check black
run: black --check irs_reader tests
- name: isort
run: isort --profile=black --check-only irs_reader tests
- name: Test with pytest
run: |
pytest
- name: Build distribution
if: ${{ github.event_name == 'release' }}
run: |
pip install build
python -m build
- name: Upload source distribution
if: ${{ github.event_name == 'release' }}
uses: softprops/action-gh-release@v2
with:
files: dist/*
7 changes: 1 addition & 6 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,8 +1,3 @@
[submodule "irs_reader/metadata"]
path = irs_reader/metadata
url = https://github.com/jsfenfen/990-xml-metadata
branch = master
[submodule "metadata"]
path = metadata
url = https://github.com/jsfenfen/990-xml-metadata
branch = master
url = https://github.com/datamade/990-xml-metadata.git
41 changes: 12 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
# IRSx

Update: 12/16. The IRS has announced it will no longer post xml 990 filings to AWS, thereby undermining irsx' ability to automatically retrieve filings. The IRS does appear to make the raw filings available in [bulk format on this page](https://www.irs.gov/charities-non-profits/form-990-series-downloads). It is possible to use IRSx by retrieving the files and placing them at the location that IRSX expects to find them. We are seeking additional information from IRS and plan to address this soon.


## Table of Contents

- [Installation](#installation)
Expand Down Expand Up @@ -289,21 +286,11 @@ For example:

### Legacy configuration ###

You also can configure IRSx's cache location by setting the local_settings.py file. To figure out where that settings file is, log in to a terminal and type:

>>> from irsx.settings import IRSX_SETTINGS_LOCATION
>>> IRSX_SETTINGS_LOCATION
'/long/path/to/lib/python3.6/site-packages/irsx/settings.py'

[ If you get an error, try upgrading irsx with `pip install irsx --upgrade` -- this feature was added in 0.1.1. ]


Go to that directory. You can either modify the settings.py file or the local_settings.py file. To do the latter, first `cd` into the directory where the settings files live and run:

$ cp local_settings.py-example local_settings.py

Then edit local_settings.py to set WORKING\_DIRECTORY to where the raw xml files are found.
You also can configure IRSx's cache location by setting an environmntal variable.

```console
> export IRSX_CACHE_DIRECTORY=/where/you/like
```

## IRSx from python

Expand Down Expand Up @@ -477,20 +464,16 @@ You can still add command line args, like this:

## Testing

Nosetests - Test coverage is incomplete, improve it with coverage.py; run 'pip install coverage'
then:

$ nosetests --with-coverage --cover-erase --cover-package=irs_reader

or

$ coverage report -m



Tox -- see tox.ini; testing for: 2.7,3.4,3.5,3.6. You may need to run `pip install tox` in the testing environment.
Install dependencies
```console
> pip install .[tests]
```

And run tests

```console
> pytest
```

## Acknowledgements

Expand Down
2 changes: 1 addition & 1 deletion irs_reader/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '0.3.2'
__version__ = "0.3.2"
2 changes: 1 addition & 1 deletion irs_reader/dir_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@


def mkdir_p(paths):
""" Makedirs, from http://stackoverflow.com/a/600612 """
"""Makedirs, from http://stackoverflow.com/a/600612"""
for path in paths:
try:
os.makedirs(path)
Expand Down
35 changes: 19 additions & 16 deletions irs_reader/file_utils.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,17 @@
import re
import os
import re
from datetime import datetime

import requests

from datetime import datetime
from .settings import IRS_XML_HTTP_BASE, WORKING_DIRECTORY, INDEX_DIRECTORY, IRS_INDEX_BASE
from .settings import (
INDEX_DIRECTORY,
IRS_INDEX_BASE,
IRS_XML_HTTP_BASE,
WORKING_DIRECTORY,
)

OBJECT_ID_RE = re.compile(r'20\d{16}')
OBJECT_ID_RE = re.compile(r"20\d{16}")

# Not sure how much detail we need to go into here
OBJECT_ID_MSG = """
Expand All @@ -18,40 +24,37 @@


def stream_download(url, target_path, verbose=False):
""" Download a large file without loading it into memory. """
"""Download a large file without loading it into memory."""
response = requests.get(url, stream=True)
handle = open(target_path, "wb")
if verbose:
print("Beginning streaming download of %s" % url)
start = datetime.now()
try:
content_length = int(response.headers['Content-Length'])
content_MB = content_length/1048576.0
content_length = int(response.headers["Content-Length"])
content_MB = content_length / 1048576.0
print("Total file size: %.2f MB" % content_MB)
except KeyError:
pass # allow Content-Length to be missing
pass # allow Content-Length to be missing
for chunk in response.iter_content(chunk_size=512):
if chunk: # filter out keep-alive new chunks
if chunk: # filter out keep-alive new chunks
handle.write(chunk)

if verbose:
print(
"Download completed to %s in %s" %
(target_path, datetime.now() - start))
print("Download completed to %s in %s" % (target_path, datetime.now() - start))


def validate_object_id(object_id):
""" It's easy to make a mistake entering these, validate the format """
"""It's easy to make a mistake entering these, validate the format"""
result = re.match(OBJECT_ID_RE, str(object_id))
if not result:
print("'%s' appears not to be a valid 990 object_id" % object_id)
raise RuntimeError(OBJECT_ID_MSG)
return object_id


# Files are no longer available on S3
# def get_s3_URL(object_id):
# return ("%s/%s_public.xml" % (IRS_XML_HTTP_BASE, object_id))
def get_s3_URL(object_id):
return "%s/%s_public.xml" % (IRS_XML_HTTP_BASE, object_id)


def get_local_path(object_id):
Expand Down
Loading