-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List "base" (non-calculated) columns from PUF to synthesize #4
Comments
On Fri, 16 Nov 2018, Max Ghenis wrote:
@feenberg said in an email:
There are quite a number of variables in the PUF that are arithmetic functions of more
basic variables. These include AGI, Itemized Deductions, Taxable Income, tax, etc. We
can use all of these directly from the PUF as part of the synthesis procedure. There is
no need to synthesize them via RF or CART.
Is there a list of such variables, so we can focus on base variables?
I would suggest:
E00100 AGI
E02000 Schedule E
E03260 Deduction for Self employment tax
P04470 Total Deductions
E21040 Itemized Deduction limitation
E04800 Taxable Income
E05100 Tax on taxable income
E05200 Computed Regular Tax
E05800 Income tax before credits
E06000 Income subject to tax
E06200 Marginal tax base
E06200 Tax generated
E06500 Total income tax
E08800 Income tax after credits
E10300 Total tax liability
E09600 Alternative minimum tax
E62100 Alternative minimum taxable income
E07180 Total tax credit
E06500 Total income tax
E08800 Income Tax after Credits
E10300 Total tax liability
E59680 EIC used to offset income tax before credits
E59700 EIC used to offset all other taxes
E59720 EIC refunded
E11070 Refundable Child Credit
TXRT Tax rate code
We could also use Taxsim to calculate E18425.
I do suggest that we provide some value for every variable in the PUF,
rather than select the ones we think are "useful", because part of the job
of the synth file is to allow users to test programs for syntax before
submitting to the actual PUF. Even if we won't have a useful number for,
say "Payment with return" it is important to have some number there for
such checking.
We don't want to get to caught up in getting the right distribution for
every variable combination - that isn't the only purpose of the file.
Dan
…
Having this for the CPS file as well--which is useful for testing in public on GH without worrying
about disclosing data--would also be useful.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVap-hySsqteZkrbb35f1452bSuA2ks5uv60hgaJpZM4YnTjo.gif]
|
This raises some interesting questions about which variables to synthesize and how. I think at least in the near term we should focus primarily on a file that plays well with Tax-Calculator. (If there is some additional important use for the file, we should be explicit about that.) If this is the correct principle, then we should definitely synthesize all variables that are essential to the running of Tax-Calculator. Perhaps there are supplemental variables that we also want to synthesize, or construct in some other probably less sophisticated way (e.g., "populate" is a term used by TPC). I see 5 kinds of variables:
I think we should focus on synthesizing variables of types 1, 2, and 3 -- agreeing completely with @feenberg that some are more important than others, and therefore deserve more analytic attention than others. Size generally should be a good indication of degree of analytic attention needed, but not always. We should not synthesize variables of type 4. After synthesizing types 1, 2, and 3 we should run the resulting synthetic file - which at this point should be similar to any input file to Tax-Calculator - through Tax-Calculator using tax law for the base year of the file, to get the values for variables of type 4. As for variables of type 5, I think they should have to fight for existence. If we have a demonstrated need - or a general principle -for why a variable or set of variables should be on the file, then they could be considered. As @feenberg says, maybe tax payment should be on the file. If that is the case, perhaps some very simple post-synthesis method -- e.g., x% of tax liability - should be used to "populate" the file. If this is the approach, I think the workflow would be something like this:
(Step 2 might be skipped or perhaps automated so that we do steps 1 and 3 before we evaluate the file seriously.) That still leaves two questions not fully addressed: (1) how should the file constructed for Tax-Calculator relate to different PUFs -- PUF as prepared by SOI, and PUF as enhanced by @andersonfrailey?, and (2) how best to identify the type 1 and 2 elemental variables, and implicitly, the type 4 variables? For question 1, at some point we will face a choice - should we synthesize the PUF as prepared by SOI, or the PUF as enhanced by @andersonfrailey? As the text above suggests, I think in the near term we have to synthesize the PUF as enhanced, or else we won't have a file to run through Tax-Calculator until someone enhances the file we synthesize. So that is an easy question. If and when we have a really good result, we would want to have a larger discussion about this as it would affect other workflows. Obviously it only makes sense to enhance once, so it probably always will be sensible to synthesize a post-enhancement rather than pre-enhancement PUF, but it is worth an explicit discussion at some point in the future. But the short-term answer is clear- synthesize the enhanced PUF. For question 2, I think @feenberg's list looks good but what we should do to be srue is start with the required inputs to Tax-Calculator and make those the synthesis variables (types 1 and 2 above), rather than trying to synthesize everything but the calculated (type 4) variables. We could look to the Tax-Calculator inputs documentation, and then verify with @andersonfrailey and/or @martinholmer. One final note: There may be variables on the PUF, or on the CPS-based PUF, that are important beyond their Tax-Calculator need, as @feenberg mentioned. I don't know what those other needs are but I imagine one or more among us does, and it would be good to include them, too -- but I'd still suggest that they are not as important as those that are needed for Tax-Calculator. |
While I generally agree with the sentiments expressed below, I have an
additional consideration that may partially contradict some of them. I do
wish that whatever methodology we adopt it should be reproducible on a
different tax year without a lot of work. That is, I would hope we would
have script or program that was sufficiently general that when the next
year of data became available the script could be run against the new PUF
without a research project to determine the details. This would be a great
advantage, even if the quality of the imputations was reduced.
I fear that the TPC project will not be reproduible once the grant money
is spent because it will take a year or more of analysis to modify the
programs to another year. If this is the case, our project could be the
long-lasting one, if it was easy to repeat.
Additional comments below.
On Sun, 18 Nov 2018, Don Boyd wrote:
This raises some interesting questions about which variables to synthesize and how.
I think at least in the near term we should focus primarily on a file that plays well with
Tax-Calculator. (If there is some additional important use for the file, we should be explicit about
that.) If this is the correct principle, then we should definitely synthesize all variables that are
essential to the running of Tax-Calculator. Perhaps there are supplemental variables that we also want
to synthesize, or construct in some other probably less sophisticated way (e.g., "populate" is a term
used by TPC).
I do hope that we synthesize all the variables, even if we don't do a good
job on all of them.
I see 5 kinds of variables:
1. Elemental variables (i.e, variables that are used in Tax-Calculator calculations) that are in the
PUF as we receive it from SOI -- for example, e00300 interest received.
2. Elemental variables that are not in the PUF as received from SOI, but are in the enhanced PUF that
is used by Tax-Calculator -- values that are constructed by @andersonfrailey and/or @martinholmer
and are essential to the operation of Tax-Calculator -- I think of the split of wages between
prime and spouse (e00200p and e00200s) as an example.
3. Elemental variables we may want to construct to make synthesis better. For example, Tax-Calculator
needs (I think) both e00650 (total dividends in AGI) and e00600 (qualified dividends). We may find
that it makes economic sense to construct unqualified dividends as e00650 - e00600, and then
synthesize qualified and unqualified dividends separately, and add them up to get total e00650.
Why? Because we might find that simple unconstrained synthesis methods can result in qualified
dividends being larger than total dividends on some synthetic returns, something I think logic and
law would not allow - so either we need to do something more complex to impose a constraint, or
else construct the components and add them together. (Yet another approach could be to synthesize
total dividends and the share of total dividends that are qualified, with that share constrained
to [0, 1].) All of this could happen unknown to Tax-Calculator - we would synthesize a variable it
doesn't know about (unqualified dividends) but only include in our data the variables it needs
(e00650 and e00600).
One more possibility - synthesize qualified and unqualified dividends,
then calculate total dividends from the components. There are 2 elemental
variables here and one to be calculated. Pick any 2 to synthesize. It may
not even matter.
4. Variables that are calculated by Tax-Calculator from these elemental variables, such as AGI
(c00100) and tax before credit (taxbc), which may or may not have counterparts in the PUF from SOI
(e.g., AGI in original PUF is e00100).
These should come from the tax calculator, not synthesized.
5. Variables that are in the PUF as received from SOI but are not used in Tax-Calculator and may or
may not be important for other purposes. TXRT tax rate code strikes me as such a variable.
Using PUF TXRT in the synthesis step may help us get the correlations
between elemental variables and marginal tax rates correct, which will
help scoring revenue.
I think we should focus on synthesizing variables of types 1, 2, and 3 -- agreeing completely with
@feenberg that some are more important than others, and therefore deserve more analytic attention than
others. Size generally should be a good indication of degree of analytic attention needed, but not
always.
I do fear that too much analytical attention paid to individual variables
will interfere with getting the work done quickly, and being portable to
the next PUF.
We should not synthesize variables of type 4. After synthesizing types 1, 2, and 3 we should run the
resulting synthetic file - which at this point should be similar to any input file to Tax-Calculator -
through Tax-Calculator using tax law for the base year of the file, to get the values for variables of
type 4.
Yes
As for variables of type 5, I think they should have to fight for existence. If we have a demonstrated
need - or a general principle -for why a variable or set of variables should be on the file, then they
could be considered. As @feenberg says, maybe tax payment should be on the file. If that is the case,
perhaps some very simple post-synthesis method -- e.g., x% of tax liability - should be used to
"populate" the file.
Why not use CART for these?
If this is the approach, I think the workflow would be something like this:
1. Synthesize a data set with elemental variables of types 1, 2 and 3 -- using more sophisticated
methods for some than the other
2. Initial evaluation to see if it passes laugh tests
3. Run it (with type 3 dropped) through Tax-Calculator to get the calculated values for type 4
variables
4. More-sophisticated evaluation, taking into account the values of type 4 calculated variables,
which we will really care about
5. If and when we are satisfied with step 4, we can construct those type 5 variables that justify
their existence, using simple methods.
(Step 2 might be skipped or perhaps automated so that we do steps 1 and 3 before we evaluate the file
seriously.)
I would have a different procedure:
1) Synthesize all elemental variables using the calculated variables as a
base. Use a mechanical application of RF or CART.
2) Recalculate the calculated variables and substitute the new values into
the file.
3) Calculate a revenue score by AGI lass for a small finite change in each
parameter of the tax calculator using the PUF and synth.
4) Calculate the correlation between scores calculated in the two
different ways.
That still leaves two questions not fully addressed: (1) how should the file constructed for
Tax-Calculator relate to different PUFs -- PUF as prepared by SOI, and PUF as enhanced by
@andersonfrailey?, and (2) how best to identify the type 1 and 2 elemental variables, and implicitly,
If synth has all the PUF variables with the PUF names, it is a good
training dataset. If it is customized to our calculator, it isn't. If it
has all the PUF variables and some more, it serves both purposes. If we
don't overanalize the problem, we can do that.
the type 4 variables?
We could simply take the CPS imputations into the PUF before the CART, or
repeat the CPS imputations after the CART. Whichever is easier, I suppose.
Either way, they are going to be pretty far removed from their origin, and
I wouldn't have high hopes for getting representative cross-correlations.
Nor should that bother us.
Dan
…
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVRHEOh-8KBBTH2f9u42qfiX1Juxjks5uwUyAgaJpZM4YnTjo.gif]
|
Up above @feenberg said:
I think steps 1 and 2 are important and have moved them to a separate discussion as issue #7. |
I think we need to be careful about synthesizing only the enhanced PUF that we use in Tax-Calculator. Many of our enhancements come after we've augmented the PUF with the CPS file and I worry that trying to synthesize the PUF after we've augmented it will negatively affect our results. |
Agreed @andersonfrailey. I'd suggest we synthesize the raw PUF, then pass that to the rest of the Tax-Calculator PUF creation procedure as the raw PUF is today. My original question is whether any of the features in the raw PUF are derived from the rest of the raw PUF. If so we can synthesize the non-derived features, then calculate the derived features after synthesis. Without doing this we'll risk the derived features not making sense. |
On Tue, 20 Nov 2018, andersonfrailey wrote:
I think we need to be careful about synthesizing only the enhanced PUF that
we use in Tax-Calculator. Many of our enhancements come after we've
augmented the PUF with the CPS file and I worry that trying to synthesize
the PUF after we've augmented it will negatively affect our results.
I don't understand. Augmented adds variables and records, CART can't
improve the result, but why should the augmented material suffer worse
than the PUF data?
Dan
…
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVYQi3xnv6r6_xSDIdpLVFjFUHQUSks5uxCt-gaJpZM4YnTjo.gif]
|
On Tue, 20 Nov 2018, Max Ghenis wrote:
Agreed @andersonfrailey. I'd suggest we synthesize the raw PUF, then pass
that to the rest of the Tax-Calculator PUF creation procedure as the raw PUF
is today.
My original question is whether any of the features in the raw PUF are
derived from the rest of the raw PUF. If so we can synthesize the
non-derived features, then calculate the derived features after synthesis.
Without doing this we'll risk the derived features not making sense.
If all calculated values are calculated after synthesis, then the return
will balance. That seems like the way to go.
Dan
…
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the
thread.[AHvQVcSLyUPcdR1NFTOHWLmMYZhfvSFuks5uxC-9gaJpZM4YnTjo.gif]
|
I think this is resolved based on the list @feenberg provided in #4 (comment). Let's discuss how to use elemental vs calculated variables in #7. And cheers to the first resolved issue in the repo! 🥇 |
Reopening as @andersonfrailey can provide a list of minimal columns needed from the raw PUF. |
Of the 209 variables that come in the raw PUF, we only keep 68 in the PUF used by Tax-Calculator. Here's a list:
There are 89 variables in the PUF used by Tax-Calculator, the rest come from either the CPS or are derived by us during file preparation. |
Thanks @andersonfrailey, good to know we need to synthesize at most 67 (skipping
|
One more: just want to confirm none of these are direct transformations of others, and can be calculated without synthesis? |
I don't see much value is synthesizing this. We could just make this equal to whatever year of the PUF we're synthesizing in my opinion.
I made a mistake keeping those in the list. Neither are ultimately used in the PUF, but
Correct. All of the variables that can be and ultimately are calculated in Tax-Calculator are dropped. |
Got it, thanks. We'll synthesize all variables listed in #4 (comment) except
|
@feenberg said in an email:
Is there a list of such variables, so we can focus on base variables?
Having this for the CPS file as well--which is useful for testing in public on GH without worrying about disclosing data--would also be useful.
The text was updated successfully, but these errors were encountered: