Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List "base" (non-calculated) columns from PUF to synthesize #4

Closed
MaxGhenis opened this issue Nov 17, 2018 · 15 comments
Closed

List "base" (non-calculated) columns from PUF to synthesize #4

MaxGhenis opened this issue Nov 17, 2018 · 15 comments

Comments

@MaxGhenis
Copy link
Collaborator

@feenberg said in an email:

There are quite a number of variables in the PUF that are arithmetic functions of more basic variables. These include AGI, Itemized Deductions, Taxable Income, tax, etc. We can use all of these directly from the PUF as part of the synthesis procedure. There is no need to synthesize them via RF or CART.

Is there a list of such variables, so we can focus on base variables?

Having this for the CPS file as well--which is useful for testing in public on GH without worrying about disclosing data--would also be useful.

@feenberg
Copy link
Collaborator

feenberg commented Nov 18, 2018 via email

@donboyd5
Copy link
Owner

donboyd5 commented Nov 18, 2018

This raises some interesting questions about which variables to synthesize and how.

I think at least in the near term we should focus primarily on a file that plays well with Tax-Calculator. (If there is some additional important use for the file, we should be explicit about that.) If this is the correct principle, then we should definitely synthesize all variables that are essential to the running of Tax-Calculator. Perhaps there are supplemental variables that we also want to synthesize, or construct in some other probably less sophisticated way (e.g., "populate" is a term used by TPC).

I see 5 kinds of variables:

  1. Elemental variables (i.e, variables that are used in Tax-Calculator calculations) that are in the PUF as we receive it from SOI -- for example, e00300 interest received.
  2. Elemental variables that are not in the PUF as received from SOI, but are in the enhanced PUF that is used by Tax-Calculator -- values that are constructed by @andersonfrailey and/or @martinholmer and are essential to the operation of Tax-Calculator -- I think of the split of wages between prime and spouse (e00200p and e00200s) as an example.
  3. Variables we may want to construct to make synthesis better. For example, Tax-Calculator needs (I think) both e00650 (total dividends in AGI) and e00600 (qualified dividends). We may find that it makes economic sense to construct unqualified dividends as e00650 - e00600, and then synthesize qualified and unqualified dividends separately, and add them up to get total e00650. Why? Because we might find that simple unconstrained synthesis methods can result in qualified dividends being larger than total dividends on some synthetic returns, something I think logic and law would not allow - so either we need to do something more complex to impose a constraint, or else construct the components and add them together. (Yet another approach could be to synthesize total dividends and the share of total dividends that are qualified, with that share constrained to [0, 1].) All of this could happen unknown to Tax-Calculator - we would synthesize a variable it doesn't know about (unqualified dividends) but only include in our data the variables it needs (e00650 and e00600).
  4. Variables that are calculated by Tax-Calculator from these elemental variables, such as AGI (c00100) and tax before credit (taxbc), which may or may not have counterparts in the PUF from SOI (e.g., AGI in original PUF is e00100).
  5. Variables that are in the PUF as received from SOI but are not used in Tax-Calculator and may or may not be important for other purposes. TXRT tax rate code strikes me as such a variable.

I think we should focus on synthesizing variables of types 1, 2, and 3 -- agreeing completely with @feenberg that some are more important than others, and therefore deserve more analytic attention than others. Size generally should be a good indication of degree of analytic attention needed, but not always.

We should not synthesize variables of type 4. After synthesizing types 1, 2, and 3 we should run the resulting synthetic file - which at this point should be similar to any input file to Tax-Calculator - through Tax-Calculator using tax law for the base year of the file, to get the values for variables of type 4.

As for variables of type 5, I think they should have to fight for existence. If we have a demonstrated need - or a general principle -for why a variable or set of variables should be on the file, then they could be considered. As @feenberg says, maybe tax payment should be on the file. If that is the case, perhaps some very simple post-synthesis method -- e.g., x% of tax liability - should be used to "populate" the file.

If this is the approach, I think the workflow would be something like this:

  1. Synthesize a data set with elemental variables of types 1, 2 and 3 -- using more sophisticated methods for some than the other
  2. Initial evaluation to see if it passes laugh tests
  3. Run it (with type 3 dropped) through Tax-Calculator to get the calculated values for type 4 variables
  4. More-sophisticated evaluation, taking into account the values of type 4 calculated variables, which we will really care about
  5. If and when we are satisfied with step 4, we can construct those type 5 variables that justify their existence, using simple methods.

(Step 2 might be skipped or perhaps automated so that we do steps 1 and 3 before we evaluate the file seriously.)

That still leaves two questions not fully addressed: (1) how should the file constructed for Tax-Calculator relate to different PUFs -- PUF as prepared by SOI, and PUF as enhanced by @andersonfrailey?, and (2) how best to identify the type 1 and 2 elemental variables, and implicitly, the type 4 variables?

For question 1, at some point we will face a choice - should we synthesize the PUF as prepared by SOI, or the PUF as enhanced by @andersonfrailey? As the text above suggests, I think in the near term we have to synthesize the PUF as enhanced, or else we won't have a file to run through Tax-Calculator until someone enhances the file we synthesize. So that is an easy question. If and when we have a really good result, we would want to have a larger discussion about this as it would affect other workflows. Obviously it only makes sense to enhance once, so it probably always will be sensible to synthesize a post-enhancement rather than pre-enhancement PUF, but it is worth an explicit discussion at some point in the future. But the short-term answer is clear- synthesize the enhanced PUF.

For question 2, I think @feenberg's list looks good but what we should do to be srue is start with the required inputs to Tax-Calculator and make those the synthesis variables (types 1 and 2 above), rather than trying to synthesize everything but the calculated (type 4) variables. We could look to the Tax-Calculator inputs documentation, and then verify with @andersonfrailey and/or @martinholmer.

One final note: There may be variables on the PUF, or on the CPS-based PUF, that are important beyond their Tax-Calculator need, as @feenberg mentioned. I don't know what those other needs are but I imagine one or more among us does, and it would be good to include them, too -- but I'd still suggest that they are not as important as those that are needed for Tax-Calculator.

@feenberg
Copy link
Collaborator

feenberg commented Nov 18, 2018 via email

@donboyd5
Copy link
Owner

donboyd5 commented Nov 19, 2018

Up above @feenberg said:

I would have a different procedure:

  1. Synthesize all elemental variables using the calculated variables as a
    base. Use a mechanical application of RF or CART.

  2. Recalculate the calculated variables and substitute the new values into
    the file.

  3. Calculate a revenue score by AGI lass for a small finite change in each
    parameter of the tax calculator using the PUF and synth.

  4. Calculate the correlation between scores calculated in the two
    different ways.

I think steps 1 and 2 are important and have moved them to a separate discussion as issue #7.

@andersonfrailey
Copy link
Collaborator

I think we need to be careful about synthesizing only the enhanced PUF that we use in Tax-Calculator. Many of our enhancements come after we've augmented the PUF with the CPS file and I worry that trying to synthesize the PUF after we've augmented it will negatively affect our results.

@MaxGhenis
Copy link
Collaborator Author

Agreed @andersonfrailey. I'd suggest we synthesize the raw PUF, then pass that to the rest of the Tax-Calculator PUF creation procedure as the raw PUF is today.

My original question is whether any of the features in the raw PUF are derived from the rest of the raw PUF. If so we can synthesize the non-derived features, then calculate the derived features after synthesis. Without doing this we'll risk the derived features not making sense.

@feenberg
Copy link
Collaborator

feenberg commented Nov 20, 2018 via email

@feenberg
Copy link
Collaborator

feenberg commented Nov 20, 2018 via email

@MaxGhenis
Copy link
Collaborator Author

MaxGhenis commented Nov 21, 2018

I think this is resolved based on the list @feenberg provided in #4 (comment).

Let's discuss how to use elemental vs calculated variables in #7.

And cheers to the first resolved issue in the repo! 🥇

@MaxGhenis
Copy link
Collaborator Author

Reopening as @andersonfrailey can provide a list of minimal columns needed from the raw PUF.

@MaxGhenis MaxGhenis reopened this Nov 28, 2018
@andersonfrailey
Copy link
Collaborator

Of the 209 variables that come in the raw PUF, we only keep 68 in the PUF used by Tax-Calculator. Here's a list:

{'dsi',
 'e00200',
 'e00300',
 'e00400',
 'e00600',
 'e00650',
 'e00700',
 'e00800',
 'e00900',
 'e01100',
 'e01200',
 'e01400',
 'e01500',
 'e01700',
 'e02000',
 'e02100',
 'e02300',
 'e02400',
 'e03150',
 'e03210',
 'e03220',
 'e03230',
 'e03240',
 'e03270',
 'e03290',
 'e03300',
 'e03400',
 'e03500',
 'e07240',
 'e07260',
 'e07300',
 'e07400',
 'e07600',
 'e09700',
 'e09800',
 'e09900',
 'e11200',
 'e17500',
 'e18400',
 'e18500',
 'e19200',
 'e19800',
 'e20100',
 'e20400',
 'e24515',
 'e24518',
 'e26270',
 'e27200',
 'e32800',
 'e58990',
 'e62900',
 'e87521',
 'e87530',
 'eic',
 'f2441',
 'f6251',
 'fded',
 'flpdyr',
 'mars',
 'midr',
 'n24',
 'p08000',
 'p22250',
 'p23250',
 'p86421',
 'recid',
 's006',
 'xtot'}

There are 89 variables in the PUF used by Tax-Calculator, the rest come from either the CPS or are derived by us during file preparation.

@MaxGhenis
Copy link
Collaborator Author

Thanks @andersonfrailey, good to know we need to synthesize at most 67 (skipping recid). Couple questions for you:

  1. Is there value in synthesizing flpdyr, or is that just to match back to the original PUF?
  2. A couple of these aren't in the Tax-Calculator documentation, like fded and p86421. Are they used in calculating other variables in the processed PUF, or can we skip them too?

@MaxGhenis
Copy link
Collaborator Author

One more: just want to confirm none of these are direct transformations of others, and can be calculated without synthesis?

@andersonfrailey
Copy link
Collaborator

Is there value in synthesizing flpdyr, or is that just to match back to the original PUF?

I don't see much value is synthesizing this. We could just make this equal to whatever year of the PUF we're synthesizing in my opinion.

A couple of these aren't in the Tax-Calculator documentation, like fded and p86421. Are they used in calculating other variables in the processed PUF, or can we skip them too?

I made a mistake keeping those in the list. Neither are ultimately used in the PUF, but fded is used is puf_data/finalprep.py and p86421 is actually dropped.

One more: just want to confirm none of these are direct transformations of others, and can be calculated without synthesis?

Correct. All of the variables that can be and ultimately are calculated in Tax-Calculator are dropped.

@MaxGhenis
Copy link
Collaborator Author

Got it, thanks. We'll synthesize all variables listed in #4 (comment) except recid, flpdyr, and p86421 (65 total variables).

fded is used in https://github.com/open-source-economics/taxdata/blob/master/puf_data/finalprep.py#L50, so we'll synthesize that:

  cmbtp = np.where(data['FDED'] == 1, cmbtp_itemizer, cmbtp_standard)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants