READ_ME for "basic_extract_YEAR" SQL script
Code by David Splinter

There are three sections in the READ_ME. The first is the description of the project, which links individuals in different tax units (including non-filers) into households. The second is a step-by-step summary of the SQL code. The third is a summary of the fields and observations generated by the script.

///////////////////////////////////////////////////////////////////
// SECTION I. Project Description from Section III.C. of Larrimore, Mortenson, Splinter (2019)
///////////////////////////////////////////////////////////////////

After aggregating tax forms to the individual level and establishing that the population counts using these data are consistent with those from the Census Bureau, individuals are aggregated into households using reported addresses and ZIP codes. Prior to linking tax returns by physical address, we link all individuals who appear together on the same tax return as either a primary filer, a spouse, or a dependent. Importantly, all dependents — even adult dependents — are considered to be part of the claimant’s household and are not treated as separate economic units. This is consistent with the tax definition of a dependent, where an individual can only be claimed as a dependent if they fail to cover at least half of their own expenses. Consequently, all individuals in a tax unit are treated as being part of the primary filer’s household and as having the same address as the primary filer. This is true even if one or more individuals list a separate address on their own tax forms. Most frequently, dependents with different addresses are likely children away at college for part of the year, who can still be claimed as a dependent if they are under age 24 and are a full-time student for at least 5 months of the year.

After constructing complete tax units, we turn to linking tax units and non-filing individuals into households by physical address. We allow only one address per person. Filer addresses are always taken from tax returns if available. Non-filer addresses come from information returns. If a non-filer has multiple information returns that include both a street address and a P.O. Box address, we use the street address. In the rare case of multiple street addresses on information returns, we sort the addresses numerically and alphabetically and choose the first address after sorting.

To construct the household-level file, these addresses are recoded into a standard form (e.g., recoding “1ST ST” or “FIRST STREET” to “FIRST ST” and then removing all spaces) and individuals are considered to live together if their address and 5-digit ZIP code both match (all address corrections included in this recoding are provided in the Online Appendix). For individuals living in an apartment or multi-unit building, the unit number must match as well as the main address. To limit false matches for multi-unit buildings, we identify multi-unit building addresses that are missing unit identifiers and divide these tax units into separate households. For example, if at least three tax units have addresses of type “1 MAIN ST APT XX” with apartment numbers included and two other tax units in the same ZIP code have addresses of “1 MAIN ST” but with no apartment number, we assume the two tax returns are simply missing unit information and are treated as separate households. This approach helps reduce the likelihood of false-positive merges from address-reporting errors on the tax forms.

Even after our extensive standardization of common address abbreviations, misspelling of street names remain. To link records due to close misspellings, we implement near-year matches. First, we identify misspelled street names by comparing our addresses to a master list of street names. This master list was provided by the address verification company SmartyStreets and includes 5,087,497 ZIP code/street name combinations (SmartyStreets 2019). Before making this comparison, uncleaned street names in the tax records are edited to be letter-only street names by: (1) converting number streets to letters as described in the address standardization process above, (2) removing all remaining numbers including house and apartment numbers, and (3) removing leading and trailing characters such as “APT” or “STREET.” We then observe whether the street listed on each unmatched tax-unit household exists on the master list of street names in the taxpayer’s ZIP code. For any unmatched tax unit with an invalid address, including a missing address or a P.O. Box address, we first attempt to correct the address and ZIP code by replacing them with the next-year tax return or information return data.

The next year’s address information is used if it meets specific criteria for its similarity to the current year (invalid) address. The next year’s address must not be missing or a P.O. Box. It also must have either the same first two digits of their house/apartment number and a different ZIP code where at least 3 of 5 digits are the same (to correct an apparently small number of misreported ZIP codes), or the same first two digits of their house/apartment number and the same ZIP code (to correct misspelled street names). Since these similarity tests are not possible when the current year’s address is missing or is a P.O. Box, the next-year physical address is also used if the current-year address is missing a street name or an unmatched P.O. Box (to account for missing street addresses). This matching process is repeated with prior-year addresses. We then use these cleaned addresses to link individuals into complete households.

///////////////////////////////////////////////////////////////////
// SECTION II. Summary of the SQL Code
///////////////////////////////////////////////////////////////////

// Step 1: Generate an Aggregate List of All Individuals

	1.a. dependent filer TINs

	1.b. non-dependent filer TINs AND addresses

	1.c. all dependent TINs AND addresses, including filers and spouses
		-- drop third or fourth dependents that are iTINs.

	1.d. non-filer TINs, for ages 0-100

	1.e. non-filer addresses

	1.f. combine the results of 1.b., 1.c., and 1.e.

	1.g. create a list of all filer TINs, dependent and primary

	1.h. create a list of all dependent TINs, excluding filer TINS

// Step 2: multi unit addresses
	
	2.a. create a table of addresses of multi-unit buildings
	
	2.b. drop everything following "apt", "unit", and "#" in the address line
	
	2.c. aggregate by multi-unit address (excluding apt suffix) to prevent multiple merges
	 
	2.d. merge to original address and set multitrunc to 1 if address matches the truncated multi-unit address (tin, zipcode, state, addressorig, address, alphonly, numonly)

// Step 3: clean addresses

	3.a. standardize common address components
		-- many operations restricted to those with "non-standard" addresses
	
	3.b. remove all spaces (and shorten length by removing trailing spaces); and create number-only and letter-only addresses
	
	3.c. calculate household size distribution at this point
	
	3.d. clean number only and alpha only for fuzzy matches (needed for all all to identify invalid addresses)
		-- numonly
		-- alphaonly
		-- drop characters to the right of certain strings for alphaonly
		-- more alphaonly

// Step 4: valid address list
		
	4.a. Import valid address list for all zipcodes (street name (with street numbers), zipcode, state code
			-- this is a list of valid addresses as of 2017
			
	4.b. standardize the address list in the valid address data
	
// Step 5: bad address list

	5.a. Fraction of all tax units with invalid addresses
	
	5.b. Fraction of unmatched tax units with invalid addresses
	
	5.c. Identify primary tins in #filers1 with invalid street addresses for their zipcode
	
// Step 6: next year matches

	6.a. next year addresses for those with invalid addresses
	
	6.b. pre-clean next year addresses of those with invalid addresses
	
	6.c. select a single address for a given ptin
	
	6.d. replace with next year addresses if (a) the address is bad and (b) the same first two numbers of the zipcode match and (c) 3 of 5 digits of zipcode match and (d) the addresses do not match
	
	6.e. clean next year addresses
	
	6.f. Merge next year matches back to full list to replace address
	
// Step 7: prior year matches

	7.a. prior year addresses for those with invalid addresses
	
	7.b. pre-clean prior year addresses of those with invalid addresses
	
	7.c. select a single address for a given ptin
	
	7.d. replace with prior year addresses if (a) the address is bad and (b) the same first two numbers of the zipcode match and (c) 3 of 5 digits of zipcode match and (d) the addresses do not match
	
	7.e. clean prior year addresses
	
	7.f. Merge prior year matches back to full list to replace address
	
// Step 8: fuzzy match with valid addresses

	8.a. Create list of invalid addresses of unmatched tax units after near-year matches

	8.b. fuzzy match if invalid address and minimum street criteria in same zip and TUsize=1

	8.c. fuzzy match if invalid address and minimum street criteria in same zip and TUsize>1

	8.d. Merge fuzzy matches back to full list to replace address

// Step 9: create the individual, household extract

	9.a. create the HHid and TUid fields

	9.b. create the extract (broken into four parts) and output the rows

	9.c. check number number of individuals, households, and tax units 

///////////////////////////////////////////////////////////////////
// SECTION III. Summary of the data extract
///////////////////////////////////////////////////////////////////

The extract is broken into four parts. To create a single data set for a given year, append the four parts.

Here are the fields in the data:

-- TIN: masked individual TIN 
-- TAX_YR
-- HHid: max(TIN) in a given household
-- TUid: primary filer TIN in the case of TINs listed on returns and the observation-level TIN in the case of non-filers
-- ind_type: 1=primary filer, 2=dependent filer, 3=non-primary claimed on a tax return, 4=non-filer

In comparison, Census counts non-resident dependents as members of separate households (or members of group housing) from those claiming them as a dependent, resulting in a higher number of households.