Writing Functions and Building a Jinja Template

In previous posts I demonstrated how to pull data from a sqlite / spatialite data to generate reports using Python and Jinja, where Jinja2 is used as a template engine for creating LaTeX documents and the NYC Geodatabase is used as my test case. Up until now the scripts pulled the data “as is”. In this post I’ll demonstrate how I created derived variables, and how I created the Jinja2 template for the report. Please note – instead of duplicating all of the code I’m just going to illustrate the new pieces – you should check out the earlier posts to see how all the pieces fit together.

Aggregating Variables

Aggregating census data is a pretty common operation, and when working with American Community Survey estimates it’s also necessary to calculate a new margin of error for each derived value. I wrote two functions to accomplish this. For each function you pass in the keys for values you want to aggregate, a name which will be the name of the new variable, and a dictionary that contains all the keys and values that were taken from a database table for a specific geography.

#Functions for summing individual values and calculating margins of error
#for individual values

def calc_sums(keys,name,adict):
    tosum=[]
    for val in keys:
        tosum.append(adict.get(val))
    agg=sum(tosum)
    adict[name]=agg

def calc_moe(keys,name,adict):
    sqrd=[]
    for val in keys:
        item=adict.get(val)
        if item=='':
            pass
        else:
            sqrd.append(item**2)
    moe=round(math.sqrt(sum(sqrd)))
    adict[name]=moe

Later in the script, as we’re looping through all the geographies and gathering the necessary data into dictionaries that represent each data table, we call the function. In this example we’re combining household income brackets so that we don’t have so many categories:

for geog in geodict.keys():

    name=geodict.get(geog)
    filename='zzpuma_' + geog + '.tex'
    folder='puma_rept'
    outpath=os.path.join(folder,filename)

    acs1dict=pulltab('b_pumas_2013acs1','GEOID2',geog)
    acs2dict=pulltab('b_pumas_2013acs2','GEOID2',geog)

    calc_sums(['INC03_E','INC04_E'],'INC10K_E',acs1dict)
    calc_moe(['INC03_M','INC04_M'],'INC10K_M',acs1dict)
    calc_sums(['INC05_E','INC06_E'],'INC25K_E',acs1dict)
    calc_moe(['INC05_M','INC06_M'],'INC25K_M',acs1dict)
    calc_sums(['INC09_E','INC10_E'],'INC100K_E',acs1dict)
    calc_moe(['INC09_M','INC10_M'],'INC100K_M',acs1dict)

Rather than creating a new dictionary, these new values are simply appended to the existing dictionaries that contain the data taken from each of the ACS data tables in the database. They can be referenced in the template using their new column name.

Calculating Areas

I also want to include the geographic size of the PUMA as one of the report items. Columns for the area are included in the spatial table for the PUMAs – the features originally came from the TIGER files, and all TIGER files have an ALAND and an AWATER column that has land and water area in square meters. So we don’t have to calculate the area from the geometry – we can just use this function to convert the land and water attributes to square miles, and then calculate a total area:

def calc_area(adict,land,water,total):
    landarea=round(adict.get('aland10')*0.000000386102,2)
    waterarea=round(adict.get('awater10')*0.000000386102,2)
    totalarea=landarea+waterarea
    adict[land]=landarea
    adict[water]=waterarea
    adict[total]=totalarea

In the body of our script, we invoke our pulltab function (explained in an earlier post) to grab all the data from the PUMA spatial boundary table:

area=pulltab('c_bndy_pumas2010','geoid10',geog)

And then we can call our area function. We pass in the area dictionary, and what we want the new output column names to be – area for land, water, and total:

calc_area(area, 'LAND_SQM','WAT_SQM','TOT_SQM')

Like our previous aggregate script, this function appends our new values to the existing table-dictionary – in this case, one called area.

Aggregating Geographies

Our last function is a little more complicated. In all of our previous examples, we pulled PUMA-level data from the American Community Survey tables. What if we wanted 2010 Census data for the PUMAs? Decennial census data is not tabulated at the PUMA level, but it is tabulated at the census tract level. Since PUMAs are created by aggregating tracts, we can aggregate the census tract data in the NYC Geodatabase into PUMAs. Here’s our function:

#Function aggregates all values in a table with a group by field from a
#joined table, then creates a dictionary consisting of column names and values
#for a specific geography

def sumtab(tabname,jointab,id1,id2,gid,geog):
    query='SELECT * FROM %s LIMIT 1' %(tabname)
    curs.execute(query)
    col_names = [cn[0] for cn in curs.description]
    tosum=[]
    for var in col_names[3:]:
        tosum.append("SUM("+var+") AS '0_"+var+"'")
    summer=', '.join(str(command) for command in tosum)
    query='SELECT %s, %s FROM %s, %s WHERE %s = %s and %s = %s GROUP BY %s' %(gid,summer,tabname,jointab,id1,id2,gid,geog,gid)
    curs.execute(query)
    col_names = [cn[0] for cn in curs.description]
    rows = curs.fetchall()
    for row in rows:
        thedict=dict(zip(col_names,row))
    return thedict

What’s going on here? The first thing we need to do is associate the census tracts with the PUMAs they’re located in. The NYC Geodatabase does NOT have a relationship table for this, so I had to create one. We have to pass in the table name, the relationship table, the unique IDs for each, and then the ID and the geography that we’re interested in (remember our script is looping through PUMA geographies one by one). The first thing we do is a little trick – we get the names of every column in the existing data table, and we append them to a list where we create a new column name based on the existing one (in this case, append a 0 in front of the column name – in retrospect I realize this is a bad idea as column names should not begin with numbers, so this is something I will change). Then we can take the list of column names and create a giant string out of them.

With that giant string (called summer) we can now pass all of the parameters that we need into the SQL query. This selects all of our columns (using the summer string), the table names and join info, for the specific geographic area that we want and then groups the data by that geography (i.e. all tracts that have the same PUMA number). Then we zip the column names and values together in a dictionary that the function returns.

Later on in our script, we call the function:

   census10=sumtab('b_tracts_2010census','b_tracts_to_pumas','GEOID2','tractid','pumaid',geog)

Which creates a new dictionary called census10 that has all the 2010 census data for our PUMA. Like the rest of our dictionaries, census10 is passed out to the Jinja2 template and its values can be invoked using the dictionary keys (the column headings):

outfile=open(outpath,'w')
    outfile.write(template.render(geoid=geog, geoname=name, acs1=acs1dict, acs2=acs2dict, area=area,
                                  c2010=census10))
    outfile.close()

Designing the Template

The Jinja template is going to look pretty busy compared to our earlier examples, and in both cases they’re not complete (this is still a work in progress).

I wanted to design the entire report first, to get a sense for how to balance everything I want on the page, without including any Jinja code to reference specific variables in the database. So I initially worked just in LaTeX and focused on designing the document with placeholders. Ultimately I decided to use the LaTeX minipage environment as it seemed the best approach in giving me control in balancing items on the page. The LaTeX wikibook entries on floats, figures, and captions and on boxes was invaluable for figuring this out. I used rule to draw boxes to serve as placeholders for charts and figures. Since the report is being designed as a document (ANSI A 8 1/2 by 11 inches) I had no hang-up with specifying precise dimensions (i.e. this isn’t going into a webpage that could be stretched or mushed on any number of screens). I loaded the xcolor package so I could modify the row colors of the tables, as well as a number of other packages that make it easy to balance table and figure captions on the page (caption, subscaption, and multicol).

Once I was satisfied with the look and feel, I made a copy of this template and started modifying the copy with the Jinja references. The references look awfully busy, but this is the same thing I’ve illustrated in earlier posts. We’re just getting the values from the dictionaries we created by invoking their keys, regardless of whether we’re taking new derived values that we created or simply pulling existing values that were in the original data tables. Here’s a snippet of the LaTeX with Jinja that includes both derived (2010 Census, area) and existing (ACS) variables:

%Orientation - detail map and basic background info
begin{minipage}{textwidth}
	begin{minipage}[h]{3in}
   		centering
   		rule{3in}{3in}
    		captionof{figure}{Race by 2010 Census Tract}
	end{minipage}
  	hfill
	begin{minipage}[h]{4in}
		centering
		captionof{table}{Geography}
    	begin{tabular}{cccc}hline
		& Land & Water & Total\ hline
		Area (sq miles) &  num{VAR{area.get('LAND_SQM')}} & num {VAR{area.get('WAT_SQM')}} &  num {VAR{area.get('TOT_SQM')}} \ hline
		vspace{10pt}
	 end{tabular}
	captionof{table}{Basic Demographics}
	rowcolors{3}{SpringGreen}{white}
	 begin{tabular}{cccc}hline
		& textbf{2010 Census} & textbf{2009-2013} & textbf{ACS Margin}\
		& & textbf{ACS} & textbf{of Error}\ hline
		Population & num {VAR{c2010.get('0_HD01_S001')}} & num{VAR{acs2.get('SXAG01_E')}} & +/- num{VAR{acs2.get('SXAG01_M')}}\
		Males & num {VAR{c2010.get('0_HD01_S026')}} & num{VAR{acs2.get('SXAG02_E')}} & +/- num{VAR{acs2.get('SXAG02_M')}}\
		Females & num {VAR{c2010.get('0_HD01_S051')}}& num{VAR{acs2.get('SXAG03_E')}} & +/- num{VAR{acs2.get('SXAG03_M')}}\
		Median Age (yrs) & num {99999} & num{VAR{acs2.get('SXAG17_E')}} & +/- num{VAR{acs2.get('SXAG17_M')}}\
		Households & num {VAR{c2010.get('0_HD01_S150')}} & num{VAR{acs1.get('HSHD01_E')}} & +/- num{VAR{acs1.get('HSHD01_M')}}\
		Housing Units & num {VAR{c2010.get('0_HD01_S169')}} & num{VAR{acs2.get('HOC01_E')}} & +/- num{VAR{acs2.get('HOC01_M')}}\ hline
	end{tabular}
	end{minipage}
 end{minipage}

And here’s a snippet of the resulting PDF:

report_inprogress

What Next?

You may have noticed references to figures and charts in some of the code above. I’ll discuss my trials and tribulations with trying to use matplotlib to create charts in some future post. Ultimately I decided not to take that approach, and was experimenting with using various LaTeX packages to produce charts instead.