sqlite

Looping Through a Database to Create Reports

I’ve got a lot of ground to cover, picking up where I left off several months ago. In earlier posts I presented the concept for creating reports from sqlite databases using Python, Jinja, and LaTeX, and looked at different methods for passing data from the database to the template. I’m using the NYC Geodatabase as our test case. In this entry I’ll cover how I implemented my preferred approach – creating Python dictionaries to pass to the Jinja template.

One of the primary decisions I had to make was how to loop through the database. Since the reports we’re making are profiles (lots of different data for one geographic area), we’re going to want to loop through the database by geography. So, for each geography select all the data from a specific table, pass the data out to the template where the pertinent variables are pulled, build the report and move on to the next geography. In contrast, if we were building comparison tables (one specific variable for many geographic areas) we would want to loop through the data by variable.

In the beginning of the script we import the necessary modules, set up the Jinja environment, and specify our template (not going to repeat that code here – see the previous post). Then we have our function that creates a dictionary for a specific data table for a specific geography:

def pulltab(tabname,idcol,geog):
    query='SELECT * FROM %s WHERE %s = %s' %(tabname,idcol,geog)
    curs.execute(query)
    col_names = [cn[0] for cn in curs.description]
    rows = curs.fetchall()
    for row in rows:
        thedict=dict(zip(col_names,row))
    return thedict

We connect to the database and create a dictionary of all the geographies (limited to 3 PUMAs since this is just a test):

#Connect to database and create dictionary of all geographies

conn = sqlite3.connect('nyc_gdb_jan2015a/nyc_gdb_jan2015.sqlite')
curs = conn.cursor()
curs.execute('SELECT geoid10, namelsad10 FROM a_pumas2010 ORDER BY geoid10 LIMIT 3')
rows = curs.fetchall()
geodict=dict(rows)

And then we generate reports by looping through all the geographies in that dictionary, and we pass in the ID of each geography to pull all data from a data table for that geography out of the table and into a dictionary.

#Generate reports by looping through geographies and passing out
#dictionaries of values

for geog in geodict.keys():
    acs1dict=pulltab('b_pumas_2013acs1','GEOID2',geog)
    acs2dict=pulltab('b_pumas_2013acs2','GEOID2',geog)
    name=geodict.get(geog)
    filename='zzpuma_' + geog + '.tex'
    folder='test5'
    outpath=os.path.join(folder,filename)

Lastly, we pass the dictionaries out to the template, and run LaTeX to generate the report from the template:

  outfile=open(outpath,'w')
    outfile.write(template.render(geoid=geog, geoname=name, acs1=acs1dict, acs2=acs2dict))
    outfile.close()

    os.system("pdflatex -output-directory=" + folder + " " + outpath)

conn.close()

The Jinja template (as a LaTeX file) is below – the example here is similar to what I covered in my previous post. We passed two dictionaries into the template, one for each data table. The key is the name of the variable (the column name in the table) and the value is the American Community Survey estimate and the margin of error. We pass in the key and get the value in return. The PDF output follows.

documentclass{article}
usepackage[margin=0.5in]{geometry}
usepackage{graphicx}
usepackage[labelformat=empty]{caption}
usepackage[group-separator={,}]{siunitx}

title{VAR{acs1.get('GEOLABEL') | replace("&","&")} VAR{acs1.get('GEOID2')}}
date{}

begin{document}

maketitle
pagestyle{empty}
thispagestyle{empty}  

begin{table}[h]
centering
caption{Commuting to Work - Workers 16 years and over}
begin{tabular}{|c|c|c|c|c|}

hline
& Estimate & Margin of Error & Percent Total & Margin of Error\
hline

Car, truck, or van alone & num{VAR{acs1.get('COM02_E')}} & +/- num{VAR{acs1.get('COM02_M')}}
& num{VAR{acs1.get('COM02_PC')}} & +/- num{VAR{acs1.get('COM02_PM')}}\

Car, truck, or van carpooled & num{VAR{acs1.get('COM03_E')}} & +/- num{VAR{acs1.get('COM03_M')}}
& num{VAR{acs1.get('COM03_PC')}} & +/- num{VAR{acs1.get('COM03_PM')}}\

Public transit & num{VAR{acs1.get('COM04_E')}} & +/- num{VAR{acs1.get('COM04_M')}}
& num{VAR{acs1.get('COM04_PC')}} & +/- num{VAR{acs1.get('COM04_PM')}}\

Walked & num{VAR{acs1.get('COM05_E')}} & +/- num{VAR{acs1.get('COM05_M')}}
& num{VAR{acs1.get('COM05_PC')}} & +/- num{VAR{acs1.get('COM05_PM')}}\

Other means & num{VAR{acs1.get('COM06_E')}} & +/- num{VAR{acs1.get('COM06_M')}}
& num{VAR{acs1.get('COM06_PC')}} & +/- num{VAR{acs1.get('COM06_PM')}}\

Worked at home & num{VAR{acs1.get('COM07_E')}} & +/- num{VAR{acs1.get('COM07_M')}}
& num{VAR{acs1.get('COM07_PC')}} & +/- num{VAR{acs1.get('COM07_PM')}}\
hline

end{tabular}
end{table}

begin{table}[h]
centering
caption{Housing Tenure}
begin{tabular}{|c|c|c|c|c|}

hline
& Estimate & Margin of Error & Percent Total & Margin of Error\
hline

Occupied housing units & num{VAR{acs2.get('HTEN01_E')}} & +/- num{VAR{acs2.get('HTEN01_M')}} &  &\

Owner-occupied & num{VAR{acs2.get('HTEN02_E')}} & +/- num{VAR{acs2.get('HTEN02_M')}}
& num{VAR{acs2.get('HTEN02_PC')}} & +/- num{VAR{acs2.get('HTEN02_PM')}}\

Renter-occupied & num{VAR{acs2.get('HTEN03_E')}} & +/- num{VAR{acs2.get('HTEN03_M')}}
& num{VAR{acs2.get('HTEN03_PC')}} & +/- num{VAR{acs2.get('HTEN03_PM')}}\
hline

end{tabular}
end{table}
end{document}

In this example we took the simple approach of grabbing all the variables that were in a particular table, and then we just selected what we wanted within the template. This is fine since we’re only dealing with 55 PUMAs and a table that has 200 columns or so. If we were dealing with gigantic tables or tons of geographies, we could modify the Python script to pull just the variables we wanted to speed up the process; my inclination would be to create a list of variables in a text file, read that list into the script and modify the SQL function to just select those variables.

What if we want to modify some of the variables before we pass them into the template? I’ll cover that in the next post.

Inserting Data into Templates with Python and Jinja

In this post, I’m picking up where I left off and will cover the different methods I experimented with to get data out of a SQLite database and into a Jinja LaTeX template using Python. I’m using the NYC Geodatabase as my test case.

Standard Elements – Used Each Time

First – the Python script. For each iteration, the top half of the script remains the same. I import the necessary modules, and I set up my Jinja2 environment. This tells Jinja how to handle LaTeX syntax. I borrowed this code directly from the invaluable slides posted here. The only part that gets modified each time is the .get_template() bit, which is the actual LaTeX template with Jinja mark-up that is used for creating the reports.

import sqlite3

import jinja2
import os
from jinja2 import Template

latex_jinja_env = jinja2.Environment(
    block_start_string = 'BLOCK{',
    block_end_string = '}',
    variable_start_string = 'VAR{',
    variable_end_string = '}',
    comment_start_string = '#{',
    comment_end_string = '}',
    line_statement_prefix = '%-',
    line_comment_prefix = '%#',
    trim_blocks = True,
    autoescape = False,
    loader = jinja2.FileSystemLoader(os.path.abspath('.'))
)
# Modify to specify the template
template = latex_jinja_env.get_template('test1.tex')

The method for connecting to a SQLite database is also the same each time. There are a zillion tutorials and posts for working with Python and SQLite so I won’t belabor that here. Take a look at this excellent one or this awesome one.

conn = sqlite3.connect('nyc_gdb_jan2015a/nyc_gdb_jan2015.sqlite')
curs = conn.cursor()
curs.execute('SELECT * FROM b_pumas_2013acs1 ORDER BY GEOID2 LIMIT 3')

col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()
conn.close()

First Iteration – Pass Individual Variables to the Template

Here’s the bit that I modify each time. Using the example from the tutorial slides, I loop through the rows returned from my database, and I specify individual variables each time by slicing the elements in the row and assigning them a name which is passed out to the template with template.render(). Then I make a call to LaTeX to generate the PDF file (straightforward since I’m using Linux), one for each row (which represent geographic areas). Each file is named using the unique ID number of the geography, which we grabbed from our row list.

for row in rows:
    filename='zpuma_' + row[0] + '.tex'
    folder='test1'
    outpath=os.path.join(folder,filename)
    outfile=open(outpath,'w')
    outfile.write(template.render(geoid=row[0], geolabel=row[1], hshld=row[2], hshldmoe=row[3]))
    outfile.close()
    os.system("pdflatex -output-directory=" + folder + " " + outpath)

That’s the Python piece. The LaTeX template with the Jinja mark-up looks like this:

documentclass{article}
usepackage[margin=0.5in]{geometry}
usepackage[group-separator={,}]{siunitx}

title{VAR{geolabel | replace("&","&")} VAR{geoid}}
date{}

begin{document}

maketitle
pagestyle{empty}
thispagestyle{empty}  

begin{tabular}{|c|c|c|}
hline
& Estimate & Margin of Error\
Households: & num{VAR{hshld}} & +/- num{VAR{hshldmoe}}\
hline
end{tabular}

end{document}

You can see here where I’m passing in the variables with VAR – I’m using the same variable names that I created in the script to hold the row elements. I have to do a little bit of formatting to get this to work. First, one of my variables is text description that consistently contains an ampersand, so I have to use replace (a construct from Jinja) to replace & with & so LaTeX can properly escape it. Second, I want to format my numeric variables with a thousands separator. Here I use a LaTeX construct with the siunitx package, and every place a number appears I mark it with num. For this to work I always need to know that this variable will be a number; if it’s text or null LaTeX will throw an error and the doc won’t compile (an alternative to using this LaTeX solution would be to use Python’s formatting constructs). My simple output is below.

Second Iteration – Pass Variables to Template in a List

Since I’m going to be passing lots of variables out to my template, it would be tedious if I had to declare them all individually, one by one. It would be better if I could pass out an entire list, and then do the slicing to get what I want in the template. Here’s the Python for doing that:

for row in rows:
    filename='zzpuma_' + row[0] + '.tex'
    folder='test2'
    outpath=os.path.join(folder,filename)
    outfile=open(outpath,'w')
    outfile.write(template.render(thelist=row))
    outfile.close()
    os.system("pdflatex -output-directory=" + folder + " " + outpath)

And here’s the LaTeX template – in this example I modified the variables a bit.

documentclass{article}
usepackage[margin=0.5in]{geometry}
usepackage[group-separator={,}]{siunitx}

title{VAR{thelist[2] | replace("&","&")} VAR{thelist[1]}}
date{}

begin{document}

maketitle
pagestyle{empty}
thispagestyle{empty}  

begin{tabular}{|c|c|c|c|c|}
hline
& Estimate & Margin of Error & Percent Total & Percent Margin of Error\
Car, truck, or van alone: & num{VAR{thelist[171]}} & +/- num{VAR{thelist[172]}} & num{VAR{thelist[173]}} & +/- num{VAR{thelist[174]}}\
Car, truck, or van carpooled: & num{VAR{thelist[175]}} & +/- num{VAR{thelist[176]}} & num{VAR{thelist[177]}} & +/- num{VAR{thelist[178]}}\
hline
end{tabular}

end{document}

While this is a bit better, the template is harder to read – you can’t really figure out what’s in there as you just have a bunch of list slices. You also have to keep careful track of which indices apply to what element, so you know what you’re generating. I thought I could improve this by creating nested lists where the column headings from the database get carried along, and I could reference them somehow. Then I had a better idea.

Third Iteration – Pass Variables to Template in a Dictionary

I decided to use a dictionary instead of a list. Here’s the Python – since I grabbed the columns back in the database section of my code, I can loop through the elements in each row and create a dictionary by zipping the column names and row elements together, so the column name becomes the key and the row element is my data value. Then I pass the whole dictionary out to the template.

for row in rows:
    thedict=dict(zip(col_names,row))
    filename='zzpuma_' + row[0] + '.tex'
    folder='test3'
    outpath=os.path.join(folder,filename)
    outfile=open(outpath,'w')
    outfile.write(template.render(d=thedict))
    outfile.close()
    os.system("pdflatex -output-directory=" + folder + " " + outpath)

Now in the template, using Jinja I embed dict.get() for each variable and specify the key (column name) and the output will be the value. This is now highly readable, as I can see the names of the columns for the variables and there’s less potential for a mix-up.

documentclass{article}
usepackage[margin=0.5in]{geometry}
usepackage[group-separator={,}]{siunitx}

title{VAR{d.get('GEOLABEL') | replace("&","&")} VAR{d.get('GEOID2')}}
date{}

begin{document}

maketitle
pagestyle{empty}
thispagestyle{empty}  

begin{tabular}{|c|c|c|c|c|}
hline
& Estimate & Margin of Error & Percent Total & Margin of Error\

Car, truck, or van alone: & num{VAR{d.get('COM02_E')}} & +/- num{VAR{d.get('COM02_M')}} & num{VAR{d.get('COM02_PC')}} & +/- num{VAR{d.get('COM02_PM')}}\

Car, truck, or van carpooled: & num{VAR{d.get('COM03_E')}} & +/- num{VAR{d.get('COM03_M')}} & num{VAR{d.get('COM03_PC')}} & +/- num{VAR{d.get('COM03_PM')}}\
hline
end{tabular}

end{document}

In this case, the output looks the same as it did in our last iteration. Those are some basic methods for getting data into a template, and in my case I think the dictionary is the ideal data structure for this. In going further, my goal is to keep all the formatting and presentation issues in LaTeX, and all the data processing and selection pieces in Python.

Creating Reports with SQLite, Python, Jinja2, and LaTeX

For a long time, I’ve been wanting to figure out a way to generate reports from a SQLite / Spatialite database. For example, I’d like to reach into a database and generate profiles for different places that contain tables, charts, and maps. I know I can use Python to connect to the db and pull out variables. I also learned how to use LaTeX several years back when I revised the GIS Practicum manual, and routinely use it for writing reports, articles, and hand-outs.

I finally have time to devote to this, and am going to share what I’m learning in a series of posts. In this post I’ll describe how I got started, and will record some useful projects and posts that I’ve found.

Figuring Out What the Pieces Are

In searching the web for building reports in Python, I’ve discovered a number of solutions. Many people have written modules that are in various states of production, from active to defunct. Prettytable was something I’ve used for generating basic text-file reports. It’s absolutely great at what it does, but I’m looking for something that’s more robust. Of all the tools out there, ReportLab seemed to be the most prominent package that would appear again and again. I’ve shied away from it, because I wanted a solution that was a little more general – if that makes sense. Something where every component is not so tightly bound to a specific module.

Luckily I found this post, which was perfect for helping me to understand conceptually what I wanted to do. The author describes how he automatically generates song sheets by using a programming language (JAVA in this case) to reach into a database and insert the content into a template (LaTeX in this case) using a template engine (Apache Velocity) to produce good looking output. In this case, the template has the shell of a document and place-holders where variables will be passed in from the scripting language and rendered using the engine. He included this helpful diagram from wikimedia in his post:

I started looking for a template engine that would work well with both LaTeX and Python. The author had mentioned Cheetah as another engine, and it turns out that Cheetah is often used in conjunction with Python and LaTeX. After digging around some more, I discovered another template engine called Jinja (or Jinja2) which I’ve adopted as my solution, largely because I’ve found that the project documentation was quite good and there are numerous user examples that I can follow. Jinja2 allows you to do much more than simply passing variables into the template and rendering it; you have the option to run a lot of Pythonesque code from within the template itself.

Putting the Pieces Together

While Jinja is often used for generating HTML and XML (for example), it’s also used for LaTeX (for example). I found that this series of slides was the perfect introduction for me. They’re written in German, but since most of the syntax in the scripting and mark-up languages is in English it’s easy to grasp (and those three-years of German I took way back in high school are now reaping dividends!)

The slides break down how you can use Python to generate LaTeX reports in several iterations. The first iteration involves no templating at all – you simply use Python to generate the LaTeX code that you want (or if you prefer, Python serves as the template generator). The limit of this are obvious, in that you have to hard code variables into the output, or use string substitution to find and replace variable names with the intended output. In the next iteration, he demonstrates how to use Jinja2. This section is invaluable, as it provides an example of setting the Jinja2 environment so that you can escape all of the necessary characters and syntax that LaTeX needs to function. He demonstrates how to pass a variable from Python to render in a template that you create in LaTeX and mark-up with Jinja2 code (slides 18 to 20). He goes on to show how you can loop through lists to generate output.

The third iteration displays how you can pull data out of SQLite and then use Python and LaTeX to generate output. With a little imagination, you can combine this piece with his previous one and voila, you have a SQLite-Python-Jinja-Latex combo. He has a final piece that incorporates screen-scraping using Beautiful Soup, which is pretty neat but beyond my needs for this project.

Now that I understand the conceptual model and I have the four tools I’ll use with some examples, I’m ready to start experimenting. I know there will be several additional pieces I’ll need to incorporate, to generate charts (matplotlib) and maps (perhaps some of the Python modules from QGIS). There are some instances where I’ll also have to write functions to create derivatives of the data I’m pulling, so I imagine NumPy/SciPy and GDAL will come in handy for that. But first things first – I need to get the four basic pieces – SQLite – Python – Jinja2 – LaTeX – working together. That will be the topic of my next post.

Creating Reports with SQLite, Python, and prettytable

In addition to providing the NYC Geodatabase as a resource, I also wanted to use it to generate reports and build applications. None of the open source SQLite GUIs that I’m familiar with have built in report generating capabilities, so I thought I could use Python to connect to the database and generate them. I have some grand ambitions here, but decided to start out small.

Python has a built-in module, sqlite3, that you can use to work with SQLite databases. This is pretty well documented – do a search and you’ll find a ton of brief tutorials. Take a look at this great post for a comprehensive intro.

For generating reports I gave prettytable a shot: it lets you create nice looking ASCII text tables that you can copy and paste from the prompt or export out to a file. The tutorial for the module was pretty clear and covers the basics quite nicely. In the examples he directly embeds the data in the script and generates the table from it, which makes the tutorial readily understandable. For my purposes I wanted to pull data out of a SQLite database and into a formatted table, so that’s what I’ll demonstrate here.

Initially I had some trouble getting the module to load, primarily (I think) because I’m using Python 3.x and the setup file for the module was written for Python 2.x; the utility you use for importing 3rd party modules has changed between versions. I’m certainly no Python expert, so instead of figuring it out I just downloaded the module, dumped it into the site-packages folder (as suggested in the prettytable installation instructions under “The Harder Way” – but it wasn’t hard at all) and unzipped it. In my script I couldn’t get the simple “import prettytable” to work without throwing an error, but when I added the name of the specific function “import PrettyTable from prettytable” it worked. Your mileage may vary.

So here was my first go at it. I created a test database and loaded a table of population estimates from the US Census Bureau into it (you can download it if you want to experiment):

from prettytable import PrettyTable
import sqlite3

conn = sqlite3.connect('pop_test.sqlite')
curs = conn.cursor()
curs.execute('SELECT State, Name, ESTIMATESBASE2010 AS Est2010 FROM pop_est WHERE region="1" ORDER BY Name')

col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()

x = PrettyTable(col_names)
x.align[col_names[1]] = "l"
x.align[col_names[2]] = "r"
x.padding_width = 1
for row in rows:
    x.add_row(row)

print (x)
tabstring = x.get_string()

output=open("export.txt","w")
output.write("Population Data"+"n")
output.write(tabstring)
output.close()

conn.close()

The first piece is the standard SQLite piece – connect, activate a cursor, and execute a SQL statement. Here I’m grabbing three columns from the table for records that represent Northeastern states (Region 1). I read in the names of the columns from the first row into the col_names list, and I grab everything else and dump them into rows, a list that contains a tuple for each record:

>>> col_names
['State', 'Name', 'Est2010']
>>> rows
[('09', 'Connecticut', 3574097), ('23', 'Maine', 1328361), ('25', 'Massachusetts', 6547629),
 ('33', 'New Hampshire', 1316469), ('34', 'New Jersey', 8791898), ('36', 'New York', 19378104),
 ('42', 'Pennsylvania', 12702379), ('44', 'Rhode Island', 1052567), ('50', 'Vermont', 625741)]
>>>

The second piece will make sense after you have a quick look at the prettytable tutorial. Here I grab the list of columns names and specify how cells for the columns should be aligned (default is center) and padded (default is one space). Then I add each row from the nested list of tuples to the table, row by row. There are two outputs: print directly to the screen, and dump the whole table into a string. That string can then be dumped into a text file, along with a title. Here’s the screen output:

+-------+---------------+----------+
| State | Name          |  Est2010 |
+-------+---------------+----------+
|   09  | Connecticut   |  3574097 |
|   23  | Maine         |  1328361 |
|   25  | Massachusetts |  6547629 |
|   33  | New Hampshire |  1316469 |
|   34  | New Jersey    |  8791898 |
|   36  | New York      | 19378104 |
|   42  | Pennsylvania  | 12702379 |
|   44  | Rhode Island  |  1052567 |
|   50  | Vermont       |   625741 |
+-------+---------------+----------+

The one hangup I had was the formatting for the numbers: I really want some commas in there since the values are so large. I couldn’t figure out how to do this using the approach above – I’m writing all the rows in one swoop, and couldn’t step in and and format the last value for each row.

Unless – instead of constructing the table by rows, I construct it by columns. Here’s my second go at it:

from prettytable import PrettyTable
import sqlite3

conn = sqlite3.connect('pop_test.sqlite')
curs = conn.cursor()
curs.execute('SELECT State, Name, ESTIMATESBASE2010 AS Est2010 FROM pop_est WHERE region="1" ORDER BY Name')

col_names = [cn[0] for cn in curs.description]
rows = curs.fetchall()

y=PrettyTable()
y.padding_width = 1
y.add_column(col_names[0],[row[0] for row in rows])
y.add_column(col_names[1],[row[1] for row in rows])
y.add_column(col_names[2],[format(row[2],',d') for row in rows])
y.align[col_names[1]]="l"
y.align[col_names[2]]="r"

print(y)
tabstring = y.get_string()

output=open("export.txt","w")
output.write("Population Data"+"n")
output.write(tabstring)
output.close()

conn.close()

To add by column, you don’t provide any arguments to the PrettyTable function. You just add the columns one by one: here I call the appropriate values using the index, first for the column name and then for all of the values from the rows that are in the same position. For the last value (the population estimate) I use format to display the value like a decimal number (this works in Python 3.1+ – for earlier versions there’s a similar command – see this post for details). I tried this in my first example but I couldn’t get the format to stick, or got an error. Since I’m specifically calling these row values and then writing them I was able to get it to work in this second example. In this version the alignment specifications have to come last. Here’s the result:

+-------+---------------+------------+
| State | Name          |    Est2010 |
+-------+---------------+------------+
|   09  | Connecticut   |  3,574,097 |
|   23  | Maine         |  1,328,361 |
|   25  | Massachusetts |  6,547,629 |
|   33  | New Hampshire |  1,316,469 |
|   34  | New Jersey    |  8,791,898 |
|   36  | New York      | 19,378,104 |
|   42  | Pennsylvania  | 12,702,379 |
|   44  | Rhode Island  |  1,052,567 |
|   50  | Vermont       |    625,741 |
+-------+---------------+------------+

prettytable gives you a few other options, like the ability to sort records by a certain column or to return only the first “n” records from a table. In this example, since we’re pulling the data from a database we could (and did) specify sorting and other constraints in the SQL statement instead. prettytable also gives you the option of exporting the table as HTML, which can certainly come in handy.

At These Coordinates

Dispatches from the Geospatial Data World