Sunday Update No. 6: Generating queries from csv with psycopg2

29 Mar 2015

The Numbers

Total commits: 4
Repos contributed to: 1
Most popular repo: Smartsheet Calendar
Most popular days: Mar. 25, 2 commits

What I learned

The project I worked on most this week isn’t complete, so I wanted to go over something that didn’t end up on github but was still a learning moment for me.

We’re doing a lot of work around campaign finance in the Chicago mayoral election, and this week I wanted to quickly check if a list any of a list of names shows up in the records. In the past I’ve sometimes gone with the incredibly long sql where statement for something like this, but we were comparing a number of different lists with dozens of names, so hand-coding the query wasn’t going to work.

I needed to scrape some of the data, and in the process of writing to a csv I realized I could just as easily send that info directly to an sql query. I was writing the scraper in python and the database itself is postgres, so I was able to easily add a call to psycopg2 for each time the scraper found a matching element. (note: The project isn’t out yet so these are just example queries, but I’ll try and post a link to the project and full code next week)

from bs4 import BeautifulSoup
import urllib2
import psycopg2
import probablepeople
import csv
import re

data = []

conn = psycopg2.connect("dbname=electionmoney user=chris")
cur = conn.cursor()

def scrape():
    soup = BeautifulSoup(open("scrape.html"))
    table_body = soup.find('tbody')
    rows = table_body.find_all('tr',class_='data')
    cur = conn.cursor()

    for row in rows:
        td = row.find_all('td')
        state = td[6].string.split(',')[-1]
        peep = probablepeople.tag(td[2].string)[0]
        lastname = peep['Surname']
        cur.execute("""SELECT * FROM receipts where lower(trim("LastOnlyName"))=lower(trim(%s)) AND lower(trim("State"))=lower(trim(%s))""", (lastname,state))
        results = cur.fetchall()
        cur.close

        for item in results:
            if item not in data:
                data.append(item)

I reused that same format with a csv list of companies by replacing the beautiful soup code with with open('companies.csv', 'rb') as csvfile: and iterating on the rows.

The biggest issue I had was working with another query that required a LIKE comparison with a python variable. Since postgres uses ‘%’ for wildcards and python uses ‘%’ for variables, the psycopg2 query got confused about what exactly I meant. After some hunting, I found that creating the query itself as a variable and then passing that to psycopg2 worked.

query = "".join(('%',company,'%'))
cur.execute("""SELECT * FROM receipts where lower(trim("Employer")) LIKE lower(trim(%s))""", (query,))

This made re-querying the data much easier and as an added bonus makes the entire trail reproducible and able to be put under version control. Earlier this month I was lucky enough to go to NICAR, IRE’s computer-assisted reporting (i.e, journo-nerd) conference, and one of my main goals coming out is to have a better process for data projects. I feel like this is another step in that direction.

This week

Again, a lot of items I talked about last week are still on this week’s to-do list

I need to write up my formal goals list for this project
Continue to add features to the blog (archives, comments, etc …)
Write up my process for actually creating the blog in more depth
Write up the process for the campaign finance tracker
Finish and writeup the smartsheet calendar project