Following from yesterday’s resusitation of the Charlottetown Building Permits RSS feed, I decided that it was finally time to get around to seeing if there was enough data locked inside the City’s PDF files to create a map of building permit approvals. It turned out to be not that difficult to do using some open source wrangling. Here’s what I did.
The goal was to take the 219 PDF files I was able to scrape from the City’s Building Permit Approval page that each look like this:
and to pull enough information out about each approval to be able to geocode it. I did this using the excellent pdftotext utility, part of the open source Xpdf package. Doing this:
pdftotext -raw Weekly_approvals_webpage_21_Oct_2011.pdf \ Weekly_approvals_webpage_21_Oct_2011.txt
produces a plain ASCII text file that looks like this:
10-533 335067 402-bld-10 20-Oct-10 3-Oct-11 18-22 Water Street... 10-569 363556 439-BLD-10 17-Nov-10 3-Oct-11 20 Lapthorne Avenue... 11-002 1018274 001-bld-11 4-Jan-11 6-Oct-11 375 Mount Edward Road... 11-136 342436 326-bld-11 26-Aug-11 3-Oct-11 134 Kent Street...
From those files, because the Provincial Property Identification Number — the PID — is always a 6 or 7 digit number, and because such numbers rarely, if ever, appear elsewhere in the files, I was able to pull out the PID for every approval using some PHP:
preg_match('/\d{6,7}/',$line,$matches)
From there I looked up each PID in the freely-available Provincial Civic Address data, leaving me with a CSV file like this:
-63.12688,46.23066,22 WATER ST,"10-533 335067 402-bld-10 20-Oct-10... -63.12606,46.24454,20 LAPTHORN AV,"10-569 363556 439-BLD-10 17-Nov-10... -63.14558,46.27834,375 MOUNT EDWARD RD,"11-002 1018274 001-bld-11 4-Jan-11... -63.12808,46.23572,134 KENT ST,"11-136 342436 326-bld-11 26-Aug-11 3-Oct-11...
This CSV contains geocoded record of the 1,985 building permits I was able to scrape out the PDF files. Finally I used the open source KMLCSV Converter app to convert the CSV file into a mappable KML file and from there it was simply a matter of doing any of:
- Feeding the KML file to Google Maps
- Adapting my PEI Schools Map to show the Building Permits on an CloudMade-drive OpenStreetMat map.
- Opening the KML file in Google Earth
I continue to hope that the City of Charlottetown will eventually release building permit data in an open format so that all the scripery-scrapery required to do this can be eliminated and we can all concentrate on doing interesting things with the data rather than on getting the data in the first place.