PAGC-Public Address GeoCoder / Code
/ [r360] /branches/sew-refactor/postgresql ([original](http://sourceforge.net/p/pagc/code/360/tree/branches/sew-refactor/postgresql)) ([raw](?raw))
Tree [r360] /branches /sew-refactor /postgresql / History
Read Me
This is a fork of the PAGC standardizer and a single line address parser. The code is built into a single postgresql extension library.
Portions of this code belong to their respective contributors. This code is released under an MIT-X license.
Copyright (c) 2006-2013 Stephen Woodbridge. Copyright (c) 2008 Walter Bruce Sinclair
woodbri@swoodbridge.com woodbr@imaptools.com
Also read files COPYING
Makefile - PGXS makefile mk-city-regex.pl - Perl script to create parseaddress-regex.h mk-st-regexp.pl - Perl script to create parseaddress-stcities.h README.address_standardizer - this file COPYING - License file usps-st-city-adds.txt - add local additions of STCITY NAME usps-st-city-orig.txt - Steve's extract of USPS city names
mk-city-regex - created by make mk-st-regexp - created by make usps-st-city-name.txt - created by make from usps-st-city-orig.txt and usps-st-city-adds.txt parseaddress-regex.h - created by make and mk-st-regexp parseaddress-stcities.h - created by make and mk-city-regex from usps-st-city-name.txt
PREREQUISITES:
o Postgresql headers and PGXS tools
o Perl 5 and Perl module Regexp::List which can be install with: sudo perl -MCPAN -e "install Regexp::Assemble"
o libpcre and headers sudo apt-get install libpcre3-dev libpcre3
Build and Install:
make sudo make install
For postgresql 9.1+ this will install all the files need for CREATE EXTENSION
createdb testdb psql -c "create extension address_standardizer"
How the parser works
The parser works from right to left looking first at the macro elements for postcode, state/province, city, and then looks micro elements to determine if we are dealing with a house number street or intersection or landmark. It currently does not look for a country code or name, but that could be introduced in the future.
Country code
Assumed to be US or CA based on:
postcode as US or Canada
state/province as US or Canada
else US
Postcode/zipcode
These are recognized using Perl compatible regular expressions. These regexs are currently in the parseaddress-api.c and are relatively simple to make changes to if needed.
State/province
These are recognized using Perl compatible regular expressions. These regexs are currently in the parseaddress-api.c but could get moved into includes in the future for easier maintenance.
City name
This part is rather complicated and there are lots of issues around ambiguities as to where to split a series of tokens when a token might belong to either the city or the street name. The current strategy follows something like this:
1. if we have a state, then get the city regex for that state
2. if we can match that to the end of our remaining address string then
extract the city name and continue.
3. if we do not have a state or fail to match it then
cycle through a series of regex patterns that try to separate the city
from the street, stop and extract the city if we match
Number street name
1. check for a leading house number, and extract that
2. if there is an '@' then split the string on the '@' into street and
street2 else put the rest into street
Managing the regexes
The regexes are used to recognize US states and Canadian provinces and USPS city names.
City regexes
usps-st-city-orig.txt - this file contains all the acceptable USPS city names by state. I periodically extract these from the USPS and generate this file. I do NOT recommend editing this file. usps-st-city-adds.txt - this file you can add new definitions to if you need them. The format of both these files is:
These files are assembled into usps-st-city-name.txt which is compiled by a perl script mk-city-regex.pl into parseaddress-stcities.h which is used to lookup the city regex for a specific state or province.
As I mentioned above is these fail to detect the city, then a secondary strategy is is deployed by cycling through a list of regex patterns. These patterns and regexes are generated by mk-st-regexp.pl which creates the parseaddress-regex.h include. This is a perl script so you can view and edit it if that is needed.
I think that there might be some room for improved in the area if coodinating this process with PAGC's lexicon.csv and gazeteer.csv in the future.
Author: API: Stephen Woodbridge woodbri@imaptools.com PAGC: Walter Sinclar
This is a first attempt at extracting the PAGC standardizer code into a separate library. This was done so I could attempt to wrap it into a postgresql stored procedure. (see the directory psql for that).
This code is a part of PAGC and is release under an MIT-X license.
Assumptions:
Linux Sudo is installed and user has access to it.
PAGC does compile under Windows so you might get this to compile into a dll.
Build and installation:
Read the Makefile and change as appropriate.
make clean make make install make test_main # build the CLI interactive test program
Author: Stephen Woodbridge woodbri@imaptools.com
These are postgresql wrappers for PAGC standardizer and address parser. These are released un an MIT-X style license.
Assumptions:
Linux sudo is installed and user has sudo access (see Makefile) postgresql 8.3 (make changes in the Makefile to change)
Build and Installation:
make make install
create a new database using a postgis template
createdb -T template_postgis -E LATIN1 mydb
add the stored procedures
psql mydb -f /path/to/standardize_address.sql
Install PAGC lexicon.csv, gazeteer.csv or rules.txt using a perl script.
./pagc-data-psql lex lexicon.csv | psql mydb ./pagc-data-psql gaz gazeteer.csv | psql mydb ./pagc-data-psql rules rules.txt | psql mydb
Now you should be able to test some queries like:
psql mydb
select * from parse_address('2099 university ave w, saint paul, mn, 55104-3431'); select * from parse_address('university ave w @ main st, saint paul, mn, 55104-3431');
select * from parse_address('385 Landgrove Rd Landgrove VT 05148'); -- "385";"Landgrove Rd";"";"385 Landgrove Rd";"Landgrove";"VT";"05148";"";"US"
select * from standardize_address( 'select seq, word::text, stdword::text, token from gaz union all select seq, word::text, stdword::text, token from lex ', 'select seq, word::text, stdword::text, token from gaz order by id', 'select * from rules order by id', 'select 0::int4 as id, ''1071 B Ave''::text as micro, ''Loxley, AL 36551''::text as macro');
select * from standardize_address( 'select seq, word::text, stdword::text, token from lex order by id', 'select seq, word::text, stdword::text, token from gaz order by id', 'select * from rules order by id', 'select 0::int4 as id, ''116 commonwealth ave apt a''::text as micro, ''west concord, ma 01742''::text as macro');
\q
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.