Fixed width data is probably the most annoying data to import because you need some mechanism to break the columns at the column boundaries. A lot of people bring this kind of data
into a tool such as OpenOffice, Excel or MS Access, massage it into a delimeted format and then pull it in with PostgreSQL copy command or some other means. There is another way and one that doesn't require
anything else aside from what gets packaged with PostgreSQL. We will demonstrate this way.
Its quite simple. Pull each record in as a single column and then spit it into the columns you want with plain old SQL. We'll demonstrate this by importing Census data places fixed width file.
Although this technique we have is focused on PostgreSQL, its pretty easy to do the same steps in any other relational database.
Both David Fetter and Dimitri Fontaine have demonstrated other approaches of doing this as well
so check theirs out.
The copy commands
We will be using psql to import our data and the psql client side \copy command. Before we get into the specifics, we want to clarify a very important distinction here, that seems to confuse quite a few people new to PostgreSQL.
There are 2 kinds of copy commands. There is the SQL COPY command built into PostgreSQL server and SQL dialect and there is the client \copy command built into the psql client tool. Both can be used to import and export data and
take more or less the same arguments. However in some cases you may be able to use one and not the other. Below is a brief summary of how they defer
psql client \copy command
- Runs under the server context -- the postgres daemon account
- file path is relative to the server, not the person running it even if you are doing it from psql.
- postgres daemon account needs to have rights to read the file being imported.
- Person calling it needs to have super user rights
- Initiated with COPY .... and pretty much valid anywhere you can run PostgreSQL SQL commands
- Runs under the client OS context -- OS account of the person running the command
- file path is relative to the client, the person running the command, so for example if you are using psql in windows and your server is Linux, the path will be like C:/path/to/file and the file has to exist in your
local windows workstation, not the server.
- You don't need to be logged in as a postgres super user to use this command, though you need write rights to the database you are importing data to.
- The OS account you are logged in as needs to have rights to read the file being imported. postgres daemon account need not have rights.
- Initiated with \copy and only valid in psql or some psql like variant built with psql plumbing.
More extensive details can be found on the PostgreSQL wiki copy command.
The copy commands have been enhanced in newer versions of PostgreSQL, so you may want to check out the official manual for your version as well.
For our following exercises we'll import US Census 2000 places.zip file which is located at
http://www.census.gov/geo/www/gazetteer/places2k.html. This is a
fixed width file with record layout defined at http://www.census.gov/geo/www/gazetteer/places2k.html#places.
Specs for reference look as follows:
* Columns 1-2: United States Postal Service State Abbreviation
* Columns 3-4: State Federal Information Processing Standard (FIPS) code
* Columns 5-9: Place FIPS Code
* Columns 10-73: Name
* Columns 74-82: Total Population (2000)
* Columns 83-91: Total Housing Units (2000)
* Columns 92-105: Land Area (square meters) - Created for statistical purposes only.
* Columns 106-119: Water Area(square meters) - Created for statistical purposes only.
* Columns 120-131: Land Area (square miles) - Created for statistical purposes only.
* Columns 132-143: Water Area (square miles) - Created for statistical purposes only.
* Columns 144-153: Latitude (decimal degrees) First character is blank or "-" denoting North or South latitude respectively
* Columns 154-164: Longitude (decimal degrees) First character is blank or "-" denoting East or West longitude respectively
Here are the steps:
- Unzip file
- Launch a psql shell. TIP:If you are using PgAdmin III, version 1.9 and above has a psql icon located in Plugins that will open up a PSQL shell and connect you to the selected database.
- Create table that holds data for staging. Just create a table with a single column that allows unconstrained text.
CREATE TABLE places_staging(data text);
Now use psql to import it in. For this example, we set the client_encoding to latin1 since the file is in latin1 encoding. Your client_encoding should match that of the file, otherwise you may experience
errors such as invalid for utf-8 etc. We also specified a bogus column delimiter of '|' since we figure our data has no '|' in it. By default copy would assume tab delimeted columns.
set client_encoding = 'latin1';
\copy places_staging FROM C:/censusdata/places2k.txt DELIMITER AS '|'
Next we create our final table and we don't really care about a lot of the fields so
we will just import the fields we care about. Here we use the super useful substring function to denote where to start the text and how many characters to pick up.
CREATE TABLE places(usps char(2) NOT NULL,
fips char(2) NOT NULL,
INSERT INTO places(usps, fips, fips_code, loc_name)
SELECT substring(data,1,2) As usps, substring(data,3,2) As fips, substring(data,3,5) As fips_code,
trim(substring(data, 10,64)) As loc_name
If you did it right, you should get an answer like below which should match the number of lines in the file:
INSERT 0 25375
Tracked: Jul 14, 03:56
Tracked: May 03, 14:52