Import fixed width data into PostgreSQL with just PSQL

Fixed width data is probably the most annoying data to import because you need some mechanism to break the columns at the column boundaries. A lot of people bring this kind of data into a tool such as OpenOffice, Excel or MS Access, massage it into a delimeted format and then pull it in with PostgreSQL copy command or some other means. There is another way and one that doesn't require anything else aside from what gets packaged with PostgreSQL. We will demonstrate this way.

Its quite simple. Pull each record in as a single column and then spit it into the columns you want with plain old SQL. We'll demonstrate this by importing Census data places fixed width file.

Although this technique we have is focused on PostgreSQL, its pretty easy to do the same steps in any other relational database.

Both David Fetter and Dimitri Fontaine have demonstrated other approaches of doing this as well so check theirs out.

UPDATE

The copy commands

We will be using psql to import our data and the psql client side \copy command. Before we get into the specifics, we want to clarify a very important distinction here, that seems to confuse quite a few people new to PostgreSQL.

There are 2 kinds of copy commands. There is the SQL COPY command built into PostgreSQL server and SQL dialect and there is the client \copy command built into the psql client tool. Both can be used to import and export data and take more or less the same arguments. However in some cases you may be able to use one and not the other. Below is a brief summary of how they defer

SQL Copy
psql client \copy command

More extensive details can be found on the PostgreSQL wiki copy command. The copy commands have been enhanced in newer versions of PostgreSQL, so you may want to check out the official manual for your version as well.

The Data

For our following exercises we'll import US Census 2000 places.zip file which is located at http://www.census.gov/geo/www/gazetteer/places2k.html. This is a fixed width file with record layout defined at http://www.census.gov/geo/www/gazetteer/places2k.html#places.

Specs for reference look as follows:

   * Columns 1-2: United States Postal Service State Abbreviation
    * Columns 3-4: State Federal Information Processing Standard (FIPS) code
    * Columns 5-9: Place FIPS Code
    * Columns 10-73: Name
    * Columns 74-82: Total Population (2000)
    * Columns 83-91: Total Housing Units (2000)
    * Columns 92-105: Land Area (square meters) - Created for statistical purposes only.
    * Columns 106-119: Water Area(square meters) - Created for statistical purposes only.
    * Columns 120-131: Land Area (square miles) - Created for statistical purposes only.
    * Columns 132-143: Water Area (square miles) - Created for statistical purposes only.
    * Columns 144-153: Latitude (decimal degrees) First character is blank or "-" denoting North or South latitude respectively
    * Columns 154-164: Longitude (decimal degrees) First character is blank or "-" denoting East or West longitude respectively

Here are the steps:

  1. Unzip file
  2. Launch a psql shell. TIP:If you are using PgAdmin III, version 1.9 and above has a psql icon located in Plugins that will open up a PSQL shell and connect you to the selected database.
  3. Create table that holds data for staging. Just create a table with a single column that allows unconstrained text.
  4. CREATE TABLE places_staging(data text);
  5. Now use psql to import it in. For this example, we set the client_encoding to latin1 since the file is in latin1 encoding. Your client_encoding should match that of the file, otherwise you may experience errors such as invalid for utf-8 etc. We also specified a bogus column delimiter of '|' since we figure our data has no '|' in it. By default copy would assume tab delimeted columns.

    			
    set client_encoding = 'latin1';
    \copy places_staging FROM C:/censusdata/places2k.txt DELIMITER AS '|'
    			
    		
  6. Next we create our final table and we don't really care about a lot of the fields so we will just import the fields we care about. Here we use the super useful substring function to denote where to start the text and how many characters to pick up.

    			
    CREATE TABLE places(usps char(2) NOT NULL,
        fips char(2) NOT NULL, 
        fips_code char(5),
        loc_name varchar(64));
    INSERT INTO places(usps, fips, fips_code, loc_name)
        SELECT substring(data,1,2) As usps, substring(data,3,2) As fips, substring(data,3,5) As fips_code,
            trim(substring(data, 10,64)) As loc_name
        FROM places_staging;
    			
    		

    If you did it right, you should get an answer like below which should match the number of lines in the file:

    INSERT 0 25375