PLV8 for Breaking Long lines of text

Recently we found ourselves needing to purchase and download Zip+4 from the USPS. Zip+4 provides listing of mailable addresses in the US. We intend to use it for address validation.

Each file has one single line with no linefeeds or carriage returns! From spec, each 182-character segment constitutes a record. USPS was nice enough to provide a Java graphical app called CRLF that can inject breaks at specified intervals. That's all nice and well, but with hundreds of files to parse, using their interactive graphical CRLF tool is too tedious.

How could we compose a PostgreSQL function to handle the parsing? Unsure of the performance among procedural languages, we wrote the function in PL/pgSQL, SQL, and PL/V8 to compare. PL/V8 processed the files an astounding 100 times faster than the rest.

PL/V8 is nothing but PL using JavaScript. V8 is a moniker christened by Google to distinguish their JavaScript language engine from all others. It's really not all that different, if at all from any other JavaScript. PL/V8 offers a tiny footprint compared to the stalwarts of PL/Python, PL/Perl, or PL/R. Plus, you can use PL/V8 to create windowing functions. You can't do that with PL/pgSQL and SQL. PL/V8 is sandboxed, meaning that it cannot access web services, network resources, etc. PL/Python, PL/Perl, and PL/R have non-sandboxed versions. For certain applications, being sandboxed is a coup-de-gras.

In our casual use of PL/V8, we found that when it comes to string, array, and mathematical operations, PL/V8 outshines PL/pgSQL, SQL, and in many cases PL/R and PL/Python.

We tested PL/V8 version 1.4.8 on PostgreSQL 9.6 Beta for Windows 64-bit on both the BigSQL Windows 64-bit and the EDB Windows 64-bit distributions

With combined help from Stack Exchange: JavaScript: An Elegant Way to Split String into Segments and the PL/V8 manual, we cooked up this function:

CREATE OR REPLACE FUNCTION split_text(p_text text, p_rec_l integer) 
RETURNS SETOF text AS
$$
   for (var i = 0, cl = p_text.length; i < cl; i += p_rec_l) {
          plv8.return_next( p_text.substring(i, i + p_rec_l) );
   }
$$
LANGUAGE plv8;

A simple test:

SELECT split_text('JohnJackPaulJaneMary', 4);

Output:

split_text
------------
John
Jack
Paul
Jane
Mary
(5 rows)

To try out the function on our Zip+4 files we loaded up the 19 files for Massachusetts into a table. Each file contains one row so we end up with 19 rows in our zip4raw table. The row size varies between 2KB and 27MB with most around 10MB.

Step 1 is to get directory listing of the files:

CREATE SCHEMA staging;
CREATE TABLE staging.dir_list(filename text);
COPY staging.dir_list 
    FROM PROGRAM 'dir C:\data\USPS\zip4\ma\*.txt /b /S'  
    WITH (format 'csv');
CREATE UNLOGGED TABLE staging.zip4raw(data text);

Step 2 is to load each file as a record into zip4raw table. Speed-wise both BigSQL and EDB came out pretty close with BigSQL faster by a tiny margin.

DO language plpgsql
$$
BEGIN
    EXECUTE (
        SELECT string_agg(
            'COPY staging.zip4raw(data) FROM ' || 
            quote_literal(filename) , ';'
        )
        FROM staging.dir_list
    );
END;
$$

-- Query returned successfully with no result in 2.3 secs. (PostgreSQL 9.6beta1 on x86_64-pc-mingw64, compiled by gcc.exe (Rev5, Built by MSYS2 project) 4.9.2, 64-bit BigSQL distrib)
-- Query returned successfully with no result in 2.6 secs. (PostgreSQL 9.6beta1, compiled by Visual C++ build 1800, 64-bit EDB distrib)

At this point we have a table called staging.zip4raw with 19 rows. Now we use the split_text function to break each row into multiple records, at every 182-character intervals.

CREATE UNLOGGED TABLE staging.zip4raw_records AS 
SELECT r 
FROM 
    staging.zip4raw AS z CROSS JOIN LATERAL 
    split_text(z.data,182) AS r;

-- Query returned successfully: 818681 rows affected, 1.3 secs execution time. (PostgreSQL 9.6beta1 on x86_64-pc-mingw64, compiled by gcc.exe (Rev5, Built by MSYS2 project) 4.9.2, 64-bit BigSQL distrib)
-- Query returned successfully: 818681 rows affected, 1.4 secs execution time.(PostgreSQL 9.6beta1, compiled by Visual C++ build 1800, 64-bit EDB distrib)

Using unlogged tables reaps a noticeable performance boost. Without it, execution times peaked into the 4-6 seconds range.

Our executed query looks like this:

CREATE TABLE zip4_detail AS
SELECT 
    substring(r,158, 2)::varchar(2) AS stusps,
    substring(r,160,3)::varchar(3) AS countyfp, 
    substring(r,1,1)::varchar(1) AS copyright_detail_code,
    substring(r,2,5)::varchar(5) AS zip_code,
    substring(r,7, 10)::varchar(10) AS update_key_no,
    substring(r,17,1)::varchar(1) AS action_code,
    substring(r,18,1)::varchar(1) AS record_type_code,
    substring(r,19,4)::varchar(4) AS carrier_route_id,
    substring(r,23,2)::varchar(2) AS street_pre_drctn_abbrev,
    substring(r,25,28)::varchar(28) AS street_name,
    substring(r,53,4)::varchar(4) AS street_suffix_abbrev,
    substring(r,57,2)::varchar(2) AS street_post_drctn_abbrev,
    substring(r,59,10)::varchar(10) AS addr_primary_low_no,
    substring(r,69,10)::varchar(10) AS addr_primary_high_no,
    substring(r,79,1)::varchar(1) AS addr_prmry_odd_even_code,
    substring(r,80,40)::varchar(40) AS building_or_firm_name,
    substring(r,120,4)::varchar(4) AS addr_secondary_abbrev,
    substring(r,124,8)::varchar(8) AS addr_secondary_low_no,
    substring(r,132,8)::varchar(8) AS addr_secondary_high_no,
    substring(r,140,1)::varchar(1) AS addr_secny_odd_even_code,
    substring(r,141,2)::varchar(2) AS low_no_zip_sector_no,
    substring(r,143,2)::varchar(2) AS low_no_zip_segment_no,
    substring(r,145,2)::varchar(2) AS high_no_zip_sector_no,
    substring(r,147,2)::varchar(2) AS high_no_zip_segment_no,
    substring(r,149,1)::varchar(1) AS base_alt_code,
    substring(r,150,1)::varchar(1) AS lacs_status_ind,
    substring(r,151,1)::varchar(1) AS govt_bldg_ind,
    substring(r,152,6)::varchar(6) AS finance_no,
    substring(r,163,2)::varchar(2) AS congressional_dist_no,
    substring(r,165,6)::varchar(6) AS muncipality_ctyst_key,
    substring(r,171,6)::varchar(6) AS urbanization_ctyst_key,
    substring(r,177,6)::varchar(6) AS prefd_last_line_ctyst_key
FROM (
    SELECT r 
    FROM 
        staging.zip4raw AS z CROSS JOIN LATERAL 
        split_text(z.data,182) AS r
) AS c
WHERE r LIKE 'D%';

-- Query returned successfully: 752794 rows affected, 14.7 secs execution time. (PostgreSQL 9.6beta1 on x86_64-pc-mingw64, compiled by gcc.exe (Rev5, Built by MSYS2 project) 4.9.2, 64-bit BigSQL distrib)
-- Query returned successfully: 752794 rows affected, 15.1 secs execution time. (PostgreSQL 9.6beta1, compiled by Visual C++ build 1800, 64-bit EDB distrib)

BigSQL performed slightly better than EDB VC++ in these tests. This agrees with Jim Mlodgenski recent benchmark findings, that PostgreSQL compiled with Mingw64 is slightly faster than PostgreSQL compiled with VC++.