Whenever you need to split a text into multiple records breaking by some delimeter, there are two common options that PostgreSQL provides. The first is
regpexp_split_to_table and then next popular is using the unnest function in combination with string_to_array.
Here is an example using regexp_split_to_table:
FROM regexp_split_to_table('john,smith,jones', ',') AS a;
You can achieve the same result by using the construct:
FROM unnest(string_to_array('john,smith,jones', ',')) AS a;
With short text you won't notice much perfomance difference. But what happens if we pass in a humungous text?
Recently we found ourselves needing to purchase and download Zip+4 from the USPS. Zip+4 provides listing of mailable addresses in the US. We intend to use it for address validation.
Each file has one single line with no linefeeds or carriage returns! From spec, each 182-character segment constitutes a record. USPS was nice enough to provide a Java graphical app called CRLF that can inject breaks at specified intervals. That's all nice and well, but with hundreds of files to parse, using their interactive graphical CRLF tool is too tedious.
How could we compose a PostgreSQL function to handle the parsing? Unsure of the performance among procedural languages, we wrote the function in PL/pgSQL, SQL, and PL/V8 to compare. PL/V8 processed the files an astounding 100 times faster than the rest.
In our casual use of PL/V8, we found that when it comes to string, array, and mathematical operations, PL/V8 outshines PL/pgSQL, SQL, and in many cases PL/R and PL/Python.
We've been playing around with new phrase text feature of full-text PostgreSQL 9.6. In doing so, I was curious how big a number one can designate as max that words can be apart. I discovered thru trial and error, that the magic number is 16384 which is much bigger than I had suspected.
To celebrate recent release of PostgreSQL 9.6beta1, we've started to experiment with our favorite extensions.