Using wget directly from PostgreSQL using COPY FROM PROGRAM

One of the features new to PostgreSQL 9.3 is the COPY FROM PROGRAM sql construct and in psql the equivalent \copy from program. Michael Paquier covered the COPY TO/FROM PROGRAM in hist Postgres 9.3 feature highlight: COPY TO/FROM PROGRAM. Depesz covered the companion psql construction in Support for piping copy to from an external program. Michael demonstrated an example using curl. I wanted to try something similar using wget since I have wget readily available on all my Linux and Unix boxes. For this example I'll demonstrate doing it on windows, but doing it on Linux is much the same and simpler since the wget and curl are generally already in Linux default path.

Wget call from PostgreSQL to load data

When you call CURL it naturally pipes the data to the stdout which is what is needed in order to insert the output to a PostgreSQL table. If you use wget, you need to add some additional commands to retarget it from file to stdout. In this example we'll load data using the World Bank API described at http://data.worldbank.org/node/11.

CREATE TABLE worldbank_json(data json);
COPY worldbank_json 
  FROM PROGRAM 'C:/wget/wget.exe -q -O - "$@" "http://api.worldbank.org/countries?format=json&per_page=100&page=1"';
COPY worldbank_json 
  FROM PROGRAM 'C:/wget/wget.exe -q -O - "$@" "http://api.worldbank.org/countries?format=json&per_page=100&page=2"';
COPY worldbank_json 
 FROM PROGRAM 'C:/wget/wget.exe -q -O - "$@" "http://api.worldbank.org/countries?format=json&per_page=100&page=3"';

Querying JSON data

The json datasets get dumped as single records for each call. To query the data we employ some PostgreSQL 9.3+ json goodie operators described in JSON Functions and Operators PostgreSQL manual page.

-- json is 0 index based, first element is page summary (0 index)
-- second element (index 1) is the data which itself is a json array
WITH je AS (SELECT json_array_elements(data->1) AS jd 
FROM  worldbank_json)
SELECT jd->>'id' AS id, jd->>'name' As country, 
  jd#>>'{adminregion,id}' As region_id
FROM je ;

The output of our query is partially shown below:

 id  |                        country                        | region_id
-----+-------------------------------------------------------+-----------
 ABW | Aruba                                                 |
 AFG | Afghanistan                                           | SAS
 AFR | Africa                                                |
 AGO | Angola                                                | SSA
 ALB | Albania                                               | ECA
 AND | Andorra                                               |
 ARB | Arab World                                            |
 ARE | United Arab Emirates                                  |
 ARG | Argentina                                             | LAC
 ARM | Armenia                                               | ECA
 ASM | American Samoa                                        | EAP
 ATG | Antigua and Barbuda                                   | LAC
 AUS | Australia                                             |
 AUT | Austria                                               |
 AZE | Azerbaijan                                            | ECA
 BDI | Burundi                                               | SSA