= Getting started with the PostgreSQL database

= Getting started with the PostgreSQL database

Databases are tools to store information in an organized but flexible way.
A spreadsheet is essentially a database, but the constraints of a graphical application render most spreadsheet applications useless to programmers.
With Edge and IoT devices becoming significant target platforms, developers need powerful but lightweight solutions for storing, processing, and querying large amounts of data.
One of my favourite combinations is the PostgreSQL database and https://github.com/arcapos/luapgsql[Lua bindings], but the possibilities are endless.
Whatever language you use, Postgres is a great choice for a database, but you need to know some basics before adopting it.

== Install Postgres

To install PostgreSQL on Linux, use your software repository.
On Fedora, CentOS, Mageia, and similar:

[source,bash]
----
$ sudo dnf install postgresql postgresql-server
----

On Debian, Linux Mint, Elementary, and similar:

[source,bash]
----
$ sudo apt install postgresql postgresql-contrib
----

On macOS and Windows, download an installer from https://www.postgresql.org/download/[postgresql.org].

== Setting up Postgres

Most distributions install the Postgres database without _starting_ it, but provide you with a script or https://opensource.com/article/21/4/sysadmins-love-systemd[systemd service] to help it start reliably.
However, before you start PostgreSQL, you must create a database cluster.

=== Fedora

On Fedora, CentOS, or similar, there's a Postgres setup script provided in the Postgres package.
Run this script for easy configuration:

[source,bash]
----
$ sudo /usr/bin/postgresql-setup --initdb
[sudo] password:
* Initializing database in '/var/lib/pgsql/data'
* Initialized, logs are in /var/lib/pgsql/initdb_postgresql.log
----

=== Debian

On Debian-based distributions, setup is performed automatically by `apt` during installation.

=== Everything else

Finally, if you're running something else, then you can just use the toolchain provided by Postgres itself.
The `initdb` command creates a database cluster, but you must run it as the `postgres` user, an identity you may temporarily assume using `sudo`:

[source,bash]
----
$ sudo -u postgres \
"initdb -D /var/lib/pgsql/data \
--locale en_US.UTF-8 --auth md5 --pwprompt"
----

== Start Postgres

Now that a cluster exists, start the Postgres server using either the command provided to you in the output of `initdb` or with systemd:

[source,bash]
----
$ sudo systemctl start postgresql
----

== Creating a database user

To create a Postgres user, use the `createuser` command.
The `postgres` user is the superuser of the Postgres install,

[source,bash]
----
$ sudo -u postgres createuser --interactive --password bogus
Shall the new role be a superuser? (y/n) n
Shall the new role be allowed to create databases? (y/n) y
Shall the new role be allowed to create more new roles? (y/n) n
Password:
----

== Create a database

To create a new database, use the `createdb` command.
In this example, I create the database `exampledb` and assign ownership of it to the user `bogus`:

[source,bash]
----
$ createdb exampledb --owner bogus
----

== Interacting with PostgreSQL

You can interact with a PostgreSQL database using the `psql` command.
This command provides an interactive shell so you can view and update your databases.
To connect to a database, specify the user and database you want to use:

[source,bash]
----
$ psql --user bogus exampledb
psql (XX.Y)
Type "help" for help.

exampledb=>
----

== Create a table

Databases contain tables, which can be visualized as a spreadsheet.
There's a series of rows (called _records_ in a database) and columns.
The intersection of a row and a column is called a _field_.

The Structured Query Language (SQL) is named after what it provides:
A method to inquire about the contents of a database in a predictable and consistent syntax to receive useful results.

Currently, your database is empty, devoid of any tables.
You can create a table with the `CREATE` query.
It's useful to combine this with the `IF NOT EXISTS` statement, which prevents PostgreSQL from clobbering an existing table.

Before you createa table, think about what kind of data (the "data type" in SQL terminology) you anticipate the table to contain.
In this example, I create a table with one column for a unique identifier and one column for some arbitrary text up to nine characters.

[source,sql]
----
exampledb=> CREATE TABLE IF NOT EXISTS my_sample_table(
exampledb(> id SERIAL,
exampledb(> wordlist VARCHAR(9) NOT NULL
);
----

The `SERIAL` keyword isn't actually a data type.
It's https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-SERIAL[special notation in PostgreSQL] that creates an auto-incrementing integer field.
The `VARCHAR` keyword is a data type indicating a variable number of characters within a limit.
In this code, I've specified a maximum of 9 characters.
There are lots of data types in PostgreSQL, so refer to the project documentation for a list of options.

== Insert data

You can populate your new table with some sample data by using the `INSERT` SQL keyword:

[source,sql]
----
exampledb=> INSERT INTO my_sample_table (wordlist) VALUES ('Alice');
INSERT 0 1
----

Your data entry fails, should you attempt to put more than 9 characters into the `wordlist` field:

[source,sql]
----
exampledb=> INSERT INTO my_sample_table (WORDLIST) VALUES ('Alexandria');
ERROR: value too long for type character varying(9)
----

== Alter a table or column

When you need to change a field definition, you use the `ALTER` SQL keyword.
For instance, should you decide that a nine character limit for `wordlist`, you can increase its allowance by setting its data type:

[source,sql]
----
exampledb=> ALTER TABLE my_sample_table
ALTER COLUMN wordlist SET DATA TYPE VARCHAR(10);
ALTER TABLE
exampledb=> INSERT INTO my_sample_table (WORDLIST) VALUES ('Alexandria');
INSERT 0 1
----

== View data in a table

SQL is a query language, so you view the contents of a database through queries.
Queries can be simple, or it can involve joining complex relationships between several different tables.
To see everything in a table, use the `SELECT` keyword on `*` (an asterisk is a wildcard):

[source,sql]
----
exampledb=> SELECT * FROM my_sample_table;
id | wordlist
\----+------------
1 | Alice
2 | Bob
3 | Alexandria
(3 rows)
----

== More data

PostgreSQL can handle a lot of data, but as with any database the key to success is how you design your database for storage and what you do with the data once you've got it stored.
A relatively large public data set can be found on https://stats.oecd.org/Index.aspx?DataSetCode=LAND_COVER[OECD.org], and using this you can try some advanced database techniques.

First, download the data as comma-separated values (CSV) and save the file as `land-cover.csv` in your `Downloads` folder.

Browse the data in a text editor or spreadsheet application to get an idea of what columns there are, and what kind of data each column contains.
Look at the data carefully and keep an eye out for exceptions to an apparent rule.
For instance, the `COU` column, containing a country code such as `AUS` for Australia and `GRC` for Greece, tends to be 3 characters until the oddity `BRIICS`.

Once you understand the data you're working with, you can prepare a Postgres database:

[source,bash]
----
$ createdb landcoverdb --owner bogus
$ psql --user bogus landcoverdb
landcoverdb=> create table land_cover(
country_code varchar(6),
country_name varchar(76),
small_subnational_region_code varchar(5),
small_subnational_region_name varchar(14),
large_subnational_region_code varchar(17),
large_subnational_region_name varchar(44),
measure_code varchar(13),
measure_name varchar(29),
land_cover_class_code varchar(17),
land_cover_class_name varchar(19),
year_code integer,
year_value integer,
unit_code varchar(3),
unit_name varchar(17),
power_code integer,
power_name varchar(9),
reference_period_code varchar(1),
reference_period_name varchar(1),
value float(8),
flag_codes varchar(1),
flag_names varchar(1));
----

== Importing data

Postgres can import CSV data directly using the special metacommand `\copy`:

[source,bash]
----
landcoverdb=> \copy land_cover from '~/land-cover.csv' with csv header delimiter ','
COPY 22113
----

That's 22,113 records imported.
Seems like a good start!

== Querying data

A broad `SELECT` statement to see all columns of all 22,113 records is possible, and Postgres very nicely pipes the output to a screen pager so you can scroll through the output at a leisurely pace.
However, using advanced SQL you can get some useful views of what's otherwise some pretty raw data.

[source,sql]
----
landcoverdb=> select
lcm.country_name,
lcm.year_value,
sum(lcm.value) sum_value
from land_cover lcm
join (
select
country_name,
large_subnational_region_name,
small_subnational_region_name,
max(year_value) max_year_value
from land_cover
group by country_name,
large_subnational_region_name,
small_subnational_region_name
) as lcmyv
on
lcm.country_name = lcmyv.country_name and
lcm.large_subnational_region_name = lcmyv.large_subnational_region_name and
lcm.small_subnational_region_name = lcmyv.small_subnational_region_name and
lcm.year_value = lcmyv.max_year_value
group by lcm.country_name,
lcm.large_subnational_region_name,
lcm.small_subnational_region_name,
lcm.year_value
order by country_name,
year_value;
----

Here's some sample output:

[source,text]
----
\---------------+------------+------------
Afghanistan | 2019 | 743.48425
Albania | 2019 | 128.82532
Algeria | 2019 | 2417.3281
American Samoa | 2019 | 100.2007
Andorra | 2019 | 100.45613
Angola | 2019 | 1354.2192
Anguilla | 2019 | 100.078514
Antarctica | 2019 | 12561.907
[...]
----

SQL is a rich langauge, and so it's beyond the scope of this article.
Read through the SQL code and see if you can modify it to provide a different set of data.

== Open database

PostgreSQL is one of the great open source databases.
With it, you can design repositories for structured data, and then use SQL to view it in different ways so you can gain fresh perspectives on that data.
Postgres integrates with many languages, including Python, Lua, Groovy, Java, and more, so regardless of your toolset, you can probably make use of this excellent database.