Sequential grouping of database table rows in postgreSql
“This post assumes that you have a basic understanding of SQL queries, PostgreSQL , aggregated functions like
GROUP BY, conditional expressions likeCASE, and window functions.”
In one of the projects
I worked,
we had a
PostgreSQL(version 9.4) database table
transaction_logs which stores
transaction data
generated by an ELD (Electronic Logging Device)
like this.
id | state | start_time | end_time | miles
1 | 'ON_DUTY' | '2018-05-01 00:00:00' | '2018-05-01 01:00:00' | 100
2 | 'ON_DUTY' | '2018-05-01 01:12:13' | '2018-05-01 01:58:13' | 30
3 | 'ON_DUTY' | '2018-05-01 02:13:14' | '2018-05-01 03:13:14' | 40
4 | 'OFF_DUTY'| '2018-05-01 03:14:15' | '2018-05-01 04:14:15' | 20
5 | 'SLEEPING'| '2018-05-01 04:15:16' | '2018-05-01 05:15:16' | 0
6 | 'OFF_DUTY'| '2018-05-01 05:16:17' | '2018-05-01 06:16:17' | 30
7 | 'OFF_DUTY'| '2018-05-01 06:17:18' | '2018-05-01 07:17:18' | 5
8 | 'ON_DUTY' | '2018-05-01 07:18:19' | '2018-05-01 08:18:19' | 3
These transaction logs represent a driver’s activity recorded from the vehicle where this ELD is connected. The table had around 50 million+ records.
For a business use case,
we wanted to unify
the transaction logs
having same state
column value in
a sequence ordered
ascendingly with
start_time column.
“For the sake of simplicity; we have already ordered the table records ascendingly with
start_timecolumn.”
To understand the unification, consider the first three rows in the above table
id | state | start_time | end_time | miles
1 | 'ON_DUTY' | '2018-05-01 00:00:00' | '2018-05-01 01:00:00' | 100
2 | 'ON_DUTY' | '2018-05-01 01:12:13' | '2018-05-01 01:58:13' | 30
3 | 'ON_DUTY' | '2018-05-01 02:13:14' | '2018-05-01 03:13:14' | 40
Here, we can see that
the three records in
the sequence are having
the same state ON_DUTY.
We wanted to unify
it to return single
unified record without
losing the meaning
of the data.
id | state | start_time | end_time | miles
1 | 'ON_DUTY' | '2018-05-01 00:00:00' | '2018-05-01 03:13:14' | 170
If you loot at the unified record carefully, then you can observe that
-
Value
idis taken from the first record in the sequence. -
Value
stateis the same as all three records on the basis of which we are unifying. -
Value
start_timeis taken from the first record in the sequence. -
Value
end_timeis taken from the last record(here third) in the sequence. -
Value
milesis a sum of all the miles column value.
This is unification without losing meaning. And we are calling it as a sequential grouping of database table rows.
If we apply sequential grouping on the whole table then the output is like this.
id | state | start_time | end_time | miles
3 | 'ON_DUTY' | '2018-05-01 00:00:00' | '2018-05-01 03:13:14' | 170
4 | 'OFF_DUTY'| '2018-05-01 03:14:15' | '2018-05-01 04:14:15' | 20
5 | 'SLEEPING'| '2018-05-01 04:15:16' | '2018-05-01 05:15:16' | 0
7 | 'OFF_DUTY'| '2018-05-01 05:16:17' | '2018-05-01 07:17:18' | 35
8 | 'ON_DUTY' | '2018-05-01 07:18:19' | '2018-05-01 08:18:19' | 3
In this article, we will discuss how we achieved this sequential grouping using an SQL query.
Adding required explicit aggregate functions FIRST and LAST
We needed aggregate functions
FIRST and LAST
to return first start_time
and last end_time in the
combined result of each group.
PostgreSQL doesn’t provide
these aggregate functions
by default.
We added the FIRST and LAST aggregate functions by running the following queries as given in the PostgreSQL wiki.
-- Create a function that always returns the first non-NULL item
CREATE OR REPLACE FUNCTION public.first_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT AS $$
SELECT $1;
$$;
-- And then wrap an aggregate around it
CREATE AGGREGATE public.FIRST (
sfunc = public.first_agg,
basetype = anyelement,
stype = anyelement
);
-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT AS $$
SELECT $2;
$$;
-- And then wrap an aggregate around it
CREATE AGGREGATE public.LAST (
sfunc = public.last_agg,
basetype = anyelement,
stype = anyelement
);
One might have
wondered that
why didn’t
we fire direct
GROUP BY query
on the state
column.
Using GROUP BY on the state column
When we fired
a query with a GROUP BY
function on the column state
like this
SELECT LAST(id) as id,
LAST(state) as state,
FIRST(start_time) as start_time,
LAST(end_time) as end_time,
SUM(miles) as miles
FROM transaction_logs
GROUP BY state;
We got the following result
id | state | start_time | end_time | miles
8 | 'ON_DUTY' | '2018-05-01 00:00:00' | '2018-05-01 08:18:19' | 173
5 | 'SLEEPING'| '2018-05-01 04:15:16' | '2018-05-01 05:15:16' | 0
7 | 'OFF_DUTY'| '2018-05-01 03:14:15' | '2018-05-01 07:17:18' | 55
The query returned a single row for each state column value.
Consider a row
from the result
with a state ON_DUTY.
id | state | start_time | end_time | miles
8 | 'ON_DUTY' | '2018-05-01 00:00:00' | '2018-05-01 08:18:19' | 173
It says that a driver was on duty from “12:00 AM” to “8:18 AM”. But if we look at the table data then we can observe that a driver was also off duty and sleeping between the resulted time range. And it is losing the meaning of the data.
We concluded
that using
GROUP BY on
the state column
will group the records
in the scope of
the whole table
and it is not
useful for our
purpose.
We needed a column
which has a unique
value for the same
state sequence
throughout the
table which we
can use for GROUP BY
query.
For this, we came up with a subquery approach to retrieve a virtual column which can return a unique value for a sequence of same state logs.
Building a subquery
First of all, we retrieved a virtual column using subquery which identifies the same state sequence break. This is the first subquery which we used to get a column which will have a unique value for each sequence.
SELECT id, state,
CASE
WHEN LAG(state) OVER (ORDER BY start_time) = state
THEN NULL
ELSE 1
END AS group_flag
FROM transaction_logs;
The above query returns
a virtual column group_flag
which will have value 1 when
state of the previous record
is not the same with
the current one
and NULL when
it is the same.
id | state | group_flag
1 | 'ON_DUTY' | 1
2 | 'ON_DUTY' | NULL
3 | 'ON_DUTY' | NULL
4 | 'OFF_DUTY'| 1
5 | 'SLEEPING'| 1
6 | 'OFF_DUTY'| 1
7 | 'OFF_DUTY'| NULL
8 | 'ON_DUTY' | 1
We used this subquery to retrieve another virtual column which has a unique value for each sequence.
SELECT t1.id, t1.state, t1.group_flag,
SUM(group_flag) OVER (ORDER BY start_time) AS unique_sequence_value
FROM (
SELECT id, state, start_time,
CASE
WHEN lag(state) OVER (ORDER BY start_time) = state THEN NULL
ELSE 1
END AS group_flag
FROM transaction_logs
) t1;
The above query
returns a new virtual
column unique_sequence_value, which has
a cumulatvely sum of
the group_flag values
ordered by start_time.
id | state | group_flag | unique_sequence_value
1 | 'ON_DUTY' | 1 | 1
2 | 'ON_DUTY' | NULL | 1
3 | 'ON_DUTY' | NULL | 1
4 | 'OFF_DUTY'| 1 | 2
5 | 'SLEEPING'| 1 | 3
6 | 'OFF_DUTY'| 1 | 4
7 | 'OFF_DUTY'| NULL | 4
8 | 'ON_DUTY' | 1 | 5
If we look at the values
of the unique_sequence_value column
then we can see that
we’ve got
the required unique value
for the state sequence
which we can use
now for the GROUP BY query.
Final query
We used the two subqueries above to achieve the desired result
SELECT LAST(id) as id,
LAST(state) as state,
FIRST(start_time) as start_time,
LAST(end_time) as end_time,
SUM(miles) as miles
FROM (
SELECT t1.*,
SUM(group_flag) OVER (ORDER BY start_time) AS unique_sequence_value
FROM (
SELECT *,
CASE
WHEN lag(state) OVER (ORDER BY start_time) = state THEN NULL
ELSE 1
END AS group_flag
FROM transaction_logs
) t1
) t2
GROUP BY unique_sequence_value
ORDER BY start_time;
We’ve created a separate table
aggregated_transaction_logs
to store the state
change logs
returned by the query.