Sequential grouping of database table rows in postgreSql

“This post assumes that you have a basic understanding of SQL queries, PostgreSQL , aggregated functions like GROUP BY, conditional expressions like CASE, and window functions.”

In one of the projects I worked, we had a PostgreSQL(version 9.4) database table transaction_logs which stores transaction data generated by an ELD (Electronic Logging Device) like this.

id |   state   |       start_time      |       end_time        | miles
 1 | 'ON_DUTY' | '2018-05-01 00:00:00' | '2018-05-01 01:00:00' |  100
 2 | 'ON_DUTY' | '2018-05-01 01:12:13' | '2018-05-01 01:58:13' |   30
 3 | 'ON_DUTY' | '2018-05-01 02:13:14' | '2018-05-01 03:13:14' |   40
 4 | 'OFF_DUTY'| '2018-05-01 03:14:15' | '2018-05-01 04:14:15' |   20
 5 | 'SLEEPING'| '2018-05-01 04:15:16' | '2018-05-01 05:15:16' |    0
 6 | 'OFF_DUTY'| '2018-05-01 05:16:17' | '2018-05-01 06:16:17' |   30
 7 | 'OFF_DUTY'| '2018-05-01 06:17:18' | '2018-05-01 07:17:18' |    5
 8 | 'ON_DUTY' | '2018-05-01 07:18:19' | '2018-05-01 08:18:19' |    3

These transaction logs represent a driver’s activity recorded from the vehicle where this ELD is connected. The table had around 50 million+ records.

For a business use case, we wanted to unify the transaction logs having same state column value in a sequence ordered ascendingly with start_time column.

“For the sake of simplicity; we have already ordered the table records ascendingly with start_time column.”

To understand the unification, consider the first three rows in the above table

id |   state   |       start_time      |       end_time        | miles
 1 | 'ON_DUTY' | '2018-05-01 00:00:00' | '2018-05-01 01:00:00' |  100
 2 | 'ON_DUTY' | '2018-05-01 01:12:13' | '2018-05-01 01:58:13' |   30
 3 | 'ON_DUTY' | '2018-05-01 02:13:14' | '2018-05-01 03:13:14' |   40

Here, we can see that the three records in the sequence are having the same state ON_DUTY. We wanted to unify it to return single unified record without losing the meaning of the data.

id |   state   |       start_time      |       end_time        | miles
 1 | 'ON_DUTY' | '2018-05-01 00:00:00' | '2018-05-01 03:13:14' |  170

If you loot at the unified record carefully, then you can observe that

Value id is taken from the first record in the sequence.
Value state is the same as all three records on the basis of which we are unifying.
Value start_time is taken from the first record in the sequence.
Value end_time is taken from the last record(here third) in the sequence.
Value miles is a sum of all the miles column value.

This is unification without losing meaning. And we are calling it as a sequential grouping of database table rows.

If we apply sequential grouping on the whole table then the output is like this.

id |   state   |       start_time      |       end_time        | miles
 3 | 'ON_DUTY' | '2018-05-01 00:00:00' | '2018-05-01 03:13:14' |  170
 4 | 'OFF_DUTY'| '2018-05-01 03:14:15' | '2018-05-01 04:14:15' |   20
 5 | 'SLEEPING'| '2018-05-01 04:15:16' | '2018-05-01 05:15:16' |    0
 7 | 'OFF_DUTY'| '2018-05-01 05:16:17' | '2018-05-01 07:17:18' |   35
 8 | 'ON_DUTY' | '2018-05-01 07:18:19' | '2018-05-01 08:18:19' |    3

In this article, we will discuss how we achieved this sequential grouping using an SQL query.

Adding required explicit aggregate functions FIRST and LAST

We needed aggregate functions FIRST and LAST to return first start_time and last end_time in the combined result of each group. PostgreSQL doesn’t provide these aggregate functions by default.

We added the FIRST and LAST aggregate functions by running the following queries as given in the PostgreSQL wiki.

-- Create a function that always returns the first non-NULL item
CREATE OR REPLACE FUNCTION public.first_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT AS $$
        SELECT $1;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.FIRST (
        sfunc    = public.first_agg,
        basetype = anyelement,
        stype    = anyelement
);

-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE SQL IMMUTABLE STRICT AS $$
        SELECT $2;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.LAST (
        sfunc    = public.last_agg,
        basetype = anyelement,
        stype    = anyelement
);

One might have wondered that why didn’t we fire direct GROUP BY query on the state column.

Using GROUP BY on the state column

When we fired a query with a GROUP BY function on the column state like this

SELECT LAST(id) as id,
       LAST(state) as state,
       FIRST(start_time) as start_time,
       LAST(end_time) as end_time,
       SUM(miles) as miles
FROM transaction_logs
GROUP BY state;

We got the following result

id |   state   |       start_time      |       end_time        | miles
 8 | 'ON_DUTY' | '2018-05-01 00:00:00' | '2018-05-01 08:18:19' |  173
 5 | 'SLEEPING'| '2018-05-01 04:15:16' | '2018-05-01 05:15:16' |    0
 7 | 'OFF_DUTY'| '2018-05-01 03:14:15' | '2018-05-01 07:17:18' |   55

The query returned a single row for each state column value.

Consider a row from the result with a state ON_DUTY.

id |   state   |       start_time      |       end_time        | miles
 8 | 'ON_DUTY' | '2018-05-01 00:00:00' | '2018-05-01 08:18:19' |  173

It says that a driver was on duty from “12:00 AM” to “8:18 AM”. But if we look at the table data then we can observe that a driver was also off duty and sleeping between the resulted time range. And it is losing the meaning of the data.

We concluded that using GROUP BY on the state column will group the records in the scope of the whole table and it is not useful for our purpose.

We needed a column which has a unique value for the same state sequence throughout the table which we can use for GROUP BY query.

For this, we came up with a subquery approach to retrieve a virtual column which can return a unique value for a sequence of same state logs.

Building a subquery

First of all, we retrieved a virtual column using subquery which identifies the same state sequence break. This is the first subquery which we used to get a column which will have a unique value for each sequence.

SELECT id, state,
       CASE
       WHEN LAG(state) OVER (ORDER BY start_time) = state
         THEN NULL
       ELSE 1
       END AS group_flag
FROM transaction_logs;

The above query returns a virtual column group_flag which will have value 1 when state of the previous record is not the same with the current one and NULL when it is the same.

id |   state   | group_flag
 1 | 'ON_DUTY' |     1
 2 | 'ON_DUTY' |     NULL
 3 | 'ON_DUTY' |     NULL
 4 | 'OFF_DUTY'|     1
 5 | 'SLEEPING'|     1
 6 | 'OFF_DUTY'|     1
 7 | 'OFF_DUTY'|     NULL
 8 | 'ON_DUTY' |     1

We used this subquery to retrieve another virtual column which has a unique value for each sequence.

SELECT t1.id, t1.state, t1.group_flag,
      SUM(group_flag) OVER (ORDER BY start_time) AS unique_sequence_value
FROM (
      SELECT id, state, start_time,
             CASE
                WHEN lag(state) OVER (ORDER BY start_time) = state THEN NULL
                ELSE 1
             END AS group_flag
      FROM transaction_logs
  ) t1;

The above query returns a new virtual column unique_sequence_value, which has a cumulatvely sum of the group_flag values ordered by start_time.

id |   state   | group_flag | unique_sequence_value
 1 | 'ON_DUTY' |     1      |   1
 2 | 'ON_DUTY' |     NULL   |   1
 3 | 'ON_DUTY' |     NULL   |   1
 4 | 'OFF_DUTY'|     1      |   2
 5 | 'SLEEPING'|     1      |   3
 6 | 'OFF_DUTY'|     1      |   4
 7 | 'OFF_DUTY'|     NULL   |   4
 8 | 'ON_DUTY' |     1      |   5

If we look at the values of the unique_sequence_value column then we can see that we’ve got the required unique value for the state sequence which we can use now for the GROUP BY query.

Final query

We used the two subqueries above to achieve the desired result

SELECT LAST(id) as id,
       LAST(state) as state,
       FIRST(start_time) as start_time,
       LAST(end_time) as end_time,
       SUM(miles) as miles
FROM (
  SELECT t1.*,
      SUM(group_flag) OVER (ORDER BY start_time) AS unique_sequence_value
  FROM (
      SELECT *,
             CASE
                WHEN lag(state) OVER (ORDER BY start_time) = state THEN NULL
                ELSE 1
             END AS group_flag
      FROM transaction_logs
  ) t1
) t2
GROUP BY unique_sequence_value
ORDER BY start_time;

We’ve created a separate table aggregated_transaction_logs to store the state change logs returned by the query.