Cost of a Join - Part 2: Enums, Wider Tables

A follow-up to the previous post where the performance of queries with many joins is investigated.

Great discussion on hacker news and r/programming brought up a couple ideas I hadn't considered.

What about enums?
What about tables with more columns?

I also figured it would be neat if other people could run these benchmarks with their own parameters / hardware.

So I adjusted my script to support enums and wider tables, and packaged it up into a tool anyone can use. It supports three different join types: enum, foreign keys, and what I'm calling "chained". The benchmark definitions are described in a json file which looks like this:

$ cat input.json
[
    {
        "join-type": "chained",
        "max-tables": 10,  # Queries will start by joining 2 tables, increasing by one until all tables are joined.  Number of tables joined will be the X axis on the plot.
        "max-rows": 10000,  # Benchmarks will be performed at 10 rows, 100 rows, etc. until max-rows is reached, creating a separate line on the plot for each.
        "extra_columns": 2,
        "max_id": 5,
        "create-indexes": true,
        "output-filename": "benchmark_1",
        "plot-title": "My Chained Benchmark Title"
    },
    {
        "join-type": "enums",
        "max-rows": 10000,  # Benchmarks will be performed at 10 rows in the primary table, increasing by a factor of 10 until max-rows is reached
        "max-enums": 100,  # Queries will start by selecting (and optionally filtering by) 1 enum column, increasing by one until max-enums is reached
        "possible-enum-values": 10,
        "extra-columns": 2,
        "where-clause": true,
        "output-filename": "benchmark_1",
        "plot-title": "My Enum Benchmark Title"
    },
    {
        "join-type": "foreign-keys",
        "max-primary-table-rows": 10000,  # Benchmarks will be performed at 10 rows in the primary table, increasing by a factor of 10 until max-rows is reached
        "max-fk-tables": 100,  # Queries will start by selecting from (and optionally filtering by) 1 foreign key table, increasing by one until max-fk-tables is reached
        "fk-rows": 100,
        "fk-extra-columns": 2,
        "extra-columns": 2,
        "where-clause": true,
        "output-filename": "benchmark_1",
        "plot-title": "My Foreign Key Benchmark Title"
    }
]

You supply it as input like so...

$ make build
$ PGHOST="localhost" PGDATABASE="join_test" PGUSER="brian" PGPASSWORD="pass" ./run_with_docker.sh input.json

It produces html charts as output, along with CSVs of the benchmark data.

results/chained_benchmark_1.html
results/chained_benchmark_1_10000_rows.csv
results/chained_benchmark_1_1000_rows.csv
results/chained_benchmark_1_100_rows.csv
results/chained_benchmark_1_10_rows.csv
results/enum_benchmark_1.html
results/enum_benchmark_1_10000_rows.csv
results/enum_benchmark_1_1000_rows.csv
results/enum_benchmark_1_100_rows.csv
results/enum_benchmark_1_10_rows.csv
results/foreign_key_benchmark_1.html
results/foreign_key_benchmark_1_10000_rows.csv
results/foreign_key_benchmark_1_1000_rows.csv
results/foreign_key_benchmark_1_100_rows.csv
results/foreign_key_benchmark_1_10_rows.csv

And that's it. You can specify as many benchmarks as you like in the json file, point it at a PostgreSQL instance and let it rip. I went ahead and did that with an RDS db.m4.large instance. Let's take a look at some results.

Enums vs Foreign Keys

The previous post looked at queries that joined one table to another, which was joined to another, which was joined to another, and so on. This is something we can't really do with enums, since an enum can't reference another enum. But we can change it up a bit and model something using foreign keys that is similar to what you might use an enum for. We can have one primary table with many integer columns, each referencing a different table that contains a number of text labels, and compare the performance of that to enums.

Foreign Keys

For the foreign key version, we'll create a primary table with 100 columns that each references a different table, each referenced table having 10 possible values, and we'll issue queries that select a value from one reference table, then two, then three, up to all 100, and we'll do this with the primary table containing 10, 100, 1,000, 10,000, and finally 100,000 rows.

Our primary table looks like this...

join_test=> \d primary_table
                              Table "public.primary_table"
   Column     |  Type   | Collation | Nullable |                  Default
--------------+---------+-----------+----------+-------------------------------------------
 id           | integer |           | not null | nextval('primary_table_id_seq'::regclass)
 table_1_id   | integer |           |          |
 table_2_id   | integer |           |          |
 table_3_id   | integer |           |          |
 ...
 table_100_id | integer |           |          |
Indexes:
    "primary_table_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "primary_table_table_1_id_fkey" FOREIGN KEY (table_1_id) REFERENCES table_1(id)
    "primary_table_table_2_id_fkey" FOREIGN KEY (table_2_id) REFERENCES table_2(id)
    "primary_table_table_3_id_fkey" FOREIGN KEY (table_3_id) REFERENCES table_3(id)
    ...
    "primary_table_table_100_id_fkey" FOREIGN KEY (table_100_id) REFERENCES table_100(id)

join_test=>

Our reference tables look like this...

join_test=> \d table_5
                            Table "public.table_5"
 Column |  Type   | Collation | Nullable |               Default
--------+---------+-----------+----------+-------------------------------------
 id     | integer |           | not null | nextval('table_5_id_seq'::regclass)
 label  | text    |           | not null |
Indexes:
    "table_5_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "primary_table" CONSTRAINT "primary_table_table_5_id_fkey" FOREIGN KEY (table_5_id) REFERENCES table_5(id)

join_test=>

And our queries look like this...

SELECT
    p.id,
    t1.label AS t1_label,
    t2.label AS t2_label,
    t3.label AS t3_label,
    ...
    t5.label AS t5_label
FROM
    primary_table AS p INNER JOIN
    table_1 AS t1 ON
        t1.id = p.table_1_id INNER JOIN
    table_2 AS t2 ON
        t2.id = p.table_2_id INNER JOIN
    table_3 AS t3 ON
        t3.id = p.table_3_id INNER JOIN
    ...
    table_100 AS t100 ON
        t100.id = p.table_100_id ;

Here are the results:

If we look at a table with 100,000 rows joining to fetch 5 of its associated labels, we're looking at 0.1s. If we're joining to fetch all 100 associated label values, and we only have 1,000 rows, we're still in that same ballpark of 0.17s. However, if we're fetching all 100 associated label values for 100,000 rows, we're now over 6 seconds.

It's worth noting that there's no where clause though, so this is, perhaps, not a super common use case in say, a web application. Still, it gives us something to compare enum performance to.

Enums

To replicate the above functionality with enums, we'll create a single table with 100 enum columns on it, each enum having 10 possible values, and we'll query one of those values, then two, then three, up to all 100, and we'll do this with the table containing 10, 100, 1,000, 10,000, and finally 100,000 rows.

Our enum using table looks like this...

join_test=> \d primary_table
                             Table "public.primary_table"
 Column    |  Type     | Collation | Nullable |                  Default
-----------+-----------+-----------+----------+-------------------------------------------
 id        | integer   |           | not null | nextval('primary_table_id_seq'::regclass)
 label_1   | enum_1    |           |          |
 label_2   | enum_2    |           |          |
 label_3   | enum_3    |           |          |
 ...
 label_100 | enum_100  |           |          |
Indexes:
    "primary_table_pkey" PRIMARY KEY, btree (id)

join_test=>

And our queries look like this...

SELECT
    id,
    label_1,
    label_2,
    label_3,
    ...
    label_100
FROM
    primary_table;

Wow! Enums really are faster. I didn't expect that to be the case, especially after reading that they are apparently implemented with a system table behind the scenes. Can't say I'm going to run out and replace all my reference tables with enums tomorrow, since there are some drawbacks. Existing values cannot be removed from an enum, and adding a new value cannot be executed inside a transaction block. But still. Interesting!

Filtering on Foreign Keys and Enums

With the ability to create tables like these already setup, I thought it might be interesting to see what happens if we filter on these values. So, same setup, but our queries will now have a where clause.

Foreign Key version:

SELECT
    p.id,
    t1.label AS t1_label,
    t2.label AS t2_label,
    t3.label AS t3_label,
    ...
    t5.label AS t5_label
FROM
    primary_table AS p INNER JOIN
    table_1 AS t1 ON
        t1.id = p.table_1_id INNER JOIN
    table_2 AS t2 ON
        t2.id = p.table_2_id INNER JOIN
    table_3 AS t3 ON
        t3.id = p.table_3_id INNER JOIN
    ...
    table_100 AS t100 ON
        t100.id = p.table_100_id
WHERE
    t1.label = 'My Label #1' AND
    t2.label = 'My Label #1' AND
    t3.label = 'My Label #1' AND
    ...
    t100.label = 'My Label #1';

Foreign Key results:

Performance improves and the number of rows in the primary table doesn't seem to matter much. Here's the enum version with a where clause:

SELECT
    id,
    label_1,
    label_2,
    label_3,
    ...
    label_100
FROM
    primary_table
WHERE
    label_1 = 'My Label #1' AND
    label_2 = 'My Label #1' AND
    label_3 = 'My Label #1' AND
    ...
    label_100 = 'My Label #1';

And the enum results:

Pretty great performance from enums. Whatever shortcuts PostgreSQL is able to take when filtering on these values, they certainly do have an impact. Filtering on all 100 enum columns for 100,000 rows takes only 0.06s.

Wider Tables

Wider tables are something we can test with the type of queries done in the prior post, the "chained" kind, where one table joins to another, which joins to another, etc. These wider tables will look like this, where we've added a number of columns between id and the reference to the previous table...

join_test=> \d table_5
                                           Table "public.table_5"
     Column      |         Type          | Collation | Nullable |                  Default
-----------------+-----------------------+-----------+----------+-------------------------------------------
 id              | integer               |           | not null | nextval('table_5_id_seq'::regclass)
 extra_column_1  | character varying(20) |           |          | '12345678901234567890'::character varying
 extra_column_2  | character varying(20) |           |          | '12345678901234567890'::character varying
 extra_column_3  | character varying(20) |           |          | '12345678901234567890'::character varying
 extra_column_4  | character varying(20) |           |          | '12345678901234567890'::character varying
 extra_column_5  | character varying(20) |           |          | '12345678901234567890'::character varying
 extra_column_6  | character varying(20) |           |          | '12345678901234567890'::character varying
 extra_column_7  | character varying(20) |           |          | '12345678901234567890'::character varying
 extra_column_8  | character varying(20) |           |          | '12345678901234567890'::character varying
 extra_column_9  | character varying(20) |           |          | '12345678901234567890'::character varying
 extra_column_10 | character varying(20) |           |          | '12345678901234567890'::character varying
 table_4_id      | integer               |           |          |
Indexes:
    "table_5_pkey" PRIMARY KEY, btree (id)
    "table_5_table_4_id_idx" btree (table_4_id)
Foreign-key constraints:
    "table_5_table_4_id_fkey" FOREIGN KEY (table_4_id) REFERENCES table_4(id)
Referenced by:
    TABLE "table_6" CONSTRAINT "table_6_table_5_id_fkey" FOREIGN KEY (table_5_id) REFERENCES table_5(id)

join_test=>

The benchmark script is sure to fill each one with real data

join_test=> select * from table_5 limit 1;
-[ RECORD 1 ]---+---------------------
id              | 1
extra_column_1  | 12345678901234567890
extra_column_2  | 12345678901234567890
extra_column_3  | 12345678901234567890
extra_column_4  | 12345678901234567890
extra_column_5  | 12345678901234567890
extra_column_6  | 12345678901234567890
extra_column_7  | 12345678901234567890
extra_column_8  | 12345678901234567890
extra_column_9  | 12345678901234567890
extra_column_10 | 12345678901234567890
table_4_id      | 738

Time: 1.197 ms
join_test=>

Our queries look like this...

SELECT
    count(*)
FROM
    table_1 AS t1 INNER JOIN
    table_2 AS t2 ON
        t1.id = t2.table_1_id INNER JOIN
    table_3 AS t3 ON
        t2.id = t3.table_2_id INNER JOIN
    table_4 AS t4 ON
        t3.id = t4.table_3_id INNER JOIN
    ...
    table_100 AS t100 ON
        t99.id = t100.table_99_id
WHERE
    t1.id <= 5;

Results using tables with 10 extra columns...

And results using tables with 100 extra columns...

Interesting that it matters, but also that it doesn't seem to matter much.

Conclusion

My takeaway from this is that enums really are faster in some cases, and that wider tables don't necessarily mean deal breaking slowdowns.

Another takeaway is that there are an infinite number of conditions and scenarios you might want to test. A few of them are even supported in the benchmarking tool if you'd like to try these cases for yourself:

Enums vs Foreign Keys with extra columns added to the primary table

Foreign Key benchmark with extra columns added to the referenced tables

Increasing the number of possible values in the enums / referenced tables

And of course, the number of rows / tables can be increased in general if you have the patience.

If you try any of that, let me know how it goes!