A Bug on Aurora PostgreSQL 12 causes Autovacuum to hang forever

Amazon recently announced the release of Aurora PostgreSQL 12. This announcement may have encouraged several customers to upgrade from their existing Aurora PostgreSQL 10.x or 11.x to Aurora PostgreSQL 12.4. The Community PostgreSQL 12 has been one of the greatest releases ever with a big list of features and performance optimizations. So, it may be fair to assume that Aurora PostgreSQL 12.4 works on the same lines.

One of the customers as well upgraded their Aurora PostgreSQL 11.9 environments to Aurora PostgreSQL 12.4. Indeed, testing always plays a key role before performing any upgrades. Similarly, the functional testing was done and everything seemed to be perfect. 

The actual issue started in less than a day after the upgrade. The performance of some queries started degrading. After taking a deeper look, it was noticed that the queries on just one of the tables started behaving badly. We then saw a very strange behaviour on Performance Insights similar to the following image. I have reproduced the same issue on Aurora PostgreSQL 12 and the following is an image from my test scenario.

What is wrong with the above image ? Yes, the vacuum process on one of the tables is running forever. To know more about autovacuum and why it is needed, I would recommend you to visit this blog post I have contributed to, while working for Percona

The size of the table (fairly small) and a completely inactive database with perfectly tuned autovacuum settings should not make the vacuum process to run forever. To know the internals of autovacuum settings and how to tune it well, I would recommend read to see this blog post I wrote sometime ago. 

Following is the test case I have quickly created to reproduce this issue. I have created a table and created a GiST Index on one of the columns and then performed some inserts and updates in random.

migops=> CREATE TABLE foo.bar (id int PRIMARY KEY, sometext varchar(10), loc point);
CREATE TABLE
migops=> CREATE INDEX gist_bar_idx ON foo.bar USING GIST (loc);
CREATE INDEX
migops=> \d+ foo.bar
Table "foo.bar"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description 
----------+-----------------------+-----------+----------+---------+----------+--------------+-------------
id | integer | | not null | | plain | | 
sometext | character varying(10) | | | | extended | | 
loc | point | | | | plain | | 
Indexes:
"bar_pkey" PRIMARY KEY, btree (id)
"gist_bar_idx" gist (loc)
Access method: heap

migops=> INSERT INTO foo.bar VALUES (generate_series(1,100000),'migops',point '(1,2)');
INSERT 0 100000
migops=> UPDATE foo.bar SET sometext = 'junk' where id between 1000 and 100000;
UPDATE 99001

Now, let me run a VACUUM on the table and check if it's completing successfully.

migops=> VACUUM VERBOSE foo.bar ;
INFO: vacuuming "foo.bar"
INFO: index "bar_pkey" now contains 100000 row versions in 276 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: index "gist_bar_idx" now contains 100000 row versions in 1074 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: "bar": found 0 removable, 100000 nonremovable row versions in 736 out of 736 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 27251
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

Yes, the above log says that the VACUUM successfully completed. However, let me insert some more data and see if VACUUM succeeds.

migops=> INSERT INTO foo.bar VALUES (generate_series(100001,1000000),'migops',point '(1,2)');
INSERT 0 900000

migops=> VACUUM VERBOSE foo.bar;
INFO: vacuuming "foo.bar"
^CCancel request sent
ERROR: canceling statement due to user request
Time: 382984.688 ms (06:22.985)

As you rightly noticed in the above logs, the VACUUM runs forever. 

For a customer who runs mission critical databases, time is money. The first thought when this happened right after a Production database got upgraded was if the GiST Index got corrupted during the upgrade.  So, the Index was rebuilt using REINDEX INDEX CONCURRENTLY. Similarly, I have rebuilt the Index in my test setup too.

migops=> REINDEX INDEX CONCURRENTLY foo.gist_bar_idx;
REINDEX
Time: 7900.757 ms (00:07.901)

migops=> VACUUM VERBOSE foo.bar;
INFO: vacuuming "foo.bar"
INFO: index "gist_bar_idx" now contains 3853520 row versions in 12038 pages
DETAIL: 0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s.
INFO: "bar": found 0 removable, 128 nonremovable row versions in 1 out of 7353 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 33309
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s.
VACUUM
Time: 53.344 ms

As noticed in the above logs, upon rebuild of the Index, the VACUUM succeeded. Some relief for the customer running their Production clusters on Aurora PostgreSQL 12.4 that the VACUUM succeeded now. 

But, what about the next autovacuum ? Wait, is it stuck again ? This is what the following logs tell us. While reproducing the issue, I ran some inserts.

migops=> INSERT INTO foo.bar VALUES (generate_series(10000002,11000000), 'migops', point '(3,4)');
INSERT 0 999999

-- The following ran for more than 30 mins before me terminating it. 
migops=> VACUUM VERBOSE foo.bar ;
INFO: vacuuming "foo.bar"
^CCancel request sent
ERROR: canceling statement due to user request

So, it again requires a REINDEX of the GiST Index to let the vacuum succeed again. 

Conclusion : We will now need to wait and see if AWS sends any notification or a public announcement to let everybody know about this issue or maybe remove the Aurora PostgreSQL 12.4 version from the list of options to implement a fix, just like it happened earlier. Please subscribe to our blog posts to follow our latest updates on migrations to postgres. Please contact us to know how we can assist you while performing migrations to PostgreSQL. 

15 thoughts on “A Bug on Aurora PostgreSQL 12 causes Autovacuum to hang forever”

  1. Hi, Ma to face the same issue with my few MB size table. every night job is done DML against that table. but vacuum was running frequently.
    I just execute vacuum full against that table and the issue is resolved.

    Thanks

      1. I enabled Performance Insights to view that. However, we could also snapshot some catalog views and see the same using Grafana.

Leave a Comment

Your email address will not be published.

Scroll to Top