How did I upgrade Aurora Postgres RDS clusters in Production using Terraform

Chirag Modi

Published in

FAUN — Developer Community 🐾

10 min readFeb 17, 2023

AWS RDS Minor/Major version upgrade — Step by Step Guide

AWS RDS Minor/Major Version upgrade: Image by Author

Context

Database is a major and very important component in any organization which should be maintained correctly. We have seen poorly managed databases which can be disruptive and may down one or multiple services in an organization.

Like we upgrade OS to be up to date, Database also needs to be upgraded and should be up to date unless there are strong reasons which impact functionalities due to whatsoever reason.

Gone are the days when Database Administrators used to take care of everything related Databases after the arrival of managed DB solutions offered by different cloud providers. You would have heard about shared responsibility principles by AWS which says that customers still needs to manage their Databases by implementing common DB tasks like

Configure automatic backup
Configure PITR (point in time recovery)
Configure backtracking of DB (Aurora RDS)
Configure alarms for different DB metrics
Apply patches for the operating system for the DB instance
Upgrade minor version for the RDS cluster if not configured to update automatically
Upgrade major version for the RDS cluster

In general, all these mentioned tasks are carried out by SRE, CloudOps, DevOps for different teams in an organization based on their setup, like there will be different members working with different component teams or there can be a centralized team who is taking care of these tasks for all component teams.

Scope

In Today’s world, I assume no organization would be using AWS console or Cloud providers UI to manage their databases as it’s manual and tedious tasks when you have a huge number of DB clusters to manage.

Different organizations use different automation tools like ansible, cloudformation or Terraform to automate their DB management tasks but I will be covering Terraform in this article.

I will be covering Aurora Postgres RDS which is a managed DB solution offered by AWS.

In addition to the RDS version upgrade process, I will try to cover different problems I faced as part of the upgrade and how I solved it so you can better relate it and apply those solutions.

I know pictures speak louder than words but the topic I am covering does not leave scope to include more pictures for explanation so pardon me for that.

Planning

First, based on the features available in a specific RDS version and EOL (end of life) date for the current used version, you need to decide which version you want to upgrade your RDS cluster.

Let’s say as per my use case, I am using Aurora Postgres RDS version 11.9 and it’s support ends in Jan 2023 so I needed to upgrade and there are some performance improvement available in Aurora Postgres 13 so I decided to upgrade to 13.7

Things to remember, we can not upgrade directly from 11.9 to 13.7 but instead follow the upgrade chart provided to AWS.

Postgres Version Upgrade Chart: Image from AWS Documentation

There are multiple paths provided to upgrade from 11.9 to 13.7 so we need to select a minor version from which the target major version can be migrated with minimum upgrade steps instead of migrating to multiple major versions.

You can also check supported target versions by using this command.

aws rds describe-db-engine-versions \
  --engine aurora-postgresql \
  --engine-version 11.16 \
  --query 'DBEngineVersions[].ValidUpgradeTarget[?IsMajorVersionUpgrade == `true`].{EngineVersion:EngineVersion}' \
  --output text

So based on this, I identified the following path to reach the target version of 13.7 which can be different depending on your current and target version.

Upgrade minor version — 11.9 => 11.16
Upgrade major version — 11.16 => 13.7

Upgrade DB using Terraform

Once the upgrade path is identified, you need to start DB upgrade using the automation tool of your organization so in my case, I am using terraform to upgrade DB.

I am using “rds-aurora” terraform module for Aurora postgres RDS deployment — https://registry.terraform.io/modules/terraform-aws-modules/rds-aurora/aws/latest

If you are already using this terraform module then you need to investigate and find out some details like

Which version of this terraform module supports your target version ?
Does it also require you to upgrade the terraform version (not module) ?
Will it break your existing setup or other modules ?

For example, I was using module version 2.29.0 which was pretty old so in addition to DB upgrade, I needed to upgrade terraform module.

I was using terraform version 0.12.23 so I had two options for terraform module upgrade.

3.8.0 — Which requires terraform version >= 0.12.6 so it’s simple one
4.3.0 — Which requires terraform version >= 0.13 so I need to upgrade terraform version also.

I decided to go with 4.3.0 because in future I needed to use ServerlessV2 feature of Aurora RDS which is only supported by module version 7.0.0 which anyways will require terraform venison >= 0.13

You would argue that why don’t you directly use 7.0.0 instead of 4.3.0 ?

Because it contains major breaking changes including recreating your DB instances so always take incremental baby steps instead of one giant leap (remember microservices architecture — multiple small and frequent deployment) because you never know what might break and does not work after upgrade and keep note DB is a very sensitive resource for all your components and services.

So we are good to start our DB upgrade process.. let’s do it!

Upgrade Minor Version

I don’t need to say that you would be using a Dev environment so you can iterate over multiple times without impacting any other environments. If you are skeptical about data related issues during upgrade then you can take manual snapshot of your DB cluster although it’s not required because AWS provides Zero-Downtime minor upgrades for versions mentioned in linked documentation otherwise it will require 3–5 minutes of downtime irrespective of your DB size.

Step1 — Update Terraform Module and Terraform Version

First step would be to upgrade the decided RDS module version and terraform version to make sure the terraform plan does not complain about anything otherwise resolve those issues first.

Step2 — Upgrade Minor Version and Instance Size

Depending on your terraform setup, you need to update engine_version for the RDS module and execute terraform apply to upgrade to the minor version which should go without any issue. I will come to it later about Why i need to update instance_type here.

engine_version = 11.16
instance_type = db.t3.large

Upgrade Major Version

Here comes an interesting part of DB upgrade where you might face issues leading to multiple iterations if you are doing this first time.

You would say.. what is interesting just set engine_version = 13.7 and deploy

engine_version = 13.7
db_cluster_parameter_group_name = <cluster parameter group for 13 family>
db_parameter_group_name = <parameter group for 13 family>

You should try and believe me it’s not that straight forward how it looks.

Problem-1

You will face a chicken and egg problem related to Cluster Parameter Group and Instance parameter group If you are using a custom parameter group.

Default Terraform Rules just to explain this problem —

Terraform will destroy resources and create new ones if it requires replacement.
Dependent resources will be created first.

So It will try to delete parameter group resources which will not be allowed because it’s already attached to running instances.

To fix this issue, you can add a lifecycle hook to the resource so it will create the resource first and then destroy it.

lifecycle {
  create_before_destroy = true
}

It will create another issue of duplicate resources because that parameter group is already there and you are trying to create the same again before deleting it.

Solution-1

Every RDS version comes with a default parameter group so instead keep the existing parameter group resource as it is and set default one for cluster parameter group which should fix this issue.

db_cluster_parameter_group_name = default.aurora-postgresql13

Problem-2

Second problem you might see major upgrade deployment will give error saying you can not upgrade to major version because of pending maintenance for your DB instance.

Solution-2

So basically, If you see any OS maintenance pending for your DB instances, you need to apply maintenance first but again you might not have any access to apply patch from AWS console and if you have that access then also you should not do it following IAC (infrastructure as code) practice because anything you do from AWS console is not getting recorded in terraform state which can be resulted in inconsistent state.

So the solution is to change instance_type temporarily for example If it’s db.t3.medium then change it to db.t3.large which should fix this issue but you need to do it in step-2 of minor version upgrade otherwise you will not be able to proceed with major upgrade.

Step3 — Upgrade Major Version and Default Parameter Group

So This is our first step in a major version upgrade which contains solutions for problem-1 and problem-2 all together.

engine_version = 13.7
db_cluster_parameter_group_name = default.aurora-postgresql13
db_parameter_group_name = null
allow_major_version_upgrade = true

Instance parameter group will inherit values from cluster parameter group so I have set it as null.

It will require downtime of 15–20 minutes irrespective of Database size.

Do you think you should take a snapshot before executing this ? Not required, because AWS automatically creates manual snapshots for every major version upgrade which you can find prefixed with “preupgrade” under the snapshot section in AWS console.

Step4 — Update Custom Parameter Group

In this step, you can set a custom parameter group back from default one for both cluster and instance. you will not face the issue which is mentioned in problem-1 above because here the parameter group you are trying to create is not attached to any instance so it will first create it and attach it to cluster and instance.

db_cluster_parameter_group_name = <cluster parameter group for 13 family>
db_parameter_group_name = <parameter group for 13 family>

Problem-3

If you look at the status of the parameter group for Cluster instance in AWS console then it will show as “Pending”, asking instance to reboot otherwise custom parameter values will not be applied.

Solution-3

We will apply the same solution-2 here which will fix this issue.

Step5 — Cleanup by Reverting Instance Size

Change instance_type back from from db.t3.large to db.t3.medium which you did in Step-2. Once that’s done, you should be able to see custom parameter group status “In Sync”

instance_type = db.t3.medium

So It took me 5 steps to upgrade from 11.9 to 13.7 and as I mentioned before here we at least know what exactly we are doing in each step and may be able to fix if any issue arises instead of doing one big-bang upgrade where it’s difficult to identify what went wrong.

Best Practices

Here are some best practices I learned as part of my upgrade journey…

Prepare for downtime depending on whether you are doing a minor/major upgrade.
Don’t expect failover to happen while upgrading version in spite of having Multi-AZ setup because failover works only in case of Instance or AZ failure but not during version upgrade.
If you want to upgrade Serverless DBs then there are only specific versions which you can upgrade which is mentioned in referenced AWS documentation.
If you want to upgrade from ServerlessV1 to provisioned DB then the only way is backup and restore and above mentioned upgrade steps will not work.
If you want to re-create a cluster by deleting it then don’t forget to take a manual snapshot because automated snapshots gets deleted once you delete the cluster.
If you can’t take manual snapshot and need to use automated snapshot then convert automated snapshot to manual snapshot before deleting cluster.
If you have read replicas then remember, upgrade first happens on master and then on read replicas so to reduce time, remove read replicas and re-create again after upgrade.
If you have replica autoscaling in-place then autoscaled replicas will not be part of your terraform state which must be manipulated separately from AWS console.
If you are using an automation tool like Terraform then just stick to it instead of making any changes directly from AWS console for resources maintained in terraform state otherwise your state file will be inconsistent leading to severe issues.
You will see performance-impact after major version upgrade which is also based on your DB size and query pattern and reason is major upgrade clears cluster cache and lazy loading to cache after major version upgrade takes time to pick up so you should have cache warmup process in place to be executed post major version upgrade.
There is no straightforward process for upgrading the RDS major version without downtime unless you want to take a longer route which involves more time and cost by using AWS DMS services or one can use pglogical replication method.

Conclusion

I have tried to cover the overall process to upgrade version for the Aurora Postgres RDS clusters including problems faced, their solutions and best practices which is not an exhaustive list but should cover major ones. As you know one size does not fit all, you might face some different problems based on your setup and requirements for which you can read referenced AWS documentation.

Thanks for reading!