AWS Aurora Global Clusters Explained: What you wish they told you before you built it

AWS Aurora Global is, on the face of it, a decent product. Aurora is a MySQL fork with a tonne of purported performance benefits over vanilla MySQL. I was building a system, in AWS, which relied on a MySQL database so thought I’d give Aurora Global Clusters a try. This was because I wanted to achieve high availability both in a single region as well as across regions (i.e. if one region failed, another took over). Here’s what I learned:

Foreword about Aurora Serverless

Everybody wants to be “serverless” these days, right? Aurora Serverless is AWS’s attempt at jumping on that bandwagon. Unlike truly serverless products like Lambda and DynamoDB, Aurora Serverless is not really serverless. When you provision a “serverless” cluster, a set of MySQL servers get built for you. Having used Aurora Serverless in a former job, it soon became apparent that it’s not a mature product and after many many hours of cluster downtime due to badly performing scaling activities et al. we binned it off. I don’t think it’s gotten a lot better since and, personally, I’d not touch it with a 9.5 ft pole.

Basic Design

Aurora Global is, after-all, just MySQL. As such, the design is broadly what you’d expect for a MySQL single-master cluster. The Global Cluster has a primary region. This is where the master/writer node runs. You then have one or more read-only nodes in your primary and secondary region(s). Aurora Global is, in effect, a way to get read-only copies of your MySQL database replicated to other regions so your applications can do low-latency reads of that data.

Aurora Global Cluster Limitations

There’s a few documented limitations, albeit they’re hidden deep in larger documentation. Here’s a few of the key ones:

You cannot have multi-master global clusters. Global clusters have a single writer and multiple readers
You can have up to 6 regional clusters in a global cluster. If you want more, you’re stuffed
You can’t use RDS Proxy with Global clusters
~~You won’t get automatic minor version upgrades~~
There’s no auto-scaling
The cheapest instance size you can use is db.r6g.large
You often can’t do major version upgrades – for example you cannot upgrade from

5.7.mysql_aurora.2.10.1 to 8.0.mysql_aurora.3.01.0

Cost

As mentioned above, the cheapest instance size you can use is db.r6g.large. In AWS, you only pay for what you use. Except when you don’t. And this is one of those times. db.r6g.large is often very over-sized for smaller workloads, particularly in development environments.

Assuming a base deployment of 2 nodes in 2 regions in 2 environments (development env and production env) the base cost of an Aurora Global deployment is upwards of $1,500 per month.

Resiliency and Failover

The obvious advantage that you might assume exists when running a Global Cluster is that, if the primary region (the one with the writer node) were to fail, then another region would pick up the traffic. Well… you’re wrong. Aurora Global has no automatic failover; this means that you will never end up with a writer node in another region, if the primary region fails. As such, you cannot write to your database if the primary region fails.

It is possible for you to manually initiate a “failover”… but this involves removing a region from the Global Cluster and then pointing all of your applications, globally, to this single regional cluster. Once you do this, that region will have a writer node but all other regions will be entirely useless as they will not be replicating from the newly split region. To restore other regions, you must destroy and rebuild their database clusters, making them part of a new global cluster with the region that you just promoted as the primary.

In addition to this, when the primary writer node fails/reboots for some reason (e.g. underlying instance hardware failure), all other nodes in your cluster will also reboot. This means you will have a global read outage until the writer node comes back.

Update 2023-01-05: The unavailability of reader nodes when the primary writer node fails has become particularly important during an outage today. A bug caused the primary region writer node, in one of our clusters, to crash. This caused all of the reader nodes to go offline, meaning that writing and reading, across all regions of the global cluster, was down for about 8 hours. AWS support have no ability to resolve issues like this and must escalate to internal teams. Manually initiating a “failover” to another region (by deleting the primary region’s database) would have been tempting here but the bug meant the secondary region would have crashed too and the whole cluster would need rebuilding, due to the “failover” activity. This is a terrifying prospect, for a production database…

Upgrades

As mentioned above, you often can’t do major version upgrades without building a whole new cluster and migrating your applications. Minor version upgrades are sometimes possible by simply editing the Aurora version of your cluster, but there’s a bug with CloudFormation meaning that you must manually upgrade your clusters with the AWS console/CLI before changing the Aurora version in CloudFormation, otherwise your stack will error out.

I’ve had mixed success with upgrades. Some of them have been smooth and no downtime was seen. Others have been a bit more brutal and a minute or two of downtime was seen during the upgrade. When upgrading from 3.02.x to 3.03.x, for example, it’s not possible without destroying and rebuilding your entire global cluster, which will likely come with a very long period of downtime.

Aurora now has a Green/Blue deployment model, but it also comes with a minute or so of downtime as traffic fails over.

In essence, expect downtime during upgrades.

Parallel Query

Parallel Query is an Aurora feature which is not enabled by default (as at Feb 2023). It purportedly has performance benefits which allows it to automagically split a SELECT query into multiple smaller queries and execute them on the database in parallel. For example, it might split something like SELECT * FROM table WHERE foo IN (“bar”, “baz”) into two separate queries – one searching for foo=”bar” and the other foo=”baz”.

Feb 2023: Alas, Parallel Query has a propensity to crash your database servers when it is enabled and AWS support recommends turning it off:

During our investigation, we identified that the restarts were caused by an issue in the Aurora MySQL software related to parallel query. Please look for upcoming release for the fix. In the meantime, consider leaving the parallel query option disabled by setting aurora_parallel_query = OFF.

Apr 2023: Aurora 3.02.3 and 3.03.1 fixes this issue, apparently

Write Forwarding

Write forwarding is a pretty exciting feature, on paper. It allows you to do write (INSERT, UPDATE, DELETE, etc.) statements on reader nodes and have those forwarded to and executed on the writer nodes. It does, however, have a lot of very painful limitations:

Contrary to the documentation, write forwarding is not available on Aurora versions based on MySQL 8. At the time of writing, the latest version that supports it is

5.7.mysql_aurora.2.10.1
You do not have access to “writer” nodes in non-primary regions. The DNS record doesn’t resolve because the endpoint is “inactive”. This means that your applications always need to know whether they are running in a primary or secondary region. Note: 2022-09-09: Aurora changed behaviour, without notice, to make secondary region writer endpoints active. This apparently broke quite a few customers so they rolled back the change, however there does seem to be a desire from them to re-implement this in the future
When you connect to a reader node and try to write to it, you will be told the node is read-only until you run this query: set aurora_replica_read_consistency = ‘eventual’;
You cannot set aurora_replica_read_consistency globally. It must be set, manually, each time you connect to the database. As such, it’s unlikely you’ll be able to use write forwarding with applications that you do not have full control over the codebase
You cannot do DDL (create table, alter table, truncate table, etc.) on write forwarded nodes

I want solutions, not problems!

All is not entirely lost with Aurora Global. It serves a purpose if you need to run your applications geographically close to your users and, thus, want a local read-only database next to the application.

For anything more than this, an open-source product called ProxySQL, solves a lot of the problems. The basic design is thus:

Create a multi-region Aurora Global cluster as you would normally. Don’t worry about enabling write forwarding, cause it’s a bit useless
Peer together the VPCs in all of your regions, and add security group rules, such that each region can connect to the Aurora instances in every other region
Run ProxySQL locally in each of your regions
Add your local region’s writer and reader endpoints to ProxySQL with a weight of 10,000,000
Add your writer and reader endpoints, in every other region, to ProxySQL with a weight of 1 (at time of writing, weight 0 meant the server was never used so 1 means it’s going to be used for 1 query in every 10 million… which is the best you’ll get)
Create an entry in mysql_replication_hostgroups with a check_type of innodb_read_only to ensure that writer endpoints which erroneously become active are never attempted to be written to
Create ProxySQL mysql_query_rules to send SELECT (but not SELECT FOR UPDATE) statements to the reader hostgroup whilst sending all other queries to the writer hostgroup
Connect your applications to the ProxySQL instance running locally in their own region

What will happen here is:

Applications will do write queries to ProxySQL and it will forward them to the single writer node which is currently active
In the event that you are forced to break a region out of the global cluster to “promote” it, ProxySQL will realise this and start sending write queries to that region
Read queries will always go to the reader node(s) in the local region
In the event that your local reader nodes fail (either because of a fault or because you manually destroyed them during “failover”) then read queries will be done from another region. This means that your applications must tolerate some kind of elevated query latency, but this is good practice anyway

MySQL [(none)]> select * from mysql_servers; +--------------+----------------------------------------------------------------------------------------+------+-----------+--------+----------+-------------+-----------------+---------------------+---------+----------------+---------+ | hostgroup_id | hostname | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment | +--------------+----------------------------------------------------------------------------------------+------+-----------+--------+----------+-------------+-----------------+---------------------+---------+----------------+---------+ | 1 | thing.cluster-cdqdzbngckab.eu-west-1.rds.amazonaws.com | 3306 | 0 | ONLINE | 10000000 | 0 | 1000 | 0 | 0 | 0 | | | 1 | thing.cluster-cvbn52q4dmcd.eu-central-1.rds.amazonaws.com | 3306 | 0 | ONLINE | 1 | 0 | 1000 | 0 | 0 | 0 | | | 2 | thing.cluster-ro-cvbn52q4dmef.eu-central-1.rds.amazonaws.com | 3306 | 0 | ONLINE | 1 | 0 | 1000 | 0 | 0 | 0 | | | 2 | thing.cluster-ro-cdqdzbngckgh.eu-west-1.rds.amazonaws.com | 3306 | 0 | ONLINE | 10000000 | 0 | 1000 | 0 | 0 | 0 | | +--------------+----------------------------------------------------------------------------------------+------+-----------+--------+----------+-------------+-----------------+---------------------+---------+----------------+---------+ 4 rows in set (0.000 sec)

Summary

It’s possible, with a lot of effort, to get Aurora Global working fairly well in a multi-region HA setup. In reality, MySQL isn’t a fantastic database for any sort of highly available setup where you must share data between the regions. If you’re building a new application, maybe consider some better alternatives such as MongoDB, DynamoDB, ElasticSearch, CouchBase or Redis.

Yours sincerely, a worn down Engineer whose Cheerios have been pissed in by AWS one too many times.

10 comments

Andrew Rice says:

15 March, 2022 at 10:31 am

I owe you a beer.

Rafael Costa says:

23 January, 2023 at 11:07 pm

You can perform a major version upgrade on a global cluster through the modify-global-cluster API using the –global-cluster-identifier, –engine-version and –allow-major-version-upgrade parameters.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-global-database-upgrade.html

Great article! Thank you!

1. Phil Lavin says:
  
  23 January, 2023 at 11:45 pm
  
  Thanks for the tip! I believe it only works between some versions. When trying this method myself to upgrade between the versions mentioned in the article, an error was explicitly thrown stating it was not possible between those versions
  
Tomasz Anderson says:

24 January, 2023 at 6:05 pm

Hey, did you try CockroachDB for these type of workloads?

1. Phil Lavin says:
  
  24 January, 2023 at 7:06 pm
  
  Not specifically for this project as it was a 3rd party product that requires MySQL. I hear good things about Cockroach though
  
Roger Eisentrager says:

24 January, 2023 at 10:45 pm

Feel free to look into MariaDB SkySQL….. much better without these issues.

1. Phil Lavin says:
  
  25 January, 2023 at 6:45 am
  
  I didn’t know MariaDB had a managed MySQL service. Good to know for the future
  
Michael Laccetti says:

25 January, 2023 at 3:26 pm

Does Aurora PG suffer from the same limitations as MySQL? 😅

1. Phil Lavin says:
  
  25 January, 2023 at 7:26 pm
  
  Good question! I’m not sure, having never tried it. I can’t imagine it’s too far different but do let me know what your experience is, if you get chance to try it
  
Wiz says:

1 March, 2023 at 6:36 pm

Worth considering CockroachDB if you want to stick with a relational format