@JanVladimirMostert: Who doesn't love questions! ;) Let's go in order.
1) We didn't need any NoSQL functionality from MariaDB. One of our engineers definitely popped up with that idea, but we had to reject it for two reasons: i) using something for a functionality it isn't native to usually ends in low performance; ii) MariaDB was a tier-II database, i.e., a persistent store. The data in a tier-II is constant over a period of time. Since MariaDB gives excellent read speeds, we leverages it for that.
2) Since the internal data network was on a high throughput machine, and the internal latency was a single-digit millisecond, we preferred synchronous. This gave us the option to have atomic data writes with no loss at all. I'd say that you use async. replication only if you have sync. replication between two masters, or any architecture where you have at least one copy of the latest data.
3) Yeah, Galera cluster was the clustering option of choice; this was due to a number of reasons. For starters, it worked out-of-the-box. Secondly, we had synchronous data replication. On a multicore machine, we had the option to leverage multithreading. And particularly because it took less than 2 minutes to spin up a node. No need to note the log offsets, and whatnot.
4) Sadly, yes. We did deal with disaster recovery. On a Sunday afternoon, I was checking the error logs of the application and that's when I noticed something strange: there were 6k responses with the 50x errors. I saw the error logs, and lo and behold: Maria wasn't writing anything to the database. Luckily for us, it had only been 2-3 minutes, so we quickly fired up a new master, replicating the data from the old ones and everything worked out just fine.
5) This highly depends on how much data there is. We had R/W close to 10/300 (/second), so it would take us somewhere between 5-10 minutes in provisioning, copying, and deploying a new node. This was because we had a Redis metacache layer in between. So, if master went down, all of the data will go into Redis' cache, and since it's cache, we were able to quickly recover.
I hope this answers it! If you have any more questions, feel free to ask them! :)