Over the past 2 weeks we have made several infrastructure changes and upgrades. The changes touched almost every part of the system, therefor you may have noticed breif latency spikes up to 200ms. However we tried to keep the upgrades completely transparent.
The upgrades allow for us to add real-time data feeds more easily, with less latency and better message delivery guarantees. Which means better APIs and more data streams for you.
Below are some details for the techies who are interested.
Ceph Luminous
One of the upgrades was to the Ceph cluster which brings it to the latest version, allowing us to mark the SSD OSDs and SAS OSDs accordingly. Along with marking the OSDs it's now easy to specify which data should live on those lables. This means our DBs and critical data can live on the SSDs only while logs and non-latency critical data can live on the SAS drives. Snapshots and archive data can live on traditional SATAs.
Along with the upgrade, we have added additional monitoring to prometheus for each storage pool and node. Giving better insight if a single node or OSD is having slow read/write speeds.
We have also converted several of our OSDs to use the new BlueStore filesystem format. ( Read more about BlueStore here ). This eliminates the need of an SSD for the OSD log file, freeing up more disks for usable storage. If after a couple months these show promising results, we will continue switch more to this format.
Apache Kafka
We have switched our internal messaging system to a 16 node Kafka cluster. Taking advantage of the streaming mechanisms for data manipulation and writing to C*. This gives us several benefits including data durability, replication, and splitting up streams between groups of services. With splitting streams between multiple instances of an application our uptime will not suffer due to a server outage. We can now scale our infrastructure horizontally as needed. Currently we can handle 2m+ ticks/second with only 1-2ms latency.
While we normally try to steer away from JVM based projects, the performance and proven reliability of Kafka is hard to beat. The streaming landscape is getting a lot of attention right now, so it will be interesting to see what projects emerge.