-
Notifications
You must be signed in to change notification settings - Fork 23
Deutsche Telekom PoC Report
Deutsche Telekom's email service has had the Devocot Ceph plugin developed for storing emails in Ceph RADOS objects. The focus was on a hybrid model where emails are stored in RADOS objects while all other metadata (lists, index, cache) is stored in a CephFS file system located in the same cluster.
In order to test the stability and functionality of the Devocot Ceph plugin for use in Deutsche Telekom's e-mail service, the plugin was subjected to intensive stability and load tests in a productive test setup. This involved evaluating failover scenarios and use in "live operation" with around 1,000,000 mailboxes.
The aim was to improve the system and the configuration in such a way that at least the same service quality as with the existing storage system was achieved. The test environment was set up specifically for this test at the Telekom site in Ulm.
In summary, the attempt to store Dovecot mails as objects in a Ceph cluster was successful. The current storage plugin has demonstrated its readiness for production operation. The Ceph cluster met all tests in terms of failure and restart scenarios. Its reliability is analogous to the existing NetApp storage system.
The goal of achieving comparable performance to the existing system was achieved. Even though the access time for a single mail was slightly higher, customers could not feel any differences in everyday use. Delays could only be noticed during mass operations in batch mode, such as moving accounts.
For these tests, a representative test system was set up that, based on the production figures at the time, would allow a pilot operation with approx. 1 million email accounts.
- ~39 million accounts
- ~1.3 petabyte net storage
- ~42% usable raw space
- ~6.7 billion emails
- ~1.2 billion index/cache/metadata files
The test system is divided on three fire sections. The data is divided into two classes. One class is responsible for storing the Dovecot index data, which is backed up using replication. The performance servers are used for this purpose.
The second class is responsible for storing the email objects, which are secured using erasure coding. The Capacity Servers are used for this purpose.
The MON and MDS servers take over the "administration" of the data and meta data, the monitoring of the storage servers and are the first point of contact for the clients.
The Ceph test environment consists of a total of 33 bare metal servers in three fire sections, plus a virtual system used as an admin node:
- 9 performance nodes, 3 per fire compartment, per server 8* 1.6TB SSDs as OSDs (Bluestore).
- 18 capacity nodes, 6 per firewall, per server 10* 4TB HDDs for OSDs (Bluestore) and 2* 400GB SSDs for the RocksDB
- 3 MON nodes, one per fire compartment
- 3 MDS Nodes, one for each fire partition
- 1 Admin Node, this is also Salt-Master, as a virtual (SuSE Linux-) machine.
HW Model | Node Type | Hardware equipment |
---|---|---|
HPW DL 380 GEN 9 | Monitor | 2 x XEON E5-2660 (14 Cores) - 64GB RAM |
HPW DL 380 GEN 9 | MDS | 2 x XEON E5-2643 (6 Cores) - 256GB RAM |
HPW DL 380 GEN 9 | OSD (Cap) | 2 x XEON E5-2640 (10 Cores) - 128GB RAM |
HPW DL 380 GEN 9 | OSD (Perf) | 2 x XEON E5-2643 (6 Cores) - 256GB RAM |
- OS: SuSE Linux Enterprise Server 15 SP1, SuSE Enterprise Storage 6
- Ceph: Nautilus 14.2.11, 14.2.13, 14.2.16
Each HPE DL380 server is connected with a 4 x 10GbE LACP trunk. This connection provides two separate VLANs, which enable separation between the frontend (VLAN626) and the backend (VLAN1626).
On the Ceph side, three storage pools have been created for the eMail application:
- 1 RADOS pool, erasure coding k4m5
- 1 CephFS pool, replicated with 6 copies
- 1 CephFS meta data pool, replicated with also 6 copies.
The rules are used to distribute the data evenly across all three fire partitions. The RADOS pool stores three chunks per fire segment, the replicated pools two replicas each per fire segment.
The evaluation was carried out in three phases. First, the cluster and the infrastructure were optimized by synthetic loads to such an extent that a trial operation under real conditions could be considered. Since such a trial operation was to be carried out with real mailboxes, the reliability and behavior in the event of a disaster were tested in a second test phase. Only then was a pilot operation with real customer mailboxes carried out.
In this phase, the built Ceph cluster was to be tested to the extent that the requirements of the later phase were met. For this purpose, load was generated with various test programs while the cluster was monitored and examined for bottlenecks or instabilities. The overall solution required for the operation of the Dovecot Ceph plug-in consists of a CephFS file system, in which the Dovecot index data is stored, and the underlying RADOS, the actual object store, which is used for storing the mails.
The naked CephFS, i.e. still without concrete reference to Dovecot or emails, was intensively measured with the Flexible IO Tester (fio). Several problems in the hardware, the network and the configuration of OS and Ceph were fixed.
The most remarkable point about CephFS is the optimization of the ceph-mds daemon (MDS). See also here. The current version of the MDS is single-threaded and CPU-bound for most activities, including responding to client requests. This turned out to be a problem fairly soon, as the load became too high due to the high parallelism of the clients. At first, an attempt was made to work with one MDS per system and multiple systems. The workload was distributed among the active MDSs by Subtree Pinning. Unfortunately, this was not yet sufficient. Repeatedly, there were delays in responding to MDS requests. Since measurements showed that one MDS daemon alone cannot utilize the hardware of a single system, a patch from SUSE made it possible to run multiple MDS daemons per system. Through this and intensive use of subtree pinning, it was possible to achieve the desired performance of the MDS or a level whereby the overall system reached other limits that could not be remedied ad hoc.
In the meantime, the tests were expanded to include Dovecot servers with the Dovecot Ceph plugin. In addition to the Ceph cluster, additional systems were set up as Dovecot servers and load generators. The load for the Dovecot servers was generated using imaptest. This tool was extended here with some test cases that play a role in the operation at Deutsche Telekom. During these tests, the Ceph plugin was improved in various places and made more stable through bug fixes. In addition, the Ceph and SLES versions were updated.
After overcoming the limitations of the MDS, the use of HDDs to store the email objects turned out to be a new limit. However, since the overall performance was sufficient, a switch to SSDs was not made.
These and many other findings about Ceph, as well as system overviews and performance measurements, can be found in presentations by Danny al-Gaaf, who supported the project: librmb: billions of emails in Ceph
In order to be able to guarantee secure operation with customer data during the subsequent pilot test, it was necessary to ensure that failures of subcomponents would not cause lasting damage. Numerous failure scenarios were run through.
The test cases deal exclusively with the subject of hardware. Failure and defect scenarios are run through in order to observe the behavior of the system and to determine in what way there are effects on the overall system and the clients. The test cases are set up analogously to the tests performed on the existing system, so that both platforms can be compared in terms of functionality.
A load test is performed in each of the test scenarios described. The aim is to observe whether and how the failure of components has an effect. Failures may require relocation/replication or reconstruction of data.
Impairments can be detected on the clients (IMAP response times deteriorate, connections drop, etc.). However, the PoC does not make any statements about the actual performance of the system. The throughput graphs attached to the tests are only intended to show whether any effects on performance are to be expected during the tests, but without evaluating them.
For load generation, imaptest was used with appropriate profiles and average mail and mailbox sizes.
For load generation, imaptest is launched on 36 clients with an increasing number of mailboxes. imaptest generates load on the Ceph cluster via a corresponding number of Dovecot servers. Within imaptest, typical user actions are generated in a synthetic mixture.
Test description | Test Result |
---|---|
Failure of one MON node | PASSED |
Failure of 2 MON nodes | PASSED |
Failure of one MDS node | PASSED |
Failure of 2 MDS nodes | PASSED |
failure of a storage cap node | PASSED |
failure of an OSD | PASSED |
failure of the frontend network at an OSD node | PASSED |
Failure of the backend network at an OSD node | PASSED |
failure of the nodes of a fire compartment, additionally 2 OSDs | PASSED |
failure of a storage perf node | PASSED |
SALT master failure | PASSED |
Loss of time synchronization on active Mon | PASSED |
Ordered Ceph service restart | PASSED |
Remove and add OSD node | PASSED |
LACP Check: Disable 1 to 3 network ports | PASSED |
Performance behavior under high storage load of the Ceph cluster | PASSED |
Cluster behavior under load during network switch updates | PASSED |
Ceph Update | PASSED |
Check if parameter bluefs_buffered_io true/false has impact on performance | PASSED |
Admin Node revert to old state (from backup/clone) | PASSED |
All tests were successfully completed. The Ceph cluster passed expectations with the tested functionalities (failure and restart scenarios), analogous to the existing NetApp storage system.
As described in the previous chapter, possible effects on the clients are attached in the form of I/O throughput graphs for each test (measured on the Ceph side). There were dips or dropouts here depending on the scenario, but their evaluation was not the task of this test series.
After passing all the required tests, and it was proven that the performance and reliability of the configuration was sufficient, a pilot operation with customer mailboxes was started in the summer of 2022. As anyone who deals with email knows, many problems only come to light because of the diversity that the MIME specification allows. This and the gradually increasing volume of data have brought the implementation of the Ceph plugin a step forward.
Telekom's email system hosts approximately 39 million accounts. Randomly, 1 million mailboxes were to be selected and gradually moved into the Ceph system.
The move of the accounts was not done abruptly, but it was decided early on to increase the number of mailboxes and objects step by step. Always with the option to simply move the mailboxes back to the traditional system as a failover scenario. The goal of this approach was to learn from the system to learn by building a monitoring infrastructure, and to identify and resolve potential performance bottlenecks at an early stage.
Several potential bottlenecks were identified during this process. Some related to the network infrastructure (e.g., jumbo frames worked better), some related to the intensive use of CephFS to store hot data such as index, locks, and cache. In particular, the use of CephFS in combination with a very large number of mailboxes and parallel access hit the MDS services quite hard (since they were/are single threaded). It was necessary to introduce MDS sharding to distribute the load across multiple MDS services. There were also some improvements and bug fixes on the plugin side, such as increasing the speed of mail recovery, improvement of access to metadata, improved handling of error situations related to the use of network storage instead of a file system.
In addition, this import step revealed a number of problems that had accumulated in the customer mailboxes over the past 20 years. During the import into the Ceph system, the home directories of the users, which were in mdbox format, could not simply be taken over, as was previously the case. By changing the storage method from mdbox to individual objects in RADOS, each e-mail had to be imported individually.
The bug fixes made in the process can be read in the [ChangeLog] (https://github.com/ceph-dovecot/dovecot-ceph-plugin/blob/master/CHANGELOG.md). The most important results are summarized in the next section.
At the end of the pilot phase, there were 1.1 million accounts on the Ceph system.
What | Objects | Size |
---|---|---|
Total | 478.430.012 | 411 TiB |
Mails | 416.428.807 | 371 TiB |
CephFS data | 38.173.393 | 28 TiB |
CephFS metadata | 22.700.672 | 24 GiB |
After initial problems (see below) were overcome, pilot operation went smoothly. The target number of accounts on the system was achieved. There were no restrictions on the customer experience. At the end of the first quarter of 2023, the pilot was terminated and the accounts were moved back to the existing system.
During the pilot phase, several changes and improvements were made to the Ceph plugin and the Ceph configuration. The following topics deserve special mention.
The size of most e-mails is very small (<190k). But out of historical reasons, Telekoms e-mail stock hold e-mails with huge attachments (> 1.7GB). During migration this was a big challenge as several configuration limits needed to be adjusted. While migrating mailboxes with those messages the quota_max_mail_size
was hit several times.
Ceph itself has two configuration settings which have direct influence (osd_max_write_size
, osd_max_object_size
). The luminous defaults were way too small (90MB/128MB), to handle those huge objects. So the read and write mechanism was modified to split up the incoming email object into chunks and to write max osd_max_write_size of chunks to the cluster.
As reading huge emails requires the process to load the full object into memory, dovecots default_vsz_limit needed to be adjusted, to avoid imap/pop3 process to run out of memory.
If the Dovecot index is lost or corrupt due to an error, fast repair mechanism is required to restore the mailbox index. Initially we started quite naive by using the librados equivalent of rados ls to list all objects in a pool and restore the index based on metadata stored in the object. This worked quite ok given "small" number of placement groups for the mail storage pool and low concurrency.
However, during the evaluation phase we decided to increase the number of placement groups due to the high number of objects and concurrent number of client connections. What followed was poor performance using the initial approach. This poor performance comes from the fact, that Ceph needs to query the metadata for each object in the pool. With the increase of the placement groups the number of metadata queries also increased.
As caching was obviously not the solution to improve the speed, another approach was required. During the pilot, we improved the repair mechanism several times, by using filter, optimised the number of scans required, introduced parallel processing. At the end we opted to create additional dedicated Ceph index object which holds all added oid
of the customer mailbox. Like the detection of corrupt indices, the size of this index object is regularly checked by the plugin and if a threshold has been reached, the index is regenerated by using the actual mailbox index files.
With this approach, unnecessary scans and queries for metadata could be reduced and the overall emergency repair time for big mailboxes is in the range of seconds again.
During the pilot phase, the number of mail objects increased quite quickly due to the automatic migration scripts. It turned out that this made the number of objects per placement group too high. A significant improvement was achieved by doubling the number of PGs of the large rados_mail
e-mail pool from 2048 to 4096 PGs. At the same time, the number of objects per PG was halved from about 180000 to about 90000. Likewise, the size per PG was halved from approx. 45 GB to approx. 22 GB.
More placement groups meant better distribution across the cluster and risk mitigation through improved data recovery. As mentioned earlier, increasing the number of placement groups had the disadvantage of slowing down worst-case mailbox repair object scans. However, in terms of read/write performance, this was a win.
Dovecot supports compression of emails via a plugin (e.g. Dovecot zlib plugin). The plugin hooks into the byte stream and applies compression/decompression before writing/reading mail objects. Since most mail data is pure ASCII, the compression level is quite high. In the NFS based mail system, compression was useful because it reduced R/W operations on the NFS and thus increased throughput. However, the fact that compression and decompression are performed on the Dovecot server leads to an increase in CPU time.
In the pilot configuration, we decided not to use compression because we assumed that most objects were already small enough. We estimated the average size of an email to be ~190k, which means that most emails are already below the bluestore_min_alloc_size
64k x replication size 3 = 192k. So compression would only increase fragmentation.
The pilot phase with real data will end in March 2023. The user accounts will be returned to normal operation and the test installation will be dismantled.
We believe that the trial of storing Dovecot mails as objects in a Ceph cluster has been successful. The present storage plugin has demonstrated its readiness for production operation. We plan to finalize this status with the release of version 1.0.0.
The advantages hoped for at the start of the project in the area of scalability and constant object access times have been confirmed.
But better is the enemy of good. We have gathered some points throughout the evaluation that would further improve the plugin. However, their implementation is not planned at the moment.
Depending on the customer contract and business case, the emails are stored for different lengths of time. In many cases, they are never deleted by the user. This leads to the fact that the total number of emails increases quickly.
Currently, the Ceph plugin takes the sdbox approach to storage, storing each email in its own Ceph object. This has several consequences.
First, it results in a large number of objects being stored in the Ceph. One of Dany's talks at the time reported an inventory of ~6.7 billion emails, which would result in an equal number of Ceph objects. This is a staggering number. However, we are confident that this can be handled with Ceph.
Second, depending on the nature of the business case and customer usage, it may be the case that the average mail size is well below the bluestore_min_alloc_size
x replication size
value, resulting in a lot of clutter. So more space is consumed than is actually necessary.
One way to solve both problems of the single object approach is to take the idea of the mdbox format and implement a mrbox format for Ceph. Here, the emails would be stored in archive objects that can grow to a configurable size. Only then would a new mail archive object be started. All maintenance processes would be analogous to the mdbox format. Initial analyses have shown that this is possible, although it involves additional complexity.
Another way to reduce the amount or complexity of data per object, not the number of objects itself, in the cluster is to replace the way object metadata is stored in the plugin. Currently, the plugin uses extended attributes (xattr) to store mail metadata along with the mail object. Technically, this data is stored in omap key-value pairs and replicated along with the objects. A simpler variant would be to store this data as a prefix in the mail object itself.
Dovecot allows the definition of cache fields. Using cache fields with the Ceph plugin increases the size of the mailbox cache files on CephFS which in turn puts pressure on the MDS.
The plugin itself saves metadata directly along the object using object timestamp as saved date, size as physical size. Other Mail metadata like receivedate, mail uid, etc. are stored as extended attributes (xattr). If Dovecot does not already have the metadata for a mail in its cache, it queries the RADOS store for the associated xattr, resulting in a RocksDB query.
Currently, the plugin follows a pretty simple approach to access metadata so every request to a not cached value will result in a new request to receive attributes via RocksDB. This means it may create overhead, due to the required additional round-trips to the RocksDB.
As the cache fields can not be changed after mailbox creation it is wise to analyse current cache usage and balance the cache size against additional object storage calls.
During the pilot phase we did not experience problems with the additional lookup calls to the object for the xattr, but we see potential in decreasing the number of calls required to get mail metadata or choosing alternative ways like saving database lookup to manage mails metadata.
Connecting a client to a Ceph cluster is a comparatively expensive operation. We have taken measures to avoid this if IMAP operations based on the Dovecot index data can be answered or at least delayed until object data is really needed. However, since Dovecot creates a new process for each IMAP/POP connection and this process needs a cluster connection at some point, the connection setup happens quite often.
For the dictionaries, Dovecot uses a system-local, central proxy process that performs this expensive connection setup to SQL Server, Redis, etc. only once. This is a considerable relief. Such a design could also help save connection setups for the Ceph cluster. Whether the dictionary model is sufficient needs to be tested. If not, a dedicated proxy would have to be created. Through the internal layering of the APIs, this would be possible at the level of the librmb, for example.
Currently, all Dovecot storage plugins are index-based, meaning they require a separate index that lists each mail in the mailbox with additional information such as UID, sequence number, date received, and so on. The index is then used for common actions such as listing mails or retrieving metadata. Dovecot stores these index files in the file system which is why we need a CephFS in the chosen hybrid model.
Dovecot synchronizes data changes for a mailbox between processes via the file system index, e.g., via file handles or Inotify.
However, given sufficient development resources, it is possible to replace this mechanism with a transactional storage solution such as a database for index data. This would remove the file system requirement from Dovecot and allow the Ceph Dovecot service to run more easily, requiring only RADOS.
In the pilot installation, HDD was used for storing email objects. SSD was used for the index data in CephFS and the omap-RocksDB on the OSD. As a result, the access time for an email has become longer than in the currently operated NFS system. This is not a problem for the individual customer, but does not improve the overall throughput.
For a new system, the use of SSD for the email objects would be desirable. This can also be layered so that only current emails are on SSD and older emails are on HDD via Dovecot alternate storage.
Due to the unexpectedly long uptime of the Ceph cluster, the trial operation ended with software versions of OS and Ceph that were out of date. This should be avoided, since problems that had already been solved were often investigated.