Skip to content

RADOS Storage Plugin

Jan Radon edited this page May 14, 2019 · 69 revisions

The Hybrid Storage Model

The mails are saved directly as RADOS objects. All other data are stored as before in the file system. This applies in particular to the data of the lib-index of Dovecot. We assume the file system is designed as shared storage based on CephFS.

Based on the code of the Dovecot storage format Cydir we developed a hybrid storage as Dovecot plugin. The hybrid storage directly uses the librados for storing mails in Ceph objects. The mail objects are immutable and get stored in one RADOS object. Immutable metadata is stored in omap KV and XATTR. The index data is completely managed by Dovecot’s lib-index and ends up in CephFS volumes.

Because of the way MUAs access mails, it may be necessary to provide a local cache for mail objects. The cache can be located in the main memory or on local (SSD) storage. However, this optimization is optional and will be implemented only if necessary.

The mail objects and CephFS should be placed in different RADOS pools. The mail objects are immutable and require a lot of storage. They would benefit a lot from erasure coded pools. The index data required a lot of writing and are placed on an SSD based CephFS pool.

RADOS Mail Object Format

A mail is immutable regarding its RFC5322 content and some attributes known at the time of saving. The RFC5322 content is written as RADOS object data without any modifications. The immutable attributes Dovecot is using, are stored as RADOS XATTR. Their values are stored in string representation. Right now, the following attributes are stored with the objects

I

rbox format version, string, currently "0.1"

G

Mail GUID, UUID stored as hex string

R

Received Date as Unix time, long stored as string

S

Save Date as Unix time, long stored as string

P

POP3 UIDL, string

O

POP3 Order, unsigned int stored as string

M

Mailbox GUID, Mailbox GUID as hex string

Z

Physical Size, mail’s physical size in bytes as uint64_t stored as string

V

Virtual Size, mail’s virtual size in bytes as uint64_t stored as string

U

Mail UID, uint32_t stored as string

B

Original mailbox name

A

From Envelope, string

K

Mail Keywords, the default handling of keywords follows the mdbox storage plugin. But it is possible to store the keyword index as ceph omap key/value string/string directly at the object. Keys have the format 'k_'+<keyword_idx> the value is the <keyword_idx>.

F

Mail Flags, string as hex with leading 0x

FLAG

HEX

MAIL_ANSWERED

0x01

MAIL_FLAGGED

0x02

MAIL_DELETED

0x04

MAIL_SEEN

0x08

MAIL_DRAFT

0x10

MAIL_RECENT

0x20

MAIL_FLAGS_MASK

0x3f

MAIL_FLAGS_NONRECENT

(MAIL_FLAGS_MASK ^ MAIL_RECENT)

All writable attributes like flags or keywords are stored in Dovecot index files only.

The mail objects are addressed using a UUID in hex as OID for each mail. This OID is stored in the Dovecot index data as an obox extension record. The extension record is compatible to the obox format and can be inspected using doveadm dump <path-to-mailbox>:

RECORD: seq=1, uid=568, flags=0x19 (Seen Answered Draft)
 - ext 0 keywords  :            (1000)
 - ext 1 modseq    :       4056 (d80f000000000000)
 - ext 4 cache     :        692 (b4020000)
 - ext 5 obox      :            (8eed840764b05359f12718004d2485ee8ded840764b05359f12718004d2485ee)
                   : guid = 8eed840764b05359f12718004d2485ee
                   : oid  = 8ded840764b05359f12718004d2485ee
 - ext 6 vsize     :       2214 (a6080000)
                   : vsize         = 2765958940838
                   : highest_uid   = 121556224
                   : message_count = 1404118538
⚠️
If the index data for a mailbox gets lost, there are currently only limited ways to reconstruct the mailbox from the RADOS objects. We use a RADOS namespace per user to make object iteration for a recovery possible.

Restore Index

To restore a rbox index, Dovecot’s tool doveadm force-resync -u user <MAILBOX_NAME> can be used. The tool first makes a backup of the current index file if it exists. Afterwards, it tries to read the mailbox GUID from the index file to determine the mailbox unique identifier. If GUID cannot be found, the mailbox name given in the doveadm command line argument will be used.

The restore process works in the following way.

  1. The tool searches for all RADOS objects matching XATTR M == <Mailbox GUID> in user namespace (in case no GUID is available, it uses B == <Mailbox name>)

  2. Found mail objects are added to the newly created index file.

  3. The tool tries to find a matching mail record in cache or backup index file (seq, uid) to complete the newly created index entry.

  4. The resulting index will only include mails which have a corresponding RADOS mail object.

File System Layout

The file system layout of rbox is quite similar to dbox. Without any special configuration the layout is as follows:

<mail location root>/mailboxes/INBOX/rbox-Mails/dovecot.index*

Index files for INBOX

<mail location root>/mailboxes/foo/rbox-Mails/dovecot.index*

Index files for mailbox "foo"

<mail location root>/mailboxes/foo/bar/rbox-Mails/dovecot.index*

Index files for mailbox "foo/bar"

<mail location root>/dovecot.mailbox.log*

Mailbox changelog

<mail location root>/subscriptions

subscribed mailboxes list

<mail location root>/dovecot-uidvalidity*

IMAP UID validity

Note that with rbox the Index files actually contain significant data which is held nowhere else. Index files for rbox contain message flags and keywords. This data cannot be automatically recreated, so it is important that Index files are treated with the same care as message data files.

Index files can be stored in a different location by using the INDEX parameter in the mail location specification. If the INDEX parameter is specified, it will make Dovecot look for the Index files as follows:

<INDEX location>/mailboxes/INBOX/rbox-Mails/dovecot.index*

Index files for INBOX

<INDEX location>/mailboxes/foo/rbox-Mails/dovecot.index*

Index files for mailbox "foo"

<INDEX location>/mailboxes/foo/bar/rbox-Mails/dovecot.index*

Index files for mailbox "foo/bar"

The mail messages itself are stored in RADOS objects.

Configuration

Dovecot - 10-mail.conf

mail_plugins

To load the plugin, add storage_rbox to the list of mail plugins to be loaded. There are several ways to do this. Add for example the plugin in 10-mail.conf to mail_plugins.

mail_plugins = $mail_plugins storage_rbox

To enable or disable the plugin per user, userdb return mail_plugins can be created as extra field. See UserDatabase/ExtraFields for examples.

mail_location

Add the plugin to 10-mail.conf as mail_location. The name of the mailbox format is rbox. See Mail location for details. Because the index management of Dovecot is used, the description of path and a lot of the optional parameters are valid for rbox, too.

The optional parameters of the mailbox location specification that differ for rbox are:

LAYOUT

specifies the directory layout to use:

  • fs: The default used by rbox

  • index: Uses mailbox GUIDs as the directory names. The mapping between mailbox names and GUIDs exists in dovecot.list.index* files.

DIRNAME

specifies the directory name used in mailbox directories to store the mailbox files. With rbox the default is "rbox-Mails/". Note that this directory is used only for the mail directory not for index/control directories (but see below).

ALT

specifies the alternate storage pool for mail objects.

All Dovecot variables for mail_location can be applied. Add for example to 10-mail.conf:

mail_location = rbox:/home/user/dovecot/var/mail/rados/%u

Dovecot - 90-plugin.conf

rbox_cluster_name

When running Ceph with default settings, the default cluster name is ceph, which means the Ceph configuration file with the file name ceph.conf is saved in the /etc/ceph/ default directory. When several clusters are run this value might have to be changed.

rbox_cluster_name = ceph

rados_user_name

If no RADOS user name is specified, the plugin will use client.admin as default user name. The username can be changed with the rados_user_name option.

rados_user_name = client.admin

rbox_pool_name

The following setting defines the RADOS pool used to store the e-mails. The default is mail_storage. If the pool is missing, it will be created.

rbox_pool_name = mail_storage

rbox_config_obj_name

The configuration object to be used by the dovecot-ceph-plugin can be specified. The plugin will try to read the plugin Ceph configuration from a Ceph object with this oid. If it fails, it will create a default configuration.

rbox_config_obj_name=rbox_cfg

rbox_bugfix_cephfs_21652

If using the cephfs filesystem to store index and cache files, a ghost mailbox INBOX.INBOX might appear. This is due to a heuristic used in dovecot to detect if a certain mailbox has children. The posix standard defines the existence of hardlinks β€˜.’ and β€˜..’ in an empty directory as optional. Cephfs does not create the β€˜..’ hardlink and the dovecot heuristic wrongly assumes INBOX has children. There is a ceph fix available (ceph/ceph#21652). Setting the following configuration to true will prevent that dovecot interprets directories with hardlinks < 2 as directories with subdirectories.

rbox_bugfix_cephfs_21652=true|false

rados_save_log

The plugin can be instructed to log all objects written to the object store (save, copy, move) in a separate logfile. The logentry contains the operation (save, copy, move), pool name, namespace and oid. This information is sufficient to identify the mail object to e.g. manually cleanup a broken backup restore. Furthermore is it possible to pass this file to the rmb tool, which takes care of the cleanup.

rados_save_log=full path to the log file

rados_check_empty_mailboxes

If using indirect namespaces, the regular doveadm mailbox delete command does not cleanup the user namespace mapping object. To enable automatic deletion, the rados_check_empty_mailboxes can be set to true. In this case the indirect namespace mapping object will be deleted after the last mailbox of a user is deleted. The setting operates by summing up all emails in all user mailboxes. When deleting a mailbox and the total number of e-mails in all user mailboxes is 0 after deletion, the mapping object will be deleted. Alternatively, the doveadm rmb mailbox delete command can be used.

rados_check_empty_mailboxes=true|false

rbox_ceph_aio_wait_for_safe_and_cb

There are two possible methods available to wait for rados aio write operations to finish.

  • aio_wait_for_complete : Block until an operation completes, this means it is in memory on all replicas.

  • aio_wait_for_safe_and_cb: Block until an operation is safe, this means it is on stable storage on all replicas.

The following setting sets the method. The default is aio_wait_for_complete.

rbox_ceph_aio_wait_for_safe_and_cb=false|true

rbox_ceph_write_chunks

There exist two implemented ways how to write email data to ceph. First method will load all mail data into librados::bufferlist and splits the buffer based on the maximum write_operation_object size defined by ceph. In This case the write operation will be triggered in the rbox_save_finish lifecycle method. This is the default behavior.

The second method will use the chunks provided by dovecot and creates a seperate write operation for each chunk in the lifecycle method rbox_write_continue. If you want to enable this method set the following setting to true.

The following setting sets the method. The default is false.

rbox_ceph_write_chunks=false|true

rbox_ceph_client_x

It is possible to pass in custom ceph client configuration using the prefix rbox_ceph_client_<client_cfg_value> to modify the ceph connection.

The following setting sets the config. The default is client config in ceph.conf (or ceph client config reference)

e.g. rbox_ceph_client_client_cache_mid=0.75

Example

All this can be configured in the plugin section in 90-plugin.conf:

plugin {
  rbox_cluster_name = ceph
  rados_user_name = client.admin
  rbox_pool_name = mail_storage
  rbox_cfg_object_name = rbox_cfg
  rbox_bugfix_cephfs_21652 = false
  rados_save_log = /var/mail/rbox/save_file.log
  rados_check_empty_mailboxes= false
  rbox_ceph_aio_wait_for_safe_and_cb=false
  rbox_ceph_write_chunks=false
}

Ceph

The plugin uses the default way for Ceph configuration described in Step 2: Configuring a Cluster Handle:

  1. rados_conf_parse_env(): Evaluate the CEPH_ARGS environment variable.

  2. rados_conf_read_file(): Search the default locations, and the first found is used. The locations are:

    • $CEPH_CONF (environment variable)

    • /etc/ceph/ceph.conf

    • ~/.ceph/config

    • ceph.conf (in the current working directory)

librmb

The configuration for the librmb is stored in a wellknown RADOS object named rmb_cfg as a JSON document. The name can be overridden by the Dovecot plugin configuration rbox_config_obj_name and can be modified and created with the rmb CLI.

The following shows the default librmb configuration. This configuration will be created if the configuration object in a RADOS pool is missing.

 {
   "user_mapping": "false",
   "user_ns": "users",
   "user_suffix": "_u",
   "rbox_public_namespace": "public",
   "rbox_mail_attributes": "MGPORZVBUI",
   "rbox_updateable_attributes": "B",
   "rbox_update_attributes": "false",
   "rbox_metadata_storage": "default",
   "rbox_storage_metadata_attr": "ima"
 }

Namespace Usage

Per default, user e-mails are saved using the username as the RADOS namespace. For public e-mails which do not have an owner, the namespace public is used. This leads to a problem, when you allow to rename a username. All mails have to be copied from the namespace oldusername to the namespace newusername. To overcome this, it is possible to use a username indirection.

For each user a GUID will be generated and used as RADOS namespace. The mapping from the username to the internally used GUID will be done using a mapping object named username and containing the GUID as object data. If a username changes, only the object id of the mapping object has to be changed.

πŸ”₯
This is a setting which can only be changed before you start storing e-mails. This means, if you already have e-mails stored with user_mapping=false and change this value to true, the old e-mails are no longer accessible and vice versa.
user_mapping

Enable username mapping. The default is false. See rados_check_empty_mailboxes option.

user_ns

You can specify the namespace where the username mapping objects will be saved. The default is users.

user_suffix

If user_mapping is active, a new Ceph object for each user will be created, holding the namespace information. The name of this object will be <username><user_suffix>. This avoids collisions between system namespaces and username based namespaces. The default is _u.

rbox_public_namespace

For mails which are stored in a public folder, the configured public namespace will be used. No suffix will be added to this namespace to avoid collisions with username based namespaces. The default is public.

Metadata Usage

A mail object stores, besides the mail itself, some immutable or mutable metadata as desribed above. The configuration allows to define which and when the attributes are saved.

rbox_update_attributes

It may make sense to update attributes after the initial save to keep track of changes. The default is false.

rbox_mail_attributes

You can define which metadata attributes should be saved as Ceph XATTR. See RADOS Mail Object Format for metadata details. The default is MGPORZVBUI.

rbox_updateable_attributes

Defines the metadata attributes which can be updated. For example the original mailbox 'B' metadata attribute to keep track of the mailbox name of a mail when copying. This information may be helpful when rebuilding a lost index. The default is B

The possible values for rbox_mail_attributes and rbox_updateable_attributes can be found under RADOS Mail Object Format. For rbox_updateable_attributes currently only RBOX_METADATA_ORIG_MAILBOX (B) is supported.

rbox_metadata_storage

To reduce the amount of metadata xattributes, it makes sense to save all immutable attributes to a single xattribute. The default value is default which means every metadata attribute is saved as a single xattribute. You can set the value to ima if you want to save all attributes not defined in rbox_updateable_attributes in a single json formatted xattribute.

rbox_storage_metadata_attr

Defines the xattribute name used by rbox_metadata_storage=ima. The default is ima

Ceph Object and Namespace Layout

The RADOS objects that hold the mails are stored in a per user namespace. The tree based on the default configuration would look like this.

root
β”œβ”€β”€ public                                (1)
β”‚Β Β  └── 78DA8E6FA63746FAB89271CA2AF72BA4
β”œβ”€β”€ rmb_cfg                               (2)
└── <username>_u                          (3)
    └── 4463b919b3b9275a5f3100009c60b9f7. (4)
  1. This namespace contains the public mail objects.

  2. This object holds the JSON configuration.

  3. This namespace contains the objects for username.

  4. A mail object.

If indirect namespaces are configured, to be independent of username changes, the namespace rbox_ns_cfg holds the mapping objects that map from username to the generated namespace GUID.

root
β”œβ”€β”€ 4440BDE7DC844DBC88BBCD3185A038B5      (1)
β”‚Β Β  └── 4463b919b3b9275a5f3100009c60b9f7. (2)
β”œβ”€β”€ 9ACDA05123BC4C5D96D3E322AF241CFE      (3)
β”‚Β Β  └── 78DA8E6FA63746FAB89271CA2AF72BA4. (4)
β”œβ”€β”€ rmb_cfg                               (5)
└── user                                  (6)
    β”œβ”€β”€ public                            (7)
    └── <username>_u                      (8)
  1. This namespace contains the mail objects of username.

  2. A mail object.

  3. This namespace contains the mail objects of public.

  4. A mail object.

  5. This object holds the JSON configuration.

  6. This namespace contains the indirection objects.

  7. This object contains the GUID of the actual namespace for public. In this example 9ACDA05123BC4C5D96D3E322AF241CFE.

  8. This object contains the GUID of the actual namespace for username. In this example 4440BDE7DC844DBC88BBCD3185A038B5.

Shared and Public Folders

To configure shared folder access the ACL plugin needs to be activated in the dovecot configuration as usual. In the namespace configuration you need to use rbox format as mailbox format. The configuration follows the mdbox configuration so rbox:%%h as location is sufficient.

dovecot-ceph-plugin uses the username as RADOS namespaces. In case of the public folder the namespace public is set.

Testing

We use ImapTest for testing the plugin. The Ceph cluster we used for the first tests runs locally and has been created using vstart.sh (See ceph/README.md). We test the protocols IMAP and POP3. Before you can start the tests you have to fit the environment.

For librmb we use googletest C++ Framework. Googletest library is added as git submodule you can clone googletest with: git submodule update --init --recursive

The configuration assumes a Ceph cluster running locally without cephx, that has for example been created using vstart.sh as decribed in Developer Guide (quick) or ceph/README.md.

../src/vstart.sh -X -n -l

Common

Create 100 user: Name = t1 .. t100, Password = t

etc/passwd:

t1:{PLAIN}t::::::
t2:{PLAIN}t::::::
t3:{PLAIN}t::::::
...
t100:{PLAIN}t::::::

Script to create the users:

#!/bin/bash
for i in {1..100}
    do
         echo "t$i:{PLAIN}t::::::" >> passwd
    done

IMAP

It is not ncecessary to add or modify test profiles. ImapTest can be started with the following command.

imaptest user=t%d pass=t port=10143

POP3

If POP3 is used for the ImapTest, it is necessary to add or modify some configuration entries.

System

ulimit -n 3072
ulimit -s unlimited

Dovecot / LMTP

Enable POP3 and LMTP via etc/dovecot/dovecot.conf:

protocols = imap pop3 lmtp

Add or change the following entry of etc/dovecot/conf.d/10-master.conf:

default_process_limit = 500
default_client_limit = 3000

service lmtp {
  unix_listener lmtp {
    #mode = 0666
  }

  inet_listener lmtp {
    address = 127.0.0.1 ::1
    port = 10024
  }
}

ImapTest

To run ImapTest with POP3 you have to use a profile file which sets POP3 as the client protocol.

POP3 profile example
lmtp_port = 10024
#lmtp_max_parallel_count = 500
total_user_count = 100
rampup_time = 30s

user aggressive {
  #username_prefix = test
  username_format = t%n
  count = 80%

  mail_inbox_delivery_interval = 10s
  mail_spam_delivery_interval = 5s
  mail_action_delay = 2s
  mail_action_repeat_delay = 1s
  mail_session_length = 3 min

  mail_send_interval = 10s
  mail_write_duration = 5s

  mail_inbox_reply_percentage = 50
  mail_inbox_delete_percentage = 5
  mail_inbox_move_percentage = 5
  mail_inbox_move_filter_percentage = 10
}

user normal {
  username_format = t%n
  count = 20%

  mail_inbox_delivery_interval = 5 min
  mail_spam_delivery_interval = 3 min
  mail_action_delay = 3 min
  mail_action_repeat_delay = 10s
  mail_session_length = 20 min

  mail_send_interval = 10 min
  mail_write_duration = 2 min

  mail_inbox_reply_percentage = 50
  mail_inbox_delete_percentage = 5
  mail_inbox_move_percentage = 5
  mail_inbox_move_filter_percentage = 10
}

client pop3 {
  count = 90%
  connection_max_count = 1
  protocol = pop3
  pop3_keep_mails = no
  login_interval = 1min
}

client pop3 {
  count = 10%
  connection_max_count = 1
  protocol = pop3
  pop3_keep_mails = yes
  login_interval = 1min
}