FIO -- The Flexible I/O Tester -- is a tool used for benchmarking and stress-testing I/O subsystems. We generally refer to this type of workload as a "microbenchmark" because it is used in a targeted way to determine the bottlenecks and limits of a system.
FIO has a native mechanism to run multiple servers concurrently against a data store. Our implementation of the workload in Ripsaw takes advantage of this feature, spawning N FIO servers based on the options provided by the user in the CR file. A single FIO client pod is then launched as the control point for executing the workload on the server pods in parallel.
The Custom Resource (CR) file for fio includes a significant number of options to offer the user flexibility.
apiVersion: ripsaw.cloudbulldozer.io/v1alpha1
kind: Benchmark
metadata:
name: fio-benchmark
namespace: my-ripsaw
spec:
elasticsearch:
url: "http://my.es.server:9200"
clustername: myk8scluster
test_user: my_test_user_name
workload:
name: "fio_distributed"
args:
prefill: true
# for compressed volume uncomment the next 2 lines and make the cmp_bs same as bs
# prefill_bs: 8KiB
# cmp_ratio: 70
samples: 3
servers: 3
# Chose to run servers in 'pod' or 'vm'
# 'vm' needs kubevirt to be available
# Default: pod
kind: pod
runtime_class: class_name
jobs:
- write
- read
bs:
- 4KiB
- 64KiB
numjobs:
- 1
- 8
iodepth: 4
read_runtime: 60
read_ramp_time: 5
filesize: 2GiB
log_sample_rate: 1000
storageclass: rook-ceph-block
storagesize: 5Gi
rook_ceph_drop_caches: True
rook_ceph_drop_cache_pod_ip: 192.168.111.20
#######################################
# EXPERT AREA - MODIFY WITH CAUTION #
#######################################
# global_overrides:
# - key=value
job_params:
- jobname_match: w
params:
- fsync_on_close=1
- create_on_open=1
- jobname_match: read
params:
- time_based=1
- runtime={{ workload_args.read_runtime }}
- ramp_time={{ workload_args.read_ramp_time }}
- jobname_match: rw
params:
- rwmixread=50
- time_based=1
- runtime={{ workload_args.read_runtime }}
- ramp_time={{ workload_args.read_ramp_time }}
- jobname_match: readwrite
params:
- rwmixread=50
- time_based=1
- runtime={{ workload_args.read_runtime }}
- ramp_time={{ workload_args.read_ramp_time }}
# - jobname_match: <search_string>
# params:
# - key=value
You can add a node selector and/or taints/tolerations to the resulting Kubernetes resources like so:
spec:
workload:
name: fio_distributed
args:
nodeselector:
foo: bar
tolerations:
- key: "taint-to-tolerate"
operator: "Exists"
effect: "NoSchedule"
Note: The implementation of
nodeselector
has led to the deprecation ofpin_server
. You must only usenodeslector
. You can get the same behavior of pin_server with a node selector by doing this:
nodeselector:
kubernetes.io/hostname: 'SERVER_TO_PIN'
The options provided in the CR file are designed to allow for a nested set of job execution loops. This allows the user to setup a series of jobs, usually of increasing intensity, and execute the job group with a single request to the Ripsaw operator. This allows the user to run many jobs that may take hours or even days to complete, and the jobs will continue through the nested loops unattended.
The workload loops are nested as such from the CR options:
+-------->numjobs---------+
| |
| +------>bs|bsrange----+ |
| | | |
| | +---->job---------+ | |
| | | | | |
| | | +-->samples---+ | | |
| | | | | | | |
| | | | | | | |
| | | +-------------+ | | |
| | +-----------------+ | |
| +---------------------+ |
+-------------------------+
A note about units:
For consistency in the CR file, we apply the fio option
kb_base=1000
in the jobfile configmap. The effect of this is that unit names are treated via IEC and SI standards, and thus units likeKB
,MB
, andGB
are considered base-10 (1KB = 1000B), and units likeKiB
,MiB
, andGiB
are base-2 (1KiB = 1024B).However, note that fio (as of versions we have tested) does not react as might be expected to "shorthand" IEC units like
Ki
orMi
-- these will be treated as base-10 instead of base-2.Unfortunately, the K8S resource model specifies explicitly the use of "shorthand" IEC units like
Ki
andMi
, and the use of the complete form ofKiB
orMiB
will result in errors.Therefore, be aware in providing units to the CR values that fio options should use the
MiB
format while thestoragesize
option used for the K8S persistent volume claim should use theMi
format
Values here will usually be left unchanged
- name: The name the Ripsaw operator will use for the benchmark resource
- namespace: The namespace in which the benchmark will run
- elasticsearch: (optional) Values are used to enable indexing of fio data; further details are below
- clustername: (optional) An arbitrary name for your system under test (SUT) that can aid indexing
- test_user: (optional) An arbitrary name for the user performing the tests that can aid indexing
- name: DO NOT CHANGE This value is used by the Ripsaw operator to trigger the correct Ansible role
This is the meat of the workload where most of the adjustments to your needs will be made.
- samples: Number of times to run the exact same workload. This is the innermost loop, as described above
- servers: Number of fio servers that will run the specified workload concurrently
- nodeselector: K8S node selector (per
kubectl get nodes
) on which to run server pods - jobs: (list) fio job types to run, per
fio(1)
valid values for thereadwrite
optionNote: Under most circumstances, a
write
job should be provided as the first list item forjobs
. This will ensure that subsequent jobs in the list can use the files created by thewrite
job instead of needing to instantiate the files themselves prior to beginning the benchmark workload. - runtime_class : If this is set, the benchmark-operator will apply the runtime_class to the podSpec runtimeClassName.
Note: For Kata containers
- kind: Can either be
pod
orvm
to determine if the fio workload is run in a Pod or in a VMNote: For VM workloads, you need to install Openshift Virtualization first
- vm_image: Whether to use a pre-defined VM image with pre-installed requirements. Necessary for disconnected installs.
Note: You can use my fedora image here: quay.io/mulbc/fed-fio
Note: Only applies when kind is set tovm
- vm_cores: The number of CPU cores that will be available inside the VM. Default=1
Note: Only applies when kind is set to
vm
- vm_cores: The amount of Memory that will be available inside the VM in Kubernetes format. Default=5G
Note: Only applies when kind is set to
vm
- bs: (list) blocksize values to use for I/O transactions
Note: We set the
direct=1
fio option in the jobfile configmap. In order to avoid errors, thebs
values provided here should be a multiple of the filesystem blocksize (typically 4KiB). The note above about units applies here. - bsrange:(list) blocksize range values to use for I/O transactions
Note: We set the
direct=1
fio option in the jobfile configmap. In order to avoid errors, thebsrange
values provided here should be a multiple of the filesystem blocksize (typically 1KiB - 4KiB). The note above about units applies here.bsrange: - 1KiB-4KiB - 16KiB-64KiB - 256KiB-4096KiB
- numjobs: (list) Number of clones of the job to run on each server -- Total jobs will be
numjobs * servers
- iodepth: Number of I/O units to keep in flight against a file; see
fio(1)
- read_runtime: Amount of time in seconds to run
read
workloads (includingreadwrite
workloads) - read_ramp_time: Amount of time in seconds to ramp up
read
workloads (i.e., executing the workload without recording the data)Note: We intentionally run
write
workloads to completion of the file size specified in order to ensure that complete files are available for subsequentread
workloads. Allread
workloads are time-based, using these parameters, but note this behavious is configured via the EXPERT AREA section of the CR as described below, and therefore this may be adjusted to user preferences. - filesize: The size of the file used for each job in the workload (per
numjobs * servers
as described above) - log_sample_rate: Applied to fio options
log_avg_msec
andlog_hist_msec
in the jobfile configmap; seefio(1)
- storageclass: (optional) The K8S StorageClass to use for persistent volume claims (PVC) per server pod
- pvcaccessmode: (optional) The AccessMode to request with the persistent volume claim (PVC) for the fio server. Can be one of ReadWriteOnce,ReadOnlyMany,ReadWriteMany Default: ReadWriteOnce
- pvcvolumemode: (optional) The volmeMode to request with the persistent volume claim (PVC) for the fio server. Can be one of Filesystem,Block Default: Filesystem
Note: It is recommended to change this to
Block
for VM tests - storagesize: (optional) The size of the PVCs to request from the StorageClass (note units quirk per above)
- rook_ceph_drop_caches: (optional) If set to
True
, the Rook-Ceph OSD caches will be dropped prior to each sample - rook_ceph_drop_cache_pod_ip: (optional) The IP address of the pod hosting the Rook-Ceph cache drop URL -- See cache drop pod instructions below
Technical Note: If you are running kube/openshift on VMs make sure the diskimage or volume is preallocated.
- prefill: (Optional) boolean to enable/disable prefill SDS
- prefill requirement stems from Ceph RBD thin-provisioning - just creating the RBD volume doesn't mean that there is space allocated to read and write out there. For example, reads to an uninitialized volume don't even talk to the Ceph OSDs, they just return immediately with zeroes in the client.
- prefill_bs (Optional) The Block size that need to used for the prefill.
- When running against compressed volumes, the prefill operation need to be done with the same block size as using in the test, otherwise the compression ratio will not be as expected.
- cmp_ratio (Optional) When running against compressed volumes, the expected compression ratio (0-100)
- fio_json_to_log: (Optional) boolean to enable/disable sending job results in json format to client pod log.
The key=value
combinations provided in the list here will be appended to the [global]
section of the fio
jobfile configmap. These options will therefore override the global values for all workloads in the loop.
Under most circumstances, the options provided in the EXPERT AREA here should not be modified. The key=value
pairs under params
here are used to append additional fio job options based on the job type. Each jobname_match
in the list uses a "search string" to match a job name per fio(1)
, and if a match is made, the key=value
list items under params
are appended to the [job]
section of the fio jobfile configmap.
Dropping the OSD caches before workloads is a normal and advised part of tests that involve storage I/O. Doing this with Rook-Ceph requires a privileged pod running the same namespace as the Ceph pods and with the Ceph command tools available. To facilitate this, we provide the resources/rook_ceph_drop_cache_pod.yaml file, which will deploy a pod with the correct permissions and tools, as well as running a simple HTTP listener to trigger the cache drop by URL. You must deploy this privileged pod in order for the drop caches requests in the workload to function.
kubectl apply -f resources/rook_ceph_drop_cache_pod.yaml
Note: If Ceph is in a namespace other than rook-ceph
you will need to modify the provided YAML accordingly.
Since the cache drop pod is deployed with host networking, the pod will take on the IP address of the node on which it is running. You will need to use this IP address in the CR file as described above.
kubectl get pod -n rook-ceph rook-ceph-osd-cache-drop --template={{.status.podIP}}
You'll need to standup the infrastructure required to index and visualize results. We are using Elasticsearch as the database, and Grafana for visualizing.
Currently, we have tested with elasticsearch 7.0.1, so please deploy an elasticsearch instance. There are are many guides that are quite helpful to deploy elasticsearch, for starters you can follow the guide to deploy with docker by elasticsearch.
Once you have verified that you can access the elasticsearch, you'll have to create an index template for ripsaw-fio-logs.
We send fio logs to the index ripsaw-fio-logs
, the template can be found in arsenal.
Ripsaw will be indexing the fio result json to the index ripsaw-fio-result
. For this, no template is required. However if you're an advanced user of elasticsearch, you can create it and edit its settings.
Currently for fio-distributed, we have tested with grafana 6.3.0. An useful guide to deploy with docker is present in grafana docs.
Once you've set it up, you can import the dashboard from the template in arsenal.
You can then follow instructions to import dashboard like adding the data source following the grafana docs
Please set the data source to point to the earlier, and the index name should be ripsaw-fio-logs
.
The field for timestamp will always be time_ms
.
In order to index your fio results to elasticsearch, you will need to define the parameters appropriately in
your workload CR file. The spec.elasticsearch.url
parameter is required.
The spec.clustername
and spec.test_user
values are advised to allow for better indexing of your data.