pipeline_tuning

Table of Contents Understanding pipeline stats Presentation Tunings Case study #1

Understanding pipeline stats

Presentation

When processing incoming information (from scan or changelogs), robinhood periodically dumps internal statistics to its log file (frequency for dumping stats is driven by config parameter 'logs::stats_interval). These log lines starts with 'STATS' tag which makes them easy to 'grep'.

 grep STATS /var/log/robinhood.log

The more useful part for analyzing ingest performance is the EntryProcessor Pipeline Stats section.

Example of pipeline statistics:

 ==== EntryProcessor Pipeline Stats ===
 Idle threads: 0
 Id constraints count: 9561 (hash min=0/max=3/avg=0.6)
 Name constraints count: 9561 (hash min=0/max=4/avg=0.6)
 Stage              | Wait | Curr | Done |     Total | ms/op |
  0: GET_FID        |    0 |    0 |    0 |         0 |  0.00 |
  1: GET_INFO_DB    | 9497 |    5 |    0 |     41881 |  0.17 |
  2: GET_INFO_FS    |   11 |    5 |    0 |     41870 |  0.00 |
  3: REPORTING      |    1 |    0 |    0 |     41869 |  0.00 |
  4: PRE_APPLY      |    1 |    0 |    0 |     41868 |  0.00 |
  5: DB_APPLY       |   51 |    6 |    0 |     41817 |  0.14 | 99.60% batched (avg batch size: 36.3)
  6: RM_OLD_ENTRIES |    0 |    0 |    0 |         0 |  0.00 |

Pipeline stats show the following indicators for each pipeline stage:

Wait: the number of entries waiting to be processed at this stage.
Curr: the number of entries currently being processed at this stage.
Done: the number of entries that have been processed at this stage but that cannot go to the next stage because an entry with the same 'id' or 'path' is currently being processed in next steps.
Total: the total number of entries that were processed at this step since the program started.
ms/op: the average duration of operations at this stage.

The last column is the most useful. It indicates the time spent for each kind of operation, thus giving an idea of current limitation:

Filesystem operations: GET_FID, GET_INFO_FS, CHGLOG_CLR.
Database operations: GET_INFO_DB, DB_APPLY, RM_OLD_ENTRIES.

constraints count also gives an indication of how many entries are currently in the pipeline. If it is close to 'EntryProcessor::max_pending_operations' parameter (default is 10.000) the pipeline is full, i.e. robinhood doesn't process information as fast is it arrives. If it is close to zero, it means robinhood processes information faster than it arrives.

Tunings

Some tuning can be done in EntryProcessor config block.

Main tuning in nb_threads parameter: it specifies how many worker threads process pipeline operations.

If you notice that some pipeline stages are starving, you can also have a fine tuning of threads working at each pipeline stage by setting 'stage_stagename_threads_max' parameters (see example in the cases studies below).

Case study #1

Robinhood is running filesystem scan. In EntryProcessor config block, we set nb_threads = 32.

Let's analyze the workflow stats and try to improve the scan speed:

 ==== EntryProcessor Pipeline Stats ===
 Idle threads: 0
 Id constraints count: 10000 (hash min=0/max=5/avg=0.6)
 Name constraints count: 10000 (hash min=0/max=5/avg=0.6)
 Stage              | Wait | Curr | Done |     Total | ms/op |
  0: GET_FID        |    0 |    0 |    0 |       441 | 13.12 |
  1: GET_INFO_DB    | 9958 |    2 |    0 |     55892 |  0.61 |
  2: GET_INFO_FS    |    0 |   28 |   10 |     55864 | 17.27 |
  3: REPORTING      |    0 |    0 |    0 |     55854 |  0.00 |
  4: PRE_APPLY      |    0 |    0 |    0 |     55854 |  0.00 |
  5: DB_APPLY       |    0 |    2 |    0 |     55852 |  1.43 |
  6: CHGLOG_CLR     |    0 |    0 |    0 |         0 |  0.00 |
  7: RM_OLD_ENTRIES |    0 |    0 |    0 |         0 |  0.00 |

Observations:

Pipeline is full (constraints count=10000) which may slow down the scan speed (scan threads can't push entries in the pipeline when it is full). This indicate the limiting point is currently in the pipeline sprocessing steps.
Longer operations are filesystem operations: GET_FID (13.12 ms/op) and GET_INFO_FS (17.27 ms/op)
According to 'Curr' column, most threads (28/32) are currently processing entries in 'GET_INFO_FS' stage, which may starve over pipeline steps.

To fix this second point, we can limit the number of threads running at this stage by setting STAGE_GET_INFO_FS_threads_max = 24; in EntryProcessor config block.

This is what we get then:

 ==== EntryProcessor Pipeline Stats ===
 Idle threads: 26
 Id constraints count: 7 (hash min=0/max=1/avg=0.0)
 Name constraints count: 8 (hash min=0/max=1/avg=0.0)
 Stage              | Wait | Curr | Done |     Total | ms/op |
  0: GET_FID        |    0 |    0 |    0 |      2011 |  4.25 |
  1: GET_INFO_DB    |    0 |    2 |    0 |    381537 |  0.47 |
  2: GET_INFO_FS    |    0 |    3 |    0 |    381534 |  6.90 |
  3: REPORTING      |    0 |    0 |    0 |    381534 |  0.00 |
  4: PRE_APPLY      |    0 |    0 |    0 |    381534 |  0.00 |
  5: DB_APPLY       |    0 |    2 |    0 |    381532 |  1.36 |
  6: CHGLOG_CLR     |    0 |    0 |    0 |         0 |  0.00 |
  7: RM_OLD_ENTRIES |    0 |    0 |    0 |         0 |  0.00 |

Observations:

Pipeline workflow looks better: it is no longer full so scan threads are no longer limited by robinhood processing rate to push entries to the pipeline.
Latency of filesystem operations also decreased (4.25ms and 6.90ms). This may be explained by less concurrency on filesystem operations.

At this point, we have made the processing flow more fluid. Now the limiting point is the input stream. To make it faster, the filesystem scan speed must be improved. This can be done by increasing the number of scan threads, or by tuning the fileystem itself.

Back to wiki home

Download latest version: robinhood 3.1.6 (2020/09/29)
Papers and presentations
Online documentation
Subscribe robinhood-news
Mail robinhood-support
Wiki home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline_tuning

Table of Contents

Understanding pipeline stats

Presentation

Tunings

Case study #1

Clone this wiki locally