Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very high Disk read and write #3429

Open
2 tasks done
gruselglatz opened this issue Feb 1, 2024 · 15 comments
Open
2 tasks done

Very high Disk read and write #3429

gruselglatz opened this issue Feb 1, 2024 · 15 comments

Comments

@gruselglatz
Copy link

Current Behavior

We see heavy Disk usage from a default installation of Deptrack:
image
image

The Total count in this screenshot is in the Timeframe of the uptime only!
We see also the postgres: checkpointer to produce permanent load.

Steps to Reproduce

Projects: 974

We upload BOMs via API, no integrations enabled.
image

Expected Behavior

less disk write

Dependency-Track Version

4.10.1

Dependency-Track Distribution

Container Image

Database Server

PostgreSQL

Database Server Version

13.11

Browser

N/A

Checklist

@gruselglatz gruselglatz added defect Something isn't working in triage labels Feb 1, 2024
@nscuro
Copy link
Member

nscuro commented Feb 1, 2024

Is this disk write from the DT application, or from the database? Are they running on different hosts, such that you can pinpoint which of them is the problem?

@gruselglatz
Copy link
Author

gruselglatz commented Feb 5, 2024

They are running on the same machine.

currently i am collecting some sample data regarding the per process io on the machine.

@gruselglatz
Copy link
Author

gruselglatz commented Feb 5, 2024

After 2 hours this is what iotop gives me:
image

After 3 hours:
image

The systemd logs are caused by the health check from docker, which refers to this issue: docker/for-linux#679

Docker stats:
image

Postregs was restarted, see uptime:
image

@gruselglatz
Copy link
Author

@nscuro, seeing consistent disk write and read since 83 days -> ~8.5 TB write, ~ 4.6 TB read

@symology
Copy link

symology commented Mar 5, 2024

We're also seeing very high disk usage from DependencyTrack. We have deployed one instance each of the front-end and api Docker containers into a Kubernetes cluster in Azure, using a file storage account for the /data drive, and Azure SQL for the database - so the database is not using the storage account in this case. We've not changed any of the default Task Scheduler settings.

We're currently running version 4.10.1, but from the logs it looks like the high disk usage started when we updated from 4.4.0 to 4.8.2 last year, and has continued after we updated to 4.10.1.

I've recreated the same behaviour in another test deployment (same setup as above, but where I have direct access to the storage account) and looking at the storage stats for that, DependencyTrack has performed 6.9 million file storage transactions in the last 12 hours, with no BOM files being uploaded during that period, and no other users.

From what I can see from the file storage, based on file properties: there is a set of files in the /data/index/vulnerability folder that all begin with the prefix "_3" which are being deleted and created constantly. It looks like they might be related to the Lucene search index.

@nscuro
Copy link
Member

nscuro commented Mar 5, 2024

Thanks for narrowing it down, @symology!

This particular search index should be updated under the following conditions:

  • A vulnerability database (e.g. NVD) is being mirrored, and a vulnerability was created or updated
    • Apart from the initial mirroring operation, these should be incremental and don't produce as much I/O
  • The system detects a drift between vulnerabilities in the database and vulnerabilities in the search index
    • This would trigger a complete re-index, but at worst it would result in one index write per vulnerability

Looking at your logs, are you able to spot frequent Starting reindex task messages being emitted?

@symology
Copy link

symology commented Mar 5, 2024

Hi @nscuro.

I don't actually see the specific text Starting reindex task in the logs at all since I've started this deployment up. I'm running with the default log level, which I think is INFO.

There is an entry for [org.dependencytrack.search.IndexSubsystemInitializer] Building lucene indexes if required just after starting the test instance up, followed by a number of health checks indicating the various indexes are 'healthy'.

There has been some NVD mirroring, and there are brief spikes in the actual volume of data being written/read to the file storage which correspond to those downloads, but the actual number of file storage transactions has stayed consistently high all the way through, so I don't think it's being triggered by the updates. The indexing seems to be running constantly, regardless of changes to the vulnerabilities.

@nscuro
Copy link
Member

nscuro commented Mar 6, 2024

It could be that Lucene's background merge threads are causing that. Lucene segments are append-only, so with lots of writes (many index commits) it presumably has to do lots of housekeeping.

Doing some surface-level research, it also seems like the writing patterns that DT uses (many individual index operations) are sub-optimal, and Lucene prefers batch operations instead: https://j.blaszyk.me/tech-blog/exploring-apache-lucene-index/
Presumably we currently cause (too) many segments to be created by doing individual operations.

I think that is a reasonable explanation, but it needs to be verified.

@gruselglatz
Copy link
Author

Hi,
reindexing is not happening that often it seems:
image

This is happening exactly every 3 days:
image

I didn't find any lucene log entries.

Is there some possibilities to connect some monitoring to lucene?

@nscuro
Copy link
Member

nscuro commented Mar 6, 2024

I don't think there's a way to monitor Lucene right now. But I'll have a look if there is something we can add / enable to make it more observable. It is a bit of a blindspot at the moment indeed.

nscuro added a commit to nscuro/dependency-track that referenced this issue Mar 10, 2024
Collect basic metrics:

* Total number of index operations (`add`, `update`, `delete`, `commit`), grouped by index
* Number of index documents in RAM
* Number of bytes used by the index
* Total number of documents in the index

Also, integrate Lucene's `InfoStream` with Dependency-Track's logging system. Lucene output will now be included when configuring `LOGGING_LEVEL=DEBUG`, or when the respective logger is explicitly configured in `logback.xml`.

Relates to DependencyTrack#3429

Signed-off-by: nscuro <[email protected]>
nscuro added a commit to nscuro/dependency-track that referenced this issue Mar 10, 2024
Collect basic metrics:

* Total number of index operations (`add`, `update`, `delete`, `commit`), grouped by index
* Number of index documents in RAM
* Number of bytes used by the index
* Total number of documents in the index

Also, integrate Lucene's `InfoStream` with Dependency-Track's logging system. Lucene output will now be included when configuring `LOGGING_LEVEL=DEBUG`, or when the respective logger is explicitly configured in `logback.xml`.

Relates to DependencyTrack#3429

Signed-off-by: nscuro <[email protected]>
nscuro added a commit to nscuro/dependency-track that referenced this issue Apr 16, 2024
Collect basic metrics:

* Total number of index operations (`add`, `update`, `delete`, `commit`), grouped by index
* Number of index documents in RAM
* Number of bytes used by the index
* Total number of documents in the index

Also, integrate Lucene's `InfoStream` with Dependency-Track's logging system. Lucene output will now be included when configuring `LOGGING_LEVEL=DEBUG`, or when the respective logger is explicitly configured in `logback.xml`.

Relates to DependencyTrack#3429

Signed-off-by: nscuro <[email protected]>
@DrummerB
Copy link

Hi, we are trying to set up dependency-track and are facing the same issue. This is on a completely fresh installation of the latest version, without even ever logging in or adding any components. After dependency-track downloads and processes all the nist archives, the disk IO remains at 100% while the CPU is at around 50%.

Using fatrace, I see about 5600 reads of .dependency-track/db.mv.db per second.

System
# uname -a
Linux foo.bar.eu 6.1.0-21-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.90-1 (2024-05-03) x86_64 GNU/Linux
CPU
# lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   6
  On-line CPU(s) list:    0-5
Vendor ID:                GenuineIntel
  BIOS Vendor ID:         Virtuozzo
  Model name:             Intel Core Processor (Broadwell, IBRS)
    BIOS Model name:      Virtuozzo 7.12.0 PC (i440FX + PIIX, 1996)  CPU @ 2.0GHz
    BIOS CPU family:      1
    CPU family:           6
    Model:                61
    Thread(s) per core:   1
    Core(s) per socket:   6
    Socket(s):            1
    Stepping:             2
    BogoMIPS:             4399.99
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp 
                          lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic m
                          ovbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti intel_ppin ssbd i
                          brs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt md_clear
Virtualization features:  
  Hypervisor vendor:      KVM
  Virtualization type:    full
Caches (sum of all):      
  L1d:                    192 KiB (6 instances)
  L1i:                    192 KiB (6 instances)
  L2:                     24 MiB (6 instances)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-5
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          KVM: Mitigation: VMX unsupported
  L1tf:                   Mitigation; PTE Inversion
  Mds:                    Mitigation; Clear CPU buffers; SMT Host state unknown
  Meltdown:               Mitigation; PTI
  Mmio stale data:        Unknown: No mitigations
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline
  Srbds:                  Unknown: Dependent on hypervisor status
  Tsx async abort:        Mitigation; Clear CPU buffers; SMT Host state unknown
Memory
free -h 
               total        used        free      shared  buff/cache   available
Mem:           7.8Gi       4.7Gi       570Mi        88Mi       3.0Gi       3.1Gi
Swap:          1.9Gi       1.3Gi       587Mi
Logs
2024-06-14 08:50:24,870 INFO [NistMirrorTask] Initiating download of https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-2004.json.gz
2024-06-14 08:50:25,073 INFO [NistMirrorTask] Downloading...
2024-06-14 08:50:25,565 INFO [NistMirrorTask] Uncompressing nvdcve-1.1-2004.json.gz
2024-06-14 08:50:25,610 INFO [NvdParser] Parsing nvdcve-1.1-2004.json
2024-06-14 08:50:38,928 INFO [NistMirrorTask] Initiating download of https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-2004.meta
2024-06-14 08:50:39,227 INFO [NistMirrorTask] Downloading...
2024-06-14 08:50:39,928 INFO [NistMirrorTask] Initiating download of https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-2003.json.gz
2024-06-14 08:50:40,126 INFO [NistMirrorTask] Downloading...
2024-06-14 08:50:41,565 INFO [NistMirrorTask] Uncompressing nvdcve-1.1-2003.json.gz
2024-06-14 08:50:41,588 INFO [NvdParser] Parsing nvdcve-1.1-2003.json
2024-06-14 08:50:55,514 INFO [NistMirrorTask] Initiating download of https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-2003.meta
2024-06-14 08:50:55,819 INFO [NistMirrorTask] Downloading...
2024-06-14 08:50:56,480 INFO [NistMirrorTask] Initiating download of https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-2002.json.gz
2024-06-14 08:50:56,678 INFO [NistMirrorTask] Downloading...
2024-06-14 08:50:58,111 INFO [NistMirrorTask] Uncompressing nvdcve-1.1-2002.json.gz
2024-06-14 08:50:58,193 INFO [NvdParser] Parsing nvdcve-1.1-2002.json
2024-06-14 08:51:25,835 INFO [NistMirrorTask] Initiating download of https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-2002.meta
2024-06-14 08:51:26,144 INFO [NistMirrorTask] Downloading...
2024-06-14 08:51:26,184 INFO [NistMirrorTask] Initiating download of https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-modified.json.gz
2024-06-14 08:51:26,383 INFO [NistMirrorTask] Downloading...
2024-06-14 08:51:26,798 INFO [NistMirrorTask] Uncompressing nvdcve-1.1-modified.json.gz
2024-06-14 08:51:26,820 INFO [NvdParser] Parsing nvdcve-1.1-modified.json
2024-06-14 08:51:32,734 INFO [NistMirrorTask] Initiating download of https://nvd.nist.gov/feeds/json/cve/1.1/nvdcve-1.1-modified.meta
2024-06-14 08:51:32,836 INFO [NistMirrorTask] Downloading...
2024-06-14 08:51:32,940 INFO [NistMirrorTask] NIST mirroring complete
2024-06-14 08:51:32,941 INFO [NistMirrorTask] Time spent (d/l):   13402ms
2024-06-14 08:51:32,941 INFO [NistMirrorTask] Time spent (parse): 1520965ms
2024-06-14 08:51:32,943 INFO [NistMirrorTask] Time spent (total): 1596497ms
2024-06-14 08:51:32,974 INFO [EpssMirrorTask] Starting EPSS mirroring task
2024-06-14 08:51:32,975 INFO [EpssMirrorTask] Mirrored data directory created successfully
2024-06-14 08:51:32,975 INFO [EpssMirrorTask] Initiating download of https://epss.cyentia.com/epss_scores-current.csv.gz
2024-06-14 08:51:33,768 INFO [EpssMirrorTask] Downloading...
2024-06-14 08:51:33,814 INFO [EpssMirrorTask] Uncompressing epss_scores-current.csv.gz
2024-06-14 08:51:34,017 INFO [EpssParser] Parsing epss_scores-current.csv
2024-06-14 08:59:37,422 INFO [EpssMirrorTask] EPSS mirroring complete
2024-06-14 08:59:42,495 INFO [EpssMirrorTask] Time spent (d/l):   793ms
2024-06-14 08:59:42,495 INFO [EpssMirrorTask] Time spent (parse): 480734ms
2024-06-14 08:59:42,618 INFO [EpssMirrorTask] Time spent (total): 484240ms
2024-06-14 09:24:05,019 INFO [InternalComponentIdentificationTask] Starting internal component identification
2024-06-14 09:24:07,012 INFO [VulnerabilityMetricsUpdateTask] Executing metrics update on vulnerability database
2024-06-14 09:24:07,651 INFO [PortfolioMetricsUpdateTask] Executing portfolio metrics update
2024-06-14 09:24:38,287 INFO [PortfolioMetricsUpdateTask] Completed portfolio metrics update in 00:28:755
2024-06-14 09:25:02,241 INFO [RepositoryMetaAnalyzerTask] Analyzing portfolio component repository metadata
2024-06-14 09:25:21,562 INFO [VulnerabilityMetricsUpdateTask] Completed metrics update on vulnerability database in 01:14:549
2024-06-14 09:25:23,853 INFO [RepositoryMetaAnalyzerTask] Portfolio component repository metadata analysis complete
2024-06-14 09:25:45,680 INFO [InternalComponentIdentificationTask] Internal component identification completed in 01:10:934

image

I can reproduce the same on a completely clean AWS Lightsail instance with 8 GB RAM, 2 vCPUs, 160 GB SSD running Debian 12.5.

I simply install docker, then install dependency-track using the manual quick start instructions, and I get the same problem.

Installing 4.4.0, as suggested above, did not help.

@nscuro
Copy link
Member

nscuro commented Jun 14, 2024

@DrummerB Please do not use the embedded H2 database (which writes to .dependency-track/db.mv.db) for "serious" deployments. There is loads of H2-internal housekeeping going on that DT has no control over. You will not have this particular issue if you use an external PostgreSQL database instead.

@DrummerB
Copy link

@nscuro Changing to PostgreSQL seems to have fixed this issue indeed, thanks a lot!

@cd-renovate-bot
Copy link

We have had a spike in Disk IO with 4.11.7 @nscuro to the point I'm going to see if reverting to 4.11.6 fixes it.

We're using EFS with Fargate containers for the frontend/api and RDS for a PostgreSQL database.

These are the stats from EFS.
image

From the logs I do see a lot of index rewrites back to back

@hastarin
Copy link

hastarin commented Aug 19, 2024

Sorry the above was me (forgot to change account). Reverting saw logs complaining about a file lock for a while. This may have been while two containers were running at the same time as things reverted.

2024-08-19 01:37:08,008 ERROR [VulnerabilityIndexer] An error occurred while adding document to index
org.apache.lucene.store.LockObtainFailedException: Lock held by another program: /data/.dependency-track/index/vulnerability/write.lock

But following that the continual re-indexing started again. :(

2024-08-19 01:44:47,796 INFO [VulnerabilityIndexer] Reindexing of 311570 vulnerabilities completed in PT4M2.109547586S
2024-08-19 01:44:47,798 INFO [VulnerabilityIndexer] Starting reindex task. This may take some time.
2024-08-19 01:44:47,809 INFO [IndexManager] Deleting vulnerability index
2024-08-19 01:48:52,034 INFO [VulnerabilityIndexer] Reindexing of 311570 vulnerabilities completed in PT4M3.888477161S
2024-08-19 01:48:52,036 INFO [VulnerabilityIndexer] Starting reindex task. This may take some time.
2024-08-19 01:48:52,036 INFO [IndexManager] Deleting vulnerability index
2024-08-19 01:52:55,324 INFO [VulnerabilityIndexer] Reindexing of 311570 vulnerabilities completed in PT4M2.921062536S
2024-08-19 01:52:55,325 INFO [VulnerabilityIndexer] Starting reindex task. This may take some time.
2024-08-19 01:52:55,335 INFO [IndexManager] Deleting vulnerability index

Given I'm already using PostgreSQL can anyone suggest any way to further troubleshoot this?

I'm about to try disabling the NVD, GitHub and Google OSV mirrors to see if that helps isolate the problem.

EDIT:
Just disabling didn't seem to do anything. While they were still disabled, I upgraded the containers back to 4.11.7 at which point the problem had stopped. I have now enabled each of the mirrors again and will have to wait and see if the problem comes back in the next 24 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants