donwsampling: Optimize downsample memory usage #297

bwplotka · 2018-04-17T10:41:25Z

Downsample can OOM easily with 16GB.. wonder if we can improve that in any way.

It uses a lot of memory when downsampling my 2-week block (on top of other mem usage after lot's of compactions maybe). Wonder if that's not some lazy GC issue.

bwplotka · 2018-04-17T10:44:33Z

bwplotka · 2018-04-17T11:02:25Z

20GB not enough as well.

bwplotka · 2018-04-17T13:55:01Z

My block stats:

gsutil du -sh "gs://thanos-beta/01CB8HB76TAETC189K7V77VB8G/" 
13.82 GiB   gs://thanos-beta/01CB8HB76TAETC189K7V77VB8G

 gsutil du -sh "gs://thanos-beta/01CB8HB76TAETC189K7V77VB8G/chunks/"
12.58 GiB   gs://thanos-beta/01CB8HB76TAETC189K7V77VB8G/chunks

This guy is... HEAVY

bwplotka · 2018-04-17T14:21:12Z

go tool pprof -symbolize=remote -alloc_space thanos "(..)/debug/pprof/heap"

(pprof) top 5
Showing nodes accounting for 55.07GB, 62.78% of 87.71GB total
Dropped 356 nodes (cum <= 0.44GB)
Showing top 5 nodes out of 58
      flat  flat%   sum%        cum   cum%
   21.22GB 24.19% 24.19%    21.22GB 24.19%  github.com/improbable-eng/thanos/pkg/compact/downsample.EncodeAggrChunk
   10.97GB 12.50% 36.70%    10.97GB 12.50%  github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/chunkenc.(*bstream).writeByte
    8.49GB  9.68% 46.38%    14.14GB 16.12%  github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/chunkenc.(*XORChunk).iterator
    7.62GB  8.68% 55.06%     7.62GB  8.68%  github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/chunkenc.NewXORChunk
    6.77GB  7.72% 62.78%     6.77GB  7.72%  github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/chunkenc.(*bstream).writeBit
(pprof)

ROUTINE ======================== github.com/improbable-eng/thanos/pkg/compact/downsample.EncodeAggrChunk in /home/bartek/go/src/github.com/improbable-eng/thanos/pkg/compact/downsample/aggr.go
   22.36GB    22.36GB (flat, cum) 24.19% of Total
         .          .     28:			b = append(b, buf[:n]...)
         .          .     29:			continue
         .          .     30:		}
         .          .     31:		l := len(c.Bytes())
         .          .     32:		n := binary.PutUvarint(buf[:], uint64(l))
  758.93MB   758.93MB     33:		b = append(b, buf[:n]...)
  175.05MB   175.05MB     34:		b = append(b, byte(c.Encoding()))
   21.15GB    21.15GB     35:		b = append(b, c.Bytes()...)
         .          .     36:	}
  307.51MB   307.51MB     37:	chk := AggrChunk(b)
         .          .     38:	return &chk
         .          .     39:}
         .          .     40:
         .          .     41:func (c AggrChunk) Bytes() []byte {
         .          .     42:	return []byte(c)

bwplotka · 2018-04-17T14:25:15Z

25-35 GB seems to be enough for my compactor.

We definitely should invest time to optimize that though.

felipejfc · 2018-07-18T17:58:43Z

I've allocated 50GB for mine and it's dying for OOM :/

kedare · 2018-07-19T09:23:52Z

It looks like this has been improved in the last releases ? I don't know if it's related but I don't see any memory spike since one of the last upgrade

jdfalk · 2018-08-23T13:53:09Z

I am still getting OOM on large compacted blocks. We have blocks that are 207GB and after grabbing the block it will eventually eat all system memory until it OOM's. Server memory:

              total        used        free      shared  buff/cache   available
Mem:           125G        8.1G        1.3G        124K        116G        116G
Swap:           23G         61M         23G

The server also has 32 cores that are being dramatically underutilized.

bwplotka · 2018-08-23T16:38:23Z

Yup, thanks for feedback. Nothing changed in this area, so this becomes an important issue.

drax68 · 2018-09-12T14:10:38Z

Reproduced with >200Gb blocks, it tries to obtain that block after being killed by oom and wastes s3 traffic in a loop. Is there a way to disable downsampling until memory usage is fixed?

Fixes #297 Signed-off-by: Bartek Plotka <[email protected]>

bwplotka · 2018-09-12T19:38:45Z

@drax68 Makes total sense.

Added: #515

Also looking for volounteer to fix this issue. I can help to lead this, but won't have time to solve this in soon-ish time. (There are higher priority things)

xjewer · 2018-09-12T23:11:48Z

I can volunteer, what are you suggesting to look at first? Writing chunk data to a tmp file instead of eat away all heap?

Fixes #297 Signed-off-by: Bartek Plotka <[email protected]>

bwplotka · 2018-09-13T09:55:37Z

Thanks @xjewer!

Gather data (!) We are only suspecting what is causing this. We don't want to optimize part of code that is not relevant in memory usage. The easiest way, would be to write some manual unit-test (don't commit it even) with downsampling some mocked medium block and print memory here & there https://golangcode.com/print-the-current-memory-usage/ to be aware what is the current state and if we improve things or not. Don't have any better solution with shortest cycles - that is how I would do it.
Chunk data are .. not that large. By design we aggregate all so the size of it is much much more smaller than the source block we are downsampling. We should think about dumping this block only when nothing else can be optimized. When users see 200GB while downsampling I doubt their blocks are 1000GB (:
I assume we stream per series from source block. But the data suggests that somehow the block is buffered. There might be different reasons. Maybe we accidently allocate some byte slice. Maybe we free the memory just fine, but GC is too lazy to actually kill those and we don't release the memory quick enough. There is one good solution - use bytes.Pool, like here: https://github.com/improbable-eng/thanos/blob/master/pkg/store/bucket.go#L163 and use it here:https://github.com/improbable-eng/thanos/blob/master/pkg/compact/downsample/downsample.go#L74 Again, we need to make sure to have something for 1. thing to measure the result.
The initial debug donwsampling: Optimize downsample memory usage #297 (comment) shows that byte slice here: https://github.com/improbable-eng/thanos/blob/master/pkg/compact/downsample/aggr.go#L21 holds most of memory. Maybe here is place we are not releasing the aggr block allocations fast enough? But hm.. those are just aggregated data I believe

Fixes #297 Signed-off-by: Bartek Plotka <[email protected]>

xjewer · 2018-09-17T00:16:42Z

I tested on my 9GB block. We put all the chunks in-memory.
Here's a pprof list in some time from the downsampling, it continuously wasting the memory, pushing all series to memblock:

112          3MB     1.12GB           			newb.addSeries(&series{lset: lset, chunks: downsampleRaw(all, resolution)}) 
113            .          .           			continue

and keep them until writing all that data back to the file:

https://github.com/improbable-eng/thanos/blob/master/pkg/compact/downsample/downsample.go#L128

	id, err = comp.Write(dir, newb, origMeta.MinTime, origMeta.MaxTime)

So even bytes.Pool won't help in suggested place.
GODEBUG=schedtrace=500 shows GC works properly.

bwplotka · 2018-09-17T10:22:46Z

I tested on my 9GB block.

9GB for source block or for downsampled block? I guess for source block, right? can you check size of downsampled block? It should be much smaller.

If that so, why we have 9GB in memory still? I don't think *output block matters that much, but I might be wrong. That was clear from beginning that we are keeping it in memory the ouput block. (: But data says we keep input... am I missing something?

xjewer · 2018-09-17T15:02:48Z

9GB for source block or for downsampled block? I guess for source block, right?

Sure, source block. And yes, we're keeping the output block there.

What I see, that initially we loading XORenconded chunks in a source block, which means the size of aggregated to 5m decoded samples would be comparable to source. Here is my result:

thanos-production du -h -d 1 .
632M	./01CQKYG30KE5KZJN7GCKAK18GY
5.1G	./01CQKY0N79H5572MDCMG3N0ZKS
10.0G	./01CPPF5VGX7SXDQSVXYYEJKE3Q

where:
10.0G - source
5.1G - first pass
632M - second pass

So for the extremely large source block we would waste a huge amount of RAM

bwplotka · 2018-09-17T18:06:40Z

Nice, thanks for those numbers. It seems we are getting somewhere. So definitely output block can be significant I can agree.

So we could avoid this by flush things to file if possible (Let's name it Stream output block per series) And indeed we are expanding (decoding) XOR chunks in some buffer here.
We could be able to stream that part per sample as well (let's call this improvement Stream downsampling per sample). This is not that easy, though, but possible.

xjewer · 2018-09-18T00:34:25Z

from the discussion with @bwplotka in slack:

there was a misunderstanding, diff wasn't exactly 4GB, but some memory has gone indeed, something about 1-1.5 GB
interesting point https://github.com/improbable-eng/thanos/blob/3c8546ceef9cf13856d91b9897fa816303fc05b6/cmd/thanos/downsample.go#L228 https://github.com/prometheus/tsdb/blob/master/chunks/chunks.go#L371
we read chunks using pool, but don’t put them back 🤔
memBlock anyway is the target for optimisation: has to be handled by series to avoid memory consumption.

The next what I found:
https://github.com/improbable-eng/thanos/blob/master/cmd/thanos/downsample.go#L228 uses mmap to open chunk files https://github.com/prometheus/tsdb/blob/master/chunks/chunks.go#L334
For my case, ByteSlices are with block data has ~ 536781921 capacity, which means series' data would be pointed out on this array behind this slice and wouldn't be swept out by GC until the following series will reach next data file. For the index file the same https://github.com/prometheus/tsdb/blob/master/index/index.go#L592

$ chunks ls -l

total 19980256
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:36 000001
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:36 000002
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:37 000003
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:37 000004
-rw-r--r--  1 xjewer  staff   511M  6 Sep 03:37 000005
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:37 000006
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:37 000007
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:37 000008
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:38 000009
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:38 000010
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:38 000011
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:38 000012
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:38 000013
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:39 000014
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:39 000015
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:39 000016
-rwxr-xr-x  1 xjewer  staff   512M  6 Sep 03:39 000017
-rwxr-xr-x  1 xjewer  staff   512M  6 Sep 03:39 000018
-rwxr-xr-x  1 xjewer  staff   459M  6 Sep 03:39 000019

Summary:
For the time being, the only way is to handle data series by series, not to keep all the output downsampled block in memory

bwplotka · 2018-09-18T10:08:42Z

Agreed, good stuff! wt., 18 wrz 2018, 01:34 użytkownik xjewer <[email protected]> napisał:

…

from the discussion with @bwplotka <https://github.com/bwplotka> in slack: 1. there was a misunderstanding, diff wasn't exactly 4GB, but some memory has gone indeed, something about 1-1.5 GB 2. interesting point https://github.com/improbable-eng/thanos/blob/3c8546ceef9cf13856d91b9897fa816303fc05b6/cmd/thanos/downsample.go#L228 https://github.com/prometheus/tsdb/blob/master/chunks/chunks.go#L371 we read chunks using pool, but don’t put them back 🤔 3. memBlock anyway is the target for optimisation: has to be handled by series to avoid memory consumption. The next what I found: https://github.com/improbable-eng/thanos/blob/master/cmd/thanos/downsample.go#L228 uses mmap to open chunk files https://github.com/prometheus/tsdb/blob/master/chunks/chunks.go#L334 For my case, ByteSlices are with 536781921 capacity, which means series' data would be pointed out on this array behind this slice and wouldn't be swept out by GC until the following series will reach next data file. $ chunks ls -l total 19980256 -rw-r--r-- 1 xjewer staff 512M 6 Sep 03:36 000001 -rw-r--r-- 1 xjewer staff 512M 6 Sep 03:36 000002 -rw-r--r-- 1 xjewer staff 512M 6 Sep 03:37 000003 -rw-r--r-- 1 xjewer staff 512M 6 Sep 03:37 000004 -rw-r--r-- 1 xjewer staff 511M 6 Sep 03:37 000005 -rw-r--r-- 1 xjewer staff 512M 6 Sep 03:37 000006 -rw-r--r-- 1 xjewer staff 512M 6 Sep 03:37 000007 -rw-r--r-- 1 xjewer staff 512M 6 Sep 03:37 000008 -rw-r--r-- 1 xjewer staff 512M 6 Sep 03:38 000009 -rw-r--r-- 1 xjewer staff 512M 6 Sep 03:38 000010 -rw-r--r-- 1 xjewer staff 512M 6 Sep 03:38 000011 -rw-r--r-- 1 xjewer staff 512M 6 Sep 03:38 000012 -rw-r--r-- 1 xjewer staff 512M 6 Sep 03:38 000013 -rw-r--r-- 1 xjewer staff 512M 6 Sep 03:39 000014 -rw-r--r-- 1 xjewer staff 512M 6 Sep 03:39 000015 -rw-r--r-- 1 xjewer staff 512M 6 Sep 03:39 000016 -rwxr-xr-x 1 xjewer staff 512M 6 Sep 03:39 000017 -rwxr-xr-x 1 xjewer staff 512M 6 Sep 03:39 000018 -rwxr-xr-x 1 xjewer staff 459M 6 Sep 03:39 000019 Summary: For the time being, the only way is to handle data series by series, not to keep all the output downsampled block in memory — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#297 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGoNu1tCVS9Usag6-yXUY8Qj3SYSnqqzks5ucD-WgaJpZM4TYDtE> .

Add instant writer implementation to shrink memory consumption during the downsampling stage. Encoded chunks are written to chunks blob files right away after series was handled. Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files. It still works in one thread, hence operates only on one core. Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data Fixes thanos-io#297

tonglil · 2018-11-19T18:49:27Z

Would we be able to open this issue until the fix? :)

Add instant writer implementation to shrink memory consumption during the downsampling stage. Encoded chunks are written to chunks blob files right away after series was handled. Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files. It still works in one thread, hence operates only on one core. Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data Fixes thanos-io#297

Add instant writer implementation to shrink memory consumption during the downsampling stage. Encoded chunks are written to chunks blob files right away after series was handled. Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files. It still works in one thread, hence operates only on one core. Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data Fixes #297 * compact: clarify purpose of streamed block writer Add comments and close resources properly. * downsample: fix postings index Use proper posting index to fetch series data with label set and chunks * Add stream writer an ability to write index data right during the downsampling process. One of the trade-offs is to preserve symbols from raw blocks, as we have to write them before preserving the series. Stream writer allows downsample a huge data blocks with no needs to keep all series in RAM, the only need it preserve label values and postings references. * fix nitpicks * downsampling: simplify StreamedBlockWriter interface Reduce of use public Flush method to finalize index and meta files. In case of error, a caller has to remove block directory with a preserved garbage inside. Rid of use tmp directories and renaming, syncing the final block on disk before upload.

daguilarv · 2020-03-06T10:06:42Z

Is there any update on this improvement?

bwplotka added the feature request/improvement label Apr 17, 2018

bwplotka changed the title ~~compactor: Optimize compactor/downsample memory usage~~ donwsampling: Optimize compactor/downsample memory usage Apr 17, 2018

bwplotka changed the title ~~donwsampling: Optimize compactor/downsample memory usage~~ donwsampling: Optimize downsample memory usage Apr 17, 2018

bwplotka added difficulty: medium and removed difficulty: medium labels May 3, 2018

bwplotka added a commit that referenced this issue Sep 12, 2018

compact: Add optional flag to disable compaction.

fa91641

Fixes #297 Signed-off-by: Bartek Plotka <[email protected]>

bwplotka mentioned this issue Sep 12, 2018

compact: Add optional flag to disable downsampling. #515

Merged

bwplotka added a commit that referenced this issue Sep 13, 2018

compact: Add optional flag to disable compaction.

6dab67d

Fixes #297 Signed-off-by: Bartek Plotka <[email protected]>

bwplotka added a commit that referenced this issue Sep 13, 2018

compact: Add optional flag to disable downsampling.

645f30c

Fixes #297 Signed-off-by: Bartek Plotka <[email protected]>

bwplotka mentioned this issue Sep 13, 2018

Thanos Compactor Fails to Progress, Goes OOM #517

Closed

bwplotka closed this as completed in #515 Sep 14, 2018

bwplotka added a commit that referenced this issue Sep 14, 2018

compact: Add optional flag to disable downsampling. (#515)

0d7bf0b

Fixes #297 Signed-off-by: Bartek Plotka <[email protected]>

xjewer mentioned this issue Sep 19, 2018

compact: Improved memory usage while downsampling #529

Merged

xjewer mentioned this issue Sep 21, 2018

store: Store gateway consuming lots of memory / OOMing #448

Closed

dhiller mentioned this issue Oct 18, 2022

Increase thanos compact cpu and memory limits kubevirt/project-infra#2383

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

donwsampling: Optimize downsample memory usage #297

donwsampling: Optimize downsample memory usage #297

bwplotka commented Apr 17, 2018 •

edited

Loading

bwplotka commented Apr 17, 2018

bwplotka commented Apr 17, 2018

bwplotka commented Apr 17, 2018

bwplotka commented Apr 17, 2018

bwplotka commented Apr 17, 2018 •

edited

Loading

felipejfc commented Jul 18, 2018

kedare commented Jul 19, 2018

jdfalk commented Aug 23, 2018

bwplotka commented Aug 23, 2018

drax68 commented Sep 12, 2018

bwplotka commented Sep 12, 2018 •

edited

Loading

xjewer commented Sep 12, 2018

bwplotka commented Sep 13, 2018 •

edited

Loading

xjewer commented Sep 17, 2018

bwplotka commented Sep 17, 2018

xjewer commented Sep 17, 2018 •

edited

Loading

bwplotka commented Sep 17, 2018 •

edited

Loading

xjewer commented Sep 18, 2018 •

edited

Loading

bwplotka commented Sep 18, 2018 via email

tonglil commented Nov 19, 2018

daguilarv commented Mar 6, 2020

donwsampling: Optimize downsample memory usage #297

donwsampling: Optimize downsample memory usage #297

Comments

bwplotka commented Apr 17, 2018 • edited Loading

bwplotka commented Apr 17, 2018

bwplotka commented Apr 17, 2018

bwplotka commented Apr 17, 2018

bwplotka commented Apr 17, 2018

bwplotka commented Apr 17, 2018 • edited Loading

felipejfc commented Jul 18, 2018

kedare commented Jul 19, 2018

jdfalk commented Aug 23, 2018

bwplotka commented Aug 23, 2018

drax68 commented Sep 12, 2018

bwplotka commented Sep 12, 2018 • edited Loading

xjewer commented Sep 12, 2018

bwplotka commented Sep 13, 2018 • edited Loading

xjewer commented Sep 17, 2018

bwplotka commented Sep 17, 2018

xjewer commented Sep 17, 2018 • edited Loading

bwplotka commented Sep 17, 2018 • edited Loading

xjewer commented Sep 18, 2018 • edited Loading

bwplotka commented Sep 18, 2018 via email

tonglil commented Nov 19, 2018

daguilarv commented Mar 6, 2020

bwplotka commented Apr 17, 2018 •

edited

Loading

bwplotka commented Apr 17, 2018 •

edited

Loading

bwplotka commented Sep 12, 2018 •

edited

Loading

bwplotka commented Sep 13, 2018 •

edited

Loading

xjewer commented Sep 17, 2018 •

edited

Loading

bwplotka commented Sep 17, 2018 •

edited

Loading

xjewer commented Sep 18, 2018 •

edited

Loading