Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

donwsampling: Optimize downsample memory usage #297

Closed
bwplotka opened this issue Apr 17, 2018 · 21 comments · Fixed by #529
Closed

donwsampling: Optimize downsample memory usage #297

bwplotka opened this issue Apr 17, 2018 · 21 comments · Fixed by #529

Comments

@bwplotka
Copy link
Member

bwplotka commented Apr 17, 2018

Downsample can OOM easily with 16GB.. wonder if we can improve that in any way.

It uses a lot of memory when downsampling my 2-week block (on top of other mem usage after lot's of compactions maybe). Wonder if that's not some lazy GC issue.

@bwplotka
Copy link
Member Author

image

@bwplotka bwplotka changed the title compactor: Optimize compactor/downsample memory usage donwsampling: Optimize compactor/downsample memory usage Apr 17, 2018
@bwplotka bwplotka changed the title donwsampling: Optimize compactor/downsample memory usage donwsampling: Optimize downsample memory usage Apr 17, 2018
@bwplotka
Copy link
Member Author

20GB not enough as well.

@bwplotka
Copy link
Member Author

My block stats:

gsutil du -sh "gs://thanos-beta/01CB8HB76TAETC189K7V77VB8G/" 
13.82 GiB   gs://thanos-beta/01CB8HB76TAETC189K7V77VB8G
 gsutil du -sh "gs://thanos-beta/01CB8HB76TAETC189K7V77VB8G/chunks/"
12.58 GiB   gs://thanos-beta/01CB8HB76TAETC189K7V77VB8G/chunks

This guy is... HEAVY

@bwplotka
Copy link
Member Author

go tool pprof -symbolize=remote -alloc_space thanos "(..)/debug/pprof/heap"

(pprof) top 5
Showing nodes accounting for 55.07GB, 62.78% of 87.71GB total
Dropped 356 nodes (cum <= 0.44GB)
Showing top 5 nodes out of 58
      flat  flat%   sum%        cum   cum%
   21.22GB 24.19% 24.19%    21.22GB 24.19%  github.com/improbable-eng/thanos/pkg/compact/downsample.EncodeAggrChunk
   10.97GB 12.50% 36.70%    10.97GB 12.50%  github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/chunkenc.(*bstream).writeByte
    8.49GB  9.68% 46.38%    14.14GB 16.12%  github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/chunkenc.(*XORChunk).iterator
    7.62GB  8.68% 55.06%     7.62GB  8.68%  github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/chunkenc.NewXORChunk
    6.77GB  7.72% 62.78%     6.77GB  7.72%  github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/chunkenc.(*bstream).writeBit
(pprof) 
ROUTINE ======================== github.com/improbable-eng/thanos/pkg/compact/downsample.EncodeAggrChunk in /home/bartek/go/src/github.com/improbable-eng/thanos/pkg/compact/downsample/aggr.go
   22.36GB    22.36GB (flat, cum) 24.19% of Total
         .          .     28:			b = append(b, buf[:n]...)
         .          .     29:			continue
         .          .     30:		}
         .          .     31:		l := len(c.Bytes())
         .          .     32:		n := binary.PutUvarint(buf[:], uint64(l))
  758.93MB   758.93MB     33:		b = append(b, buf[:n]...)
  175.05MB   175.05MB     34:		b = append(b, byte(c.Encoding()))
   21.15GB    21.15GB     35:		b = append(b, c.Bytes()...)
         .          .     36:	}
  307.51MB   307.51MB     37:	chk := AggrChunk(b)
         .          .     38:	return &chk
         .          .     39:}
         .          .     40:
         .          .     41:func (c AggrChunk) Bytes() []byte {
         .          .     42:	return []byte(c)

@bwplotka
Copy link
Member Author

bwplotka commented Apr 17, 2018

25-35 GB seems to be enough for my compactor.

image

We definitely should invest time to optimize that though.

@felipejfc
Copy link

I've allocated 50GB for mine and it's dying for OOM :/

@kedare
Copy link

kedare commented Jul 19, 2018

It looks like this has been improved in the last releases ? I don't know if it's related but I don't see any memory spike since one of the last upgrade

@jdfalk
Copy link
Contributor

jdfalk commented Aug 23, 2018

I am still getting OOM on large compacted blocks. We have blocks that are 207GB and after grabbing the block it will eventually eat all system memory until it OOM's. Server memory:

              total        used        free      shared  buff/cache   available
Mem:           125G        8.1G        1.3G        124K        116G        116G
Swap:           23G         61M         23G

The server also has 32 cores that are being dramatically underutilized.

@bwplotka
Copy link
Member Author

Yup, thanks for feedback. Nothing changed in this area, so this becomes an important issue.

@drax68
Copy link

drax68 commented Sep 12, 2018

Reproduced with >200Gb blocks, it tries to obtain that block after being killed by oom and wastes s3 traffic in a loop. Is there a way to disable downsampling until memory usage is fixed?

@bwplotka
Copy link
Member Author

bwplotka commented Sep 12, 2018

@drax68 Makes total sense.

Added: #515

Also looking for volounteer to fix this issue. I can help to lead this, but won't have time to solve this in soon-ish time. (There are higher priority things)

@xjewer
Copy link
Contributor

xjewer commented Sep 12, 2018

I can volunteer, what are you suggesting to look at first? Writing chunk data to a tmp file instead of eat away all heap?

bwplotka added a commit that referenced this issue Sep 13, 2018
bwplotka added a commit that referenced this issue Sep 13, 2018
@bwplotka
Copy link
Member Author

bwplotka commented Sep 13, 2018

Thanks @xjewer!

  1. Gather data (!) We are only suspecting what is causing this. We don't want to optimize part of code that is not relevant in memory usage. The easiest way, would be to write some manual unit-test (don't commit it even) with downsampling some mocked medium block and print memory here & there https://golangcode.com/print-the-current-memory-usage/ to be aware what is the current state and if we improve things or not. Don't have any better solution with shortest cycles - that is how I would do it.

  2. Chunk data are .. not that large. By design we aggregate all so the size of it is much much more smaller than the source block we are downsampling. We should think about dumping this block only when nothing else can be optimized. When users see 200GB while downsampling I doubt their blocks are 1000GB (:

  3. I assume we stream per series from source block. But the data suggests that somehow the block is buffered. There might be different reasons. Maybe we accidently allocate some byte slice. Maybe we free the memory just fine, but GC is too lazy to actually kill those and we don't release the memory quick enough. There is one good solution - use bytes.Pool, like here: https://github.com/improbable-eng/thanos/blob/master/pkg/store/bucket.go#L163 and use it here:https://github.com/improbable-eng/thanos/blob/master/pkg/compact/downsample/downsample.go#L74 Again, we need to make sure to have something for 1. thing to measure the result.

  4. The initial debug donwsampling: Optimize downsample memory usage  #297 (comment) shows that byte slice here: https://github.com/improbable-eng/thanos/blob/master/pkg/compact/downsample/aggr.go#L21 holds most of memory. Maybe here is place we are not releasing the aggr block allocations fast enough? But hm.. those are just aggregated data I believe

@xjewer
Copy link
Contributor

xjewer commented Sep 17, 2018

I tested on my 9GB block. We put all the chunks in-memory.
Here's a pprof list in some time from the downsampling, it continuously wasting the memory, pushing all series to memblock:

112          3MB     1.12GB           			newb.addSeries(&series{lset: lset, chunks: downsampleRaw(all, resolution)}) 
113            .          .           			continue 

and keep them until writing all that data back to the file:

https://github.com/improbable-eng/thanos/blob/master/pkg/compact/downsample/downsample.go#L128

	id, err = comp.Write(dir, newb, origMeta.MinTime, origMeta.MaxTime)

So even bytes.Pool won't help in suggested place.
GODEBUG=schedtrace=500 shows GC works properly.

@bwplotka
Copy link
Member Author

I tested on my 9GB block.

9GB for source block or for downsampled block? I guess for source block, right? can you check size of downsampled block? It should be much smaller.

If that so, why we have 9GB in memory still? I don't think *output block matters that much, but I might be wrong. That was clear from beginning that we are keeping it in memory the ouput block. (: But data says we keep input... am I missing something?

@xjewer
Copy link
Contributor

xjewer commented Sep 17, 2018

9GB for source block or for downsampled block? I guess for source block, right?

Sure, source block. And yes, we're keeping the output block there.

What I see, that initially we loading XORenconded chunks in a source block, which means the size of aggregated to 5m decoded samples would be comparable to source. Here is my result:

thanos-production du -h -d 1 .
632M	./01CQKYG30KE5KZJN7GCKAK18GY
5.1G	./01CQKY0N79H5572MDCMG3N0ZKS
10.0G	./01CPPF5VGX7SXDQSVXYYEJKE3Q

where:
10.0G - source
5.1G - first pass
632M - second pass

So for the extremely large source block we would waste a huge amount of RAM

@bwplotka
Copy link
Member Author

bwplotka commented Sep 17, 2018

Nice, thanks for those numbers. It seems we are getting somewhere. So definitely output block can be significant I can agree.

So we could avoid this by flush things to file if possible (Let's name it Stream output block per series) And indeed we are expanding (decoding) XOR chunks in some buffer here.
We could be able to stream that part per sample as well (let's call this improvement Stream downsampling per sample). This is not that easy, though, but possible.

@xjewer
Copy link
Contributor

xjewer commented Sep 18, 2018

from the discussion with @bwplotka in slack:

  1. there was a misunderstanding, diff wasn't exactly 4GB, but some memory has gone indeed, something about 1-1.5 GB
  2. interesting point https://github.com/improbable-eng/thanos/blob/3c8546ceef9cf13856d91b9897fa816303fc05b6/cmd/thanos/downsample.go#L228 https://github.com/prometheus/tsdb/blob/master/chunks/chunks.go#L371
    we read chunks using pool, but don’t put them back 🤔
  3. memBlock anyway is the target for optimisation: has to be handled by series to avoid memory consumption.

The next what I found:
https://github.com/improbable-eng/thanos/blob/master/cmd/thanos/downsample.go#L228 uses mmap to open chunk files https://github.com/prometheus/tsdb/blob/master/chunks/chunks.go#L334
For my case, ByteSlices are with block data has ~ 536781921 capacity, which means series' data would be pointed out on this array behind this slice and wouldn't be swept out by GC until the following series will reach next data file. For the index file the same https://github.com/prometheus/tsdb/blob/master/index/index.go#L592

$ chunks ls -l

total 19980256
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:36 000001
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:36 000002
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:37 000003
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:37 000004
-rw-r--r--  1 xjewer  staff   511M  6 Sep 03:37 000005
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:37 000006
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:37 000007
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:37 000008
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:38 000009
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:38 000010
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:38 000011
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:38 000012
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:38 000013
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:39 000014
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:39 000015
-rw-r--r--  1 xjewer  staff   512M  6 Sep 03:39 000016
-rwxr-xr-x  1 xjewer  staff   512M  6 Sep 03:39 000017
-rwxr-xr-x  1 xjewer  staff   512M  6 Sep 03:39 000018
-rwxr-xr-x  1 xjewer  staff   459M  6 Sep 03:39 000019

Summary:
For the time being, the only way is to handle data series by series, not to keep all the output downsampled block in memory

@bwplotka
Copy link
Member Author

bwplotka commented Sep 18, 2018 via email

xjewer added a commit to xjewer/thanos that referenced this issue Sep 19, 2018
Add instant writer implementation to shrink memory consumption during the downsampling stage.
Encoded chunks are written to chunks blob files right away after series was handled.
Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files.
It still works in one thread, hence operates only on one core.

Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data

Fixes thanos-io#297
xjewer added a commit to xjewer/thanos that referenced this issue Sep 19, 2018
Add instant writer implementation to shrink memory consumption during the downsampling stage.
Encoded chunks are written to chunks blob files right away after series was handled.
Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files.
It still works in one thread, hence operates only on one core.

Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data

Fixes thanos-io#297
xjewer added a commit to xjewer/thanos that referenced this issue Sep 20, 2018
Add instant writer implementation to shrink memory consumption during the downsampling stage.
Encoded chunks are written to chunks blob files right away after series was handled.
Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files.
It still works in one thread, hence operates only on one core.

Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data

Fixes thanos-io#297
xjewer added a commit to xjewer/thanos that referenced this issue Sep 20, 2018
Add instant writer implementation to shrink memory consumption during the downsampling stage.
Encoded chunks are written to chunks blob files right away after series was handled.
Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files.
It still works in one thread, hence operates only on one core.

Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data

Fixes thanos-io#297
chrischdi pushed a commit to c445/thanos that referenced this issue Oct 24, 2018
Add instant writer implementation to shrink memory consumption during the downsampling stage.
Encoded chunks are written to chunks blob files right away after series was handled.
Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files.
It still works in one thread, hence operates only on one core.

Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data

Fixes thanos-io#297
chrischdi pushed a commit to c445/thanos that referenced this issue Oct 25, 2018
Add instant writer implementation to shrink memory consumption during the downsampling stage.
Encoded chunks are written to chunks blob files right away after series was handled.
Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files.
It still works in one thread, hence operates only on one core.

Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data

Fixes thanos-io#297
kabakaev pushed a commit to c445/thanos that referenced this issue Oct 31, 2018
Add instant writer implementation to shrink memory consumption during the downsampling stage.
Encoded chunks are written to chunks blob files right away after series was handled.
Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files.
It still works in one thread, hence operates only on one core.

Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data

Fixes thanos-io#297
fcvarela pushed a commit to monzo/thanos that referenced this issue Nov 5, 2018
Add instant writer implementation to shrink memory consumption during the downsampling stage.
Encoded chunks are written to chunks blob files right away after series was handled.
Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files.
It still works in one thread, hence operates only on one core.

Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data

Fixes thanos-io#297
xjewer added a commit to xjewer/thanos that referenced this issue Nov 14, 2018
Add instant writer implementation to shrink memory consumption during the downsampling stage.
Encoded chunks are written to chunks blob files right away after series was handled.
Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files.
It still works in one thread, hence operates only on one core.

Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data

Fixes thanos-io#297
@tonglil
Copy link

tonglil commented Nov 19, 2018

Would we be able to open this issue until the fix? :)

chrischdi pushed a commit to c445/thanos that referenced this issue Jan 2, 2019
Add instant writer implementation to shrink memory consumption during the downsampling stage.
Encoded chunks are written to chunks blob files right away after series was handled.
Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files.
It still works in one thread, hence operates only on one core.

Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data

Fixes thanos-io#297
kabakaev pushed a commit to c445/thanos that referenced this issue Jan 24, 2019
Add instant writer implementation to shrink memory consumption during the downsampling stage.
Encoded chunks are written to chunks blob files right away after series was handled.
Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files.
It still works in one thread, hence operates only on one core.

Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data

Fixes thanos-io#297
xjewer added a commit to xjewer/thanos that referenced this issue Feb 6, 2019
Add instant writer implementation to shrink memory consumption during the downsampling stage.
Encoded chunks are written to chunks blob files right away after series was handled.
Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files.
It still works in one thread, hence operates only on one core.

Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data

Fixes thanos-io#297
xjewer added a commit to xjewer/thanos that referenced this issue Feb 6, 2019
Add instant writer implementation to shrink memory consumption during the downsampling stage.
Encoded chunks are written to chunks blob files right away after series was handled.
Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files.
It still works in one thread, hence operates only on one core.

Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data

Fixes thanos-io#297
bwplotka pushed a commit that referenced this issue Feb 6, 2019
Add instant writer implementation to shrink memory consumption during the downsampling stage.
Encoded chunks are written to chunks blob files right away after series was handled.
Flush method closes chunk writer and sync all symbols, series, labels, posting and meta data to files.
It still works in one thread, hence operates only on one core.

Estimated memory consumption is unlikely more than 1Gb, but depends on data set, labels size and series' density: chunk data size (512MB) + encoded buffers + index data

Fixes #297

* compact: clarify purpose of streamed block writer

Add comments and close resources properly.

* downsample: fix postings index

Use proper posting index to fetch series data with label set and chunks

* Add stream writer an ability to write index data right during
the downsampling process.

One of the trade-offs is to preserve symbols from raw blocks, as we have to write them
before preserving the series.

Stream writer allows downsample a huge data blocks with no needs to keep
all series in RAM, the only need it preserve label values and postings references.

* fix nitpicks

* downsampling: simplify StreamedBlockWriter interface

Reduce of use public Flush method to finalize index and meta files.
In case of error, a caller has to remove block directory with a preserved
garbage inside.

Rid of use tmp directories and renaming, syncing the final block on disk
before upload.
@daguilarv
Copy link

Is there any update on this improvement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants