Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change flow exporter's export expiry mechanism #2360

Merged
merged 1 commit into from
Sep 23, 2021

Conversation

heanlan
Copy link
Contributor

@heanlan heanlan commented Jul 7, 2021

This PR changes the flow exporter's exporting mechanism from exporting periodically to priority-based. We introduces activeExpireTime, idleExpireTime to every pqItem to track when it should be exported. We create two priority queue here, one for conntrack connection store and one for deny connection store. Flow exporter manages on top of these two priority queues to export the expired connection items based on their expire time.

BenchmarkExportConntrackConns testing on sendFlowRecords results comparison -
After change:

# conns iterations ns/op B/op allocs/op
20000 100 3484688 ns/op 527820 B/op 8214 allocs/op
30000 100 5868374 ns/op 788098 B/op 12313 allocs/op
40000 100 7300047 ns/op 1047562 B/op 16392 allocs/op
50000 100 9312464 ns/op 1308313 B/op 20503 allocs/op

Before change:

# conns iterations ns/op B/op allocs/op
20000 100 11875355 ns/op 1131954 B/op 13034 allocs/op
30000 100 19828705 ns/op 1752559 B/op 20148 allocs/op
40000 100 25297873 ns/op 2376111 B/op 27339 allocs/op
50000 100 32755049 ns/op 3002266 B/op 34589 allocs/op

Improvements - reduce by:

# conns iterations ns/op B/op allocs/op
20000 100 70.6% 53.3% 36.9%
30000 100 70.4% 55.0% 38.8%
40000 100 71.1% 55.9% 40.0%
50000 100 71.5% 56.4% 40.7%

Runtime improvements reasons - in the main, ForAllFlowRecordsDo costs 48% of runtime. As every time we poll to check for expired records, we will iterate through all the records in the record map. Now we are removing the FlowRecords struct, and we are using priority queue to replace the record map, no long need to iterate through all items.

Memory improvements reasons - in the main, github.com/vmware/go-ipfix/pkg/entities.NewDataRecord consumes 32% of memory, AddFlowRecordToMap consumes 12%, sendFlowRecords also consumes more compare to after-change version. We are getting rid of these methods by removing FlowRecords struct.

@heanlan heanlan added the area/flow-visibility/exporter Issues or PRs related to the Flow Exporter functions in the Agent label Jul 7, 2021
@heanlan heanlan force-pushed the flow-records-export branch 3 times, most recently from 5a82424 to 6f00feb Compare July 7, 2021 18:19
@codecov-commenter
Copy link

codecov-commenter commented Jul 7, 2021

Codecov Report

Merging #2360 (dbd4f37) into main (ea26e6a) will increase coverage by 5.56%.
The diff coverage is 88.14%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2360      +/-   ##
==========================================
+ Coverage   60.41%   65.97%   +5.56%     
==========================================
  Files         283      284       +1     
  Lines       22455    26852    +4397     
==========================================
+ Hits        13566    17716    +4150     
- Misses       7462     7470       +8     
- Partials     1427     1666     +239     
Flag Coverage Δ
e2e-tests 56.18% <85.92%> (?)
kind-e2e-tests 49.23% <85.35%> (+1.81%) ⬆️
unit-tests 40.93% <41.88%> (-1.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/controller/networkpolicy/packetin.go 73.95% <ø> (+2.81%) ⬆️
pkg/agent/flowexporter/connections/connections.go 74.02% <68.00%> (-1.53%) ⬇️
...agent/flowexporter/connections/deny_connections.go 87.32% <84.09%> (+4.63%) ⬆️
pkg/agent/flowexporter/exporter/exporter.go 83.17% <88.42%> (+3.40%) ⬆️
.../flowexporter/connections/conntrack_connections.go 84.24% <93.75%> (+2.94%) ⬆️
.../agent/flowexporter/priorityqueue/priorityqueue.go 94.11% <94.11%> (ø)
pkg/agent/flowexporter/utils.go 75.00% <100.00%> (+15.00%) ⬆️
pkg/controller/egress/ipallocator/allocator.go 65.00% <0.00%> (-17.98%) ⬇️
pkg/util/k8s/node.go 69.56% <0.00%> (-15.06%) ⬇️
pkg/controller/networkpolicy/endpoint_querier.go 77.64% <0.00%> (-13.79%) ⬇️
... and 280 more

@heanlan heanlan marked this pull request as draft July 7, 2021 20:15
@heanlan heanlan force-pushed the flow-records-export branch 2 times, most recently from 9b62770 to 5dead68 Compare July 7, 2021 21:42
@heanlan
Copy link
Contributor Author

heanlan commented Jul 7, 2021

Found failures on 1) some of the e2e tests specific to Kind cluster(fixed) 2) unit-test on windows only(fixed). Looking into those failures...

@heanlan heanlan force-pushed the flow-records-export branch 9 times, most recently from b77a20c to 31749f1 Compare July 8, 2021 07:21
@heanlan
Copy link
Contributor Author

heanlan commented Jul 8, 2021

/test-e2e

@heanlan heanlan marked this pull request as ready for review July 8, 2021 21:51
@heanlan heanlan marked this pull request as draft July 8, 2021 22:53
@heanlan
Copy link
Contributor Author

heanlan commented Jul 10, 2021

/test-e2e

@heanlan heanlan marked this pull request as ready for review July 10, 2021 15:19
@heanlan heanlan requested review from zyiou and srikartati July 13, 2021 01:16
pkg/agent/flowexporter/flowrecords/flow_records.go Outdated Show resolved Hide resolved
pkg/agent/flowexporter/exporter/exporter.go Outdated Show resolved Hide resolved
pkg/agent/flowexporter/exporter/exporter.go Outdated Show resolved Hide resolved
srikartati added a commit to srikartati/antrea that referenced this pull request Jul 17, 2021
Deadlock is due to the access of connection map to update from exporter
go routine to update the "DoneExport" flag. This was caught in scale testing.
Resolved this through a temporary fix by adding same flag in record
data struct. The connection and record deletion logic will be re-evaluated
through PR antrea-io#2360 as it refactors the related code.
srikartati added a commit to srikartati/antrea that referenced this pull request Jul 19, 2021
Deadlock is due to the access of connection map to update from exporter
go routine to update the "DoneExport" flag. This was caught in scale testing.
Resolved this through a temporary fix by adding same flag in record
data struct. The connection and record deletion logic will be re-evaluated
through PR antrea-io#2360 as it refactors the related code.
@antoninbas
Copy link
Contributor

@heanlan did you run the benchmarks again after changing the function parameters to *Connection instead of Connection?

If there is a concern about memory usage, you can look into using the connection key instead. Build a slice of connection keys, then for every key, grab the lock again to get a pointer to the connection from the store, then prepare the set to export, release the lock and send the set.

One thing to also consider is to have a max number of connections to export with each iteration. This way you pre-allocate a small amount of memory and you reduce the max memory usage:

func (exp *flowExporter) sendFlowRecords() (time.Duration, error) {
	currTime := time.Now()
        const maxConnectionsToExport = 64
	expiredConns := make([]flowexporter.Connection, 0, maxConnectionsToExport * 2)
	var expireTime1, expireTime2 time.Duration
	expiredConns, expireTime1 = exp.conntrackConnStore.GetExpiredConns(expiredConns, currTime, maxConnectionsToExport)
	expiredConns, expireTime2 = exp.denyConnStore.GetExpiredConns(expiredConns, currTime, maxConnectionsToExport)
	// Select the shorter time out among two connection stores to do the next round of export.
	nextExpireTime := getMinTime(expireTime1, expireTime2)
	for i := range expiredConns {
		if err := exp.exportConn(&expiredConns[i]); err != nil {
			klog.ErrorS(err, "Error when sending expired flow record")
			return nextExpireTime, err
		}
	}
	return nextExpireTime, nil
}

you may want to try the latter one first, as it's a pretty simple change. Another real-world advantage IMO is that you hold the lock for a bounded amount of time because GetExpiredConns will only look for at most maxConnectionsToExport before returning.

@srikartati
Copy link
Member

in that case, 1) theoretically GetItemNum should be a method of priority queue struct, since we need to acquire the lock, we'll put it in connections struct 2) we need to acquire&release the locks twice in sendFlowRecord.

Like minExpireTime the output parameter in GetExpiredConns, we can get this in the current iteration and use it in the next one without using any lock. Like Antonin, I was also interested to see the benefits you could get by using a fixed number for pre-allocation of the slice. If that is good, we can see if getting the connection count will be useful or not.

@heanlan
Copy link
Contributor Author

heanlan commented Sep 9, 2021

@heanlan did you run the benchmarks again after changing the function parameters to *Connection instead of Connection?

Thanks, Antonin and Srikar. Yes, the results listed in the chart above is after this change, the total memory usage of sendFlowRecord got reduced by 10%.

I have tried the approach that pre-allocate the slice with a fixed size. In the case of exporting 50k connections, the memory usage of GetExpiredConns reduced from 279.71MB to 5MB.
For testing purpose( sendFlowRecord only called once in benchmark), I pre-allocate 50k as the slice size, so the memory usage at this step is huge. Therefore I didn't put the total memory usage here as it is useless.

@srikartati
Copy link
Member

srikartati commented Sep 10, 2021

you may want to try the latter one first, as it's a pretty simple change. Another real-world advantage IMO is that you hold the lock for a bounded amount of time because GetExpiredConns will only look for at most maxConnectionsToExport before returning.

@antoninbas Holding the lock for bounded time is good as polling routine can get hold of it quickly rather than waiting for a long time. I see a tradeoff where the export of the records may be delayed by a bit depending on polling routine runtime. You may have suggested 64 as an example--do you have any recommendation for rationale in picking this number?
In a follow-up PR, may be we could do some experiments with different constants which can help here.

As an alternative, I thought about using the fraction of flows to define capacity for expired connections. This has a con of allocating a large memory when the flow number is high.

@antoninbas
Copy link
Contributor

I feel like anything around 100 is probably a good choice. Or if you really want to get fancy something like min(50 + 0.1 * connectionStore.size(), 200). I still think there should be a min and a max value in that case. You want to be careful of edge cases. If you just use 0.1 * connectionStore.size(), it's a bit of a waste when you overall have a small number of connections and a large number of them need to get exported.

@heanlan heanlan force-pushed the flow-records-export branch 2 times, most recently from ebd339a to ffff223 Compare September 13, 2021 18:54
@heanlan heanlan marked this pull request as draft September 13, 2021 20:02
@heanlan heanlan marked this pull request as ready for review September 14, 2021 03:21
pkg/agent/flowexporter/exporter/exporter.go Outdated Show resolved Hide resolved
pkg/agent/flowexporter/exporter/exporter.go Outdated Show resolved Hide resolved
pkg/agent/flowexporter/exporter/exporter.go Outdated Show resolved Hide resolved
pkg/agent/flowexporter/exporter/exporter.go Outdated Show resolved Hide resolved
pkg/agent/flowexporter/connections/connections.go Outdated Show resolved Hide resolved
@@ -214,29 +233,29 @@ func TestConntrackConnectionStore_AddOrUpdateConn(t *testing.T) {
addOrUpdateConnTests := []struct {
flow flowexporter.Connection
}{
{testFlow1}, // To test update part of function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the more I look at this test case, the less I understand why it is structured as it is. The code is 200+ lines long, and there is too much testcase-specific code IMO. We should break the test down into separate individual tests (possibly using subtests), and place the common code into helper functions if needed. This could be done in a follow-up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, marked as a follow-up to-do.

pkg/agent/flowexporter/connections/connections.go Outdated Show resolved Hide resolved
pkg/agent/flowexporter/priorityqueue/priorityqueue.go Outdated Show resolved Hide resolved
pkg/agent/flowexporter/utils.go Outdated Show resolved Hide resolved
antoninbas
antoninbas previously approved these changes Sep 21, 2021
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@heanlan
Copy link
Contributor Author

heanlan commented Sep 21, 2021

/test-all
/test-ipv6-only-e2e
/test-ipv6-e2e

Before the change, flow exporter exports records periodically.

Signed-off-by: heanlan <[email protected]>
@heanlan
Copy link
Contributor Author

heanlan commented Sep 21, 2021

Squashing the commits

/test-all
/test-ipv6-only-e2e
/test-ipv6-e2e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/flow-visibility/exporter Issues or PRs related to the Flow Exporter functions in the Agent
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants