Prefetch next inode number #5130

polyrabbit · 2024-09-03T05:09:57Z

Currently inodes are fetched in synchronous batches. The synchronous call to allocate inodes will harm performance sometime. One case is to write checkpoints in LLM training where hundred of nodes write at the same time - they all act synchronously which causes serious transaction conflicts. We observed hundred of milliseconds of latency when allocating inodes. One recent example:

By pipelining inode fetch and allocation, we now have a smooth latency of create op, e.g. (this example sets inodeBatch=10 so it fetches every 10 allocations)

before:

create (9462500,1,-rw-rw-r--:0100664): (9462570,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:20] - OK <0.005308>
create (9462500,2,-rw-rw-r--:0100664): (9462571,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:22] - OK <0.002266>
create (9462500,3,-rw-rw-r--:0100664): (9462572,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:24] - OK <0.002200>
create (9462500,4,-rw-rw-r--:0100664): (9462573,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:26] - OK <0.002205>
create (9462500,5,-rw-rw-r--:0100664): (9462574,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:28] - OK <0.002206>
create (9462500,6,-rw-rw-r--:0100664): (9462575,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:30] - OK <0.002190>
create (9462500,7,-rw-rw-r--:0100664): (9462576,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:32] - OK <0.002229>
create (9462500,8,-rw-rw-r--:0100664): (9462577,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:34] - OK <0.002183>
create (9462500,9,-rw-rw-r--:0100664): (9462578,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:36] - OK <0.002180>
create (9462500,10,-rw-rw-r--:0100664): (9462579,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:38] - OK <0.002190>
create (9462500,11,-rw-rw-r--:0100664): (9462580,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:40] - OK <0.004361>
create (9462500,12,-rw-rw-r--:0100664): (9462581,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:42] - OK <0.002745>
create (9462500,13,-rw-rw-r--:0100664): (9462582,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:44] - OK <0.002181>
create (9462500,14,-rw-rw-r--:0100664): (9462583,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:46] - OK <0.002134>
create (9462500,15,-rw-rw-r--:0100664): (9462584,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:48] - OK <0.002136>
create (9462500,16,-rw-rw-r--:0100664): (9462585,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:50] - OK <0.002168>
create (9462500,17,-rw-rw-r--:0100664): (9462586,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:52] - OK <0.002209>
create (9462500,18,-rw-rw-r--:0100664): (9462587,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:54] - OK <0.002125>
create (9462500,19,-rw-rw-r--:0100664): (9462588,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:56] - OK <0.002176>
create (9462500,20,-rw-rw-r--:0100664): (9462589,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:58] - OK <0.002151>
create (9462500,21,-rw-rw-r--:0100664): (9462590,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:60] - OK <0.004324>
create (9462500,22,-rw-rw-r--:0100664): (9462591,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:62] - OK <0.002246>
create (9462500,23,-rw-rw-r--:0100664): (9462592,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:64] - OK <0.002210>
create (9462500,24,-rw-rw-r--:0100664): (9462593,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:66] - OK <0.002263>
create (9462500,25,-rw-rw-r--:0100664): (9462594,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:68] - OK <0.002166>
create (9462500,26,-rw-rw-r--:0100664): (9462595,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:70] - OK <0.002177>
create (9462500,27,-rw-rw-r--:0100664): (9462596,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:72] - OK <0.002128>
create (9462500,28,-rw-rw-r--:0100664): (9462597,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:74] - OK <0.002135>
create (9462500,29,-rw-rw-r--:0100664): (9462598,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:76] - OK <0.002223>
create (9462500,30,-rw-rw-r--:0100664): (9462599,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:78] - OK <0.002260>

after:

create (9462500,1,-rw-rw-r--:0100664): (9462600,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:12] - OK <0.004521>
create (9462500,2,-rw-rw-r--:0100664): (9462601,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:14] - OK <0.002690>
create (9462500,3,-rw-rw-r--:0100664): (9462602,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:16] - OK <0.002181>
create (9462500,4,-rw-rw-r--:0100664): (9462603,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:18] - OK <0.002068>
create (9462500,5,-rw-rw-r--:0100664): (9462604,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:20] - OK <0.002049>
create (9462500,6,-rw-rw-r--:0100664): (9462605,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:22] - OK <0.002040>
create (9462500,7,-rw-rw-r--:0100664): (9462606,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:24] - OK <0.002017>
create (9462500,8,-rw-rw-r--:0100664): (9462607,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:26] - OK <0.002075>
create (9462500,9,-rw-rw-r--:0100664): (9462608,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:28] - OK <0.002018>
create (9462500,10,-rw-rw-r--:0100664): (9462609,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:30] - OK <0.002584>
create (9462500,11,-rw-rw-r--:0100664): (9462610,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:32] - OK <0.002046>
create (9462500,12,-rw-rw-r--:0100664): (9462611,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:34] - OK <0.002076>
create (9462500,13,-rw-rw-r--:0100664): (9462612,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:36] - OK <0.002043>
create (9462500,14,-rw-rw-r--:0100664): (9462613,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:38] - OK <0.002081>
create (9462500,15,-rw-rw-r--:0100664): (9462614,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:40] - OK <0.002533>
create (9462500,16,-rw-rw-r--:0100664): (9462615,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:42] - OK <0.002011>
create (9462500,17,-rw-rw-r--:0100664): (9462616,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:44] - OK <0.002013>
create (9462500,18,-rw-rw-r--:0100664): (9462617,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:46] - OK <0.002043>
create (9462500,19,-rw-rw-r--:0100664): (9462618,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:48] - OK <0.002078>
create (9462500,20,-rw-rw-r--:0100664): (9462619,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:50] - OK <0.002677>
create (9462500,21,-rw-rw-r--:0100664): (9462620,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:52] - OK <0.002043>
create (9462500,22,-rw-rw-r--:0100664): (9462621,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:54] - OK <0.002070>
create (9462500,23,-rw-rw-r--:0100664): (9462622,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:56] - OK <0.002134>
create (9462500,24,-rw-rw-r--:0100664): (9462623,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:58] - OK <0.002106>
create (9462500,25,-rw-rw-r--:0100664): (9462624,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:60] - OK <0.002004>
create (9462500,26,-rw-rw-r--:0100664): (9462625,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:62] - OK <0.002055>
create (9462500,27,-rw-rw-r--:0100664): (9462626,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:64] - OK <0.002008>
create (9462500,28,-rw-rw-r--:0100664): (9462627,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:66] - OK <0.002670>
create (9462500,29,-rw-rw-r--:0100664): (9462628,[-rw-rw-r--:0100664,1,1002,1002,1725336481,1725336481,1725336481,0]) [fh:68] - OK <0.002151>
create (9462500,30,-rw-rw-r--:0100664): (9462629,[-rw-rw-r--:0100664,1,1002,1002,1725336481,1725336481,1725336481,0]) [fh:70] - OK <0.002099>

Signed-off-by: Changxin Miao <[email protected]>

davies · 2024-09-04T02:43:49Z

pkg/meta/base.go

 	}
 	n := m.freeInodes.next
 	m.freeInodes.next++
 	for n <= 1 {
 		n = m.freeInodes.next
 		m.freeInodes.next++
 	}
+	if m.freeInodes.maxid-m.freeInodes.next < uint64(utils.JitterIt(inodeBatch*0.1)) {


it will start single goroutine if we use a fixed number here

A goroutine will be started when freeInodes.next is close to freeInodes.maxid, regardless of the threshold.
With a fixed number, when one inode-allocation spawns a goroutine, the next call will always spawn another one. Because the next call will have a larger freeInodes.next.
Those goroutines except the first one are unschedulable, they will not consume much resources.

if m.freeInodes.maxid-m.freeInodes.next == fixed_one { }

Got it, done

davies · 2024-09-04T02:47:17Z

pkg/meta/base.go

-		if err != nil {
-			return 0, err
+		m.prefetchMu.Lock() // Wait until prefetchInodes() is done
+		nextLimit := m.prefetchedInode


use m.prefetchedInode to overwrite m.freeInodes if it's valid

Yes, this is what the following code does.

if m.freeInodes.next >= m.freeInodes.maxid { m.prefetchMu.Lock() m.freeInodes = m.prefetchedInodes m.prefetchedInodes = freeID{} m.prefetchMu.Unlock() } if m.freeInodes.next >= m.freeInodes.maxid { }

Refactored like this?6043c1c

Signed-off-by: Changxin Miao <[email protected]>

davies · 2024-09-04T13:10:36Z

LGTM, thanks!

Prefetch next inode number

949a8ee

Signed-off-by: Changxin Miao <[email protected]>

davies reviewed Sep 4, 2024

View reviewed changes

polyrabbit added 2 commits September 4, 2024 17:26

Only spawn one goroutine to prefetch inode

6da6b8c

Signed-off-by: Changxin Miao <[email protected]>

Refactor prefetched inode assignment

6043c1c

Signed-off-by: Changxin Miao <[email protected]>

davies merged commit 6a4999f into juicedata:main Sep 4, 2024
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefetch next inode number #5130

Prefetch next inode number #5130

polyrabbit commented Sep 3, 2024

davies Sep 4, 2024

polyrabbit Sep 4, 2024

davies Sep 4, 2024

polyrabbit Sep 4, 2024

davies Sep 4, 2024 •

edited

Loading

polyrabbit Sep 4, 2024

davies Sep 4, 2024

polyrabbit Sep 4, 2024

davies commented Sep 4, 2024

Prefetch next inode number #5130

Prefetch next inode number #5130

Conversation

polyrabbit commented Sep 3, 2024

davies Sep 4, 2024

Choose a reason for hiding this comment

polyrabbit Sep 4, 2024

Choose a reason for hiding this comment

davies Sep 4, 2024

Choose a reason for hiding this comment

polyrabbit Sep 4, 2024

Choose a reason for hiding this comment

davies Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

polyrabbit Sep 4, 2024

Choose a reason for hiding this comment

davies Sep 4, 2024

Choose a reason for hiding this comment

polyrabbit Sep 4, 2024

Choose a reason for hiding this comment

davies commented Sep 4, 2024

davies Sep 4, 2024 •

edited

Loading