Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefetch next inode number #5130

Merged
merged 3 commits into from
Sep 4, 2024
Merged

Conversation

polyrabbit
Copy link
Contributor

Currently inodes are fetched in synchronous batches. The synchronous call to allocate inodes will harm performance sometime. One case is to write checkpoints in LLM training where hundred of nodes write at the same time - they all act synchronously which causes serious transaction conflicts. We observed hundred of milliseconds of latency when allocating inodes. One recent example:

img_v3_02e8_144036ae-d5fc-4326-a653-37476f2ca68g

By pipelining inode fetch and allocation, we now have a smooth latency of create op, e.g. (this example sets inodeBatch=10 so it fetches every 10 allocations)

before:

create (9462500,1,-rw-rw-r--:0100664): (9462570,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:20] - OK <0.005308>
create (9462500,2,-rw-rw-r--:0100664): (9462571,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:22] - OK <0.002266>
create (9462500,3,-rw-rw-r--:0100664): (9462572,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:24] - OK <0.002200>
create (9462500,4,-rw-rw-r--:0100664): (9462573,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:26] - OK <0.002205>
create (9462500,5,-rw-rw-r--:0100664): (9462574,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:28] - OK <0.002206>
create (9462500,6,-rw-rw-r--:0100664): (9462575,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:30] - OK <0.002190>
create (9462500,7,-rw-rw-r--:0100664): (9462576,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:32] - OK <0.002229>
create (9462500,8,-rw-rw-r--:0100664): (9462577,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:34] - OK <0.002183>
create (9462500,9,-rw-rw-r--:0100664): (9462578,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:36] - OK <0.002180>
create (9462500,10,-rw-rw-r--:0100664): (9462579,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:38] - OK <0.002190>
create (9462500,11,-rw-rw-r--:0100664): (9462580,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:40] - OK <0.004361>
create (9462500,12,-rw-rw-r--:0100664): (9462581,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:42] - OK <0.002745>
create (9462500,13,-rw-rw-r--:0100664): (9462582,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:44] - OK <0.002181>
create (9462500,14,-rw-rw-r--:0100664): (9462583,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:46] - OK <0.002134>
create (9462500,15,-rw-rw-r--:0100664): (9462584,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:48] - OK <0.002136>
create (9462500,16,-rw-rw-r--:0100664): (9462585,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:50] - OK <0.002168>
create (9462500,17,-rw-rw-r--:0100664): (9462586,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:52] - OK <0.002209>
create (9462500,18,-rw-rw-r--:0100664): (9462587,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:54] - OK <0.002125>
create (9462500,19,-rw-rw-r--:0100664): (9462588,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:56] - OK <0.002176>
create (9462500,20,-rw-rw-r--:0100664): (9462589,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:58] - OK <0.002151>
create (9462500,21,-rw-rw-r--:0100664): (9462590,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:60] - OK <0.004324>
create (9462500,22,-rw-rw-r--:0100664): (9462591,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:62] - OK <0.002246>
create (9462500,23,-rw-rw-r--:0100664): (9462592,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:64] - OK <0.002210>
create (9462500,24,-rw-rw-r--:0100664): (9462593,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:66] - OK <0.002263>
create (9462500,25,-rw-rw-r--:0100664): (9462594,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:68] - OK <0.002166>
create (9462500,26,-rw-rw-r--:0100664): (9462595,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:70] - OK <0.002177>
create (9462500,27,-rw-rw-r--:0100664): (9462596,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:72] - OK <0.002128>
create (9462500,28,-rw-rw-r--:0100664): (9462597,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:74] - OK <0.002135>
create (9462500,29,-rw-rw-r--:0100664): (9462598,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:76] - OK <0.002223>
create (9462500,30,-rw-rw-r--:0100664): (9462599,[-rw-rw-r--:0100664,1,1002,1002,1725336337,1725336337,1725336337,0]) [fh:78] - OK <0.002260>

after:

create (9462500,1,-rw-rw-r--:0100664): (9462600,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:12] - OK <0.004521>
create (9462500,2,-rw-rw-r--:0100664): (9462601,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:14] - OK <0.002690>
create (9462500,3,-rw-rw-r--:0100664): (9462602,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:16] - OK <0.002181>
create (9462500,4,-rw-rw-r--:0100664): (9462603,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:18] - OK <0.002068>
create (9462500,5,-rw-rw-r--:0100664): (9462604,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:20] - OK <0.002049>
create (9462500,6,-rw-rw-r--:0100664): (9462605,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:22] - OK <0.002040>
create (9462500,7,-rw-rw-r--:0100664): (9462606,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:24] - OK <0.002017>
create (9462500,8,-rw-rw-r--:0100664): (9462607,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:26] - OK <0.002075>
create (9462500,9,-rw-rw-r--:0100664): (9462608,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:28] - OK <0.002018>
create (9462500,10,-rw-rw-r--:0100664): (9462609,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:30] - OK <0.002584>
create (9462500,11,-rw-rw-r--:0100664): (9462610,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:32] - OK <0.002046>
create (9462500,12,-rw-rw-r--:0100664): (9462611,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:34] - OK <0.002076>
create (9462500,13,-rw-rw-r--:0100664): (9462612,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:36] - OK <0.002043>
create (9462500,14,-rw-rw-r--:0100664): (9462613,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:38] - OK <0.002081>
create (9462500,15,-rw-rw-r--:0100664): (9462614,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:40] - OK <0.002533>
create (9462500,16,-rw-rw-r--:0100664): (9462615,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:42] - OK <0.002011>
create (9462500,17,-rw-rw-r--:0100664): (9462616,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:44] - OK <0.002013>
create (9462500,18,-rw-rw-r--:0100664): (9462617,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:46] - OK <0.002043>
create (9462500,19,-rw-rw-r--:0100664): (9462618,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:48] - OK <0.002078>
create (9462500,20,-rw-rw-r--:0100664): (9462619,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:50] - OK <0.002677>
create (9462500,21,-rw-rw-r--:0100664): (9462620,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:52] - OK <0.002043>
create (9462500,22,-rw-rw-r--:0100664): (9462621,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:54] - OK <0.002070>
create (9462500,23,-rw-rw-r--:0100664): (9462622,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:56] - OK <0.002134>
create (9462500,24,-rw-rw-r--:0100664): (9462623,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:58] - OK <0.002106>
create (9462500,25,-rw-rw-r--:0100664): (9462624,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:60] - OK <0.002004>
create (9462500,26,-rw-rw-r--:0100664): (9462625,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:62] - OK <0.002055>
create (9462500,27,-rw-rw-r--:0100664): (9462626,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:64] - OK <0.002008>
create (9462500,28,-rw-rw-r--:0100664): (9462627,[-rw-rw-r--:0100664,1,1002,1002,1725336480,1725336480,1725336480,0]) [fh:66] - OK <0.002670>
create (9462500,29,-rw-rw-r--:0100664): (9462628,[-rw-rw-r--:0100664,1,1002,1002,1725336481,1725336481,1725336481,0]) [fh:68] - OK <0.002151>
create (9462500,30,-rw-rw-r--:0100664): (9462629,[-rw-rw-r--:0100664,1,1002,1002,1725336481,1725336481,1725336481,0]) [fh:70] - OK <0.002099>

Signed-off-by: Changxin Miao <[email protected]>
pkg/meta/base.go Outdated
}
n := m.freeInodes.next
m.freeInodes.next++
for n <= 1 {
n = m.freeInodes.next
m.freeInodes.next++
}
if m.freeInodes.maxid-m.freeInodes.next < uint64(utils.JitterIt(inodeBatch*0.1)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will start single goroutine if we use a fixed number here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A goroutine will be started when freeInodes.next is close to freeInodes.maxid, regardless of the threshold.
With a fixed number, when one inode-allocation spawns a goroutine, the next call will always spawn another one. Because the next call will have a larger freeInodes.next.
Those goroutines except the first one are unschedulable, they will not consume much resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if m.freeInodes.maxid-m.freeInodes.next  == fixed_one {
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, done

pkg/meta/base.go Outdated
if err != nil {
return 0, err
m.prefetchMu.Lock() // Wait until prefetchInodes() is done
nextLimit := m.prefetchedInode
Copy link
Contributor

@davies davies Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use m.prefetchedInode to overwrite m.freeInodes if it's valid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is what the following code does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if m.freeInodes.next >= m.freeInodes.maxid {
   m.prefetchMu.Lock()
   m.freeInodes = m.prefetchedInodes
   m.prefetchedInodes = freeID{}
   m.prefetchMu.Unlock()
}
if m.freeInodes.next >= m.freeInodes.maxid {
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored like this?6043c1c

@davies davies merged commit 6a4999f into juicedata:main Sep 4, 2024
39 checks passed
@davies
Copy link
Contributor

davies commented Sep 4, 2024

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants