MMAP for Windows (not working atm) #341

oKatanaaa · 2023-03-20T22:07:32Z

The code in PR lets you run llama the first time, but the second time the program crashes. This is due to memory access violation when trying to access any data related to the model or vocab. I don't know why, but current implementation cannot serialize pointers in the magic properly. I suspect this is due to allocating them using new operator which generates addresses outside of mapped memory range, guess on the second run those pointers are simply not interpretable and point to nowhere.

Major changes:

custom malloc renamed to _malloc. Otherwise linker is complaining when linking against CRT as there are multiple definitions for malloc. I tried to do undef malloc to remove references for the original one, but did not verify yet if it makes any difference.
WinMap fix, set access flag to FILE_MAP_COPY when loading second time. See https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-mapviewoffile.
Implement madvise for win. I can't verify if this change makes any difference as the pointers serialization is broken as of now, but it seems to be working.
Implement msync for win. Same comment as for madvise.

Some of the fixes to make it compile include:

#define NOMINMAX
using appropriate stat structure for win.
define MS_ASYNC for win.

This PR code will likely not compile on Linux right now (and it's also quite crappy as I did rapid changes just to make it compile at least). So if there are people using Visual Studio, I hope this PR will lay some foundation for further development.

oKatanaaa · 2023-03-20T22:21:22Z

Also I think it's time to add commentaries for the code (mmap specifically), as it already becomes stupidly complicated and has a lot of magic constants. It simply becomes harder to contribute and maintain.

anzz1 · 2023-03-20T22:26:00Z

...
Again if the case is that more processes is what is wanted and an ability to share the state between them, a more general approach would be making a C-style API with something simple like struct state{...} , save_state(*state), load_state(*state). Then any implementation could just live as a separate module and use those general funcs to manipulate the state however they wish, and this would keep the main program clean of any non-portable code.

Originally posted by @anzz1 in #278 (comment)

I do not think replacing the default memory allocation routines is the correct way forward. If sharing memory between processes were to be implemented, it should be made through a simple C api outside the main program and not deeply baked inside the main program which makes it hard to maintain and remove if the functionality is not wanted. Any such functionality which strays towards complication and less portability should be implemented as an easy to remove module rather than baking it in.

niclimcy · 2023-03-20T23:45:14Z

...
Again if the case is that more processes is what is wanted and an ability to share the state between them, a more general approach would be making a C-style API with something simple like struct state{...} , save_state(*state), load_state(*state). Then any implementation could just live as a separate module and use those general funcs to manipulate the state however they wish, and this would keep the main program clean of any non-portable code.

Originally posted by @anzz1 in #278 (comment)

I do not think replacing the default memory allocation routines is the correct way forward. If sharing memory between processes were to be implemented, it should be made through a simple C api outside the main program and not deeply baked inside the main program which makes it hard to maintain and remove if the functionality is not wanted. Any such functionality which strays towards complication and less portability should be implemented as an easy to remove module rather than baking it in.

You mean something like importing https://github.com/alitrack/mman-win32 ?

anzz1 · 2023-03-21T00:52:37Z

You mean something like importing https://github.com/alitrack/mman-win32 ?

Nope, quite the opposite, steering clear of any non-portable code, imported libraries or dependencies inside the main program, and have any functionality like this rather implemented as a module which would live outside the main program. Being easily separated from the main code and used (or not used) by adding a compiler flag and including the module's .cpp file in the project.

Explained in more detail in here: #23 (comment)

niclimcy · 2023-03-21T01:24:44Z

cmake script is broken, no longer detects msvc compiler

oKatanaaa · 2023-03-21T12:33:23Z

...
Again if the case is that more processes is what is wanted and an ability to share the state between them, a more general approach would be making a C-style API with something simple like struct state{...} , save_state(*state), load_state(*state). Then any implementation could just live as a separate module and use those general funcs to manipulate the state however they wish, and this would keep the main program clean of any non-portable code.

Originally posted by @anzz1 in #278 (comment)

I do not think replacing the default memory allocation routines is the correct way forward. If sharing memory between processes were to be implemented, it should be made through a simple C api outside the main program and not deeply baked inside the main program which makes it hard to maintain and remove if the functionality is not wanted. Any such functionality which strays towards complication and less portability should be implemented as an easy to remove module rather than baking it in.

@anzz1 This PR is not about multiprocessing or sharing memory but rather accelerating the loading of the model via a memory mapped file (see #91 for more details). Though I do agree with your point that all the fancy stuff should be kept outside of main to make everythin portable/maintainable.

oKatanaaa · 2023-03-21T12:41:06Z

@nicknitewolf I looked into mman-win32 and tried to use the source code. Unfortunately it doesn't work. It is pretty similar to what @jart had already written but breaks without Justine's tweaks. I tried to adapt it to match the implementations, but no luck.

Sadly I'm out of options at this moment, not enough expertise for this kind of stuff. Unless there will be a person knowing mmap on win who can fix this, the only option is to fall back to default load on Windows.

anzz1 · 2023-03-21T21:31:08Z

@anzz1 This PR is not about multiprocessing or sharing memory but rather accelerating the loading of the model via a memory mapped file (see #91 for more details). Though I do agree with your point that all the fancy stuff should be kept outside of main to make everythin portable/maintainable.

Yes, but the goal of accelerating the loading via a memory-mapped file is the ability to preload a model to memory, thus sharing it. Let's not argue about semantics :D

In any case, a C-style API is now being implemented: #370
This will make implementing this way cleaner and easier as it can be implemented as a module and the default crt memory funcs do not need to be replaced. There are now llama_init_from_file and llama_free functions!

CoderRC · 2023-03-28T01:10:13Z

I have created my own library that implements mmap using mingw32 that makes this project maintainable for windows. It is possible to compile the program using library from https://github.com/CoderRC/libmingw32_extended, make changes like in #564 and the specific make command below:
make LDFLAGS='-D_POSIX_MAPPED_FILES -lmingw32_extended'

jart

Good evening! I'm back on the job. I'm on a Windows computer. I've just installed MSVC 2022.

Wow! This was so much more than I anticipated, when I suggested you contribute this. I'm really happy with what I'm seeing. I'm going to run your code locally and see how close it is to being in a working state. If it's close, then I would feel comfortable merging this as you've written it, and then pushing a quick change addressing any regressions on Linux / etc. just in case anyone's tracking our mmap branch.

Then, depending on how quickly I can confirm MSVC is working properly, I'll create a separate pull request, to merge mmap into master.

If you're working this evening, then please feel free to push away, addressing the comments below, before I merge.

Also I think it's time to add commentaries for the code (mmap specifically), as it already becomes stupidly complicated and has a lot of magic constants. It simply becomes harder to contribute and maintain.

Magic numbers don't need to be maintained. WIN32 has a strong API stability promise. My goal with mmap.h has been to tuck all that stuff away, hidden within a magical header, that polyfills standard-looking POSIX code, so that WIN32 ideally needn't concern anyone in the future who works on this project. I can see from this change you get that, and you've done a good job improving upon it.

jart · 2023-03-28T01:10:57Z

main.cpp

 };

+static void winMSync(magic* addr, size_t len_bytes) {


Oh nice! Having msync() on WIN32 might not have been strictly necessary. I mostly added it because POSIX has rules about needing msync() which mostly only apply to less common UNIX platforms (for example, OpenBSD lacks read() + memory coherency) but this is really nice to have, especially since it has the error reporting code.

jart · 2023-03-28T01:12:36Z

main.cpp

@@ -182,9 +227,9 @@ void *memalign(size_t a, size_t n) {
    i = i + sizeof(size_t);
    i = ROUNDUP(i, a);
    j = ROUNDUP(i + m, MAGIC_GRAN);
-    if (j > mag->capacity) {
+    //if (j > mag->capacity) {


This line surprises me and I suppose it relates to why this is still a draft. Since wouldn't this cause us to blow away and recreate mappings?

Oh, most likely it is the draft code. I just was trying out different things depending on my intuition of what's happening, not sure how it got into PR

jart · 2023-03-28T01:14:06Z

main.cpp

    mag->offset = i + m;
    spin_unlock(mag->lock);
    p = MAGIC_ADDR + i;
    ((size_t *)p)[-1] = n;
    return p;
 }

-void *malloc(size_t n) {
+void *_malloc(size_t n) {


Ah shucks! MSVC won't let us override its own malloc implementation in the linkage process? That's important to have since C++ STL depends on libc malloc(). Without this, we'd need to explicitly override the STL allocators. I'm going to see if I can work some magic here persuading the compiler to restore this.

Yep, that actually was the first thing I was trying to figure out. Haven't checked it out yet, but maybe we could make MSVC not link against binaries defining malloc? Not sure if it's possible though

I have a working solution based off this. The trick I used is to override operator new and then, to avoid changing ggml.c I supply a custom allocated array to its initializer.

jart · 2023-03-28T01:15:35Z

main.cpp

+#if defined(malloc)
+# undef malloc
+#endif
+#define malloc(x) _malloc(x)


With something like this, you could probably get away with moving these to the top of the file if you pulled the trick

#define malloc(...) _malloc(__VA_ARGS__)

That way you could avoid having to rename the symbols above.

jart · 2023-03-28T01:16:30Z

mmap.h

+static int WinMadvise(char* fd, size_t length, int flags) {
+    auto p_handle = GetCurrentProcess();
+    struct _WIN32_MEMORY_RANGE_ENTRY entry((void*)fd, length);
+    bool success = PrefetchVirtualMemory(p_handle, 1, &entry, 0);


Once again, nice.

jart · 2023-03-28T01:16:48Z

mmap.h

+        LPVOID lpMsgBuf;
+        LPVOID lpDisplayBuf;
+        DWORD error_code = GetLastError();
+        FormatMessage(


We'll probably want to create a helper function for this error code.

jart · 2023-03-28T01:21:12Z

mmap.h

+            LocalSize(lpDisplayBuf) / sizeof(TCHAR),
+            TEXT("failed with error %d: %s"),
+            error_code, lpMsgBuf);
+        fprintf(stderr, (char*)lpDisplayBuf);


W.r.t. fprintf, I'd say if you're already going to the trouble of pulling out all of the above WIN32 functions, then just go for the gold and use WriteFile(GetStdHandle(STD_ERROR_HANDLE)).

[side commentary] In my work on Cosmopolitan Libc, one thing I love doing for instance, is ensuring that Cosmo never depends on the MSVC Libc, since it has a history of bundling things like telemetry, plus there's like 10 different Microsoft Libc's to choose from. But linking just KERNEL32 is awesome when it's possible, which is what Cosmo apps do.

j-f1 · 2023-03-28T12:22:42Z

.gitignore

+*.msp
+
+# JetBrains Rider
+*.sln.iml


Does all of this stuff need to be in the gitignore? (eg I see ignores for F# which this project definitely does not use). Would it be possible to limit it to just the things that are actually likely to appear in people’s worktrees?

I agree. I'm about to merge this and push my fixes right afterwards to our mmap dev branch. I'll do my best to trim this down in the follow-up commit.

Suggestion incorporated into cbddf46. I don't think we needed to change .gitignore at all. At least not with MSVC 2022 using CMake.

jart

This change is a WIP but it's going into a dev branch. I think this is worth merging. I'll push the fixes I found to the problems we encountered right afterwards. Thank you!

jart · 2023-03-28T16:03:47Z

main.cpp

    mag->offset = i + m;
    spin_unlock(mag->lock);
    p = MAGIC_ADDR + i;
    ((size_t *)p)[-1] = n;
    return p;
 }

-void *malloc(size_t n) {
+void *_malloc(size_t n) {


I have a working solution based off this. The trick I used is to override operator new and then, to avoid changing ggml.c I supply a custom allocated array to its initializer.

jart · 2023-03-28T16:04:58Z

.gitignore

+*.msp
+
+# JetBrains Rider
+*.sln.iml


I agree. I'm about to merge this and push my fixes right afterwards to our mmap dev branch. I'll do my best to trim this down in the follow-up commit.

Still not fully working yet. Closes #341

- We have pretty high quality POSIX polyfills now - We no longer need to override malloc() Tracked by issue #91 Improves upon #341

jart · 2023-03-28T17:17:16Z

I've cherry-picked this pull request onto the pushed I've just made to the mmap branch. Merging this on GitHub is no longer necessary. Thank you for your contribution!

fix: win map fixes, still not working

8793e7e

oKatanaaa mentioned this pull request Mar 20, 2023

Should use mmap for model loading #91

Closed

gjmulder added the bug Something isn't working label Mar 21, 2023

jart self-requested a review March 28, 2023 01:22

jart reviewed Mar 28, 2023

View reviewed changes

j-f1 reviewed Mar 28, 2023

View reviewed changes

jart approved these changes Mar 28, 2023

View reviewed changes

jart marked this pull request as ready for review March 28, 2023 16:07

jart pushed a commit that referenced this pull request Mar 28, 2023

Make WIN32 mmap() improvements (#341)

e488168

Still not fully working yet. Closes #341

jart added a commit that referenced this pull request Mar 28, 2023

Get mmap() working with WIN32 MSVC

cbddf46

- We have pretty high quality POSIX polyfills now - We no longer need to override malloc() Tracked by issue #91 Improves upon #341

jart closed this Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMAP for Windows (not working atm) #341

MMAP for Windows (not working atm) #341

oKatanaaa commented Mar 20, 2023 •

edited

Loading

oKatanaaa commented Mar 20, 2023

anzz1 commented Mar 20, 2023 •

edited

Loading

niclimcy commented Mar 20, 2023

anzz1 commented Mar 21, 2023 •

edited

Loading

niclimcy commented Mar 21, 2023

oKatanaaa commented Mar 21, 2023

oKatanaaa commented Mar 21, 2023

anzz1 commented Mar 21, 2023

CoderRC commented Mar 28, 2023

jart left a comment

jart Mar 28, 2023

jart Mar 28, 2023

oKatanaaa Mar 28, 2023

jart Mar 28, 2023

oKatanaaa Mar 28, 2023

jart Mar 28, 2023

jart Mar 28, 2023

jart Mar 28, 2023

jart Mar 28, 2023

jart Mar 28, 2023

j-f1 Mar 28, 2023

jart Mar 28, 2023

jart Mar 28, 2023

jart left a comment

jart Mar 28, 2023

jart Mar 28, 2023

jart commented Mar 28, 2023

MMAP for Windows (not working atm) #341

MMAP for Windows (not working atm) #341

Conversation

oKatanaaa commented Mar 20, 2023 • edited Loading

oKatanaaa commented Mar 20, 2023

anzz1 commented Mar 20, 2023 • edited Loading

niclimcy commented Mar 20, 2023

anzz1 commented Mar 21, 2023 • edited Loading

niclimcy commented Mar 21, 2023

oKatanaaa commented Mar 21, 2023

oKatanaaa commented Mar 21, 2023

anzz1 commented Mar 21, 2023

CoderRC commented Mar 28, 2023

jart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jart commented Mar 28, 2023

oKatanaaa commented Mar 20, 2023 •

edited

Loading

anzz1 commented Mar 20, 2023 •

edited

Loading

anzz1 commented Mar 21, 2023 •

edited

Loading