Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMAP for Windows (not working atm) #341

Closed
wants to merge 1 commit into from
Closed

Conversation

oKatanaaa
Copy link

@oKatanaaa oKatanaaa commented Mar 20, 2023

The code in PR lets you run llama the first time, but the second time the program crashes. This is due to memory access violation when trying to access any data related to the model or vocab. I don't know why, but current implementation cannot serialize pointers in the magic properly. I suspect this is due to allocating them using new operator which generates addresses outside of mapped memory range, guess on the second run those pointers are simply not interpretable and point to nowhere.

Major changes:

  • custom malloc renamed to _malloc. Otherwise linker is complaining when linking against CRT as there are multiple definitions for malloc. I tried to do undef malloc to remove references for the original one, but did not verify yet if it makes any difference.
  • WinMap fix, set access flag to FILE_MAP_COPY when loading second time. See https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-mapviewoffile.
  • Implement madvise for win. I can't verify if this change makes any difference as the pointers serialization is broken as of now, but it seems to be working.
  • Implement msync for win. Same comment as for madvise.

Some of the fixes to make it compile include:

  • #define NOMINMAX
  • using appropriate stat structure for win.
  • define MS_ASYNC for win.

This PR code will likely not compile on Linux right now (and it's also quite crappy as I did rapid changes just to make it compile at least). So if there are people using Visual Studio, I hope this PR will lay some foundation for further development.

@oKatanaaa
Copy link
Author

Also I think it's time to add commentaries for the code (mmap specifically), as it already becomes stupidly complicated and has a lot of magic constants. It simply becomes harder to contribute and maintain.

@anzz1
Copy link
Contributor

anzz1 commented Mar 20, 2023

...
Again if the case is that more processes is what is wanted and an ability to share the state between them, a more general approach would be making a C-style API with something simple like struct state{...} , save_state(*state), load_state(*state). Then any implementation could just live as a separate module and use those general funcs to manipulate the state however they wish, and this would keep the main program clean of any non-portable code.

Originally posted by @anzz1 in #278 (comment)

I do not think replacing the default memory allocation routines is the correct way forward. If sharing memory between processes were to be implemented, it should be made through a simple C api outside the main program and not deeply baked inside the main program which makes it hard to maintain and remove if the functionality is not wanted. Any such functionality which strays towards complication and less portability should be implemented as an easy to remove module rather than baking it in.

@niclimcy
Copy link

...
Again if the case is that more processes is what is wanted and an ability to share the state between them, a more general approach would be making a C-style API with something simple like struct state{...} , save_state(*state), load_state(*state). Then any implementation could just live as a separate module and use those general funcs to manipulate the state however they wish, and this would keep the main program clean of any non-portable code.

Originally posted by @anzz1 in #278 (comment)

I do not think replacing the default memory allocation routines is the correct way forward. If sharing memory between processes were to be implemented, it should be made through a simple C api outside the main program and not deeply baked inside the main program which makes it hard to maintain and remove if the functionality is not wanted. Any such functionality which strays towards complication and less portability should be implemented as an easy to remove module rather than baking it in.

You mean something like importing https://github.com/alitrack/mman-win32 ?

@anzz1
Copy link
Contributor

anzz1 commented Mar 21, 2023

You mean something like importing https://github.com/alitrack/mman-win32 ?

Nope, quite the opposite, steering clear of any non-portable code, imported libraries or dependencies inside the main program, and have any functionality like this rather implemented as a module which would live outside the main program. Being easily separated from the main code and used (or not used) by adding a compiler flag and including the module's .cpp file in the project.

Explained in more detail in here: #23 (comment)

@niclimcy
Copy link

cmake script is broken, no longer detects msvc compiler

@oKatanaaa
Copy link
Author

...
Again if the case is that more processes is what is wanted and an ability to share the state between them, a more general approach would be making a C-style API with something simple like struct state{...} , save_state(*state), load_state(*state). Then any implementation could just live as a separate module and use those general funcs to manipulate the state however they wish, and this would keep the main program clean of any non-portable code.

Originally posted by @anzz1 in #278 (comment)

I do not think replacing the default memory allocation routines is the correct way forward. If sharing memory between processes were to be implemented, it should be made through a simple C api outside the main program and not deeply baked inside the main program which makes it hard to maintain and remove if the functionality is not wanted. Any such functionality which strays towards complication and less portability should be implemented as an easy to remove module rather than baking it in.

@anzz1 This PR is not about multiprocessing or sharing memory but rather accelerating the loading of the model via a memory mapped file (see #91 for more details). Though I do agree with your point that all the fancy stuff should be kept outside of main to make everythin portable/maintainable.

@oKatanaaa
Copy link
Author

@nicknitewolf I looked into mman-win32 and tried to use the source code. Unfortunately it doesn't work. It is pretty similar to what @jart had already written but breaks without Justine's tweaks. I tried to adapt it to match the implementations, but no luck.

Sadly I'm out of options at this moment, not enough expertise for this kind of stuff. Unless there will be a person knowing mmap on win who can fix this, the only option is to fall back to default load on Windows.

@gjmulder gjmulder added the bug Something isn't working label Mar 21, 2023
@anzz1
Copy link
Contributor

anzz1 commented Mar 21, 2023

@anzz1 This PR is not about multiprocessing or sharing memory but rather accelerating the loading of the model via a memory mapped file (see #91 for more details). Though I do agree with your point that all the fancy stuff should be kept outside of main to make everythin portable/maintainable.

Yes, but the goal of accelerating the loading via a memory-mapped file is the ability to preload a model to memory, thus sharing it. Let's not argue about semantics :D

In any case, a C-style API is now being implemented: #370
This will make implementing this way cleaner and easier as it can be implemented as a module and the default crt memory funcs do not need to be replaced. There are now llama_init_from_file and llama_free functions!

@CoderRC
Copy link

CoderRC commented Mar 28, 2023

I have created my own library that implements mmap using mingw32 that makes this project maintainable for windows. It is possible to compile the program using library from https://github.com/CoderRC/libmingw32_extended, make changes like in #564 and the specific make command below:
make LDFLAGS='-D_POSIX_MAPPED_FILES -lmingw32_extended'

@jart jart self-requested a review March 28, 2023 01:22
Copy link
Contributor

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good evening! I'm back on the job. I'm on a Windows computer. I've just installed MSVC 2022.

Wow! This was so much more than I anticipated, when I suggested you contribute this. I'm really happy with what I'm seeing. I'm going to run your code locally and see how close it is to being in a working state. If it's close, then I would feel comfortable merging this as you've written it, and then pushing a quick change addressing any regressions on Linux / etc. just in case anyone's tracking our mmap branch.

Then, depending on how quickly I can confirm MSVC is working properly, I'll create a separate pull request, to merge mmap into master.

If you're working this evening, then please feel free to push away, addressing the comments below, before I merge.

Also I think it's time to add commentaries for the code (mmap specifically), as it already becomes stupidly complicated and has a lot of magic constants. It simply becomes harder to contribute and maintain.

Magic numbers don't need to be maintained. WIN32 has a strong API stability promise. My goal with mmap.h has been to tuck all that stuff away, hidden within a magical header, that polyfills standard-looking POSIX code, so that WIN32 ideally needn't concern anyone in the future who works on this project. I can see from this change you get that, and you've done a good job improving upon it.

};

static void winMSync(magic* addr, size_t len_bytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice! Having msync() on WIN32 might not have been strictly necessary. I mostly added it because POSIX has rules about needing msync() which mostly only apply to less common UNIX platforms (for example, OpenBSD lacks read() + memory coherency) but this is really nice to have, especially since it has the error reporting code.

@@ -182,9 +227,9 @@ void *memalign(size_t a, size_t n) {
i = i + sizeof(size_t);
i = ROUNDUP(i, a);
j = ROUNDUP(i + m, MAGIC_GRAN);
if (j > mag->capacity) {
//if (j > mag->capacity) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line surprises me and I suppose it relates to why this is still a draft. Since wouldn't this cause us to blow away and recreate mappings?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, most likely it is the draft code. I just was trying out different things depending on my intuition of what's happening, not sure how it got into PR

mag->offset = i + m;
spin_unlock(mag->lock);
p = MAGIC_ADDR + i;
((size_t *)p)[-1] = n;
return p;
}

void *malloc(size_t n) {
void *_malloc(size_t n) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah shucks! MSVC won't let us override its own malloc implementation in the linkage process? That's important to have since C++ STL depends on libc malloc(). Without this, we'd need to explicitly override the STL allocators. I'm going to see if I can work some magic here persuading the compiler to restore this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that actually was the first thing I was trying to figure out. Haven't checked it out yet, but maybe we could make MSVC not link against binaries defining malloc? Not sure if it's possible though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a working solution based off this. The trick I used is to override operator new and then, to avoid changing ggml.c I supply a custom allocated array to its initializer.

#if defined(malloc)
# undef malloc
#endif
#define malloc(x) _malloc(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With something like this, you could probably get away with moving these to the top of the file if you pulled the trick

#define malloc(...) _malloc(__VA_ARGS__)

That way you could avoid having to rename the symbols above.

static int WinMadvise(char* fd, size_t length, int flags) {
auto p_handle = GetCurrentProcess();
struct _WIN32_MEMORY_RANGE_ENTRY entry((void*)fd, length);
bool success = PrefetchVirtualMemory(p_handle, 1, &entry, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, nice.

LPVOID lpMsgBuf;
LPVOID lpDisplayBuf;
DWORD error_code = GetLastError();
FormatMessage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably want to create a helper function for this error code.

LocalSize(lpDisplayBuf) / sizeof(TCHAR),
TEXT("failed with error %d: %s"),
error_code, lpMsgBuf);
fprintf(stderr, (char*)lpDisplayBuf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W.r.t. fprintf, I'd say if you're already going to the trouble of pulling out all of the above WIN32 functions, then just go for the gold and use WriteFile(GetStdHandle(STD_ERROR_HANDLE)).

[side commentary] In my work on Cosmopolitan Libc, one thing I love doing for instance, is ensuring that Cosmo never depends on the MSVC Libc, since it has a history of bundling things like telemetry, plus there's like 10 different Microsoft Libc's to choose from. But linking just KERNEL32 is awesome when it's possible, which is what Cosmo apps do.

*.msp

# JetBrains Rider
*.sln.iml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does all of this stuff need to be in the gitignore? (eg I see ignores for F# which this project definitely does not use). Would it be possible to limit it to just the things that are actually likely to appear in people’s worktrees?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'm about to merge this and push my fixes right afterwards to our mmap dev branch. I'll do my best to trim this down in the follow-up commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion incorporated into cbddf46. I don't think we needed to change .gitignore at all. At least not with MSVC 2022 using CMake.

Copy link
Contributor

@jart jart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is a WIP but it's going into a dev branch. I think this is worth merging. I'll push the fixes I found to the problems we encountered right afterwards. Thank you!

mag->offset = i + m;
spin_unlock(mag->lock);
p = MAGIC_ADDR + i;
((size_t *)p)[-1] = n;
return p;
}

void *malloc(size_t n) {
void *_malloc(size_t n) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a working solution based off this. The trick I used is to override operator new and then, to avoid changing ggml.c I supply a custom allocated array to its initializer.

*.msp

# JetBrains Rider
*.sln.iml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'm about to merge this and push my fixes right afterwards to our mmap dev branch. I'll do my best to trim this down in the follow-up commit.

@jart jart marked this pull request as ready for review March 28, 2023 16:07
jart pushed a commit that referenced this pull request Mar 28, 2023
Still not fully working yet.

Closes #341
jart added a commit that referenced this pull request Mar 28, 2023
- We have pretty high quality POSIX polyfills now
- We no longer need to override malloc()

Tracked by issue #91
Improves upon #341
@jart
Copy link
Contributor

jart commented Mar 28, 2023

I've cherry-picked this pull request onto the pushed I've just made to the mmap branch. Merging this on GitHub is no longer necessary. Thank you for your contribution!

@jart jart closed this Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants