-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore training conditionals for PLKSR, RealPLKSR, SPAN #276
Restore training conditionals for PLKSR, RealPLKSR, SPAN #276
Conversation
How do you like the idea of cleaning up the span a little at the same time? https://github.com/the-database/spandrel/blob/restoretraining/libs/spandrel/spandrel/architectures/SPAN/arch/span.py#L233 |
https://github.com/the-database/spandrel/blob/restoretraining/libs/spandrel/spandrel/architectures/SPAN/arch/span.py#L209 |
I think we first have to answer the question whether spandrel arches should include training code. Please consider:
|
My personal opinion:
Auto's use-case is using spandrel for all relevant arch code in traiNNer-redux, which gets us guaranteed compatibility between trained models and spandrel due to the coupling. This isn't possible if the model arch needs to be modified for something training specific. These things are essentially part of the architecture code that we got rid of because we weren't using this for training, but that doesn't mean nobody is going to want to. And we broke the archs because of that. |
@joeyballentine Sorry for wording my question in a confusing way. My question wasn't what use cases there are for including training code, but whether supporting training models is part of the scope/feature set of spandrel. If we want to support training, then we also have to do it properly. Right now:
While it isn't feasible to test actual training in CI, we have to at least test/verify the stability of the public API. Otherwise, we're just going to break someone's code in an update. |
Tbh I really don't think it's that big of a deal and I think you're overthinking this. I don't necessarily think we should advertise training as a feature. Spandrel is an inference library first and foremost. I just don't think we should purposefully break the ability for people to train only certain archs using spandrel as a backend, if they know what they're doing and know how to do that. |
And in terms of us "breaking training" in the future, the only way we could possibly do that is by getting rid of arch code like we did. But it's not like we're going to be doing that all the time with existing arches. We'd only be doing that with new ones, which we can then fix later if reported untrainable |
Joey, I'm not talking about advertising training to normal users. I'm talking about API stability. Right now, not even importing arch classes is part of the stable API. E.g. this is the stable API for DAT and this is what trainner-redux does to import DAT. What trainner-redux does currently only works, because Python doesn't have a way to limit the scope of import symbols, so those classes get leaked.
It's literally trivial to break training right now. Rename an arch class, rename an arch class argument, remove an unused |
I guess I still just don't see what the issue is. We can make breaking changes for anything at any time, why is this different? It's why we have versioning. I just don't see why we can't just at least merge this in order to make this use case work. You haven't given me a good reason other than "we might break it in the future" which as long as database doesn't mind, I don't see as a problem. |
Joey, what is a breaking change? The answer is: "anything that breaks API guarantees." There are no breaking changes when you don't have an API. This is why e.g. ESLint has a huge list of rules for what changes constitute which version bump (patch, minor, major). The whole point of semantic versioning is that version bumps have meaning that relate to guarantees. Anything that isn't guaranteed doesn't even constitute a patch.
Similarly, anything that isn't guaranteed also means that it can't be broken/incomplete. So what's the point of this PR if we don't support training?
Plus, "we might break it in the future" isn't what I was talking about. My point is that "we might break it in the future and not even know about it." If you can't even know whether a feature works, then you are just blindly walking in the dark hoping that nothing bad happens. But here it's even worse, because you don't even know whether just accessing a feature works. This is just going to cause future headaches for everyone. And at the risk of repeating myself: My point isn't that spandrel shouldn't support training, but that if we want spandrel to support training, then we have to do it right. This PR can be the start of this, but it can't be the end of it. This is why I want to hammer down what "spandrel supports training" even means. |
Thats what this PR fixes
So?
So does the rest of the arch code. It's quite literally just code that is an implementation detail that we have no clue how it works. We just wrap it in a cool way to make it work good
IMO, it means "you can import and arch from spandrel and train it". Simple as that. It does not mean we need to have any kind of training code whatsoever. Anyone writing a program for training already knows how to do that. You know how musl keeps breaking models in neosr which makes models trained there incompatible with spandrel? We basically did the inverse by removing these bits of training code. This is code that needs to be there in order for the arch to be trained properly. If we are going to say "spandrel is the source of truth for this arch for all trainers, and any changes you make are considered breaking changes to an arch", but the trainers cant even use the arch from spandrel, then that feels like a really silly thing for us to claim. How about we actually make the archs trainable, and that way someone like the-database can actually use this source of truth for training and make 100% compatible models. |
Different implementation detail. I'm obviously referring to our implementation details. E.g. our class names.
It literally doesn't. We already talked about some arches being designed to be trained in clusters or needing additional deps to train.
See my previous sentence as to why "any arch" doesn't work. This is why we have to document this properly. |
Insert your 2 cents, just because you use spandrel for training, does not mean that it will be compatible with it. In architectures, untracked parameters are often pulled out into dynamic parameters, and using spandrel will not affect compatibility in any way. The architecture can be made as compatible as possible by simply inserting the original code and disabling the use of some parameters. If the author of the training framework himself does not change the architecture, this will not affect compatibility |
If by training in clusters you mean like multi-gpu over network, that literally has nothing to do with the architecture code. Otherwise, I have no idea what this means. As I explained earlier, that kind of stuff that is irrelevant to the arch itself (you can train any network with multigpu/networking, you can use any config stuff you want for your training code, etc) and we don't need to care about it. I don't know why you're being so hung up about that stuff. It's like saying that since we based some of our filter code off of paint.net you're wondering if we need to include their entire GUI -- it's simply irrelevant for what we're using from the originals.
Sure. But I'm not sure how any kind of abstraction specifically meant for training would help in this regard anyway. regardless, to initialize the network you want, there will be some amount of needing to use some kind of arbitrary name of something. I get that you don't want our implementation details to be used like this, but no matter what there will be some level of "this is the name of the thing, maybe we'll change it maybe we won't but you gotta initialize it like x or y". That's just the nature of something like this. It's completely different from inference where we can auto detect the model. It's a different use case.
Again, this is just wrong, assuming I understand what you mean by clustering and those extra deps. If I'm understanding incorrectly, please explain why those are necessary for the models to train correctly, i.e. why if someone does not have those things, the model will turn out broken. Anyway, I'm pretty sure the-database is going to use spandrel regardless, so feel free to close this PR and he can just deal with adding back the span, etc code to his trainner. I just think this whole thing is ridiculous so I'm out. @the-database if you really want this in, please argue from your side about why you want this. |
Okay, so I looked through some arches, and it seems like I misremembered. I 100% know that I removed the networking code from some arch (because I had to spend like an hour figuring out how it worked to remove it), so it must have been an arch that I ended up discarding, because it didn't work out. So that's fine. Sorry for making a panic about this.
I'm not talking about abstractions or anything. It's way simpler. Right now, class names are not part of the public API. E.g. from spandrel.architecutres.DAT import DAT isn't guaranteed to work right now, because there might not be an import symbol called "DAT" under the "DAT" module. And no, you can't assume that an architecture named "FOO" will have a "FOO" class, because there are architectures will multiple classes (e.g. Under what names should those classes be exported? Simple stuff like this is what I was talking about the entire time. Same for the hyperparameters. I renamed/removed some hyperparameters of some architectures when I added them (e.g. some arches had unused parameters, and I removed This is what defining an API for architectures is all about. We make guarantees so that those things are no longer mere implementation details. And we have to talk about these things, because I previously gave zero thought to these things, because they didn't matter. So here are some concrete questions that have to be answered to define the API of the spandrel's arch classes:
|
All good
Ah ok that makes sense. Stuff like ESRGAN being called RRDBNet? In that case, it probably does make sense to have some kind of way to import it as ESRGAN. I mean, we even directly import the arches in chaiNNer as well to do things like check to do isinstance checks, so there's probably something simple we could do there to just export each arch as its normal colloquial name.
If they aren't used, I don't see why not. Generally speaking i think its actually a better thing that we clean those up, since then its more clear what params actually can get used / changed for training
Yeah that makes sense
Then yeah, we should probably have a longer conversation about it. My general thought though is just to do whatever is simplest for us to do while also allowing training. I don't really want to give us too much extra work either And my thoughts on the rest of the questions:
The colloquial ones
In this case, "Compact"
Not really sure tbh. But probably just the colloquial name + whatever extra thing. So KBNet_s would just stay that way, unless we want to write out what the S stands for? Maybe @the-database could weight in on how he'd plan on having users define that in training configs
Probably, but it also probably wouldn't matter if we didn't
Now that you mention it, i almost wonder if we should have a somewhat consistent naming convention, like picking one of ("num_in_ch", "in_nc", "in_ch", etc) to use everywhere. But idk for sure.
IMO they should all be named so order doesn't matter. |
Alright, since we resolved that, let's merge this PR and continue with what needs to be done in follow-up PRs. I already started with #279. |
I'm importing spandrel's architecture definitions to train models in traiNNer-redux, and noticed that a few of the architectures in spandrel removed some conditionals involving
self.training
. This PR restores those conditionals. I checked all archs in spandrel against their official versions and these are the only ones I found which were missingself.training
conditionals.Note that the change to SPAN isn't part of the official code, but training with the official SPAN code results in this issue: hongyuanyu/SPAN#4 The change appears to be a safe fix, since the logic is unchanged in inference mode, models trained with this change are still compatible with this arch, and all tests are passing. Let me know if this change needs to be backed out.