add README.md #155

jcaip · 2024-04-22T18:49:43Z

add README.md to sparsity folder

jcaip · 2024-04-22T20:46:46Z

adding images here for hosting lol:

jerryzh168 · 2024-04-22T20:57:43Z

torchao/sparsity/README.md

+
+# Design
+
+Pruning, like quantization, is an accuracy/performance trade-off, where we care not only about the speedup but also on the accuracy degradation of our architecture optimization technique.


should we rename the folder to pruning

I think sparsity is the more widely used term, so let's keep it called that. But I'll change Pruning -> Sparsity where it makes sense in the README

cpuhrsch · 2024-04-22T22:40:52Z

torchao/sparsity/README.md

@@ -0,0 +1,664 @@
+# torchao sparsity
+
+Sparsity is the technique of removing parameters from a neural network in order to reduce its memory overhead or latency. By carefully choosing the elements that are removed, one can achieve significant reduction in memory overhead and latency, while paying a reasonably low or no price in terms of model quality (accuracy / f1).


I'd call pruning the "technique of removing parameters from a neural network in order to reduce its memory overhead or latency"

cpuhrsch · 2024-04-22T22:43:34Z

torchao/sparsity/README.md

+
+Sparsity, like quantization, is an accuracy/performance trade-off, where we care not only about the speedup but also on the accuracy degradation of our architecture optimization technique.
+
+In quantization, the theoretical performance gain is generally determined by the data type that we are quantizing to - quantizing from float32 to float16 yields a theoretical 2x speedup. For pruning/sparsity, the analogous variable would be the sparsity level/ sparsity pattern. For semi-structured, the sparsity level is fixed at 50%, so we expect a theoretical 2x improvement. For block-sparse matrices and unstructured sparsity, the speedup is variable and depends on the sparsity level of the tensor.


nit: It's roughly a theoretical speedup of 2x. Or put differently, 2x is a very basic estimate just because of the reduce amount of memory that needs to processed. In practice it can vary quite a bit. It could even be a lot more, because it allows you to use faster caches, etc.

cpuhrsch · 2024-04-22T22:46:44Z

torchao/sparsity/README.md

+
+In quantization, the theoretical performance gain is generally determined by the data type that we are quantizing to - quantizing from float32 to float16 yields a theoretical 2x speedup. For pruning/sparsity, the analogous variable would be the sparsity level/ sparsity pattern. For semi-structured, the sparsity level is fixed at 50%, so we expect a theoretical 2x improvement. For block-sparse matrices and unstructured sparsity, the speedup is variable and depends on the sparsity level of the tensor.
+
+One key difference between sparsity and quantization is in how the accuracy degradation is determined: The accuracy degradation of quantization is determined by the scale and zero_point chosen. However, in pruning the accuracy degradation is determined by the mask. By carefully choosing the specified elements and retraining the network, pruning can achieve negligible accuracy degradation and in some cases even provide a slight accuracy gain. This is an active area of research with no agreed-upon consensus. We expect users will have a target sparsity pattern and mind and to prune to that pattern.


This is a bit biased towards affine quantization and sparsity aware training specifically for matrix multiplication. There's many other variables that influences accuracy degradation. For example the operation used and the distribution of input values.

The measure that is model quality is the same between sparsity and quantization. Some of the mitigation techniques are the same too (e.g. quantization or sparsity aware training). Where it differs, I'd say, is that sparsity explicitly relies on approximating a sum of numbers (hence the focus on zero), whereas in quantization you avoid allocating bits for unused numerical ranges/unnecessary numerical fidelty.

I'll add some more context to the end of this section, but for this and the comment above, I want to keep this as newbie friendly as possible, so I think it's okay to have a relative flawed / forceful analogy to make a point.

I think explaining things in the most faithful way introduces a lot of jargon, which is kind of overwhelming.

cpuhrsch · 2024-04-22T22:47:07Z

torchao/sparsity/README.md

+
+Given a target sparsity pattern, pruning a model can then be thought of as two separate subproblems:
+
+* How can I find a set of sparse weights which satisfy my target sparsity pattern that minimize the accuracy degradation of my model?


Right, so this first part is what I'd call pruning.

cpuhrsch · 2024-04-22T22:48:05Z

torchao/sparsity/README.md

+Given a target sparsity pattern, pruning a model can then be thought of as two separate subproblems:
+
+* How can I find a set of sparse weights which satisfy my target sparsity pattern that minimize the accuracy degradation of my model?
+* How can I accelerate my sparse weights for inference and reduced memory overhead?


And then sparsity can be the task of accelerating pruned weights. It's not always necessary to use a sparse layout or sparse kernel. Sometimes you can prune in ways that obviate these specialized techniques. For example, you can just skip an entire layer.

msaroufim

Really enjoyed reading this, it is missing code samples but I believe what you intended to write was closer to a survey of sparsity and the parameter space a library should be in which case I believe this does the job well

msaroufim · 2024-04-22T23:00:35Z

torchao/sparsity/README.md

+
+FakeSparsity is a parameterization which simulates unstructured sparsity, where each element has a mask. Because of this, we can use it to simulate any sparsity pattern we want.
+
+The user will then train the prepared model using their own custom code, calling .step() to update the mask if necessary. Once they’ve found a suitable mask, they call `squash_mask()` to fuse the mask into the weights, creating a dense tensor with 0s in the right spot.


Not sure I follow this line, what's step? Also this seems to indicate that people need to change their training code and if so how?

is this line also necessary for people only interested in accelerated inference

I updated with a code sample, that should make this a bit easier to follow.

msaroufim · 2024-04-24T16:56:15Z

So we have a docs page now https://github.com/pytorch/ao/blob/jcaip/sparsity-readme/docs/source/sparsity.rst
pytorch.org/ao

jcaip · 2024-04-25T17:48:25Z

So we have a docs page now https://github.com/pytorch/ao/blob/jcaip/sparsity-readme/docs/source/sparsity.rst pytorch.org/ao

I think this is conceptually the right long term home for most of the stuff in the README, but I feel like this information will get lost right now vs if we put it in the README.

* added readme * update * add README * update * fix images * update * cleaned up * fix * fix formatting * update * update readme * fix images * updated README again * update ---------

added readme

ba1726d

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 22, 2024

update

cf2645c

jerryzh168 reviewed Apr 22, 2024

View reviewed changes

jcaip and others added 9 commits April 22, 2024 13:58

add README

c46d645

update

507e1b5

fix images

e837218

update

27cf89e

cleaned up

a3201ec

fix

255dff3

fix formatting

93d5330

update

dea3839

Merge branch 'main' into jcaip/sparsity-readme

30dcbe1

cpuhrsch requested a review from msaroufim April 22, 2024 22:40

cpuhrsch reviewed Apr 22, 2024

View reviewed changes

msaroufim reviewed Apr 22, 2024

View reviewed changes

jcaip added 4 commits April 23, 2024 19:27

update readme

b0fe6df

fix images

a411fee

updated README again

f120dca

update

a121ed6

Merge branch 'main' into jcaip/sparsity-readme

5cc21ea

msaroufim self-requested a review April 25, 2024 17:49

msaroufim approved these changes Apr 25, 2024

View reviewed changes

msaroufim merged commit 639432b into main Apr 25, 2024
13 checks passed

msaroufim deleted the jcaip/sparsity-readme branch April 25, 2024 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add README.md #155

add README.md #155

jcaip commented Apr 22, 2024 •

edited

Loading

jcaip commented Apr 22, 2024

jerryzh168 Apr 22, 2024

jcaip Apr 22, 2024

cpuhrsch Apr 22, 2024

cpuhrsch Apr 22, 2024

cpuhrsch Apr 22, 2024

jcaip Apr 24, 2024

cpuhrsch Apr 22, 2024

cpuhrsch Apr 22, 2024

msaroufim left a comment

msaroufim Apr 22, 2024

jcaip Apr 24, 2024

msaroufim commented Apr 24, 2024

jcaip commented Apr 25, 2024


		# Design

		Pruning, like quantization, is an accuracy/performance trade-off, where we care not only about the speedup but also on the accuracy degradation of our architecture optimization technique.

		@@ -0,0 +1,664 @@
		# torchao sparsity

		Sparsity is the technique of removing parameters from a neural network in order to reduce its memory overhead or latency. By carefully choosing the elements that are removed, one can achieve significant reduction in memory overhead and latency, while paying a reasonably low or no price in terms of model quality (accuracy / f1).


		Sparsity, like quantization, is an accuracy/performance trade-off, where we care not only about the speedup but also on the accuracy degradation of our architecture optimization technique.

		In quantization, the theoretical performance gain is generally determined by the data type that we are quantizing to - quantizing from float32 to float16 yields a theoretical 2x speedup. For pruning/sparsity, the analogous variable would be the sparsity level/ sparsity pattern. For semi-structured, the sparsity level is fixed at 50%, so we expect a theoretical 2x improvement. For block-sparse matrices and unstructured sparsity, the speedup is variable and depends on the sparsity level of the tensor.


		In quantization, the theoretical performance gain is generally determined by the data type that we are quantizing to - quantizing from float32 to float16 yields a theoretical 2x speedup. For pruning/sparsity, the analogous variable would be the sparsity level/ sparsity pattern. For semi-structured, the sparsity level is fixed at 50%, so we expect a theoretical 2x improvement. For block-sparse matrices and unstructured sparsity, the speedup is variable and depends on the sparsity level of the tensor.

		One key difference between sparsity and quantization is in how the accuracy degradation is determined: The accuracy degradation of quantization is determined by the scale and zero_point chosen. However, in pruning the accuracy degradation is determined by the mask. By carefully choosing the specified elements and retraining the network, pruning can achieve negligible accuracy degradation and in some cases even provide a slight accuracy gain. This is an active area of research with no agreed-upon consensus. We expect users will have a target sparsity pattern and mind and to prune to that pattern.


		Given a target sparsity pattern, pruning a model can then be thought of as two separate subproblems:

		* How can I find a set of sparse weights which satisfy my target sparsity pattern that minimize the accuracy degradation of my model?


		FakeSparsity is a parameterization which simulates unstructured sparsity, where each element has a mask. Because of this, we can use it to simulate any sparsity pattern we want.

		The user will then train the prepared model using their own custom code, calling .step() to update the mask if necessary. Once they’ve found a suitable mask, they call `squash_mask()` to fuse the mask into the weights, creating a dense tensor with 0s in the right spot.

add README.md #155

add README.md #155

Conversation

jcaip commented Apr 22, 2024 • edited Loading

jcaip commented Apr 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim commented Apr 24, 2024

jcaip commented Apr 25, 2024

jcaip commented Apr 22, 2024 •

edited

Loading