Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/json: allow per-Encoder/per-Decoder registration of marshal/unmarshal functions #5901

Open
rsc opened this issue Jul 17, 2013 · 49 comments
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. Proposal Proposal-Accepted
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented Jul 17, 2013

For example, if a user wants to marshal net.IP with custom code, we should provide a way
to do that, probably a method on *Encoder. Similarly for *Decoder.

Same for encoding/xml.
@robpike
Copy link
Contributor

robpike commented Aug 20, 2013

Comment 1:

Labels changed: removed go1.3.

@rsc
Copy link
Contributor Author

rsc commented Nov 27, 2013

Comment 2:

Labels changed: added go1.3maybe.

@rsc
Copy link
Contributor Author

rsc commented Dec 4, 2013

Comment 3:

Labels changed: added release-none, removed go1.3maybe.

@rsc
Copy link
Contributor Author

rsc commented Dec 4, 2013

Comment 4:

Labels changed: added repo-main.

@rsc rsc added this to the Unplanned milestone Apr 10, 2015
@odeke-em
Copy link
Member

odeke-em commented Jan 2, 2017

I did some work on this sometime back in October, with CL https://go-review.googlesource.com/c/31091 to get the conversation started.
Perhaps we can work on it for Go1.9.

In there I introduced Encoder.RegisterEncoder

func (enc *Encoder) RegisterEncoder(t reflect.Type, fn func(interface{}) ([]byte, error))

and Decoder.RegisterDecoder

func (dec *Decoder) RegisterDecoder(t reflect.Type, fn func([]byte) (interface{}, error))

@sebastien-rosset
Copy link

We have a similar requirement for custom serialization. Our specific use cases are:

  1. Data anonymization. For example, omit IP addresses in the JSON output.
  2. Data masking. For example, omit specific fields depending on user privileges.
  3. Omit empty values (e.g. empty strings). This is useful to generate compact documents.
    In all three cases, the decision to omit fields is done at runtime with contextual information.

We are using CiscoM31@1e9514f. In our case, the client interface is implemented by doing a simple lookup in a map, so there is no need to register hundreds of custom marshaller (we have lots of structs).
We could have used a technique similar to https://go-review.googlesource.com/c/31091.

@harrisonhjones
Copy link

harrisonhjones commented Dec 16, 2019

@rsc is there still interest in this feature on your end? I have a use-case for it and would be happy to implement it.

[Edit 1] Spelling
[Edit 2]

I wonder if the function signature should be

func (enc *Encoder) RegisterMarshaller(t reflect.Type, f func(reflect.Value) ([]byte, error))

with all standard Encoders exposed so that users can leverage them. Example redactor:

package main

import (
	"bytes"
	"encoding/json"
	"os"
	"reflect"
	"strings"
	"unicode/utf8"
)

func main() {
	enc := json.NewEncoder(os.Stdout)
	text := "My password, foo, is totally secure"

	enc.RegisterMarshaller(reflect.TypeOf(""), StringMarshaller)
	enc.Encode(text)
	// Output
	// "My password, foo, is totally secure"

	enc.RegisterMarshaller(reflect.TypeOf(""), func(value reflect.Value) ([]byte, error) {
		return StringMarshaller(reflect.ValueOf(strings.Replace(value.String(), "foo", "[REDACTED]", -1)))
	})
	enc.Encode(text)
	// Output
	// "My password, [REDACTED], is totally secure"
}

// Largely taken from `func (e *encodeState) `string(s string, escapeHTML bool)` in `encoding/json/encode.go`
// This would exist in encoding/json.
func StringMarshaller(value reflect.Value) ([]byte, error) {
	e := bytes.Buffer{}
	s := value.String()
	escapeHTML := false // TODO: Refactor StringEncoder into a 'htmlEscaping' one and a non 'htmlEscaping' one.

	e.WriteByte('"')
	start := 0
	for i := 0; i < len(s); {
		if b := s[i]; b < utf8.RuneSelf {
			if json.HTMLSafeSet[b] || (!escapeHTML && json.SafeSet[b]) {
				i++
				continue
			}
			if start < i {
				e.WriteString(s[start:i])
			}
			e.WriteByte('\\')
			switch b {
			case '\\', '"':
				e.WriteByte(b)
			case '\n':
				e.WriteByte('n')
			case '\r':
				e.WriteByte('r')
			case '\t':
				e.WriteByte('t')
			default:
				// This encodes bytes < 0x20 except for \t, \n and \r.
				// If escapeHTML is set, it also escapes <, >, and &
				// because they can lead to security holes when
				// user-controlled strings are rendered into JSON
				// and served to some browsers.
				e.WriteString(`u00`)
				e.WriteByte(json.Hex[b>>4])
				e.WriteByte(json.Hex[b&0xF])
			}
			i++
			start = i
			continue
		}
		c, size := utf8.DecodeRuneInString(s[i:])
		if c == utf8.RuneError && size == 1 {
			if start < i {
				e.WriteString(s[start:i])
			}
			e.WriteString(`\ufffd`)
			i += size
			start = i
			continue
		}
		// U+2028 is LINE SEPARATOR.
		// U+2029 is PARAGRAPH SEPARATOR.
		// They are both technically valid characters in JSON strings,
		// but don't work in JSONP, which has to be evaluated as JavaScript,
		// and can lead to security holes there. It is valid JSON to
		// escape them, so we do so unconditionally.
		// See http://timelessrepo.com/json-isnt-a-javascript-subset for discussion.
		if c == '\u2028' || c == '\u2029' {
			if start < i {
				e.WriteString(s[start:i])
			}
			e.WriteString(`\u202`)
			e.WriteByte(json.Hex[c&0xF])
			i += size
			start = i
			continue
		}
		i += size
	}
	if start < len(s) {
		e.WriteString(s[start:])
	}
	e.WriteByte('"')

	return e.Bytes(), nil
}

@dsnet dsnet added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Dec 18, 2019
@dsnet
Copy link
Member

dsnet commented Dec 18, 2019

I have a use-case for this as well, but it's a bit specialized. Essentially:

  1. There is a type that I own (as a library author) that is used by many users where they pass it through encoding/json when they shouldn't.
  2. I would like to make changes to my type, but it breaks these users. I have no intention to support the use of this type with encoding/json, but I can't just break these users since there are are sufficient number of them.
  3. I would like to use the feature proposed here to migrate these users to an implementation of marshal/unmarshal that preserves the exact behavior as it exists today (even if its buggy and wrong).
  4. This now frees me to make the changes I want to make.

For prior art, the cmp package has a Comparer option that allows the caller to override the comparison of any specific type in the tree. The ability for users to specify custom comparisons has proven itself to be immensely powerful and flexible.

Even though I'd like to have something like this, I do have some concerns:

  • It is already the case that users often complain that encoding/json is too slow. The addition of this feature may make things slower. For cmp, the need to check whether any given value node matches a custom comparer is a significant source of slow down.
  • A decision needs to be made whether the type override only supports concrete types or interfaces. Concrete types are easier to implement efficiently. However, supporting interfaces provides significant flexibility, but also brings in significant implementation and performance costs.
  • What's the expected behavior when json.Marshal encounters a type T, and the type override has a custom marshaler specified for *T? If the value is addressable, then it makes sense to address the value and call the override function. What if the value is not addressable?

@harrisonhjones
Copy link

harrisonhjones commented Dec 18, 2019

Thanks for the response! Some comments below:

It is already the case that users often complain that encoding/json is too slow. The addition of this feature may make things slower. For cmp, the need to check whether any given value node matches a custom comparer is a significant source of slow down.

I'd have to run a benchmark but I would expect a map lookup to be fairly quick. Perhaps others are concerned with a different scale of "slow" than I.

A decision needs to be made whether the type override only supports concrete types or interfaces. Concrete types are easier to implement efficiently. However, supporting interfaces provides significant flexibility, but also brings in significant implementation and performance costs.

Interesting. I hadn't considered supporting interfaces. At the moment I only need concrete type overrides but if there are users out there that would benefit from an interface check I'd be willing to at least prototype it.

[Edit] Perhaps we could rename RegisterMarshaller to RegisterConcreteMarshaller and introduce RegisterInterfaceMarshaller later if users needed it?

What's the expected behavior when json.Marshal encounters a type T, and the type override has a custom marshaler specified for *T? If the value is addressable, then it makes sense to address the value and call the override function. What if the value is not addressable?

This actually came up in a discussion with a colleague. My preference would be to require users that want to custom marshal both T and *T to have to declare both marshaler overrides. You could use a single custom marshaler but would require both overrides.

...
var foo = ""
enc.RegisterMarshaller(reflect.TypeOf(foo), StringMarshaller)
enc.RegisterMarshaller(reflect.TypeOf(&foo), StringMarshaller)
...

func StringMarshaller(value reflect.Value) ([]byte, error) {
    ... check if value is a *String, if so deref ...
    ...
}

Alternately perhaps we modify the signiture of RegisterMarshaller to be something like:

func (enc *Encoder) RegisterMarshaller(reflect.Type, bool, func(reflect.Value) ([]byte, error))

where, if the bool was "true" it would match both T and *T and if it was false it would only match T or *T, depending on the passed in type. Looks a bit yucky to me though could be made to look better if we used the optional pattern.

@harrisonhjones
Copy link

@dsnet I imagine you might be busy with the holidays but I wanted to give you a friendly ping on this. Any thoughts on my response?

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/212998 mentions this issue: encoding/json: implement type override for serialization

@dsnet
Copy link
Member

dsnet commented Dec 31, 2019

While the logic certainly uses reflect under the hood, I'm not sure if we should expose that in the public API. I propose the following API instead:

// RegisterFunc registers a custom encoder to use for specialized types.
// The input f must be a function of the type func(T) ([]byte, error).
//
// When marshaling a value of type R, the function f is called
// if R is identical to T for concrete types or
// if R implements T for interface types.
// Precedence is given to registered encoders that operate on concrete types,
// then registered encoders that operate on interface types
// in the order that they are registered, then the MarshalJSON method, and
// lastly the default behavior of Encode.
//
// It panics if T is already registered or if interface{} is assignable to T.
func (e *Encoder) RegisterFunc(f interface{})

// RegisterFunc registers a custom decoder to use for specialized types.
// The input f must be a function of the type func([]byte, T) error.
//
// When unmarshaling a value of type R, the function f is called
// if R is identical to T for concrete types or
// if R implements T for interface types.
// Precedence is given to registered decoders that operate on concrete types,
// then registered decoders that operate on interface types
// in the order that they are registered, then the UnmarshalJSON method, and
// lastly the default behavior of Decode.
//
// It panics if T is already registered or if interface{} is assignable to T.
func (d *Decoder) RegisterFunc(f interface{})

Arguments for this API:

  • Most users of this API probably do not want to deal with reflect.Type and reflect.Value, but rather want to deal with concrete types. For this reason, json.Unmarshal takes in an interface{} rather than an reflect.Value.
  • Using interface{} looses type safety, but the other proposed signatures also have type safety issues. For a signature like: func (enc *Encoder) RegisterMarshaller(t reflect.Type, f func(reflect.Value) ([]byte, error)), we still can't statically guarantee that t is the same type as the type that function f expects as it's input.
  • It matches what's done in other API that also have type-safety issues. For example, sort.Slices takes in an interface{} with the requirement that it be a slice kind, rather than a reflect.Value.

The proposed API combined with the ability to handle interfaces, allows you to do something like:

e := json.NewEncoder(w)
e.RegisterFunc(protojson.Marshal)
e.Encode(v)

where protojson.Marshal is a function that matches the expected function signature. It enables the standard encoding/json package to now be able to properly serialize all proto.Message types without forcing or expecting every concrete proto.Message type to implement the json.Marshaler interface.

Since this is something I need for my other work, I uploaded a CL with my prototype implementation that I've been testing with actual code.

Some additional thoughts:

  • One concern is that json.Decoder.DisallowUnknownFields specifies a top-level option that should in theory affect the results of all unmarshal operations recursively. It is already a problem today that custom json.Unmarshaler implementations do not properly respect this option (and has actually been a real problem). It's unfortunate that we're adding another way where options are not properly propagated downward. However, the severity of the problem is not as bad as custom json.Unmarshaler implementations since the location where options are specified is most likely also the place where custom encoder functions are registered, so options can be properly replicated at the top-level.
  • The implementation checks for custom functions to handle the current value type first, and if the value is addressable, also tries checking for a pointer to that value type. This matches the behavior of encoding/json for how it handles checking for json.Marshaler and json.Unmarshaler implementations.
  • The implementation does not call decoder functions if the receiver type is a pointer/interface and the input JSON is null. It also does not call encoder functions if the receiver type is a pointer/interface and is nil. This matches the behavior of encoding/json for how it handles json.Marshaler and json.Unmarshaler implementations.
  • I haven't benchmarked it yet, but expect decoding to take minimal performance hit and encoding to take no performance hit when there are no custom encoders/decoders.
  • Generics might enable greater type safety for the proposed API, but the current proposal for generics does not allow type parameterization at the granularity of individual methods.

@dsnet dsnet changed the title encoding/json: allow override type marshaling proposal: encoding/json: allow override type marshaling Dec 31, 2019
@gopherbot gopherbot removed the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Dec 31, 2019
@dsnet dsnet added NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. Proposal labels Dec 31, 2019
@dsnet
Copy link
Member

dsnet commented Dec 31, 2019

Adding Proposal label as there are at least two proposed APIs and this issue proposes added API to encoding/json.

@dmitshur dmitshur removed this from the Unplanned milestone Jan 1, 2020
@odeke-em odeke-em modified the milestones: Go1.16, Go1.17 Jan 11, 2021
@odeke-em odeke-em added the early-in-cycle A change that should be done early in the 3 month dev cycle. label Jan 11, 2021
@gopherbot
Copy link
Contributor

This issue is currently labeled as early-in-cycle for Go 1.17.
That time is now, so a friendly reminder to look at it again.

@dsnet
Copy link
Member

dsnet commented Feb 24, 2021

@mvdan and I are doing a systematic review of the entire encoding/json API. We may delay the introduction of this to make sure it fits well with the overall direction we believe json should go.

@DeadlySurgeon
Copy link

DeadlySurgeon commented Mar 17, 2021

  • Generics might enable greater type safety for the proposed API, but the current proposal for generics does not allow type parameterization at the granularity of individual methods.

I'm very late to the party, but I believe generics would probably be fine here, might reduce the amount of reflection needed, and like you said would enable greater type safety for the proposed API.

As for the type parameterization, I thought up something like this.

// Encoder
func MarshalIP(ip *net.IP) ([]byte, error) { ... }
func Register[T any](f func(T) ([]byte, error)) { ... }

// Decoder
func UnmarshalIP(data []byte) (*net.IP, error) { ... }
func Register[T any](f func([]byte) (T, error)) { ... }
// or
func UnmarshalIP(data []byte, ip *net.IP) error { ... }
func Register[T any](f func([]byte, T) error) { ... }

You'd obviously still need reflection to get the typeof out of T, but at least then we can ensure that what's being passed is of type func during compilation. You could do this by getting the type of f, and then pulling the parameter T out of that, or, you could just make a new T and throw that into reflect.TypeOf.

@neild
Copy link
Contributor

neild commented Aug 10, 2021

@mvdan and I are doing a systematic review of the entire encoding/json API. We may delay the introduction of this to make sure it fits well with the overall direction we believe json should go.

How's this going?

If this doesn't conflict with the overall package direction, it'd be nice to try to get this in for 1.18.

@dsnet
Copy link
Member

dsnet commented Aug 10, 2021

I apologize, I got side-tracked from json work quite a bit the past few months, but starting to get back into it more.

@dsnet
Copy link
Member

dsnet commented Aug 28, 2021

In implementing this for v2. I think we may want to defer on this feature until the release of generics.

I propose the following alternative API:

// Marshalers is a list of functions that each marshal a specific type.
// A nil Marshalers is equivalent to an empty list.
// For performance, marshalers should be stored in a global variable.
type Marshalers struct { ... }

// NewMarshalers constructs a list of functions that marshal a specific type.
// Functions that come earlier in the list take precedence.
func NewMarshalers(...*Marshalers) *Marshalers

// MarshalFunc constructs a marshaler that marshals values of type T.
func MarshalFunc[T any](fn func(T) ([]byte, error)) *Marshalers

// Unmarshalers is a list of functions that each unmarshal a specific type.
// A nil Unmarshalers is equivalent to an empty list.
type Unmarshalers struct { ... }

// NewUnmarshalers constructs a list of functions that unmarshal a specific type.
// Functions that come earlier in the list take precedence.
// For performance, unmarshalers should be stored in a global variable.
func NewUnmarshalers(...*Unmarshalers) *Unmarshalers

// UnmarshalFunc constructs an unmarshaler that unmarshals values of type T.
func UnmarshalFunc[T any](fn func([]byte, T) error) *Unmarshalers

// WithMarshalers configures the encoder to use any marshaler in m
// that operate on the current type that the encoder is marshaling.
func (*Encoder) WithMarshalers(m *Marshalers)

// WithUnmarshalers configures the decoder to use any unmarshaler in u
// that operate on the current type that the encoder is marshaling.
func (*Decoder) WithUnmarshalers(u *Unmarshalers)

There are several advantages of this API:

  • Performance. Without generics, the implementation must go through reflect.Value.Call. Empirical testing using the tip implementation of generics shows that it is 4x faster than an implementation using reflect.Value.Call.
  • Type safety. The API that I proposed earlier provides no type safety to ensure that users are passing functions of the right signature.
  • Composability. Marshalers and Unmarshalers can be composed with one another. For example:
    m1 := NewMarshalers(f1, f2)
    m2 := NewMarshalers(f0, m1, f3) // equivalent to NewMarshalers(f0, f1, f2, f3)
  • Immutability. Marshalers and Unmarshalers are immutable once constructed. This allows the implementation to perform aggressive caching that would be complicated by the list of marshalers/unmarshalers changing underfoot. This would allow further performance benefits that my implementation for v1 fails to unlock.

Example usage:

var protoMarshalers = json.MarshalFunc(protojson.Marshal)

enc := json.NewEncoder(...)
enc.WithMarshalers(protoMarshalers)
... = enc.Decode(...)

Footnotes:

  • Credit goes to @rogpeppe for realizing that generics provides a significant performance benefit over Go reflection in this case.
  • The experiment was compiled using GOEXPERIMENT=unified since the compiler crashes otherwise.

\cc @mvdan

@neild
Copy link
Contributor

neild commented Aug 30, 2021

Generics are arriving in 1.18. Should we add this (with generics) in 1.18, or delay a release?

@dsnet
Copy link
Member

dsnet commented Aug 30, 2021

I keep forgetting that generics is coming imminently in 1.18. Assuming newly proposed API is acceptable, I support a release for 1.18.

In the event that generics is delayed, we could go with the following signatures:

func MarshalFunc(fn interface{}) *Marshalers
func UnmarshalFunc(fn interface{}) *Unmarshalers

It must rely on Go reflection for type safety at runtime. We could add a corresponding go vet or go build check that statically enforces type-safety so that we can switch it to the generic version in a backwards compatible way.

@liggitt
Copy link
Contributor

liggitt commented Sep 15, 2021

// WithUnmarshalers configures the decoder to use any unmarshaler in u
// that operate on the current type that the encoder is marshaling.
func (*Decoder) WithUnmarshalers(u *Unmarshalers)

Using a Decoder adds ~10 additional allocs and ~doubles memory consumption for callers which already have []byte data over use of json.Unmarshal, due to the Decoder only accepting a reader and then immediately reading the data into an internal buffer. Is there a plan to make the configurable decode behavior (like these custom unmarshalers, or existing options like DisallowUnknownFields/UseNumbers) accessible without such a significant performance penalty?

@dsnet
Copy link
Member

dsnet commented Sep 15, 2021

Is there a plan to make the configurable decode behavior (like these custom unmarshalers, or existing options like > DisallowUnknownFields/UseNumbers) accessible without such a significant performance penalty?

There's been work on what a theoretical v2 json could look like (if ever). That API supports setting options apart from the Decoder. Some or all of those ideas could be back-ported into the current json package.

@kolkov
Copy link

kolkov commented Apr 6, 2022

Hi! Golang version 1.18 with generics is already out. Are there any updates now?

@dsnet
Copy link
Member

dsnet commented Oct 6, 2023

Hi all, we kicked off a discussion for a possible "encoding/json/v2" package that addresses the spirit of this proposal.

See the "Caller-specified customization" section of the discussion.
For concrete examples, see the examples under the WithMarshalers and WithUnmarshalers options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. Proposal Proposal-Accepted
Projects
Status: Accepted
Development

No branches or pull requests