Should I use GPT to autogenerate schema validations?

Mix.install([
  {:jason, "> 0.0.0"},
  {:vega_lite, "~> 0.1.7"},
  {:kino_vega_lite, "~> 0.1.8"},
  {:benchee, "~> 1.1.0"},
  {:exonerate, "~> 0.3.0"}
])

~w(test.ex schema.ex)
|> Enum.each(fn file ->
  __DIR__
  |> Path.join("benchmark/#{file}")
  |> Code.compile_file()
end)

alias Benchmark.Schema
alias Benchmark.Test

Benchmark.Test

Motivation

This entire month (March, 2023), I had been spending a ton of effort completing a major refactor of my json-schema library for Elixir. As I was toiling away handcrafting macros to generate optimized, bespoke, yet generalizable code, GPT-4 rolled onto the scene and awed all of us in the industry with its almost magical ability to craft code out of whole cloth. I felt a little bit like John Henry battling against the iron tracklayer, only to win but expire from his exertion.

With the advent of LLM-based code generation, we are seeing programmers leveraging the power of LLMs, such as GPT, to generate difficult or fussy code and rapidly create code. Is this a good idea? I wanted to test this out.

Note that compared to a schema compiler, LLM-generated code may be able to see some nice optimizations for simple schemas. This is roughly equivalent to a human claiming to be able to write better assembly language than a low-level language compiler. In some cases, the human may access extra knowledge about the structure of the data being handled, and thus the claim may be justified.

On the other hand, JSONSchema validations are typically used at the edge of a system, especially when interfacing with a 3rd party system (or human) with QC that is not under the control of the publisher of the JSONSchema. In these situations, strict adherence to JSONSchema is desirable. An early 422 rejection with a reason explaining where the data are misshapen is generally more desirable than a typically more opaque 500 rejection because the data do not match the expectations of the internal system.

With these considerations, I decided to test just how good GPT is at writing JSONSchemas, and answer the question "Should I use GPT to autogenerate schema validations?"

Methodology

To test this question, the following prompt was generated against ~> 250 JSONSchemas provided as a part of the JSONSchema engine validation suite (website). Each of these was injected into the following templated query and GPT3.5 and GPT4 were asked to provide a response.

Hi, ChatGPT! I would love your help writing an Elixir public function `validate/1`, which takes
one parameter, which is a decoded JSON value.  The function should return :ok if the following
jsonschema validates, and an error if it does not:

```
#{schema}
```

The function should NOT store or parse the schema, it should translate the instructions in the schema directly as
elixir code.  For example:

```
{"type": "object"}
```

should emit the following code:

```
def validate(object) when is_map(object), do: :ok
def validate(_), do: :error
```

DO NOT STORE THE SCHEMA or EXAMINE THE SCHEMA anywhere in the code.  There should not be any
`schema` variables anywhere in the code.  please name the module with the atom `:"#{group}-#{title}"

Thank you!

From the response, the code inside of the elixir fenced block was extracted and saved into a .exs file for processing as below in this live notebook. GPT-3.5 was not capable of correctly wrapping the elixir module, so it required an automated result curation step; GPT-4 code was able to be used as-is. Some further manual curation was performed (see Systematic code generation issues.)

Limitations

The biggest limitation of this approach is the nature of the examples provided in the JSONSchema validation suite. These validations exist to help JSONSchema implementers understand "gotchas" in the JSONSchema standard. As such, they don't feature "real-world" payloads and their complexity is mostly limited to testing a single JSONSchema filter, in some cases, a handful of JSONSchema filters, where the filters have a long-distance interaction as part of the specification.

As a result, the optimizations that GPT performs may not really be scalable to real-world cases, and it's not clear if GPT will have sufficient attention to handle the more complex cases.

Future studies, possibly involving schema generation and a property testing approach, can yield a more comprehensive understanding of GPT code generation

Note that the source data for GPT is more heavily biased towards imperative programming languages, so despite the claim that AI-assisted code-generation is likely to be more fruitful for languages (like Elixir) with term-immutability, any deficiencies in the code may also be a result of a deficiency in the LLM's understanding of Elixir.

Benchmarking Accuracy

We're going to marshal our results into the following struct, which carries information for visualization:

defmodule Benchmark.Result do
  @enforce_keys [:schema, :type]
  defstruct @enforce_keys ++ [fail: [], pass: [], pct: 0.0, exception: nil]
end

{:module, Benchmark.Result, <<70, 79, 82, 49, 0, 0, 11, ...>>,
 %Benchmark.Result{schema: nil, type: nil, fail: [], pass: [], pct: 0.0, exception: nil}}

The following code is used to profile our GPT-generated code. The directory structure is expected to be that of the https://github.com/E-xyza/exonerate repository, and this notebook is expected to be in the ./bench/, otherwise the relative directory paths won't work.

Note that the Schema and Test modules should be in ./bench/benchmark/schema.ex and ./bench/benchmark/test.ex, respectively, these are loaded in the dependencies section.

defmodule Benchmark do
  alias Benchmark.Result

  @omit ~w(anchor.json refRemote.json dynamicRef.json)

  @test_directory Path.join(__DIR__, "../test/_draft2020-12")
  def get_test_content do
    Schema.stream_from_directory(@test_directory, omit: @omit)
  end

  def run(gpt, test_content) do
    code_directory = Path.join(__DIR__, gpt)

    test_content
    |> Stream.map(&compile_schema(&1, code_directory))
    |> Stream.map(&evaluate_test/1)
    |> Enum.to_list()
  end

  defp escape(string), do: String.replace(string, "/", "-")

  defp compile_schema(schema, code_directory) do
    filename = "#{schema.group}-#{escape(schema.description)}.exs"
    code_path = Path.join(code_directory, filename)

    module =
      try do
        {{:module, module, _, _}, _} = Code.eval_file(code_path)
        module
      rescue
        error -> error
      end

    {schema, module}
  end

  defp evaluate_test({schema, exception}) when is_exception(exception) do
    %Result{schema: schema, type: :compile, exception: exception}
  end

  defp evaluate_test({schema, module}) do
    # check to make sure module exports the validate function.
    if function_exported?(module, :validate, 1) do
      increment = 100.0 / length(schema.tests)

      schema.tests
      |> Enum.reduce(%Result{schema: schema, type: :ok}, fn test, result ->
        expected = if test.valid, do: :ok, else: :error

        try do
          if module.validate(test.data) === expected do
            %{result | pct: result.pct + increment, pass: [test.description | result.pass]}
          else
            %{result | type: :partial, fail: [{test.description, :incorrect} | result.fail]}
          end
        rescue
          e ->
            %{result | type: :partial, fail: [{test.description, e} | result.fail]}
        end
      end)
      |> set_total_failure
    else
      %Result{schema: schema, type: :compile, exception: :not_generated}
    end
  end

  # if absolutely none of the answers is correct, then set the type to :failure
  defp set_total_failure(result = %Result{pct: 0.0}), do: %{result | type: :failure}
  defp set_total_failure(result), do: result
end

tests = Benchmark.get_test_content()

gpt_3_results = Benchmark.run("gpt-3.5", tests)
gpt_4_results = Benchmark.run("gpt-4", tests)

:ok

warning: variable "map" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/if-then-else-if appears at the end when serialized (keyword processing sequence).exs:30: :"if-then-else-if appears at the end when serialized (keyword processing sequence)-gpt-3.5".validate_map/1

warning: function validate_array/1 is unused
  /home/ityonemo/code/exonerate/bench/gpt-3.5/if-then-else-if appears at the end when serialized (keyword processing sequence).exs:38

warning: function validate_bool/1 is unused
  /home/ityonemo/code/exonerate/bench/gpt-3.5/if-then-else-if appears at the end when serialized (keyword processing sequence).exs:58

warning: function validate_null/1 is unused
  /home/ityonemo/code/exonerate/bench/gpt-3.5/if-then-else-if appears at the end when serialized (keyword processing sequence).exs:62

warning: function validate_number/1 is unused
  /home/ityonemo/code/exonerate/bench/gpt-3.5/if-then-else-if appears at the end when serialized (keyword processing sequence).exs:54

warning: function validate_string/1 is unused
  /home/ityonemo/code/exonerate/bench/gpt-3.5/if-then-else-if appears at the end when serialized (keyword processing sequence).exs:46

warning: this clause cannot match because a previous clause at line 22 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/if-then-else-if appears at the end when serialized (keyword processing sequence).exs:25

warning: Map.get!/2 is undefined or private. Did you mean:

      * get/2
      * get/3

  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-root pointer ref.exs:4: :"ref-root pointer ref-gpt-3.5".validate/1

warning: variable "errors" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-relative pointer ref to object.exs:13: :"ref-relative pointer ref to object-gpt-3.5".validate_map/2

warning: variable "map" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-relative pointer ref to object.exs:10: :"ref-relative pointer ref to object-gpt-3.5".validate_map/2

warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-escaped pointer ref.exs:6: :"ref-escaped pointer ref-gpt-3.5".validate/1

warning: this clause cannot match because a previous clause at line 19 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-escaped pointer ref.exs:27

warning: variable "k" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-nested refs.exs:32: :"ref-nested refs-gpt-3.5".validate_object/1

warning: undefined module attribute @schemas, please remove access to @schemas or explicitly set it before access
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-nested refs.exs:50: :"ref-nested refs-gpt-3.5" (module)

warning: this clause cannot match because a previous clause at line 30 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-nested refs.exs:31

warning: module attribute @schemas was set but never used
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-nested refs.exs:57

warning: variable "schema" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref applies alongside sibling keywords.exs:10: :"ref-ref applies alongside sibling keywords-gpt-3.5".validate/1

warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref applies alongside sibling keywords.exs:10

warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref applies alongside sibling keywords.exs:14

warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-$ref to boolean schema true.exs:10

warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-$ref to boolean schema true.exs:14

warning: undefined module attribute @schema, please remove access to @schema or explicitly set it before access
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-refs with quote.exs:50: :"ref-refs with quote-gpt-3.5" (module)

warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref creates new scope when adjacent to keywords.exs:21: :"ref-ref creates new scope when adjacent to keywords-gpt-3.5".validate/1

warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref creates new scope when adjacent to keywords.exs:25: :"ref-ref creates new scope when adjacent to keywords-gpt-3.5".validate/1

warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref creates new scope when adjacent to keywords.exs:10

warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref creates new scope when adjacent to keywords.exs:17

warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref creates new scope when adjacent to keywords.exs:21

warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/ref-ref creates new scope when adjacent to keywords.exs:25

warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-unevaluatedProperties with adjacent additionalProperties.exs:9: :"unevaluatedProperties-unevaluatedProperties with adjacent additionalProperties-gpt-3.5".validate/1

warning: variable "result" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-unevaluatedProperties with adjacent additionalProperties.exs:63: :"unevaluatedProperties-unevaluatedProperties with adjacent additionalProperties-gpt-3.5".validate_properties_with_properties/2

warning: variable "default" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-unevaluatedProperties with nested properties.exs:48: :"unevaluatedProperties-unevaluatedProperties with nested properties-gpt-3.5".validate_object_properties/4

warning: Map.equal/2 is undefined or private. Did you mean:

      * equal?/2

  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-unevaluatedProperties with nested properties.exs:35: :"unevaluatedProperties-unevaluatedProperties with nested properties-gpt-3.5".validate_object_properties/4

warning: MapSchema.validate/2 is undefined (module MapSchema is not available or is yet to be defined)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-cousin unevaluatedProperties, true and false, true with properties.exs:4: :"unevaluatedProperties-cousin unevaluatedProperties, true and false, true with properties-gpt-3.5".validate/1

warning: variable "schema" does not exist and is being expanded to "schema()", please use parentheses to remove the ambiguity or change the variable name
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-cousin unevaluatedProperties, true and false, false with properties.exs:31: :"unevaluatedProperties-cousin unevaluatedProperties, true and false, false with properties-gpt-3.5".validate_properties_schema/2

warning: variable "schema" does not exist and is being expanded to "schema()", please use parentheses to remove the ambiguity or change the variable name
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-cousin unevaluatedProperties, true and false, false with properties.exs:50: :"unevaluatedProperties-cousin unevaluatedProperties, true and false, false with properties-gpt-3.5".validate_unevaluated_properties/2

warning: undefined function schema/0 (expected :"unevaluatedProperties-cousin unevaluatedProperties, true and false, false with properties-gpt-3.5" to define such a function or for it to be imported, but none are available)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedProperties-cousin unevaluatedProperties, true and false, false with properties.exs:50

warning: this clause for validate_schema1/1 cannot match because a previous clause at line 13 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/allOf-allOf with base schema.exs:20

warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/allOf-allOf with two empty schemas.exs:2: :"allOf-allOf with two empty schemas-gpt-3.5".validate/1

warning: variable "object" is unused (there is a variable with the same name in the context, use the pin operator (^) to match on it or prefix this variable with underscore if it is not meant to be used)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/allOf-nested allOf, to check validation semantics.exs:92: :"allOf-nested allOf, to check validation semantics-gpt-3.5".validate_subschema/2

warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/allOf-nested allOf, to check validation semantics.exs:10

warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/allOf-nested allOf, to check validation semantics.exs:18

warning: this clause for validate/1 cannot match because a previous clause at line 6 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/allOf-nested allOf, to check validation semantics.exs:27

warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/required-required default validation.exs:2: :"required-required default validation-gpt-3.5".validate/1

warning: clauses with the same name should be grouped together, "def validate/1" was previously defined (/home/ityonemo/code/exonerate/bench/gpt-3.5/maximum-maximum validation.exs:2)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/maximum-maximum validation.exs:14

warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/maximum-maximum validation.exs:14

warning: Map.fetch/3 is undefined or private. Did you mean:

      * fetch/2

  /home/ityonemo/code/exonerate/bench/gpt-3.5/uniqueItems-uniqueItems with an array of items and additionalItems=false.exs:31: :"uniqueItems-uniqueItems with an array of items and additionalItems=false-gpt-3.5".is_unique_items/1

warning: incompatible types:

    map() !~ [dynamic()]

in expression:

    # /home/ityonemo/code/exonerate/bench/gpt-3.5/uniqueItems-uniqueItems=false validation.exs:5
    is_list(object)

where "object" was given the type map() in:

    # /home/ityonemo/code/exonerate/bench/gpt-3.5/uniqueItems-uniqueItems=false validation.exs:5
    %{uniqueItems: false} = object

where "object" was given the type [dynamic()] in:

    # /home/ityonemo/code/exonerate/bench/gpt-3.5/uniqueItems-uniqueItems=false validation.exs:5
    is_list(object)

Conflict found at
  /home/ityonemo/code/exonerate/bench/gpt-3.5/uniqueItems-uniqueItems=false validation.exs:5: :"uniqueItems-uniqueItems=false validation-gpt-3.5".validate/1

warning: function is_integer/1 is unused
  /home/ityonemo/code/exonerate/bench/gpt-3.5/properties-object properties validation.exs:31

warning: variable "subprop" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/properties-properties, patternProperties, additionalProperties interaction.exs:64: :"properties-properties, patternProperties, additionalProperties interaction-gpt-3.5".validate_pattern_properties/2

warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/additionalProperties-additionalProperties can exist by itself.exs:4: :"additionalProperties-additionalProperties can exist by itself-gpt-3.5".validate/1

warning: undefined function validate/2 (expected :"items-items should not look in applicators, valid case-gpt-3.5" to define such a function or for it to be imported, but none are available)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/items-items should not look in applicators, valid case.exs:86

warning: variable "null" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/anyOf-nested anyOf, to check validation semantics.exs:2: :"anyOf-nested anyOf, to check validation semantics-gpt-3.5".validate/1

warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/anyOf-nested anyOf, to check validation semantics.exs:6

warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/anyOf-nested anyOf, to check validation semantics.exs:10

warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/anyOf-nested anyOf, to check validation semantics.exs:17

warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/anyOf-nested anyOf, to check validation semantics.exs:21

warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/anyOf-nested anyOf, to check validation semantics.exs:25

warning: found quoted keyword "pattern" but the quotes are not required. Note that keywords are always atoms, even when quoted. Similar to atoms, keywords made exclusively of ASCII letters, numbers, and underscores and not beginning with a number do not require quotes
  /home/ityonemo/code/exonerate/bench/gpt-3.5/pattern-pattern is not anchored.exs:5:6

warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/minimum-minimum validation with signed integer.exs:6

warning: variable "object" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with items.exs:6: :"unevaluatedItems-unevaluatedItems with items-gpt-3.5".validate/1

warning: variable "prefix" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with items.exs:31: :"unevaluatedItems-unevaluatedItems with items-gpt-3.5".validate_prefix_items/2

warning: function validate_items/2 is unused
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with items.exs:14

warning: function validate_prefix_items/2 is unused
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with items.exs:25

warning: function validate_schema/2 is unused
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with items.exs:53

warning: function validate_with_prefix_schema/2 is unused
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with items.exs:46

warning: variable "type" is unused (there is a variable with the same name in the context, use the pin operator (^) to match on it or prefix this variable with underscore if it is not meant to be used)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with nested tuple.exs:67: :"unevaluatedItems-unevaluatedItems with nested tuple-gpt-3.5".validate_prefix_items/2

warning: variable "type" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with nested tuple.exs:63: :"unevaluatedItems-unevaluatedItems with nested tuple-gpt-3.5".validate_prefix_items/2

warning: variable "data" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:9: :"unevaluatedItems-unevaluatedItems with not-gpt-3.5".validate/1

warning: variable "prefix_items" does not exist and is being expanded to "prefix_items()", please use parentheses to remove the ambiguity or change the variable name
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:56: :"unevaluatedItems-unevaluatedItems with not-gpt-3.5".validate_item/2

warning: variable "const" does not exist and is being expanded to "const()", please use parentheses to remove the ambiguity or change the variable name
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:57: :"unevaluatedItems-unevaluatedItems with not-gpt-3.5".validate_item/2

warning: function validate_item/2 is unused
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:24

warning: function validate_items/2 is unused
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:13

warning: undefined function validate_item/3 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:62

warning: undefined function validate_item/3 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:61

warning: undefined function validate_item/3 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:60

warning: undefined function validate_item/3 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:59

warning: undefined function validate_item/3 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:58

warning: undefined function validate_item/3 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:57

warning: undefined function const/0 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:57

warning: undefined function validate_item/3 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:56

warning: undefined function prefix_items/0 (expected :"unevaluatedItems-unevaluatedItems with not-gpt-3.5" to define such a function or for it to be imported, but none are available)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with not.exs:56

warning: this clause cannot match because a previous clause at line 23 always matches
  /home/ityonemo/code/exonerate/bench/gpt-3.5/unevaluatedItems-unevaluatedItems with boolean schemas.exs:27

warning: variable "properties" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-3.5/content-validation of binary-encoded media type documents with schema.exs:30: :"content-validation of binary-encoded media type documents with schema-gpt-3.5".validate_content_schema/1

warning: expected Kernel.rem/2 to have signature:

    integer() | float(), float() -> dynamic()

but it has signature:

    integer(), integer() -> integer()

in expression:

    # /home/ityonemo/code/exonerate/bench/gpt-3.5/multipleOf-by number.exs:2
    rem(val, 1.5)

Conflict found at
  /home/ityonemo/code/exonerate/bench/gpt-3.5/multipleOf-by number.exs:2: :"multipleOf-by number-gpt-3.5".validate/1

warning: expected Kernel.rem/2 to have signature:

    integer() | float(), float() -> dynamic()

but it has signature:

    integer(), integer() -> integer()

in expression:

    # /home/ityonemo/code/exonerate/bench/gpt-3.5/multipleOf-by small number.exs:2
    rem(value, 0.0001)

Conflict found at
  /home/ityonemo/code/exonerate/bench/gpt-3.5/multipleOf-by small number.exs:2: :"multipleOf-by small number-gpt-3.5".validate/1

warning: expected Kernel.rem/2 to have signature:

    integer(), float() -> dynamic()

but it has signature:

    integer(), integer() -> integer()

in expression:

    # /home/ityonemo/code/exonerate/bench/gpt-3.5/multipleOf-invalid instance should not raise error when float division = inf.exs:4
    rem(object, 0.123456789)

Conflict found at
  /home/ityonemo/code/exonerate/bench/gpt-3.5/multipleOf-invalid instance should not raise error when float division = inf.exs:4: :"multipleOf-invalid instance should not raise error when float division = inf-gpt-3.5".validate/1

warning: variable "object" does not exist and is being expanded to "object()", please use parentheses to remove the ambiguity or change the variable name
  /home/ityonemo/code/exonerate/bench/gpt-3.5/patternProperties-regexes are not anchored by default and are case sensitive.exs:15: :"patternProperties-regexes are not anchored by default and are case sensitive-gpt-3.5".is_valid_key/1

warning: variable "rest" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-4/unevaluatedProperties-unevaluatedProperties with adjacent properties.exs:4: :"unevaluatedProperties-unevaluatedProperties with adjacent properties".validate/1

warning: unused import Regex
  /home/ityonemo/code/exonerate/bench/gpt-4/unevaluatedProperties-unevaluatedProperties with adjacent patternProperties.exs:2

warning: unused import Regex
  /home/ityonemo/code/exonerate/bench/gpt-4/additionalProperties-non-ASCII pattern with additionalProperties.exs:2

warning: function is_boolean/1 is unused
  /home/ityonemo/code/exonerate/bench/gpt-4/additionalProperties-additionalProperties allows a schema which should validate.exs:19

warning: function is_boolean/1 is unused
  /home/ityonemo/code/exonerate/bench/gpt-4/additionalProperties-additionalProperties can exist by itself.exs:12

warning: variable "object" does not exist and is being expanded to "object()", please use parentheses to remove the ambiguity or change the variable name
  /home/ityonemo/code/exonerate/bench/gpt-4/additionalProperties-additionalProperties should not look in applicators.exs:29: :"additionalProperties-additionalProperties should not look in applicators".is_additional_property_valid?/1

warning: :inet.parse_address/2 is undefined or private. Did you mean:

      * parse_address/1

  /home/ityonemo/code/exonerate/bench/gpt-4/format-validation of IPv6 addresses.exs:3: :"format-validation of IPv6 addresses".validate/1

warning: :idna.to_ascii/1 is undefined (module :idna is not available or is yet to be defined)
  /home/ityonemo/code/exonerate/bench/gpt-4/format-validation of IDN hostnames.exs:15: :"format-validation of IDN hostnames".valid_idn_hostname?/1

warning: undefined function return/1 (expected :"content-validation of binary-encoded media type documents with schema" to define such a function or for it to be imported, but none are available)
  /home/ityonemo/code/exonerate/bench/gpt-4/content-validation of binary-encoded media type documents with schema.exs:39

warning: undefined function return/1 (expected :"content-validation of binary-encoded media type documents with schema" to define such a function or for it to be imported, but none are available)
  /home/ityonemo/code/exonerate/bench/gpt-4/content-validation of binary-encoded media type documents with schema.exs:36

warning: undefined function return/1 (expected :"content-validation of binary-encoded media type documents with schema" to define such a function or for it to be imported, but none are available)
  /home/ityonemo/code/exonerate/bench/gpt-4/content-validation of binary-encoded media type documents with schema.exs:32

warning: function is_map_key/2 is unused
  /home/ityonemo/code/exonerate/bench/gpt-4/not-forbidden property.exs:5

warning: expected Kernel.rem/2 to have signature:

    float(), float() -> dynamic()

but it has signature:

    integer(), integer() -> integer()

in expression:

    # /home/ityonemo/code/exonerate/bench/gpt-4/multipleOf-by number.exs:2
    rem(object, 1.5)

Conflict found at
  /home/ityonemo/code/exonerate/bench/gpt-4/multipleOf-by number.exs:2: :"multipleOf-by number".validate/1

warning: expected Kernel.rem/2 to have signature:

    float(), float() -> dynamic()

but it has signature:

    integer(), integer() -> integer()

in expression:

    # /home/ityonemo/code/exonerate/bench/gpt-4/multipleOf-by small number.exs:2
    rem(number, 0.0001)

Conflict found at
  /home/ityonemo/code/exonerate/bench/gpt-4/multipleOf-by small number.exs:2: :"multipleOf-by small number".validate/1

warning: variable "null" is unused (if the variable is not meant to be used, prefix it with an underscore)
  /home/ityonemo/code/exonerate/bench/gpt-4/oneOf-nested oneOf, to check validation semantics.exs:2: :"oneOf-nested oneOf, to check validation semantics".validate/1

warning: this clause for validate/1 cannot match because a previous clause at line 2 always matches
  /home/ityonemo/code/exonerate/bench/gpt-4/oneOf-nested oneOf, to check validation semantics.exs:6

:ok

Systematic Issues

Atoms vs. Strings

Both GPT-3.5 and GPT-4 sometimes use atoms in their code instead of strings. This is understandable, since various Elixir JSON implementations may use atoms instead of strings in the internal representation of JSON, especially for object keys. However, validation of JSON is most likely going to operate on string keys, since atom keys for input is discouraged due to security concerns. Here is some example code that GPT-4 generated:

defmodule :"oneOf-oneOf complex types" do
  def validate(object) when is_map(object) do
    case Enum.filter([:bar, :foo], &Map.has_key?(object, &1)) do
      [:bar] ->
        case Map.fetch(object, :bar) do
          {:ok, value} when is_integer(value) -> :ok
          _ -> :error
        end
      [:foo] ->
        case Map.fetch(object, :foo) do
          {:ok, value} when is_binary(value) -> :ok
          _ -> :error
        end
      _ -> :error
    end
  end

  def validate(_), do: :error
end

Code featuring atom keys in maps was manually converted prior to benchmarking accuracy, for example, the above code is converted to:

defmodule :"oneOf-oneOf complex types" do
  def validate(object) when is_map(object) do
    case Enum.filter(["bar", "foo"], &Map.has_key?(object, &1)) do
      ["bar"] ->
        case Map.fetch(object, "bar") do
          {:ok, value} when is_integer(value) -> :ok
          _ -> :error
        end
      ["foo"] ->
        case Map.fetch(object, "foo") do
          {:ok, value} when is_binary(value) -> :ok
          _ -> :error
        end
      _ -> :error
    end
  end

  def validate(_), do: :error
end

String length is UTF-8 grapheme count

Neither GPT understood that the JSONSchema string length count counts UTF-8 graphemes. As an example, GPT-4 produced the following code:

defmodule :"maxLength-maxLength validation" do
  def validate(string) when is_binary(string) do
    if byte_size(string) <= 2, do: :ok, else: :error
  end

  def validate(_), do: :error
end

Instead, the if statement should have been:

if String.length(string) <= 2, do: :ok, else: :error

Integers need to match Floats

The JSONSchema standard requires that constant integers, enumerated integers, and floating point numbers must match as integers. In elixir, while the == operator will resolve as true when comparing an integral floating point, other operations, such as matching, will not. Both GPT-3.5 and GPT-4 struggled with this. GPT-4 missed several validations due to this.

Example:

defmodule :"enum-enum with 0 does not match false" do
  def validate(0), do: :ok
  def validate(_), do: :error
end

Filters only apply to their own types

This common error, which is common to both GPT-3.5 and GPT-4, stems because GPT does not understand that a filter will not reject a type it is not designed to operate on. A good example of such code is the following (derived from the schema {"maxItems": 2}):

defmodule :"maxItems-maxItems validation" do
  def validate(list) when is_list(list) and length(list) <= 2, do: :ok
  def validate(_), do: :error
end

GPT-4 will Note that validate/1 will return :error when confronted with a string, even though the JSONSchema spec says that the maxItems filter should not apply, defaulting to successful validation.

When given the schema {"maxItems": 2, "maxLength": 4} (not in the test suite), GPT-4 does something even stranger, applying the maxLength criterion to the inner elements of the list, even while accepting the that the outer element can be either a list or a string.

defmodule :"maxItems-maxLength" do
  def validate(value) when is_list(value) and length(value) <= 2 do
    Enum.reduce(value, :ok, fn item, acc ->
      if is_binary(item) and byte_size(item) <= 4, do: acc, else: :error
    end)
  end

  def validate(value) when is_binary(value) and byte_size(value) <= 4 do
    :ok
  end

  def validate(_), do: :error
end

When given {"maxLength": 4, "maximum": 3}, GPT-4 gets the code correct.

defmodule :"maxLength-maximum" do
  def validate(value) when is_binary(value) and byte_size(value) <= 4, do: :ok
  def validate(value) when is_number(value) and value <= 3, do: :ok
  def validate(_), do: :error
end

In the GPT-4 accuracy benchmark, ~15 of the test schemas were prevented from passing all of their tests solely because of missing this criterion.

Format/Content

Neither GPT knew that format and content-* filters are default off and the test suite does not test against this validation. However, both GPTs reached for Elixir standard library tools, even though these do not necessarily fit the explicit requirements set forth by the JSONSchema standard, or 3rd party tools (despite being told explictly not to) to perform validations.

Examples (all GPT-4):

defmodule :"format-validation of date-time strings" do
  def validate(datetime_string) when is_binary(datetime_string) do
    if valid_datetime?(datetime_string) do
      :ok
    else
      :error
    end
  end

  def validate(_), do: :error

  defp valid_datetime?(datetime_string) do
    case DateTime.from_iso8601(datetime_string) do
      {:ok, _} -> true
      :error -> false
    end
  end
end

For date-time validation, the correct Elixir Standard Library to use is NaiveDateTime, not DateTime. DateTime will fail without being given time-zone information.

defmodule :"format-validation of IDN hostnames" do
  alias :idna, as: Idna

  def validate(hostname) when is_binary(hostname) do
    if valid_idn_hostname?(hostname) do
      :ok
    else
      :error
    end
  end

  def validate(_), do: :error

  defp valid_idn_hostname?(hostname) do
    case Idna.to_ascii(hostname) do
      {:ok, ascii_hostname} -> valid_ascii_hostname?(ascii_hostname)
      _ -> false
    end
  end

  defp valid_ascii_hostname?(hostname) do
    :inet.parse_strict_address(hostname) == :error and
      Enum.all?(String.split(hostname, ".", trim: true), &valid_label?/1)
  end

  defp valid_label?(label) do
    byte_size(label) in 1..63 and
      String.match?(label, ~r/^[a-zA-Z0-9]([a-zA-Z0-9-]*[a-zA-Z0-9])?$/)
  end
end

GPT-4 (impressively) reaches for the :idna erlang library, but, oddly decides to alias it with an Elixir-style module name.

Accuracy Evaluation

Next, let's look at how accurately GPT-3.5 and GPT-4 perform across all of the JSONSchema tests:

defmodule Benchmark.Plotter do
  def format_passes(result) do
    result.pass
    |> Enum.map(&"✅ #{&1}")
    |> Enum.join("\n")
  end

  def format_fails(result) do
    result.fail
    |> Enum.map(&"❌ #{elem(&1, 0)}")
    |> Enum.join("\n")
  end

  @granularity 2

  def tabularize(result) do
    color =
      case result.type do
        :ok -> :green
        :partial -> :yellow
        :failure -> :orange
        :compile -> :red
      end

    %{
      group: result.schema.group,
      test: result.schema.description,
      schema: Jason.encode!(result.schema.schema),
      pct: round(result.pct / @granularity) * @granularity,
      color: color,
      pass: format_passes(result),
      fail: format_fails(result)
    }
  end

  def nudge_data(results) do
    # data points might overlap, so to make the visualization more effective,
    # we should nudge the points apart from each other.
    results
    |> Enum.sort_by(&{&1.group, &1.pct})
    |> Enum.map_reduce(MapSet.new(), &nudge/2)
    |> elem(0)
  end

  @nudge 2

  # points might overlap, so move them up or down accordingly for better 
  # visualization.  Colors help us understand the qualitative results.
  defp nudge(result = %{pct: pct}, seen) when pct == 100, do: nudge(result, seen, -@nudge)
  defp nudge(result, seen), do: nudge(result, seen, @nudge)

  defp nudge(result, seen, amount) do
    if {result.group, result.pct} in seen do
      nudge(%{result | pct: result.pct + amount}, seen, amount)
    else
      {result, MapSet.put(seen, {result.group, result.pct})}
    end
  end

  def plot_one({title, results}) do
    tabularized =
      results
      |> Enum.map(&tabularize/1)
      |> nudge_data

    VegaLite.new(title: title)
    |> VegaLite.data_from_values(tabularized)
    |> VegaLite.mark(:circle)
    |> VegaLite.encode_field(:x, "group", type: :nominal, title: false)
    |> VegaLite.encode_field(:y, "pct", type: :quantitative, title: "percent correct")
    |> VegaLite.encode_field(:color, "color", legend: false)
    |> VegaLite.encode(:tooltip, [
      [field: "group"],
      [field: "test"],
      [field: "schema"],
      [field: "pass"],
      [field: "fail"]
    ])
  end

  def plot(list_of_results) do
    VegaLite.new()
    |> VegaLite.concat(Enum.map(list_of_results, &plot_one/1), :vertical)
  end
end


Benchmark.Plotter.plot("gpt-3.5": gpt_3_results, "gpt-4": gpt_4_results)

In the above chart, blue dots are 100% correct, green dots are partially correct, orange dots are completely incorrect, and red dots are compilation errors.

Selected Observations of interest

Incorrect Elixir

GPT-3.5 and GPT-4 are not aware that only certain functions can be called in function guards, causing a compilation error:

defmodule :"anyOf-anyOf with base schema" do
  def validate(value) when is_binary(value) and (String.length(value) <= 2 or String.length(value) >= 4), do: :ok
  def validate(_), do: :error
end

Misunderstanding Elixir

GPT-4 attempts to directly match the result of Map.keys/1. This likely works, but in general there is no guarantee that the result of this function will have any order.

defmodule :"oneOf-oneOf with missing optional property" do
  def validate(%{"foo" => _} = object) do
    case Map.keys(object) do
      ["foo"] -> :ok
      _ -> :error
    end
  end
  def validate(%{"bar" => _} = object) do
    case Map.keys(object) do
      ["bar", "baz"] -> :ok
      _ -> :error
    end
  end
  def validate(_), do: :error
end

GPT-3 also often attempts to match from Map.keys/1:

defmodule :"unevaluatedProperties-nested unevaluatedProperties, outer false, inner true, properties inside-gpt-3.5" do
  def validate(object) when is_map(object) do
    case Map.keys(object) do
      ["foo" | _] -> :ok
      _ -> :error
    end
  end

  def validate(_) do
    :error
  end
end

in this case simply

if Map.has_key?(object, "foo"), do: :ok, else: :error

would have done the trick.

Hallucinations

GPT-3.5 was particularly prone to hallucinations. In one case, it hallucinated a json_schema library (and also flubbed the parameter it passed):

defmodule :"items-prefixItems with no additional items allowed-gpt-3.5" do
  def validate(object) when is_map(object), do: validate_object(object)
  def validate(_), do: :error

  defp validate_object(object) do
    case Map.has_key?(object, :items) and Map.has_key?(object, :prefixItems) and not Map.has_key?(object, :additionalItems) do
      true -> Map.get(object, :prefixItems)
              |> Enum.all?(fn _ -> %{} end)
              |> :json_schema.validate(:#{false})
              |> handle_validation_result()
      false -> :error
    end
  end

  defp handle_validation_result(result) do
    case result do
      {:ok, _} -> :ok
      {:error, _, _} -> :error
    end
  end
end

Semantic misunderstanding

{"contains":{"maximum": 5}}

GPT-4 misinterprets OpenAPI and generates the following code:

defmodule :"contains-contains keyword validation" do
  def validate(object) when is_list(object) do
    if Enum.count(object) >= 5 do
      :ok
    else
      :error
    end
  end
  def validate(_), do: :error
end

This would be the correct code for:

{"contains": {}, "maxContains": 5}

But the semantic error that GPT-4 makes is that it thinks that "maximum" is a qualifier on "contains", when in fact the schema calls for a new "context"; each object in the list should validate as {"maximum": 5} but this doesn't apply to the list itself.

Completely misunderstanding

Several times, GPT-3.5 gave up on doing the task properly and instead wandered off into matching the schema, despite being told explictly not to. Here is the simplest example:

defmodule :"uniqueItems-uniqueItems=false validation-gpt-3.5" do
  @moduledoc "Validates a JSON object against the 'uniqueItems=false' schema.\n"
  @doc "Validates the given JSON object against the schema.\n"
  @spec validate(Map.t()) :: :ok | :error
  def validate(%{uniqueItems: false} = object) when is_list(object) do
    if Enum.uniq(object) == object do
      :ok
    else
      :error
    end
  end

  def validate(_) do
    :error
  end
end

Selected Performance Comparisons

Using the Benchee library, here I set up a framework by which we can test the speed of a few representative samples of generated code. The "John Henry" contender will be Exonerate, the Elixir library that this notebook lives in. Here we set up a compare/2 function that runs Benchee and reports the winner (ips = invocations per second, bigger is better). The module will also host the code generated by Exonerate.

defmodule ExonerateBenchmarks do
  require Exonerate

  def compare(scenario, value, raw \\ false) do
    [exonerate_ips, gpt_ips] =
      %{
        gpt4: fn -> apply(scenario, :validate, [value]) end,
        exonerate: fn -> apply(__MODULE__, scenario, [value]) end
      }
      |> Benchee.run()
      |> Map.get(:scenarios)
      |> Enum.sort_by(& &1.name)
      |> Enum.map(& &1.run_time_data.statistics.ips)

    cond do
      raw ->
        exonerate_ips / gpt_ips

      gpt_ips > exonerate_ips ->
        "gpt-4 faster than exonerate by #{gpt_ips / exonerate_ips}x"

      true ->
        "exonerate faster than gpt-4 by #{exonerate_ips / gpt_ips}x"
    end
  end

  Exonerate.function_from_string(
    :def,
    :"allOf-allOf simple types",
    ~S({"allOf": [{"maximum": 30}, {"minimum": 20}]})
  )

  Exonerate.function_from_string(
    :def,
    :"uniqueItems-uniqueItems validation",
    ~S({"uniqueItems": true})
  )

  Exonerate.function_from_string(
    :def,
    :"oneOf-oneOf with required",
    ~S({
            "type": "object",
            "oneOf": [
                { "required": ["foo", "bar"] },
                { "required": ["foo", "baz"] }
            ]
        })
  )
end

{:module, ExonerateBenchmarks, <<70, 79, 82, 49, 0, 0, 58, ...>>, [[]]}

GPT-4 wins!

{"allOf": [{"maximum": 30}, {"minimum": 20}]}

Let's take a look at a clear case where GPT-4 is the winning contender. In this code, we apply two filters to a number using the allOf construct so that the number is subjected to both schemata. This would not be the best way to do this (probably doing this without allOf is better) but it will be very illustrative of how GPT-4 can do better.

This is GPT-4's code:

def validate(number) when is_number(number) do
  if number >= 20 and number <= 30 do
    :ok
  else
    :error
  end
end

def validate(_), do: :error

Holy moly. GPT-4 was able to deduce the intent of the allOf and see clearly that the filters collapse into a single set of conditions that can be checked without indirection.

By contrast, this is what Exonerate creates:

def validate(data) do
  unquote(:"exonerate://validate/#/")(data, "/")
end

defp unquote(:"exonerate://validate/#/")(array, path) when is_list(array) do
  with :ok <- unquote(:"exonerate://validate/#/allOf")(array, path) do
    :ok
  end
end
defp unquote(:"exonerate://validate/#/")(boolean, path) when is_boolean(boolean) do
  with :ok <- unquote(:"exonerate://validate/#/allOf")(boolean, path) do
    :ok
  end
end
defp unquote(:"exonerate://validate/#/")(integer, path) when is_integer(integer) do
  with :ok <- unquote(:"exonerate://validate/#/allOf")(integer, path) do
    :ok
  end
end
defp unquote(:"exonerate://validate/#/")(null, path) when is_nil(null) do
  with :ok <- unquote(:"exonerate://validate/#/allOf")(null, path) do
    :ok
  end
end
defp unquote(:"exonerate://validate/#/")(float, path) when is_float(float) do
  with :ok <- unquote(:"exonerate://validate/#/allOf")(float, path) do
    :ok
  end
end
defp unquote(:"exonerate://validate/#/")(object, path) when is_map(object) do
  with :ok <- unquote(:"exonerate://validate/#/allOf")(object, path) do
    :ok
  end
end
defp unquote(:"exonerate://validate/#/")(string, path) when is_binary(string) do
  if String.valid?(string) do
    with :ok <- unquote(:"exonerate://validate/#/allOf")(string, path) do
      :ok
    end
  else
    require Exonerate.Tools
    Exonerate.Tools.mismatch(string, "exonerate://validate/", ["type"], path)
  end
end
defp unquote(:"exonerate://validate/#/")(content, path) do
  require Exonerate.Tools
  Exonerate.Tools.mismatch(content, "exonerate://validate/", ["type"], path)
end

defp unquote(:"exonerate://validate/#/allOf")(data, path) do
  require Exonerate.Tools

  Enum.reduce_while([
    &unquote(:"exonerate://validate/#/allOf/0")/2, 
    &unquote(:"exonerate://validate/#/allOf/1")/2
    ], 
    :ok, 
    fn fun, :ok ->
      case fun.(data, path) do
        :ok -> {:cont, :ok}
        Exonerate.Tools.error_match(error) -> {:halt, error}
      end
  end)
end

defp unquote(:"exonerate://validate/#/allOf/0")(integer, path) when is_integer(integer) do
  with :ok <- unquote(:"exonerate://validate/#/allOf/0/maximum")(integer, path) do
    :ok
  end
end

# ... SNIP ...

defp unquote(:"exonerate://validate/#/allOf/1/minimum")(number, path) do
  case number do
    number when number >= 20 ->
      :ok

    _ ->
      require Exonerate.Tools
      Exonerate.Tools.mismatch(number, "exonerate://validate/", ["allOf", "1", "minimum"], path)
  end
end

It was so long I had to trim it down to keep from boring you. But you should be able to get the point. The exonerate code painstakingly goes through every single branch of the schema giving it its own, legible function and when there's an error it also goes ahead and annotates the location in the schema where the error occurred, and what filter the input violated. So it's legitimately doing more than what the GPT-4 code does, which gleefully destroyed this information that could be useful to whoever is trying to send data.

Then again, I didn't ask it to do that. Let's see how much of a difference in performance all this makes

ExonerateBenchmarks.compare(:"allOf-allOf simple types", 25)

Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             7.94 M      125.94 ns ±10521.80%         113 ns         159 ns
exonerate        3.47 M      288.45 ns ±13170.20%         191 ns         463 ns

Comparison: 
gpt4             7.94 M
exonerate        3.47 M - 2.29x slower +162.51 ns

"gpt-4 faster than exonerate by 2.2903531909721173x"

So above, we see that gpt-4 is ~>2x faster than exonerate. John Henry is defeated, in this round.

Hidden Regressions

{"uniqueItems": true}

Next, let's take a look at a place where a quick glance at the GPT-4 code creates a tough-to-spot regression, in a very simple filter. Here, GPT-4 does an obvious thing:

def validate(list) when is_list(list) do
  unique_list = Enum.uniq(list)

  if length(list) == length(unique_list) do
    :ok
  else
    :error
  end
end

If you're not familiar with how the BEAM works, the regression occurs because Enum.uniq() is O(N) in the length of the list; length(...) is O(N) as well, so in the worst case this algorithm runs through the length of the list three times.

I won't show you the code Exonerate generated, but suffice it to say, the validator only loops through the list once. And it even quits early if it encounters a uniqueness violation.

When we give it a short list, GPT-4 wins still.

ExonerateBenchmarks.compare(:"uniqueItems-uniqueItems validation", [1, 2, 3])

Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             3.52 M        0.28 μs ±11454.15%        0.21 μs        0.50 μs
exonerate        0.67 M        1.50 μs  ±2172.91%        1.22 μs        2.35 μs

Comparison: 
gpt4             3.52 M
exonerate        0.67 M - 5.28x slower +1.22 μs

"gpt-4 faster than exonerate by 5.284881178051852x"

but, given a longer list, we see that exonerate will win out.

input = List.duplicate(1, 1000)
ExonerateBenchmarks.compare(:"uniqueItems-uniqueItems validation", input)

Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
exonerate      964.49 K        1.04 μs  ±2805.64%        0.83 μs        1.74 μs
gpt4           107.92 K        9.27 μs   ±148.84%        8.80 μs       16.01 μs

Comparison: 
exonerate      964.49 K
gpt4           107.92 K - 8.94x slower +8.23 μs

"exonerate faster than gpt-4 by 8.93731926728509x"

we can run different length sizes in both the best-case and worst-case scenarios and see where the performance crosses over.

list_lengths = [1, 3, 10, 30, 100, 300, 1000]

worst_case =
  Enum.map(
    list_lengths,
    &ExonerateBenchmarks.compare(:"uniqueItems-uniqueItems validation", Enum.to_list(1..&1), true)
  )

best_case =
  Enum.map(
    list_lengths,
    &ExonerateBenchmarks.compare(
      :"uniqueItems-uniqueItems validation",
      List.duplicate(1, &1),
      true
    )
  )

tabularized =
  worst_case
  |> Enum.zip(best_case)
  |> Enum.zip(list_lengths)
  |> Enum.flat_map(fn {{worst, best}, list_length} ->
    [
      %{
        relative: :math.log10(worst),
        length: :math.log10(list_length),
        label: list_length,
        group: :worst
      },
      %{
        relative: :math.log10(best),
        length: :math.log10(list_length),
        label: list_length,
        group: :best
      }
    ]
  end)

VegaLite.new(width: 500)
|> VegaLite.data_from_values(tabularized)
|> VegaLite.mark(:circle)
|> VegaLite.encode_field(:x, "length", type: :quantitative, title: "log_10(list_length)")
|> VegaLite.encode_field(:y, "relative",
  type: :quantitative,
  title: "log_10(exonerate_ips/gpt_ips)"
)
|> VegaLite.encode_field(:color, "group")

Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             4.34 M      230.44 ns ±13813.36%         147 ns         280 ns
exonerate        1.42 M      703.86 ns  ±4488.49%         524 ns        1202 ns

Comparison: 
gpt4             4.34 M
exonerate        1.42 M - 3.05x slower +473.42 ns
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             3.40 M        0.29 μs ±11722.80%        0.21 μs        0.53 μs
exonerate        0.65 M        1.53 μs  ±1543.42%        1.25 μs        2.46 μs

Comparison: 
gpt4             3.40 M
exonerate        0.65 M - 5.21x slower +1.24 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             1.37 M        0.73 μs  ±5156.71%        0.52 μs        1.20 μs
exonerate        0.23 M        4.32 μs   ±423.54%        3.79 μs        7.61 μs

Comparison: 
gpt4             1.37 M
exonerate        0.23 M - 5.92x slower +3.59 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4           335.74 K        2.98 μs   ±650.16%        2.54 μs        5.31 μs
exonerate       76.87 K       13.01 μs   ±115.33%       12.23 μs       22.70 μs

Comparison: 
gpt4           335.74 K
exonerate       76.87 K - 4.37x slower +10.03 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4            84.83 K       11.79 μs    ±26.89%       10.86 μs       20.85 μs
exonerate       21.57 K       46.35 μs    ±20.26%       44.33 μs       77.77 μs

Comparison: 
gpt4            84.83 K
exonerate       21.57 K - 3.93x slower +34.57 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4            27.27 K       36.67 μs    ±17.70%       33.74 μs       63.48 μs
exonerate        6.66 K      150.09 μs    ±15.95%      142.29 μs      245.64 μs

Comparison: 
gpt4            27.27 K
exonerate        6.66 K - 4.09x slower +113.42 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             4.90 K      204.00 μs    ±23.59%      193.91 μs      375.72 μs
exonerate        1.60 K      623.75 μs    ±14.33%      595.98 μs      972.79 μs

Comparison: 
gpt4             4.90 K
exonerate        1.60 K - 3.06x slower +419.75 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             4.29 M      233.22 ns ±13763.63%         149 ns         290 ns
exonerate        1.45 M      691.06 ns  ±4714.55%         523 ns        1117 ns

Comparison: 
gpt4             4.29 M
exonerate        1.45 M - 2.96x slower +457.85 ns
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             3.82 M        0.26 μs ±14476.78%       0.166 μs        0.37 μs
exonerate        0.95 M        1.05 μs  ±2504.27%        0.84 μs        1.76 μs

Comparison: 
gpt4             3.82 M
exonerate        0.95 M - 4.02x slower +0.79 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             3.21 M        0.31 μs ±10437.52%        0.22 μs        0.55 μs
exonerate        0.94 M        1.06 μs  ±2758.44%        0.84 μs        1.75 μs

Comparison: 
gpt4             3.21 M
exonerate        0.94 M - 3.42x slower +0.75 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             2.03 M        0.49 μs  ±4837.45%        0.40 μs        0.78 μs
exonerate        0.96 M        1.04 μs  ±2452.95%        0.84 μs        1.73 μs

Comparison: 
gpt4             2.03 M
exonerate        0.96 M - 2.11x slower +0.55 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
exonerate      947.66 K        1.06 μs  ±2855.38%        0.84 μs        1.77 μs
gpt4           873.47 K        1.14 μs  ±2024.78%        1.00 μs        1.85 μs

Comparison: 
exonerate      947.66 K
gpt4           873.47 K - 1.08x slower +0.0896 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
exonerate      972.48 K        1.03 μs  ±2688.61%        0.84 μs        1.72 μs
gpt4           346.59 K        2.89 μs   ±440.43%        2.67 μs        4.98 μs

Comparison: 
exonerate      972.48 K
gpt4           346.59 K - 2.81x slower +1.86 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
exonerate      966.65 K        1.03 μs  ±2855.32%        0.84 μs        1.73 μs
gpt4           106.10 K        9.43 μs    ±91.50%        8.84 μs       16.48 μs

Comparison: 
exonerate      966.65 K
gpt4           106.10 K - 9.11x slower +8.39 μs

In the worst case scenario for Exonerate, we see that the relative speeds stay about the same: This makes sense, as both processes are O(N) in the size of the list, and the Exonerate overhead is the same per function instance, even if GPT-4 actually traverses the list more times.

In the best case scenario, the crossover occurs at around 80 items in the list. This isn't terribly good, but a 3x slower for a 100ns function call isn't the end of the world. Let's take a look at another example.

{
    "type": "object",
    "oneOf": [
        { "required": ["foo", "bar"] },
        { "required": ["foo", "baz"] }
    ]
}

Here is the function that GPT-4 generates:

def validate(object) when is_map(object) do
  case Enum.count(["foo", "bar"] -- Map.keys(object)) do
    0 -> :ok
    _ -> case Enum.count(["foo", "baz"] -- Map.keys(object)) do
      0 -> :ok
      _ -> :error
    end
  end
end

This too is O(N) in the size of the object, whereas the code generated by exonerate is O(1). Checking to see if a constant set of items are keys in the object should be a fixed-time process.

In the next cell, we'll test several different inputs, maps with "foo" and "bar" keys, as well as maps with "bar" and "baz" keys, and maps that only have "foo" keys. To expand the size of the map, we'll add string number keys. All keys will have to the string "foo" as values. Note that the GPT-4 code doesn't address a map with "foo", "bar", and "baz" keys, which should be rejected. We expect to see performance regressions that are worse for the case without "baz" because these cases will run through the size of the map twice.

with_bar =
  Enum.map(
    list_lengths,
    fn list_length ->
      input = Map.new(["foo", "bar"] ++ Enum.map(1..list_length, &"#{&1}"), &{&1, "foo"})

      ExonerateBenchmarks.compare(
        :"oneOf-oneOf with required",
        input,
        true
      )
    end
  )

with_baz =
  Enum.map(
    list_lengths,
    fn list_length ->
      input = Map.new(["foo", "baz"] ++ Enum.map(1..list_length, &"#{&1}"), &{&1, "foo"})

      ExonerateBenchmarks.compare(
        :"oneOf-oneOf with required",
        input,
        true
      )
    end
  )

with_none =
  Enum.map(
    list_lengths,
    fn list_length ->
      input = Map.new(["foo", "baz"] ++ Enum.map(1..list_length, &"#{&1}"), &{&1, "foo"})

      ExonerateBenchmarks.compare(
        :"oneOf-oneOf with required",
        input,
        true
      )
    end
  )

tabularized =
  with_bar
  |> Enum.zip(with_baz)
  |> Enum.zip(with_none)
  |> Enum.zip(list_lengths)
  |> Enum.flat_map(fn {{{bar, baz}, none}, list_length} ->
    [
      %{
        relative: :math.log10(bar),
        length: :math.log10(list_length),
        label: list_length,
        group: :bar
      },
      %{
        relative: :math.log10(baz),
        length: :math.log10(list_length),
        label: list_length,
        group: :baz
      },
      %{
        relative: :math.log10(none),
        length: :math.log10(list_length),
        label: list_length,
        group: :none
      }
    ]
  end)

VegaLite.new(width: 500)
|> VegaLite.data_from_values(tabularized)
|> VegaLite.mark(:circle)
|> VegaLite.encode_field(:x, "length", type: :quantitative, title: "log_10(list_length)")
|> VegaLite.encode_field(:y, "relative",
  type: :quantitative,
  title: "log_10(exonerate_ips/gpt_ips)"
)
|> VegaLite.encode_field(:color, "group")

Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             4.02 M        0.25 μs   ±453.97%        0.23 μs        0.43 μs
exonerate        0.63 M        1.60 μs  ±1719.89%        1.19 μs        2.45 μs

Comparison: 
gpt4             4.02 M
exonerate        0.63 M - 6.42x slower +1.35 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             3.30 M        0.30 μs  ±9092.90%        0.26 μs        0.51 μs
exonerate        0.60 M        1.66 μs  ±1651.95%        1.25 μs        2.46 μs

Comparison: 
gpt4             3.30 M
exonerate        0.60 M - 5.47x slower +1.35 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             2.33 M        0.43 μs  ±2160.37%        0.38 μs        0.74 μs
exonerate        0.54 M        1.86 μs  ±1299.83%        1.49 μs        2.81 μs

Comparison: 
gpt4             2.33 M
exonerate        0.54 M - 4.34x slower +1.43 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             1.20 M        0.84 μs  ±3875.60%        0.70 μs        1.37 μs
exonerate        0.37 M        2.73 μs   ±925.57%        2.23 μs        4.12 μs

Comparison: 
gpt4             1.20 M
exonerate        0.37 M - 3.27x slower +1.90 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
exonerate      606.10 K        1.65 μs  ±1681.92%        1.23 μs        2.52 μs
gpt4           381.09 K        2.62 μs   ±634.41%        2.38 μs        4.51 μs

Comparison: 
exonerate      606.10 K
gpt4           381.09 K - 1.59x slower +0.97 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
exonerate      613.11 K        1.63 μs  ±1672.92%        1.23 μs        2.48 μs
gpt4           108.29 K        9.23 μs   ±124.20%        8.68 μs       16.16 μs

Comparison: 
exonerate      613.11 K
gpt4           108.29 K - 5.66x slower +7.60 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
exonerate      628.85 K        1.59 μs  ±1761.49%        1.23 μs        2.44 μs
gpt4            31.44 K       31.80 μs    ±17.56%       30.57 μs       52.44 μs

Comparison: 
exonerate      628.85 K
gpt4            31.44 K - 20.00x slower +30.21 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             2.62 M        0.38 μs  ±1623.39%        0.35 μs        0.66 μs
exonerate        0.57 M        1.75 μs  ±1708.48%        1.25 μs        2.66 μs

Comparison: 
gpt4             2.62 M
exonerate        0.57 M - 4.58x slower +1.37 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             2.19 M        0.46 μs  ±3810.88%        0.41 μs        0.79 μs
exonerate        0.56 M        1.77 μs  ±1688.38%        1.29 μs        2.61 μs

Comparison: 
gpt4             2.19 M
exonerate        0.56 M - 3.89x slower +1.32 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             1.33 M        0.75 μs  ±2545.42%        0.66 μs        1.26 μs
exonerate        0.48 M        2.07 μs  ±1351.64%        1.53 μs        3.10 μs

Comparison: 
gpt4             1.33 M
exonerate        0.48 M - 2.75x slower +1.32 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4           695.17 K        1.44 μs  ±1680.97%        1.28 μs        2.40 μs
exonerate      363.43 K        2.75 μs   ±871.41%        2.24 μs        4.27 μs

Comparison: 
gpt4           695.17 K
exonerate      363.43 K - 1.91x slower +1.31 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
exonerate      565.35 K        1.77 μs  ±1681.01%        1.27 μs        2.67 μs
gpt4           186.24 K        5.37 μs   ±155.12%        4.91 μs        9.17 μs

Comparison: 
exonerate      565.35 K
gpt4           186.24 K - 3.04x slower +3.60 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
exonerate      567.24 K        1.76 μs  ±1671.49%        1.28 μs        2.61 μs
gpt4            50.96 K       19.62 μs    ±19.43%       18.96 μs       34.09 μs

Comparison: 
exonerate      567.24 K
gpt4            50.96 K - 11.13x slower +17.86 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
exonerate      552.85 K        1.81 μs  ±1699.22%        1.28 μs        2.60 μs
gpt4            13.87 K       72.10 μs    ±14.63%       69.24 μs      118.93 μs

Comparison: 
exonerate      552.85 K
gpt4            13.87 K - 39.86x slower +70.29 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             2.55 M        0.39 μs  ±1614.09%        0.35 μs        0.69 μs
exonerate        0.56 M        1.78 μs  ±1733.75%        1.24 μs        2.82 μs

Comparison: 
gpt4             2.55 M
exonerate        0.56 M - 4.54x slower +1.38 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             2.15 M        0.47 μs  ±3708.10%        0.41 μs        0.82 μs
exonerate        0.56 M        1.79 μs  ±1697.45%        1.30 μs        2.70 μs

Comparison: 
gpt4             2.15 M
exonerate        0.56 M - 3.85x slower +1.33 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4             1.32 M        0.76 μs  ±2584.80%        0.65 μs        1.28 μs
exonerate        0.48 M        2.08 μs  ±1353.92%        1.53 μs        3.12 μs

Comparison: 
gpt4             1.32 M
exonerate        0.48 M - 2.75x slower +1.33 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
gpt4           693.10 K        1.44 μs  ±1228.39%        1.28 μs        2.46 μs
exonerate      358.38 K        2.79 μs   ±862.70%        2.27 μs        4.40 μs

Comparison: 
gpt4           693.10 K
exonerate      358.38 K - 1.93x slower +1.35 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
exonerate      565.23 K        1.77 μs  ±1679.28%        1.27 μs        2.67 μs
gpt4           188.25 K        5.31 μs   ±234.88%        4.96 μs        9.31 μs

Comparison: 
exonerate      565.23 K
gpt4           188.25 K - 3.00x slower +3.54 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
exonerate      571.25 K        1.75 μs  ±1674.82%        1.28 μs        2.55 μs
gpt4            50.26 K       19.89 μs    ±20.74%       19.00 μs       35.15 μs

Comparison: 
exonerate      571.25 K
gpt4            50.26 K - 11.36x slower +18.14 μs
Operating System: Linux
CPU Information: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
Number of Available Cores: 8
Available memory: 23.35 GB
Elixir 1.14.2
Erlang 25.1.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking exonerate ...
Benchmarking gpt4 ...

Name                ips        average  deviation         median         99th %
exonerate      543.67 K        1.84 μs  ±1503.21%        1.30 μs        2.71 μs
gpt4            13.84 K       72.26 μs    ±16.23%       68.85 μs      119.16 μs

Comparison: 
exonerate      543.67 K
gpt4            13.84 K - 39.29x slower +70.42 μs

Indeed, we see exactly the relationship we expect: maps with "foo/baz" and "foo/none" have a more dramatic performance improvement over maps with "foo/bar". Moreover, we see a "kink" in the performance regression around N=30. This is likely because in the BEAM virtual machine, under the hood maps switch from a linked list implementation (with O(N) worst case search) to a hashmap implementation at N=32.

Conclusions

So, should you use GPT to generate your OpenAPI validations? Probably not... yet

GPT-3.5 (and even better, GPT-4) are very impressive at generating correct validations for OpenAPI schemas in Elixir. The most common systematic errors (e.g. not being sure whether to use atoms or strings) are easily addressable using prompt engineering. However, the GPT-generated is not quite right in many cases and sometimes it dangerously misunderstands (see Semantic Misunderstanding).

Although indeed GPT appears to be able to perform compiler optimizations that generate highly efficient code, this code is not composable, and the attention of the current state of the art LLM models may not scale to more complex schemas. In the small, GPT makes performance errors that are likely due to its lack of understanding of the VM architecture; without repeating this experiment in other languages, it's not entirely clear, though, that this wouldn't be better.

The use case for autogenerating code in GPT, especially for something like this, is likely to be a developer with low experience in OpenAPI and/or low experience in Elixir. For these practicioners, using GPT in lieu of a built compiler is still generally not a good idea, though I'm looking forward to repeating this experiment with GPT-6.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpt-bench.livemd

gpt-bench.livemd

Should I use GPT to autogenerate schema validations?

Motivation

Methodology

Limitations

Benchmarking Accuracy

Systematic Issues

Atoms vs. Strings

String length is UTF-8 grapheme count

Integers need to match Floats

Example:

Filters only apply to their own types

Format/Content

Examples (all GPT-4):

Accuracy Evaluation

Selected Observations of interest

Incorrect Elixir

Misunderstanding Elixir

Hallucinations

Semantic misunderstanding

Completely misunderstanding

Selected Performance Comparisons

GPT-4 wins!

Hidden Regressions

Conclusions

Files

gpt-bench.livemd

Latest commit

History

gpt-bench.livemd

File metadata and controls

Should I use GPT to autogenerate schema validations?

Motivation

Methodology

Limitations

Benchmarking Accuracy

Systematic Issues

Atoms vs. Strings

String length is UTF-8 grapheme count

Integers need to match Floats

Example:

Filters only apply to their own types

Format/Content

Examples (all GPT-4):

Accuracy Evaluation

Selected Observations of interest

Incorrect Elixir

Misunderstanding Elixir

Hallucinations

Semantic misunderstanding

Completely misunderstanding

Selected Performance Comparisons

GPT-4 wins!

Hidden Regressions

Conclusions