Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming #55

Closed
wants to merge 3 commits into from
Closed

Streaming #55

wants to merge 3 commits into from

Conversation

Deraen
Copy link
Member

@Deraen Deraen commented Jun 5, 2021

Continued from #53

I've added create-parser function for accessing low-level parser & extended ReadValues for parser.

This allows finding array inside Json objects etc. and then reading and array of values from that.

Probably there could be similar functions for writer.

@Deraen Deraen changed the title Bsless streaming Streaming Jun 5, 2021
@bsless
Copy link
Contributor

bsless commented Sep 30, 2021

Hi @Deraen, any update on this?

@Deraen
Copy link
Member Author

Deraen commented Oct 1, 2021

Trying to find my notes about this... hmm.

I think I was a bit unhappy -read-values implementation for JsonParser, as it needs to move the parser token to a specific point etc. The create-parser part from my change is fine, but I wonder if it would be better to leave readValuesAs part for the lib users. Or maybe it would be better as a separate function, instead of making read-values work in every case.

The parser object already has reference to Jsonista ObjectMapper, so calling readValuesAs directly is simpler than with InputStream and other, where .readerFor call is first needed...

@bsless
Copy link
Contributor

bsless commented Mar 8, 2022

Hi @Deraen, been a while since we've both looked at it :)
Coming back to the subject, I am not sure I see the rationale in working via a parser.
Would the user API be different? Will there be a difference in performance? What's the differences in the code paths?
Via parser:

  • mapper -> factory -> parser -> read values as T -> iterable

Via reader:

  • mapper -> reader for T -> read values -> iterable

Unless you want to expose the type for the parser as an argument.
Not sure if I missed something

@Deraen
Copy link
Member Author

Deraen commented Mar 8, 2022

@bsless The last test cases shows the use case for the parser.

To read and array lazily from a JSON stream, which is not an top level array, user needs to use parser to navigate to specific element and then the array can be read.

In #53 we already discussed difference to readTree version I've used with GeoJSON files.

I think the API we'll provide should cover both cases: top level arrays, reading e.g. large GeoJSON file feature arrays, and also read log files without top level array start, separator and end elements. Though I think the last case is already covered by reading the file using Reader, readLine and then parsing each line with read-value.

@viesti
Copy link
Member

viesti commented Mar 8, 2022

read log files without top level array start, separator and end elements. Though I think the last case is already covered by reading the file using Reader, readLine and then parsing each line with read-value.

Piping in that this last one is common enough that it's called ndjson :)

@bsless
Copy link
Contributor

bsless commented Mar 8, 2022

Thank you for helping me get back into context :)
Now I see the justification for the parser, and then obviously read-values could be implemented with it to avoid duplication. If it ends up hitting the same code paths I see no inherent problem.
We should add tests, but I think reading newline delimited json could even be done without using a reader/readLine, which I would actually prefer because why create intermediary strings? read-values should be able to handle it directly as a stream.
Would you like me to PR this changes against this branch?

@Deraen
Copy link
Member Author

Deraen commented Mar 9, 2022

@bsless If you have idea how to clean up the implementation, go ahead.

Maybe the problem I had was that parser .readValuesAs doesn't handle that array start element? The parser needs to be on the correct position when calling that.

I wonder if there is something on Jackson API to handle these cases better.

I think reading newline delimited json could even be done without using a reader/readLine, which I would actually prefer because why create intermediary strings?

Makes sense.

@bsless
Copy link
Contributor

bsless commented Mar 21, 2022

@Deraen digging through Jackson's API, it should be possible to get the parser from the MappingIterator returned from readValues, letting us drop that piece of code.
Also, my memory worked correctly, ndjson is parsed seamlessly:

(seq (read-values "{\"a\":1}\n{\"b\":2}"))
;; => ({"a" 1} {"b" 2})

Another opportunity I found was making the read-values kv-reducible as well.
The problem with using readValues and not readValue is that it assumes the input is a stream of values, so the start token when reading an array is nil
The more I think of it a parser shouldn't be derived from the mapping iterator, and going from the parser to readValuesAs doesn't work out well:

(iterator-seq (.readValuesAs (doto (create-parser (write-value-as-string [1 2]) default-object-mapper) .nextToken) Object))
;; => ([1 2])

At this point I think we'll need a good reason to connect the parser and read values. If we want to clean the implementation up, maybe a macro would be in place to reduce some boilerplate, but otherwise they can stand on their own.

I managed to come up with a macro like:

(defmacro extend-types
  {:style/indent [1 :defn]}
  [types & args]
  `(do
     ~@(for [[[p] impls] (partition 2 (partition-by symbol? args))
           t types]
       `(extend-type ~t ~p ~@impls))))

(extend-types [File URL String Reader InputStream]
  ReadValue
  (-read-value [this ^ObjectMapper mapper]
    (.readValue mapper this ^Class Object))
  CreateParser
  (-create-parser [this ^ObjectMapper mapper]
    (.createParser (.getFactory mapper) this))
  ReadValues
  (-read-values [this ^ObjectMapper mapper]
    (.readValues (.readerFor mapper ^Class Object) this)))

Which reduces the boilerplate significantly
Thoughts?

@Deraen
Copy link
Member Author

Deraen commented Mar 22, 2022

Boilerplate on the library code isn't a problem. Macros can make namespace load times much worse and make it harder to understand the code.

@bsless
Copy link
Contributor

bsless commented Mar 22, 2022

Alright, where does it leave us with regards to everything else? What do you think about the relation between readValues and the parser? It seems to me like the should remain separate. readValues deals explicitly with streams while the parser's main benefit is in navigating a data structure.

@opqdonut
Copy link
Member

opqdonut commented Nov 1, 2024

I think this can be closed since #82 is merged

@opqdonut opqdonut closed this Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants