Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making C++ streams API public #921

Closed
indutny opened this issue Feb 22, 2015 · 19 comments
Closed

Making C++ streams API public #921

indutny opened this issue Feb 22, 2015 · 19 comments
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. discuss Issues opened for discussions and feedbacks.

Comments

@indutny
Copy link
Member

indutny commented Feb 22, 2015

As of b968623 we finally have something that might be useful for external bindings. I'd like to eventually make this API public.

What are your thoughts on this? What should be improved/changed in order to make it more useful?

cc @iojs/collaborators @iojs/community-members @iojs/streams @iojs/tc

@indutny indutny added c++ Issues and PRs that require attention from people who are familiar with C++. future labels Feb 22, 2015
@mscdex
Copy link
Contributor

mscdex commented Feb 22, 2015

Does this allow addon authors to create streams3 streams from C++? If so, what does the js-land part look like for hooking up to the generic C++ stream binding?

@indutny
Copy link
Member Author

indutny commented Feb 22, 2015

@mscdex not yet, but a good suggestion. Current API is about doing efficient wrapping of the underlying socket to implement some protocol: tls, (maybe) http2, spdy, mysql...

@piscisaureus
Copy link
Contributor

Conceptually +1 to having a public c++ streams base class.
However I am not in favor of making the alloc/read/afterWrite events part of the public API.

@indutny
Copy link
Member Author

indutny commented Feb 25, 2015

@piscisaureus why not? alloc/read is the whole point of this thing, because it allows efficient reading straight into the prepared memory. In fact, this is the reason why TLSSocket is faster than legacy tls in v0.10 node.js . AfterWrite could be shaved off, I guess.

@indutny
Copy link
Member Author

indutny commented Feb 25, 2015

@piscisaureus here is a first pass on making it suitable as a user API: #957 . I'm not yet sure how to kill AfterWrite.

@brendanashworth brendanashworth added the discuss Issues opened for discussions and feedbacks. label Jun 24, 2015
@sgerace
Copy link

sgerace commented Sep 29, 2015

@indutny We've been attempting to optimize a very performance-critical portion of our code base related to streams (specifically, stream transformations) and the idea of moving our implementation to C++ has been tossed around. Do you think your latest additions puts the code close to the point where a C++ streams3 implementation is possible? If you could provide a little guidance on what might be left to be done on the Node side, I'm sure we can spare some time to help out move the ball forward.

@indutny
Copy link
Member Author

indutny commented Sep 29, 2015

@sgerace this StreamBase C++ API saves lots of time that is usually spent on allocating intermediate data, and copying values around. This proved to be very useful for TLS and HTTP, but I can't say for sure that every use case will benefit from it.

Right now there are several blockers to exposing it:

  1. Usage of ReqWrap. This is an internal class, that is responsible for wrapping the libuv request handles and invoking asynchronous callbacks on them. It should be an abstract thing to make it public
  2. Heavy usage of internal Environment class. It should not use it in any of headers files. However it could be using it in stream_base.cc, as it is compiled with core

Another important thing is user feedback. I have used this API in two places in core so far, and am pretty much happy about how it works. But, perhaps, your use-case requires more flexibility from it. Let me know if you have any ideas for improvements!

Thanks!

@sgerace
Copy link

sgerace commented Sep 29, 2015

@indutny Perhaps if I provide a little bit more information about our exact use case, you might be able to guide us down the correct path. Basically, we are trying to validate (both against the JSON spec and our own schema definitions), hash (in a consistent way) and gzip very large JSON objects in a streaming manner. This whole thing actually came about because our initial implementation simply read the data, called a JSON.parse(), then used that to produce a stable string (i.e., sorted object properties, etc.). Unfortunately, some of the JSON objects that we are working with are close to 1GB in size, and so we were, at times, hitting the memory limits of Node.js and our server was crashing. We reimplemented everything as a Transform stream (so basically we had a pipeline similar to input.pipe(validator_and_hasher).pipe(gzip).pipe(out) where validator_and_hasher is our transform stream. That solved our memory issues, but now we are running into performance issues because we are having to go character-by-character to validate and hash the stream of JSON in a single pass. This is very CPU intensive at the moment and takes way longer than we'd like (processing a ~600MB JSON file can take 3-4 minutes). My gut feeling is that it is a combination of the allocation and access/manipulation of buffers on the Javascript side that is preventing us from achieving the desired performance, but I'm not 100% certain.

@indutny
Copy link
Member Author

indutny commented Sep 29, 2015

@sgerace do this JSON objects come over TCP socket?

@sgerace
Copy link

sgerace commented Sep 29, 2015

@indutny Yeah, the input streams come from both HTTP requests and direct TCP sockets and the output is ultimately piped to an external (permanent) data store (such as hard disk or S3).

@indutny
Copy link
Member Author

indutny commented Sep 29, 2015

@sgerace this is very interesting use case. I think it could probably be solved by using two StreamBase instances (one for receiving data, and one for sending data to another socket), and connecting them in C++.

The idea of StreamBase is to eliminate networking <-> JS code interaction, and do processing in C++. However please be aware that the networking API is very low-level there, so it may be more complicated to implement it there.

@sgerace
Copy link

sgerace commented Sep 29, 2015

@indutny So, let me see if I understand what you are saying; basically, we would create an add-on function (in C++) that accepts, as two of its arguments, two StreamBase instances (which would obviously be provided as Streams from JS) and then somehow read the data from the first stream, process it, and write the result to the second stream which would then find its way back to JS-land. Am I understanding that correctly?

@indutny
Copy link
Member Author

indutny commented Sep 29, 2015

@sgerace not exactly

  1. You'll need to provide your own C++ instances which should be inherited from StreamBase
  2. Wrap recv and send socket with these instances
  3. Handle the incoming data in C++ recv instance
  4. Process it
  5. Pass it to C++ send instance
  6. StreamBase will send it straight to the socket

This way - JS maybe skipped altogether.


If you want to leave sending data to JS, you may use just a single StreamBase instance to process incoming data and emit processed data like it has been received on that socket.

@sgerace
Copy link

sgerace commented Sep 29, 2015

@indutny Okay, so I'm starting to get a better picture of this, but now I've got a few more questions.

  1. Before we get any further, is what you are describing even possible with the current implementation of StreamBase, or, would the blocking elements you pointed out earlier need to be addressed prior to that being possible? (As an aside, I was just playing around with just including stream_base.h into my add-on and it doesn't seem to be happy, I'm guessing thats because it isn't part of the public interface yet?)
  2. How does this compare to something more like the way (I think) zlib is implemented, where the streams are handled at the JS level and then data is written and processed (asynchronously?) as packets on the C++ side...do you think there could be significant improvements over something like that?

@indutny
Copy link
Member Author

indutny commented Sep 29, 2015

  1. It is internal thing until the blockers will be resolved
  2. It skips allocation of Buffer, and handling it to JS. Both take some time. The improvement amount depends on the use case, but I would say - most likely yes. I'm just not certain how much the improvement will be. Just to remind about it one more time, the price for this is significantly increased code complexity.

@sgerace
Copy link

sgerace commented Oct 6, 2015

@indutny Thanks for the summary, I think after looking over everything (and spending a few more days playing around with the code), we've decided to go with a more traditional zlib-type implementation for now. To your second point, we just don't know the real-world performance gains we would be able to obtain at the cost of implementing a significantly more complex system. We can always move the logic into a C++ stream implementation if the performance is still an issue once the interface is ironed out. In the meantime, however, I would certainly be interested to offer feedback to any proposed C++ Stream API.

@indutny
Copy link
Member Author

indutny commented Oct 6, 2015

Thanks for collaboration!

@Fishrock123
Copy link
Contributor

See: nodejs/node-eps#2

@bnoordhuis
Copy link
Member

Close on the assumption that nodejs/node-eps#2 supersedes this issue. There's not been much movement here in the last four months either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. discuss Issues opened for discussions and feedbacks.
Projects
None yet
Development

No branches or pull requests

7 participants