Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Expo] Direct stepper chunk support #7012

Closed
wants to merge 5 commits into from

Conversation

colinrgodsey
Copy link
Contributor

@colinrgodsey colinrgodsey commented Jun 10, 2017

moving over to #7047

This is an exposition PR intended to get feedback on a possible future addition to Marlin. I expect lots of input and there will be future changes, proper branch basing, etc. It should eventually just get closed out and not merged. (proof of concept?)

This feature adds a new G-code and serial extension that allows an external service to send direct step buffers to Marlin via USB serial. The intention here is to allow users that have the standard 8-bit control board and a more powerful external device to use the more powerful device to handle planning and step sequencing (the Raspberry Pi is the target hardware here).

This feature addition allows an external device to concurrently upload chunks of 1024 steps to the device, and trigger their sequencing by using a new G-code command (currently, 'C0'). The protocol for updating chunk buffers themselves is a binary protocol that starts a packet using the control character '!'. This character is not used elsewhere in g-code (input anyways), and allows low-level processing of the serial sequence- enabling buffering independent of the command parser (all handled in the ISR).

C0 command format:
C0 I[chunk start index] R[number of chunks, defaults to 1] S[steps per second]

The execution of the chunks is done by extending the Marlin block format with a field and flag that lets it execute the buffered chunk instead of looking for the normal trapezoid related parameters. The step speed is configurable, I've had success with 10k-30k steps/s, although 30k seems to starve the temperature ISR causing runaway errors.

The bandwidth/load limitations and whatnot were tested in advance, and the device seems capable of running healthy at 500kbps. 250kbps is probably also fine. My testing seemed to benchmark stable transfer bitrate at about half of the line bitrate (due to waiting for responses etc). This was with my test planner that doesn't implement the buffering pipeline as optimally as it could, otherwise effective bitrates could be closer to the max.

My external test planner: https://github.com/colinrgodsey/step-daemon

I finally got to a point where I could print a blazing fast 120mm/s benchy that seemed to provide the same dimensional accuracy to what I would normally get from Marlin (board is an MKS Base v1.4, Atmega 2560). So, figured it's time I start cramming things down "ye ol' open source pipeline".

moebyusDev and others added 4 commits May 12, 2017 17:44
It was missing MSG_FILAMENT_CHANGE_HEAT_2 and MSG_FILAMENT_CHANGE_HEATING_2
fixed spanish lang not compiling w filament change
@colinrgodsey
Copy link
Contributor Author

Hmm, I forgot to add literally any description of the chunk format...

The current chunk format is 256 bytes per chunk. Every 2 bytes describes 8 steps for 4 axis (XXXXYYYY ZZZZEEEE), letting each axis describe a step move delta that is +/- 7 steps (we have to sacrifice a step here because we cant actually represent +/- 8 using 4 bits). Subtract 7 from each integer 'nibble' to get the step delta. When processing the chunk, each nibble is looked up in a small step table to get the step pattern for that delta (without using bresenham counters). Unfortunately, direction needs to be checked every 8 steps, as it could change.

@bobc
Copy link

bobc commented Jun 10, 2017 via email

@colinrgodsey
Copy link
Contributor Author

colinrgodsey commented Jun 10, 2017

@bobc So comms errors are actually handed correctly, each chunk is sent with a checksum, and will eventually respond with "!ok X", "!fail X", "!busy". Pretty similar to the gcode pipeline. The responses are different so you can effectively multiplex the command streams (chunks and gcode).

The response pipeline also lets you "batch" send as many chunks as you want, even before you get a response. This is the key to getting the full available bandwidth, I just have not implemented that in step-daemon yet because its more complex than "one at a time" (due to error handling). Thats where the 50% bandwidth comes into play, otherwise if you did it better you could achieve close to the line rate. Plus currently, if you wanted to run your chunks at 50k steps/s (impossible in marlin right now), you'd only need an effective bitrate of about 100kbps (easily covered with a 250kbps line rate).

As far as when the "chunk buffers" are full in marlin, it starts responding with "!busy", in which case the external planner (step-daemon in my case) will take a calculated pause. So it does definitely waste some bandwidth here and there, but only when the chunk buffers are full up.

So, the encoding is definitely fixed at 4 stepper motors. It looks like Marlin is really set up to handle 4 steppers, except with MIXING_EXTRUDER enabled. So I think that covers the majority, but for mixing extruders, we'll probably need another format.

Which brings me to formats. I think ideally there should be some standards for formats. Think of bitmaps: bitmaps have a pretty set format (usually just a flat buffer with idx = x + y * width), but there's several actual pixel encodings available that effect how much information can be stored in each 'pixel', without really effecting the format. So basically, an expandable format with different 'pixel' encodings would probably be best.

As far as the "wave" tables (step table) I have, and the 8-bit resolution, that is a compromise that may be different for certain device situations. For this use case, im assuming 8 or 16 division microstepping. The actual step curve for microstepping can be rather odd, so it seemed like a great place to hide the "8 step lines" that the chunks boil down to. The lines should be multiple steps, as otherwise we waste cycles and bandwidth by having to encode and check more direction data (a single 1024 step chunk can change directions 128 times max per axis).

EDIT: and at 15k steps/sec, there's really no noticeable harmonic noise. The wave tables introduce a tiny bit of noise, but I wouldn't say its worse than the noise produced with plain marlin. Just sounds a little different. I initially had marlin introducing some entropy into the wave table positioning, but it made a rather displeasing white noise. The wave tables I have each produce a discrete, but not horrible, tone. I think the bresenham line algo basically produces other similar "coherent" noise.

EDIT2: Also, this format makes extensive use of 8-bit processing for performance sake. All the math involved in "playing" the chunks is based on 8-bit bitwise operations.

@bobc
Copy link

bobc commented Jun 10, 2017 via email

@colinrgodsey
Copy link
Contributor Author

colinrgodsey commented Jun 10, 2017

Test video showing some of the motion, and you can hear the sounds. Definitely a bit different from Marlin, but not loud at all. Video was done at 15k steps/s for the chunks (20 and 30 are even better). My squeaky z-axis during z-lift is the loudest part. That's my general noise benchmark so far ;)

https://youtu.be/lDJzmHbLk6M

@bobc yea that's definitely a good idea. I think maybe some fixed (standard) formats would be good tho, just so you can provide optimized pipelines. Or at least fixed standards in marlin itself, on 8-bit boards you're already so restricted for processing time and space (RAM and application memory), you could maybe only have 1 or 2 encodings enabled at a time, and some of them will be more optimal (and possible) on that particular board.

This klipper project is interesting, didn't know about that. I initially thought of trying to do it that way, but the idea of writing your own firmware is incredibly scary lol. I made it a hard goal for this project that: my printer wasnt really more likely to catch on fire than running just Marlin. The klipper people are brave for attempting that. Besides, the implementation I needed really fit just fine with the normal G-code pipeline. The external planner basically only does 2 things: a) controls the g-code pipeline and monitors when it needs to sync positions with the pinter, like after homing b) takes large segments of G0 and G1 commands and uses the chunk pipeline to ultimately turn them into C0 commands (backed by chunk buffers). Turning them into C0 commands ultimately requires the planning and stepping logic.

EDIT: Even the auto bed leveling is still done by marlin, the external planner just scrapes the probe info from serial, and feeds it into its own algos in the case of step daemon (bicubic for the win!)

@bobc
Copy link

bobc commented Jun 10, 2017 via email

const uint8_t dE = b & 0xF;

uint8_t steps[4] = { 0 };
steps[X_AXIS] = block_moves[dX][(block_steps + 0) & 0x7];
Copy link
Contributor Author

@colinrgodsey colinrgodsey Jun 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 0,2,4,6 offsets here are to just add a permanent offset to the wave tables for each axis. Probably not needed and just wasting cycles, but the theory there was to prevent noise. If all axis are stepping at the same rate, we want the pulses to stagger a bit. But the chances of them all stepping at the same rate is probably rare.


#define UPDATE_DIR(AXIS) \
if(d## AXIS == 0) {} \
else if(d## AXIS < 7) SBI(dm, AXIS ##_AXIS); \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wish there was a faster way to do this

// Stop an active pulse, reset the Bresenham counter, update the position
#define PULSE_STOPC(AXIS) \
if (steps[_AXIS(AXIS)]) { \
count_position[_AXIS(AXIS)] += count_direction[_AXIS(AXIS)]; \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory this should let us resume marlin based planning after doing chunks, provided the planner syncs the "real" position back to it.

#define BLINK_LED LED_BUILTIN

//wave tables, 4 bit move, +/- 7
uint8_t block_moves[16][8] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wave tables for each step frequency, done by hand. this is my third version for this encoding, it seems to be the least annoying noise wise.

@colinrgodsey
Copy link
Contributor Author

colinrgodsey commented Jun 10, 2017

@bobc the main advantage here is that allows you to get 32-bit performance (and more), out of hardware you probably already have: an 8-bit or above Marlin compatible board, and a raspberry pi. The step-daemon project is targeted specifically for the RPi, and will offer a virtual serial interface letting you use octoprint with it. Basically, you can use hardware you already have on hand, and allow your device to handle kinematics and other advanced features past what you normally be able to do with Marlin alone on that board. All that, plus hopefully allowing Marlin to go at a faster steprate because it has to nothing to do other than: handle the serial ISR, temp ISR, and stepper ISR. No planning, no real math, no bed leveling, no line counters, no lin advanced, literally has to just handle the realtime components. And all this possible with a daemon that lives on your RPi that takes maybe 10-20% CPU.

The step daemon software uses full 64-bit precision floating point, and real vector math and linear algebra to solve planning problems. The rpi has really impressive floating point support, even for 64-bit precision. An example I gave out earlier: the math required for processing at 20k steps/sec is equivalent to simulating a 500-particle 3d system at 60fps. Which is pretty simple, the rpi can run minecraft, for example ;)

@bobc
Copy link

bobc commented Jun 10, 2017 via email

@colinrgodsey
Copy link
Contributor Author

colinrgodsey commented Jun 10, 2017

I find it impossible to believe that the CPU limitations are based solely on the ISR. You have 32-bit floating point math, long math, int math, tons of things happening routinely that really have no business being on an 8-bit processor. Hell, there's even long divides. This solution is virtually no different than what smoothie or the replicape do, they have an advanced processor that produces sequences for the real-time low-bit cores. Where they have the advantage of DMA and similar, we have to suffer through serial transfer. But I have already proven that 250kbps of raw transfer (through the chunk pipeline) sacrifices no more than 15% available CPU time on an atmega2560 (16 Mhz), so theres really nothing stopping this from reaching higher step rates.

100kbps effective serial rate (which would be needed for 50k steps/s), means 12.5k ISRs a second with hardware serial. That's virtually nothing.

I'm sorry you can't see the value in this, and unfortunately this means ill have to produce more "marketing nonsense" to showcase its worth. But the fact of the matter is, we're generally trying to run something as complex as "doom" on the equivalent of a calculator CPU, and that's where I feel the true nonsense is. You can be magnitudes more complex and exact with your calculations as soon as you offload this stuff. Just like slicing is a natural division of duties in 3d printing, I believe planning and stepping is too.

Also, I'd like to note that the hard limit in marlin of 40k steps/second is based on stepper driver limitations. For core/MVP support, that's not a limit I really want to push.

EDIT: also since submitting this, I've refined the chunk stepper routine to use about half the cycles it previously was. Real numbers soon

EDIT2: Erm, and sorry, a TI-86 is a 6Mhz. So I guess an atmega2560 would be a bit less than 3 of those, but with far more RAM

@thinkyhead thinkyhead added T: Design Concept Technical ideas about ways and methods. T: Development Makefiles, PlatformIO, Python scripts, etc. S: Don't Merge Work in progress or under discussion. S: Experimental labels Jun 10, 2017
@thinkyhead
Copy link
Member

thinkyhead commented Jun 10, 2017

I'm very interested to look more closely at the code and see what's going on, and I will do so now…

I've rebased your branch onto the current bugfix-1.1.x and resolved conflicts. If you want to replace your branch with a new one, my branch is at https://github.com/thinkyhead/Marlin/tree/bf_chunk_support.

To replace your branch with the contents of mine:

git remote add thinkyhead [email protected]:thinkyhead/Marlin.git
git fetch thinkyhead
git checkout chunk_support
git reset --hard thinkyhead/bf_chunk_support
git push -f

You could then use your branch to make a new PR targeted at bugfix-1.1.x, if you want.

@colinrgodsey
Copy link
Contributor Author

colinrgodsey commented Jun 10, 2017

@thinkyhead great! I'll probably rebase (and new PR) rather soon, I have some pending changes that should help with performance, but I wanted to push this through as its basically what I used to for my last test. Once i get the new changes tested, ill merge and rebase. Also going to add a note on one line of the PR here (for the temp ISR), I apparently did push that line that wasn't tested, and I don't really know what the effects are currently.

//sync planner position back up with stepper positions
Planner::sync_from_steppers();
//temperature ISR can get drowned out under high step rate, make sure it gets run
Temperature::isr();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!! this line has not been tested !!. It snuck into my PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably not 100% safe. We've already done some tweaking to leave space for the serial UART, and have improved the reliability of Temperature::isr generally, but if the stepper ISR is run at too high a rate, the temperature ISR does get missed. There is a flag indicating that the temperature ISR is active, so if the stepper ISR is interrupting it then that can be checked.

@thinkyhead
Copy link
Member

thinkyhead commented Jun 10, 2017

I'll probably rebase rather soon,

At that point you'll need to close this PR and open a new one, because you can't re-target a PR to a different branch.

@colinrgodsey
Copy link
Contributor Author

@thinkyhead awesome, yea this PR will be a throwaway. Completely conceptual (but proven) right now, just for the proof of concept and RFC. Next one I post should be a functional example, this one is more to get some rather raw input. The code itself is stuff mostly garbage water, not really much thought into where I put globals and header fodder. Think i switched between snake and camel case a few times.

Will probably do the same branch, suppose theres not much point to keep this one open after i move to the new one.

@thinkyhead
Copy link
Member

If you can provide any hard data to sway the skeptics, that will go a long way towards justifying this concept. I haven't yet absorbed the full picture, but it has all the hallmarks of an optimization…

@thinkyhead
Copy link
Member

thinkyhead commented Jun 10, 2017

The intention here is to allow users that have the standard 8-bit control board and a more powerful external device to use the more powerful device to handle planning and step sequencing

I've also developed a nice proof-of-concept that uses two RAMPS boards, both running Marlin, to share resources by coordinating via i2c. This is another easy way to take advantage of inexpensive hardware. The second Marlin instance may be completely stripped down. In my proof-of-concept the "slave" board handles all Z movement for 4 independent Z stepper motors, while the "master" board does everything else. Other divisions of labor should easily be possible, so you might have one Marlin running the UI and sensors while the other one only handles motion. In any case, offloading computation across more inexpensive boards is a great way to go, and can help extend the life of 8-bit boards even as the 32-bit boards come down in price.

@colinrgodsey
Copy link
Contributor Author

@thinkyhead awesome! Yea, I think there's a great benefit to multiprocessing in general here, in almost any form. Even the external planner software (step-daemon) i wrote for this uses discrete pipeline processing for each stage of the handling, in this case to make sure the work is distributed properly for multi-process on a multi-core CPU. Marlin is just the last stage in the pipeline. Could conceptually be spread across devices too. More power is more power, as long as the serialization and transmission isn't more costly ;)

Anyways, I just figured it would be awesome if i could cram my rpi into the pipeline, given how capable of a system is, and how common it is with 3d printing anyways.

I agree on the PR. I wanted to get this in to "prime the pump" so to speak, next PR I intend to be complete, hopefully with a corresponding functional version of step-daemon so people can test. And yes, lots of numbers!

@thinkyhead
Copy link
Member

thinkyhead commented Jun 11, 2017

Is this overall tech meant to be packaged as a plugin for OctoPrint — given that it's the de facto RPi printing host —?

@colinrgodsey
Copy link
Contributor Author

It could be. That touches on lots of usability issues, which is definitely something that I've been thinking about. I think it's perfectly reasonable that the plugin could be deployed as an octoprint plugin. For step-daemon, it's a java application, and raspbian should come with the oracle JVM available. On the other hand, step-d could benefit from real-time or high-nice priority (which is more easily done with a system level launcher). It should be able to delay up to about 10ms without really missing a beat. It can deal with java JIT and GC, so it should technically be happy with longish pauses that will happen in a more busy system. I ideally see it as something that would be deployed with octoprint tho.

@thinkyhead
Copy link
Member

thinkyhead commented Jun 12, 2017

The idea of interfacing at the block level is good, as it keeps everything in the same destination queue. If there are unused variables from the normal stepper block, then they can be combined with the chunk data as a union so that the stepper blocks don't need to increase in size.

I haven't seen the portion of the code where it synchronizes back to the high-level current_position — but this would of course be a necessary addition. Hooks are already in place. We built some foundational functions to get the current position from the steppers, for any kinematic system, and accounting for any active bed leveling (except UBL, currently). The functions I wrote for the purpose are:

  • set_current_from_steppers_for_axis
  • get_cartesian_from_steppers

I know you're still working on this, but as I've been studying your code, I've also updated my branch with some elements that you should include in your update. Most important is making CHUNK_SUPPORT an optional feature. I've put conditionals around all the code that needs it. This just makes continuing development easier. I'm also going to presume that ADVANCE and LIN_ADVANCE are incompatible with this feature.

@colinrgodsey
Copy link
Contributor Author

colinrgodsey commented Jun 12, 2017

Great! Yea I'll probably combine any extra fields I need with a union. I'm thinking of simplifying the bitwise math a bit more just to use counters, so I'll need some extra fields somewhere, although maybe as statics. So far there's just the one extra 8-bit field, ill probably union that with one of the unused planner fields.

The only position syncing done on the marlin side right now is syncing the stepper positions back to the planner after executing a chunk, no real automatic cartesian syncing on the marlin side, just because bed leveling and all the translation will be on the external planner. Not sure marlin has enough information to calculate the real cartesian point after running chunks.

Currently I do have step-daemon syncing positions after detecting a few different g-codes in the pipeline (like after homing etc). Ill have to make sure it catches all the possible conditions in which it needs to sync, maybe something periodic too (right now the LCD just never updates its position during printing, which is somewhat annoying). But there are many g-code entry points into the coordinate systems which is great, and has made it really easy to do pure g-code solutions for most things, just watching, inserting and modifying the g-code pipeline as it goes by.

I really appreciate the branch updates! I feel bad I didn't get those done in the first place. I'll make sure I get that merged and get a cleaner version done tonight (and probably a new PR, with the new basing). Ill try to get some better comments in there to.

I think LIN_ADVANCE could be implemented to work with the chunk system. Something like, bed leveling for example, is basically impossible because the external planner takes on most of the cartesian handling role, but LIN_ADVANCE (from what I've seen) seems to be just stepper based, so I think it could work. I've actually look at it a bunch, but I don't really have a good mental picture of how the implementation is done.

To be honest, I've had a hard time trying to follow the LIN_ADVANCE implementation. It seems rather complex, and 'weighty' computation wise (but does work great). It's one of the reasons I got inspired to start doing some of this work- I started thinking about the general problem of linear advance again just as a fun computational physics problem (I need these now and again or I forgot how to math). And I arrived at the conclusion that it would be cool if you could handle the physics, kinematics, stepping, all in a high-level language that used real vector math and linear algebra. Was also like, it's a damn shame I can't use the RPi which has a rather stunning capability for that kind of processing. I bounced back and forth between ideas. Shared dual-access RAM was the contender for a bit, but I thought it would be more valuable to do it with no new hardware. And, you know, maybe actually make the solution useful for somebody. Rambling... not knocking the LIN ADVANCE implementation! It's just an inspiring physics problem.

But, that's step daemon. That project may in fact go nowhere at all lol I want to focus on it more as a proof of concept for this firmware extension. I think it's it worth it to consider LIN_ADVANCE compatibility with the extension, I'd like to preserve as much pure Marlin functionality with it as possible.

TL;DR- No linear advance yet, but I think it should be done if possible.

unsigned char dm = last_direction_bits;

#define UPDATE_DIR(AXIS) \
if(d## AXIS == 0) {} \
Copy link
Contributor Author

@colinrgodsey colinrgodsey Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is plain wrong. Should be

d## AXIS == 7

Just so no bit is set or removed for direction. Index 7 is 0 steps (dX = index - 7). This would cause it to change the direction unnecessarily often.

steps[Z_AXIS] = block_moves[dZ][(block_steps + 4) & 0x7];
steps[E_AXIS] = block_moves[dE][(block_steps + 6) & 0x7];

//start of block, check direction
Copy link
Contributor Author

@colinrgodsey colinrgodsey Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, so inside this routine, when i say 'block', I mean '8 step line'. I doubled up on the conflicted meaning here (with the marlin block_t). I'll probably rename all these vars to use 'segment' instead of 'block'.

@colinrgodsey
Copy link
Contributor Author

moving over to #7047

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S: Don't Merge Work in progress or under discussion. S: Experimental T: Design Concept Technical ideas about ways and methods. T: Development Makefiles, PlatformIO, Python scripts, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants