Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[outdated] Add support for ENC424J600 with bumped minimq #154

Merged
merged 17 commits into from
Sep 27, 2021

Conversation

HarryMakes
Copy link
Contributor

@HarryMakes HarryMakes commented Apr 12, 2021

Summary

This PR is continued from #134 to properly implement support for the ENC424J600 controller, requiring updated crates of the enc424j600 driver (see PRs #7 and #11) and minimq (see diff).

Key Changes

  • @occheung: Introduce Cargo feature phy_enc424j600 for implementing MQTT client functionality with ENC424J600 controller
  • @occheung: Upon reboot, apply Booster's EEPROM EUI48 to ENC424J600's volatile RAM as MAC address
  • Bump minimq to quartiq/minimq@d2ec3e8
  • In following situations, eliminate panics or watchdog triggers on ENC424J600's variant while keeping trying to resume network connection:
    • MQTT broker IP address is invalid or currently cannot be connected

(New changes since 6b82f96)

  • Adopts smoltcp-nal 0.1.0 to implement the NAL stack required by minimq.
  • Implement mutex for accessing the MQTT client and the underlying NAL stack for processing the MQTT packets and polling the smoltcp interface respectively.

Testing

Network connection and MQTT features remain functional for both ENC424J600 variant on v1.4 User IFC board and W5500 variant on v1.3 User IFC board (using default crate features).

Unresolved issues (new since 6b82f96)

  • MQTT client will not reconnect to the broker after disconnection. Can potentially be solved by bumping minimq or smoltcp_nal, but would require rewriting and re-testing the w5500 crate due to bumped embedded_nal.
  • PHY linkup is not actively monitored, so if it is unplugged from the network, the socket TCP state keeps at ESTABLISHED (because there is no timeout/keepalive/socket close()), meaning in #[telemetry] the MQTT client will keep calling publish() to queue in the socket TX buffer until it is full, and keep emitting WriteFail errors.

Copy link
Member

@ryan-summers ryan-summers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments in-line

@@ -43,7 +43,7 @@ jobs:
with:
command: clippy

compile:
compile-w5500:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the matrix strategy to compile with both the w5500 feature and the enc424j600 instead of duplicating the pipeline

This comment was marked as resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All we have to do is update the artifact names of the generated firmware images. E.g. move target/thumbv7em-none-eabihf/booster after the W5500 build to booster-w5500 or something similar. There's no need to make separate binaries.

Copy link
Contributor Author

@HarryMakes HarryMakes Jun 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be resolved by the new CI workflow definition in 3b97ec4 be8f747:

  • Uses matrix to compile an individual job for each Eth controller variant, including release (stable + beta Rust toolchain) and unstable.
  • The action/upload-artifacts step names the artifacts uniquely for each variant.

src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
src/enc424j600_config.rs Outdated Show resolved Hide resolved
src/enc424j600_config.rs Outdated Show resolved Hide resolved
src/enc424j600_config.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
Copy link
Contributor Author

@HarryMakes HarryMakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Please see my comments below. Some main points to do:

  • Determine if our ENC424J600 driver has unnecessary operations (e.g. delays in ns, looping during NAL's connect attempt, etc)
  • Determine if certain issues can be done by pushing PR to minimq (e.g. regarding error-handling after read/write)
  • Verify control of TX/RX buffers

@@ -43,7 +43,7 @@ jobs:
with:
command: clippy

compile:
compile-w5500:

This comment was marked as resolved.

Cargo.toml Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
src/enc424j600_config.rs Outdated Show resolved Hide resolved
src/enc424j600_config.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
occheung and others added 7 commits June 16, 2021 13:16
(afdab5f) cargo: include enc424j600 dependencies

(39e0b55) main: feature gate different eth device

(e845f56) nal: implement enc424j600 variant

(befe70a) nal: fix unnecessary w5500 dependencies

(20efdcc) settings: override eeprom MAC

(0d90470) use cargo fmt

(6e13f4f) cargo: add w5500 as default

(83380a9) cargo: simplify smoltcp dependency

(83109c3) settings/mac: remove redundant code

(a3db62f) mac: use smoltcp

(15f0964) eth_cfg: split off from main

(4ae6e82) nal: migrated to enc424j600

(e66d916) phy_config: minor fix

(3869217) clean up
* This also tells the ENC424J600 driver to use Cortex-M instructions to perform SPI CS_n delays at the current firmware CPU clock frequency.
@HarryMakes HarryMakes force-pushed the feature/enc424j600_merged branch 2 times, most recently from 3b97ec4 to be8f747 Compare June 18, 2021 08:17
@HarryMakes
Copy link
Contributor Author

Hi @ryan-summers, I've rewritten most of my original PR commits to make the code more understandable and take advantage of all our other libraries under https://github.com/quartiq. Here are some major differences from my previous request for review:

Bumping minimq to quartiq/minimq@d2ec3e8

  • I specifically picked this commit because it is the newest commit right before minimq bumps embedded-nal from 0.1.0 to 0.6.
  • I do not want to bump embedded-nal because W5500 driver would have to be re-written as well, which is not the goal of this PR.
  • Originally, minimq was only bumped to quartiq/minimq@933687c, and further bumping to quartiq/minimq@d2ec3e8 actually brings in an issue that both W5500 and ENC424J600 needs to fix (in another PR):
    • Feature/session state minimq#43 adds maintained MQTT session state, swaps out the Disconnected error in case of connection loss and adds the SessionReset error in case of broker session loss after reconnection.
    • Currently (with 6b82f96), I only changed Disconnected to SessionReset, but this error cannot happen on ENC424J600+smoltcp-nal because this minimq version doesn't close() the socket when the broker is disconnected, and smoltcp-nal cannot connect() the smoltcp::TcpSocket as the socket TX half is still open (e.g. stuck in CLOSE-WAIT) and doesn't have a keep alive interval to close it automatically.
      • In src/enc424j600_config.rs, the set_keep_alive() line doesn't do anything because the keep alive value gets reset when the NAL stack does connect(). I haven't removed this line or fixed this behaviour yet because it's not needed for this PR.
      • quartiq/minimq@d7533f0 should fix this behaviour, but embedded-nal has already been bumped over there, and so this fix is not of interest to me this time.
    • This has NOT been tested on W5500 to see if any behaviour changed upon broker disconnection.

Switching to smoltcp-nal 0.1.0

  • I specifically picked 0.1.0 because, if I read it right, none of the later commits on smoltcp-nal main branch are related to our current application. For example, we don't use DHCP or port randomisation.
  • By scrutinising smoltcp-nal, I learnt that:
    • It simply calls the smoltcp API that accomplish the goals defined for each trait method on embedded-nal, such as obtaining a free socket, sending connection request to a remote, accessing the TX/RX buffers, and detecting connection status. It doesn't need to loop, for example, to ensure the signals or packets have actually taken place for each method call.
    • It also has a user-facing poll() method for the main program to process packets and signals elsewhere. Thus, as in 0c01e2d, I could implement the polling procedure is independently from the NAL library.
  • In order to handle reconnection to the broker properly, either minimq or smoltcp-nal would eventually be bumped. Other than the pending minimq fix I mentioned above, we could also try the existing smoltcp fix DHCP / PHY reset support smoltcp-nal#10 to force-close the sockets once disconnected. Again, this bug should be dealt with separately.
  • smoltcp-nal 0.1.0 also simply discards smoltcp errors per poll(), which might not be ideal for debugging the network stack.

Bumping enc424j600 to latest master

  • Since the commit in our original PRs, I fixed some SPI issues that would make SPI transactions more reliable and efficient, and make the API more user-friendly overall (e.g. removing the need to provide a nanosecond-delay function pointer, and renaming functions for clarity).
  • Detailed changes can be found in my PR #11 on the driver repo.

Implementing mutex for minimq::MqttClient and smoltcp-nal::NetworkStack

  • Because we want to use the smoltcp-nal poll() function while the MQTT client is processing publishes/subscribes, and MqttClient::new() moves the NetworkStack to the client, we have to produce 2 references for safe access in the same concurrent context.
  • To do so, I took stabilizer's current code as reference to create a mutex proxy abstraction for the NAL stack. I think Feature/telemetry stabilizer#341 introduced this idea to the Stabilizer firmware.
  • As a result, I introduced the following mutex-related structs to this PR:
    • SmoltcpNalStackProxy: the proxy that owns a mutex that's the AtomicCheckMutex struct defined within src/mutex.rs, rather than shared_bus. Maybe we should switch to shared-bus in a separate PR...
    • SmoltcpNalStackManager: the manager that can distribute proxies, implemented similarly to AtomicCheckManager and NetworkManager in the Stabilizer firmware.
    • SmoltcpNalStackController: the controller that controls the NAL stack via a proxy, implemented similarly to NetworkProcessor in the Stabilizer firmware. It also owns an embedded-time epoch clock, whose tick frequency is specified in the main program (using Rust 1.51.0+ const generics).
  • To ensure the client (MqttClient) and the independent stack controller (SmoltcpNalStackController) have mutual exclusive access to the underlying NAL stack, in the main program I defined a struct EthernetManager that replaces the mqtt_client shared RTIC resource on both W5500 and ENC424J600 variants.
    • This manager owns the MQTT client, and additionally the stack controller only for the ENC424J600 variant. In the critical sections on #[idle], lock() is now called on this manager instead of just the MQTT client.

Miscellaneous changes

  • Switched to cortex_m::singleton for the one-time use of smoltcp NetStorage for setting up the ENC424J600+smoltcp-nal MQTT client.
  • Substituted the constant SPI_CLOCK_FREQ from enc424j600 with the literal value of 14.mhz(). Later on, enc424j600 might be able to verify the frequency against the max value stored by SPI_CLOCK_FREQ when the SPI port abstraction is getting instantiated.
  • Improved GitHub workflows as per discussion on [outdated] Add support for ENC424J600 with bumped minimq #154 (comment).

I hope these new changes are implemented well enough for the purpose of adding the ENC424J600 variant to NGFW. I'll make separate PRs/issues to get into the other unhandled behaviours brought by my PR.

Copy link
Member

@ryan-summers ryan-summers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one main design discussion I'd like to resolve before moving forward. See below

src/smoltcp_nal_stack.rs Outdated Show resolved Hide resolved
@HarryMakes
Copy link
Contributor Author

Dear all, I've edited my code up to commit 88930f4. I ran a Booster with this firmware for over 27 hours and there were no unexpected errors or panics during normal operation (i.e. no disconnection from MQTT broker during the test). However, the "unresolved issues" in #154 (comment) are still relevant, and I don't think those should be addressed within this PR.

Copy link
Member

@ryan-summers ryan-summers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very good - thanks for the work on this front. I agree that we should eventually update smoltcp-nal, minimq, and jump over to miniconf, but that work should definitely be reserved for a future PR.

Just a few minor comments so far.

src/enc424j600_api.rs Outdated Show resolved Hide resolved
src/main.rs Outdated
#[cfg(feature = "phy_enc424j600")]
type Ethernet = NetworkStack<'static, 'static, enc424j600::smoltcp_phy::SmoltcpDevice<Enc424j600>>;
#[cfg(feature = "phy_enc424j600")]
type NalClock = enc424j600_api::EpochClock<168_000_000>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should perform some type of run-time check to verify that the template argument here matches the configured CPU clock - otherwise, if we update the CPU clock speed, we may miss this and timing would go haywire.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ryan, I tried to implement a check by declaring the CPU frequency as a global const (CPU_FREQ), using this const as the const generic parameter for the EpochClock (e8bd495), and additionally asserting the equality between CPU_FREQ and the const generic param when instantiating the EpochClock with new() (373f6b8).

As the CPU frequency is also used elsewhere (e.g. instantiating the ENC driver, defining current task scheduled intervals), CPU_FREQ should by extension be used for instantiating those objects too. However, I can't see a useful means to verify the frequencies during run-time for those.

@HarryMakes
Copy link
Contributor Author

If there is no other obvious mistakes or oversight, and since this has been tested for numerous times, I would greatly appreciate if this PR receives an approval soon.

Since I have also tested a workaround for the fan controller I2C issue (#140), as soon as this is merged, I would like to make the corresponding PR to be rebased on this merge. Thank you!

@HarryMakes HarryMakes merged commit 373f6b8 into develop Sep 27, 2021
@HarryMakes HarryMakes deleted the feature/enc424j600_merged branch September 27, 2021 04:10
@HarryMakes HarryMakes restored the feature/enc424j600_merged branch September 27, 2021 04:11
@HarryMakes
Copy link
Contributor Author

Much apologies!! I made a big mistake where I pushed my local develop to the wrong remote.

@HarryMakes HarryMakes changed the title Add support for ENC424J600 with bumped minimq [outdated] Add support for ENC424J600 with bumped minimq Sep 27, 2021
@HarryMakes
Copy link
Contributor Author

Dear @jordens, @ryan-summers and all, I am sorry about this mistake. I wanted to merge my local develop test branch to my personal private booster fork's develop to test the GitHub CI, but didn't notice or check that the branch upstream was set to this quartiq remote - possibly due to a malpractice in the past. As a result, I changed this develop branch by mistake and by accident.

According to my own local repo clone, the original develop head should be at 4ef6c80. After the accident, I wasn't able to see any "revert" function or button on the PR, so instead I hard-reset it to this commit and force-pushed again. If this is still a mistake, please kindly advise or take actions.

To continue the discussion on this PR, I have opened a new PR on #156. I have learnt my lesson and I'd like to sincerely apologise to all parties involved.

@ryan-summers
Copy link
Member

I don't see any issues with the git history - thanks for reverting the issue. My local history aligns with the fetched history, which indicates that the unintended change appears to be reverted properly.

In general, I don't think there was any significant error on your side (this is a very understandable error to make). I'll setup branch protections to help ensure these issues don't happen in the future :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants