Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pi4 gisb stalls when using genet ethernet #1219

Closed
cinaplenrek opened this issue Aug 5, 2019 · 13 comments
Closed

Pi4 gisb stalls when using genet ethernet #1219

cinaplenrek opened this issue Aug 5, 2019 · 13 comments

Comments

@cinaplenrek
Copy link

cinaplenrek commented Aug 5, 2019

I'm working on plan9 arm64 kernel support for the raspberry pi 4.

I'm observing gisb arbiter errors when operating the ethernet controller on the raspberry pi 4.
In general, ethernet works fine on light traffic but heavy traffic causes sporadic 42 second long
bus stalls. That is, any core accessing mmio registers on the gisb (genet, pcie) hangs and
then continues. Even accesses to the gisb arbiter itself hang.

After such a stall, when i poll (as i dont know the INTID for the arbiter) the gisb arbiter capture
status register (0x7c4007f4) reads 0x3D and the bus address reported in the capture address
registers ([0x7c4007ec] | [0x7c4007e8]<<32) reads strange 12 bit bus addresses like:
0x2a0, 0xfe0, 0xea0, 0xee0, 0x6a0... (they'r all (x-32)%64 == 0)

Normally, when the arm accesses invalid mmio registers on the bus i get an SErr interrupt
and the arbiter capture address registers contain a proper bus address above 0x7c000000.
This is not the case here.

Is it possible for the arm to issue bus access to such addresses? And if so, how?
If not, who could initiate such bus transactions?

Can someone tell me the INTID for the gisb arb error interrupt and how the interrupt
can be enabled besides enabling it in the GIC? Maybe polling the arbiter results in
these bogus addresses?

What i could figure out so far:

  • stalls happen for both read and write accesses, and the register doesnt matter
  • hanging mmio write accesses seem to complete fine after the stall. that is i tested
    reading back the registers i write in the ethernet driver after write and the new value got updated.
  • the 42 second stall time is also unrelated to the arbiter timeout value in the arbiter
    timer register 0x7c400008
  • serializing all genet register accesses and placing barriers before and after has no effect
  • linux works fine, and i made a trace of all mmio register writes to check for differences
    in genet initialization but they match: http://felloff.net/usr/cinap_lenrek/pi4iodump.txt

Speculation:

  • the stall time of 42 seconds is the same time a 32 bit counter would wrap at 100MHz
@popcornmix
Copy link
Contributor

@P33M any ideas?

@P33M
Copy link

P33M commented Aug 5, 2019

What size of accesses are you using to read/write GISB registers?

@cinaplenrek
Copy link
Author

cinaplenrek commented Aug 5, 2019 via email

@P33M
Copy link

P33M commented Aug 5, 2019

Decoding the error capture status register (0x7c4007f4) - the error was not caused by a slave response timeout, the error was not caused by a slave response error, and the bus cycle was a read. Oddly, none of the 4 byte strobes in [5:2] are asserted (1 => not asserted). How can we have a read cycle with no byte strobes?

Does the status register ever change (i.e. is it the same for both read and write)?

Is Plan 9 using the firmware clock setup or have any modifications been made to any of the clock generators?

Edit: also, can you capture the GISB master source register at 0x7c4007f8? It's a bitmask of who generated the address that generated the fault.

@cinaplenrek
Copy link
Author

cinaplenrek commented Aug 5, 2019 via email

@cinaplenrek
Copy link
Author

cinaplenrek commented Aug 5, 2019 via email

@cinaplenrek
Copy link
Author

is there anything else i can try to rule out potential problem sources?
the clock generators where mentioned...
i have core_freq=250 in config.txt for the mini uart console to work.
are there any config.txt properties i can try to change to rule out
clock or power issues?

@pelwell
Copy link
Contributor

pelwell commented Aug 6, 2019

Try with core_freq=500 and core_freq_min=500 - 250 is possibly too low.

@cinaplenrek
Copy link
Author

cinaplenrek commented Aug 6, 2019 via email

@pelwell
Copy link
Contributor

pelwell commented Aug 6, 2019

I suppose suggesting using Linux is not helpful?

@cinaplenrek
Copy link
Author

cinaplenrek commented Aug 7, 2019 via email

@P33M
Copy link

P33M commented Aug 14, 2019

The only other thing I can think of would be the cacheability of the address space in question - what page protection bits are being used?

@cinaplenrek
Copy link
Author

cinaplenrek commented Aug 16, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants