-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion - full duplex by splitting HID in / out into separate composite functions (no issue) #15
Comments
@RoganDawes So here are the new descriptors for the two Composite HID functions: Input to host:
Output from host:
Not sure if using a collection is still necessary and if usage on the second descripor really Needs to be changed to two, but they should work. And here Comes the first problem, thanks to powershell.
Result:
Please ignore my nice hostname ;-) So if this would be Linux, I guess I would be able to check which device file is readable and which is writable. As this unfortunately is Windows, I have to provide this Information to createfile. I'm going to check the Win32_USBControllerDevice attributes for usefull Information on this tomorrow. Worst case: using HIDD* win32 methods for enumeration would be needed. Let me know if you have any ideas on this |
I honestly don't think it is necessary to have two raw hid interfaces, although technically it may be possible to double your throughput as a result. I think the real problems are shitty powershell and lack of "streaming". If you get 1000 packets per second, each packet has to go within 1 ms of each other. However, I measured latencies of 10-20 ms just changing from writing to reading in Powershell, which kills your throughput right there. Making the protocol less chatty, i.e. having the sender continue to send until the receiver indicates to slow down seems like the way to go! |
Hi @RoganDawes As promised I've done some tests on synchronous transfers using two separate Interfaces.
You're absolutly right on this. I've done 4 tests from Powershell:
Test 3 and 4 have been the same, but I was hoping that the FileStream could write up to 8 concurrent reports, as they are created with
To sum up: It seems the FileStream methods of .NET aren't able to reach the maximum transfer rate (1000 64byte reports per second on USB 2.0), no matter how hard I try. So i give up on synchronous Transfer, as the benefit is low while the effort is high (according that both of us have working implementations with multiple protocol layers).
Considering my tests, I doubt that there would be a speed increase with this (at least not for P4wnP1, as HID communication always runs at maximum Speed, while the upper thread based layers work on demand). Here's the Output code, which has no console IO or Array creation Overhead, but reaches 8KB/s max:
And here's my testscript, use it as you need to. I solved the problem of enumerating device Interfaces based on HID report descriptors, which took a ton of shitty csharp code. This is another reason to leave this path. The only useful thing about this code is that I'm able to spot my Composite device based on Serial + manufacturer string, wwhich isn't possible with WMI numeration (Strings for interface drivers are different), which is nice because as said I often Change VID/PID, but again crating a temporary csharp file for inline compilation renders this useless. So I guess im stuck at ~4Kbyte Maximum synchronous Transfer or could achieve ~8Kbytes at the costs of shitty NET code (while consuming additional USB EPs). Maybe I'll use dedicated input / output reports for faster file transfer later on and I'll still be slower than my first Modem ;-). Best regards and thanks for the Exchange on this. P.S. Excuse typos, my spellcheck is fighting against English language |
Hi, It is certainly NOT the case that Windows/Powershell cannot achieve higher than what you are currently able to get. I have been able to get up to 48kBytes/second using powershell code (however, it was unreliable/inconsistent, and didn't include all the "refinements" required for multiple concurrent connections). It did maintain that rate for at least several seconds, so I don't think that this is an unreachable goal. Rogan |
Could you provide a snippet of code reaching this rate, the snippet I provided above has no (obvious) room for improvement (writes data out in minimal loop) |
Here is some real sample code, and some numbers to go with it: On Windows: speedtest.ps1
This waits for the first successful read from the RAW HID interface and starts the stopwatch, then exits after 16384 iterations (1024*1024/64==16384) I run it like so:
On the Pi:
This writes the numbers 1-17000, formatted into a 63 character zero-padded string (with a CR added to make 64 bytes) to the hid interface. I run longer than 16384 to account for any packets getting lost. The results (with ^M's removed):
In other words, 1MB successfully transferred (with no lost packets, since the last number is indeed 16384) in 16.4 seconds, a total rate of 63 937 bytes/second. |
What gets interesting is adding a Write statement into the powershell after $d.Write(), that writes the received bytes back to the device, then updating the command on the pi to:
In theory, seq would generate 16384 lines, the powershell would echo them one at a time, and "t" would contain those 16384 lines. In practice, I ended up with only around 2000 lines in "t", and the powershell waiting to receive its full 1MB. I even upped the number of lines to 65536, and the powershell still didn't terminate, indicating that even though I sent 4 times more data than required, it still didn't successfully receive a full 1MB :-( So, this seems to be a fairly fundamental limitation of USB raw hid (perhaps only on Windows, and perhaps only with Powershell. More testing required!) In fact, this might justify using two endpoints, one for reading, and one for writing, which I otherwise thought was a bad idea. ;-) |
And this code, while not beautiful by any means, gets reasonable performance, while not losing any data!
On Linux:
So, 40 seconds elapsed, to send 2MB back and forth (1MB in each direction), with no errors or lost packets. That's pretty good, I think! I was hoping to use the Monitor class to allow one thread to notify the other, but I ended up with a deadlock, where the reader had already added an item and pulsed $q, while the writer had just ended the while loop and was getting around to calling Wait($q). Since the reader had read the last packet, there were no more "pulse"s sent, and the writer waited forever, even though there was actually one last packet in the queue. |
I'll test this with the two endpoint script provided above. It's likely that I missed the transfer interruption which occurred in your test, because I terminated the loops after 1000 reports. Additionally I have to test the code provided by you and try to work out what causes the transfer rate difference. I'm not able to work on this during weekend, but will report back on both next week. |
Although I hadn't time to fully dive into you code it seems you could change it. The $q object is already thread safe I think (using it on my stage 2 without manual synchronizing without issues). As you applied a 10 ms sleep at the write loop, you could move this sleep into an else branch of the condition I've done this here (line 233 and 310) but with a 50 ms delay. CPU consumption goes below 1 percent with the sleep. Sleep has no impact on throughput of continuous data, but again this is on an upper layer. So I'm going to report back next week...promise |
Ignore the last comment. I missed that the inner while loop empties the queue before the sleep is called. You've been looking for a way to avoid polling queue count via monitor, as far as I understand. |
Although I should do other things right now, I couldn't stop thinking about your examples. So I tested the following code (only changed PID/VID and added early out to WMI device enumeration):
On my first attempt I thought it wasn't working (thats why I added in the printout of $c). Leaving the Code running for more than a minute, I recognized that my issue isn't PowerShell:
I'm still stuck at ~7,8 KBytes/s I'm using I don't get it - why is my USB communication too slow :-( |
Okay, you're using the same . So I'm not sure where to go now, have to think about what causes the speed drop. |
I'm exactly at 1/8th of your Speed (8ms per report read/write). This explains my 3,2KBytes/s upper bound. I use alternating read/write which consumes 16ms per 64byte Fragment, leaving me at an effective maximum of 4KBytes/s for one direction. This drops to 3,2 KByte/s because of my naive implementation for fragment reassembling. Could we compare Raspberry specs:
I hope the UDC is always the same ... still don't get it |
I've found the root cause (wouldn't be able to sleep otherwise). New results:
It was a really silly mistake. I used this code for my HID device
The report legth I used was exactly 1/8th of the size it should be (copy paste from HID Keyboard code). I've already implemented a application layer function, which loads remote files to a dynamially generated powershell variable in the host process, which could be used for file transfer testing. I'm going to revisit the full dulpex approach with two EPs only if this function is working to slow (less than half of maximum transfer rate). Anything else should be doable with more code optimization. Continuous alternating read/write seems to be okay for me. @RoganDawes I want to thank you very much for discussing these points, if you'd like me to do additional tests ask at any time (but please avoid forcing me to use an AVR ;-)). Additionally I want to mention a new problem, which could affect both of our projects. See here |
Nice work! As you can see, I am implementing the same idea both on AVR/ESP and Linux, so we can continue to collaborate ;-) For the moment, I'm happy with the default Raspbian kernel, not having run into the problems that you have yet. I'll keep it in mind should I encounter them, though! Thanks! |
The continuous alternating read/write effectively halves your throughput, which you can recover by doubling the number of end points. An alternative is to simply reduce the number of acknowledgements required. My thinking is to eliminate the ACK's from my code entirely, and assume that the packet was received unless a NAK packet is received. This would be triggered by an out of sequence sequence number being received (1,2,4, for example). If this happens, the sender could retransmit the missing packet, and continue from that point. Of course, this means that the sender needs to keep a number of past packets available for retransmission if necessary. And, given that my sequence numbers are only 4 bits, there is a chance that the wrong packet gets retransmitted. Hmmm! I wonder how well "packing" a number of ACK's into a single packet would work. i.e. Set a "P-ACK" flag, then pack a whole lot of ACK's into the data portion of the packet. At 4 bits, and 60 data bytes, one could pack up to 120 ACKs into a single packet, significantly reducing the number of ACK packets required, and boosting the one-way data rate. Instead of 1:1 (data:ack), you could get 120:1, and the one way traffic would essentially approach the maximum. Sending actual data in the other direction would by necessity require flushing the pending ACK queue first, before sending the actual data. Regardless, implementing the blocking read thread, and updating the rest of the code in Proxy.ps1 should result in significant improvements! Let's see if I can actually achieve 32000 bytes per second?! |
I'm not sure why, but I'm at 48KBps with alternating read/write (should be <32KB). https://www.youtube.com/watch?v=MI8DFlKLHBk&yt:cc=on Used code is here (stage1_mini.ps1, stage2.ps1 and hidsrv9.py): |
I have to recap your suggestions and remaining possibilities:
So for me, the alternating read/write approach seems the way to go, for the following reasons:
|
Maybe the report ID could be used to improve things further (destination port, or in your case the channel could be outsourced to there). I don't know how much the report descriptor grows if one defines 255 possible report IDs (0 seems to be reserved for no ID in use) |
Why ? As far as I understand your idea, payload data would be decoupled from Logical Header data (SYN, ACK, P-ACK, SEQ). The Scenario of changing send direction is like PUSH..., instead of sending the P-ACK packet, the old Sender is waiting for, the new sender sends its data (with an urgent or push flag) and the old Sender (which is now receiver) knows that he has to reassemble an incoming stream, before he continues to send its own data (and receive the p-ACK). But from my understanding, no matter how far you optimize, it ends up in a more optimized HALF-DUPLEX mode. The alternating read/write I'm using asures that a FULL DUPLEX channel is available at anytime (but at half rate ... still not sure why I achieved more than 32KB/s), as every packet could carry payloads in the direction needed (on demand). |
Yes, indeed, you are absolutely right, that there is no need to flush the P-ACKs. As you have observed, I am modelling my protocol on TCP, so I'm not sure where you get half-duplex from. Using the P-ACK approach, each side can get up to (120/121)% of the available bandwidth for their own transmissions, if the other side is not transmitting any actual data. If both sides need to transmit, it should balance appropriately depending of the ratios that each side requires, up to a 50:50 split. What I haven't established yet is whether having a dedicated thread for reading the HID interface in Powershell ends up effectively prioritising reads at the expense of writes, making the Linux side dominant when writing. Of course, this is all very specific to using RAW HID as a transport, using something else like a text-based printer, or CDC or ACM would have less of an issue, I suspect, simply because the available data rate would be that much higher. |
That's why I stated one ends up with half-duplex (if both side are sending, although your idea scales very well, if only one side is sending)
I'm still trying to get this. I was planning to extend your read-loop example, to a test with concurrent but independant read and write on both ends (separate threads). As I still don't get, why I could reach a Transfer rate >32KBps, I think it could be possible that the device file allows writing and reading at the same time (SHARED ACCESS) which essentialy would mean FULL DUPLEX is possible with a single HID device interface. I have to delay this test, as it turns out that the
Sure, but moving away from HID would mean to get more loud (triggering Endpoint Protection on target). I like the HID approach very much and using ACM or other Communication Classes would be to easy ;-) . |
News: Replacing
Every call to Going to ping back with the results |
I struggle to believe that BeginWrite/EndWrite on a MemoryStream is necessary, or even efficient. Unless you are doing it to be compatible with Async operations on other types of streams, I'd just do a Write, and be done with it. |
Your're again right, benefit couldn't be measured, but I was wondering why I achieved this high rate with WriteAsync ?!?! Anyway, meanwhile I've found the answer. Good News... believe it or not: The devicefile is FULL DUPLEX. You can send data in both directions with about 64 KB/s in parrallel Test output powershell side (read and writing at the same time):
And the other end (python, first read is the Trigger to start threads):
Althoug I didn't measured overall time, reading and writing have been done concurrently. The difference is, that I fully decoupled inbound and outbound data (no echo Server). I haven't implemented any tests for packet loss, but sending and receiving threads have to have matching Report Counts in order to allow the threads to terminate. So it is very likely that there is no packet loss. Testing this is easy, as i included routines to print out inbound data on both sides (disabled to reduce influence on time measurement). Test code will be up in some minutes |
Next interesting Observation: Starting only the powershell side of communication (no python endpoint on Pi Zero) it turns out that This means there should never be any packet (=Report) loss, which again means there's no Need to use ACKs on per Report Basis. One could simply reassemble reports to a bigger stream. No Problem on Linux, but PS is still missing something like FIFOMemoryStream (which in fact could be implemented esily on your side, as you use inline C-Sharp). Test from PS printing out Report Count on write, without listener on RPi running:
Note: None of the pending reports is lost, if the python side is started some minutes later |
@RoganDawes here're the test files https://github.com/mame82/tests/tree/master/fullduplex I guess I have to reimplement everything, as alternating read/write is the worst approach. Have you found a replacement for named pipes to Interface with upper protocol layers (some sort of FIFO Stream available on NET 3.5) ? |
Very interesting results! I guess my implementation was naive, as an echo server, introducing delays! I never bother reassembling the stream, I simply write the data portion of the packets to their destination as I receive them. So I have no need for a FIFO memory stream at all. The main issue then is failure to read packets on the Powershell side, resulting in lost data. This is easily seen by introducing a console write in the read loop, I ended up losing about 500 packets each time! If you keep that loop clean and tight, then hopefully there should be no packet loss in that direction either! |
I haven't had packet loss at any time. This seems to be clear now, as write() is blocking if data isn't read on the other end. Read() was blocking, too. |
As said, receive Count was capped by the ´for Loops, which only terminate if exactly the number of packets has been received which has been sent before. One more not: The only case of Report loss I could imagine, would be if reports are written on Linux end (slow RPi) and the Windows end is reading back to slow (unlikely but possible). But you're again right: No listener on Windows = data loss |
@RoganDawes I guess the solution is here: Started powershell threads first, but deployed a delay in read thread to force packet loss:
Additionally I added Console Output before the first Report is sent from PS out thread:
Starting the PS Process first and running:
I get the following interesting result:
Seems there's no Report loss if first send has taken place from Windows side |
Last assumption is wrong:
So ACKs have to be sent from Pi to Windows :-( Assuring that Windows end reads fast enough couldn't be done reliably otherwise, I guess |
Looks like you lost 68 reports to me?
On Mon, 3 Apr 2017, 17:25 mame82, <[email protected]> wrote:
@RoganDawes <https://github.com/RoganDawes> I guess the solution is here:
Started powershell threads first, but deployed a delay in read thread to
force packet loss:
# normal script block, should be packed into thread later on
$HIDinThread = {
$hostui.WriteLine("Reading up to $in_count reports, with blocking read")
$inbytes = New-Object Byte[] (65)
$sw = New-Object Diagnostics.Stopwatch
for ($i=0; $i -lt $in_count; $i++)
{
$cr = $HIDin.Read($inbytes,0,65)
if ($i -eq 0) { $sw.Start() }
Start-Sleep -m 100 # try to miss reports
$utf8 = [System.Text.Encoding]::UTF8.GetString($inbytes)
$hostui.WriteLine($utf8)
}
$sw.Stop()
$timetaken = $sw.Elapsed.TotalSeconds
$KBps = $in_count * 64 / 1024 / $timetaken
$hostui.WriteLine("$in_count reports have been read in $timetaken
seconds ($KBps KB/s)")
}
Additionally I added Console Output before the first Report is sent from PS
out thread:
for ($i=0; $i -lt $out_count; $i++)
{
if ($i -eq 0) { $hostui.WriteLine("Sending first report out on
send thread")} # output is blocked by other thread if interacting with
$hostui, so this line couldn't be placed exactly
$HIDout.Write($outbytes,0,65)
if ($i -eq 0) { $sw.Start() }
#$hostui.WriteLine("reports written $i") # test how many
reports are needed ill write() blocks if no receiver on other end
}
Starting the PS Process first and running:
seq -f "%063g" 1 100 > /dev/hidg1
I get the following interesting result:
…____________________________________________________________________________________________________________________________________________________________________________________
PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-
f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Reading up to 16384 reports, with blocking read
Writing 16385 reports with synchronous 'Write'
Sending first report out on send thread
000000000000000000000000000000000000000000000000000000000000001
000000000000000000000000000000000000000000000000000000000000069
000000000000000000000000000000000000000000000000000000000000070
000000000000000000000000000000000000000000000000000000000000071
000000000000000000000000000000000000000000000000000000000000072
000000000000000000000000000000000000000000000000000000000000073
000000000000000000000000000000000000000000000000000000000000074
000000000000000000000000000000000000000000000000000000000000075
000000000000000000000000000000000000000000000000000000000000076
000000000000000000000000000000000000000000000000000000000000077
000000000000000000000000000000000000000000000000000000000000078
000000000000000000000000000000000000000000000000000000000000079
000000000000000000000000000000000000000000000000000000000000080
000000000000000000000000000000000000000000000000000000000000081
000000000000000000000000000000000000000000000000000000000000082
000000000000000000000000000000000000000000000000000000000000083
000000000000000000000000000000000000000000000000000000000000084
000000000000000000000000000000000000000000000000000000000000085
000000000000000000000000000000000000000000000000000000000000086
000000000000000000000000000000000000000000000000000000000000087
000000000000000000000000000000000000000000000000000000000000088
000000000000000000000000000000000000000000000000000000000000089
000000000000000000000000000000000000000000000000000000000000090
000000000000000000000000000000000000000000000000000000000000091
000000000000000000000000000000000000000000000000000000000000092
000000000000000000000000000000000000000000000000000000000000093
000000000000000000000000000000000000000000000000000000000000094
000000000000000000000000000000000000000000000000000000000000095
000000000000000000000000000000000000000000000000000000000000096
000000000000000000000000000000000000000000000000000000000000097
000000000000000000000000000000000000000000000000000000000000098
000000000000000000000000000000000000000000000000000000000000099
000000000000000000000000000000000000000000000000000000000000100
Seems there's no Report loss if first send has taken place from Windows side
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15 (comment)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAJwi7SSO5kqpBO_UhEbDbsNl9ukEHUWks5rsQ9zgaJpZM4MuqqN>
.
|
Indeed, till first Report was sent from Windows. But raising the simulated read delay, produce more Report loss (assumption on send is needed was wrong). So again I struggle on Windows, as I know Linux isn't able to send if nothing is received (remeber Crash on unresponsive IRQ when sending data to /dev/hidg1 before Windows is able to read). So one has to assure one read per millisecond on Windows or use ACKs from RPi to Windows. Seems your P-ACK idea is the best way to do this, again you're right |
So, as discussed, let Windows be the first to send a packet. Once the Linux side has received a packet, you know that the Windows side is ready to receive, and communications can begin. Alternatively, you can monitor the dmesg output to see when the relevant USB configuration has been selected by the Windows host to know that the endpoint has been "activated". However, it still doesn't mean that the powershell is running yet - this you can "discover" by waiting for the powershell to send you the first packet! |
Yes, I'm already doing this, both in production code and in the full duplex example provided above. Anyway I was wrong assuming that reports aren't missed when sending is started from Windows. It is simple... Linux writes to HID device non-blocking and reading from FileStream misses reports if done to slow. This isn't the case for sending from Windows to Linux. I'm thinking about new tests, changing underlying mechanics away from inpit/output reports, but I'm running out of time. |
Well, go with the 32kBps option in the meantime, version 2 can get higher speed ;-) |
Thumps up for this comment...I'm already suffering from tunnel-vision trying to optimize low level HID communications while loosing focus on other things which I wanted to implement in P4wnP1. But it doesn't get boring, another funny thing is that I'm faking RNDIS to run on 20 GBit / s which involves different issues. If you're interested in this, here's the link https://github.com/mame82/ratepatch (applies on raspbian with kernel 4.4.50+) |
Not suprisingly I'm still thinking about Report loss occuing when writing to Linux /dev/hidg and reading back from powershell to slow. So please excuse the next large paste of test Output. I observed the following. New Observation: if sending is aborted on Linux side, the last 32 reports could be readen back (with 500 ms delay) without having any loss. This means the FileStream is backed by a 2048 Byte buffer. If it would be possible to Access this buffer directly from powershell, a notification could be sent back to block writing on the other side (including last seq number received). Unfortunately, I haven't found a way to Access the underlying buffer... FileStream.Position and FileStream.Length are both unset. So if you're going to implement your P-ACK idea, it seems 32 reports is the magic number to track end send ACKs for. So your sequence number misses exactly one bit to cope with that. Here's the testoutput showing the described behaviour. The parts where large amounts of reporst are missed, have been caused by unlimited sending. The parts with continues Report received are the result of manually Abort sending from Linux side (last 32 reports are reconstructed with 500 ms delay)
Here's the example output |
Interesting! So, if I limited my "packets in flight without ACK" to 16 (max of my sequence numbers), I could be sure that there would be no packet loss. Funnily enough, I instrumented my "echo loop" to indicate how many reports there were in the queue at the beginning of the while loop. Not once did I get more than 16 reports, with a 1ms sleep once the queue was drained. The unfortunate part is that my sequence numbers are per "connection", of which I can have up to 255 at once (in theory). So I'd have to track the unacknowledged packets at a different level. Which unfortunately, is a bit of a layering violation, I think. I think the "solution" is going to be making sure that the read loop just reads as fast as possible, and if any packets are observed to be missing, to send a RST on that channel, and let it start again. Not particularly robust, but should work, I hope! |
FWIW, by simply substituting the $device.BeginRead/EndRead pairs with dequeueing packets from the readloop/queue, I managed to get 13kBps throughput with a cmd.exe doing "dir /s". Strangely, when writing the packets out to a socket, the throughput dropped to about 8kBps. |
While starting to implement a new lower layer communication scheme, based on our observations I put some comments (design ideas) into the source of the As this isn't implemented in 5 minutes, I'd like to kindly ask you to Review These comments before I start coding (and maybe throw it away in the end). Idea (From Linux Point of view = USB device, not Host):
|
Good news, implemented FULL DUPLEX similar to suggestion from above with some some improvments. Results:
So I'm on ~45500 Bytes/s from Pi to Powershell (real netto data, without protocol headers) Report loss detection (includes blocking if Output buffer reached 32 Reports which haven't been read and resending of unacknowledged reports) is only done from Pi to Windows (HID Input Reports), as we know the other way around writes are blocked if no data is read back (assuzmption still holds true in all Tests, max, 4 Report writes to FileStream without read on Linux end). Protocol overhead is reduced to 2 header bytes on "link layer" so payload size is 62 Bytes per report. If you're interested in work in Progress code, ping back. Btw. I decided to Interface to upper layers with synchronized input/output queues consuming/holding pure Reports...this still isn't fully implemented. Fragmentation/defragmentation of larger streams is going to be handled in upper layers (based on a FIN bit in reports). DST / SOURCE fields (or channel in your case) will be moved to upper layers, too. I don't nned this information on link layer anymore. Reason endpoints of this layer are well defined and pre-known USB-Host<-->USB-Device (or PowerShell-Client <--> Python-Server) |
Forgot to mention, measurement was on win 10 64 bit. On Win 7 32 Bit throughput is far slower (going to test tomorrow, code is PS 2.0 and NET 3.5 compatible, not sure if this is the bottleneck on Win 7) |
I think the basic idea is solid. My approach is to have a single "channel identifier", resulting in a max of 256 (concurrent!) channels, rather than 65536, which I think is reasonable in the circumstances. Packet format in my case (following some of your ideas above, so not implemented yet) would look like:
I'm still not 100% convinced about acking every packet - i.e. making a continuous stream of comms at full rate, as this will require fairly significant resources to keep up, possibly resulting in suspicious activity on the victim being detected. |
And indeed, now that you mention it, having the sequence number and packet length first, and leaving the channel and payload for a higher layer makes perfect sense, even in my implementation. Nice work! |
So here's the test result on Windows 7 (a bit disappointing compared to win 10)
I still read/write packages in both directions (Input and Output reports) at maximum possible rate (limited by file IO and CPU Speed). I've quit the approach, of ACK'ing every SEQ number received, instead I changed to accumulative ACKs. This raised the new question, of how to detect report loss on Pi, as ACK sequences aren't neccesarily continuos (1,4,6 is a valid ACK sequence which could be received by the Pi, which means that the last valid SEQ received was 6). So communication Looks like this (Reader and write threads are decoupled, but share current state of SEQ number received and last ACK sent): No report loss (PowerShell perspective, read and write thread run asynchronous with different Loop times, in example write Loop is slower): Same example with report loss: My report format: INPUT REPORTS
OUTPUT REPORTS
As resend request would be repeated, till the state of the PowerShell peer changes (because reports are constantly flowing in both directions at Maximum possible Speed. |
Right, but as my current link layer stack is focused on robust communication and could handle large delays in read/write threads on both sides (with impact of data processing overhead in mind). This could be used to lower CPU load by introducing (conditional) delays (at cost of decrease o maximum transfer speed). So yes, I'm not sending reports on demand, but continues ... I see HID reports as electrical media compared to ISO/OSI model. This means power is continuously flowing, but not necessarily with useful payload. This again assures that header data (or state data) is exchanged instantly |
Looking at the win 7 results again, it seem that FileStream.read is blocking FileStream.write somehow. Maybe this methods are synchronized on NET3.5 ?!? I won't test if this could be circumvented by accessing the File object directly (should allow overlapped read/write) as this would involve to much additional csharp code. Anyway I'm happy with current implementation and refocus on polishing and constructing a clean interface to upper layers. I'm still thinking about putting a socks5 proxy on python end to let powershell establish requested TCP connections. This would be a scenario handling channel splitting on an upper layer and therefore a clean and robust interface has to be provided to interface with my link layer. Creating a layer interface is easy with object oriented python, but again a mess with PS <3.0 +I'm aiming at PS 2.0. Design idea is to wrap up the link layer in a PSCustom object, which receives a device file in constructor an provides a read and write method for LARGE data streams (thread creation, fragmentation and report loss handling should be done internally). As this idea involves even more code it has to be placed on stage2 and a simplified protocol will be used to deliver stage1. This involves even more code on Raspberry side, as the server has to handle two different protocols, but I believe it is worth the effort |
One more addition. As shown in report layout comments, I'm planning to use a FIN bit. Its purpose is to mark the end of fragmented streams. A start flag isn't needed as the first report with a payload (length field > 0) starts a stream and all succeeding reports are concatenated until FIN bit is set in the terminating report. My former approach used an empty report to terminate fragmented streams, which drops transfer speed even more. In real world usage many streams fit into a single report (example directory listing in a shell, where each output line is interpreted as single stream ... This lead to sending an empty report after each line in my old implementation, producing way to much overhead). Are you aware of a simple PS 2.0 compatible way to create objects with custom member functions (inline cshar code is a no go for me)? Only option I found so far is PSCustom object |
Last comment on current implementation: The sequence number range in use is 0 to 31. This allows to track 32 reports, which is the max size of my output buffer on Pi. This again fits the maximum size of the input buffer of Windows FileStream observed. As this is my max, SEQ/ACK never consume more than 5 bits, leaving room to use the remaining bits as flags. The flags are extracted with cheap binary and/or/nand operations. |
My current code achieves rates > 40 KByte/s in both directions (full Duplex) on Win10 as well as on Win7 (mistakenly included Win7 console IO in first Speed tests). I've started to migrate to an object oriented Approach to build a clean Interface for my (now called) LinkLayer. Here's the test code, feel free to use: https://github.com/mame82/tests/tree/master/fullduplex_fast |
One really important fact (I didn't recognized earlier). One has to use two dedicated file descriptors on Linux (although it's the same file) for reading and writing, otherwise the speed halves due to synchronized file Access. This is at least true if python is used. |
I'm done with FullDuplex LinkLayer implementation. Works nice on Windows 10, Windows 7 is untested Codes still needs polishing...Current features:
Performance on Win10 is fantastic:
87 % of max from Windows to Pi and 93 % of max from Pi to Windows. That's insane. Measurement shown above is done by sending from both peers at the same time
As promised, code is here: So I'm done with this Topic but kept the issue open, in case further questions arise. In case there're no more question, please close the issue. Again, thx for this intensive and valuable exchange of Information on HID low Level development ! |
Remarks on provided code at https://github.com/mame82/tests/tree/master/fullduplex_fast2: Most parts of the source are implementation of LinkLayer Interface (as class in python, PSCusomObject on PowerShell) Start of main code, utilizing the LinkLayer Objects is marked with the following comment in both source files:
So as shown, using the LinkLayer objects is relativly easy (not much code, beside the implementation itself). In order to test, you could replace the CSharp part (responsible for creating a FileStream to the HIDdevice) with your own. In case you use the current CSharp code, you have to replace On Linux end, my devicefile is |
Hi @RoganDawes
Additionally I want to point out my python helper class, which uses approaches discussed earlier, to prepare stage1 code for typing out via HID. As you can see in this class, I'm still using base64 encoded GZip streams, which are converted to PowerShell code on the fly. Thus things like loading custom assemblies without touching disc got possible initializing variables with code or binary data etc. became possible. Currently I'm using my Btw. I've choosen to implement a console based Approach as frontend for my current HID backdoor (yes, it is a bit meterpreter'ish). One connects to P4wnP1 via SSH and the frontend is embedded into a Screen session. The idea behind this approach is to implement a socks4a or socks5 server, later on, which could be reached out via the same SSH session and relay traffic through the target client. This would be a real airgap bridge. |
@RoganDawes I finished my HID backdoor and added you to the credits https://github.com/mame82/P4wnP1/blob/master/README.md. It could be used to make the Window invisible, while keeping focus. As P4wnP1 types chars very fast, they ran into the STDIN buffer of the target window, which couldn't be interrupted by user interaction. The final stage one needs for lines of code to hide the window and the rest gets typed out in about 2 to 6 seconds (depending on stage1 type, pure powershell stage is more compact than the DOT NET assembly version). See my readme for reference. |
Very cool! Nice work! And thanks for the shoutout :-) |
Yeah, no problem. This was an inspiring conversation. Seytonic demoed the payload The final attackt starts at about 5:30 in the video... look closely, the powershell window disappears really fast. Stage 2 download an execution has finished when the status changes to "client connected". I still use your wmi approach to enumerate fot the HID device (at least in the default version of stage 1) |
Continue of ongoing discussion from here
The text was updated successfully, but these errors were encountered: