Discussion - full duplex by splitting HID in / out into separate composite functions (no issue) #15

mame82 · 2017-03-30T17:22:08Z

Continue of ongoing discussion from here

mame82 · 2017-03-30T17:35:27Z

@RoganDawes
Thought opening a new issue is a good idea before going off Topic too much.
I haven't much time for coding right now, but wanted to test the idea discussed.

So here are the new descriptors for the two Composite HID functions:

Input to host:

0x06, 0x00, 0xFF,  // Usage Page (Vendor Defined 0xFF00)
0x09, 0x01,        // Usage (0x01)
0xA1, 0x01,        // Collection (Application)
0x09, 0x01,        //   Usage (0x01)
0x15, 0x00,        //   Logical Minimum (0)
0x26, 0xFF, 0x00,  //   Logical Maximum (255)
0x75, 0x08,        //   Report Size (8)
0x95, 0x40,        //   Report Count (64)
0x81, 0x02,        //   Input (Data,Var,Abs,No Wrap,Linear,Preferred State,No Null Position)
0xC0,              // End Collection

Output from host:

0x06, 0x00, 0xFF,  // Usage Page (Vendor Defined 0xFF00)
0x09, 0x01,        // Usage (0x01)
0xA1, 0x01,        // Collection (Application)
0x09, 0x02,        //   Usage (0x02)
0x15, 0x00,        //   Logical Minimum (0)
0x26, 0xFF, 0x00,  //   Logical Maximum (255)
0x75, 0x08,        //   Report Size (8)
0x95, 0x40,        //   Report Count (64)
0x91, 0x02,        //   Output (Data,Var,Abs,No Wrap,Linear,Preferred State,No Null Position,Non-volatile)
0xC0,              // End Collection

Not sure if using a collection is still necessary and if usage on the second descripor really Needs to be changed to two, but they should work.

And here Comes the first problem, thanks to powershell.
Need to find a way to distinguish between in and out Interface, not sure if this is possible with WMI.
Test code:

function GetDevicePath($USB_VID, $USB_PID)
{
    $HIDGuid="{4d1e55b2-f16f-11cf-88cb-001111000030}"
    foreach ($wmidev in gwmi Win32_USBControllerDevice |%{[wmi]($_.Dependent)} ) {
        #[System.Console]::WriteLine($wmidev.PNPClass)
	    if ($wmidev.DeviceID -match ("$USB_VID" + '&PID_' + "$USB_PID") -and $wmidev.DeviceID -match ('HID') -and -not $wmidev.Service) {
            $devpath = "\\?\" + $wmidev.PNPDeviceID.Replace('\','#') + "#" + $HIDGuid
            "Matching device found $wmidev"
        }
    }
    #$devpath
}

$USB_VID="1D6B"
$USB_PID="fdde" # full duplex device ;-)

GetDevicePath $USB_VID $USB_PID

Result:

Matching device found \\WUNDERLAND-PC\root\cimv2:Win32_PnPEntity.DeviceID="HID\\VID_1D6B&PID_FDDE&MI_03\\8&B609427&0&0000"
Matching device found \\WUNDERLAND-PC\root\cimv2:Win32_PnPEntity.DeviceID="HID\\VID_1D6B&PID_FDDE&MI_04\\8&2F37D1E9&0&0000"

Please ignore my nice hostname ;-)

So if this would be Linux, I guess I would be able to check which device file is readable and which is writable. As this unfortunately is Windows, I have to provide this Information to createfile. I'm going to check the Win32_USBControllerDevice attributes for usefull Information on this tomorrow. Worst case: using HIDD* win32 methods for enumeration would be needed.

Let me know if you have any ideas on this

RoganDawes · 2017-03-30T19:46:37Z

I honestly don't think it is necessary to have two raw hid interfaces, although technically it may be possible to double your throughput as a result. I think the real problems are shitty powershell and lack of "streaming". If you get 1000 packets per second, each packet has to go within 1 ms of each other. However, I measured latencies of 10-20 ms just changing from writing to reading in Powershell, which kills your throughput right there.

Making the protocol less chatty, i.e. having the sender continue to send until the receiver indicates to slow down seems like the way to go!

mame82 · 2017-03-31T13:17:15Z

Hi @RoganDawes

As promised I've done some tests on synchronous transfers using two separate Interfaces.

technically it may be possible to double your throughput as a result

You're absolutly right on this.

I've done 4 tests from Powershell:

Writing out 1000 64byte reports on dedicated HID out Interface, result:
About 8 seconds (= 8Kbyte/s)
Writing out 1000 64byte reports on dedicated HID out Interface, echo them back and and reading them from a dedicated HID in Interface via seperate thread. Result:
Again about 8 seconds (= 8Kbyte/s) reading back input data while writing data out has no Speed Impact

Test 3 and 4 have been the same, but I was hoping that the FileStream could write up to 8 concurrent reports, as they are created with FILE_FLAG_OVERLAPPED. The results are still disappointing, transfering 64KByte takes about 100ms less time, Here's the testoutput of test 3 and 4:

Path: \\?\hid#vid_1d6b&pid_fdde&mi_02#8&2324206c&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_fdde&mi_03#8&b609427&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 0
Path: \\?\hid#vid_1d6b&pid_fdde&mi_04#8&2f37d1e9&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 0, Output: 65
Writing 1000 reports with synchronous 'Write'
 Hello World                                                     
 Hello World                                                     
 .. snip ... (1000 Hello World from output thread, echoed back by bash via cat /dev/hidg2 > /dev/hidg1)
 Hello World                                                     
HID out thread finfished, time taken 8,1890945 seconds
Writing 1000 reports with async 'BeginWrite', 8 concurrent writes
 Hello World                                                     
 .. snip ... (1000 Hello World from output thread, echoed back by bash via cat /dev/hidg2 > /dev/hidg1)
 Hello World                                                     
HID concurrent output thread finfished, time taken 7,9576403 seconds
Killing remaining threads
 Hello World                                                     
Godbye

To sum up: It seems the FileStream methods of .NET aren't able to reach the maximum transfer rate (1000 64byte reports per second on USB 2.0), no matter how hard I try. So i give up on synchronous Transfer, as the benefit is low while the effort is high (according that both of us have working implementations with multiple protocol layers).

i.e. having the sender continue to send until the receiver indicates to slow down

Considering my tests, I doubt that there would be a speed increase with this (at least not for P4wnP1, as HID communication always runs at maximum Speed, while the upper thread based layers work on demand). Here's the Output code, which has no console IO or Array creation Overhead, but reaches 8KB/s max:


    $outbytes = New-Object Byte[] (65)
 
    $msg=[system.Text.Encoding]::ASCII.GetBytes("Hello World")
    for ($i=0; $i -lt $msg.Length; $i++) { $outbytes[$i + 1] = $msg[$i] }

    for ($i=0; $i -lt 1000; $i++)
    {
        $HIDout.Write($outbytes,0,65)
    }

And here's my testscript, use it as you need to. I solved the problem of enumerating device Interfaces based on HID report descriptors, which took a ton of shitty csharp code. This is another reason to leave this path. The only useful thing about this code is that I'm able to spot my Composite device based on Serial + manufacturer string, wwhich isn't possible with WMI numeration (Strings for interface drivers are different), which is nice because as said I often Change VID/PID, but again crating a temporary csharp file for inline compilation renders this useless.

So I guess im stuck at ~4Kbyte Maximum synchronous Transfer or could achieve ~8Kbytes at the costs of shitty NET code (while consuming additional USB EPs). Maybe I'll use dedicated input / output reports for faster file transfer later on and I'll still be slower than my first Modem ;-).

Best regards and thanks for the Exchange on this.

P.S. Excuse typos, my spellcheck is fighting against English language

RoganDawes · 2017-03-31T13:34:23Z

Hi,

It is certainly NOT the case that Windows/Powershell cannot achieve higher than what you are currently able to get. I have been able to get up to 48kBytes/second using powershell code (however, it was unreliable/inconsistent, and didn't include all the "refinements" required for multiple concurrent connections).

It did maintain that rate for at least several seconds, so I don't think that this is an unreachable goal.

Rogan

mame82 · 2017-03-31T13:44:47Z

Could you provide a snippet of code reaching this rate, the snippet I provided above has no (obvious) room for improvement (writes data out in minimal loop)

RoganDawes · 2017-03-31T14:08:24Z

Here is some real sample code, and some numbers to go with it:

On Windows:

speedtest.ps1

$M = 64
$cs = '
using System;
using System.IO;
using Microsoft.Win32.SafeHandles;
using System.Runtime.InteropServices;
namespace n {
	public class w {
		[DllImport(%kernel32.dll%, CharSet = CharSet.Auto, SetLastError = true)]
		public static extern SafeFileHandle CreateFile(String fn, UInt32 da, Int32 sm, IntPtr sa, Int32 cd, uint fa, IntPtr tf);

		public static FileStream o(string fn) {
			return new FileStream(CreateFile(fn, 0XC0000000U, 3, IntPtr.Zero, 3, 0x40000000, IntPtr.Zero), FileAccess.ReadWrite, 9, true);
		}
	}
}
'.Replace('%',[char]34)
Add-Type -TypeDefinition $cs

& {
	$devs = gwmi Win32_USBControllerDevice
	foreach ($dev in $devs) {
		$wmidev = [wmi]$dev.Dependent
		if ($wmidev.GetPropertyValue('DeviceID') -match ('1209&PID_6667') -and ($wmidev.GetPropertyValue('Service') -eq $null)) {
			$fn = ([char]92+[char]92+'?'+[char]92 + $wmidev.GetPropertyValue('DeviceID').ToString().Replace([char]92,[char]35) + [char]35+'{4d1e55b2-f16f-11cf-88cb-001111000030}')
		}
	}
	try {
		$f = [n.w]::o($fn)
		$d = New-Object IO.MemoryStream
		$c = 0
		$b = New-Object Byte[]($M+1)
		$sw = $null
		while($c -lt 1024 * 1024) {
			$r = $f.Read($b, 0, $M+1)
			if ($sw -eq $null) { $sw = [Diagnostics.Stopwatch]::StartNew() }
			$d.Write($b,1, $M)
			$c += $M
		}
		$sw.Stop()
		$sw.Elapsed
		$d.Length
		([Text.Encoding]::ASCII).GetString($d.ToArray())
	} catch {
		echo $_.Exception|format-list -force
	}
	exit
}

This waits for the first successful read from the RAW HID interface and starts the stopwatch, then exits after 16384 iterations (1024*1024/64==16384)

I run it like so:

> powershell -exec bypass -file speedtest.ps1 > log.txt

On the Pi:

# time seq -f "%063g" 1 17000 > /dev/hidg1

This writes the numbers 1-17000, formatted into a 63 character zero-padded string (with a CR added to make 64 bytes) to the hid interface. I run longer than 16384 to account for any packets getting lost.

The results (with ^M's removed):

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 16
Milliseconds      : 402
Ticks             : 164026490
TotalDays         : 0.000189845474537037
TotalHours        : 0.00455629138888889
TotalMinutes      : 0.273377483333333
TotalSeconds      : 16.402649
TotalMilliseconds : 16402.649

1048576
000000000000000000000000000000000000000000000000000000000000001
000000000000000000000000000000000000000000000000000000000000002
...
000000000000000000000000000000000000000000000000000000000016383
000000000000000000000000000000000000000000000000000000000016384

In other words, 1MB successfully transferred (with no lost packets, since the last number is indeed 16384) in 16.4 seconds, a total rate of 63 937 bytes/second.

RoganDawes · 2017-03-31T14:31:15Z

What gets interesting is adding a Write statement into the powershell after $d.Write(), that writes the received bytes back to the device, then updating the command on the pi to:

time seq -f "%063g" 1 16384 | socat - /dev/hidg1 > t

In theory, seq would generate 16384 lines, the powershell would echo them one at a time, and "t" would contain those 16384 lines.

In practice, I ended up with only around 2000 lines in "t", and the powershell waiting to receive its full 1MB. I even upped the number of lines to 65536, and the powershell still didn't terminate, indicating that even though I sent 4 times more data than required, it still didn't successfully receive a full 1MB :-(

So, this seems to be a fairly fundamental limitation of USB raw hid (perhaps only on Windows, and perhaps only with Powershell. More testing required!)

In fact, this might justify using two endpoints, one for reading, and one for writing, which I otherwise thought was a bad idea. ;-)

RoganDawes · 2017-03-31T16:21:11Z

And this code, while not beautiful by any means, gets reasonable performance, while not losing any data!

$M = 64
$cs = '
using System;
using System.IO;
using Microsoft.Win32.SafeHandles;
using System.Runtime.InteropServices;
namespace n {
	public class w {
		[DllImport(%kernel32.dll%, CharSet = CharSet.Auto, SetLastError = true)]
		public static extern SafeFileHandle CreateFile(String fn, UInt32 da, Int32 sm, IntPtr sa, Int32 cd, uint fa, IntPtr tf);

		public static FileStream o(string fn) {
			return new FileStream(CreateFile(fn, 0XC0000000U, 3, IntPtr.Zero, 3, 0x40000000, IntPtr.Zero), FileAccess.ReadWrite, 9, true);
		}
	}
}
'.Replace('%',[char]34)
Add-Type -TypeDefinition $cs

$readloop = {
	Param($M, $f, $q)

	try {
		$d = New-Object IO.MemoryStream
		$c = 0
		$sw = $null
		while($c -lt 1024*1024) {
			$b = New-Object Byte[]($M+1)
			$r = $f.Read($b, 0, $M+1)
			if ($sw -eq $null) { $sw = [Diagnostics.Stopwatch]::StartNew() }
			$d.Write($b,1, $M)
			$c += $M
#			[System.Threading.Monitor]::Enter($q)
			$q.Enqueue($b)
#			[System.Threading.Monitor]::Pulse($q)
#			[System.Threading.Monitor]::Exit($q)
		}
		$sw.Stop()
		$sw.Elapsed
		$d.Length
		([Text.Encoding]::ASCII).GetString($d.ToArray())
	} catch {
		$_.Exception|format-list -force
	}
	exit
}

$writeloop = {
	Param($M, $f, $q)

	try {
		while ($true) {
#			[System.Threading.Monitor]::Enter($q)
#			[System.Threading.Monitor]::Wait($q)
			[System.Console]::Write("!")
			if ($q.Count -gt 0) {
				[System.Console]::WriteLine($q.Count)
				while ($q.Count -gt 0) {
					$b = $q.Dequeue()
					$f.Write($b, 0, $M+1)
				}
			}
			Start-Sleep -m 10
#			[System.Threading.Monitor]::Exit($q)
		}
	} catch {
		[System.Console]::WriteLine("Write Thread Done!")
		$_.Exception
	}
	exit
}

$Q = New-Object System.Collections.Queue
$Q = [System.Collections.Queue]::Synchronized($Q)

$devs = gwmi Win32_USBControllerDevice
foreach ($dev in $devs) {
	$wmidev = [wmi]$dev.Dependent
	if ($wmidev.GetPropertyValue('DeviceID') -match ('1209&PID_6667') -and ($wmidev.GetPropertyValue('Service') -eq $null)) {
		$fn = ([char]92+[char]92+'?'+[char]92 + $wmidev.GetPropertyValue('DeviceID').ToString().Replace([char]92,[char]35) + [char]35+'{4d1e55b2-f16f-11cf-88cb-001111000030}')
	}
}
$f = [n.w]::o($fn)

$readThread = [PowerShell]::Create()
[void] $readThread.AddScript($readloop)
[void] $readThread.AddParameter("M", $M)
[void] $readThread.AddParameter("f", $f)
[void] $readThread.AddParameter("q", $q)

$writeThread = [PowerShell]::Create()
[void] $writeThread.AddScript($writeloop)
[void] $writeThread.AddParameter("M", $M)
[void] $writeThread.AddParameter("f", $f)
[void] $writeThread.AddParameter("q", $q)

[System.IAsyncResult]$AsyncReadJobResult = $null
[System.IAsyncResult]$AsyncWriteJobResult = $null

try {
	$AsyncWriteJobResult = $writeThread.BeginInvoke()

	Sleep 1 # Wait 1 second to give some time for the write thread to be ready
	$AsyncReadJobResult = $readThread.BeginInvoke()
	Write-Host "Ready"
} catch {
	$ErrorMessage = $_.Exception.Message
	Write-Host $ErrorMessage
} finally {
	if ($readThread -ne $null -and $AsyncReadJobResult -ne $null) {
		$readThread.EndInvoke($AsyncReadJobResult)
		$readThread.Dispose()
	}

	if ($writeThread -ne $null -and $AsyncWriteJobResult -ne $null) {
		$writeThread.EndInvoke($AsyncWriteJobResult)
		$writeThread.Dispose()
	}
	exit
}

On Linux:

pi@raspberrypi:~ $ time seq -f "%063g" 1 16384 | socat - /dev/hidg1 > t

real	0m40.080s
user	0m1.030s
sys	0m3.360s
pi@raspberrypi:~ $ wc -l t
16384 t
pi@raspberrypi:~ $

So, 40 seconds elapsed, to send 2MB back and forth (1MB in each direction), with no errors or lost packets. That's pretty good, I think!

I was hoping to use the Monitor class to allow one thread to notify the other, but I ended up with a deadlock, where the reader had already added an item and pulsed $q, while the writer had just ended the while loop and was getting around to calling Wait($q). Since the reader had read the last packet, there were no more "pulse"s sent, and the writer waited forever, even though there was actually one last packet in the queue.

mame82 · 2017-03-31T16:31:02Z

I'll test this with the two endpoint script provided above. It's likely that I missed the transfer interruption which occurred in your test, because I terminated the loops after 1000 reports. Additionally I have to test the code provided by you and try to work out what causes the transfer rate difference. I'm not able to work on this during weekend, but will report back on both next week.

mame82 · 2017-03-31T17:44:52Z

Although I hadn't time to fully dive into you code it seems you could change it. The $q object is already thread safe I think (using it on my stage 2 without manual synchronizing without issues). As you applied a 10 ms sleep at the write loop, you could move this sleep into an else branch of the condition if ($q.Count -gt 0). Thus you would react on enqueued data with a 10ms delay (only if $q.count has fallen to 0, not if there's continuous data in the queue). After doing this it should work without Monitor lock (trigger based on synchronized $q.Count).

I've done this here (line 233 and 310) but with a 50 ms delay. CPU consumption goes below 1 percent with the sleep. Sleep has no impact on throughput of continuous data, but again this is on an upper layer.

So I'm going to report back next week...promise

mame82 · 2017-03-31T19:58:38Z

Ignore the last comment. I missed that the inner while loop empties the queue before the sleep is called. You've been looking for a way to avoid polling queue count via monitor, as far as I understand.

mame82 · 2017-04-01T09:27:47Z

Although I should do other things right now, I couldn't stop thinking about your examples. So I tested the following code (only changed PID/VID and added early out to WMI device enumeration):

$M = 64
$cs = '
using System;
using System.IO;
using Microsoft.Win32.SafeHandles;
using System.Runtime.InteropServices;
namespace n {
	public class w {
		[DllImport(%kernel32.dll%, CharSet = CharSet.Auto, SetLastError = true)]
		public static extern SafeFileHandle CreateFile(String fn, UInt32 da, Int32 sm, IntPtr sa, Int32 cd, uint fa, IntPtr tf);

		public static FileStream o(string fn) {
			return new FileStream(CreateFile(fn, 0XC0000000U, 3, IntPtr.Zero, 3, 0x40000000, IntPtr.Zero), FileAccess.ReadWrite, 9, true);
		}
	}
}
'.Replace('%',[char]34)
Add-Type -TypeDefinition $cs

& {
	$devs = gwmi Win32_USBControllerDevice
	foreach ($dev in $devs) {
		$wmidev = [wmi]$dev.Dependent
		if ($wmidev.GetPropertyValue('DeviceID') -match ('1D6B&PID_0137') -and ($wmidev.GetPropertyValue('Service') -eq $null)) {
			$fn = ([char]92+[char]92+'?'+[char]92 + $wmidev.GetPropertyValue('DeviceID').ToString().Replace([char]92,[char]35) + [char]35+'{4d1e55b2-f16f-11cf-88cb-001111000030}')
            break # second dev string invalid handle
		}
	}
	try {
		$f = [n.w]::o($fn)
		$d = New-Object IO.MemoryStream
		$c = 0
		$b = New-Object Byte[]($M+1)
		$sw = $null
		while($c -lt 1024 * 1024) {
            #[Console]::WriteLine("$c")
			$r = $f.Read($b, 0, $M+1)
			if ($sw -eq $null) { $sw = [Diagnostics.Stopwatch]::StartNew() }
			$d.Write($b,1, $M)
			$c += $M
		}
		$sw.Stop()
		$sw.Elapsed
		$d.Length
		([Text.Encoding]::ASCII).GetString($d.ToArray())
	} catch {
		echo $_.Exception|format-list -force
	}
	exit
}

On my first attempt I thought it wasn't working (thats why I added in the printout of $c). Leaving the Code running for more than a minute, I recognized that my issue isn't PowerShell:

PS D:\P4wnP1> D:\P4wnP1\powershell\tests\fastread_usabuse.ps1


Days              : 0
Hours             : 0
Minutes           : 2
Seconds           : 11
Milliseconds      : 64
Ticks             : 1310641048
TotalDays         : 0,00151694565740741
TotalHours        : 0,0364066957777778
TotalMinutes      : 2,18440174666667
TotalSeconds      : 131,0641048
TotalMilliseconds : 131064,1048

1048576
000000000000000000000000000000000000000000000000000000000000001
000000000000000000000000000000000000000000000000000000000000002
... snip ...
000000000000000000000000000000000000000000000000000000000016381
000000000000000000000000000000000000000000000000000000000016382
000000000000000000000000000000000000000000000000000000000016383

I'm still stuck at ~7,8 KBytes/s

I'm using usb_f_hid.ko with libcomposite.ko which is needed for the P4wnP1 Features, could it be that you're using g_hid.ko ?

I don't get it - why is my USB communication too slow :-(

mame82 · 2017-04-01T09:29:50Z

Okay, you're using the same .

So I'm not sure where to go now, have to think about what causes the speed drop.

mame82 · 2017-04-01T09:48:12Z

I'm exactly at 1/8th of your Speed (8ms per report read/write). This explains my 3,2KBytes/s upper bound. I use alternating read/write which consumes 16ms per 64byte Fragment, leaving me at an effective maximum of 4KBytes/s for one direction. This drops to 3,2 KByte/s because of my naive implementation for fragment reassembling.

Could we compare Raspberry specs:

root@p4wnp1:~# ls /sys/class/udc
20980000.usb
root@p4wnp1:~# uname -r
4.4.50+
root@p4wnp1:~# lsmod | grep hid
usb_f_hid              10837  6
libcomposite           49479  15 usb_f_ecm,usb_f_hid,usb_f_rndis

I hope the UDC is always the same ... still don't get it

mame82 · 2017-04-01T16:33:43Z

I've found the root cause (wouldn't be able to sleep otherwise). New results:

PS C:\Users\XMG-U705> D:\P4wnP1\powershell\tests\fastread_usabuse.ps1


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 16
Milliseconds      : 400
Ticks             : 164008304
TotalDays         : 0,000189824425925926
TotalHours        : 0,00455578622222222
TotalMinutes      : 0,273347173333333
TotalSeconds      : 16,4008304
TotalMilliseconds : 16400,8304

1048576
000000000000000000000000000000000000000000000000000000000000001
... snip ...
000000000000000000000000000000000000000000000000000000000016383
000000000000000000000000000000000000000000000000000000000016384

It was a really silly mistake. I used this code for my HID device

# create RAW HID function
# =======================================================
if $USE_RAWHID; then
mkdir -p functions/hid.g2
echo 1 > functions/hid.g2/protocol
echo 1 > functions/hid.g2/subclass
echo 8 > functions/hid.g2/report_length
cat $wdir/conf/raw_report_desc > functions/hid.g2/report_desc
fi

The report legth I used was exactly 1/8th of the size it should be (copy paste from HID Keyboard code).
So I'm at the same speed, as you now. I have a minor issue in my Code, because I reassemble the reports by manually creating new static arrays. This has a huge performace impact on large transfers (a static array with size=old_size+64 is created before concatenating new data) and hinders me in doing meaningfull Transfer tests. I'll do some new speed tests on file transfer after changing the report assembler to work with MemoryStream instead.

I've already implemented a application layer function, which loads remote files to a dynamially generated powershell variable in the host process, which could be used for file transfer testing. I'm going to revisit the full dulpex approach with two EPs only if this function is working to slow (less than half of maximum transfer rate). Anything else should be doable with more code optimization. Continuous alternating read/write seems to be okay for me.

@RoganDawes I want to thank you very much for discussing these points, if you'd like me to do additional tests ask at any time (but please avoid forcing me to use an AVR ;-)).

Additionally I want to mention a new problem, which could affect both of our projects. See here

RoganDawes · 2017-04-01T18:14:19Z

Nice work! As you can see, I am implementing the same idea both on AVR/ESP and Linux, so we can continue to collaborate ;-)

For the moment, I'm happy with the default Raspbian kernel, not having run into the problems that you have yet. I'll keep it in mind should I encounter them, though! Thanks!

RoganDawes · 2017-04-01T18:35:11Z

The continuous alternating read/write effectively halves your throughput, which you can recover by doubling the number of end points. An alternative is to simply reduce the number of acknowledgements required. My thinking is to eliminate the ACK's from my code entirely, and assume that the packet was received unless a NAK packet is received. This would be triggered by an out of sequence sequence number being received (1,2,4, for example). If this happens, the sender could retransmit the missing packet, and continue from that point.

Of course, this means that the sender needs to keep a number of past packets available for retransmission if necessary. And, given that my sequence numbers are only 4 bits, there is a chance that the wrong packet gets retransmitted. Hmmm!

I wonder how well "packing" a number of ACK's into a single packet would work. i.e. Set a "P-ACK" flag, then pack a whole lot of ACK's into the data portion of the packet. At 4 bits, and 60 data bytes, one could pack up to 120 ACKs into a single packet, significantly reducing the number of ACK packets required, and boosting the one-way data rate. Instead of 1:1 (data:ack), you could get 120:1, and the one way traffic would essentially approach the maximum. Sending actual data in the other direction would by necessity require flushing the pending ACK queue first, before sending the actual data.

Regardless, implementing the blocking read thread, and updating the rest of the code in Proxy.ps1 should result in significant improvements! Let's see if I can actually achieve 32000 bytes per second?!

mame82 · 2017-04-02T11:51:14Z

I'm not sure why, but I'm at 48KBps with alternating read/write (should be <32KB).

https://www.youtube.com/watch?v=MI8DFlKLHBk&yt:cc=on

Used code is here (stage1_mini.ps1, stage2.ps1 and hidsrv9.py):
https://github.com/mame82/P4wnP1/tree/devel/hidtools

mame82 · 2017-04-02T13:12:14Z

I have to recap your suggestions and remaining possibilities:

P-ACK solution

pro: nearly max rate in one direction
con: keeping track of acks
con: complicated if communication changes direction

SYNC/ACK (alternating read/write)

con: half Speed
pro: SYN/ACK wouldn't be needed at all. I use a 4 Byte header on every Report. SND and RCV fields fullfill the Need of SYN/ACK but aren't considered. In fact the only time I effectivly use the SND field, is if the payload in a report is smaller than 60 Bytes (to trim it down, but this could be done in higher layers, too).
pro: I was able to achieve more than half data rate (48 KByte/s, see Video link above) with this Approach, when I changed stream re-assembling to use MemoryStream instead of Byte[]. The Performance improvment seems to be influenced by disabeling of other USB functions (RNDIS/ECM) and more Speed increase was achieved by using AsyncWrite on the MemoryStreams

Dedicated OUT / IN endpoints

pro: Maximum rate should be achievde (could be increased even more, by using additional endpoints, like 2 INPUT devices and 2 OUTPUT devices, as 1000 Reports/s seems the per device Limit)
con: SYN/ACK has to be reintroduced, because blocking write calls seem to be imposible, thus data written to the out report could be overwritten by next report if the Receiver hasn't read already. This could be circumvented with an ACK on every packet received (again alternating read/write) or multiple ACKs in a later packet (P-ACK Approach)
con: consumes additional USB EPs

So for me, the alternating read/write approach seems the way to go, for the following reasons:

large Transfers in both directions (data exfiltration from target, file injection from device to target)
USB EPs needed for other USB Gadget functions (RPi UDC seems to be able to provide only 8 EPs)
Transfer rate is okay, while using a simple protocol, without much Overhead
read/write packets are used as heartbeat if empty and deliver STAGE2 payload as continuous Loop (on an otherwised unused src/dst, which is a channel in your terminology)

mame82 · 2017-04-02T13:25:03Z

Maybe the report ID could be used to improve things further (destination port, or in your case the channel could be outsourced to there). I don't know how much the report descriptor grows if one defines 255 possible report IDs (0 seems to be reserved for no ID in use)

mame82 · 2017-04-03T08:19:47Z

Sending actual data in the other direction would by necessity require flushing the pending ACK queue first, before sending the actual data.

Why ? As far as I understand your idea, payload data would be decoupled from Logical Header data (SYN, ACK, P-ACK, SEQ). The Scenario of changing send direction is like PUSH..., instead of sending the P-ACK packet, the old Sender is waiting for, the new sender sends its data (with an urgent or push flag) and the old Sender (which is now receiver) knows that he has to reassemble an incoming stream, before he continues to send its own data (and receive the p-ACK).
A peer which wants to send data, while still receiving an incoming stream (in the middle of the 120 packets) could decide on his own, if he wants to send his data with URGENT flag (PUSH) or if the pending outbound data should be cached in a send queue till the incoming transmission is fully received. If the outbound buffer is growing too large, due to permanent incoming transmissions, data could be forced to be sent with PUSH flag set.

But from my understanding, no matter how far you optimize, it ends up in a more optimized HALF-DUPLEX mode.

The alternating read/write I'm using asures that a FULL DUPLEX channel is available at anytime (but at half rate ... still not sure why I achieved more than 32KB/s), as every packet could carry payloads in the direction needed (on demand).

RoganDawes · 2017-04-03T08:35:30Z

Yes, indeed, you are absolutely right, that there is no need to flush the P-ACKs.

As you have observed, I am modelling my protocol on TCP, so I'm not sure where you get half-duplex from. Using the P-ACK approach, each side can get up to (120/121)% of the available bandwidth for their own transmissions, if the other side is not transmitting any actual data. If both sides need to transmit, it should balance appropriately depending of the ratios that each side requires, up to a 50:50 split.

What I haven't established yet is whether having a dedicated thread for reading the HID interface in Powershell ends up effectively prioritising reads at the expense of writes, making the Linux side dominant when writing.

Of course, this is all very specific to using RAW HID as a transport, using something else like a text-based printer, or CDC or ACM would have less of an issue, I suspect, simply because the available data rate would be that much higher.

mame82 · 2017-04-03T09:15:23Z

up to a 50:50 split

That's why I stated one ends up with half-duplex (if both side are sending, although your idea scales very well, if only one side is sending)

What I haven't established yet is whether having a dedicated thread for reading the HID interface in Powershell ends up effectively prioritising reads at the expense of writes, making the Linux side dominant when writing.

I'm still trying to get this. I was planning to extend your read-loop example, to a test with concurrent but independant read and write on both ends (separate threads). As I still don't get, why I could reach a Transfer rate >32KBps, I think it could be possible that the device file allows writing and reading at the same time (SHARED ACCESS) which essentialy would mean FULL DUPLEX is possible with a single HID device interface.

I have to delay this test, as it turns out that the WriteAsync method of my MemoryStream objects is only available with NET 4.5 and I lost backward compatibility to Windows 7, which I want to fix first.

Of course, this is all very specific to using RAW HID as a transport, using something else like a text-based printer, or CDC or ACM would have less of an issue, I suspect, simply because the available data rate would be that much higher.

Sure, but moving away from HID would mean to get more loud (triggering Endpoint Protection on target). I like the HID approach very much and using ACM or other Communication Classes would be to easy ;-) .

mame82 · 2017-04-03T11:18:40Z

News:

Replacing AsyncWrite on my Background MemoryStreams with BeginWrite/EndWrite dropped my Transfer rate into the expected range:

End receiving /tmp/test received 1.048.517 Byte in 41,2182 seconds (24,84 KB/s)

Every call to BeginWrite assures that EndWrite is called for previous write, which I haven't done like this while using WriteAsync. Inspecting the source of the AsyncWrite implementation in MemoryStream.cs is seams every AsyncWrite is done in a seperate AsyncTask. I'm still not sure how my Transfer rate could grow above 32KB/s with AsyncWrite, as I'm waiting fo an answer to every packet sent before sending more data. Anyway, I updated my code and move on to prepare a concurrent read/write test.

Going to ping back with the results

RoganDawes · 2017-04-03T13:06:57Z

I struggle to believe that BeginWrite/EndWrite on a MemoryStream is necessary, or even efficient. Unless you are doing it to be compatible with Async operations on other types of streams, I'd just do a Write, and be done with it.

mame82 · 2017-04-03T14:19:10Z

I struggle to believe that BeginWrite/EndWrite on a MemoryStream is necessary, or even efficient. Unless you are doing it to be compatible with Async operations on other types of streams, I'd just do a Write, and be done with it.

Your're again right, benefit couldn't be measured, but I was wondering why I achieved this high rate with WriteAsync ?!?!

Anyway, meanwhile I've found the answer. Good News... believe it or not: The devicefile is FULL DUPLEX. You can send data in both directions with about 64 KB/s in parrallel

Test output powershell side (read and writing at the same time):

PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Reading up to 16384 reports, with blocking read
Writing 16385 reports with synchronous 'Write'
16384 reports have been read in 16.5860813 seconds (61.7385132436316 KB/s)
16385 reports have been written in 16.61491 seconds (61.635151800401 KB/s)
Killing remaining threads
Godbye

And the other end (python, first read is the Trigger to start threads):

Count 1 reports read in 0.000141143798828 seconds (0.0 KB/s)
Count 16384 reports written in 16.5881521702 seconds (61.730805788 KB/s)
Count 16383 reports read in 16.6130959988 seconds (61.5779262382 KB/s)

Althoug I didn't measured overall time, reading and writing have been done concurrently. The difference is, that I fully decoupled inbound and outbound data (no echo Server).

I haven't implemented any tests for packet loss, but sending and receiving threads have to have matching Report Counts in order to allow the threads to terminate. So it is very likely that there is no packet loss.
Anyway, packet Content has to be checked on both ends (I'm sure that I ran into an issue, where riding a Report before reading the pending input Report cleared the input Report).

Testing this is easy, as i included routines to print out inbound data on both sides (disabled to reduce influence on time measurement).

Test code will be up in some minutes

mame82 · 2017-04-03T14:40:13Z

Next interesting Observation: Starting only the powershell side of communication (no python endpoint on Pi Zero) it turns out that write() is blocking to. If no reports are read on RPi's end, a Maximum of 4 reports could be written before write() blocks.

This means there should never be any packet (=Report) loss, which again means there's no Need to use ACKs on per Report Basis. One could simply reassemble reports to a bigger stream. No Problem on Linux, but PS is still missing something like FIFOMemoryStream (which in fact could be implemented esily on your side, as you use inline C-Sharp).

Test from PS printing out Report Count on write, without listener on RPi running:

PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Reading up to 16384 reports, with blocking read
Writing 16385 reports with synchronous 'Write'
reports written 0
reports written 1
reports written 2
reports written 3

Note: None of the pending reports is lost, if the python side is started some minutes later

mame82 · 2017-04-03T14:49:09Z

@RoganDawes here're the test files https://github.com/mame82/tests/tree/master/fullduplex

I guess I have to reimplement everything, as alternating read/write is the worst approach.

Have you found a replacement for named pipes to Interface with upper protocol layers (some sort of FIFO Stream available on NET 3.5) ?

RoganDawes · 2017-04-03T14:49:20Z

Very interesting results! I guess my implementation was naive, as an echo server, introducing delays!

I never bother reassembling the stream, I simply write the data portion of the packets to their destination as I receive them. So I have no need for a FIFO memory stream at all.

The main issue then is failure to read packets on the Powershell side, resulting in lost data. This is easily seen by introducing a console write in the read loop, I ended up losing about 500 packets each time! If you keep that loop clean and tight, then hopefully there should be no packet loss in that direction either!

mame82 · 2017-04-03T14:51:12Z

The main issue then is failure to read packets on the Powershell side, resulting in lost data. This is easily seen by introducing a console write in the read loop, I ended up losing about 500 packets each time! If you keep that loop clean and tight, then hopefully there should be no packet loss in that direction either!

I haven't had packet loss at any time. This seems to be clear now, as write() is blocking if data isn't read on the other end. Read() was blocking, too.

mame82 · 2017-04-03T15:06:46Z

As said, receive Count was capped by the ´for Loops, which only terminate if exactly the number of packets has been received which has been sent before.

One more not:
Writing to /dev/hidg1 of course isn't blocking. If no reader is in place on he Windows end, the data is lost.

The only case of Report loss I could imagine, would be if reports are written on Linux end (slow RPi) and the Windows end is reading back to slow (unlikely but possible).

But you're again right: No listener on Windows = data loss

mame82 · 2017-04-03T15:25:35Z

@RoganDawes I guess the solution is here:

Started powershell threads first, but deployed a delay in read thread to force packet loss:

# normal script block, should be packed into thread later on
$HIDinThread = {
    $hostui.WriteLine("Reading up to $in_count reports, with blocking read")
    
    $inbytes = New-Object Byte[] (65)
    
    $sw = New-Object Diagnostics.Stopwatch
    for ($i=0; $i -lt $in_count; $i++)
    {
        $cr = $HIDin.Read($inbytes,0,65)
        if ($i -eq 0) { $sw.Start() }
        
        Start-Sleep -m 100 # try to miss reports
        $utf8 = [System.Text.Encoding]::UTF8.GetString($inbytes)
        $hostui.WriteLine($utf8)
    }
    $sw.Stop()
    $timetaken = $sw.Elapsed.TotalSeconds
    $KBps = $in_count * 64 / 1024 / $timetaken
    $hostui.WriteLine("$in_count reports have been read in $timetaken seconds ($KBps KB/s)")
}

Additionally I added Console Output before the first Report is sent from PS out thread:


    for ($i=0; $i -lt $out_count; $i++)
    {
        if ($i -eq 0) { $hostui.WriteLine("Sending first report out on send thread")} # output is blocked by other thread if interacting with $hostui, so this line couldn't be placed exactly
    
        $HIDout.Write($outbytes,0,65)
        if ($i -eq 0) { $sw.Start() }
        #$hostui.WriteLine("reports written $i") # test how many reports are needed ill write() blocks if no receiver on other end
    }

Starting the PS Process first and running:

seq -f "%063g" 1 100 > /dev/hidg1

I get the following interesting result:


____________________________________________________________________________________________________________________________________________________________________________________
PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Reading up to 16384 reports, with blocking read
Writing 16385 reports with synchronous 'Write'
Sending first report out on send thread
 000000000000000000000000000000000000000000000000000000000000001

 000000000000000000000000000000000000000000000000000000000000069

 000000000000000000000000000000000000000000000000000000000000070

 000000000000000000000000000000000000000000000000000000000000071

 000000000000000000000000000000000000000000000000000000000000072

 000000000000000000000000000000000000000000000000000000000000073

 000000000000000000000000000000000000000000000000000000000000074

 000000000000000000000000000000000000000000000000000000000000075

 000000000000000000000000000000000000000000000000000000000000076

 000000000000000000000000000000000000000000000000000000000000077

 000000000000000000000000000000000000000000000000000000000000078

 000000000000000000000000000000000000000000000000000000000000079

 000000000000000000000000000000000000000000000000000000000000080

 000000000000000000000000000000000000000000000000000000000000081

 000000000000000000000000000000000000000000000000000000000000082

 000000000000000000000000000000000000000000000000000000000000083

 000000000000000000000000000000000000000000000000000000000000084

 000000000000000000000000000000000000000000000000000000000000085

 000000000000000000000000000000000000000000000000000000000000086

 000000000000000000000000000000000000000000000000000000000000087

 000000000000000000000000000000000000000000000000000000000000088

 000000000000000000000000000000000000000000000000000000000000089

 000000000000000000000000000000000000000000000000000000000000090

 000000000000000000000000000000000000000000000000000000000000091

 000000000000000000000000000000000000000000000000000000000000092

 000000000000000000000000000000000000000000000000000000000000093

 000000000000000000000000000000000000000000000000000000000000094

 000000000000000000000000000000000000000000000000000000000000095

 000000000000000000000000000000000000000000000000000000000000096

 000000000000000000000000000000000000000000000000000000000000097

 000000000000000000000000000000000000000000000000000000000000098

 000000000000000000000000000000000000000000000000000000000000099

 000000000000000000000000000000000000000000000000000000000000100

Seems there's no Report loss if first send has taken place from Windows side

mame82 · 2017-04-03T15:28:22Z

Last assumption is wrong:

000000000000000000000000000000000000000000000000000000000000081

000000000000000000000000000000000000000000000000000000000000183

000000000000000000000000000000000000000000000000000000000000290

000000000000000000000000000000000000000000000000000000000000393

000000000000000000000000000000000000000000000000000000000000495

So ACKs have to be sent from Pi to Windows :-(

Assuring that Windows end reads fast enough couldn't be done reliably otherwise, I guess

RoganDawes · 2017-04-03T15:33:06Z

Looks like you lost 68 reports to me? On Mon, 3 Apr 2017, 17:25 mame82, <[email protected]> wrote: @RoganDawes <https://github.com/RoganDawes> I guess the solution is here: Started powershell threads first, but deployed a delay in read thread to force packet loss: # normal script block, should be packed into thread later on $HIDinThread = { $hostui.WriteLine("Reading up to $in_count reports, with blocking read") $inbytes = New-Object Byte[] (65) $sw = New-Object Diagnostics.Stopwatch for ($i=0; $i -lt $in_count; $i++) { $cr = $HIDin.Read($inbytes,0,65) if ($i -eq 0) { $sw.Start() } Start-Sleep -m 100 # try to miss reports $utf8 = [System.Text.Encoding]::UTF8.GetString($inbytes) $hostui.WriteLine($utf8) } $sw.Stop() $timetaken = $sw.Elapsed.TotalSeconds $KBps = $in_count * 64 / 1024 / $timetaken $hostui.WriteLine("$in_count reports have been read in $timetaken seconds ($KBps KB/s)") } Additionally I added Console Output before the first Report is sent from PS out thread: for ($i=0; $i -lt $out_count; $i++) { if ($i -eq 0) { $hostui.WriteLine("Sending first report out on send thread")} # output is blocked by other thread if interacting with $hostui, so this line couldn't be placed exactly $HIDout.Write($outbytes,0,65) if ($i -eq 0) { $sw.Start() } #$hostui.WriteLine("reports written $i") # test how many reports are needed ill write() blocks if no receiver on other end } Starting the PS Process first and running: seq -f "%063g" 1 100 > /dev/hidg1 I get the following interesting result:

…

____________________________________________________________________________________________________________________________________________________________________________________ PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\concurrent_rw2.ps1 Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030} Invalid handle Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2- f16f-11cf-88cb-001111000030} Input: 65, Output: 65 Reading up to 16384 reports, with blocking read Writing 16385 reports with synchronous 'Write' Sending first report out on send thread 000000000000000000000000000000000000000000000000000000000000001 000000000000000000000000000000000000000000000000000000000000069 000000000000000000000000000000000000000000000000000000000000070 000000000000000000000000000000000000000000000000000000000000071 000000000000000000000000000000000000000000000000000000000000072 000000000000000000000000000000000000000000000000000000000000073 000000000000000000000000000000000000000000000000000000000000074 000000000000000000000000000000000000000000000000000000000000075 000000000000000000000000000000000000000000000000000000000000076 000000000000000000000000000000000000000000000000000000000000077 000000000000000000000000000000000000000000000000000000000000078 000000000000000000000000000000000000000000000000000000000000079 000000000000000000000000000000000000000000000000000000000000080 000000000000000000000000000000000000000000000000000000000000081 000000000000000000000000000000000000000000000000000000000000082 000000000000000000000000000000000000000000000000000000000000083 000000000000000000000000000000000000000000000000000000000000084 000000000000000000000000000000000000000000000000000000000000085 000000000000000000000000000000000000000000000000000000000000086 000000000000000000000000000000000000000000000000000000000000087 000000000000000000000000000000000000000000000000000000000000088 000000000000000000000000000000000000000000000000000000000000089 000000000000000000000000000000000000000000000000000000000000090 000000000000000000000000000000000000000000000000000000000000091 000000000000000000000000000000000000000000000000000000000000092 000000000000000000000000000000000000000000000000000000000000093 000000000000000000000000000000000000000000000000000000000000094 000000000000000000000000000000000000000000000000000000000000095 000000000000000000000000000000000000000000000000000000000000096 000000000000000000000000000000000000000000000000000000000000097 000000000000000000000000000000000000000000000000000000000000098 000000000000000000000000000000000000000000000000000000000000099 000000000000000000000000000000000000000000000000000000000000100 Seems there's no Report loss if first send has taken place from Windows side — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJwi7SSO5kqpBO_UhEbDbsNl9ukEHUWks5rsQ9zgaJpZM4MuqqN> .

mame82 · 2017-04-03T15:40:06Z

Looks like you lost 68 reports to me?

Indeed, till first Report was sent from Windows.

But raising the simulated read delay, produce more Report loss (assumption on send is needed was wrong).

So again I struggle on Windows, as I know Linux isn't able to send if nothing is received (remeber Crash on unresponsive IRQ when sending data to /dev/hidg1 before Windows is able to read).

So one has to assure one read per millisecond on Windows or use ACKs from RPi to Windows.

Seems your P-ACK idea is the best way to do this, again you're right

RoganDawes · 2017-04-03T18:59:04Z

So, as discussed, let Windows be the first to send a packet. Once the Linux side has received a packet, you know that the Windows side is ready to receive, and communications can begin.

Alternatively, you can monitor the dmesg output to see when the relevant USB configuration has been selected by the Windows host to know that the endpoint has been "activated". However, it still doesn't mean that the powershell is running yet - this you can "discover" by waiting for the powershell to send you the first packet!

mame82 · 2017-04-03T19:12:48Z

Yes, I'm already doing this, both in production code and in the full duplex example provided above. Anyway I was wrong assuming that reports aren't missed when sending is started from Windows. It is simple... Linux writes to HID device non-blocking and reading from FileStream misses reports if done to slow. This isn't the case for sending from Windows to Linux.
While searching for a solution to replace FileStream.read() with something which gives access to the underlying buffer (to block writing on the other end if buffer is full) I stumpled across feature reports. Not only they don't rely on FileStream, the are handled with control transfers instead of IRQ. So if I'm right the 1000 reports per second boundary doesn't apply on feature reports.

I'm thinking about new tests, changing underlying mechanics away from inpit/output reports, but I'm running out of time.

RoganDawes · 2017-04-03T19:23:59Z

Well, go with the 32kBps option in the meantime, version 2 can get higher speed ;-)

mame82 · 2017-04-03T20:04:33Z

Thumps up for this comment...I'm already suffering from tunnel-vision trying to optimize low level HID communications while loosing focus on other things which I wanted to implement in P4wnP1. But it doesn't get boring, another funny thing is that I'm faking RNDIS to run on 20 GBit / s which involves different issues. If you're interested in this, here's the link https://github.com/mame82/ratepatch (applies on raspbian with kernel 4.4.50+)

mame82 · 2017-04-04T10:57:50Z

Not suprisingly I'm still thinking about Report loss occuing when writing to Linux /dev/hidg and reading back from powershell to slow.

So please excuse the next large paste of test Output. I observed the following.
Running the read Loop on Windows with a 500 ms delay, Report loss is assured.
I now started such a read Loop on PS and large chunks of Output reports from Linux.
The Report Content is "Number xxx" - xxx represent the Report number written.
The read Loop outputs the Report Content and a number representing the Count of the read Loop

New Observation: if sending is aborted on Linux side, the last 32 reports could be readen back (with 500 ms delay) without having any loss. This means the FileStream is backed by a 2048 Byte buffer. If it would be possible to Access this buffer directly from powershell, a notification could be sent back to block writing on the other side (including last seq number received). Unfortunately, I haven't found a way to Access the underlying buffer... FileStream.Position and FileStream.Length are both unset.

So if you're going to implement your P-ACK idea, it seems 32 reports is the magic number to track end send ACKs for. So your sequence number misses exactly one bit to cope with that.

Here's the testoutput showing the described behaviour. The parts where large amounts of reporst are missed, have been caused by unlimited sending. The parts with continues Report received are the result of manually Abort sending from Linux side (last 32 reports are reconstructed with 500 ms delay)

 Number 0                                                        
0
 Number 327                                                      
1
 Number 594                                                      
2
 Number 595                                                      
3
 Number 596                                                      
4
 Number 597                                                      
5
 Number 598                                                      
6
 Number 599                                                      
7
 Number 600                                                      
8
 Number 601                                                      
9
 Number 602                                                      
10
 Number 603                                                      
11
 Number 604                                                      
12
 Number 605                                                      
13
 Number 606                                                      
14
 Number 607                                                      
15
 Number 239                                                      
16
 Number 364                                                      
17
 Number 365                                                      
18
 Number 366                                                      
19
 Number 367                                                      
20
 Number 368                                                      
21
 Number 369                                                      
22
 Number 370                                                      
23
 Number 371                                                      
24
 Number 372                                                      
25
 Number 373                                                      
26
 Number 374                                                      
27
 Number 375                                                      
28
 Number 376                                                      
29
 Number 377                                                      
30
 Number 378                                                      
31
 Number 379                                                      
32
 Number 380                                                      
33
 Number 381                                                      
34
 Number 382                                                      
35
 Number 383                                                      
36
 Number 384                                                      
37
 Number 385                                                      
38
 Number 386                                                      
39
 Number 387                                                      
40
 Number 388                                                      
41
 Number 389                                                      
42
 Number 390                                                      
43
 Number 391                                                      
44
 Number 392                                                      
45
 Number 393                                                      
46
 Number 394                                                      
47
 Number 395                                                      
48
 Number 0                                                        
49
 Number 306                                                      
50
 Number 651                                                      
51
 Number 994                                                      
52
 Number 1045                                                     
53
 Number 1046                                                     
54
 Number 1047                                                     
55
 Number 1048                                                     
56
 Number 1049                                                     
57
 Number 1050                                                     
58
 Number 1051                                                     
59
 Number 1052                                                     
60
 Number 1053                                                     
61
 Number 1054                                                     
62
 Number 1055                                                     
63
 Number 1056                                                     
64
 Number 1057                                                     
65
 Number 1058                                                     
66
 Number 1059                                                     
67
 Number 1060                                                     
68
 Number 1061                                                     
69
 Number 1062                                                     
70
 Number 1063                                                     
71
 Number 1064                                                     
72
 Number 1065                                                     
73
 Number 1066                                                     
74
 Number 1067                                                     
75
 Number 1068                                                     
76
 Number 1069                                                     
77
 Number 1070                                                     
78
 Number 1071                                                     
79
 Number 1072                                                     
80
 Number 1073                                                     
81
 Number 1074                                                     
82
 Number 1075                                                     
83
 Number 1076                                                     
84
 Number 0                                                        
85
 Number 1                                                        
86
 Number 2                                                        
87
 Number 3                                                        
88
 Number 4                                                        
89
 Number 5                                                        
90
 Number 6                                                        
91
 Number 7                                                        
92
 Number 8                                                        
93
 Number 9                                                        
94
 Number 10                                                       
95
 Number 11                                                       
96
 Number 12                                                       
97
 Number 13                                                       
98
 Number 14                                                       
99
 Number 15                                                       
100
 Number 16                                                       
101
 Number 17                                                       
102
 Number 18                                                       
103
 Number 19                                                       
104
 Number 20                                                       
105
 Number 21                                                       
106
 Number 22                                                       
107
 Number 23                                                       
108
 Number 24                                                       
109
 Number 25                                                       
110
 Number 26                                                       
111
 Number 27                                                       
112
 Number 28                                                       
113
 Number 29                                                       
114
 Number 30                                                       
115
 Number 31                                                       
116
 Number 0                                                        
117
 Number 127                                                      
118
 Number 128                                                      
119
 Number 129                                                      
120
 Number 130                                                      
121
 Number 131                                                      
122
 Number 132                                                      
123
 Number 133                                                      
124
 Number 134                                                      
125
 Number 135                                                      
126
 Number 136                                                      
127
 Number 137                                                      
128
 Number 138                                                      
129
 Number 139                                                      
130
 Number 140                                                      
131
 Number 141                                                      
132
 Number 142                                                      
133
 Number 143                                                      
134
 Number 144                                                      
135
 Number 145                                                      
136
 Number 146                                                      
137
 Number 147                                                      
138
 Number 148                                                      
139
 Number 149                                                      
140
 Number 150                                                      
141
 Number 151                                                      
142
 Number 152                                                      
143
 Number 153                                                      
144
 Number 154                                                      
145
 Number 155                                                      
146
 Number 156                                                      
147
 Number 157                                                      
148
 Number 158                                                      
149

Here's the example output

RoganDawes · 2017-04-04T11:49:51Z

Interesting! So, if I limited my "packets in flight without ACK" to 16 (max of my sequence numbers), I could be sure that there would be no packet loss. Funnily enough, I instrumented my "echo loop" to indicate how many reports there were in the queue at the beginning of the while loop. Not once did I get more than 16 reports, with a 1ms sleep once the queue was drained.

The unfortunate part is that my sequence numbers are per "connection", of which I can have up to 255 at once (in theory). So I'd have to track the unacknowledged packets at a different level. Which unfortunately, is a bit of a layering violation, I think.

I think the "solution" is going to be making sure that the read loop just reads as fast as possible, and if any packets are observed to be missing, to send a RST on that channel, and let it start again. Not particularly robust, but should work, I hope!

RoganDawes · 2017-04-04T11:52:46Z

FWIW, by simply substituting the $device.BeginRead/EndRead pairs with dequeueing packets from the readloop/queue, I managed to get 13kBps throughput with a cmd.exe doing "dir /s". Strangely, when writing the packets out to a socket, the throughput dropped to about 8kBps.

mame82 · 2017-04-05T14:31:52Z

@RoganDawes

While starting to implement a new lower layer communication scheme, based on our observations I put some comments (design ideas) into the source of the concurrent read/write testcase (the one with 64000 Bytes/s full Duplex on a single device file), to avoid report loss.

As this isn't implemented in 5 minutes, I'd like to kindly ask you to Review These comments before I start coding (and maybe throw it away in the end).

Idea (From Linux Point of view = USB device, not Host):

# Writing out reports doesn't mean that the receiver is able to read
# them back (if reading to slow, writing from this side isn't blocking)
#
# If the receiver is Windows via FileStream object, it was obeserved that
# exactly 2048 bytes = 32 reports are cached in a ring buffer, which gets
# overwritten if more reports are sent before reading themm back
#
# To assure every report is readen a report loss detection is applied
# only for reports written to the host (HID input reports) as it has been
# observed that reports readen from the host (OUTPUT reports) don't get lost
# (write call to HID device FileStream blocks after writing 4 reports
# without reading them back on this end)
#
# So outgoing sequence numbers are deployed, reaching from 0 to 31 to
# match the FileStream Buffer on windows.
# Outgoing report format is (INPUT REPORT for host):
#       0:      length (effective payload length in report, excluding header)
#       1:      seq (outgoing sequence number)
#       2:      src (like source port, but 0..255 - should maybe moved to an upper layer)
#       3:      dst (like destination port, but 0..255 - should maybe moved to an upper layer)
#       4..63:  payload, padded with zeroes if needed
#       Note: report ID isn't needed at gadget side, thus report size is 64 bytes

# Incoming report format is (OUTPUT REPORT from host):
#       0:      length (effective payload length in report, excluding header)
#       1:      ack (acknowledge number, holding the last SEQ number the host has read back)
#       2:      src (like source port, but 0..255 - should maybe moved to an upper layer)
#       3:      dst (like destination port, but 0..255 - should maybe moved to an upper layer)
#       4..63:  payload, padded with zeroes if needed
#       Note: report ID isn't needed at gadget side, thus report size is 64 bytes

# packets are constantly read and written. This could be seen as carrier
# Both peers are able to send data at any time, by putting it into the report payload (length > 0)
# If no data should be sent from one peer, an empty report (length = 0) is sent anyway, to assure
# continuous delivering o sSEQ and ACK numbers.
#
# The USB client (this device) is only allowed to send up to 32 reports (MAX_SEND) without
# receiving an ACK. These up to 32 reports are cached in an outbound queue, to allow resending if needed.
# This qualifies for handling of the following error cases:

# Error case 1:
#       The last ACK received with an OUTPUT REPORT isn't the next one awaited. As no scheduling or
#       priorization functionallity is introduced in this protocol layer, received ACKs occure in sequential
#       manner always (ACKs are never lost, as HOST to device communication is save in terms of report loss).
#       Examples for valid ACK sequences:
#               0, 1, 2, 3 ...
#               30, 31, 0, 1 ...
#       Example for invalid ACK sequences:
#               0, 1, 2, 5
#               31, 1
#       Note: The corner case, that exactly 33 rerports (or n%32+1 reports) are missed, would lead to an ACK
#       sequence like this:     29, 30 ... miss 33 ... ,2. This in fact could never happen, as writing reports
#       without received ACK is blocked if the outbound queue grows to 32.
#
# Error case 1 detection:
#       awaited_ACK = (last_ACK + 1) % 32       # modulo could be replaced with: if (awaited_ACK > 31) awaited_ACK -= 32
#       received_ACK != awaited_ACK
#
# Error case 1 cause:
#       The USB host (receiver of input report) missed all reports sent, starting from the last valid
#       ACK received
#       Example:
#               last_valid_ACK = 10
#               received_ACK = 2
#               last_seq_sent = 8
#
#               Input reports have been sent to the host, up to SEQ number 8.
#               The host received valid reports up to last_valid_ACK=10 (last valid SEQ number seen by the receiving host was 10).
#               The next report received by the host is received_ACK=2 (last SEQ number seen by the receiving host was 2, the host
#               already is aware of the facted, that this wasn't the right SEQ number and ignores these packets).
#
#               At this point, it is obvious that the receiving host missed the following reports:
#                       11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 1 (last_valid_ACK + 1 ... received_ACK - 1)
#               At this point it isn't known if the host missed the reports 3 to 8, which have already been sent. Thus it is decalred,
#               that THE RECEIVER HAS TO IGNORE RPORTS WITH OUT-OF-BOUND SEQ NUMBER. (Another approach would be to allow th receiver to cache out-of-bound
#               reports, in order to only resend reports which have been lost. This would come at the cost of additional logic and resending "old"
#               reports would cause out-of-nound sequence numbers itself, resulting in more complex error detection)
#
#               To cope with that report loss, reports 11 ... 1 have to be sent, followed by reports 2 ... 10 as these have been ignored by the receiver.
#
#               This isn't the most efficient approach, as in worst case missing a single report on receiver side could lead to resending up to
#               30 reports, which already had been sent (and ignored by the receiver). But as the first missing report is always send first and ACKs
#               are received from a parallel thread, this comes down to a "1 report sent / 1 report acknowledged" case (hopefully, has to be tested)
#
# Error case 1, action to take:
#       If out-of-bound ACK is received, all reports from last_valid_ACK+1 to last_seq_sent have to be retransmitted by the sender (USB device).
#       If the receiver (USB host) recognizes an out-of-bound SEQ number, the report is ignored
#

# Additional note:
#  The design ideas describe apply to the write thread of the USB device (writing INPUT reports) and the read thread of the USB host (reading INPUT reports).
#  Anyway, the ACKs are carried with output reports. As read and write loops are ran in independent threads on both peers, states like "last_valid_ACK"
#  and "last_SEQ_sent" are kep in synchronized global sate objects, shared between the read and write thread of a peer.
#  Otherwise read and write threads are decoupled and independent, there's nothing like write-after-read (to hopefully achieve maximum throughput, at least
#  for output reports). This again means, that if a SEQ number is sent, it could take several incoming packets, till the ACK is received (up to 32 before
#  sending from device to host is blocked).

mame82 · 2017-04-10T19:40:24Z

Good news, implemented FULL DUPLEX similar to suggestion from above with some some improvments.

Results:

PS D:\P4wnP1\powershell> D:\P4wnP1\powershell\fullduplex\fullduplex4.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#7&27da95e8&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Starting thread to continuously read HID input reports
Starting write loop continously sending HID ouput report
Global seq number readen 31
MainThread: received report: port nr. 0                                                  
MainThread: received report: port nr. 1                                                  
MainThread: received report: port nr. 2                                                  
... snip ... (no report loss in between)
MainThread: received report: port nr. 17405                                              
MainThread: received report: port nr. 17406                                              
MainThread: received report: port nr. 17407                                              
Total time in seconds 22.2611788
Throughput in 45731,6303483444 bytes/s netto payload date (excluding report loss and resends)
Throughput out in the same time 53357,2822298161 bytes/s netto output (19158 reports)
Killing remaining threads
Godbye

So I'm on ~45500 Bytes/s from Pi to Powershell (real netto data, without protocol headers)
and on ~53000 Bytes/s from PowerShell to Pi (concurrent full duplex read/write on single HID device file).

Report loss detection (includes blocking if Output buffer reached 32 Reports which haven't been read and resending of unacknowledged reports) is only done from Pi to Windows (HID Input Reports), as we know the other way around writes are blocked if no data is read back (assuzmption still holds true in all Tests, max, 4 Report writes to FileStream without read on Linux end).

Protocol overhead is reduced to 2 header bytes on "link layer" so payload size is 62 Bytes per report.

If you're interested in work in Progress code, ping back.

Btw. I decided to Interface to upper layers with synchronized input/output queues consuming/holding pure Reports...this still isn't fully implemented. Fragmentation/defragmentation of larger streams is going to be handled in upper layers (based on a FIN bit in reports). DST / SOURCE fields (or channel in your case) will be moved to upper layers, too. I don't nned this information on link layer anymore. Reason endpoints of this layer are well defined and pre-known USB-Host<-->USB-Device (or PowerShell-Client <--> Python-Server)

mame82 · 2017-04-10T19:50:37Z

Forgot to mention, measurement was on win 10 64 bit. On Win 7 32 Bit throughput is far slower (going to test tomorrow, code is PS 2.0 and NET 3.5 compatible, not sure if this is the bottleneck on Win 7)

RoganDawes · 2017-04-11T12:48:13Z

I think the basic idea is solid. My approach is to have a single "channel identifier", resulting in a max of 256 (concurrent!) channels, rather than 65536, which I think is reasonable in the circumstances.

Packet format in my case (following some of your ideas above, so not implemented yet) would look like:

seq no (1 byte) - value between 0-255
channel no (1 byte) - value between 0-255 allows up to 256 channels simultaneously, and can obviously be reused.
payload length (1 byte) - value between 0-(HID Packet size - 3)
data ((HID Packet size - 3) bytes) - zero padded to fill up the entire packet where necessary.

I'm still not 100% convinced about acking every packet - i.e. making a continuous stream of comms at full rate, as this will require fairly significant resources to keep up, possibly resulting in suspicious activity on the victim being detected.

RoganDawes · 2017-04-11T13:02:31Z

And indeed, now that you mention it, having the sequence number and packet length first, and leaving the channel and payload for a higher layer makes perfect sense, even in my implementation.

Nice work!

mame82 · 2017-04-11T15:59:37Z

So here's the test result on Windows 7 (a bit disappointing compared to win 10)

__________________________________________________________________________________________________________________________________________________________________________________
PS D:\del\P4wnP1\powershell> D:\del\P4wnP1\powershell\fullduplex4.ps1
Path: \\?\hid#vid_1d6b&pid_0137&mi_02#8&1f80c44c&1&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Invalid handle
Path: \\?\hid#vid_1d6b&pid_0137&mi_03#8&4567976&0&0000#{4d1e55b2-f16f-11cf-88cb-001111000030}
Input: 65, Output: 65
Starting thread to continuously read HID input reports
Starting write loop continously sending HID ouput report
Global seq number readen 31
MainThread: received report: port nr. 0                                                  
MainThread: received report: port nr. 1                                                  
MainThread: received report: port nr. 2                                                  
... snip ... (no reports lost)                       
MainThread: received report: port nr. 17404                                              
MainThread: received report: port nr. 17405                                              
MainThread: received report: port nr. 17406                                              
MainThread: received report: port nr. 17407                                              
Total time in seconds 34.1784225
Throughput in 29757,0199443816 bytes/s netto payload date (excluding report loss and resends)
Throughput out in the same time 31754,2449479639 bytes/s netto output (17505 reports)
Killing remaining threads
Godbye

____________________________________________________________________________________________________________________________________________________________________________________

I'm still not 100% convinced about acking every packet - i.e. making a continuous stream of comms at full rate, as this will require fairly significant resources to keep up, possibly resulting in suspicious activity on the victim being detected.

I still read/write packages in both directions (Input and Output reports) at maximum possible rate (limited by file IO and CPU Speed).
Report loss for output reports is prevented by underlying file stream (blocking write if no read), so I ignore this direction in terms of report loss prevention (based on SEQ numbers and ACKS).
For input reports I use a SEQ number on writing, to detect loss on powershell side. To in form the sender (Raspberry) about report loss I use an ACK number on output reports (instead of the SEQ number used in input reports, saving me one header byte).

I've quit the approach, of ACK'ing every SEQ number received, instead I changed to accumulative ACKs. This raised the new question, of how to detect report loss on Pi, as ACK sequences aren't neccesarily continuos (1,4,6 is a valid ACK sequence which could be received by the Pi, which means that the last valid SEQ received was 6).
If INPUT report 2,3,5 has been lost, the ACK sequence would look the same (1,4,6), with the difference that the PowerShell side is aware of the fact, that reports are missing, as they haven't arrived with continuous SEQ numbers. So my current approach was to introduce a RESEND REQUEST flag bit, which changes the purpose of the ACK field to serve as SEQ number to start resending from.

So communication Looks like this (Reader and write threads are decoupled, but share current state of SEQ number received and last ACK sent):

No report loss (PowerShell perspective, read and write thread run asynchronous with different Loop times, in example write Loop is slower):
READ thread reads: SEQ 0
READ thread reads: SEQ 1 (valid, successor)
READ thread reads: SEQ 2 (valid, successor)
WRITE Thread writes: ACK 2
READ thread reads: SEQ 3
READ thread reads: SEQ 4 (valid, successor)
READ thread reads: SEQ 5 (valid, successor)
WRITE Thread writes: ACK 5

Same example with report loss:
READ thread reads: SEQ 0
READ thread reads: SEQ 1 (valid, successor)
READ thread reads: SEQ 4 (invalid, last valid was SEQ 1)
WRITE Thread writes: ACK 2 + RESEND REQUEST bit (advice the Sender to resend all reports starting from SEQ 2)
READ thread reads: SEQ 2 (valid, successor)
READ thread reads: SEQ 3 (valid, successor)
READ thread reads: SEQ 4 (valid, successor)
READ thread reads: SEQ 5 (valid, successor)
WRITE Thread writes: ACK 5
... and so on ...

My report format:

INPUT REPORTS

# works with incoming sequence number, as reports could be lost if this host is reading to slow
# report layout for incoming reports
#    0: REPORT ID
#    1: LEN: BIT7 = fin flag, BIT6 = unused, BIT5...BIT0 = Payload Length (Payload length 0 to 62)
#    2: SEQ: BIT7 = unused, BIT6 = unused, BIT5...BIT0 = SEQ: Sequence number used by the sender (0..31)
#    3..64: Payload

OUTPUT REPORTS

# works with outgoing acknoledge number, as reports could be lost if this host is reading to slow
# valid (in-order) reports are propagated back to the sender with an acknowledge number (ACK) 
# Sender has to stop sending after a maximum of 32 reports if the corresponding ACK for the
# 32th packet isn't received
# ACKs are accumulating, this means if SEQ 0, 1, 2 are read by the HIDin Thread, without writing an 
# output report containing the needed ACKs (for example, caused by to much processing overhead in output 
# loop for example), the next ACK written will be 2 (omitting 0 and 1).
# To allow the other peer to still detect report loss, without receiving an ack for every single report,
# a flag is introduced to fire resend request. If this flag is set, this informs the other peer to resend 
# every report, beginning from the sucessor of the ACK number in the ACK field (this allows to acknowledge additional
# reports while requesting missed ones).

# report layout for outgoing reports
#    0: REPORT ID
#    1: LEN: BIT7 = fin flag, BIT6 = unused, BIT5...BIT0 = Payload Length (Payload length 0 to 62)
#    2: SEQ: BIT7 = unused, BIT6 = RESEND REQUEST, BIT5...BIT0 = ACK: Acknowledge number holding last valid SEQ number received by reader thread
#    3..64: Payload

As resend request would be repeated, till the state of the PowerShell peer changes (because reports are constantly flowing in both directions at Maximum possible Speed.
Thus the Pi side keeps track of "real state changes" of the Windows end, and ist write thread (generating INPUT reports for Windows) only accounts for a new state, if relevant Header fields of a received OUTPUT Report changed during two reads (which means the state of the Windows side changed). So this "state tracking" ignores the LENGTH field, as it could Change between two OUTPUT reports (different payloads), although the ACK state (RESEND bit, ACK number and FIN bit) are the same (no state Change).

mame82 · 2017-04-11T16:35:18Z

fairly significant resources to keep up, possibly resulting in suspicious activity on the victim being detected.

Right, but as my current link layer stack is focused on robust communication and could handle large delays in read/write threads on both sides (with impact of data processing overhead in mind). This could be used to lower CPU load by introducing (conditional) delays (at cost of decrease o maximum transfer speed). So yes, I'm not sending reports on demand, but continues ... I see HID reports as electrical media compared to ISO/OSI model. This means power is continuously flowing, but not necessarily with useful payload. This again assures that header data (or state data) is exchanged instantly

mame82 · 2017-04-11T16:49:41Z

Looking at the win 7 results again, it seem that FileStream.read is blocking FileStream.write somehow. Maybe this methods are synchronized on NET3.5 ?!? I won't test if this could be circumvented by accessing the File object directly (should allow overlapped read/write) as this would involve to much additional csharp code. Anyway I'm happy with current implementation and refocus on polishing and constructing a clean interface to upper layers. I'm still thinking about putting a socks5 proxy on python end to let powershell establish requested TCP connections. This would be a scenario handling channel splitting on an upper layer and therefore a clean and robust interface has to be provided to interface with my link layer. Creating a layer interface is easy with object oriented python, but again a mess with PS <3.0 +I'm aiming at PS 2.0.

Design idea is to wrap up the link layer in a PSCustom object, which receives a device file in constructor an provides a read and write method for LARGE data streams (thread creation, fragmentation and report loss handling should be done internally). As this idea involves even more code it has to be placed on stage2 and a simplified protocol will be used to deliver stage1. This involves even more code on Raspberry side, as the server has to handle two different protocols, but I believe it is worth the effort

mame82 · 2017-04-11T17:48:31Z

One more addition. As shown in report layout comments, I'm planning to use a FIN bit. Its purpose is to mark the end of fragmented streams. A start flag isn't needed as the first report with a payload (length field > 0) starts a stream and all succeeding reports are concatenated until FIN bit is set in the terminating report.

My former approach used an empty report to terminate fragmented streams, which drops transfer speed even more. In real world usage many streams fit into a single report (example directory listing in a shell, where each output line is interpreted as single stream ... This lead to sending an empty report after each line in my old implementation, producing way to much overhead).

Are you aware of a simple PS 2.0 compatible way to create objects with custom member functions (inline cshar code is a no go for me)? Only option I found so far is PSCustom object

mame82 · 2017-04-11T17:59:32Z

Last comment on current implementation: The sequence number range in use is 0 to 31. This allows to track 32 reports, which is the max size of my output buffer on Pi. This again fits the maximum size of the input buffer of Windows FileStream observed. As this is my max, SEQ/ACK never consume more than 5 bits, leaving room to use the remaining bits as flags. The flags are extracted with cheap binary and/or/nand operations.

mame82 · 2017-04-13T15:32:30Z

My current code achieves rates > 40 KByte/s in both directions (full Duplex) on Win10 as well as on Win7 (mistakenly included Win7 console IO in first Speed tests).

I've started to migrate to an object oriented Approach to build a clean Interface for my (now called) LinkLayer.

Here's the test code, feel free to use:

https://github.com/mame82/tests/tree/master/fullduplex_fast

mame82 · 2017-04-13T15:34:51Z

One really important fact (I didn't recognized earlier). One has to use two dedicated file descriptors on Linux (although it's the same file) for reading and writing, otherwise the speed halves due to synchronized file Access.

This is at least true if python is used.

mame82 · 2017-04-17T17:31:23Z

@RoganDawes

I'm done with FullDuplex LinkLayer implementation. Works nice on Windows 10, Windows 7 is untested

Codes still needs polishing...Current features:

sync on initial Connection from Windows to Linux (fit ACK to SEQ number)
report loss detection (only from Linux to Windows, not needed the otherway around)
automatic resend of lost Reports
Keep up to 32 Reports in flight (from Linux to Windows)
fast, automatic fragmentation / defragmentation of large streams to payload size
low Header Overhead (2 Bytes, 62 Bytes payload per Report)

Performance on Win10 is fantastic:

Windows to Linux: Throughput 56244.6202298 Bytes/s
Linux to Windows: Throughput in 60447,0528938712 bytes/s netto payload data (excluding report loss and resends)

87 % of max from Windows to Pi and 93 % of max from Pi to Windows. That's insane.
Max CPU consumption on Windows (during test) 15 % (Core i7 @3,6 GHz)
Max CPU consumption on Pi (during test) 28%.
CPU consumption on Pi 98% if PowerShell client disconnects (caused by this line of code )

Measurement shown above is done by sending from both peers at the same time
Conditions in test have been ideal:

Outbound data is Enqueued before time measurement (no accessing of the synchronized Queues during data transfer. Of Course, in real world scenario this happens on regular basis and would have speed impact)
Received data is processed after time meassuring stops. Accessing the data during transfer, again, would have a speed Impact (the underlying Queues are synchronized with reader/writer threads). The real world impact on speed depends on fragmentation. Sending a stream of 6200 bytes needs a single queue access to fetch (no big speed impact). Sending 100 streams of 62 Bytes, results 100 times accessing of the queue from upper processing layer, which slows things down (same goes for reader thread itself, at it has to put data into the inbound queue more frequently, which drops rate even further). Data processing itself has no impact, as it is done in a seperate thread.
Stream size which is allowed to be put into output Queues has no hardcoded limit at the moment.

As promised, code is here:
https://github.com/mame82/tests/tree/master/fullduplex_fast2

So I'm done with this Topic but kept the issue open, in case further questions arise. In case there're no more question, please close the issue.

Again, thx for this intensive and valuable exchange of Information on HID low Level development !
Hope this could push both of our Projects.

mame82 · 2017-04-17T17:43:16Z

Remarks on provided code at https://github.com/mame82/tests/tree/master/fullduplex_fast2:

Most parts of the source are implementation of LinkLayer Interface (as class in python, PSCusomObject on PowerShell)

Start of main code, utilizing the LinkLayer Objects is marked with the following comment in both source files:

#########################
# test of link layer
#########################

So as shown, using the LinkLayer objects is relativly easy (not much code, beside the implementation itself).

In order to test, you could replace the CSharp part (responsible for creating a FileStream to the HIDdevice) with your own. In case you use the current CSharp code, you have to replace Manufacturer and Serial string (I don't enumerate the device based on VID and PID).

On Linux end, my devicefile is /dev/hidg1 which has to be changed to your needs, too.

mame82 · 2017-07-24T12:03:07Z

Hi @RoganDawes
it has been a while and I made great progress, although my projects seems to never get done.
Meanwhile I developed a full fledged multi-channel backdoor using HID only. It It has several network layers, multiple communication channels and the initial stage is still triggered via HID Keyboard (targeting powershell).
I've moved most of my Windows code to Csharp, because managing multiple threads in powershell (especially debugging them), while trying to be compatibility to PS 2.0 (no class inheritance, no classes at all) is a mess.
Anyway, my early stage, which gets typed out to HID keyboard, is still PowerShell based, because even the smallest compressed NET assembly exceeds 4000 chars in size.
While working on my stage 1, your "move the window offscreen" idea came into my mind, again. Because I still want to avoid inline compiling of csharp code in the PowerShell part, I've re-implmented your "SetWindowsPos" approach in pure PowerShell 2.0.
As in the past, again, I want to share this with you:

$h = (Get-Process -Id $pid).MainWindowHandle
$ios = System.Runtime.InteropServices.HandleRef
$hwnd = New-Object $ios (1, $h)
$insertAfter = New-Object $ios (2, -2)
(([reflection.assembly]::LoadWithPartialName("WindowsBase")).GetType("MS.Win32.UnsafeNativeMethods"))::SetWindowPos($hwnd, $insertAfter, 200, 300, 10, 10, 4)

Additionally I want to point out my python helper class, which uses approaches discussed earlier, to prepare stage1 code for typing out via HID. As you can see in this class, I'm still using base64 encoded GZip streams, which are converted to PowerShell code on the fly. Thus things like loading custom assemblies without touching disc got possible initializing variables with code or binary data etc. became possible.

Currently I'm using my out_PS_IEX_Invoker method, which, as the Name implies, relies on Invoke-Expression. I'm not happy with using this commandlet, because this is one of the commands thread hunting focuses on. So if you have other ideas according code invocation from byte arrays or strings in powershell, please let me know.

Btw. I've choosen to implement a console based Approach as frontend for my current HID backdoor (yes, it is a bit meterpreter'ish). One connects to P4wnP1 via SSH and the frontend is embedded into a Screen session. The idea behind this approach is to implement a socks4a or socks5 server, later on, which could be reached out via the same SSH session and relay traffic through the target client. This would be a real airgap bridge.

mame82 · 2017-07-26T21:23:17Z

@RoganDawes I finished my HID backdoor and added you to the credits https://github.com/mame82/P4wnP1/blob/master/README.md.
I hope you are fine with this.
My final window hiding uses 'setWindowPos' instead of 'showWindowAsync'.

It could be used to make the Window invisible, while keeping focus. As P4wnP1 types chars very fast, they ran into the STDIN buffer of the target window, which couldn't be interrupted by user interaction.

The final stage one needs for lines of code to hide the window and the rest gets typed out in about 2 to 6 seconds (depending on stage1 type, pure powershell stage is more compact than the DOT NET assembly version). See my readme for reference.

RoganDawes · 2017-07-26T23:07:15Z

Very cool! Nice work! And thanks for the shoutout :-)

mame82 · 2017-07-28T19:14:55Z

Yeah, no problem. This was an inspiring conversation.

Seytonic demoed the payload
https://youtu.be/Pft7voW5ui8

The final attackt starts at about 5:30 in the video... look closely, the powershell window disappears really fast. Stage 2 download an execution has finished when the status changes to "client connected".

I still use your wmi approach to enumerate fot the HID device (at least in the default version of stage 1)

Discussion - full duplex by splitting HID in / out into separate composite functions (no issue) #15

Discussion - full duplex by splitting HID in / out into separate composite functions (no issue) #15

Comments

mame82 commented Mar 30, 2017

mame82 commented Mar 30, 2017

RoganDawes commented Mar 30, 2017

mame82 commented Mar 31, 2017 • edited Loading

RoganDawes commented Mar 31, 2017

mame82 commented Mar 31, 2017 • edited Loading

RoganDawes commented Mar 31, 2017

RoganDawes commented Mar 31, 2017 • edited Loading

RoganDawes commented Mar 31, 2017

mame82 commented Mar 31, 2017

mame82 commented Mar 31, 2017

mame82 commented Mar 31, 2017 • edited Loading

mame82 commented Apr 1, 2017

mame82 commented Apr 1, 2017

mame82 commented Apr 1, 2017 • edited Loading

mame82 commented Apr 1, 2017 • edited Loading

RoganDawes commented Apr 1, 2017

RoganDawes commented Apr 1, 2017

mame82 commented Apr 2, 2017 • edited Loading

mame82 commented Apr 2, 2017 • edited Loading

mame82 commented Apr 2, 2017

mame82 commented Apr 3, 2017 • edited Loading

RoganDawes commented Apr 3, 2017

mame82 commented Apr 3, 2017 • edited Loading

mame82 commented Apr 3, 2017

RoganDawes commented Apr 3, 2017

mame82 commented Apr 3, 2017

mame82 commented Apr 3, 2017 • edited Loading

mame82 commented Apr 3, 2017

RoganDawes commented Apr 3, 2017

mame82 commented Apr 3, 2017

mame82 commented Apr 3, 2017 • edited Loading

mame82 commented Apr 3, 2017

mame82 commented Apr 3, 2017 • edited Loading

RoganDawes commented Apr 3, 2017 via email

mame82 commented Apr 3, 2017 • edited Loading

RoganDawes commented Apr 3, 2017

mame82 commented Apr 3, 2017 • edited Loading

RoganDawes commented Apr 3, 2017

mame82 commented Apr 3, 2017

mame82 commented Apr 4, 2017

RoganDawes commented Apr 4, 2017

RoganDawes commented Apr 4, 2017

mame82 commented Apr 5, 2017

mame82 commented Apr 10, 2017 • edited Loading

mame82 commented Apr 10, 2017

RoganDawes commented Apr 11, 2017

RoganDawes commented Apr 11, 2017

mame82 commented Apr 11, 2017 • edited Loading

mame82 commented Apr 11, 2017

mame82 commented Apr 11, 2017

mame82 commented Apr 11, 2017

mame82 commented Apr 11, 2017

mame82 commented Apr 13, 2017 • edited Loading

mame82 commented Apr 13, 2017

mame82 commented Apr 17, 2017 • edited Loading

mame82 commented Apr 17, 2017 • edited Loading

mame82 commented Jul 24, 2017

mame82 commented Jul 26, 2017

RoganDawes commented Jul 26, 2017

mame82 commented Jul 28, 2017

mame82 commented Mar 31, 2017 •

edited

Loading

mame82 commented Mar 31, 2017 •

edited

Loading

RoganDawes commented Mar 31, 2017 •

edited

Loading

mame82 commented Mar 31, 2017 •

edited

Loading

mame82 commented Apr 1, 2017 •

edited

Loading

mame82 commented Apr 1, 2017 •

edited

Loading

mame82 commented Apr 2, 2017 •

edited

Loading

mame82 commented Apr 2, 2017 •

edited

Loading

mame82 commented Apr 3, 2017 •

edited

Loading

mame82 commented Apr 3, 2017 •

edited

Loading

mame82 commented Apr 3, 2017 •

edited

Loading

mame82 commented Apr 3, 2017 •

edited

Loading

mame82 commented Apr 3, 2017 •

edited

Loading

mame82 commented Apr 3, 2017 •

edited

Loading

mame82 commented Apr 3, 2017 •

edited

Loading

mame82 commented Apr 10, 2017 •

edited

Loading

mame82 commented Apr 11, 2017 •

edited

Loading

mame82 commented Apr 13, 2017 •

edited

Loading

mame82 commented Apr 17, 2017 •

edited

Loading

mame82 commented Apr 17, 2017 •

edited

Loading