design.txt

The POSIX "diff" and "patch" tools work line-oriented, but sometimes it would be preferable to work with word- or even character-based differences in the same way.

This project will therefore create a pre-/postprocessor, losslessly transforming input files into a format which contains only one word, one character or a one-byte hex-dump per text line.

We will also exploit fact that POSIX "diff" provides the "-b" option and POSIX "patch" provides the "-l" option, which make all sequences of one or more whitespaces (except newlines) at the end of a line compare equally as single word-separators.

The "word" transformation format shall therefore keep the original whitespace between words at the end of the lines, including all newline sequences until the start of the next word, if any.

This happens by transforming the beforementioned whitespace into a different whitespace sequence only consisting of SPACE and HT characters, using the following transformation rules:

LF    (\012) --> SPACE HT
SPACE (\040) --> SPACE SPACE HT
CR    (\015) --> SPACE SPACE SPACE HT
HT    (\011) --> SPACE SPACE SPACE SPACE HT
FF    (\014) --> SPACE SPACE SPACE SPACE SPACE HT
VT    (\013) --> SPACE SPACE SPACE SPACE SPACE SPACE HT

In addition, SPACE and HT characters which do not match any of the above patterns in the transformed output represent themselves, i. e. are to be interpreted literally.

All generated output lines will be terminated with LF.

The "character" transformation format shall do the same, so that those lines will in fact contain 2 or more characters rather than just one.

A challenge for all transformation formats are newline characters, though: We also want to compare any sequence of newline characters as if it were just a single whitespace.

And finally the case that the last line of the original file did not end with a newline character needs also be supported.

I propose a prefixed line format for transformation output, so that the following sample text (where the '$' represent newline characters for better visualizing trailing whitspace; the '$' are not present in the real input or output)

-----
Hi, world!$
$
This is a test.$
$
There are 3 empty lines after this one:$
$
$
$
End.$
-----

which is looks like

-----
0000000   H   i   ,  sp   w   o   r   l   d   !  nl  nl   T   h   i   s
0000020  sp   i   s  sp   a  sp   t   e   s   t   .  nl  nl   T   h   e
0000040   r   e  sp   a   r   e  sp   3  sp   e   m   p   t   y  sp   l
0000060   i   n   e   s  sp   a   f   t   e   r  sp   t   h   i   s  sp
0000100   o   n   e   :  nl  nl  nl  nl   E   n   d   .  nl
0000115
-----

when dumped with "od -ta" will be transformed into the following output (also dumped using "od" in the same way) when operating in "word" splitting mode:

-----
0000000   H   i   ,  sp  nl   w   o   r   l   d   !  sp  ht  sp  ht  nl
0000020   T   h   i   s  sp  nl   i   s  sp  nl   a  sp  nl   t   e   s
0000040   t   .  sp  ht  sp  ht  nl   T   h   e   r   e  sp  nl   a   r
0000060   e  sp  nl   3  sp  nl   e   m   p   t   y  sp  nl   l   i   n
0000100   e   s  sp  nl   a   f   t   e   r  sp  nl   t   h   i   s  sp
0000120  nl   o   n   e   :  sp  ht  sp  ht  sp  ht  sp  ht  nl   E   n
0000140   d   .  sp  ht  nl
0000145
-----

In character-split mode, the output would instead be:

-----
0000000   H  nl   i  nl   ,  sp  nl   w  nl   o  nl   r  nl   l  nl   d
0000020  nl   !  sp  ht  sp  ht  nl   T  nl   h  nl   i  nl   s  sp  nl
0000040   i  nl   s  sp  nl   a  sp  nl   t  nl   e  nl   s  nl   t  nl
0000060   .  sp  ht  sp  ht  nl   T  nl   h  nl   e  nl   r  nl   e  sp
0000100  nl   a  nl   r  nl   e  sp  nl   3  sp  nl   e  nl   m  nl   p
0000120  nl   t  nl   y  sp  nl   l  nl   i  nl   n  nl   e  nl   s  sp
0000140  nl   a  nl   f  nl   t  nl   e  nl   r  sp  nl   t  nl   h  nl
0000160   i  nl   s  sp  nl   o  nl   n  nl   e  nl   :  sp  ht  sp  ht
0000200  sp  ht  sp  ht  nl   E  nl   n  nl   d  nl   .  sp  ht  nl
0000217
-----

The hexdump transformation format will be simpler than the format above: All output consist of a 2-digit upper-case hexadecimal number representing the next byte.

The bitstream transformation format is the same, except that the output only consists of characters '0' and '1'.