Lots about perf testing

This commit is contained in:
Neil Webber 2024-05-18 17:36:43 -05:00
parent d4b8e8d896
commit b237f9b100

View file

@ -168,15 +168,82 @@ I'll have to write more about this later; as a first step to getting around:
Note that the disk operations are synchronous (not threaded/asyncio). I have tried going async and it doesn't seem to make much difference. Not sure why just yet but it works well enough in synchronous mode (in other words - when Unix commands a read, the read happens right then and no emulated instructions happen "during" the read ... then the interrupt is fired and unix continues, as if the disk were infinitely fast because no instruction cycles elapsed between GO and the interrupt).
## tests
## Tests
Some (somewhat trivial) unit tests:
python3 pdptests.py
% python3 pdptests.py
## Performance tests
The module `pdptests.py` can be used to run performance tests, like this:
python3 pdptests.py -p
By default it measures 1 million MOV R1,R0 instructions. On a 2022 Macbook Air (M2) this takes about 25 seconds (speed dependent on python version). It will show something like:
% python3 pdptests.py -p
Instruction ['010100'] took 481.5 nsecs
Other instructions can be specified with `-i` or `--inst`. Any instruction understood by the methods in `pdpasmhelper` will work. For example:
% python3 pdptests.py -p --inst 'clr r0'
Instruction ['005000'] took 565.9 nsecs
It's interesting that CLR is slower then MOV; it seems likely this is entirely the overhead of the double-dispatch. The so-called `ssdd` (two-operand) instructions are dispatched directly (using their top four bits) to op01_mov, op06_add, etc in `op4.py`. Something like CLR, however, dispatches to another dispatcher to further decode the next opcode digit (see `d3dispatcher` and the `op00_dispatch_table` for example). It might be interesting to see how much performance could be gained by literally building a 64K-entry dispatch table that just dispatched all instruction combinations to a direct handler ... ahhhh, the things a modern machine with gigabytes of memory can get away with!
Any single instruction with at most one additional operand word can be tested, so this works:
% python3 pdptests.py -p --inst 'add $7,r0'
Instruction ['062700', '000007'] took 1303.2 nsecs
REMINDER: pdpasmhelper uses unix v7 `as` syntax. Note, e.g., `$7` for the immediate constant value.
Register-to-register MOV operations, and some selected other instructions with register operands have been optimized. Memory operations of course take longer (already apparent in the above `add $7,r0` example because the `7` constant is a PC-immediate memory operand).
Registers r0-r3 are available for use in timing tests. Registers r0 and r1 are cleared at each major iteration of the outer timing loop; registers r2 and r3 are unmodified (but start zero). Register r4 is used by the framework and must not be altered. Register r5 is initialized to point at a "safe" location for writing (for testing) and should not be altered.
The test code executes in USER mode with a full 64K address space. So this works just fine to test memory access speed for example:
% python3 pdptests.py -p --inst 'mov (r0),r1'
Instruction ['011001'] took 931.2 nsecs
That ends up looping over a read of location zero. This also works:
% python3 pdptests.py -p --inst 'mov (r5),r0'
Instruction ['011500'] took 930.5 nsecs
and (as mentioned) (r5) can be used as a write destination as well:
% python3 pdptests.py -p --inst 'mov r0,(r5)'
Instruction ['010015'] took 931.1 nsecs
Autoincrement can be tested this way, which is a bit funky but works because r0 simply wraps around and user space is fully mapped:
python3 pdptests.py -p --inst 'mov (r0)+,r1'
Instruction ['012001'] took 1130.7 nsecs
The overhead of post-increment (or pre-decrement) is more than just the implied addition to the register because of MMU (MMR1) semantics (the ability to unwind a partially-executed instruction if a page-fault occurs). TODO: There might be room to optimize some of that overhead out in the (common?) case where the destination operand is a register (would require fetching the memory operand BEFORE the autoincrement/decrement goes back into the source register, might be messy to get right).
Tests can be run without the MMU enabled; use option `--nommu`. CAUTION: The I/O page will be mapped to the last 8K of the test environment in this case, which implies that some tests (e.g., `--inst mov (r0)+,r1`) might be ill-advised as they will read/strobe various emulated I/O registers as r0 cycles through the 64K space.
Example:
% python3 pdptests.py -p --nommu --inst 'mov r1,r2'
Instruction ['010102'] took 397.6 nsecs
vs:
% python3 pdptests.py -p --inst 'mov r1,r2'
Instruction ['010102'] took 485.8 nsecs
A substantial amount of caching and careful coding work has gone into minimizing MMU overhead. In these test environments the MMU is configured in the way that might be typical for most operating systems (unix v5-v7 in particular).
# TODO
Known areas that need work include:
* It would be nice to have more tests, alternatively, it would be nice to be able to run DEC diagnostics (this is a very high bar).
* Need to emulate more devices, especially a DL-11
* The UNIBUS address space and especially the UBA system is a stub. The disk drive is, in effect, emulated as being on the Massbus and uses its BAE register (which the unix driver sets accordingly) for the 22-bit physical address extension bits.
* No floating point instructions implemented; not sure how important they are. They were an option, so presumably all code can deal with them not being present.