[cmds] Speed up Paint code drawing on slow 8088 systems by ghaerr · Pull Request #2277 · ghaerr/elks (original) (raw)

Speeds up drawing in Paint program considerably, discussed in #2269 (comment).

This turned out to be an interesting exercise, and an ia16-elf-gcc compiler bug was found in the process of trying to get it to not translate (x << 6) + (x << 4) to (x * 80) and emit a MUL instruction. On the 8088, MUL instructions are very slow and using left shifts will be much faster.

What was found with regards to ia16-elf-gcc is:

The default code generation option -Os (optimize for size) is not well suited for 8088 code generation. Specifically, it generates small code with no regard to speed.
Replacing the global optimization option for paint to -O3 (or -O2 or -O1) caused the compiler to emit code which crashed paint when drawing. This is still being investigated.
Trying to use the GCC __attribute__((0))) to turn off optimization for a single function generated very sloppy code which was deemed too slow.
When optimizing in the default -Os (small size) mode, GCC actually replaces (x << 6) + (x << 4) with (x * 80) and generates a MUL instruction.
In the same -Os default optimization, writing (x / 8) actually generates a DIV instruction (!!!). Coding (x >> 3) generates a right shift, which was nice to see.

After playing way too long trying to get GCC to work without potentially putting each graphics function in a separate file, it was finally decided that translating the entire C86 ASM language vga-4pp.s fast draw routines from AS86 format to GCC AS format was the way to go, and it now works very well.

As a result of conversations with @Vutshi and @dbalsom in #2269, the following changes were made to greatly increase the speed of paint on 8088. This has been tested with QEMU slowing it down considerably by adding -singlestep -icount 8,align to the QEMU command line. Soon I hope to get MartyPC running on my macOS laptop for cycle-accurate 8088 emulation.

Rewrite the VGA 4-planar C86 ASM routines in vga-4bpp.s to GCC assembly in vga-ia16.S.
Rewrite the cursor show/hide routines so that only the masked cursor bits are saved and restored. This sped up cursor drawing by about 50%.
Remove MUL and DIV instructions from the main loop event_wait_timeout routine.
Replace previous y * 80 code with (y << 6) + (y << 4) in C and the new vga-ia16.S code. vga-4bpp.s not done yet (for C86).
Slightly rewrote and got working @Vutshi's C drawhline routine. This is where I later found the GCC bug when compiling in -O3 mode. This routine is commented out now since it has been replaced with fast ASM code in vga-ia16.S.
Add a mouse acceleration filter which allows the mouse to speed up when moved rapidly. Seems to work well but needs to be more throughly tested on different speed systems.
Removed x >= 0 and y >= 0 clipping checks in render.c.
Rewrote R_ClearCanvas and R_DrawPalette to use drawhline for ia16.