Skip to content

py/emitnative: Some more load/store improvements. #17477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 1, 2025

Conversation

agatti
Copy link
Contributor

@agatti agatti commented Jun 11, 2025

Summary

Here's the last - I hope - PR with Viper load/store improvements (well, for native too since they share those emitter primitives).

Some ports should, in theory, actually shrink down in size:

  • Xtensa has one single function per operation, each covering all operand sizes by flipping bits in a single instruction template (like the Thumb emitter in the previous PR in the series), still using narrow opcodes whenever possible
  • RV32 also has one single function per operation with a single function template each, with even more duplicated code removed, now emitting compressed opcodes whenever possible (even for store operations)
  • x86 already had the necessary opcode emitters to cover all load/store cases but they weren't exposed to Viper, so all operations should be one opcode each and - in theory - emit faster code
  • x64, like x86, also had all the necessary opcode emitters already there, with the same performance benefits

However, the Arm emitter is probably slightly bigger since there's one function per operation/size combination. I've tried to offset this by merging some single-opcode emitter functions into entities that generate the shortest sequence for all offsets. Halfword opcodes have a different format so I'd still have four functions rather than six - not really worth the effort for such a platform that, as I understand it, isn't as memory constrained as other ones supported by MicroPython.

Thumb only got redundant single-opcode emitters removed, so it's equivalent as far as emitted code goes.

I haven't replaced all the bit-width fit checks in the various emitters with the new macros, in order to keep this patch series more readable. For the time being I'll replace those as I encounter them in code that gets changed.

Boundary tests were cleaned up and improved as well - I didn't go as far as using unittest's sub-tests (one per bit width tested), but if that's desired I can migrate those tests to that framework.

Who would have thought that getting and setting simple variables would have triggered all this work? :)

Testing

tests/mcropython/viper_ptr* were executed successfully on the following ports:

  • QEMU (VIRT_RV32, SABRELITE, and MPS_AN385) both directly and with --via-mpy --emit native
  • ESP8266 (NodeMCU v3) both directly and with --via-mpy --emit native using the same prelude injection method introduced for natmod tests (-b was already taken in tests/run-tests.py, so only --begin is available)
  • ESP32 (generic D0WD board), both directly and with --via-mpy --emit native
  • x64 Unix using the STANDARD variant directly
  • x86 Unix using the COVERAGE variant directly

Trade-offs and Alternatives

As mentioned, the Arm emitter is slightly larger and it can be made a bit smaller if it's required - adding some complexity to that emitter backend.

Copy link

codecov bot commented Jun 11, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.56%. Comparing base (1eb27e1) to head (1b0cdc0).
Report is 10 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #17477      +/-   ##
==========================================
- Coverage   98.57%   98.56%   -0.01%     
==========================================
  Files         169      169              
  Lines       21968    21946      -22     
==========================================
- Hits        21654    21632      -22     
  Misses        314      314              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

github-actions bot commented Jun 11, 2025

Code size report:

   bare-arm:    +0 +0.000% 
minimal x86:    +0 +0.000% 
   unix x64:  -336 -0.039% standard
      stm32:    +0 +0.000% PYBV10
     mimxrt:    +8 +0.002% TEENSY40
        rp2:   +16 +0.002% RPI_PICO_W
       samd:    +0 +0.000% ADAFRUIT_ITSYBITSY_M4_EXPRESS
  qemu rv32:   -55 -0.012% VIRT_RV32

@dpgeorge dpgeorge added the py-core Relates to py/ directory in source label Jun 16, 2025
Copy link
Member

@dpgeorge dpgeorge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, it looks really good! A nice clean up and enhancement.

#define ASM_SUB_REG_REG(state, rd, rs) asm_rv32_opcode_sub(state, rd, rd, rs)
#define ASM_XOR_REG_REG(state, rd, rs) asm_rv32_emit_optimised_xor(state, rd, rs)
#define ASM_CLR_REG(state, rd)
#define ASM_LOAD8_REG_REG_REG(state, rd, rs1, rs2) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this particular macro isn't needed? If it's omitted then I think emitnative.c will generate the same code using its generic fall back code.

(Same comment for ASM_STORE8_REG_REG_REG.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now they act the same as the fallback indeed, the reason for their existence is that it was some preparatory work for further changes in the RV32 emitter I planned to introduce (optional support for more RV32 C sub-extensions that introduce more compressed opcodes). This technically applies to ASM_LOAD16_REG_REG_REG and ASM_STORE16_REG_REG_REG too, as whilst the emitted code sequence is different the number of opcodes and the final sequence size is the same.

I don't mind taking them out now to potentially re-introduce them later (even if under a different form).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it' fine to keep them.

@agatti agatti force-pushed the optimise-all-viper-emitters branch from 1371a54 to 2293a54 Compare June 26, 2025 18:56
agatti added 10 commits July 1, 2025 15:34
This commit adds two macros that lets check whether a given value can
fit an arbitrary number of bits, either as a signed or as an unsigned
number.

The native emitter code backends perform a lot of bit size checks to see
if a particular code sequence can be emitted instead of a generic one,
and each platform backend has its own ad-hoc macros (usually one per bit
count and signedness).

With these macros there's a single way to perform those checks, plus
there's no more chance for off-by-one mask length errors when dealing
with signed numbers.

Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
This commit expands the implementation of Viper load/store operations
that are optimised for the RV32 platform.

Given the way opcodes are encoded, all value sizes are implemented with
only two functions - one for loads and one for stores.  This should
reduce duplication with existing operations and should, in theory, save
space as some code is removed.  Both load and store emitters will
generate the shortest possible sequence (as long as the stack pointer is
not involved), using compressed opcodes when possible.

Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
This commit expands the implementation of Viper load/store operations
that are optimised for the Arm platform.

Now both load and store emitters should generate the shortest possible
sequence in all cases.  Redundant specialised operation emitters have
been folded into the general case implementation - this was the case of
integer-indexed load/store operations with a fixed offset of zero.

Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
This commit expands the implementation of Viper load/store operations
that are optimised for the Xtensa platform.

Now both load and store emitters should generate the shortest possible
sequence in all cases.  Redundant specialised operation emitters have
been aliased to the general case implementation - this was the case of
integer-indexed load/store operations with a fixed offset of zero.

Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
This commit removes load/store opcode implementations that have been
made redundant in 1f5ba69.

Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
This commit expands the implementation of Viper load/store operations
that are optimised for the x86 platform.

Unlike other platforms, x86 already implemented all necessary
functions and all it took to expose these to Viper after the
infrastructure refactoring was to add a few defines.

Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
This commit expands the implementation of Viper load/store operations
that are optimised for the x86 platform.

Like x86, x64 already implemented all necessary functions and all it
took to expose these to Viper after the infrastructure refactoring
was to add a few defines.

Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
This commit reworks the Viper pointer boundary tests in order to make
them more accurate and easier to extend.

The tests are now easier to reason about in their output, using easier
to read values, and bit thresholds are now more configurable.  If a new
conditional code sequence is introduced, adding a new bit threshold is
just a matter of adding a value into a tuple at the beginning of the
relevant test file.

Load tests have also been made more accurate, with better function
templates to test register-indexed operations.

Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
This commit introduces a mechanism to customise the code that is
injected to the board when performing a test file upload and execution.

A new argument, --begin", is added so regular Python code can be
inserted in the injected fragment between the module file creation and
the effective file import.  This is needed for running larger tests
(usually ones that have been pre-compiled with
"--via-mpy --emit native") on ESP8266, as that board does not have
enough memory to fit certain blocks of code unless additional
configuration is performed.

Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
This commit cleans up the single entry point integer-indexed load/store
emitters that have been built by merging the single operand type
load/store functions in 1f5ba69.

To follow the same convention found in RV32 and Xtensa emitters, the
function operand size is not named after the left shift amount to apply
to the initial offset to get its true byte offset, but by a generic
"operand size".

Also, those functions were updated to use the new MP_FIT_UNSIGNED macros
to perform bit width checks when figuring out which opcode encoding is
the best one to use.

Signed-off-by: Alessandro Gatti <a.gatti@frob.it>
@dpgeorge dpgeorge force-pushed the optimise-all-viper-emitters branch from 2293a54 to 1b0cdc0 Compare July 1, 2025 05:43
@dpgeorge dpgeorge merged commit 1b0cdc0 into micropython:master Jul 1, 2025
67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
py-core Relates to py/ directory in source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants