Appendix D
NEON Intrinsics and Instructions
This appendix contains information on, and a list of instructions used with, the NEON engine. Data types, lane types, and intrinsics are listed.
Table D-1 lists the different data types supported on the NEON engine, and the corresponding C data types.
DATA TYPE | D-REGISTER (64 BITS) | Q-REGISTER (128 BITS) |
Signed integers | int8x8_t | int8x16_t |
int16x4_t | int16x8_t | |
int32x2_t | int32x4_t | |
int64x1_t | int64x2_t | |
Unsigned integers | uint8x8_t | uint8x16_t |
uint16x4_t | uint16x8_t | |
uint32x2_t | uint32x4_t | |
uint64x1_t | uint64x2_t | |
Floating-point | float16x4_t | float16x8_t |
float32x2_t | float32x4_t | |
Polynomial | poly8x8_t | poly8x16_t |
poly16x4_t | poly16x8_t |
Table D-2 lists the different lane types per class, and the amount of possible types for each class.
CLASS | COUNT | TYPES |
int | 6 | int8, int16, int32, uint8, uint16, uint32 |
int/64 | 8 | int8, int16, int32, int64, uint8, uint16, uint32, uint64 |
sint | 3 | int8, int16, int32 |
sint16/32 | 2 | int16, int32 |
int32 | 2 | int32, uint32 |
8-bit | 3 | int8, uint8, poly8 |
int/poly8 | 7 | int8, int16, int32, uint8, uint16, uint32, poly8 |
int/64/poly | 10 | int8, int16, int32, int64, uint8, uint16, uint32, uint64, poly8, poly16 |
arith | 7 | int8, int16, int32, uint8, uint16, uint32, float32 |
arith/64 | 9 | int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32 |
arith/poly8 | 8 | int8, int16, int32, uint8, uint16, uint32, poly8, float32 |
floating | 1 | float32 |
any | 11 | int8, int16, int32, int64, uint8, uint16, uint32, uint64, poly8, poly16, float32 |
Table D-3 contains a list of NEON instructions, as well as a brief description of each instruction.
INSTRUCTION | DESCRIPTION |
VABA | Absolute difference and Accumulate |
VABD | Absolute difference |
VABS | Absolute Value |
VACGE | Absolute Compare Greater Than or Equal |
VACGT | Absolute Compare Greater Than |
VACLE | Absolute Compare Less Than or Equal |
VACLT | Absolute Compare Less Than |
VADD | Add |
VADDHN | Add, Select High Half |
VAND | Logical AND |
VBIC | Bitwise Bit Clear |
VBIF | Bitwise Insert if False |
VBIT | Bitwise Insert if True |
VBSL | Bitwise Select |
VCEQ | Compare Equal |
VCGE | Compare Greater Than or Equal |
VCGT | Compare Greater Than |
VCLE | Compare Less Than or Equal |
VCLS | Count Leading Sign bits |
VCLT | Compare Less Than |
VCLZ | Count Leading Zeroes |
VCNT | Count set bits |
VCVT | Convert between different number formats |
VDUP | Duplicate scalar to all lanes of vector |
VEOR | Bitwise Exclusive OR |
VEXT | Extract |
VFMA | Fused Multiply and Accumulate |
VFMS | Fused Multiply and Subtract |
VHADD | Halving Add |
VHSUB | Halving Subtract |
VLD | Vector Load |
VMAX | Maximum |
VMIN | Minimum |
VMLA | Multiply and Accumulate |
VMLS | Multiply and Subtract |
VMOV | Move |
VMOVL | Move Long |
VMOVN | Move Narrow |
VMUL | Multiply |
VMVN | Move Negative |
VNEG | Negate |
VORN | Bitwise OR NOT |
VORR | Bitwise OR |
VPADAL | Pairwise Add and Accumulate |
VPADD | Pairwise Add |
VPMAX | Pairwise Maximum |
VPMIN | Pairwise Minimum |
VQABS | Absolute Value, Saturate |
VQADD | Add, Saturate |
VQDMLAL | Saturating Double Multiply Accumulate |
VQDMLSL | Saturating Double Multiply and Subtract |
VQDMUL | Saturating Double Multiply |
VQDMULH | Saturating Double Multiply returning High half |
VQMOVN | Saturating Move |
VQNEG | Negate, Saturate |
VQRDMULH | Saturating Double Multiply returning High half |
VQRSHL | Shift left, Round, Saturate |
VQRSHR | Shift Right, Round, Saturate |
VQSHL | Shift Left, Saturate |
VQSHR | Shift Right, Saturate |
VQSUB | Subtract, Saturate |
VRADDH | Add, Select High Half, Round |
VRECPE | Reciprocal Estimate |
VRECPS | Reciprocal Step |
VREV | Reverse Elements |
VRHADD | Halving Add, Round |
VRSHR | Shift Right and Round |
VRSQRTE | Reciprocal Square Root Estimate |
VRSQRTS | Reciprocal Square Root Step |
VRSRA | Shift Right, Round and Accumulate |
VRSUBH | Subtract, select High half, Round |
VSHL | Shift Left |
VSHR | Shift Right |
VSLI | Shift Left and Insert |
VSRA | Shift Right, Accumulate |
VSRI | Shift Right and Insert |
VST | Vector Store |
VSUB | Subtract |
VSUBH | Subtract, Select High half |
VSWP | Swap Vectors |
VTBL | Vector Table Lookup |
VTBX | Vector Table Extension |
VTRN | Vector Transpose |
VTST | Test Bits |
VUZP | Vector Unzip |
VZIP | Vector Zip |
Intrinsics provide an elegant way to write NEON instructions using C. NEON intrinsics are created using the following structure:
v[q][r]name[u][n][q][_lane][_n][_result]_type
where:
For example, vmul_s16 multiplies two vectors of signed 16-bit values and is equivalent to VMUL.I16. Some examples in C include:
uint32x4_t vec128 = vld1q_u32(i); // Load 4 32-bit values
uint8x8_t vadd_u8 (uint8x8_t,
uint8x8_t); //Add two lanes
int8x16_t vaddq_s8 (int8x16_t, int8x16_t); //Saturating add two lanes