Load/Store Multiple


Often times data stored in memory has a spatial organization. Perhaps the best way to explain spatial locality of data is with an example. We will examine the spatial relationship of an array of ASCII characters.  If there is a greeting message that is printed every time a system is powered on, chances are that the characters in that string will be stored in a single, ordered,  contiguous block of memory.  The data in this string is spatially organized because once we reference a single character in the array, there is a high probability that we will also access the other characters that are adjacent to it.

Software takes advantage of spatially locality all the time.  Strings, structures, objects, caches, the stack, etc.  When we are dealing with spatially organized data, there is a high probability that we will access data found at memory locations that are adjacent to one another.  Since there is a high probability that we will access related data elements, it can be advantageous to load and store related data using a single instruction.


The Cortex-M architecture supports loading and storing data to/from multiple memory locations using a single command. LDM allows the user to specify a single base address and a variable list of registers to load values from.  STM uses a base address and a register list to write multiple WORDs of data.

The LDM and STM commands help to decrease the size of our executable image.  If we were to store 8 WORDs of data using 8 consecutive STR commands, this would require 32 bytes of storage space ( each STR requires 4 bytes).  Using a single STM instruction will only use 4 bytes of data and accomplish the same task.

The other advantage we will see is that it takes fewer clock cycles to execute an equivalent STM instruction.  A LDR/STR instruction takes two clock cycles to complete.  Storing 8 WORDs would take 16 clock cycles to complete.  The STM instruction takes N+1 clock cycles to complete, where N is the number of registers in the register list.  In this example, it would take only 9 clock cycles using the STM instruction, reducing the number of clock cycles by roughly 43 percent!

Here are a few key details of using LDM and STM

  • The number of WORDs to load/store is based on the number of registers in the register list.  If you specify 3 registers in the register list, it will store 3 WORDs of data.
  • The size of LDM and STM data is always a WORD (4-Bytes)
  • The register list does  not need to be sequential.
  • The order of the registers in the register set is somewhat meaningless.  ARM is always going to load/store the lowest register number from/to the lowest address in SRAM
  • You can add DB (LDMDB/STMDB) if you want to the address to decrement before each access.
  • The base address of a LDM/STM must be on a WORD boundary.   A word boundary is an address where the final two bits are both 0.
  • The base address register must NOT be included in the register list.


The example below demonstrates how to load 6 WORDs of data from FLASH and write it out to SRAM.  In this example, the two base address registers (R0 and R7) are not modified.

    ; Load the address of the data array
    ADR     R0, SRC_ADR

    ; Load 6 consecutive addresses starting at SRC_ADDR
    ; R0 is NOT modified
    ; All three instructions below do exactly the same thing
    LDM     R0, {R1-R6}
    LDM     R0, {R1, R2, R3, R4, R5, R6}
    LDM     R0, {R6, R1, R2, R4, R5, R3}  

    ; Load the destination address.  We cannot use the ADR
    ; command since the address of DEST_ADR is more than 
    ; 4K from the PC
    LDR     R7, =(DEST_ADDR)
    STM     R7, {R1-R6}


This example demonstrates how to use the LDM/STM instructions that do modify the base registers.  The only functional difference between the example above and this example is that R0 and R7 both get modified.

    ; Load Src and Dest Addresses
    ADR     R0, SRC_ADR
    LDR     R7, =(DEST_ADDR3)

    ; Use '!' to write back to the base address
    ; The base address is incremented based on the 
    ;number of WORDs being accessed
    LDM     R0!, {R1-R3}    ; R0 <- R0 + 12
    STM     R7!, {R1-R3}    ; R7 <- R7 + 12
    LDM     R0!, {R1-R3}    ; R0 <- R0 + 12
    STM     R7!, {R1-R3}    ; R7 <- R7 + 12