STM32: Using External SRAM
This article is about the external SRAM of STM32. With FSMC, STM32 MCUs can access external
SRAM. Hopefully, this idea looks like saving us from the thirsty of RAM.
Requirements
This article requires you have some basic knowledge about STM32 development. Including:
- C programming language
- What is SRAM
- The official document about STM32 FSMC
- Some basic tuning skills
- A STM32 dev board which supports FSMC and built in an external SRAM.
- A debugger, such as ST-Link or J-Link
- An IDE, I'm using this IDE: System Workbench for STM32
SRAM/PSRAM
All right, here comes a new acronym PSRAM. When you search PSRAM on WikiPedia, this page jumps out and note us it is not a real SRAM but pseudo one. JEDEC has a good explain about PSRAM. Usually, SRAM is too expansive(comparing to PSRAM) and has less size(usually counting in KB). However, PSRAM is far cheaper then SRAM and providing sufficient size. Here is my favorite one.
Even the manual won't tell you it is a PSRAM. But we can conclude by its price and capacity. Fortunately, my dev board has a piece of PSRAM of the same type built in.
A real SRAM looks like this one. 32KB for 38RMB. Since we don't have unlimited budget, let's assume our built-in RAM which we thought is as fast as we need. We treat the built-in RAM as real SRAM. Then we can write some code to test the performance of SRAM and PSRAM.
Some tools make your life easier
I use STM32CubeMx a lot. With the GUI we can easily get the board and some basic code prepared, just by a few of clicks. Further more, if we have the following eclipse plugins installed, we will code more happily:
- TM Terminal
- RxTx
They are for display log string from serial.
Enough talk, let's code
The manual is boring, boring and boring, especially the part about timing. At very beginning I was frustrated by the time order and waveform figures. Soon, after some attempts, I found that our poor PSRAM, err, SRAM doesn't care about too much except the data
According to board manufacture's manual, the board has a 1MB built-in external SRAM which is IS62WV51216B.
This SRAM, unsurprisingly, is a PSRAM, which we can easily figure it out by Taobao.com.
By the way, we use HAL everywhere, so please check HAL support in STM32CubeMx.
FSMC Configuration
Actually, even though our PSRAM support 18bits addressing, by my test, 1 bit addressing works very well. So we can use 1bit addressing at all. And also, 8bits/16bits data bus doesn't matter too. So we can configure our chip using 1bit addressing and 8bit data bus with any pressure. Also, save some IO pins which are perish resources. Here is the FSMC initialization code generated by STM32CubeMx.
// in main.c
static void MX_FSMC_Init(void)
{
FSMC_NORSRAM_TimingTypeDef Timing;
/** Perform the SRAM3 memory initialization sequence
*/
hsram3.Instance = FSMC_NORSRAM_DEVICE;
hsram3.Extended = FSMC_NORSRAM_EXTENDED_DEVICE;
/* hsram3.Init */
hsram3.Init.NSBank = FSMC_NORSRAM_BANK3; // my board has PSRAM connected on bank3.
hsram3.Init.DataAddressMux = FSMC_DATA_ADDRESS_MUX_DISABLE; // Not used, using HAL lock
hsram3.Init.MemoryType = FSMC_MEMORY_TYPE_SRAM;
hsram3.Init.MemoryDataWidth = FSMC_NORSRAM_MEM_BUS_WIDTH_8; // using 8bit for data bus
hsram3.Init.BurstAccessMode = FSMC_BURST_ACCESS_MODE_DISABLE; // PSRAM won't care
hsram3.Init.WaitSignalPolarity = FSMC_WAIT_SIGNAL_POLARITY_LOW;
hsram3.Init.WrapMode = FSMC_WRAP_MODE_DISABLE;
hsram3.Init.WaitSignalActive = FSMC_WAIT_TIMING_BEFORE_WS;
hsram3.Init.WriteOperation = FSMC_WRITE_OPERATION_ENABLE; // Of course, we want to write the memory
hsram3.Init.WaitSignal = FSMC_WAIT_SIGNAL_DISABLE; // Let's FSMC manage this
hsram3.Init.ExtendedMode = FSMC_EXTENDED_MODE_DISABLE; // What is extended mode? keep default
hsram3.Init.AsynchronousWait = FSMC_ASYNCHRONOUS_WAIT_DISABLE; // FSMC won't care
hsram3.Init.WriteBurst = FSMC_WRITE_BURST_DISABLE; // Not supported write burst
/* Timing */
Timing.AddressSetupTime = 0; // doesn't matter
Timing.AddressHoldTime = 0; // doesn't matter
Timing.DataSetupTime = 3; // NOTE: the less, the butter, I tried 2 but failed on 1. 2
Timing.BusTurnAroundDuration = 0; // doen'st matter
Timing.CLKDivision = 0; // doesn't care
Timing.DataLatency = 0; // doesn't care
Timing.AccessMode = FSMC_ACCESS_MODE_A;
.....
IO Pin Configuration
Actually, CubeMx is a good nanny. She does great job. We don't have to care the pins FSMC using. But we can still take a look.
// in stm32f1xx_hal_msp.c
static void HAL_FSMC_MspInit(void){
/* USER CODE BEGIN FSMC_MspInit 0 */
/* USER CODE END FSMC_MspInit 0 */
GPIO_InitTypeDef GPIO_InitStruct;
if (FSMC_Initialized) {
return;
}
FSMC_Initialized = 1;
/* Peripheral clock enable */
__HAL_RCC_FSMC_CLK_ENABLE();
/** FSMC GPIO Configuration
PF0 ------> FSMC_A0 // See, 1bit addressing
PE7 ------> FSMC_D4
PE8 ------> FSMC_D5
PE9 ------> FSMC_D6
PE10 ------> FSMC_D7
PD14 ------> FSMC_D0
PD15 ------> FSMC_D1
PD0 ------> FSMC_D2
PD1 ------> FSMC_D3
PD4 ------> FSMC_NOE
PD5 ------> FSMC_NWE
PG10 ------> FSMC_NE3
*/
GPIO_InitStruct.Pin = GPIO_PIN_0;
GPIO_InitStruct.Mode = GPIO_MODE_AF_PP;
GPIO_InitStruct.Speed = GPIO_SPEED_FREQ_HIGH;
HAL_GPIO_Init(GPIOF, &GPIO_InitStruct);
GPIO_InitStruct.Pin = GPIO_PIN_7|GPIO_PIN_8|GPIO_PIN_9|GPIO_PIN_10;
GPIO_InitStruct.Mode = GPIO_MODE_AF_PP;
GPIO_InitStruct.Speed = GPIO_SPEED_FREQ_HIGH;
HAL_GPIO_Init(GPIOE, &GPIO_InitStruct);
GPIO_InitStruct.Pin = GPIO_PIN_14|GPIO_PIN_15|GPIO_PIN_0|GPIO_PIN_1
|GPIO_PIN_4|GPIO_PIN_5;
GPIO_InitStruct.Mode = GPIO_MODE_AF_PP;
GPIO_InitStruct.Speed = GPIO_SPEED_FREQ_HIGH;
HAL_GPIO_Init(GPIOD, &GPIO_InitStruct);
GPIO_InitStruct.Pin = GPIO_PIN_10;
GPIO_InitStruct.Mode = GPIO_MODE_AF_PP;
GPIO_InitStruct.Speed = GPIO_SPEED_FREQ_HIGH;
HAL_GPIO_Init(GPIOG, &GPIO_InitStruct);
/* USER CODE BEGIN FSMC_MspInit 1 */
/* USER CODE END FSMC_MspInit 1 */
}
With 12 pins, we can fully control our PSRAM.
Memory Access Code
The CubeMx is so sweet that she even prepared HAL edition SRAM read/write code for us. If we are careful enough, we can find the DMA routines. However, we won't discuss DMA or IT here. We will test our SRAM by dry run. The following list is the typical memory R/W routines.
// Write sram byte by byte, the normally way
void sram_write(unsigned char* pbuf, unsigned long addr, size_t size) {
while(size--) {
*(__IO unsigned char *)(FSMC_BANK1_3 + addr) = *pbuf;
addr++;
pbuf++;
}
}
// Read byte by byte
void sram_read(unsigned char* pbuf, unsigned long addr, size_t size) {
while(size--) {
*pbuf = *(__IO unsigned char *)(FSMC_BANK1_3 + addr);
pbuf++;
addr++;
}
}
// Faster, word by word
// NOTE: data length won't be concerned.
void sram_write_word(unsigned short* pbuf, unsigned long addr, size_t size) {
while(size--) {
*(__IO unsigned short *)(FSMC_BANK1_3 + addr) = *pbuf;
addr++;
pbuf++;
}
}
// Read word by word
void sram_read_word(unsigned short* pbuf, unsigned long addr, size_t size) {
while(size--) {
*pbuf = *(__IO unsigned short*)(FSMC_BANK1_3 + addr);
pbuf++;
addr++;
}
}
// One step further, try double word
void sram_write_dword(unsigned int* pbuf, unsigned long addr, size_t size) {
while(size--) {
*(__IO unsigned int *)(FSMC_BANK1_3 + addr) = *pbuf;
addr++;
pbuf++;
}
}
void sram_read_dword(unsigned int* pbuf, unsigned long addr, size_t size) {
while(size--) {
*pbuf = *(__IO unsigned int*)(FSMC_BANK1_3 + addr);
pbuf++;
addr++;
}
}
// NOTE: the following code uses two tricks:
// 1. Loop weakening
// 2. Code extending
// Fast write 8 bytes
void sram_fast_write8(unsigned char* pbuf, unsigned int addr, size_t size) {
const int align = 2 * sizeof(unsigned int);
if (size <= align) {
sram_write(pbuf, addr, size);
return ;
}
size_t remains = size & 7;
size_t count = (size - remains) / sizeof(unsigned int);
unsigned int* psrc= (unsigned int *)pbuf;
__IO unsigned int* pdst = FSMC_BANK1_3 + addr;
while(count){
// Write 8 ints each time
*pdst++ = *psrc++; count--;
*pdst++ = *psrc++; count--;
}
if (remains) {
sram_write(pdst, psrc, remains);
}
}
// Fast write 16 bytes
void sram_fast_write16(unsigned char* pbuf, unsigned int addr, size_t size) {
const int align = 4 * sizeof(unsigned int);
if (size <= align) {
sram_write(pbuf, addr, size);
return ;
}
size_t remains = size & 15;
size_t count = (size - remains) / sizeof(unsigned int);
unsigned int* psrc= (unsigned int *)pbuf;
__IO unsigned int* pdst = FSMC_BANK1_3 + addr;
while(count){
// Write 8 ints each time
*pdst++ = *psrc++; count--;
*pdst++ = *psrc++; count--;
*pdst++ = *psrc++; count--;
*pdst++ = *psrc++; count--;
}
if (remains) {
sram_write(pdst, psrc, remains);
}
}
// Fast write 32 bytes
void sram_fast_write32(unsigned char* pbuf, unsigned int addr, size_t size) {
const int align = 8 * sizeof(unsigned int);
if (size <= align) {
sram_write(pbuf, addr, size);
return ;
}
size_t remains = size & 31;
size_t count = (size - remains) / sizeof(unsigned int);
unsigned int* psrc= (unsigned int *)pbuf;
__IO unsigned int* pdst = FSMC_BANK1_3 + addr;
while(count){
// Write 8 ints each time
*pdst++ = *psrc++; count--;
*pdst++ = *psrc++; count--;
*pdst++ = *psrc++; count--;
*pdst++ = *psrc++; count--;
*pdst++ = *psrc++; count--;
*pdst++ = *psrc++; count--;
*pdst++ = *psrc++; count--;
*pdst++ = *psrc++; count--;
}
if (remains) {
sram_write(pdst, psrc, remains);
}
}
Testing Code
Here we now have the testing code, in main loop:
...
const unsigned int mem_size = 1 * 1024 * 1024; // We have 1MB memory
const unsigned int buf_size = 4 * 1024; // read/write buffer size: 4KB, the unit we do W/R test
const unsigned int test_loop = 16; // Run 16 times for each kind test
unsigned char pbuf[buf_size]; // The read buffer
unsigned char pres[buf_size]; // The write buffer
...
The main loop looks like this:
/* USER CODE BEGIN WHILE */
while (1)
{
/* USER CODE END WHILE */
/* USER CODE BEGIN 3 */
LOG("-------------------- Begin test -------------------------\r\n");
{
LOG("Validating....");
memset(pbuf, 0xAB, sizeof(pbuf));
sram_write(pbuf, 0, sizeof(pbuf));
memset(pres, 0, sizeof(pres));
sram_read(pres, 0, sizeof(pres));
if (0 != memcmp(pbuf, pres, sizeof(buf_size))) {
LOG("Failed\r\n");
HAL_Delay(1000);
continue ;
} else {
LOG("Success\r\n");
}
}
/////////////////////////////////////////////////////
{
LOG("Built in:");
unsigned int ticks = HAL_GetTick();
for(unsigned int n = 0; n < test_loop; n++) {
for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
memcpy(pbuf, pres, buf_size);
}
}
ticks = HAL_GetTick() - ticks;
LOG("\tWt: %lu\tWs: %lu KB/t", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));
LOG("\r\n");
}
/////////////////////////////////////////////////////
{
LOG("Byte:");
unsigned int ticks = HAL_GetTick();
for(unsigned int n = 0; n < test_loop; n++) {
for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
sram_write(pbuf, addr, sizeof(pbuf));
}
}
ticks = HAL_GetTick() - ticks;
LOG("\tWt: %lu\tWs: %lu KB/t", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));
ticks = HAL_GetTick();
for(unsigned int n = 0; n < test_loop; n++) {
for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
sram_read(pbuf, addr, sizeof(pbuf));
}
}
ticks = HAL_GetTick() - ticks;
LOG("\tRt: %lu\tRs: %lu KB/t\r\n", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));
// HAL_Delay(1000);
}
///////////////////////////////
{
LOG("Word:");
unsigned int ticks = HAL_GetTick();
for(unsigned int n = 0; n < test_loop; n++) {
for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
sram_write_word(pbuf, addr, sizeof(pbuf) / sizeof(unsigned short));
}
}
ticks = HAL_GetTick() - ticks;
LOG("\tWt: %lu\tWs: %lu KB/t", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));
ticks = HAL_GetTick();
for(unsigned int n = 0; n < test_loop; n++) {
for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
sram_read_word(pbuf, addr, sizeof(pbuf) / sizeof(unsigned short));
}
}
ticks = HAL_GetTick() - ticks;
LOG("\tRt: %lu\tRs: %lu KB/t\r\n", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));
// HAL_Delay(1000);
}
///////////////////////////////
{
LOG("Dword:");
unsigned int ticks = HAL_GetTick();
for(unsigned int n = 0; n < test_loop; n++) {
for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
sram_write_dword(pbuf, addr, sizeof(pbuf) / sizeof(unsigned int));
}
}
ticks = HAL_GetTick() - ticks;
LOG("\tWt: %lu\tWs: %lu KB/t", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));
ticks = HAL_GetTick();
for(unsigned int n = 0; n < test_loop; n++) {
for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
sram_read_dword(pbuf, addr, sizeof(pbuf) / sizeof(unsigned int));
}
}
ticks = HAL_GetTick() - ticks;
LOG("\tRt: %lu\tRs: %lu KB/t\r\n", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));
// HAL_Delay(1000);
}
//////////////////////////////////////////
{
LOG("Fast(8B):");
unsigned int ticks = HAL_GetTick();
for(unsigned int n = 0; n < test_loop; n++) {
for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
sram_fast_write8(pbuf, addr, sizeof(pbuf));
}
}
ticks = HAL_GetTick() - ticks;
LOG("\tWt: %lu\tWs: %lu KB/t", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));
LOG("\r\n");
}
//////////////////////////////////////////
{
LOG("Fast(16B):");
unsigned int ticks = HAL_GetTick();
for(unsigned int n = 0; n < test_loop; n++) {
for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
sram_fast_write16(pbuf, addr, sizeof(pbuf));
}
}
ticks = HAL_GetTick() - ticks;
LOG("\tWt: %lu\tWs: %lu KB/t", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));
LOG("\r\n");
}
//////////////////////////////////////////
{
LOG("Fast(32B):");
unsigned int ticks = HAL_GetTick();
for(unsigned int n = 0; n < test_loop; n++) {
for(unsigned int addr = 0; addr < mem_size; addr += buf_size) {
sram_fast_write32(pbuf, addr, sizeof(pbuf));
}
}
ticks = HAL_GetTick() - ticks;
LOG("\tWt: %lu\tWs: %lu KB/t", ticks / test_loop, mem_size / (ticks * 1024 / test_loop));
LOG("\r\n");
}
HAL_Delay(1000); // cooling down
} // while
/* USER CODE END 3 */
Final Result
By run the test in debug mode, I got this result:
With Timing.DataSetupTime = 3;
-------------------- Begin test -------------------------
Validating....Success
Built in: Wt: 234 Ws: 4 KB/t
Byte: Wt: 204 Ws: 4 KB/t Rt: 409 Rs: 2 KB/t
Word: Wt: 175 Ws: 5 KB/t Rt: 274 Rs: 3 KB/t
Dword: Wt: 145 Ws: 7 KB/t Rt: 202 Rs: 5 KB/t
Fast(8B): Wt: 95 Ws: 10 KB/t
Fast(16B): Wt: 95 Ws: 10 KB/t
Fast(32B): Wt: 95 Ws: 10 KB/t
Surprisingly, I found the read operation is almost 50-70% of writing, much slower. Another interesting thing is the built-in RAM which is located in the chip I think, gains speed as same as byte by byte method. And so on, the fast method is really fast, but has it limitation: 8 bytes per loop, won't work harder any more.
With Timing.DataSetupTime = 2;
-------------------- Begin test -------------------------
Validating....Success
Built in: Wt: 234 Ws: 4 KB/t
Byte: Wt: 204 Ws: 4 KB/t Rt: 395 Rs: 2 KB/t
Word: Wt: 164 Ws: 6 KB/t Rt: 259 Rs: 3 KB/t
Dword: Wt: 134 Ws: 7 KB/t Rt: 187 Rs: 5 KB/t
Fast(8B): Wt: 80 Ws: 12 KB/t
Fast(16B): Wt: 80 Ws: 12 KB/t
Fast(32B): Wt: 80 Ws: 12 KB/t
Noticed that with later configuration, the fast write has 2KB/ticks improvement.
Conclusion
The external "P"SRAM is fast enough. Mostly we can use it as another memory resource. Some applications such as colorful LCD manipulation can use the external PSRAM as double buffer to avoid lagging. Further more, probably we can run program from external PSRAM and have more fun.
Good luck!