The definitive MIDI controller | This is not rocket science

ATmega2560 as an SPI slave

SPI is an inter-ic bus, usually applied to connect MCUs with other peripherals. The usual setup consists of four synchronously controlled wires, with the bus master called “master” and the responding device called “slave”. The protocol is extensively documented.

The SPI hardware implementation on the AVR is rather curious. The way the datasheet puts it, the interface is single buffered in transmit direction and double buffered in receive direction. By “double buffered” they mean that the byte that was just completely received is immediately moved to a second register, and is available to be read while the next byte is already streaming in. The program interested in the data may then retrieve the incoming byte whenever it suits it, as long as it does it before the next byte transfer has finished.

(The bus captures were taken with one of these USBee logic analysers from DX.)

input_ready

But in transmit direction the controller is not double buffered. Writes to the transmit register are immediately reflected in the state of the output, instead of being stored for later use. If the program writes into the transmit register while a byte transfer is ongoing, the bits that actually get sent out on the wire will be partially from the first register value, and the second part from the updated value. Therefore it is necessary for the value in the register to stay constant (and correct) the whole time of the transmission of that byte.

If the device is to transmit a particular stream of bytes, the program code running on the processor must update the register value between the individual bytes on the bus. And in case of a slave device, that means after the previous byte has been finished, before the master starts clocking in/out the next byte. Another interesting detail is that in the slave device, the received bytes are used as default data to output, at least on the ATmega2560. If the program does not update the data register SPDR, it will still contain the previously received byte when the transfer of the next byte begins, and the AVR will simply start transmitting that byte.

Luckily the master devices often have a short pause between individual bytes, so that there’s some time to set the data register value, and in the case of the Raspberry Pi, this is indeed also true.

output_ready

In a bus master device this controller setup is convenient, as the bus can be clocked (by the master itself) whenever data is ready to be sent out, and double buffering for received data is not strictly necessary. The program can always read the received byte before initiating a new transfe.r But when the AVR is used in a slave, the transmit register update timing becomes a bottleneck; the master device cannot know how much time the slave needs to prepare the next data byte. But how bad can that be?

Taking the RPC as an example, the Raspberry Pi is the SPI bus master, and sends bursts of bytes to the Arduino. The bursts contain data that the ATmega MCU on the Arduino buffers and then feeds to the MIDI devices connected to its serial bus UART outputs (contrasted with the SPI bus, at a much slower bit rate). The ATmega also receives bytes from the serial MIDI inputs, processes the incoming MIDI commands, and possibly queues them for transmission to the Raspberry Pi for further processing. Below is a diagram of the setup.

rpc_queues

The whole idea of this structure is to allow the Arduino to run its other tasks while the Raspberry Pi is not talking to it. The sparser and shorter the SPI transmission bursts get, the more time the Arduino has to for doing work between them. Ideally you’d then transfer data at the highest clock frequency you possibly can, so that you could quickly get over with your interruption.

databurst

How does the ATmega program then deal with the SPI hardware? The ATmega datasheet only gives this short piece of code as a clue:

char SPI_SlaveReceive(void)
{
  /* Wait for reception complete */
  while(!(SPSR & (1 << SPIF)))
    ;
  /* Return Data Register */
  return SPDR;
}

(Yes, the Arduino libraries provide much nicer wrappers for this, but I’m going to ignore them here.)

Interestingly enough, the AVR instruction set has the instructions SBIS/SBIC that are meant as a quick way of testing bits in I/O registers, but those can’t be used with SPI on the ATmega2560 in particular. The SPSR register is placed too high in the I/O space to be reachable! Instead of

Wait_Transmit:
  sbis    SPSR, SPIF
  rjmp    Wait_Transmit

something more like this is needed, increasing the number of cycles by one per iteration:

Wait_Transmit:
  in      r0, SPSR
  sbrs    r0, SPIF
  rjmp    Wait_Transmit

More instructions in the loop of course means longer latency before the program gets to react to it. The worst case latency with this loop is 7 clocks (if I got it right), if the execution goes like this:

  in      r0, SPSR      ; SPIF bit is still off
  sbrs    r0, SPIF      ; 1) now SPIF is set, but we already missed it
  rjmp    Wait_Transmit ; 2) jump
Wait_Transmit:
  in      r0, SPSR      ; 1) now we read the new value with SPIF set
  sbrs    r0, SPIF      ; 3) now bit is on, but we just missed it
  rjmp    Wait_Transmit ; 0) no jump
                        ; ----
                        ; 7 clocks

Disregarding the loop latency, this approach is surely fine if you know that a byte is incoming, and if you have all the time in the world to spend waiting for it. It won’t work in the RPC: instead of just idly waiting there in the while loop, the RPC actually has work to do instead. Also, since the Arduino is the SPI slave, it must be available exactly then whe master wants to talk to it (i.e. it had better get into that while loop at the right time, or some of the received data could be lost.) Worse still, the SPI transmit buffer register has to be set properly on time, or the master will get garbage back. This calls for some preemption.

The AVR SPI controller can be configured to interrupt the CPU the moment a byte transfer is finished, as explained in the AVR151 application note and in much more detail at least on the rocketnumbernine blog. The fun starts when you implement this.

Most examples that I could find to describe the interrupt handlers suggested that the received byte(s) could be processed immediately in the interrupt handler, and nearly all of them received bytes one at a time, returning from the interrupt handler in between bytes. The closest to a proper analysis of a high performance SPI handler I could find was on matuschek.net, and I found another interesting writeup in Avian’s blog, but both were only dealing with a master device, and both suggested a loop to process the data. I got curious about what was really causing the slowness of the interrupt handlers; also I didn’t see a way around having the SPI transmission get initiated by an interrupt in the RPC.

interrupt_latency2

The problems appeared when going above 200 kHz bus clock speeds, and only got worse the higher the transfer rate got. The picture above shows a common issue: a missing bit. The cursor shows where the ‘1’ bit should have been. Sometimes the first bits (MSB) would be incorrectly high, sometimes low, depending on the last received byte.

interrupt_latency

I used an extra pin as an output to find the exact timing of the the interrupt handler and the time it took to finish, by pulling the pin up while the the handler was active. It turned out that exactly at those moments where the output was corrupted, the interrupt handler started to run too late. But it would be late much more often: in most cases, as in the picture above, the activity debug pin would show that the handler is late even though there is no data to be sent.

The first thing the handler did was to pull the ACT pin high. And already then it was late to react to the SPI byte burst; setting SPDR would no longer help, it had simply missed the first bit! But I did see the interrupt get triggered correctly also, as most of the time it would. Something else must have blocked it. Indeed I had also other interrupts in use, two for each UART, eight in total. So I rewrote those UART interrupt handlers, and that helped somewhat, but still the SPI interrupt would be late every now and then at higher bus clock rates. It also takes some time to set up and read the send queue before the first byte can be transmitted, on top of the interrupt latency.

To make the story short, I found that a very simple solution was to process all of the incoming bytes in the interrupt handler, in an unrolled loop, waiting for the SPIF just like in the polling examples, and to quit the loop and the handler if the SS line went high. Something like this:

ISR (SPI_STC_vect)
{
  // first byte has already been sent
  SPDR = 0; // second byte to transmit
  spi_receive_queue.insert(SPDR); // keep received byte
  
  bool didtransmit = false;
  while (1)
  {
    // Wait for transfer to finish
    while (!(SPSR & (1 << SPIF))
    {
      if (PINB & 1)
      {
        // SS is high
        SPDR = 0;
        return;
      }
    }

    if (didtransmit)
    {
      spibuf_send.popbyte(); // remove sent byte from queue
    }

    // Set output, get input
    SPDR = spibuf_send.peekbyte(); // read but don't pop frontmost byte
    spibuf_receive.insert(SPDR);
    didtransmit = true;
  }
}

The first two bytes were hard to exploit, so I just had them set to zero. The SPI interrupt handler would run with interrupts disabled, and as long as the SPIF-high-detection was fast enough, all the data bytes starting with the third were stable.

Due to the interrupt latency the second byte was not reliable to use, the handler might miss one or two MSBs.

The first byte was also difficult, and could only be reliably transmitted if SPDR was set early enough. Since the SPI slave would simply starting send on the bus whatever was in the SPDR when the master was clocking the bus, it looked like the register would have to be set before the transmission begins. The register could indeed be set at the end of the last interrupt with the value of the next byte from the transmit queue. Or if the queue was empty, it could be set to zero, and updated whenever a byte was added in the queue from the main program… bah, it got really difficult beyond that, with atomic updates and whatnot. It was much easier to just leave also the first byte zero, by clearing SPDR at the end of each SPI interrupt.

But that wasn’t enough. The compiled code was too slow. It would do unnecessary things and in the wrong order. The compiler always wanted to push all the clobbered registers on the stack before being able to set SPDR initially. It would use unnecessarily many registers, resulting in more push and pop operations than what seemed necessary. It didn’t know how to make use of the fact that my queues were aligned in memory to generate faster code.

So I wrote it in assembly. And it worked fine. But it wasn’t nice. And I found another trick that should have been completely obvious. I could hook the SS line itself: in addition to its role as SS, it worked also fine as PCINT0!

pcint0

The Raspberry Pi leaves a generous amount of time between SS-low and the beginning of the first byte. I could use that time to both set the SPDR to some useful initial value and to prepare for sending the next byte. The next critical moment was only when the first byte had been completely received (and transmitted), and SPDR had to be reloaded. With the new approach there was plenty of time to initialize all the registers and fetch the first data byte.

For the first byte I chose to send out the total number of bytes in queue, to help schedule further requests on the Raspberry side; 8 AVR instructions, total some 10 clocks plus interrupt latency. Then the handler pushes the values in all the rest of the needed registers, gets the first data byte to send out from the queue, and waits for SPIF. On SPIF it immediately (almost…) outputs the next byte and spends some time updating queue indices and storing the received byte, before repeating the loop to wait for the next SPIF. At the 2 MHz bus speed that I aimed for, it would still sometimes take too long for the SPIF to be detected, indicating that the SPIF/SS loop was still too slow. So I unrolled it out asymmetrically, with SS (LSB of PINB) checked less often than SPIF.

Here’s how it looks like now… The unrolling is perhaps slightly excessive.

ISR (PCINT0_vect, ISR_NAKED)
{
	__asm__(
//		"sbi	0x11, 1		$" // set ACT pin

		// prepare work registers
		"push	r24		$"
		"push	r0 		$"
		
		// store SREG
		"in	r24, 0x3f	$"
		"push	r24		$"
		
		// no other registers used yet
		
		"lds	r24, spibuf_send+1	$" // output buffer read pos
		"lds	r0, spibuf_send		$" // output buffer write pos
		
		// 1st byte to transmit is the number of bytes in queue
		"sub	r0, r24		$"
		"out	0x2e, r0	$"
		
		// now there's plenty of time to do the rest of the preparation
		
		// was #SS low at all?
		"in     r0, 0x03	$" // PINB
		"sbrc	r0, 0		$" // (SPSR & SPIF)
		"rjmp	5f		$" // if (PINB & 1) = #SS is high, quit
		
		// push the rest of the registers
		"push	r22		$"
		"push	r23		$"
		"push	r25		$"
		"push	r26		$"
		"push	r27		$"
		"push	r28		$"
		"push	r29		$"
		"push	r30		$"

		// retrieve fifo pointers
		"lds	r26, spibuf_send+1	$" // output buffer read pos
		"lds	r30, spibuf_send	$" // output buffer write pos
		"ldi	r27, hi8(spisend_b)	$" // load output buffer address
		"lds	r28, spibuf_receive	$" // input buffer write pos
		"ldi	r29, hi8(spireceive_b)	$" // load input buffer address

		// there is no input yet, skip first part of loop
		"jmp	6f			$"

		// main receive loop
		"4:				$"
		"in	r24, 0x2e		$" // get Nth byte in

		// update output buffer read pos: one byte has been consumed
		// r26: buffer read pos after the transmit that is about to start
		// r23: buffer read pos after the transmit that just finished
		"sts	spibuf_send+1, r23	$"

		// keep input byte
		"st	Y, r24			$" // store input byte
		"inc	r28			$" // increment write pos
		"sts	spibuf_receive, r28	$" // input buffer write pos

		"6:"
		// if the xmit finishes, the output buffer read pos must be
		// updated; the xmit means that the previous transmitted
		// byte was transferred correctly, not the one we are preparing
		// here, and therefore the previous value of r26 is kept safe
		// in r23.
		"mov	r23, r26	$" // output buffer read pos
		
		// retrieve next byte to be transmitted
		"clr	r25		$" // zero if no data in queue
		"cp	r30, r26	$" // got data in send queue?
		"breq	1f		$" // jump if not
		"ld	r25, X		$" // read byte to output
		"inc	r26		$" // increment index
		"1:			$"

//		"cbi	0x11, 1		$" // clear ACT pin

		// wait for next byte (SPIF) or end of transmission
		// have to react quickly to SPIF, so interleave
		// #SS checks with multiple SPIF checks
		"1:			$"
	
		"in	r24, 0x2d	$" // SPSR
		"sbrc	r24, 7		$" // (SPSR & SPIF)
		"rjmp	2f		$" // if SPIF is set, goto 2
		
		"in	r24, 0x2d	$" // SPSR
		"sbrc	r24, 7		$" // (SPSR & SPIF)
		"rjmp	2f		$" // if SPIF is set, goto 2
		
		"in	r24, 0x2d	$" // SPSR
		"sbrc	r24, 7		$" // (SPSR & SPIF)
		"rjmp	2f		$" // if SPIF is set, goto 2
		
		"in	r24, 0x2d	$" // SPSR
		"sbrc	r24, 7		$" // (SPSR & SPIF)
		"rjmp	2f		$" // if SPIF is set, goto 2
		
		"in	r24, 0x2d	$" // SPSR
		"sbrc	r24, 7		$" // (SPSR & SPIF)
		"rjmp	2f		$" // if SPIF is set, goto 2
		
		"in	r24, 0x2d	$" // SPSR
		"sbrc	r24, 7		$" // (SPSR & SPIF)
		"rjmp	2f		$" // if SPIF is set, goto 2
		
		"in	r24, 0x2d	$" // SPSR
		"sbrc	r24, 7		$" // (SPSR & SPIF)
		"rjmp	2f		$" // if SPIF is set, goto 2
		
		"in	r24, 0x2d	$" // SPSR
		"sbrc	r24, 7		$" // (SPSR & SPIF)
		"rjmp	2f		$" // if SPIF is set, goto 2

		"in	r24, 0x2d	$" // SPSR
		"sbrc	r24, 7		$" // (SPSR & SPIF)
		"rjmp	2f		$" // if SPIF is set, goto 2
		
		"in     r0, 0x03	$" // PINB
		
		"in	r24, 0x2d	$" // SPSR
		"sbrc	r24, 7		$" // (SPSR & SPIF)
		"rjmp	2f		$" // if SPIF is set, goto 2

		"sbrc	r0, 0		$" // (PINB & 1) = #SS
		"rjmp	3f		$" // if #SS high, goto 3

		"in	r24, 0x2d	$" // SPSR
		"sbrs	r24, 7		$" // (SPSR & SPIF)
		"rjmp	1b		$" // if SPIF is cleared, goto 1
		"2:			$"

		// previous xmit is done, do the next one
		"out	0x2e, r25	$" // set Nth byte out
//		"sbi	0x11, 1		$" // set ACT pin
		"jmp	4b		$" // most critical part is done, loop

		// arrive here if SS high
		"3:			$"
//		"sbi	0x11, 1		$" // set ACT pin
		"out	0x2e, r1	$" // clear SPDR just in case
		
		// teardown
		"pop	r30		$"
		"pop	r29		$"
		"pop	r28		$"
		"pop	r27		$"
		"pop	r26		$"
		"pop	r25		$"
		"pop	r23		$"
		"pop	r22		$"

		// arrive here if SS high before first byte
		"5:			$"
		"pop	r0		$" // return old SREG
		"out	0x3f, r0	$" //
		"pop	r0		$"
		"pop	r24		$"
		"sbi	0x1b, 0		$" // ack PCINT0
//		"cbi	0x11, 1		$" // clear ACT pin
		"reti			$"
	       );
}

This code currently works perfectly at 2 MHz. At the moment I don’t think it’s possible to make it fast enough for 4 MHz, but there were also some other issues at that bitrate. It was as if the SPI hardware itself was too slow and was losing bits; perhaps my signal wires are not good enough. But 2 MHz is plenty enough for the mere 16 MHz ATmega2560!

11 Responses

  1. gregory

    Very interesting and comprehensive analysis! Thanks a lot!

    I’m battling similar issue with randomly corrupted SPI transfers. Having read your article it sounds like this must be the case! Will try it over the next days.

    July 31, 2014 at 13:11

  2. ld

    gregory, no, there’s something more to it. The Raspberry Pi kernel SPI driver itself is hanging and losing data periodically. And it’s still not fixed.

    July 31, 2014 at 14:34

    • gregory

      so are you saying there is no problem with atmega and it was Raspberry Pi that was causing issues?

      November 14, 2014 at 23:30

  3. ld

    gregory, no, there’s no problem with the atmega at all.

    I found some time ago that there is a new SPI driver available for the Pi, and you have to compile it yourself, but that one works perfectly. No glitches whatsoever.

    I was going to write a post about it, but didn’t have a chance to do so yet. The instructions for using it are here: https://github.com/notro/rpi-source/wiki/Examples-on-how-to-build-various-modules#dma-capable-spi-master-driver-spi-bcm2708

    November 14, 2014 at 23:35

  4. gregory

    Well, this is interesting, I tried the dma capable driver and still having the random corruptions on AVR->RPi channel only.

    However I haven’t tried your AVR code yet…

    November 14, 2014 at 23:45

  5. This is currently my implementation of the interrupt:
    https://github.com/rpicopter/AvrMiniCopter/blob/master/arduino/SPIdev.h#L82

    What would you recommend to try next?

    November 24, 2014 at 14:52

  6. Well, thinking about it I tried the DMA capable SPI driver for RPi but from userspace! I will have to tackle the problem again and move to kernel space. Will update once I get some findings.

    However, from your analysis above, every aspects points to a fault in AVR SPI handler… how is this possible?

    November 24, 2014 at 15:30

    • Do you have somewhere an example that show how you use the DMA capable SPI driver on RPi?

      Did you tweak spidev to support DMA capable SPI?

      Thanks a lot!

      November 24, 2014 at 15:42

      • ld

        I’m not fully sure what is different in the new driver, but for me it works much better even in userspace. I just loaded it instead of the old driver. And I haven’t played with it lately much though, so it might be that it’s not perfect.

        I guess you can see that I went quite a bit further than what your interrupt handler now does to get it right! :-)

        You really have to get one of those cheap logic analyzers, and it will show you immediately what’s wrong. Use the Arduino GPIO pins to indicate the state of your program. In the end it’s all about timing.

        Did you already try reducing the bus speed? If I remember correctly the RPi can go to very low SPI speeds if needed, and that will leave your Arduino more time to process the data.

        Even if the speed is then insufficient for your intended purpose, it may still help you identify the cause of the issue. Are you blocking interrupts elsewhere in your program?

        November 24, 2014 at 19:20

  7. gregory

    Can I ask what kernel version were you using?

    November 24, 2014 at 19:41

    • ld

      $ uname -a
      Linux raspberrypi 3.12.29+ #714 PREEMPT Wed Oct 1 23:11:38 BST 2014 armv6l GNU/Linux

      November 29, 2014 at 19:30

Leave a Reply

Your email address will not be published. Required fields are marked *