Ah, the joy of firmware debugging! I’ve been working on a USB firmware for my new upcoming product that involves a dual traditional TTL (Serial) and a USB-TO-SERIAL converter (CDC Communication Port) implemented in the same firmware. The firmware was written in C18 , using the USB stack from Microchip Applications Libraries.
Everything worked well until I suddenly discovered that while sending data over the CDC port from the computer to the device, the device would *sometimes* receive garbage data instead of what I sent. This wouldn’t be so discouraging if the data received would be the same, or at least this would happen always not *sometimes*. The bug was caught in the act on the screenshot below, the arrows show the data I actually sent ‘stop’ followed by LF (0xA) char and the character codes (in decimal) actually received. This is garbage because 92,162,62,180 are note the ASCII chars for ‘stop’. Please note that this will only occur *sometimes*, however I could almost consistently repeat it after several retries of sending a command to the device.
The tracking of this bug turned out to be so adventurous and involved that I decided to write a tutorial about it. It involved some tools that I didn’t use often before, but now they proved indispensable in locating this furry and fuzzy code pest . As you’ll see at the end of the article the bug was pretty obvious at least for me (I already made this mistake many times , but forgot about it), however it might be really useful for someone just starting working with microcontrollers, C and/or interrupts. It also provides an opportunity to review some tools I used in the process.
So where do we go from here ? Same data is send over a USB CDC Serial Emulator Port but sometimes the microcontroller receives garbage instead of the original data. A general rule of thumb is – start at the source and go up. Let’s identify the possible problems:
#1 – the computer terminal, or USB driver has gone crazy, it is sending something else instead of what was typed
#2 – there’s a hardware noise problem on the USB lines
#3 – there’s a bug in the Microchip USB Stack
#4 – there’s a bug in my code
Ok so we start with #1 , how do we see that the terminal is actually sending the correct data ? We need a way to see the actual packets send via USB. For someone who had a hardware USB protocol analyzer it would be a good time to power it up. Unfortunately these devices are really costly (let me know in the comments if I am wrong and you know any affordable solutions), and luckily for us there are decent software USB protocol analyzers that we can use instead. The tool that I used and that I highly recommend is USBTrace. It is not free , but still affordable and a great value that will pay off if you value your debug time. The reason I like USBTrace is that is so easy to use. You know a good product because after you start it you don’t have to read a manual to perform a basic function. Once I started USBTrace on the left side I was able to find a list of all USB devices, it was easy to track down my virtual CDC Serial port because in the bottom panel I could see the device descriptors. Please note that I also had connected a FTDI serial converter that I could have confused with my device. However my device had some strings in the descriptor that I recognized from my firmware such as “CDC RS-23 Emulation Demo”. Once I checked my device I was ready to click the green Run buttons. And the packets started pouring in …
Wow , that was an overwhelming number of packets! My device was in streaming mode and sending a lot of data to the EndPoint #82. I only needed data that I send to device so how do I do that ? Again USBTrace was great and very intuitive , clicked on the funnel icon (Filter) and I got following dialog:
Here I was able to set the Direction of the packet – OUT. Please note that for USB packets , by convention , the direction is as looking from host – so OUT from computer to device. I also set the CDC OUT endpoint as specified in my firmware #2, finally I set the request type to BULK_OR_INTRRUPT_TRANSFER.
Now all I had to do is to send the ‘stop’ command couple of times until the bug happened again, that was easy to spot since the streaming would not stop once garbage was received. And once it happened, I checked the contents of the last packet in USBTrace, and what do I see:
The data is completely correct, so my first guess is ruled out. Let the bug hunt continue. Moving on to #2.
#2 – there’s a hardware noise problem on the USB lines
I had a look at the cables, I had a look at the device, I knew the hardware setup worked perfectly before. If I had a hardware USB analyzer I would hook it up. But I didn’t so I took a leap of faith and I skipped #2. This wasn’t by the books. However it was all I could do in reasonable time. So moving on to #3.
#3 – there’s a bug in the Microchip USB stack
Once you download the USB stack, and integrate into your project , it becomes part of your project. So tracking Microchip’s bug was no different than tracking my own code bug. So I moved to #4 and handled #3 in the same step.
#4 – there’s a bug in my code
Ok so here we are in the MPLAB IDE 8.83. My PIC device is capable of up to 4 Hardware breakpoints. This means that the MCU can halt on specific memory locations, it also can halt when a specific memory location is written or read. By scanning through the USB Stack code, I found that it stores the incoming USB CDC data in the cdc_data_rx buffer . From the .map file I was able to determine that this buffer address is starting at 0xE08. The MAP file is basically a file the compiler creates and where you can find all your variables and code and at what address they are stored, look for it in the same folder where you output .hex file is.
Setting a memory breakpoint is accomplished by clicking on the “ADV.B” blue button in the debugger toolbar, this brings up a dialog like this:
I just punched in the address of my buffer, selected access type as Write. Now when the first byte in the incoming CDC buffer is written with good / or garbage data the code should break and point me to where the data writing happened. Again I stared sending data, and I had to click on the green debug run icon couple of times, to skip through good cases until the bad garbage write happened. The MPLAB debugger didn’t stop every time the bad case happened , I am not sure if this is a limitation of the debugger or something caused by my code bug, but finally it stopped at a place where I defined some code that was executed inside an interrupt.
This is the code for receiving the data over the other port the TTL USART, not the one having problems CDC Serial ! It basically fills out a buffer called usart_rx_buf, but this not the cdc_data_rx buffer so how could they interfere? Eureka ! Buffer overflow – the index used to access the buffer byte uasart_rx_pos was going crazy out of bounds. Indeed I looked at how usart_rx_pos was declared:
The only other place where this variable was used was in this non-interrupt code:
Now this is a big problem, let’s look at the last statement
usart_rx_pos = 0
If we look at the disable of this instruction, on a PIC 8 bit controller we can see that it takes actually 2 instruction to complete it:
This is because this is a 2 bytes variable . Now think what would happen if the interrupt happens between these two instructions. The interrupt mentioned above modifies the variable. So while the main code thinks it has “0” in usart_rx_pos, in fact it only sets to 0 the second part of the variable the first one could have been modified by an interrupt! Thus the unpredictable behavior … usart_rx_pos becomes a random variable which causes the buffer overflow. Data is written at random locations in program memory . Scarrrrrry !
So how did this happened , even if “I’ve been there done that” many times ? Well the timer interrupt code was not inside an interrupt initially , I later migrated it to an interrupt and forgot to do a sanity check of variable types. In fact I don’t need an int for usart_rx_pos (the buffer is smaller than 255). So I changed it’s declaration to:
Now that writing of one byte is done in single instruction there’s no risk of corruption by an interrupt updating same variable. However as we’ll see below this is only part of the solution. Below is the disassembly code of writing 0 to a one byte variable:
Please note that I also added an important ‘volatile’ specification in front of all variables modified by the interrupt code. Actually the lack of ‘volatile’ was my second guess why the bug happened. Without ‘volatile’ compiler might think that there’s no other place where the variable can change besides the spot it compiles. So it might optimize the code in weird ways. Here is an example, if a variable is not volatile the compiler could theoretically update this code
x = 1;
y = 0;
x = 1;
x = 1;
y = 0;
It removes the second assignment of x since it considers it redundant, while in fact if you have an interrupt executing let’s say before y=0 , and modifying x=0 thus in this case the x=1 at the end is necessary ! So ‘volatile’ basically tells the compiler that a variable can change suddenly ‘out of the blue’ and it should take care while optimizing the code and not to remove any assignments even if they look redundant.
I’ll now present my final solution to ensure the interrupt code does not modify usart_rx_pos and usart_rx_buf while we’re processing it in non-interrupt code. The solution is to prevent the interrupt from happening during that time. No worries, if interrupt has to happen while it was disabled by us - there ‘s usually a flag set and the interrupt will happen eventually once we’re re-enabling it. Because the processing in session_new_char() might take a while (and you don’t want to disable USART receiving interrupts for a long time) my strategy was to declare two SAFE non-volatile variables and copy data over quickly then re-enable the interrupt. Please note that this code would work fine even if usart_rx_pos was an int (dual instruction assignment) :
Please note that I didn’t have to disable all interrupts, it was enough to disable only the interrupts that would cause our variable to be modified (the USART Receive interrupt). It’s enabled or disabled by the RC1IE flag in PIE1 register.
After testing this new code, the problem was gone. So now I am thinking “everything is for the best” , I spent some time tracing a bug, but I learned along the way and sure thing I wrote this tutorial so I never again forget that interrupts can modify variables when you don’t expect to !
A. Declare variables modified in interrupts as ‘volatile’ to stop compiler doing nasty things to them
B. Disable the interrupts that can modify those variables, while working with these variables in a non-interrupt code (both reading and writing), this will create a so called ‘atomic’ code, called so because it cannot be split by an interrupt.
C. Keep ‘the atomic’ code short – the interrupts can wait , but not for too long in some cases. For example the hardware USART buffer on my device only has 2 bytes – so it would get filled and lost soon if the processing done in the interrupt would be halted for too long.
D. Have fun. Life is beautiful even when debugging