View previous topic :: View next topic |
Author |
Message |
lukeevanslx
Joined: 11 Jun 2012 Posts: 14
|
How to detect Stack Overrun (possible variable corruption) |
Posted: Tue Jun 19, 2012 6:11 am |
|
|
How do I detect stack overrun (eg. the stack is corrupting variables)?
What symbols am I looking for in the .lst file?
Or does CCS actually use the stack for auto variables, or are auto allocations just a scatchpad area of RAM calculated to the proper size?
(I tried searching "stack" in the help but no good hits. Perhaps I can be directed to the correct topic in the manual?)
Thanks |
|
|
RF_Developer
Joined: 07 Feb 2011 Posts: 839
|
|
Posted: Tue Jun 19, 2012 6:22 am |
|
|
You don't check for stack overflow. As far as I am aware there is no way to do so, at least on the 16s and 18s. There may be on the 24s and 32s, but don't hold your breath on that.
On 16s and 18s, and probably 24s, the stack, a hardware stack NOT implemented in data memory and therefore not going to overwrite varaibles on overflow, is far too small to store any variables. It more or less stores return addresses only, This is why recursion is difficult if not practically impossible on the lower/mid range PICs. The 32s are likely to be different as they are MIPs based, while I have little experience of the 24s/dsPICs
To check the stack usage look at the top if the .lst file. Here is an example from one of my projects:
Code: |
ROM used: 28236 bytes (58%)
Largest free fragment is 18864
RAM used: 1567 (47%) at main() level
1655 (50%) worst case
Stack: 14 worst case (11 in main + 3 for interrupts)
|
All variable storage is in normal data memory space. Scratchpad areas are used for local variables. Globals are allocated fixed, exclusive storage. Take a look at the .sym file for details of what's gone where.
RF Developer |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19613
|
|
Posted: Tue Jun 19, 2012 8:18 am |
|
|
Ongoing:
The PIC 18 chips, have the ability to trigger a reset on stack overflow (which can then be detected with 'restart_cause'). However because the stack doesn't store variables, it would normally only get triggered through code/memory corruption (atomic particle for example), or a problem with the code design.
There is also a 'caveat' on checking the stack size used in the listing - remember that if you are using the ICD, _this_ uses stack space, and if you have a bootloader, this can also add to the stack space used. You need to work out how much stack really is available, before using the listing.
Best Wishes |
|
|
FvM
Joined: 27 Aug 2008 Posts: 2337 Location: Germany
|
|
Posted: Tue Jun 19, 2012 8:24 am |
|
|
With PIC18 dedicated hardware stack, a stack over- or underrun is causing a reset, as you can review in the processor datasheet.
PIC24 has a stack in general RAM, but a stackoverflow is causing an execption and by default a reset. |
|
|
jeremiah
Joined: 20 Jul 2010 Posts: 1362
|
|
Posted: Tue Jun 19, 2012 8:48 am |
|
|
FvM wrote: | With PIC18 dedicated hardware stack, a stack over- or underrun is causing a reset, as you can review in the processor datasheet.
PIC24 has a stack in general RAM, but a stackoverflow is causing an execption and by default a reset. |
Underflow also generates a trap for PIC24:
http://ww1.microchip.com/downloads/en/DeviceDoc/39707a.pdf |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19613
|
|
Posted: Tue Jun 19, 2012 8:49 am |
|
|
Realistically the most likely cause for 'corrupted variables', is a pointer overrun.
You have to remember the PIC has no hardware memory management/protection, so there is nothing to stop you from writing to the wrong address in memory. So (for instance):
Code: |
char buffer[14];
int16 val=1234;
void main(void) {
strcpy(buffer,"My test string");
printf("%4ld\n\r", val);
do {
} while(TRUE);
}
|
_Will have corrupted 'val'_. Reason is that I'm writing a 14 character string to a 14 character storage area, and a string _requires_ an extra character for the null terminator.
Generally if variables are declared consecutively like this, they will be stored in the same order in RAM, so the null terminator will have overwritten the low byte of val, changing it to 0x400 from 0x4D2, and giving 1024 when printed.....
The same applies to any function you write, directly using pointers.
Best Wishes |
|
|
asmboy
Joined: 20 Nov 2007 Posts: 2128 Location: albany ny
|
|
Posted: Tue Jun 19, 2012 7:42 pm |
|
|
To add to what Ttelmah has written,
another common memory cruncher is
writing to an array element at a greater index
than was allocated. Or incorrectly storing to a circular index buffer etc.
CCS does NOT do bounds checking.
As a programmer you need to pay attention to things like that as there is NO RUNTIME CHECKING .
In runtime, you are performing without a net at all times.
The smallest slip up will throw you to the canvass.
|
|
|
Douglas Kennedy
Joined: 07 Sep 2003 Posts: 755 Location: Florida
|
|
Posted: Wed Jun 20, 2012 7:57 am |
|
|
Atomic particles messing with the PIC's movement of electrons is low on the list of causes since most often it is human error. As suggested first you look at your code twice. If that doesn't find the error and you have a PIC with restart on stack error then CCS has __line__ . You assign it to a variable in strategic places and in the restart clause you send it to a monitor (pre crash RAM is preserved on a restart ( assuming it wasn't power fail))..I use ICD_U64 and the CCS debugger for this. Now you have some idea as to the last line number that your code passed before crashing. It's not perfection but it can narrow it down. Now it is possible an atomic particle alters your brain waves when trouble shooting and you miss the whole thing....so there is no perfect answer. |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19613
|
|
Posted: Wed Jun 20, 2012 8:33 am |
|
|
Yes, though I have seen it. Only on a couple of machines fitted into a site where mining had gone into some fairly radioactive rocks, but it does happen. Also anyone involved with CCD's will testify just how often they can record a stray particle. However put it about fiftieth 'down' the list of things likely to go wrong.
However key words to the original post, are 'stack is corrupting variables'. No, on the PIC16/18's, stack overflow _will not cause variable corruption_. There is no variable stack as such.
'Top ten', in order of likelyhood:
Incorrect pointer count.
Incorrect array index.
Incorrect sizing of array/pointer passed to a function - this is both a power, and a danger of C, where you can pass (say) pointer to an int8 to a function, and tell the function that it is a pointer to something larger, then find yourself talking to values far beyond the end of the physical array.
Incorrect handling of malloc.
Not disabling interrupts when passing multiple byte values too/from and interrupt handler.
Noisy PSU. RAM corruption through poor supply regulation.
Compiler error (There have been a few particularly when handling complex structures crossing page boundaries).
RF induced problems.
Spikes from lines into the PIC. Particularly MCLR. This _does not_ have the protection diodes present on other pins, and if used as an input, these should be provided externally, or spikes just a little over Vdd, can cause RAM corruption.
Best Wishes |
|
|
Douglas Kennedy
Joined: 07 Sep 2003 Posts: 755 Location: Florida
|
|
Posted: Wed Jun 20, 2012 10:57 am |
|
|
Ttelmah says the following is a frequent cause of trouble.
Quote: |
Not disabling interrupts when passing multiple byte values too/from and interrupt handler.
|
I've been guilty of this in the past and lost some sleep over it.
I protect the int16 int32 variables shared between main and any isr that uses them but I have a concern that is perhaps unfounded and maybe Ttelmah could comment on it.
Scenario int32 comes in on a can bus the isr captures the packet and posts it to a circular buffer. Two things are happening the can_bus hardware has packet size buffers rx0 rx1 the interrupt #INT_CANRX0 #INT_CANRX1 triggers if either one has data the same isr which then posts the data to the circular buffer. In main the circular buffer is read by disable_interrupts the int32 extracted enable_interrupts .
Now this holds off interrupts for a brief time so the concern is do the rx0 and rx1 flags get set during the brief interval so when the interrupt is enabled it is immediately triggered or can can_bus data be lost? |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19613
|
|
Posted: Wed Jun 20, 2012 11:44 am |
|
|
Disabling the interrupts and copying is the way to go. The time needed to move four bytes is only perhaps eight machine cycles, so unless the interrupt can receive a fifth byte, and not buffer this, in this time, then it gives 100% coverage. Alternative is to use alternating buffers, and as soon as a packet of bytes are received, switch to the second buffer, and flag this, then in the main code, move the bytes from the buffer not now used, which since it is not the buffer in use, doesn't need interrupts disabled.
Best Wishes |
|
|
Douglas Kennedy
Joined: 07 Sep 2003 Posts: 755 Location: Florida
|
|
Posted: Wed Jun 20, 2012 12:36 pm |
|
|
Thanks Ttelmah,
I believe you are confirming that the interrupt pending flags are set even when interrupts are disabled something I assumed to be true and the risk is only that the hardware buffer overflows while main has the interrupt blacked out. So the latency incurred by the instructions needed to get the isr up to the point it can pull in data and get back out are in fact more of an issue than the blackout time in main ( a few instructions to move 4 bytes ). This is good news since it is deterministic in that data can't arrive faster than the can_bus baud rate and the isr call instructions are also determinable... so in my case with my baud rate it is mathematically impossible ( to lose data). The can_bus is purported to only lose 1 bit every hundred years of continuous use. ( this assumes the hardware ex. oscillators wiring etc are perfect for 100 years) I didn't want to destroy this reliability with my coding. |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19613
|
|
Posted: Thu Jun 21, 2012 1:42 am |
|
|
Oh, yes.
This is in the chip's data sheet, and is why you can leave interrupts disabled, and poll the interrupt flag, as an alternative to using an interrupt handler. This is also why you have recommendations like clearing an interrupt, before enabling it, for things like the 'edge' interrupts, where the act of programming the edge used, can trigger the interrupt bit, though it is disabled.
the 'interrupt enable' bits only control whether the interrupt flag will result in an interrupt call. The sequence (for a PIC18) is:
Interrupt flag - set when the hardware event happens. Doesn't care about any of the other bits.
Interrupt enable - sets whether each flag is connected to the interrupt hardware
Priority bit - sets which of the two interrupt hardware sections each interrupt signal is connected to. Beware that INT_EXT, does not have this bit and always connects to the high priority hardware if priorities are enabled.
GIEH/GIE enables the hardware call logic for each of the two sources.
An actual interrupt 'call' only occurs if the entire 'tree' of bits is set correctly, but the very first one (the flag), doesn't care about any of the others.
Best Wishes |
|
|
asmboy
Joined: 20 Nov 2007 Posts: 2128 Location: albany ny
|
|
Posted: Wed Jun 27, 2012 7:14 pm |
|
|
re: disabling ints
for quite some time i have made been making use of
timer 'tix' that are incremented as a 32 bit int by a timer ISR.
when i want to know the actual tix count -
i never disable ints in order to read it.
this code, while not hyper efficient , has never failed , that i know of.
Code: |
unsigned int32 getMsecs(void){
unsigned int32 readtsecs;
//expMsecs is the 1 msec tic of the timer ISR
do{
readtsecs=expMsecs;
} while (readtsecs !=expMsecs);
return (readtsecs);
}
|
|
|
|
|