View previous topic :: View next topic |
Author |
Message |
benres
Joined: 30 Jun 2004 Posts: 9 Location: Cambridge, MA
|
|
Posted: Thu Nov 18, 2004 6:30 pm |
|
|
Sorry about that -- the links should work now.
I agree this is a bizzare problem. We also have hundreds of this exact PIC in the field with PLL enabled with zero problems. I thought I had pretty much seen everything, especially after about a year of very smooth sailing.
Does your history of reliable PIC operation include devices with the PLL enabled?
Our hardare fails reliably without this fix, and it works reliably with this fix. If the problem was power supply or crystals or some other hardware issue, I wouldn't expect moving the first instruction from 0x0000 to 0x0002 to make a difference, but it clearly does. Disabling the PLL also works, but then our circuit runs too slow.
I'm comfortable this fix works. All I'm trying to do now is ascertain whether there's any negative consequences of implementing this fix. Aside from loosing two bytes of program memory, is there a downside?
Thanks,
- Ben |
|
|
PCM programmer
Joined: 06 Sep 2003 Posts: 21708
|
|
Posted: Thu Nov 18, 2004 7:04 pm |
|
|
I haven't used the 4x PLL in any of our 18F projects yet.
Other people on this board have used it. |
|
|
asmallri
Joined: 12 Aug 2004 Posts: 1635 Location: Perth, Australia
|
|
Posted: Thu Nov 18, 2004 7:08 pm |
|
|
Quote: | Does your history of reliable PIC operation include devices with the PLL enabled? |
Yes. I run most of my projects on a 10MHz crystal with the 4 x PLL
Quote: |
Our hardare fails reliably without this fix, and it works reliably with this fix. If the problem was power supply or crystals or some other hardware issue, I wouldn't expect moving the first instruction from 0x0000 to 0x0002 to make a difference, but it clearly does. Disabling the PLL also works, but then our circuit runs too slow. |
When you try it without the PLL do you make any change whatsoever to the code? For example do you change the SPBRG? I ask because I am wondering if the linker is remapping anythin elsewhere.
Quote: |
I'm comfortable this fix works. All I'm trying to do now is ascertain whether there's any negative consequences of implementing this fix. Aside from loosing two bytes of program memory, is there a downside? |
If the problem is as a result of a compiler or linker related issue then it may have other ramifications.
One possibility comes to mind. The CCS compiler will place code down in very low memory. In one case I saw it place code executable code at address 0x0004. Adding a sinlge NOP at address 0x0000 would prevent the compiler for placing code at this address and it would be place somewhere else in the map but not at 0x0006. One additional instruction will affect the way the compiler places code in program memory. This is why I suggested you look atthe .lst file and see what is happening down in low memory.
Try the following idea. Remove the NOP at 0x0000 and add this code and see what happens.
Code: |
// dummy code to force the comiler to not use memory in the vector space
#rom 0x0020={0xffff,0xffff}
|
_________________ Regards, Andrew
http://www.brushelectronics.com/software
Home of Ethernet, SD card and Encrypted Serial Bootloaders for PICs!! |
|
|
asmallri
Joined: 12 Aug 2004 Posts: 1635 Location: Perth, Australia
|
|
Posted: Thu Nov 18, 2004 7:30 pm |
|
|
I checked out the hex file produced by both the fixed and the non fixed versions and the compiler is defintiely placing operation code at 0x0004 (unfortunately you cannot see this from the .lst file as it is suppressed. Try my suggested mod, examine the hex file produced and see if the complier still does it. Looks to me like a complier / linker bug and if so the fix I proposed is a 'better' solution as it prevent the linker placing code in the vector region. _________________ Regards, Andrew
http://www.brushelectronics.com/software
Home of Ethernet, SD card and Encrypted Serial Bootloaders for PICs!! |
|
|
benres
Joined: 30 Jun 2004 Posts: 9 Location: Cambridge, MA
|
|
Posted: Thu Nov 18, 2004 7:30 pm |
|
|
Inserting the "#rom()" actually makes no difference in the LST file. The compiler seems to ignore this and still puts an instruction at location zero.
Looking at low memory, adding the #build() command simply moves everything by 2 bytes. I suppose this could introduce some sort of page boundary problem, but I'm hard pressed to think of a specific scenario. Can I assume a NOP is placed at 0x0000?
I turn the PLL on and off by changing the configuration bits in the MPLab window. I've been careful not to recompile in order to avoid the problem you describe. I aslo verify non-operation of the serial port by looking at the TX pin with a scope.
Now that this short test program has been written, I have TWO examples of software that fails to start when there's an instruction at location zero and the PLL is enabled. Furthermore, both examples are fixed by either turning off the PLL or putting a NOP at instruction zero.
Thanks,
- Ben
PS: There were some problems with the link I posted -- it's fixed now:
http://web.media.mit.edu/~benres/picbug/
http://web.media.mit.edu/~benres/picbug/test_noFix.lst
http://web.media.mit.edu/~benres/picbug/test_Fixed.lst |
|
|
asmallri
Joined: 12 Aug 2004 Posts: 1635 Location: Perth, Australia
|
|
Posted: Thu Nov 18, 2004 7:51 pm |
|
|
Hex is the extract of the produced hex file using the no-fix source file
Quote: |
:100000005EEF00F00B50016A0A5C03E20AC000F0E8 |
Note it places operational code at 0x0004
Here is the extract of the produced hex file with the #rom added to the source file
Quote: | :020000040000FA
:040000006DEF00F0B0
:0E0022000B50016A0A5C03E20AC000F00CD029 |
Note that there is no code installed between the reset vector and address 0x0022
Please test my suggested mod. I am sure it will work. If it does work what does this tell you? It is not a PIC fault. You are seeing a problem with the compiler or linker. The PLL or not PLL is a red herring the clue is in the hex files _________________ Regards, Andrew
http://www.brushelectronics.com/software
Home of Ethernet, SD card and Encrypted Serial Bootloaders for PICs!! |
|
|
benres
Joined: 30 Jun 2004 Posts: 9 Location: Cambridge, MA
|
|
Posted: Thu Nov 18, 2004 9:27 pm |
|
|
I see what you're talking about. Adding the #ROM indeed changes the .HEX file, but not the .LST file. Unfortunately, the compiler version I'm using (3.141) doesn't actually reserve the memory declared in the #ROM. I'm trying to get this reserved with #ORG or #RESERVE, but these don't seem to work in low memory. I'll play with it more tomorrow.
Upgrading to the 3.212 compiler fixes the problem in both the test version and full code. So perhaps the PLL is indeed a red herring. The 3.212 compiler also reserves #ROM'd memeory.
Given our deadline, upgrading the compiler is too scary to contemplate.
I still have a question. Even with the working 3.212 compiler, I'm seeing code placed in low memory. What gives? How come this works? Here's the first few lines of the HEX file with the 3.212 compiler. What makes some low memory placement acceptable, and other placements not acceptable?
Code: |
(spaces added for readability.)
:02 0000 04 0000 FA
:10 0000 00 4AEF00F00B50016A0A5C03E20AC000F0 FC
:10 0010 00 0CD0006A080E0C6E0A3600360B50005C DD
:10 0020 00 D8B0006E01360C2EF7D7000C015008C0 76
:10 0030 00 0AF0640E0B6EE6DF00C008F00150300E CF
:10 0040 00 07E109B00CD009B60AD009B8200E02D0 D9
|
(Attn lurkers: See appendix A of http://ww1.microchip.com/downloads/en/DeviceDoc/33014g.pdf for more info on HEX file formats)
A few other points:
* This failure is definately linked to specific hardware. Units that pass always pass, and units that fail, reliably fail about 30% of the time
* We've been putting this code on devices for two years, and have over a thousand instances that have been running for years without problem ... until now.
My inclination is to go with the less certain #build() fix for now, and start validating the compiler upgrade for the next build.
Thanks,
- Ben |
|
|
asmallri
Joined: 12 Aug 2004 Posts: 1635 Location: Perth, Australia
|
|
Posted: Thu Nov 18, 2004 11:52 pm |
|
|
The following method also works if you do not have interrupt handlers or in the event your have your own handlers not delcared as type interrupt.
Code: | #build(interrupt=0x028)
#org 0x0020, 0x002f {} |
When you do this the also compiler lets you over write the real handler addresses at 0x0008 and 0x0018
Quote: | I still have a question. Even with the working 3.212 compiler, I'm seeing code placed in low memory. What gives? |
Using low memory should be ok but the compiler or more likely the linker is getting something wrong - it may be that it gets screwed up because of the interrupt vectors - I don't know.
Quote: | Given our deadline, upgrading the compiler is too scary to contemplate. |
Tough call - Do you use the tool that you know is flawed but not the extent of the flaw or you you use the newer tool which is not guarenteed to have the bug fixed anyway? I note that several people have expressed confidence that 3.212 is a good build.
Quote: | My inclination is to go with the less certain #build() fix for now, and start validating the compiler upgrade for the next build. |
The compiler (or linker) stuffs around trying to fit code into fragments of memory. You have observered that the particular code segment that has caused you trouble so far can be made to work by decreasing the code space by one instruction. So the compiler (or linker) is putting this particular code segment elsewhere and instead putting some other smaller codse segment in low memory. Who is to say that this has not also been corrupted in some way that has not yet manifested itself to you? I believe the #build approach to be better because it prevents the compiler from using any space around the vectors (where I summize the real bug lies). _________________ Regards, Andrew
http://www.brushelectronics.com/software
Home of Ethernet, SD card and Encrypted Serial Bootloaders for PICs!! |
|
|
benres
Joined: 30 Jun 2004 Posts: 9 Location: Cambridge, MA
|
|
Posted: Sat Nov 20, 2004 8:23 am |
|
|
This solution you suggest does indeed fix the test prorgram. Code: | [quote]
#build(interrupt=0x028)
#org 0x0020, 0x002f {}
[/quote] |
However the actual program has interrupts, so this doesn't work for production. Otherwise I would definately follow your advice.
I agree with your thinking that it's a compiler problem and not a PIC chip problem, but how to explain the non-deterministic nature? Why is the PIC exhibiting unpredictable startup behavior? Given the crystal works and the powerup startup timer is enabled, what explains this randomness? I would expect a linker bug to always work or always fail during a deterministic startup sequence. As can be seen from the test program, there's no inputs, and no dependencies on uninitialized variables.
I hear what you're saying about the decision to upgrade compilers or not. It's important to point out the production software is doing some realtime processing that is very timing dependant. Upgrading compilers would require re-tuning some of that code. I'd be more inclined to consider an upgrade now if that wasn't the case.
I'm going to pull a dozen or so boards with this bug from production and will try to make them available to anyone (esp Microchip) who wants to experiment. I'm definately going to revisit this in a few weeks when I have more time. Hopefully the solution will pop out.
I've learned a lot more about PICs and compilers through this bug. This thread has been a very rewarding online experience. Thank You. |
|
|
asmallri
Joined: 12 Aug 2004 Posts: 1635 Location: Perth, Australia
|
|
Posted: Sat Nov 20, 2004 9:06 am |
|
|
Quote: | but how to explain the non-deterministic nature? Why is the PIC exhibiting unpredictable startup behavior? |
One possibility is the small segment of code the compiler is putting is very low memory is being corrupted or overlaying one or more of the interrupt vectors. When this happens the PIC runs some unexpected code sequence. This code sequence might have an instruction that branches on the state of a flag or the contents of a memory location. As this code is corrupt you have no way of knowing if bits or memory were correctly initialised. The behaviour could appear random because it depends on the state of some uninitialised memory location or processor flag.
Quote: | I would expect a linker bug to always work or always fail during a deterministic startup sequence. |
No - your symptoms are very typical of a problem when a variable or flag is used that has not been initialised. You would expect a compiler to catch a lot of this but when the compiler is placing corrupt code into the PIC - well then anything is possible.
Quote: | As can be seen from the test program, there's no inputs, and no dependencies on uninitialized variables. |
True for what you THINK the compiler is putting into memory and assuming there are no other other bugs being introduced by the compiler but what is it really doing?
Quote: | I'm going to pull a dozen or so boards with this bug from production and will try to make them available to anyone (esp Microchip) who wants to experiment. |
I'm happy to have a look at one of these for you but to be honest I think your time would be better spent by disassembling the hex file of the small test program. You can do this by importing it with MPLAB into memory. Once you have done this you will be able to see what the code is actually telling the processor to do. The problem you have described is so classic in nature that I am sure you will find a corrupted branch instruction. _________________ Regards, Andrew
http://www.brushelectronics.com/software
Home of Ethernet, SD card and Encrypted Serial Bootloaders for PICs!! |
|
|
Guest
|
|
|
DHDF Guest
|
|
|
|