CCS C Software and Maintenance Offers
FAQFAQ   FAQForum Help   FAQOfficial CCS Support   SearchSearch  RegisterRegister 

ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

CCS does not monitor this forum on a regular basis.

Please do not post bug reports on this forum. Send them to CCS Technical Support

[PIC24HJ256GP210] Device ID becomes corrupted
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    CCS Forum Index -> General CCS C Discussion
View previous topic :: View next topic  
Author Message
canadidan



Joined: 13 Feb 2019
Posts: 24

View user's profile Send private message

[PIC24HJ256GP210] Device ID becomes corrupted
PostPosted: Wed Feb 13, 2019 10:36 am     Reply with quote

Research

When I search "device ID wrong", I find the following common causes:

• Issue with pulling MCLR low/high
• User trying to power PIC with programmer
• Other wiring issue with ICSP pins

These problems prevent people from programming the PIC. My problem is different.

My Issue

A perfectly working device will suddenly change identity - the device ID will change, but it will continue to function otherwise.

For this PIC, the device ID is 0x0073. After the corruption, it is 0x00FF, corresponding to a dsPIC33FJ256GP710.



Furthermore, if I tell my programmer the chip is a dsPIC33FJ256GP710 (not the PIC24HJ256GP210 is actually is) I can still load new software to the PIC's flash and run it (as shown above).

Cause

This has only occurred while running a custom "firmware upgrade" procedure: the new firmware is received over UART, stored in an external flash, the CRC is checked, then it is copied to the PIC flash using write_program_memory();

Frequency

I have run this firmware upgrade routine on over 40 units and in excess of 500 times. This corruption has occurred on a total of 5 units.

Of them, 2 units were corrupted the first time I ran the upgrade procedure. The other 3 had been upgraded dozens of times before eventually being corrupted (also during/after this upgrade procedure).

Steps tried so far to recover these devices

As shown above, I first tried reading the Device ID from the device itself following these instructions:

https://www.ccsinfo.com/forum/viewtopic.php?t=43278

Then, I tried overwriting this area of flash (using both ASM and C code methods), but that hasn't worked. I suspect those addresses are protected.

Has anyone seen this before, and how could I correct the device ID?

Thank you,
Dan


Last edited by canadidan on Wed Feb 13, 2019 2:24 pm; edited 2 times in total
temtronic



Joined: 01 Jul 2010
Posts: 9243
Location: Greensville,Ontario

View user's profile Send private message

PostPosted: Wed Feb 13, 2019 10:55 am     Reply with quote

OK, I don't have that PIC ...
but
any chance the programming cable length is the 'random' issue ?
or..
any chance the 5 bad ones have the same date/batch code info ??
Ttelmah



Joined: 11 Mar 2010
Posts: 19539

View user's profile Send private message

PostPosted: Wed Feb 13, 2019 11:13 am     Reply with quote

Device ID's are 16bit values, not 8bit values.

0xFF is not the ID for the DsPIC33HJ256GP710. It's ID is 0x7FF.
0xFF, suggests the area has been erased. 0xFF is generically what the
memory erases to.

How are you reading this?. You talk about an on board programming
system. Is this being read internally from the chip or using a programmer?.

The device ID is protected, but can be destroyed if there is a power
spike during programming.

Normally a device ID failure with an external programmer is a connection
problem. Too much capacitance on a line a bad connection or incorrect
supplies. You'll get this is the Vcap signal is not being generated while
programming.

If the device ID is destroyed, it can't be rewritten.
canadidan



Joined: 13 Feb 2019
Posts: 24

View user's profile Send private message

PostPosted: Wed Feb 13, 2019 12:55 pm     Reply with quote

@temtronic

We have 4 CCS programmers, with a variety of custom cables and a pogo-pin fixture for automated programming. Once the device ID is corrupted, it is repeatable across programmers.

I've checked 3 close to me, and the date/batch codes are varied:
2x 1811BTU
1x 1823JJ4

@Ttelmah

The on-board programming system does the following:

1. Erases a defined "application" region in flash:

Code:

void BIOS_Erase_CPU_Flash_FMWR_Sector(){
 unsigned int32 address_erase;
   for(address_erase=0x00400;address_erase<BIOS_ADDR;address_erase+=(getenv("FLASH_ERASE_SIZE")/2))
      erase_program_memory(address_erase);
}


2. Uses write_program_memory() to write to this same application region.

Reading Device ID

I have read the device ID 3 ways:

* Using CCSLOAD
* Using the assembly code from the post I linked in OP
* Using the read_program_memory() function

CCSLOAD



Assembly

Code:

      #asm
         mov #0x00FF, W0
         mov W0,    TBLPAG
         
         mov #0x0000, W1         
         tblrdl [W1],W0
         mov W0, devid           
         tblrdh [W1],W0
         mov W0, devid+2
         
         mov #0x0002, W1           
         tblrdl [W1],W0
         mov W0, rev           
         tblrdh [W1],W0
         mov W0, rev+2
      #endasm

      printf("devid = 0x%LX, rev = 0x%LX\r\n", devid, rev);


C code

Code:

      unsigned int8 mem_buffer[2];
      read_program_memory(0x00FF0000,mem_buffer,2);
      unsigned int16 id = mem_buffer[0] | (mem_buffer << 8);
      printf("devid = 0x%LX\r\n", id);


Voltages / Connections / VCap

This failure has never happened during ICSP programming so I'm not sure how it could be a programmer/VCap issue.

Also, these devices program successfully when I lie about the target device type.


Could erase_program_memory() somehow erase the Device ID too, despite protection?
Ttelmah



Joined: 11 Mar 2010
Posts: 19539

View user's profile Send private message

PostPosted: Wed Feb 13, 2019 1:46 pm     Reply with quote

There is an issue with a very close member of the PIC family, where the
write 'stall' does not function correctly. It is vital that interrupts are
disabled during a program memory write, and on chips with this problem
the code should poll the WR bit to verify the write has completed.
Worth ensuring you have interrupts disabled, and add this check after
the write.

Devid 0x00FF is actually a dsPIC33FJ256GP710. FJ, not the HJ.

It should be impossible to write the DEVID except by damaging the
cells.
canadidan



Joined: 13 Feb 2019
Posts: 24

View user's profile Send private message

PostPosted: Wed Feb 13, 2019 2:22 pm     Reply with quote

@Ttelmah

My mistake - couldn't remember if it was FJ or HJ.

Interrupts

Before it initiates the erase/write to flash, it disables interrupts:

Code:
Disable_Interrupts(INTR_GLOBAL);


Should I do anything in addition to this?

Poll WR bit

This is good to know, and I will implement this! It may be difficult to test in the short term, but I will return with the long-term results.

Code:

void BIOS_Erase_CPU_Flash_FMWR_Sector(){
 unsigned int32 address_erase;
   for(address_erase=0x00400;address_erase<BIOS_ADDR;address_erase+=(getenv("FLASH_ERASE_SIZE")/2))
   {
      erase_program_memory(address_erase);
      while(bit_test(NVMCON, 15));
   }
}


where

Code:

#WORD NVMCON = 0x0760


Edit: I had the wrong address for NVMCON, I took 0x0728 from here: https://www.ccsinfo.com/forum/viewtopic.php?t=54366

But from the datasheet it is 0x0760.
canadidan



Joined: 13 Feb 2019
Posts: 24

View user's profile Send private message

PostPosted: Fri Feb 15, 2019 8:18 am     Reply with quote

Here are the results of my testing, with the above changes:

Test setup

I created a "firmware upgrade stress test" routine - which continually runs our firmware upgrade method over UART until it fails.

Test without checking WR

The device ID was corrupted after 25 cycles (date code 1823JJ4)

Test with checking WR

The device ID was corrupted after 106 cycles (date code 1823JJ4)

What does this mean?

With such a low sample size, it doesn't mean very much - except that the fix wasn't sufficient. Maybe it delayed it, or maybe silicon variances just delayed the inevitable.

Next steps

I will modify the erase code to stop at the end address of the newly provided firmware - rather than erasing all of program space. It means old code might be left behind, but it is safer than wiping straight to the end and possible damaging the device ID.

I have a whole bag of previous-gen units, so I will continue investigating.
newguy



Joined: 24 Jun 2004
Posts: 1909

View user's profile Send private message

PostPosted: Fri Feb 15, 2019 8:23 am     Reply with quote

Throw in a little extra delay after the WR bit falls. Curious if a bit of extra time makes any difference.

That said, I do think that your plan to only erase and rewrite memory to the end of the new image, instead of to the end of the program space, will probably get rid of the issue.
temtronic



Joined: 01 Jul 2010
Posts: 9243
Location: Greensville,Ontario

View user's profile Send private message

PostPosted: Fri Feb 15, 2019 9:15 am     Reply with quote

Any chance it's a 'marginal' power supply issue ?
gremlins and gators are NOT fun..
canadidan



Joined: 13 Feb 2019
Posts: 24

View user's profile Send private message

PostPosted: Fri Feb 15, 2019 10:16 am     Reply with quote

@newguy

I will explore additional delay! Currently, I'm running a test with the erase routine completely bypassed, to see if it is even the source of the issue.

@temtronic

The 3V3 has been really good in our design - I've done MTBF testing for weeks straight, with noisy stepper drivers running constantly, without a single reset.



We have all decoupling caps, placed close to the pins.



Do you feel we should take additional steps to improve this?
Ttelmah



Joined: 11 Mar 2010
Posts: 19539

View user's profile Send private message

PostPosted: Fri Feb 15, 2019 12:41 pm     Reply with quote

Are you _sure_ about the ESR of your Vcap capacitor?. This is a parameter
that can cause really silly errors. I had a whole 'batch' of similar chips that
destroyed their ID's when writing to the program memory as simulated
EEPROM. It turned out the people assembling the boards had used a
substitute for this capacitor. It appears that the write does impose some
exceptional spikes on this line....
canadidan



Joined: 13 Feb 2019
Posts: 24

View user's profile Send private message

PostPosted: Fri Feb 15, 2019 1:50 pm     Reply with quote

@Ttelmah

That's really interesting and helpful! I hadn't thought about that actually. This aspect of the design pre-dates my inheritance of the project.

From the datasheet:


We are using a CL21X106MOQNNNE from Samsung:
* 16V
* 10uF
* ESR below 1 Ohm
* Trace length approx. 2mm



You made me think that maybe the MLCC shortage had forced me to order alternates. After checking my order history, this is not the case. I've always ordered the same part. Our design has gone through 2 generations / 4 revisions, and each time I've ordered new batches of components. Surely QC isn't that bad.. but perhaps?

Since it is worth investigating, I'll order a variety of 4.7uF and 10uF caps from other vendors with low ESR and compare.
Ttelmah



Joined: 11 Mar 2010
Posts: 19539

View user's profile Send private message

PostPosted: Fri Feb 15, 2019 1:57 pm     Reply with quote

Other thing is if the failing chips are all from one batch, you may
simply have faulty chips. Does happen...
canadidan



Joined: 13 Feb 2019
Posts: 24

View user's profile Send private message

PostPosted: Fri Feb 15, 2019 2:22 pm     Reply with quote

Occam's razor, right..

Batch Date Codes

For units built since last year, they all have the same two date codes with equal failures from each:

3x 1811BTU / 3x 1823JJ4

Corruption has occurred in both.

I found some very early prototypes from 2014/2015 - I'll subject those chips to the same test and see.

Testing

Here is what has been tested so far:

* Original code: corrupts Device ID (25 cycles)
* Wait for WR: corrupts Device ID (106 cycles)
* Skip the erase: inconclusive (didn't fail after 96 cycles)

Here is what's to come:

* Try different VCap from other vendors
* Try original code on 2014 and 2015 batch ICs

Thanks to all of you for the help; I personally really appreciate it.
dexta64



Joined: 19 Feb 2019
Posts: 11

View user's profile Send private message

PostPosted: Tue Feb 19, 2019 2:26 pm     Reply with quote



I cannot confirm the GND connections of capacitors C21, C22 and C24. They're not connected. Can you measure them?

If you have a switched regulator, ESR will be a serious problem for you.

Check the pcb of a hard disk. There are two engines running and doing the job at all times. Why design the right pcb.

Also, PIC24HJXXXGPX06 / X08 / X10 Family Silicon Errata and Data Sheet Clarification. Page 15.

"32. Module: Device ID Register On a few devices, the content of the Device ID register can change from the factory programmed default value immediately after RTSP or ICSP™ Flash programming.
As a result, development tools will not recognize these devices and will generate an error message indicating that the device ID and the device part
number do not match. Additionally, some peripherals will be reconfigured and will not function as described in the device data sheet"
Display posts from previous:   
Post new topic   Reply to topic    CCS Forum Index -> General CCS C Discussion All times are GMT - 6 Hours
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group