CCS C Software and Maintenance Offers
FAQFAQ   FAQForum Help   FAQOfficial CCS Support   SearchSearch  RegisterRegister 

ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

CCS does not monitor this forum on a regular basis.

Please do not post bug reports on this forum. Send them to CCS Technical Support

math speed help.

 
Post new topic   Reply to topic    CCS Forum Index -> General CCS C Discussion
View previous topic :: View next topic  
Author Message
curt2go



Joined: 21 Nov 2003
Posts: 200

View user's profile Send private message

math speed help.
PostPosted: Mon Sep 30, 2019 9:11 pm     Reply with quote

I have a routine that I need to make faster if possible. It's an older routine which I got some good tips to get it this fast to begin with, like using the wrapper union.
-- Processor is a 24EP256GP206 running @ 140MHz.
-- Newest compiler.
-- The scales are supposed to add up to 65535 in normal cases.

-- I am audio summing and normalizing the sound from 3 files.

-- Right now this routine takes 1.6ms in the real world. I can get it way down if I set the scales to an unsigned int16 but then the math does not work properly.

-- Any help would be awesome.
Code:

unsigned int f;
signed int16 buffer[512];
signed int16 buffer2[512];
signed int16 buffer3[512];
signed int32 scale1 = 4567;
signed int32 scale2 = 16567;
signed int32 scale3 = 28300;


union {
   signed int32 wrapper;
   signed int16 parts[2];
} value;

union {
   signed int32 wrapper;
   signed int16 parts[2];
} value2;

union {
   signed int32 wrapper;
   signed int16 parts[2];
} value3;


for(f=0;f<512;f++){   
     value.wrapper = buffer[f])*scale1;
     value2.wrapper = buffer2[f])*scale2;
     value3.wrapper = buffer3[f])*scale3;
     buffer[f] = value.parts[1] + value2.parts[1] + value3.parts[1];
}
PCM programmer



Joined: 06 Sep 2003
Posts: 21708

View user's profile Send private message

PostPosted: Mon Sep 30, 2019 11:38 pm     Reply with quote

Post the .LST file code for the for() loop only. Then we can see what might be done.
Ttelmah



Joined: 11 Mar 2010
Posts: 19587

View user's profile Send private message

PostPosted: Tue Oct 01, 2019 12:39 am     Reply with quote

Are the values ever actually 'signed'?.

It takes significantly more time to 'extend' a 16bit signed value to 32bits
than it does to do the same for an unsigned value.

Your code involves taking 1536 signed int16 values, extending each to
32bits, then performing a signed int32 multiply. If you change the values
to all be unsigned (in the declarations and in the unions), you will see perhaps
a 25% improvement straight away.

The scales need to be 32bit, since it is the fact that these are 32bit, that
forces 32bit maths to be used. If you change these to int16, you only use
16bit maths, and since you are using the upper 16bits of the 32bit result,
this will never work. You could use unsigned int16 for these, but only
by casting them to int32 when used, which is another operation and will
slow things more.
curt2go



Joined: 21 Nov 2003
Posts: 200

View user's profile Send private message

PostPosted: Tue Oct 01, 2019 3:00 pm     Reply with quote

i will post the lst file section.

Yes I do have negative numbers since its audio files. I'm not sure there is any way to treat them as unsigned or not. Been racking my brain on this one.

Even trying not using a range between 0-65535 and going to 0-16 for the scales so I could just do a bitwise multiplication but it seems that takes a long time as well. Probably once again because of the signed numbers.
curt2go



Joined: 21 Nov 2003
Posts: 200

View user's profile Send private message

PostPosted: Tue Oct 01, 2019 3:08 pm     Reply with quote

Code:

for(f=0;f<512;f++){   
0A03E:  MOV     #0,W4
0A040:  MOV     W4,783A
0A042:  MOV     783A,W0
0A044:  MOV     #200,W4
0A046:  CP      W4,W0
0A048:  BRA     LEU,A0C6
.................... value.wrapper = buffer[f]*scale1;
0A04A:  MOV     783A,W0
0A04C:  SL      W0,#1,W0
0A04E:  MOV     #7848,W4
0A050:  ADD     W0,W4,W0
0A052:  MOV     [W0],W5
0A054:  CLR     W6
0A056:  BTSC    W5.F
0A058:  SETM    W6
0A05A:  MOV     W5,W0
0A05C:  MOV     W6,W1
0A05E:  MOV     7C48,W2
0A060:  MOV     7C4A,W3
0A062:  CALL    9F3C
0A066:  MOV     W0,783C
0A068:  MOV     W1,783E
.................... value2.wrapper = buffer2[f]*scale3;
0A06A:  MOV     783A,W0
0A06C:  SL      W0,#1,W0
0A06E:  MOV     #8000,W4
0A070:  ADD     W0,W4,W0
0A072:  MOV     [W0],W5
0A074:  CLR     W6
0A076:  BTSC    W5.F
0A078:  SETM    W6
0A07A:  MOV     W5,W0
0A07C:  MOV     W6,W1
0A07E:  MOV     7C50,W2
0A080:  MOV     7C52,W3
0A082:  CALL    9F3C
0A086:  MOV     W0,7840
0A088:  MOV     W1,7842
.................... value3.wrapper = buffer3[f]*scale3;
0A08A:  MOV     783A,W0
0A08C:  SL      W0,#1,W0
0A08E:  MOV     #8400,W4
0A090:  ADD     W0,W4,W0
0A092:  MOV     [W0],W5
0A094:  CLR     W6
0A096:  BTSC    W5.F
0A098:  SETM    W6
0A09A:  MOV     W5,W0
0A09C:  MOV     W6,W1
0A09E:  MOV     7C50,W2
0A0A0:  MOV     7C52,W3
0A0A2:  CALL    9F3C
0A0A6:  MOV     W0,7844
0A0A8:  MOV     W1,7846
.................... buffer[f] = value.parts[1] + value2.parts[1] + value3.parts[1];
0A0AA:  MOV     783A,W0
0A0AC:  SL      W0,#1,W0
0A0AE:  MOV     #7848,W4
0A0B0:  ADD     W0,W4,W5
0A0B2:  MOV     783E,W0
0A0B4:  MOV     7842,W4
0A0B6:  ADD     W0,W4,W6
0A0B8:  MOV     7846,W0
0A0BA:  ADD     W6,W0,[W5]
0A0BC:  MOV     783A,W0
0A0BE:  INC     W0,W0
0A0C0:  MOV     W0,783A
0A0C2:  GOTO    A042
.................... }
alan



Joined: 12 Nov 2012
Posts: 357
Location: South Africa

View user's profile Send private message

PostPosted: Wed Oct 02, 2019 12:37 am     Reply with quote

Change your scale declaration to int16 instead of int32.

Loop gets much smaller, hardware will return int32 when multiplying two int16.
Ttelmah



Joined: 11 Mar 2010
Posts: 19587

View user's profile Send private message

PostPosted: Wed Oct 02, 2019 12:47 am     Reply with quote

No it won't.
If you multiply two int16 values you get an int16 result.
This is not a chip with a hardware maths unit. Overflow has to be done
in software not hardware.
alan



Joined: 12 Nov 2012
Posts: 357
Location: South Africa

View user's profile Send private message

PostPosted: Wed Oct 02, 2019 12:55 am     Reply with quote

Sorry then if I created confusion, I have read this then wrong:
Quote:
Core: 16-Bit dsPIC33E/PIC24E CPU
• Code Efficient (C and Assembly) Architecture
• Two 40-Bit-Wide Accumulators
• Single Cycle (MAC/MPY) with Dual Data Fetch
• Single-Cycle, Mixed-Sign MUL plus Hardware Divide
• 32-Bit Multiply Support
Ttelmah



Joined: 11 Mar 2010
Posts: 19587

View user's profile Send private message

PostPosted: Wed Oct 02, 2019 1:11 am     Reply with quote

Key is part of the data sheet:

Quote:

These instructions are available in dsPIC33EPXXXMC20X/50X and PIC24EPXXXMC20X devices only.


This for the instructions using the accumulators.

The chip he has does not have the extended accumulators.
alan



Joined: 12 Nov 2012
Posts: 357
Location: South Africa

View user's profile Send private message

PostPosted: Wed Oct 02, 2019 3:47 am     Reply with quote

OK missed that, thanks Ttelmah.
Could you maybe explain what is the difference then between these instructions, except that one uses the accumulator and the other one two 16 bit registers, might just run into problems in the future and at least I then wouldn't have to pull my hair out Very Happy
Quote:
MUL MUL.SS Wb,Ws,Wnd {Wnd + 1, Wnd} = signed(Wb) * signed(Ws)

MUL.SS Wb,Ws,Acc Accumulator = signed(Wb) * signed(Ws)


I did a small program and here are the listing:
Code:
....................      value.wrapper = buffer[f]*scale1;
00278:  MOV     1002,W0
0027A:  SL      W0,#1,W0
0027C:  MOV     #1004,W4
0027E:  ADD     W0,W4,W0
00280:  MOV     [W0],W5
00282:  MOV     1C04,W4
00284:  MUL.SS  W5,W4,W0
00286:  CLR     W1
00288:  BTSC    W0.F
0028A:  SETM    W1
0028C:  MOV     W0,1C0E
0028E:  MOV     W1,1C10
Ttelmah



Joined: 11 Mar 2010
Posts: 19587

View user's profile Send private message

PostPosted: Wed Oct 02, 2019 4:12 am     Reply with quote

Testing with a 24EP256MC206, that has the extended instruction, gives
841uSec runtime for the loop. So nearly twice as fast.
Now though this chip doesn't support this, it does have the ability
to generate a 32bit reply, using a register pair. So decided to code this:
Code:

int f;
union access {
    unsigned int32 whole;
    unsigned int8 bytes[4];
};

union joiner {
   unsigned int32 wrapper;
   unsigned int16 parts[2]; 
};

struct vals {
    unsigned int16 first;
    unsigned int16 second;
    unsigned int16 third;
};

signed int16 buffer[512];
signed int16 buffer2[512];
signed int16 buffer3[512];

unsigned int16 scale1 = 4567; //must now be unsigned int16
unsigned int16 scale2 = 16567;
unsigned int16 scale3 = 28300;

union joiner value;
union joiner value2;
union joiner value3;

//Now cheat code**************************
signed int32 res; //cheat here to allow 'res' to be easily accessed
//in machine code.
BYTE * ptr=&res;

#INLINE
void mul(unsigned int16 s1, signed int16 s2)
{ //assembler multiply to multiply a signed int16 by unsigned int16
    //and generate the 32bit result
    //However do not 'return' this, instead write directly to 'res'
    #asm
    MOV s1,w0 //two source values
    MOV s2,w1
    MOV ptr, w11 //where to put the result
    MUL.SU w1,w0,W12  //Must be even register
    //Now need to move the result from W12/W13 to RAM
    MOV W12,[W11++]
    MOV W13,[W11]
    #endasm
    return;
}

//Then testing with

    for(f=0;f<512;f++)
    {
       mul(scale1,buffer[f]);
       value.wrapper = res;
       mul(scale2,buffer2[f]);
       value2.wrapper = res;
       mul(scale3,buffer3[f]);
       value3.wrapper = res;
       
       buffer[f] = value.parts[1] + value2.parts[1] + value3.parts[1];
    }


Brings the loop down to 688uSec. To go any faster one would have to
code the loop itself in assembler.

Think the values are right. Tested the first few.
curt2go



Joined: 21 Nov 2003
Posts: 200

View user's profile Send private message

PostPosted: Wed Oct 02, 2019 11:46 am     Reply with quote

Wow. Thank you. I will test this tonight. You are a mad man! That is so fast!!!
curt2go



Joined: 21 Nov 2003
Posts: 200

View user's profile Send private message

PostPosted: Wed Oct 02, 2019 11:58 am     Reply with quote

Well i snuck in a test at work... Smile

Looks like it is as fast as its supposed to be, but one thing i noticed is that with a full scale of 65535 , this will give double the value. It looks like this routine max scale is 32768, which is totally fine with me. I can just quickly scale the scales before entering the routine. Unless you have a very quick fix, but I can't see where the issue is. Once again thank you for your awesome efforts here. Its amazing!
Ttelmah



Joined: 11 Mar 2010
Posts: 19587

View user's profile Send private message

PostPosted: Wed Oct 02, 2019 12:16 pm     Reply with quote

Since the multiply is of a signed value, the result has to be signed.
This is then in the top 16bits of the result, and copied out, so yes, the
result can only support up to 32768 as a result (and to -32767).
The 'parts' in the union really should be declared as signed int16, otherwise
sign will not be maintained in the addition.
curt2go



Joined: 21 Nov 2003
Posts: 200

View user's profile Send private message

PostPosted: Wed Oct 02, 2019 12:30 pm     Reply with quote

That is just fine. I will adjust the scales accordingly. Thank you again for all your efforts.
Display posts from previous:   
Post new topic   Reply to topic    CCS Forum Index -> General CCS C Discussion All times are GMT - 6 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group