|
|
View previous topic :: View next topic |
Author |
Message |
curt2go
Joined: 21 Nov 2003 Posts: 200
|
math speed help. |
Posted: Mon Sep 30, 2019 9:11 pm |
|
|
I have a routine that I need to make faster if possible. It's an older routine which I got some good tips to get it this fast to begin with, like using the wrapper union.
-- Processor is a 24EP256GP206 running @ 140MHz.
-- Newest compiler.
-- The scales are supposed to add up to 65535 in normal cases.
-- I am audio summing and normalizing the sound from 3 files.
-- Right now this routine takes 1.6ms in the real world. I can get it way down if I set the scales to an unsigned int16 but then the math does not work properly.
-- Any help would be awesome.
Code: |
unsigned int f;
signed int16 buffer[512];
signed int16 buffer2[512];
signed int16 buffer3[512];
signed int32 scale1 = 4567;
signed int32 scale2 = 16567;
signed int32 scale3 = 28300;
union {
signed int32 wrapper;
signed int16 parts[2];
} value;
union {
signed int32 wrapper;
signed int16 parts[2];
} value2;
union {
signed int32 wrapper;
signed int16 parts[2];
} value3;
for(f=0;f<512;f++){
value.wrapper = buffer[f])*scale1;
value2.wrapper = buffer2[f])*scale2;
value3.wrapper = buffer3[f])*scale3;
buffer[f] = value.parts[1] + value2.parts[1] + value3.parts[1];
} |
|
|
|
PCM programmer
Joined: 06 Sep 2003 Posts: 21708
|
|
Posted: Mon Sep 30, 2019 11:38 pm |
|
|
Post the .LST file code for the for() loop only. Then we can see what might be done. |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19535
|
|
Posted: Tue Oct 01, 2019 12:39 am |
|
|
Are the values ever actually 'signed'?.
It takes significantly more time to 'extend' a 16bit signed value to 32bits
than it does to do the same for an unsigned value.
Your code involves taking 1536 signed int16 values, extending each to
32bits, then performing a signed int32 multiply. If you change the values
to all be unsigned (in the declarations and in the unions), you will see perhaps
a 25% improvement straight away.
The scales need to be 32bit, since it is the fact that these are 32bit, that
forces 32bit maths to be used. If you change these to int16, you only use
16bit maths, and since you are using the upper 16bits of the 32bit result,
this will never work. You could use unsigned int16 for these, but only
by casting them to int32 when used, which is another operation and will
slow things more. |
|
|
curt2go
Joined: 21 Nov 2003 Posts: 200
|
|
Posted: Tue Oct 01, 2019 3:00 pm |
|
|
i will post the lst file section.
Yes I do have negative numbers since its audio files. I'm not sure there is any way to treat them as unsigned or not. Been racking my brain on this one.
Even trying not using a range between 0-65535 and going to 0-16 for the scales so I could just do a bitwise multiplication but it seems that takes a long time as well. Probably once again because of the signed numbers. |
|
|
curt2go
Joined: 21 Nov 2003 Posts: 200
|
|
Posted: Tue Oct 01, 2019 3:08 pm |
|
|
Code: |
for(f=0;f<512;f++){
0A03E: MOV #0,W4
0A040: MOV W4,783A
0A042: MOV 783A,W0
0A044: MOV #200,W4
0A046: CP W4,W0
0A048: BRA LEU,A0C6
.................... value.wrapper = buffer[f]*scale1;
0A04A: MOV 783A,W0
0A04C: SL W0,#1,W0
0A04E: MOV #7848,W4
0A050: ADD W0,W4,W0
0A052: MOV [W0],W5
0A054: CLR W6
0A056: BTSC W5.F
0A058: SETM W6
0A05A: MOV W5,W0
0A05C: MOV W6,W1
0A05E: MOV 7C48,W2
0A060: MOV 7C4A,W3
0A062: CALL 9F3C
0A066: MOV W0,783C
0A068: MOV W1,783E
.................... value2.wrapper = buffer2[f]*scale3;
0A06A: MOV 783A,W0
0A06C: SL W0,#1,W0
0A06E: MOV #8000,W4
0A070: ADD W0,W4,W0
0A072: MOV [W0],W5
0A074: CLR W6
0A076: BTSC W5.F
0A078: SETM W6
0A07A: MOV W5,W0
0A07C: MOV W6,W1
0A07E: MOV 7C50,W2
0A080: MOV 7C52,W3
0A082: CALL 9F3C
0A086: MOV W0,7840
0A088: MOV W1,7842
.................... value3.wrapper = buffer3[f]*scale3;
0A08A: MOV 783A,W0
0A08C: SL W0,#1,W0
0A08E: MOV #8400,W4
0A090: ADD W0,W4,W0
0A092: MOV [W0],W5
0A094: CLR W6
0A096: BTSC W5.F
0A098: SETM W6
0A09A: MOV W5,W0
0A09C: MOV W6,W1
0A09E: MOV 7C50,W2
0A0A0: MOV 7C52,W3
0A0A2: CALL 9F3C
0A0A6: MOV W0,7844
0A0A8: MOV W1,7846
.................... buffer[f] = value.parts[1] + value2.parts[1] + value3.parts[1];
0A0AA: MOV 783A,W0
0A0AC: SL W0,#1,W0
0A0AE: MOV #7848,W4
0A0B0: ADD W0,W4,W5
0A0B2: MOV 783E,W0
0A0B4: MOV 7842,W4
0A0B6: ADD W0,W4,W6
0A0B8: MOV 7846,W0
0A0BA: ADD W6,W0,[W5]
0A0BC: MOV 783A,W0
0A0BE: INC W0,W0
0A0C0: MOV W0,783A
0A0C2: GOTO A042
.................... } |
|
|
|
alan
Joined: 12 Nov 2012 Posts: 357 Location: South Africa
|
|
Posted: Wed Oct 02, 2019 12:37 am |
|
|
Change your scale declaration to int16 instead of int32.
Loop gets much smaller, hardware will return int32 when multiplying two int16. |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19535
|
|
Posted: Wed Oct 02, 2019 12:47 am |
|
|
No it won't.
If you multiply two int16 values you get an int16 result.
This is not a chip with a hardware maths unit. Overflow has to be done
in software not hardware. |
|
|
alan
Joined: 12 Nov 2012 Posts: 357 Location: South Africa
|
|
Posted: Wed Oct 02, 2019 12:55 am |
|
|
Sorry then if I created confusion, I have read this then wrong:
Quote: | Core: 16-Bit dsPIC33E/PIC24E CPU
• Code Efficient (C and Assembly) Architecture
• Two 40-Bit-Wide Accumulators
• Single Cycle (MAC/MPY) with Dual Data Fetch
• Single-Cycle, Mixed-Sign MUL plus Hardware Divide
• 32-Bit Multiply Support |
|
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19535
|
|
Posted: Wed Oct 02, 2019 1:11 am |
|
|
Key is part of the data sheet:
Quote: |
These instructions are available in dsPIC33EPXXXMC20X/50X and PIC24EPXXXMC20X devices only.
|
This for the instructions using the accumulators.
The chip he has does not have the extended accumulators. |
|
|
alan
Joined: 12 Nov 2012 Posts: 357 Location: South Africa
|
|
Posted: Wed Oct 02, 2019 3:47 am |
|
|
OK missed that, thanks Ttelmah.
Could you maybe explain what is the difference then between these instructions, except that one uses the accumulator and the other one two 16 bit registers, might just run into problems in the future and at least I then wouldn't have to pull my hair out
Quote: | MUL MUL.SS Wb,Ws,Wnd {Wnd + 1, Wnd} = signed(Wb) * signed(Ws)
MUL.SS Wb,Ws,Acc Accumulator = signed(Wb) * signed(Ws)
|
I did a small program and here are the listing:
Code: | .................... value.wrapper = buffer[f]*scale1;
00278: MOV 1002,W0
0027A: SL W0,#1,W0
0027C: MOV #1004,W4
0027E: ADD W0,W4,W0
00280: MOV [W0],W5
00282: MOV 1C04,W4
00284: MUL.SS W5,W4,W0
00286: CLR W1
00288: BTSC W0.F
0028A: SETM W1
0028C: MOV W0,1C0E
0028E: MOV W1,1C10 |
|
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19535
|
|
Posted: Wed Oct 02, 2019 4:12 am |
|
|
Testing with a 24EP256MC206, that has the extended instruction, gives
841uSec runtime for the loop. So nearly twice as fast.
Now though this chip doesn't support this, it does have the ability
to generate a 32bit reply, using a register pair. So decided to code this:
Code: |
int f;
union access {
unsigned int32 whole;
unsigned int8 bytes[4];
};
union joiner {
unsigned int32 wrapper;
unsigned int16 parts[2];
};
struct vals {
unsigned int16 first;
unsigned int16 second;
unsigned int16 third;
};
signed int16 buffer[512];
signed int16 buffer2[512];
signed int16 buffer3[512];
unsigned int16 scale1 = 4567; //must now be unsigned int16
unsigned int16 scale2 = 16567;
unsigned int16 scale3 = 28300;
union joiner value;
union joiner value2;
union joiner value3;
//Now cheat code**************************
signed int32 res; //cheat here to allow 'res' to be easily accessed
//in machine code.
BYTE * ptr=&res;
#INLINE
void mul(unsigned int16 s1, signed int16 s2)
{ //assembler multiply to multiply a signed int16 by unsigned int16
//and generate the 32bit result
//However do not 'return' this, instead write directly to 'res'
#asm
MOV s1,w0 //two source values
MOV s2,w1
MOV ptr, w11 //where to put the result
MUL.SU w1,w0,W12 //Must be even register
//Now need to move the result from W12/W13 to RAM
MOV W12,[W11++]
MOV W13,[W11]
#endasm
return;
}
//Then testing with
for(f=0;f<512;f++)
{
mul(scale1,buffer[f]);
value.wrapper = res;
mul(scale2,buffer2[f]);
value2.wrapper = res;
mul(scale3,buffer3[f]);
value3.wrapper = res;
buffer[f] = value.parts[1] + value2.parts[1] + value3.parts[1];
}
|
Brings the loop down to 688uSec. To go any faster one would have to
code the loop itself in assembler.
Think the values are right. Tested the first few. |
|
|
curt2go
Joined: 21 Nov 2003 Posts: 200
|
|
Posted: Wed Oct 02, 2019 11:46 am |
|
|
Wow. Thank you. I will test this tonight. You are a mad man! That is so fast!!! |
|
|
curt2go
Joined: 21 Nov 2003 Posts: 200
|
|
Posted: Wed Oct 02, 2019 11:58 am |
|
|
Well i snuck in a test at work...
Looks like it is as fast as its supposed to be, but one thing i noticed is that with a full scale of 65535 , this will give double the value. It looks like this routine max scale is 32768, which is totally fine with me. I can just quickly scale the scales before entering the routine. Unless you have a very quick fix, but I can't see where the issue is. Once again thank you for your awesome efforts here. Its amazing! |
|
|
Ttelmah
Joined: 11 Mar 2010 Posts: 19535
|
|
Posted: Wed Oct 02, 2019 12:16 pm |
|
|
Since the multiply is of a signed value, the result has to be signed.
This is then in the top 16bits of the result, and copied out, so yes, the
result can only support up to 32768 as a result (and to -32767).
The 'parts' in the union really should be declared as signed int16, otherwise
sign will not be maintained in the addition. |
|
|
curt2go
Joined: 21 Nov 2003 Posts: 200
|
|
Posted: Wed Oct 02, 2019 12:30 pm |
|
|
That is just fine. I will adjust the scales accordingly. Thank you again for all your efforts. |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2005 phpBB Group
|