Thursday, June 14, 2018

flac on steroids - Part 1

Today I'd like to share my little "flac on steriods" project. 
...obviously inspired by "sox on steriods" ;)

I was triggered by the recent benchmark announcements on Phoronix

The promise: flac now delivers a 5% faster encoding and decoding 
by introducing a faster CRC algorithm. 

That sounds nice! 

Let's have a closer look at it.






There's not been that much evolution on flac lately. Great that somebody took the effort.

The main issue you'll face: 

How to get the flac beast with that updated CRC algorithm on your machine!?!?

Bad luck for most of you. You have to wait. 

You'd need to have flac version greater than 1.3.2 installed to have that feature inside.
1.3.3 is not even released yet. 
And if that's done one day your OS maintainers still need ages to get it introduced. 
For LMS users it'll take even longer.

So. 99.99% of you won't have the pleasure to enjoy the extra power for now.


Ok. What now? As usual. If you want bleeding edge stuff, there's no other way
as building the binary yourself. It's done pretty straight forward though.



Some background info affecting the flac binary performance. 

flac offers several options to seriously improve its performance - just from the code perspective! 
E.g. flac can make use of sse, sse2, avx2.  These CPU features mainly apply to Intel platforms though! 
Ever wondered why flac is that slow on a RPI?
Further flac can make use of C++ or assembler (nasm)

There are quite some variables around. 
The usual issue: You just don't know how your flac was compiled and if it makes use of any of these "turbos".


Bottom line: The way flac gets compiled - and that includes the target CPU architecture - can have a huge impact on its performance! 
Compiling it by yourself I consider a pretty good idea!


I ran my own compiled flac on my Intel NUC with all performance options switched on.




Let's have a look at the benchmark. 

I am gonna try to reproduce the promised results of "+5%"first.


BTW: 
As benchmark tool I'm using "perf" now. It seems to be reliable and more precise  
compared to e.g. "time" as used for benchmarking sox earlier.

Preps:


  1. I reinstalled the Ubuntu flac and libs (dynamic linked binary)
  2. I then downloaded the Ubuntu flac sources and did a static compilation
  3. And I fetched the flac sources from git and compiled that statically
  4. I ran the encode and decode benchmarks

And here comes the result:

Binary = /tmp/flac-1.3.2-ubu
Performance counter stats for '/tmp/flac-1.3.2-ubu --totally-silent --compression-level-5 -f -o /tmp/test16.flac.flac-1.3.2-ubu /tmp/test16.wav' (10 runs):
   1031,998175      task-clock (msec)         #    1,000 CPUs utilized            ( +-  0,07% )
             6      context-switches          #    0,006 K/sec                    ( +- 16,01% )
             1      cpu-migrations            #    0,001 K/sec                    ( +- 36,85% )
           192      page-faults               #    0,186 K/sec                    ( +-  0,41% )
 2.757.615.568      cycles                    #    2,672 GHz                      ( +-  0,07% )
 5.792.144.336      instructions              #    2,10  insn per cycle           ( +-  0,03% )
   423.397.735      branches                  #  410,270 M/sec                    ( +-  0,06% )
    11.845.109      branch-misses             #    2,80% of all branches          ( +-  0,03% )

   1,032314326 seconds time elapsed                                          ( +-  0,07% )

Binary = /tmp/flac-1.3.2-ubu-static
Performance counter stats for '/tmp/flac-1.3.2-ubu-static --totally-silent --compression-level-5 -f -o /tmp/test16.flac.flac-1.3.2-ubu-static /tmp/test16.wav' (10 runs):
   1046,480818      task-clock (msec)         #    1,000 CPUs utilized            ( +-  0,07% )
             5      context-switches          #    0,005 K/sec                    ( +- 14,30% )
             0      cpu-migrations            #    0,000 K/sec                    ( +- 44,72% )
           184      page-faults               #    0,176 K/sec                    ( +-  0,24% )
 2.801.189.305      cycles                    #    2,677 GHz                      ( +-  0,07% )
 4.776.156.386      instructions              #    1,71  insn per cycle           ( +-  0,03% )
   403.541.845      branches                  #  385,618 M/sec                    ( +-  0,06% )
    11.491.004      branch-misses             #    2,85% of all branches          ( +-  0,05% )

   1,046770327 seconds time elapsed                                          ( +-  0,07% )

Binary = /tmp/flac-git-static
Performance counter stats for '/tmp/flac-git-static --totally-silent --compression-level-5 -f -o /tmp/test16.flac.flac-git-static /tmp/test16.wav' (10 runs):
    923,622729      task-clock (msec)         #    1,000 CPUs utilized            ( +-  0,09% )
             4      context-switches          #    0,005 K/sec                    ( +- 18,62% )
             0      cpu-migrations            #    0,001 K/sec                    ( +- 33,33% )
           180      page-faults               #    0,195 K/sec                    ( +-  0,21% )
 2.472.003.020      cycles                    #    2,676 GHz                      ( +-  0,07% )
 5.108.543.740      instructions              #    2,07  insn per cycle           ( +-  0,03% )
   541.381.977      branches                  #  586,151 M/sec                    ( +-  0,05% )
    11.537.502      branch-misses             #    2,13% of all branches          ( +-  0,03% )

   0,923934894 seconds time elapsed      

Result:
The results show an around 11% increase of the flac made from git sources on the encode side - against both Ubuntu versions (repo binary and self compiled) having CRC optimizations not yet applied. 
11% gain of the CRC improved binary. Nice! More then expected.
Somehow the binary compiled from Ubuntu sources shows a slightly lower performance then the dynamically linked Ubuntu version. Let's just accept that as it is. We made our case.

I then also did the decode test:

Binary = /tmp/flac-1.3.2-ubu
Performance counter stats for '/tmp/flac-1.3.2-ubu --totally-silent -d -f -o /tmp/test16.wav.flac-1.3.2-ubu /tmp/test16.flac' (10 runs):
    566,553464      task-clock (msec)         #    0,999 CPUs utilized            ( +-  0,24% )
             4      context-switches          #    0,007 K/sec                    ( +- 15,09% )
             0      cpu-migrations            #    0,000 K/sec                    ( +- 66,67% )
           128      page-faults               #    0,225 K/sec                    ( +-  0,50% )
 1.511.998.785      cycles                    #    2,669 GHz                      ( +-  0,16% )
 3.580.347.563      instructions              #    2,37  insn per cycle           ( +-  0,07% )
   214.363.822      branches                  #  378,365 M/sec                    ( +-  0,20% )
     5.272.298      branch-misses             #    2,46% of all branches          ( +-  0,05% )

   0,566851320 seconds time elapsed                                          ( +-  0,24% )

Binary = /tmp/flac-1.3.2-ubu-static
Performance counter stats for '/tmp/flac-1.3.2-ubu-static --totally-silent -d -f -o /tmp/test16.wav.flac-1.3.2-ubu-static /tmp/test16.flac' (10 runs):
    516,027060      task-clock (msec)         #    0,999 CPUs utilized            ( +-  0,97% )
             3      context-switches          #    0,006 K/sec                    ( +- 13,13% )
             0      cpu-migrations            #    0,000 K/sec                    ( +-100,00% )
           119      page-faults               #    0,231 K/sec                    ( +-  0,37% )
 1.363.596.089      cycles                    #    2,642 GHz                      ( +-  0,15% )
 3.378.787.107      instructions              #    2,48  insn per cycle           ( +-  0,08% )
   213.400.313      branches                  #  413,545 M/sec                    ( +-  0,21% )
     5.093.116      branch-misses             #    2,39% of all branches          ( +-  0,03% )

   0,516293944 seconds time elapsed                                          ( +-  0,97% )

Binary = /tmp/flac-git-static
Performance counter stats for '/tmp/flac-git-static --totally-silent -d -f -o /tmp/test16.wav.flac-git-static /tmp/test16.flac' (10 runs):
    488,574913      task-clock (msec)         #    0,999 CPUs utilized            ( +-  0,37% )
             2      context-switches          #    0,005 K/sec                    ( +- 20,10% )
             0      cpu-migrations            #    0,000 K/sec                  
           118      page-faults               #    0,241 K/sec                    ( +-  0,31% )
 1.297.780.573      cycles                    #    2,656 GHz                      ( +-  0,16% )
 3.044.344.214      instructions              #    2,35  insn per cycle           ( +-  0,09% )
   180.420.141      branches                  #  369,278 M/sec                    ( +-  0,24% )
     5.077.955      branch-misses             #    2,81% of all branches          ( +-  0,16% )

   0,488829035 seconds time elapsed                                          ( +-  0,37% )

Result:

On the decode task a 14% gain of the new CRC optimized flac from git sources against the stock dynamic linked Ubuntu was found. A lot more than the folks over at flac promised.
There's "just" a "5%" increase against the Ubuntu sources compiled with "-O3 -march=broadwell". The decode and encode seems to have a different impact on the two different Ubuntu based binaries. Honestly. I don't feel motivated to look deeper into it for now.
It won't add anything much of relevance to the actual story.

Bottom line. Well done flac designers! You lived up to your promises. Your efforts are highly appreciated.

Enjoy.

PS: Above exercise and results were also discussed with the flac designers. 

********************************************************************************************************
Benchmarking test procedure:

IF="/tmp/test.wavOF="/tmp/test.flac"

DURATION="$(soxi -d $IF)"BITRATE="$(soxi -b $IF)"SAMPLERATE="$(soxi -r $IF)"
COMPRESSIONLEVEL="5"

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

echo "****************"echo " DURATION: $DURATION"echo " SAMPLERATE: $SAMPLERATE"echo " BITRATE: $BITRATE"echo " COMPRESSION: $COMPRESSIONLEVEL"
rm $OF.* 2>/dev/null

for i in flac-1.3.2-ubu flac-1.3.2-ubu-static flac-git-static ; do
BIN="/tmp/$i"echo "****************************"echo "Binary = $BIN"perf stat -r 10 -B $BIN --totally-silent --compression-level-$COMPRESSIONLEVEL -f -o $OF.$i $IFsleep 3sync
echo
done
*************************************************************************
Compiling flac:

I'll show you now how to compile a static flac binary on Ubuntu or other Debian based systems. Open a terminal first.

I won't compile libogg support into the binary.

*************************************

sudo su

apt-get install build-essential libtool libtool-bin nasm


BASE=/tmp

cd $BASE
git clone https://git.xiph.org/flac.git
cd $BASE/flac 
./autogen.sh

### gcc compiler settings:
### Find out your CPU specific parameter to use for your processor family and
### replace below "broadwell" entry accordingly e.g. "haswell"

export CFLAGS='-O3 -march=broadwell'

./configure --prefix=/usr --enable-static --disable-shared --disable-ogg --disable-doxygen-docs --disable-xmms-plugin


### You should now see listed in the configuration summary:
###     SSE optimizations : ................... yes

###    Asm optimizations : ................... yes

make


ls -l ./src/flac/flac

******************************************

Here we go. It's that easy.

Now you'll have a bleeding edge high performance standalone (static) flac binary at hand. 

Note: It still says version 1.3.2 - just ignore it!

Copy it wherever you want it. 
E.g. To your LMS installation

cp ./src/flac/flac /usr/share/squeezeboxserver/Bin/x86_64-linux/














No comments:

Post a Comment