Saturday, January 27, 2018

RaspBerry PI - The Audio Engine - Part 7 - Realtime Kernel

A realtime kernel. The cream of the cake for a Linux based Audio Engine.

Did you know? Even the Squeezebox Touch was running a realtime-kernel almost
10 years ago!


A realtime kernel allows a much faster task switching thus much lower latency
compared to normal kernels. It's usually used in time-sensitive industrial 
applications.

From my experience this extremely low latency can make a difference for our purpose as well. 

Most of my mods you've seen so far follow the logic of reducing distractions to the audio stream. An rt-kernel very well adds to that logic. 




What makes a kernel a realtime-kernel? 
It's a huge set of patches that gets applied to the standard kernel sources. 
Numerous files are involved. 
You as a kernel maintainer have to work on configuring that patch set properly. 
Then you compile, integrate and test that kernel on your OS of choice.

A realtime kernel alone, won't make a Formula1 car out of a system though.

It requires sensitive fine-tuning here and there to be able to make use of the potential
performance gain introduced by such a realtime kernel.

If a realtime kernel is not properly integrated it can make things worse.
You might experience higher loads then you'd see on a normal kernel. 
You might even end up with race-conditions - recently we've seen 5000 USB interrupts/second - typoical number around 300/s on a PI. 
We seen stalled boots and other total system lock-ups. It can get tricky 
with that kernel! 

Keep in mind:
You can't drive your Formula1 car on a crowded freeway at full speed. 
You need a race track! 


Looking into a realtime kernel integration from my experience is well worth it.
Especially for a single purpose system as developed in Part 1-6 of this series. 

On multi-purpose systems (many apps and I/O operations running in parallel) systems, 
I'd consider a realtime kernel hardly applicable. 


Many people in the past appreciated having access to such a kernel! ( on SB Touch, I used to provide one for Moode and there are others)  

The RPI kernel team is now - since March-2018 - offering official realtime kernel sources.
Basically the RPI kernel team incorporates the rt-kernel patch themselves! Great news!

pCP thus can now also introduce a rt-kernel. It's done already!  With version pCP 3.5 available you'll have an rt-kernel based OS at hand. I've been involved a little in the testing phase. A great team to work with!

Pretty much all earlier proposed tweaks and modifications should work well 
with the rt-kernel.




Benchmarking and performance tests


I did some initial benchmarking measurements on the rt-kernel (pretty much the same being used on the upcoming pCP) and the standard low latency kernel of ARCH-Linux-ARM on my PI3.


I did the benchmarking and measurements using cyclictest:

The default ARCH kernel is a low-latency kernel already though!

* SC rt kernel 4.14.26-rt18-sc1
###CPU800MHz - isolated CPU3
./cyclictest -m -t1 -p 45 -n -a3 -i 400 -l 100000
T: 0 (  519) P:45 I:400 C: 100000 Min:      7 Act:    8 Avg:    7 Max:      16
###CPU800MHz - non-isolated CPU2
./cyclictest -m -t1 -p 45 -n -a2 -i 400  -l 100000
T: 0 (  529) P:45 I:400 C: 100000 Min:      6 Act:    8 Avg:    8 Max:      26


* ARCH LowLatency kernel 4.14.26-1-ARCH
###CPU800MHz - isolacted CPU3
cyclictest -m -t1 -p 45 -n -a3 -i 400 -l 100000
T: 0 (  323) P:45 I:400 C: 100000 Min:      7 Act:    8 Avg:    7 Max:      16
###CPU800MHz - non-isolated CPU2
cyclictest -m -t1 -p 45 -n -a1 -i 400 -l 100000
T: 0 (  329) P:45 I:400 C: 100000 Min:      6 Act:    9 Avg:    8 Max:      23


What you can see is that the results are almost equally low. That surprised me!

However. Above results look much better then these seen in the link references 
you'll find at the bottom of the article on respective standard and also rt-kernels.
It's just been initial testing from my side though. I'm not 100% sure if I should trust these
numbers.
My installation had some of the optimizations that you've seen in earlier posts already enabled. To name a few:

* no USB devices, except network
* no Wifi,BT
* HDMI off
* LEDs off
* Internal audio off

= distraction level quite low

Just to mention it: The ARCH kernel is a LL kernel already. That might explain the close to no difference between rt and ARCH standard on the tests. 
I currently don't have a standard Raspbian kernel at hand. These used to be less efficient than the ARCH Linux kernels.

I also did some tests by going up with the  CPU clock from 800 to 1000MHz. 
This brings the values avg/max down (improvement!)  by about 10%. That makes sense
doesn't it !?!?

What's important is that non of the scenarios shows several hundred us latencies as seen in
below references.
That speaks for a pretty homogeneous setup and also for a pretty good ARCH Linux base.
I didn't run heavy load tests yet. The situation might change then.

What's one of the most interesting findings though is the rather big impact when running cyclictest on the isolated CPU 3. 
Just isolating that CPU for running cyclictest on it exclusive, gets me another  30-40% improvement over that already quite low latency on my system. 
If you think about it, that improvement pretty much makes sense doesn't it!?!? 
But that much? We'll get a better performance then increasing the CPU clock by 200MHz! 

Great. So...

...how to proceed!?!?

I'll do some more testing. ;)

And those who are in the process of preparing for new OS releases with rt-kernels might consider following measures:


* introducing  "isolcpus=0,3" ( e.g. in the tweaks section of pCP)
  0 will get the IRQs isolated
  3 will get squeezelite isolated (obviously you need to start squeezelite with affinity 3) 

* to go for 100Hz scheduler in the kernel. Faster switching causes a higher base load.
  On a single purpose system that's not required. I tried myself all kind of frequencies in
  the past. In certain HW critical situations it might make sense to go higher.

  A DSD128-USB DAC playing and streaming at the same time comes to mind.
  Running a server and a having a local USB-HDD attached might be another sceneario.

* Then I would add force_turbo=1 to the config.txt
  This nails the CPUs on one - the maximun - CPU frequency as defined in the config.txt. 
   All this dynamic switching is just causing non-homogeneous load and whatever
   sideeffects and "distractions". 

   Power consumption will not rise much by nailing the CPU freqency. 

* If we disable dynamic frequency shifting with force_turbo, I'd also go down with
  the CPU frequency. 800MHz seems to be a nice choice. 

  We'll loose a bit of latency as shown above.  However. There'll also be less stress on the
  system. Since we also have squeezelite isolated, we'll look at quite perfectly balanced          setup.


* squeezelite should be started with CPUSchedulingPolicy "fifo"
  and  CPUSchedulingPriority >0. 
  squeezelite  then should be configured with option "-p  CPUSchedulingPriority +1"
  I run squeezelite at e.g. prio 44 and the "-p" option at 45.
  I stay below rt-priority 50 since most interrupts are running at 50.
  All this is already considered if you follow earlier posts! 




Finally some references covering the same subject from a slightly different perspective:

https://autostatic.com/2017/06/27/rpi-3-and-the-real-time-kernel/

https://isojed.nl/blog/2017/10/25/raspberry-pi-rt-preempt/


https://medium.com/@metebalci/latency-of-raspberry-pi-3-on-standard-and-real-time-linux-4-9-kernel-2d9c20704495

























No comments:

Post a Comment