Demystifying Voice Quality When Using Graphical Processing Units for Transcoding

July 13th, 2017

I believe Graphical Processing Units (GPUs) are the right solution for voice transcoding as VoIP services continue their migration to virtual, cloud-based solutions. However, an argument I have recently heard is that codec transcoding is more efficient and produces better voice quality with fixed point processing, whereas GPUs are designed for floating point calculations.

If this argument was true, it directly implies that using GPUs will result in inferior voice quality. Fortunately, this argument can be shown to be false by looking at actual test results.

We analyzed the voice quality results using CPUs (fixed point) versus GPUs (floating point) for transcoding using three codec types: G729AB; AMR-WB; and EVRC-B, using the speech test vector in the G.729 standard specification. Voice quality measurements were done using the PESQ standard. The highlights are:

  • G729AB
    • Testing run without Discontinuous Transmission (DTX), a.k.a. “silence suppression”, so packets were sent during periods of silence
    • GPU measurements were within .4% of CPU measurements
  • EVRC-B
    • Testing was done on two bitrates: 9.3 and 8.5 kbps
    • GPU measurements were less than .9% difference than CPU measurements
  • AMR-WB
    • Testing was done across the full spectrum of bitrates from 6.6 to 23.85 kbps
    • GPU measurements ranged from .7% better to .55% worse than CPU measurements

In summary, our testing showed GPUs using floating point processing was within 1% of CPU fixed point processing or better. In our experience, <1% difference results in no perceived degradation in voice quality.

But if you want another source that shows similar results, check out 3GPP TR 26.976 version 10.0.0 Release 10, Performance characterization of the Adaptive Multi-Rate Wideband (AMR-WB) speech codec.

Specifically, look at Annex B which contains the verification results for AMR-WB floating point codec and section B.7 which shows a comparison AMR-WB PESQ scores using a floating point vs a fixed point encoder. Section B.7 concludes with the following statement:

It is most likely, from the data, that there is no significant subjective difference between V5.3.0 of the fixed-point AMR-WB encoder with CR011 implemented and V0.2.2 of the floating-point AMR-WB encoder.

Beyond the voice quality equivalence of using GPUs for voice transcoding, there are other reasons that GPUs are the right solution for VoIP services in the virtual, cloud deployment models that service providers and enterprise customers are increasingly adopting.

As we have shown in prior blogs, when compared head-to-head, GPUs clearly exceed CPUs in terms of performance and scale because they are designed for high volume, compute-intensive processing, which is exactly what audio transcoding requires. Not only do they exceed in performance and scale, but do so at less cost for power consumption and far less rack space.

In a virtual, cloud deployment model, especially in public clouds, GPUs are readily available. It is now easier than ever to get access to GPUs. GPUs are already being made available for high volume, compute-intensive applications like machine learning and analytics, so expanding the use case to include real-time voice codec transcoding becomes simple and very attractive.

In conclusion, using GPUs for voice transcoding provides voice quality that is as good as any other option and when combined with the added benefits highlighted above is clearly the right solution for voice transcoding in the Cloud.

Ask Us Anything!

Ribbon's team of professionals are ready to answer your questions, guide you to the right solution or help you with your network design.