OCI Ampere A1 Compute instances can significantly reduce video encoding costs versus modern CPUs

Introduction

Oracle Cloud Infrastructure (OCI) has recently launched the Ampere A1 Compute family of Arm Neoverse N1-based VMs and bare-metal instances. These A1 instances use Ampere Altra CPUs that were designed specifically to deliver performance, scalability, and security for cloud applications. The A1 Flex VM family supports an unmatched number of VM shapes that can be configured with 1-80 cores and 1-512GB of RAM (up to 64GB per core). A1 is also offered in bare-metal configurations with up to 2-sockets and 160-cores per instance.

In this blog, we compare x264 video encoding performance and price between three OCI families:

  • A1.Flex: Ampere® Altra® CPUs based on Arm Neoverse N1 cores, 3.0GHz all-core sustained max
  • E4.Flex: AMD EPYC third generation processors, 2.55GHz base, 3.5GHz single-core turbo
  • Optimized3.Flex: latest third Gen Intel Xeon (Ice Lake) scalable processors, 3.0GHz base, 3.6GHz single-core turbo

Despite the introduction of newer video encoding codecs, H.264 remains by far the most popular video codec for both live-streaming and video on demand (VOD) playback1.  And it is supported across the widest range of playback devices.  We benchmark x264 here as it is a popular open source H.264 video encoder.  It is also worth noting that reducing costs remains the top priority of video developers as surveyed in 2020 by Bitmovin1.

Pricing information

The pricing2 for these Intel, AMD, and Ampere-based instances are listed in Table 1. One notable aspect of the OCI Ampere A1 Compute instances is they are the first family to offer $0.01 per-hour per-core pricing.  Besides offering a lower price per physical CPU core (OCPU), A1 also provides a tiered pricing policy, where the first 3,000 OCPU hours and 18,000GB hours per month are free.

OCI virtual machine Per OCPU per hour($) Per GB memory($) Tiered pricing
Intel.VM.Optimized3 0.054 0.0015 No
AMD.VM.E4 0.025 0.0015 No
Ampere.VM.A1 0.01 0.0015 Yes

Table 1: OCI VM Pricing information

System configurations

In our evaluation, we chose three kinds of instances featuring 8vCPUs (HW threads, SMT on for x86 instances) and 60GB memory, as listed in Table 2. All the instances run Oracle Linux 8.3 and GCC 8.3.1. We chose x2643, a popular open-source H.264 video encoder for comparison.

OCI virtual machine OCPU SMT vCPU Memory(GB) Price($/hr)
Intel.VM.Optimized3 4 On 8 60 0.306
AMD.VM.E4 4 On 8 60 0.19
Ampere.VM.A1 8 N/A 8 60 0.17

Table 2: OCI VM System configurations

Key findings

Our test picks the software configurations that get the most performance for the selected hardware. From that we get the performance and price results shown in Figure 1. Specifically, our testing takes the aggregated frames per second of each instance, extrapolates that performance out to one hour, and then applies the price to rent that instance for one hour.

As a result, the Ampere A1 Compute instance is able to provide up to about 98% higher performance/cost than Optimized3 for x264 encoding.

 Figure 1: x264 encode perf/cost for A1 and Optimized3 instances.

 

Instances Optimized3 E4 A1
Frames/hr 121,104 121,824 133,020
Frames/$ 395,765 641,179 782,471
A1 performance and price advantage +97.7% over Optimized3 +22.0% over E4

Table 3: Frames/$ for Optimized3, E4, and, A1 instances.

Test methodology

Figure 3 shows our measured performance (aggregate frames/sec) for Optimized3, E4, and A1 instances (8vCPU per instance) while encoding a reference input file (‘ducks_take_off_1080p50.y4m’) with ‘medium’ encoding preset using x264. To find the best performing configuration for the instances under test, we swept two parameters:

  • Number of x264 jobs run per compute instance
  • Number of threads run per job

Figure 3: x264 encode performance (aggregated frames per second) for OCI Optimized3, E4, and A1 instances (p=jobs per instance, t=threads per job).

Typical encoding tasks use four or more threads per job, and run at least one job per vCPU.  This will naturally move most encoding jobs to the right side of Figure 3.  And we can see from Figure 3 that for the highest aggregated FPS achieved by each instance, A1 8vCPUs outperforms Optimized3 8vCPU and E4 8vCPUs instances by around 10%.

[“source=community.arm”]