This post compares CPU performance and value for 18 compute instance types from 5 cloud compute platforms - AWS EC2, Google Compute Engine, Windows Azure, HP Cloud and Rackspace Cloud. The most interesting content is the data and resulting analysis. If you're in a rush, scroll down or click below to go straight to it.
Go To Comparisons
In the escalating cloud arms race, performance is a frequent topic of conversation. Often, overly simplistic test models and fuzzy logic are used to substantiate sweeping claims. In a general sense, computing performance is relative to, and dependent on workload type. There is no single metric or measurement that encapsulates performance as a whole.
In the context of cloud, performance is also subject to variability due to nondeterministic factors such as multitenancy and hardware abstraction. These factors combined increase the complexity of cloud performance analysis because they reduce one's ability to dependably repeat and reproduce such analysis. This is not to say that cloud performance cannot be measured, rather that doing so is not a precise science, and differs somewhat from traditional hardware performance analysis where such factors are not present.
Performance is workload dependent. Cloud performance is hard to measure consistently because of variability from multitenancy and hardware abstraction.
My goal in starting CloudHarmony in 2010 was to provide a credible source for objective and reliable performance analysis about cloud services. Since then, cloud has grown extensively and become an even more confusing place. The intent of this post is to present techniques and a visual tool we're using to help assess and compare performance and value of cloud services. The focus of this post is cloud compute CPU performance and value. In the coming weeks, follow up posts will be published covering other performance topics including block storage, network, and object storage. As is our general policy, we have not been paid or otherwise influenced in the testing or analysis presented in this post.
The focus of this post is compute CPU performance and value. Follow up posts will cover other performance topics. We were not paid to write this post.
To test performance of compute services we run a suite of about 100 benchmarks on each type of compute instance offered. These benchmarks measure various performance properties including CPU, memory and disk IO. Each test iteration takes between 1-2 days to complete. When multiple configuration options are offered, we usually run additional test iterations for each such option (e.g. compute services often offer multiple block storage options). Linux CentOS 6.* is our operating system of choice because of its nearly ubiquitous availability across services.
Although our test suite includes many CPU benchmarks, our preferred method for compute CPU performance analysis is based on metrics provided by the CPU2006 benchmark suites. CPU2006 is an industry standard benchmark created by the Open Systems Group of the non-profit Standard Performance Evaluation Corporation (SPEC). CPU2006 consists of 2 benchmark suites that measure Integer and Floating Point CPU performance. The Integer suite contains 12 benchmarks, and Floating Point 17. According to the CPU2006 website "SPEC designed CPU2006 to provide a comparative measure of compute-intensive performance across the widest practical range of hardware using workloads developed from real user applications." Thorough documentation about CPU2006 including about the benchmarks used is available on the CPU2006 website. CloudHarmony is a SPEC CPU2006 licensee.
The results table below contains CPU2006 SPECint (Integer) and SPECfp (Floating Point) metrics for each compute instance type included in this post. Each score is linked to a PDF report generated by the CPU2006 runtime for that specific test run. CPU2006 run and reporting rules require disclosure of settings and parameters used when compiling and running the CPU2006 test suites and this data is included in the reports. To summarize, our runs are based on the following settings:
- Intel C++ and Fortran Compilers version 12.1.5
- Compilation Guidelines
- Run Type
- Rate Copies
- 1 copy per CPU core or per 1GB memory (lesser of the two)
- SSE Compiler Option
- SSE4.2 or SSE4.1 (if supported by the compute instance)
Our preferred method for compute CPU performance analysis is based on metrics provided by the SPEC CPU2006 benchmark suites
CPU2006 Test Results
To be considered official, CPU2006 results must adhere to specific run and reporting guidelines. One such guideline states that results should be reproducible. While this is important in the context of hardware testing, it is impractical for cloud due to performance variability resulting from multitenancy and hardware abstraction. However, CPU2006 guidelines allow for reporting of estimated results in cases where not all guidelines can be adhered to. In such cases results must be clearly designated as estimates. It is for this reason that results in the table below are designate as such.
|Compute Service||Instance Type||CPU Type||Cores||Price2||SPECint1||SPECfp1|
|AWS EC2||cc2.8xlarge||Intel E5-2670 2.60GHz||32||2.40||441.511194||357.602046|
|HP Cloud||double-extra-large||Intel T7700 2.40GHz||8||1.12||168.55417||132.3234|
|AWS EC2||m3.2xlarge||Intel E5-2670 2.60GHz||8||1.00||150.30509||128.159625|
|Google Compute||n1-standard-8||Intel 2.60GHz||8||1.06||149.354133||143.1015|
|HP Cloud||extra-large||Intel T7700 2.40GHz||4||0.56||98.430955||85.24574|
|Rackspace Cloud||30gb||AMD Opteron 4170||8||1.00||95.43979||83.89602|
|Windows Azure||A4||AMD Opteron 4171||8||0.48||91.33294||77.93744|
|AWS EC2||m3.xlarge||Intel E5-2670 2.60GHz||4||0.50||80.180578||71.753345|
|Google Compute||n1-standard-4||Intel 2.60GHz||4||0.53||66.945866||66.84303|
|Rackspace Cloud||8gb||AMD Opteron 4170||4||0.32||51.709779||47.562079|
|Windows Azure||A3||AMD Opteron 4171||4||0.24||51.58953||46.9475|
|HP Cloud||medium||Intel T7700 2.40GHz||2||0.14||48.825275||44.085027|
|Google Compute||n1-standard-2||Intel 2.60GHz||2||0.265||39.469478||39.094813|
|AWS EC2||m1.large||Intel E5645 2.40GHz||2||0.24||39.023586||34.7884|
|AWS EC2||m1.large||Intel E5-2650 2.00GHz||2||0.24||38.816635||37.10992|
|AWS EC2||m1.large||Intel E5430 2.66GHz||2||0.24||29.534628||23.805172|
|Windows Azure||A2||AMD Opteron 4171||2||0.18||27.38071||25.92939|
|Rackspace Cloud||4gb||AMD Opteron 4170||2||0.16||25.854861||24.25972|
1: Base/Rate - Estimate
2: Hourly, USD - On Demand
Simplifying the Results
In order to provide simple and concise analysis derived from multiple relevant performance properties, it is helpful to reduce metrics from multiple related benchmarks to a single comparable value. The CPU2006 benchmark suites produce two metrics, SPECint for Integer, and SPECfp for Floating Point performance. A naive approach might be to combine them using a mean or sum of their values. However, doing so would be inaccurate because they are dissimilar values. Although they are calculated using the same algorithms, SPECint and SPECfp are produced from different benchmarks, and thus represent different meanings - as the idiom goes, this would be an apples to oranges comparison. An external analogy might be attempting to average 1 gallon of milk with 2 dozen eggs - in doing so, the resulting value: $$(1+2)/2=1.5$$ is meaningless because they are dissimilar values to begin with.
To merge dissimilar values like metrics from different benchmarks, the values must first be normalized to a common notional scale. One method for doing so is ratio conversion using factors from a common scale. The resulting ratios represent relationships between the original metrics and the common scale. Because the values share the same scale, they may then be operated on together using mathematical functions like mean and median. Using the same milk and eggs analogy, and assuming a common scale of groceries needed for the week, defined as 2 gallons of milk and 3 dozen eggs, grocery deficiency ratios may then be calculated as follows: \[\text"Milk deficiency" = \text"2 gallons needed" / \text"1 gallon on hand" = \text"Deficiency ratio 2"\] \[\text"Eggs deficiency" = \text"3 dozen needed" / \text"2 dozen on hand" = \text"Deficiency ratio 1.5"\] The resulting ratios, 2 and 1.5, may then be reduced to a single ratio representing the average grocery deficiency for both milk and eggs: \[\text"Average grocery deficiency" = (2+1.5)/2 = \text"1.75"\] In other words, in order to stock up on groceries for the week, we'll need to buy 1.75 times the milk and eggs currently on hand. Take note, however, that this ratio is only relevant in the context of milk and eggs as a whole, not separately, nor does it apply to other types of groceries.
The benefit of reducing dissimilar benchmarks values to a single representative metric is to simplify the expression and comparison of related performance properties. It allows us to present cloud performance more generally, and at a level more fitting to the interests and time of cloud users. As much as we'd like users to become well versed in the intricacies of benchmarking and performance analysis, this is simply not feasible for most, and is a primary reason for our existence. Our goal is to provide users with a simple starting point to help narrow the scope from hundreds of possible cloud services.
In order to more generally and simply present cloud performance information we generate a single value derived from multiple related benchmarks
CPU Performance Metric
The CPU performance metric displayed in the graph below was calculated using both SPECint and SPECfp metrics and the common scale ratio normalization technique described above. The common scale was the mid 80th percentile mean of all CloudHarmony SPECint and SPECfp test results from the prior year. These results included many different compute services and compute instance types, not just those included in this post. This calculation results in the following common normalization factors:
- SPECint Factor
- SPECfp Factor
To shorten resulting long decimal values, ratios were multiplied by 100. The meaning of the metric can thus be interpreted as CPU performance relative to the mean of compute instances from many different cloud services. A value of 100 represents performance comparable to the mean, 200 twice the mean, and 50 1/2 of the mean. For example, the HP double-extra-large compute instance produced scores of 168.55417 for SPECint, and 132.3234 for SPECfp. The resulting CPU performance metric of 249.72 was then calculated using the following formula: $$\text"CPU Performance"\ = (((168.55417/64.056) + (132.3234/55.995))/2)*100 → (4.99448532/2)*100 → 249.724266$$ The value 249.72 signifies this instance type performed about 2.5 times better than the mean.
The CPU performance metric used below represents SPECint and SPECfp scores relative to compute instances from many cloud services. A higher value is better
Cloud compute pricing is usually tied to CPU and memory allocation, with larger instance types offering more (or faster) CPU cores and memory. The CPU2006 benchmark suites are designed to take advantage of multicore systems when compiled and run correctly. Given the same hardware type, our test results generally show a near linear correlation between CPU allocation and CPU2006 scores. Because of these factors, the CPU performance metric derived from CPU2006 is well-suited for estimating value of compute instance types. To do so, we calculate value by dividing the metric by the hourly USD instance cost. For example, the HP extra-large compute instance costs 0.56 USD per hour and has a performance metric of 152.96. The resulting value metric 273.14 is calculated using the following formula: $$\text"Fixed Value"\ = 152.96/0.56 → 273.142857$$
The graph below allows selection of either Tiered or Fixed Value options. Tiered Value is Fixed Value with an adjustment applied to instances ranked in the top or bottom 20 percent. The table below lists the exact adjustments used. The concept behind tiered values is based loosely on CPU pricing models where the top end processors generally command premium per GHz pricing, while the low end is often discounted. The HP double-extra-large compute instance costs 1.12 USD per hour and has a performance metric of 249.72. It is also ranked in the 91st percentile which receives a +10% value adjustment. The resulting tiered value metric 245.256 is calculated using the following formula: $$\text"Tiered Value"\ = (249.72/1.12)*1.1 → 222.96*1.1 → 245.256$$
|Ranking Percentile||Value Adjustment|
Cloud compute pricing is usually tied to CPU and memory allocation. Value metrics in the graph below are derived by dividing CPU performance by the hourly cost
Most cloud providers, including all those covered in this post, offer on demand hourly pricing for compute instances. In addition, some providers offer commit based pricing and volume discounts. AWS EC2 for example offers six 1 and 3 year reserve/commit based pricing tiers. These pricing tiers exchange lower hourly rates for a setup fee paid in advance, and in the case of heavy reserve, commitment to run the compute instance 24x7x365 for the duration of the term (light and medium reserve tiers do not have this requirement). In order to represent these pricing tiers in the graph below, the total cost was normalized to an hourly rate by amortizing the setup fee into the hourly rate. For example, the m3.xlarge instance type is offered under a 1 year heavy reserve tier for 1489 setup and 0.123 per hour. For this instance type and pricing model the hourly rate used in the graph and for value metrics was 0.293/hr calculated using the following formula: $$\text"Normalized Hourly Rate"\ = ((1489/365)/24) + 0.123 → 0.17 + 0.123 → 0.293$$
AWS EC2 is also available under a bid based pricing model called Spot pricing. Although spot pricing is typically priced substantially below standard rates, it is highly volatile and subject to transient spikes that may result in unexpected termination of instances without notice. Due to this, spot pricing is generally not recommended for long term usage. The spot pricing included in the graph below is based on a snapshot taken in early June 2013 and may not represent current rate.
Volume discount and membership based pricing like Windows Azure MSDN, were not included in the graph and value analysis because they are not as straight forward, and often require substantial monthly spend commitments at which users would likely be able to negotiate similar discounts with any vendor.
The graph provides a drop down list allowing select of different pricing models. When changed, the graph and table below will automatically update.
The AWS EC2 reserve hourly pricing in the graph below is based on a normalized hourly value calculated by amortizing the setup fee into the hourly rate
Comments and Observations
As is our generally policy, we don't recommend any one service over another. However, we'd like to point out some observations about each compute service included in this post.
- On demand pricing provides similar value as other compute services. However, EC2 value increases substantially for reserve pricing models
- EC2 provides a broad performance range, topping out in this post with the 16 core (32 core hyper threaded) cc2.8xlarge instance type
- CPU architecture varies between instance types, with higher end types generally running on newer and faster hardware
- Older instance types like m1.large may deploy to different hardware platforms, and thus demonstrate variable performance. For example, there was a notable difference in performance between Intel E5430 and Intel E5-2650 based m1.large instances
- The cc2.8xlarge provides good value for multithreaded workloads with high CPU demand
Google Compute Engine (GCE)
- Performance increased near linearly from small to large instance types
- The n1-standard-4 performed roughly 10% slower than we expected (112 actual CPU performance versus 120-125 expected)
- The GCE hypervisor does not pass thru full CPU identifiers - but in GCE documentation Google has stated processors are based on the Intel Sandy Bridge (E5-2670) platform
- n1-standard-4 and n1-standard-8 instance types performed very similar to comparable EC2 instance types m3.xlarge and m3.2xlarge. All are based on the same Intel Sandy Bridge platform, and on demand pricing is also nearly the same (GCE is just a few cents higher)
- The A3 and particularly A4 instance types are priced notably lower than instance types from other services with comparable CPU cores. This factor attributed to the higher value rankings associated with those instance types regardless of their performance being generally lower
- Vertical scalability is limited with the largest A4 VM (in terms of CPU cores) having the lowest performance ranking of all 8 cores instance types - however, at 1/2 the cost, the value is still good. Exclusive use of AMD 4171 2.1GHz processors (released in 2010) are also a limiting factor. The forthcoming release of Intel Sandybridge Azure Big Compute instance types may address this deficiency
- HP compute instances provided marginally higher performance rankings for each of the 2, 4 and 8 core instance type groups
- For on demand pricing, the medium instance type provided the highest value ranking in the graph
- Performance increased 2X from medium (2 core) to extra-large (4 core) instance types, but the price difference is 4X. The 4 core large instance type between them was not tested
- Rackspace and Windows Azure performed nearly the same. Both are based on the AMD 4100 processor platform. However, Azure value is much higher for the 8 core A4 instance type (versus the Rackspace 8 core 30GB) because the cost is less than half (0.48/hr versus 1.00/hr - 14GB memory Azure versus 30GB Rackspace). The same applied to a lesser extent for the 2 and 4 core instance types (Azure A2/3.5GB and A3/7GB versus Rackspace 4GB and 8GB)
- The 30GB compute instance had the lowest value of all instance types included in this post
- Like Windows Azure, vertical scalability may be limited due to observed exclusive use of AMD 4170 2.1GHz processors (released in 2010). Rackspace does offer an upgrade path through its dedicated hosting offerings, however.
Next Up - Storage IO
CPU and storage IO are generally the two most important performance characteristics for compute services. Depending on workload, one might be more important than the other. Compute services often offer multiple storage options. Many storage options are networked and thus subject to higher variability than CPU and memory. Many workloads are sensitive to IO variations and may perform poorly in such environments. In the next post, we'll present IO performance and consistency analysis for the same providers covered in this post. Storage options covered will include:
- AWS EC2
- Ephemeral, EBS, EBS Provisioned IOPS, EBS Optimized
- Google Compute Engine
- Local/Scratch, Persistent Storage
- HP Cloud
- Local, Block/External Storage
- Local Replicated, Geo Replicated
- Local, SATA and SSD Block/External Storage
Follow storage IO, we will also release posts covering network performance (inter-region, intra-region and external), and object storage IO.