Socket or Multiple Core Assignment in Virtual Machine:
There is always confusion in student’s mind while assigning number of socket vs number of cores to VM during Virtual Machine Creation.
Surprisingly most of the students didn’t even notice this option while assigning vCPUs to VM? Few had doubt about does it really matter assigning more no of cores instead of more no of sockets to VM? Does Socket or Core configuration does make any difference in VM performance?
In this article, I’ll try my best to answer all your queries & your feedback will be really appreciable.
Before Digging directly into VMware Implementation about virtual sockets or core. Let’s clear the basics first about Socket vs Core.
What Is Socket?
Its a Physical socket on motherboard where physical processor fits in. Socket can contain multiple Cores.
What Is Core?
Physical core within a Physical Processor which actually performs computational work or code execution. Its a complete private set of
registers, execution units and queues required to execute programs.
Earlier days, we used to have 1 core per socket which means single processing unit to perform all computational tasks.
1 Socket = 1 Core = 1 Physical CPU
Drawbacks of Single Socket Core Processor
Single Socket Core configuration doesn’t work well in multi-threaded environment and all multi thread gets executed in sequential manner due to 1 Core available on physical processor.
As a result, System becomes slower.
Evolution of Multiple Core Processor to Support Multithreaded Environment
To support multi threaded environment & increasing system performance, CPU manufacturers added additional “cores,” or central processing units on single physical processor. For ex. A quad-core CPU will have four central processing units so called “cores” on single chip.
These four cores will appear as 4 CPUs to operating system and multi-threaded applications can be scheduled simultaneously on these 4 Cores by operating system. As a result,different processes will be running on each core at the same time. This speeds up system performance and gives us multitasking executional environment.
1 Socket = Multiple Cores = Multiple Processing Units
Socket Limitations of General Purpose Operating System
Few of the Operating systems are hard limited to run on a fixed number of CPUs. Operating systems vendors restricts to use limited no of physical CPUs even though more Physical CPUs are available due to socket limitations of Operating system.
For example: Windows server 2003 standard edition is limited to run on upto 4 cpus. If we install this operating system on 8 socket Physical box, it will run only on 4 of the CPUs.
But the catch here is these OS vendor restricts the number of physical CPU ( sockets) and not the number of Logical CPU ( known as Cores).
How Socket Limitation Issue got resolved by CPU vendors
Industry vendor started adding more cores to single socket & operating system took advantage of multi-core CPUs. For ex: Now, if we install Windows Server 2003 standard edition on quad core dual socket processor system. (2 Socket * 4 core) = 8 Physical CPUs.
Operating system will be able scheduled instructions on all the 8 Physical CPUs because sockets still being used 2 but now no of cores had been increased per socket to avoid OS Socket Limitations.
For ex: If any general purpose OS has limitation of 2 socket, but application requires to use 8 PCPUs.
Now with multi-core implementation, we could expose 8 PCPUs to application using 2 socket * quad-core system.
2 Socket * 4 Core = 8 PCPUs.
I hope now you would have understood what is OS socket limitations and how it was overcome in physical world.
How VMware Addressed Socket Limitation Issues at Virtual Machine Guest OS level
Prior to vSphere 4.1, there were no specific options called socket or core for vCPU assignment to VM.The only option which was available was “No of logical processors” which internally translates to no of sockets.
Prior to 4.1, by default VMkernel used to create 1 core per socket for each vCPU assigned to Guest OS.
For ex: An vSphere admin needs to create VM with 4 vCPU, he used to specify “no of logical processor -> 4”. VMkernel used to create 4 Virtual Sockets & each Virtual Socket will have 1 core assigned to it.
Below is example of vCPU assignment to Guest OS prior to ESXi4.1
16 vCPU -> 16 Socket * 1 Core
10 vCPU -> 10 Socket * 1 Core
8 vCPU -> 8 Socket * 1 Core
6 vCPU -> 6 Socket * 1 Core
VMware implementation of assigning 1 socket for each core restricts few Operating system vendors to use limited no of physical CPUs even though vCPUs are more due to socket limitations of Operating system.
Like Physical Environment, VMware also implemented multi Cores per socket to overcome Guest OS socket limitations. Now Virtual Machine running win2003 standard edition configured with 1 Virtual Socket and 8 Cores per socket allows operating system to utilize all the 8 vCPUs.
Just to show you guys how it works, I initially configured VM with 8 vCPUs and each core presented as single Socket
When reviewing the CPU configuration inside the guest OS, the task manager shows 4 vCPUs.
Reconfigured the machine with 8 vCPUs containing 1 Socket and 8 cores per Socket
After powering on the virtual machine, Guest OS sees 8 vCPUs.
Does MultipleCoresperSocket affects Virtual Machine Performance?
Now we could understand that how Guest OS socket limitations overcome by assigning more cores to virtual Sockets in VMware world. But still questions remains the same. Does it make any impact on Virtual Machine Performance to use more sockets vs more cores?
In brief, The Answer is NO.
There is no performance impact between using Virtual Sockets or Virtual Cores while assigning vCPUs to Virtual Machine. But Above statement is only true as long as the total size of Virtual Machine does not exceed Physical NUMA Node.
However as soon as vNUMA used, core per socket can have a real impact.
Why Virtual Socket and Virtual Core doesn’t have real impact on Virtual Machines? Why vNUMA impact Virtual Machine Socket configuration but pNUMA doesn’t? I’ll be covering all the details about pNUMA and vNUMA in my upcoming articles.
I hope this article will help you to understand socket vs core configuration done during vCPU assignment to any Virtual Machine.
Please don’t forget to share your comments and rating for this article.
In my last blog, we discussed about Virtual Sockets, Virtual Cores, Guest OS socket limitations & how VMware addresses these limitations in vSphere Environment.
Let me ask you same question again posted in my last blog:
Below are the setup details:
ESX Server configuration is : 2 Socket * 2 Core per Socket
VM1 Configuration is : 1 Socket * 4 Core Per Socket
VM2 Configuration is : 4 Socket * 1 Core Per Socket
Here Assumption is: VMs doesn’t have any Socket Limitations?
Both the VMs are running CPU & Memory intensive workloads.
The question is which VM will perform better and why?
Answer: Assuming you all guessed correctly based on our earlier discussion, Both VMs will perform equally. In other words, no of sockets & no of cores allocation doesn’t impact VM performance at all. There is no performance impact using virtual sockets or virtual cores.
Why VM performance doesn’t get impacted with virtual socket or core allocation:
VM remains intact because of the power of abstraction layer. Virtual Socket and Virtual Core are logical entities defined by VMkernel for vCPU configuration at VM level. When we run a operating system, Guest OS detects hardware layout within Virtual Machine like no of socket and core available at Guest OS level & it schedules instructions accordingly. For ex, In case of Guest OS socket limitations, it will try to exercise more no of cores rather than using more sockets.
As I said, the scope of virtual sockets and virtual core is only at Guest OS level. The VMkernel schedules a VMM process for every vCPU assigned to Virtual Machine.
The vCPU configuration from VMKernel perspective is sum = core * number of sockets. For ex: in above scenario, VM1 would require = 1 * 4 = 4 vCPUs
VM2 would requires = 4 * 1 = 4 vCPUs
In Conclusion, from VMkernel perspective, both the VMs requires equal amount of vCPUs regardless of no of socket or no of cores per socket allocated to Virtual Machine.
Virtual Sockets & Virtual Core scope is limited to Guest OS level. At VMKernel level, Total no of sockets & Cores gets translated into no of vCPUs which gets mapped to Physical CPUs done by CPU scheduler.
Let’s explore the example of 2 virtual socket 2 virtual core configuration
The light blue box shows the configuration the virtual machine presents to the guest OS. When a CPU instruction leaves the virtual machine it get picked up the Vmkernel. For each vCPU the VMkernel schedules a VMM world. When a CPU instruction leaves the virtual machine it gets picked up by a vCPU VMM world. Socket configurations are transparent for the VMkernel
There is another tweak in the story, If your VM is configured with 8 vCPUs or greater than that in such cases, no of virtual sockets will impact Virtual Machine Performance because of vNUMA gets activated.
Again, if VM is configured with more than 8 vCPUs then by default in vSphere 5.0, vNUMA gets activated and VMkernel presents Physical NUMA topology like NUMA client and NUMA node directly to Guest OS for better scheduling decision in the Guest OS. In vSphere 5.0, vNUMA is enabled by default on VMs greater than 8 vCPUs.
In such vNUMA scenarios, Virtual Machine performance directly depends on No of Sockets presents in the Guest OS. It is because NUMA node creation happens on the basis of no of sockets populated to Operating Systems. More Sockets more NUMA nodes and more NUMA nodes means better performance.
Let’s Deep-Dive into NUMA Architecture Concepts.
WHAT IS NUMA?
Definition from WikiPedia:
“Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors.”
NUMA architecture is a shared memory architecture that describes the placement of main memory modules with respect to processors in a multiprocessor system.
“Ignorance of NUMA can result in a applicaton performance issues”
Background Of NUMA Architecture:
UMA ( Uniform Memory Access)
Perhaps the best way to understand NUMA is to compare it with its cousin UMA, or Uniform Memory Access. In the UMA memory architecture, all processors access shared memory through a bus (or another type of interconnect) as seen in the following diagram:
UMA gets its name from the fact that each processor must use the same shared bus to access memory, resulting in a memory access time that is uniform across all processors. Note that access time is also independent of data location within memory. That is, access time remains the same regardless of which shared memory module contains the data to be retrieved.
NUMA ( Non-Uniform Memory Access)
In the NUMA shared memory architecture, each processor has its own local memory module that it can access directly with a distinctive performance advantage. At the same time, it can also access any memory module belonging to another processor using a shared bus (or some other type of interconnect) as seen in the diagram below:
Why NUMA is better than UMA
In NUMA, As its name implied, Non-Uniform access of memory introduce different memory access time with the location of the data to be accessed.
If data resides in local memory, access is fast.
If data resides in remote memory, access is slower.
The advantage of the NUMA architecture as a hierarchical shared memory scheme is its potential to improve average case access time through the introduction of fast, local memory.
In Conclusion, NUMA stands for Non-Uniform Memory Access, which translates into a variance of memory access latencies. Both AMD Opteron and Intel Nehalem are NUMA architectures. A processor and memory form a NUMA node. Access to memory within the same NUMA node is considered local access, access to the memory belonging to the other NUMA node is considered remote access.
How NUMA Node gets created
NUMA node creation is based on number of sockets & memory for each NUMA node gets calculated by dividing total system memory with No of NUMA nodes.
If physical system configured with 4 Sockets * 4 Cores per Socket and Total Memory is 12GB.
In this case, Total NUMA node created = 4
memory allocated to each NUMA node = 12/4 = 3GB
Case Study 1 : OS is not NUMA aware
Physical system configured with 4 Socket * 4 Core per socket and memory 12Gb
Multi threaded SQL applications along with some general purpose applications running on OS installed on above mentioned system.
Since OS is not NUMA aware, In worst case scenario of CPU allocation, multiple threads ( 4 thread) of SQL application can be scheduled on 4 different cores of 4 different NUMA node. In this case, lot of data will be accessed through remote memory over interconnect link which in result cause increase in memory latencies and reduce overall application performance.
refer below diagram:
Case Study 2: OS in NUMA aware
Now Since OS is NUMA aware and had complete view of NUMA nodes of Physical System so it will try it best to schedule multiple threads of same application in single NUMA node to avoid Remote Access Memory & using Local Memory of that node as much as it can for better performance.
In this example, all the 4 threads of SQL application will be scheduled on 4 cores of single NUMA node decided by NUMA aware CPU scheduler of OS. Since all the threads will be accessing local memory assigned to NUMA node so no data will be accessed over remote memory which in result improvise overall application perfomance.
Refer Below Diagram:
That’s the reason, NUMA plays very important role & it can seriously influence performance of memory intensive workloads.
I hope this article helps you guys to understand the basics of NUMA architecture and how NUMA influence workload performance.
In my upcoming articles, I will be covering few more details about NUMA w.r.t ESXi Environment like:
How ESXi NUMA scheduler works? How pNUMA is different than vNUMA? How vCPU sizing impact NUMA scheduler in ESXi environment? Understanding NUMA stats using esxtop command?
Please Feel Free to post your queries if you have anything. I would be happy to answer your queries. Please don’t forget to write comments or feedback about this article.