The figure shows that the single threaded non vectorized run only provided ~0.66 Gflops. However, expected DP flops for a single core run is 2(FMA)x1(DP elements-non vectorized)x1.1 GHz = 2.2 Gflops. However, as described in architecture section, the hardware cannot issue instructions back to back from the same thread in the core. To reach full execution unit utilization at least two threads must be running at all times. So running on 2 threads (OMP_NUM_THREADS=2), we are able to reach 1.45 Gflops per core. As the core still uses vector unit to perform scalar arithmetic, the code for scalar arithmetic is very inefficient. For each fma on a DP element, it has to broadcast the element to all the lanes. Operate on the vector register with a mask and then store the single element back to memory.