Citation:
The Intel Xeon Phi coprocessor uses time-multiplexed multithreading. The PPF thread picker determines the request to be sent to instruction cache and put into the prefetch buffer and the second one selects what entries to read from the prefetch buffer and send them to the decoder. It uses a smart round-robin mechanism to pick only threads that have work to do and avoids threads that are inactive due to various conditions like cache miss. One can control thread scheduling by putting delay loops in the thread that the thread picker will not schedule in the actual hardware(which is useful for optimization)..
Là ça sent l'enfummage. Si le thread qui doit attendre n'est pas schedulé, alors ça ne change rien de mettre des boucles d'attente...
Citation:
The figure shows that the single threaded non vectorized run only provided ~0.66 Gflops. However, expected DP flops for a single core run is 2(FMA)x1(DP elements-non vectorized)x1.1 GHz = 2.2 Gflops. However, as described in architecture section, the hardware cannot issue instructions back to back from the same thread in the core. To reach full execution unit utilization at least two threads must be running at all times. So running on 2 threads (OMP_NUM_THREADS=2), we are able to reach 1.45 Gflops per core. As the core still uses vector unit to perform scalar arithmetic, the code for scalar arithmetic is very inefficient. For each fma on a DP element, it has to broadcast the element to all the lanes. Operate on the vector register with a mask and then store the single element back to memory.
C'est bien ce que je craignais. Le code scalaire est quasiment deux fois plus lent que le code vectoriel.