Vector issues using intel’s C++ compiler in Visual Studio

Asked

Viewed 50 times

0

The code below is the result of a work I am developing, basically it is the multiplication of a square matrix, however, the results I had parallelizing the application with the Openmp API were superior to the results I obtained using SIMD of the same API.

What am I doing wrong? is the syntax?

Some information that may be pertinent in identifying the problem: I am using the intel compiler through the visual studio IDE, the visual studio Openmp is version 2.0 (which does not support SIMD) but I think it is the 4.0 that comes with the compiler being used. Anyway, for me it is a new activity (parallel processing) so if you can clarify things I would appreciate it heartily. Follow the code:

#include "stdafx.h"
#include <iostream> 
#include <time.h>
#include <omp.h>

using namespace std;

int lin = 800, col = 800; // Valores de linha e coluna


int main()
{

    // --------------------------------------
    // Cria a matriz 1
    int** m1 = new int*[lin];
    for (int i = 0; i < lin; ++i)
        m1[i] = new int[col];
    // --------------------------------------


    // --------------------------------------
    // Cria a matriz 2
    int** m2 = new int*[lin];
    for (int i = 0; i < lin; ++i)
        m2[i] = new int[col];
    // --------------------------------------


    // --------------------------------------
    // Cria a matriz resposta
    int** res = new int*[lin];
    for (int i = 0; i < lin; ++i)
        res[i] = new int[col];
    // --------------------------------------

    cout << "criou matrizes" << endl;



//PREENCHE m1 e m2
// ----------------------------------------------------------------------------

// BLOCO PARALELO
#pragma omp simd collapse (2)
        for (int i = 0; i < lin; ++i) {
            for (int j = 0; j < lin; ++j) {
                m1[i][j] = (i + 1);
            }
        }

// FIM DO BLOCO PARALELO

// BLOCO PARALELO
#pragma omp simd collapse (2) 
        for (int i = 0; i < lin; ++i) {
            for (int j = 0; j < lin; ++j) {
                m2[i][j] = (i + 1);
            }
        }

// FIM DO BLOCO PARALELO

cout << "preencheu" << endl;

// ----------------------------------------------------------------------------



    //faz a magica rolar

    clock_t timer = clock(); //valores de marcação de tempo


    // ----------------------------------------------------------------------------
    cout << "iniciou" << endl;


#pragma omp simd collapse (2)
    for (int i = 0; i < lin; i++)
    {
        for (int j = 0; j < lin; j++)
        {
            res[i][j] = 0;
            for (int k = 0; k < lin; k++)
                res[i][j] += m1[i][k] * m2[k][j];
        }
    }
    cout << "finalizou" << endl;
    // ----------------------------------------------------------------------------

    //marca tempo final e exibe
    timer = clock() - timer;
    cout << "Programa Finalizado em " << ((float)timer) / CLOCKS_PER_SEC << " Segundos" << endl;

    system("Pause");
}

// This code is contributed 
// by Soumik Mondal 
  • Why results with Openmp should be lower than results with SIMD ?

  • Because in addition to parallelizing the process, the SIMD does multiple vector calculations simultaneously. Therefore, the SIMD should have a better result than simple parallelism.

1 answer

0


This question is similar to that one made in the stack in English, and the answer to it is same as she received. Anyway, as the purpose of this forum is to provide answers in Portuguese, I will translate the reply from @Jonathan-dursi:

The standard provided by link is relatively clear (pag 13, lines 19+20)

When any thread finds a construct simd, the loop iterations associated with the construction can be executed in Anes SIMD that are available for the thread.

SIMD is an internal thing for threads. More concretely, in a CPU you can imagine using directives simd to specifically request the vectorization of chunks of loop iterations that belong individually to the same thread. This is exposing the various levels of parallelism existing in a single multi-core processor, independently of the platform. See, for example, the discussion (along with the accelerator material) on this post on intel blog.

So basically, you’ll want to use omp parallel to distribute the work in different threads that can migrate to multiple cores; and you will want to use omp simd to use the vector pipelines (for example) on each kernel. Normally, omp parallel would go out of code chunk to handle parallel distribution of work from coarser granulation and the omp simd would circumvent tight ties within it to explore the parallelism of fine graining.

In short: SIMD alone has little potential to win from a common parallel region. The directive simd does not create a parallel region.

A SIMD code will be more efficient than a parallel code if it better addresses common deficiencies such as Misses cache or if you have more Anes what cores in your processor, which is quite unusual. In addition, the directive simd is a tip for the preprocessor. There is no guarantee that your code will be vectorized.

You may have gains if combine the directives.

(and I accidentally answered the question in English instead of this :p )

Browser other questions tagged

You are not signed in. Login or sign up in order to post.