Como usar parallel_invoke para escrever uma rotina de classificação em paralelo

Artigo
10/12/2023

Este documento descreve como usar o algoritmo parallel_invoke para melhorar o desempenho do algoritmo de classificação bitônica. O algoritmo de classificação bitônica divide recursivamente a sequência de entrada em partições classificadas menores. O algoritmo de classificação bitônica pode ser executado em paralelo porque cada operação de partição é independente de todas as demais operações.

Embora a classificação bitônica seja um exemplo de uma rede de classificação que classifica todas as combinações de sequências de entrada, este exemplo classifica sequências cujos comprimentos são uma potência de dois.

Observação

Este exemplo usa uma rotina de classificação paralela para ilustração. Você também pode usar os algoritmos de classificação internos fornecidos pelo PPL: concurrency::parallel_sort, concurrency::parallel_buffered_sort e concurrency::parallel_radixsort. Para obter mais informações, consulte Algoritmos paralelos.

Seções

Este tópico descreve as seguintes tarefas:

Executando a classificação bitônica em série
Usando parallel_invoke para executar a classificação bitônica em paralelo

Executando a classificação bitônica em série

O exemplo a seguir mostra a versão serial do algoritmo de classificação bitônica. A função bitonic_sort divide a sequência em duas partições, classifica essas partições em direções opostas e mescla os resultados. Essa função chama a si mesma duas vezes recursivamente para classificar cada partição.

const bool INCREASING = true;
const bool DECREASING = false;

// Comparator function for the bitonic sort algorithm.
template <class T>
void compare(T* items, int i, int j, bool dir)
{
   if (dir == (items[i] > items[j]))
   {
      swap(items[i], items[j]);
   }
}

// Sorts a bitonic sequence in the specified order.
template <class T>
void bitonic_merge(T* items, int lo, int n, bool dir)
{
   if (n > 1)
   {
      int m = n / 2;
      for (int i = lo; i < lo + m; ++i)
      {
         compare(items, i, i + m, dir);
      }
      bitonic_merge(items, lo, m, dir);
      bitonic_merge(items, lo + m, m, dir);
   }
}

// Sorts the given sequence in the specified order.
template <class T>
void bitonic_sort(T* items, int lo, int n, bool dir)
{
   if (n > 1)
   {
      // Divide the array into two partitions and then sort 
      // the partitions in different directions.
      int m = n / 2;
      bitonic_sort(items, lo, m, INCREASING);
      bitonic_sort(items, lo + m, m, DECREASING);

      // Merge the results.
      bitonic_merge(items,lo, n, dir);
   }
}

// Sorts the given sequence in increasing order.
template <class T>
void bitonic_sort(T* items, int size)
{
    bitonic_sort(items, 0, size, INCREASING);
}

[Parte superior]

Usando parallel_invoke para executar a classificação bitônica em paralelo

Esta seção descreve como usar o algoritmo parallel_invoke para executar o algoritmo de classificação bitônica em paralelo.

Para realizar o algoritmo de classificação bitonic em paralelo

Adicione uma diretiva #include para o arquivo de cabeçalho ppl.h.
```
#include <ppl.h>
```
Adicione uma diretiva using para o namespace concurrency.
```
using namespace concurrency;
```

Crie uma nova função, chamada parallel_bitonic_mege, que usa o algoritmo parallel_invoke para mesclar as sequências em paralelo se houver quantidade suficiente de trabalho a ser feito. Caso contrário, chame bitonic_merge para mesclar as sequências serialmente.

// Sorts a bitonic sequence in the specified order.
template <class T>
void parallel_bitonic_merge(T* items, int lo, int n, bool dir)
{   
   // Merge the sequences concurrently if there is sufficient work to do.
   if (n > 500)
   {
      int m = n / 2;
      for (int i = lo; i < lo + m; ++i)
      {
         compare(items, i, i + m, dir);
      }

      // Use the parallel_invoke algorithm to merge the sequences in parallel.
      parallel_invoke(
         [&items,lo,m,dir] { parallel_bitonic_merge(items, lo, m, dir); },
         [&items,lo,m,dir] { parallel_bitonic_merge(items, lo + m, m, dir); }
      );
   }
   // Otherwise, perform the work serially.
   else if (n > 1)
   {
      bitonic_merge(items, lo, n, dir);
   }   
}

Execute um processo semelhante ao da etapa anterior, mas para a função bitonic_sort.

// Sorts the given sequence in the specified order.
template <class T>
void parallel_bitonic_sort(T* items, int lo, int n, bool dir)
{   
   if (n > 1)
   {
      // Divide the array into two partitions and then sort 
      // the partitions in different directions.
      int m = n / 2;

      // Sort the partitions in parallel.
      parallel_invoke(
         [&items,lo,m] { parallel_bitonic_sort(items, lo, m, INCREASING); },
         [&items,lo,m] { parallel_bitonic_sort(items, lo + m, m, DECREASING); }
      );
      
      // Merge the results.
      parallel_bitonic_merge(items, lo, n, dir);
   }
}

Crie uma versão sobrecarregada da função parallel_bitonic_sort que classifica a matriz em ordem crescente.

// Sorts the given sequence in increasing order.
template <class T>
void parallel_bitonic_sort(T* items, int size)
{
   parallel_bitonic_sort(items, 0, size, INCREASING);
}

O algoritmo parallel_invoke reduz a sobrecarga executando o último da série de tarefas no contexto de chamada. Por exemplo, na função parallel_bitonic_sort, a primeira tarefa é executada em um contexto separado e a segunda tarefa é executada no contexto de chamada.

// Sort the partitions in parallel.
parallel_invoke(
   [&items,lo,m] { parallel_bitonic_sort(items, lo, m, INCREASING); },
   [&items,lo,m] { parallel_bitonic_sort(items, lo + m, m, DECREASING); }
);

O exemplo completo a seguir executa as versões serial e paralela do algoritmo de classificação bitônica. O exemplo também imprime no console o tempo necessário para executar cada computação.

// parallel-bitonic-sort.cpp
// compile with: /EHsc
#include <windows.h>
#include <algorithm>
#include <iostream>
#include <random>
#include <ppl.h>

using namespace concurrency;
using namespace std;

// Calls the provided work function and returns the number of milliseconds 
// that it takes to call that function.
template <class Function>
__int64 time_call(Function&& f)
{
   __int64 begin = GetTickCount();
   f();
   return GetTickCount() - begin;
}

const bool INCREASING = true;
const bool DECREASING = false;

// Comparator function for the bitonic sort algorithm.
template <class T>
void compare(T* items, int i, int j, bool dir)
{
   if (dir == (items[i] > items[j]))
   {
      swap(items[i], items[j]);
   }
}

// Sorts a bitonic sequence in the specified order.
template <class T>
void bitonic_merge(T* items, int lo, int n, bool dir)
{
   if (n > 1)
   {
      int m = n / 2;
      for (int i = lo; i < lo + m; ++i)
      {
         compare(items, i, i + m, dir);
      }
      bitonic_merge(items, lo, m, dir);
      bitonic_merge(items, lo + m, m, dir);
   }
}

// Sorts the given sequence in the specified order.
template <class T>
void bitonic_sort(T* items, int lo, int n, bool dir)
{
   if (n > 1)
   {
      // Divide the array into two partitions and then sort 
      // the partitions in different directions.
      int m = n / 2;
      bitonic_sort(items, lo, m, INCREASING);
      bitonic_sort(items, lo + m, m, DECREASING);

      // Merge the results.
      bitonic_merge(items,lo, n, dir);
   }
}

// Sorts the given sequence in increasing order.
template <class T>
void bitonic_sort(T* items, int size)
{
    bitonic_sort(items, 0, size, INCREASING);
}

// Sorts a bitonic sequence in the specified order.
template <class T>
void parallel_bitonic_merge(T* items, int lo, int n, bool dir)
{   
   // Merge the sequences concurrently if there is sufficient work to do.
   if (n > 500)
   {
      int m = n / 2;
      for (int i = lo; i < lo + m; ++i)
      {
         compare(items, i, i + m, dir);
      }

      // Use the parallel_invoke algorithm to merge the sequences in parallel.
      parallel_invoke(
         [&items,lo,m,dir] { parallel_bitonic_merge(items, lo, m, dir); },
         [&items,lo,m,dir] { parallel_bitonic_merge(items, lo + m, m, dir); }
      );
   }
   // Otherwise, perform the work serially.
   else if (n > 1)
   {
      bitonic_merge(items, lo, n, dir);
   }   
}

// Sorts the given sequence in the specified order.
template <class T>
void parallel_bitonic_sort(T* items, int lo, int n, bool dir)
{   
   if (n > 1)
   {
      // Divide the array into two partitions and then sort 
      // the partitions in different directions.
      int m = n / 2;

      // Sort the partitions in parallel.
      parallel_invoke(
         [&items,lo,m] { parallel_bitonic_sort(items, lo, m, INCREASING); },
         [&items,lo,m] { parallel_bitonic_sort(items, lo + m, m, DECREASING); }
      );
      
      // Merge the results.
      parallel_bitonic_merge(items, lo, n, dir);
   }
}

// Sorts the given sequence in increasing order.
template <class T>
void parallel_bitonic_sort(T* items, int size)
{
   parallel_bitonic_sort(items, 0, size, INCREASING);
}

int wmain()
{  
   // For this example, the size must be a power of two.
   const int size = 0x200000;

   // Create two large arrays and fill them with random values.
   int* a1 = new int[size];
   int* a2 = new int[size];

   mt19937 gen(42);
   for(int i = 0; i < size; ++i)
   {
      a1[i] = a2[i] = gen();
   }

   __int64 elapsed;
   
   // Perform the serial version of the sort.
   elapsed = time_call([&] { bitonic_sort(a1, size); });
   wcout << L"serial time: " << elapsed << endl;

   // Now perform the parallel version of the sort.
   elapsed = time_call([&] { parallel_bitonic_sort(a2, size); });
   wcout << L"parallel time: " << elapsed << endl;
 
   delete[] a1;
   delete[] a2;
}

A saída de exemplo a seguir é para um computador que tem quatro processadores.

serial time: 4353
parallel time: 1248

[Parte superior]

Compilando o código

Para compilar o código, copie-o e cole-o em um projeto do Visual Studio, ou cole-o em um arquivo chamado parallel-bitonic-sort.cpp e execute o seguinte comando em uma janela do Prompt de comando do Visual Studio.

cl.exe /EHsc parallel-bitonic-sort.cpp

Programação robusta

Este exemplo usa o algoritmo parallel_invoke em vez da classe concurrency::task_group porque o tempo de vida de cada grupo de tarefas não se estende além de uma função. Recomendamos que você use parallel_invoke quando puder, pois ele tem menos sobrecarga de execução do que objetos task group e, portanto, permite que você escreva um código de melhor desempenho.

As versões paralelas de alguns algoritmos têm um desempenho melhor somente quando há trabalho suficiente para fazer. Por exemplo, a função parallel_bitonic_merge chamará a versão serial, bitonic_merge se houver 500 ou menos elementos na sequência. Você também pode planejar sua estratégia de classificação geral com base na quantidade de trabalho. Por exemplo, pode ser mais eficiente usar a versão serial do algoritmo de classificação rápida se a matriz contiver menos de 500 itens, conforme mostrado no exemplo a seguir:

template <class T>
void quick_sort(T* items, int lo, int n)
{
   // TODO: The function body is omitted for brevity.
}

template <class T>
void parallel_bitonic_sort(T* items, int lo, int n, bool dir)
{
   // Use the serial quick sort algorithm if there are relatively few
   // items to sort. The associated overhead for running few tasks in 
   // parallel may not overcome the benefits of parallel processing.
   if (n - lo + 1 <= 500)
   {
      quick_sort(items, lo, n);
   }
   else if (n > 1)
   {
      // Divide the array into two partitions and then sort 
      // the partitions in different directions.
      int m = n / 2;

      // Sort the partitions in parallel.
      parallel_invoke(
         [&items,lo,m] { parallel_bitonic_sort(items, lo, m, INCREASING); },
         [&items,lo,m] { parallel_bitonic_sort(items, lo + m, m, DECREASING); }
      );
      
      // Merge the results.
      parallel_bitonic_merge(items, lo, n, dir);
   }
}

Assim como acontece com qualquer algoritmo paralelo, recomendamos que você faça o perfil e ajuste seu código conforme apropriado.

Confira também

Paralelismo de tarefas
parallel_invoke Função

Compartilhar via