第一届PKU HPCGame WriteUp

A. 欢迎参赛！

签到

B. 流量席卷小土豆

{"password":"de3e27056bde306b803cc5602f4cde9d464912330b4562ae163bd21e23495bcd","job_id":"6437"}

首先尝试8nodes 8tasks未能得分

1 2	Point 0: msg: job was not run with 4 nodes, 4 tasks per node

使用4nodes 4tasks得到point 0分数：

1 2	Point 0: msg: job was run with 4 nodes, 4 tasks per node by user `hu_y68lstwp8s7q`

ssh 登录在 shell 直接执行会因为爆内存被 kill

1
2
3

** (process:1206968): WARNING **: 21:52:42.121: Dissector bug, protocol VNC, in packet 2094662: Adding vnc.hextile_anysubrects would put more than 1000000 items in the tree -- possible infinite loop
mergecap: The file "0.ssh.pcap" appears to have been cut short in the middle of a packet.
slurmstepd: error: Detected 1 oom_kill event in StepId=5600.batch. Some of the step tasks have been OOM Killed.

根据文档 https://hpc.pku.edu.cn/ug/guide/quickstart/ 写 job.sh

#!/bin/bash
#SBATCH -o job.%j.out
#SBATCH -p C064M0256G
#SBATCH --qos=high
#SBATCH -J TASK
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4

srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/0.pcap -Y ssh -w 0.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/1.pcap -Y ssh -w 1.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/2.pcap -Y ssh -w 2.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/3.pcap -Y ssh -w 3.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/4.pcap -Y ssh -w 4.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/5.pcap -Y ssh -w 5.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/6.pcap -Y ssh -w 6.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/7.pcap -Y ssh -w 7.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/8.pcap -Y ssh -w 8.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/9.pcap -Y ssh -w 9.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/10.pcap -Y ssh -w 10.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/11.pcap -Y ssh -w 11.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/12.pcap -Y ssh -w 12.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/13.pcap -Y ssh -w 13.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/14.pcap -Y ssh -w 14.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/15.pcap -Y ssh -w 15.ssh.pcap
srun -N4 --ntasks-per-node=4 mergecap -w merged.pcap *.ssh.pcap

sbatch job.sh 命令提交到 SCOW 计算平台

Completed 之后 quantum-cracker merged.pcap 得到密码：

1 2	[hu_y68lstwp8s7q@l08c49n1 ~]$ quantum-cracker merged.pcap de3e27056bde306b803cc5602f4cde9d464912330b4562ae163bd21e23495bcd

同时对比8nodes 8tasks的答案，得到point 1分数

1 2	[hu_y68lstwp8s7q@l08c49n1 ~]$ quantum-cracker merged.pcap de3e27056bde306b803cc5602f4cde9d464912330b4562ae163bd21e23495bcd

成功解出B题

C. 简单的编译

Makefile: https://makefiletutorial.com/

CMakeLists: https://en.wikipedia.org/wiki/CMake#CMakeLists.txt

Makefile:

all:
	nvcc -o hello_cuda hello_cuda.cu
	mpic++ -o hello_mpi hello_mpi.cpp
	g++ -fopenmp -o hello_omp hello_omp.cpp

CMakeLists.txt:

cmake_minimum_required(VERSION 3.26)

project(HPC LANGUAGES CXX CUDA)

# packages
find_package(MPI REQUIRED)
find_package(OpenMP REQUIRED)

# CUDA
add_executable(hello_cuda hello_cuda.cu)

# MPI
add_executable(hello_mpi hello_mpi.cpp)
target_link_libraries(hello_mpi MPI::MPI_CXX)

# OpenMP
add_executable(hello_omp hello_omp.cpp)
target_link_libraries(hello_omp OpenMP::OpenMP_CXX)

D. 小北问答 Classic

［填空］ 根据 2023 年 11 月公布的 GREEN500 排行榜，能量效率最高的超级计算的能效达到了65.396（GFlops/watts）
［填空］ 根据 Amdahl 定律，在关键路径上，将占据程序运行时间 10% 的部分加速到 2 倍，则程序整体的加速比为105.26%；将占据程序运行时间40%的部分加速到1.2倍，则程序整体的加速比为107.14%。（均保留两位小数，均假设加速前后程序的关键路径不变）
［单选］ 以下哪个工具的作用与其他选项的区别最大？答案：GCC
1. Meson
2. Autoconf
3. CMake
4. GCC
［单选］ 从以下选项中选择正确答案依次填入空中：MPI 提供的是进程级别的并行；OpenMP 提供的是线程级别的并行；两者相比，线程级别的并行更加轻量。
1. 线程
2. 进程
［混合］

AVX512 指令集是现代 x86_64 CPU 上常用的用于加速 SIMD 运算的指令集，使用 AVX512 指令集，一条指令可以对最多 8（填空）个 64 位浮点数进行运算。各个厂商对 AVX512 指令集的实现有所不同，如 Intel（以 Ice Lake SP 为例）在CPU中设计独立的 AVX512 运算单元，但有可能导致调用 AVX512 指令集时因功耗过大而降频，AMD（以 ZEN4 架构为例）复用两个 AVX2 运算单元执行 AVX512 运算。
1. 在CPU中设计独立的 AVX512 运算单元，但有可能导致调用 AVX512 指令集时因功耗过大而降频
2. 复用两个 AVX2 运算单元执行 AVX512 运算
［单选］ 以下哪个不是专门用于在 GPU 上设计并行计算程序的？答案：OpenGL
1. OpenACC
2. CUDA
3. HIP
4. OpenGL
［多选］

SIMD（Single Instruction, Multiple Data）和 SIMT（Single Instruction, Multiple Threads）是并行计算的两种模型。下面哪些描述是正确的？（示例回答方式：i,ii,iii,iv）答案：iii,iv
1. SIMD 和 SIMT 都是同一种并行计算模型，只是不同的命名。
2. SIMD 每个线程执行相同的指令，但处理不同的数据；SIMT中每个线程可以执行不同的指令。
3. AVX512 指令集是 SIMD 模型的代表，而 OpenMP 是 SIMT 模型的代表。
4. 在 SIMT 模型中，线程协作执行相同的指令序列，但处理不同的数据。
［单选］

在高性能计算中，下列哪项技术不是用于节点间通信的？答案：HBM
1. InfiniBand
2. Ethernet
3. NVLink
4. HBM
［单选］

在高性能计算集群中，哪个工具通常用于作业调度和资源管理？答案：Slurm
1. 微信群
2. Slurm
3. Docker Swarm
4. Apache Mesos
［单选］

在处理矩阵乘法的并行化时，通常会使用分块矩阵乘法来减少缓存未命中，从而提高并行效率。
1. 缓存未命中
2. 通信开销
3. 浮点运算错误
4. 硬盘 I/O 延迟

E. 齐心协力

https://docs.ray.io/en/latest/ray-core/api/index.html

python实现并行即可

import os
import numpy as np
import ray

ray.init(address=f"{os.environ['RAY_CLUSTER_ADDR']}")

@ray.remote(num_cpus=4)
class MatrixMultiplier(object):
    def __init__(self, weight_file):
        self.weight = np.load(weight_file)

    def multiply(self, x):
        return np.maximum(0, x @ self.weight)

multipliers = [
    MatrixMultiplier.remote(f"weights/weight_{i}.npy") for i in range(4)
]

os.makedirs("outputs", exist_ok=True)

pipeline = [None] * 100

for i in range(100):
    x = np.load(f"inputs/input_{i}.npy")

    for multiplier in multipliers:
        x = multiplier.multiply.remote(x)

    pipeline[i] = x

for i, output in enumerate(pipeline):
    output = ray.get(output)

    np.save(f"outputs/output_{i}.npy", output)

ray.shutdown()

G. 3D生命游戏

#include <cstddef> 
#include <cstdint> 
#include <filesystem> 
#include <fstream> 
#include <iostream> 
#include <utility>

__device__ int getNeighborIndex(int x, int y, int z, int M) {
    x = (x + M) % M;
    y = (y + M) % M;
    z = (z + M) % M;
    return x * M * M + y * M + z;
}

__global__ void update(uint8_t* curr_space, uint8_t* next_space, int M) {
    extern __shared__ uint8_t shared_curr_space[];
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    shared_curr_space[threadIdx.x] = curr_space[index];
    __syncthreads();
    int z = index % M;
    int y = (index / M) % M;
    int x = index / (M * M);
    int xm = (x + M - 1) % M;
    int x0 = x % M;
    int xp = (x + 1) % M;
    int ym = (y + M - 1) % M;
    int y0 = y % M;
    int yp = (y + 1) % M;
    int zm = (z + M - 1) % M;
    int z0 = z % M;
    int zp = (z + 1) % M;
    int neighbor_count = 0;
    neighbor_count += curr_space[xm * M * M + ym * M + zm];
    neighbor_count += curr_space[xm * M * M + ym * M + z0];
    neighbor_count += curr_space[xm * M * M + ym * M + zp];
    neighbor_count += curr_space[xm * M * M + y0 * M + zm];
    neighbor_count += curr_space[xm * M * M + y0 * M + z0];
    neighbor_count += curr_space[xm * M * M + y0 * M + zp];
    neighbor_count += curr_space[xm * M * M + yp * M + zm];
    neighbor_count += curr_space[xm * M * M + yp * M + z0];
    neighbor_count += curr_space[xm * M * M + yp * M + zp];
    neighbor_count += curr_space[x0 * M * M + ym * M + zm];
    neighbor_count += curr_space[x0 * M * M + ym * M + z0];
    neighbor_count += curr_space[x0 * M * M + ym * M + zp];
    neighbor_count += curr_space[x0 * M * M + y0 * M + zm];
    neighbor_count += curr_space[x0 * M * M + y0 * M + zp];
    neighbor_count += curr_space[x0 * M * M + yp * M + zm];
    neighbor_count += curr_space[x0 * M * M + yp * M + z0];
    neighbor_count += curr_space[x0 * M * M + yp * M + zp];
    neighbor_count += curr_space[xp * M * M + ym * M + zm];
    neighbor_count += curr_space[xp * M * M + ym * M + z0];
    neighbor_count += curr_space[xp * M * M + ym * M + zp];
    neighbor_count += curr_space[xp * M * M + y0 * M + zm];
    neighbor_count += curr_space[xp * M * M + y0 * M + z0];
    neighbor_count += curr_space[xp * M * M + y0 * M + zp];
    neighbor_count += curr_space[xp * M * M + yp * M + zm];
    neighbor_count += curr_space[xp * M * M + yp * M + z0];
    neighbor_count += curr_space[xp * M * M + yp * M + zp];
    uint8_t curr_state = curr_space[x * M * M + y * M + z];
    if (curr_state == 1) {
        if (neighbor_count < 5 || neighbor_count > 7)
            next_space[x * M * M + y * M + z] = 0;
        else
            next_space[x * M * M + y * M + z] = 1;
    }
    else {
        if (neighbor_count == 6) {
            next_space[x * M * M + y * M + z] = 1;
        }
        else {
            next_space[x * M * M + y * M + z] = 0;
        }
    }
}

int main(int argc, char* argv[]) {
    if (argc < 4) {
        std::cout << "Usage: " << argv[0] << " <input_path> <output_path> <N>"
            << std::endl;
        return 1;
    }
    char* input_path = argv[1];
    char* output_path = argv[2];
    int N = std::atoi(argv[3]);
    size_t M, T;
    std::ifstream input_file(input_path, std::ios::binary);
    input_file.read(reinterpret_cast<char*>(&M), sizeof(M));
    input_file.read(reinterpret_cast<char*>(&T), sizeof(T));
    uint8_t* curr_space = new uint8_t[M * M * M],
        * next_space = new uint8_t[M * M * M];
    input_file.read(reinterpret_cast<char*>(curr_space), M * M * M);
    uint8_t* d_curr_space, * d_next_space;
    cudaMalloc(&d_curr_space, M * M * M * sizeof(uint8_t));
    cudaMalloc(&d_next_space, M * M * M * sizeof(uint8_t));
    cudaMemcpy(d_curr_space, curr_space, M * M * M * sizeof(uint8_t),
        cudaMemcpyHostToDevice);
    int num_cells = M * M * M;
    int threads_per_block = 256;
    int num_blocks = (num_cells + threads_per_block - 1) / threads_per_block;
    for (int t = 0; t < N; ++t) {
        update << <num_blocks, threads_per_block, threads_per_block * sizeof(uint8_t) >> > (d_curr_space, d_next_space, M);
        std::swap(d_curr_space, d_next_space);
    }
    cudaMemcpy(curr_space, d_curr_space, M * M * M * sizeof(uint8_t),cudaMemcpyDeviceToHost);
    T += N;
    std::ofstream output_file(output_path, std::ios::binary);
    output_file.write(reinterpret_cast<char*>(&M), sizeof(M));
    output_file.write(reinterpret_cast<char*>(&T), sizeof(T));
    output_file.write(reinterpret_cast<char*>(curr_space), M * M * M);
    delete[] curr_space;
    delete[] next_space;

    cudaFree(d_curr_space);
    cudaFree(d_next_space);

    return 0;
}

编译：

1	/usr/local/cuda/bin/nvcc -O3 main.cu -o main

测试数据：

module load gcc/12

g++ -std=c++17 -o generate generate.cpp

./generate 1024 data.bin 10

测试：

1	srun -p GPU40G -N 1 --gres=gpu:1 ./main data.bin output 1024

4e2a575ae47d338562fbabebfc7cf5ca output

F. 高性能数据校验

这道题关键就在于如何高效地处理存在依赖关系的数据，题目提示“利用到现代计算机的多核特性以及并行文件系统的强悍能力”，但是后一块的SHA512校验值依赖于前一块的SHA512校验值...

想到了E题的 Pipeline Parallelism ，但是如何才能并行呢？

overlap...

加载modules

1	module load gcc/13 openmpi/4.1.6 openssl/3

生成校验文件

1 2	g++ -o generate generate.cpp -fopenmp $PWD/generate data.bin 4294967296 256

校验代码（最终还是串行

#include <algorithm> 
#include <chrono> 
#include <cstring> 
#include <filesystem> 
#include <fstream> 
#include <iostream> 
#include <mpi.h> 
#include <openssl/evp.h> 
#include <openssl/sha.h>

namespace fs = std::filesystem;

constexpr size_t BLOCK_SIZE = 1024 * 1024;

void print_checksum(std::ostream& os, uint8_t* md, size_t len) { for (int i = 0; i < len; i++) { os << std::setw(2) << std::setfill('0') << std::hex << static_cast<int>(md[i]); } }

void send_block(std::ifstream& istrm, size_t len, int dest, char* buffer) { istrm.read(buffer, std::min(BLOCK_SIZE, len)); MPI_Send(buffer, std::min(BLOCK_SIZE, len), MPI_CHAR, dest, 0, MPI_COMM_WORLD); }

void checksum(size_t len, uint8_t* obuf, int src, char* buffer) { int num_block = (len + BLOCK_SIZE - 1) / BLOCK_SIZE;

const char* precomputedHash = "cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e";

uint8_t prev_md[SHA512_DIGEST_LENGTH] = {};
for(int i = 0; i < SHA512_DIGEST_LENGTH; i++) {
    sscanf(&precomputedHash[i*2], "%2hhx", &prev_md[i]);
}


EVP_MD_CTX* ctx = EVP_MD_CTX_new();
EVP_MD* sha512 = EVP_MD_fetch(nullptr, "SHA512", nullptr);

for (int i = 0; i < num_block; i++) {
    MPI_Status status;
    MPI_Recv(buffer, std::min(BLOCK_SIZE, len - i * BLOCK_SIZE), MPI_CHAR, src, 0, MPI_COMM_WORLD, &status);
    EVP_DigestInit_ex(ctx, sha512, nullptr);
    EVP_DigestUpdate(ctx, buffer, std::min(BLOCK_SIZE, len - i * BLOCK_SIZE));
    EVP_DigestUpdate(ctx, prev_md, SHA512_DIGEST_LENGTH);

    unsigned int md_len = 0;
    EVP_DigestFinal_ex(ctx, prev_md, &md_len);
}

std::memcpy(obuf, prev_md, SHA512_DIGEST_LENGTH);
EVP_MD_CTX_free(ctx);
EVP_MD_free(sha512);
}

int main(int argc, char* argv[]) { auto begin_time = std::chrono::high_resolution_clock::now();

MPI_Init(&argc, &argv);

int rank, nprocs;

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

if (rank == 0) {
    if (argc < 2) {
        std::cout << "Usage: " << argv[0] << " <input_file>" << std::endl;
        MPI_Abort(MPI_COMM_WORLD, 1);
    }

    fs::path input_path = argv[1];

    auto file_size = fs::file_size(input_path);
    std::cout << input_path << " size: " << file_size << std::endl;

    MPI_Bcast(&file_size, 1, MPI_UNSIGNED_LONG_LONG, 0, MPI_COMM_WORLD);

    std::ifstream istrm(input_path, std::ios::binary);

    char* buffer = new char[BLOCK_SIZE];

    int num_block = (file_size + BLOCK_SIZE - 1) / BLOCK_SIZE;
    for (int i = 0; i < num_block; i++) {
        send_block(istrm, file_size - i * BLOCK_SIZE, 1, buffer);
    }

    delete[] buffer;

} else if (rank == 1) {

    uint64_t file_size;
    MPI_Bcast(&file_size, 1, MPI_UNSIGNED_LONG_LONG, 0, MPI_COMM_WORLD);

    char* buffer = new char[BLOCK_SIZE];

    uint8_t obuf[SHA512_DIGEST_LENGTH];
    checksum(file_size, obuf, 0, buffer);

    delete[] buffer;

    std::cout << "checksum: ";
    print_checksum(std::cout, obuf, SHA512_DIGEST_LENGTH);
    std::cout << std::endl;
    std::ofstream ofs("out.data");
    print_checksum(ofs, obuf, SHA512_DIGEST_LENGTH);
}

MPI_Finalize();

return 0;
}

计时：

auto begin_time = std::chrono::high_resolution_clock::now();
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - begin_time);
std::cout << "checksum time cost: " << std::dec << duration.count() << "ms" << std::endl;

编译代码：

1	mpicxx -O3 -L/lustre/toolchain/openssl/lib64/ -lcrypto -O3 max.cpp

运行：

1	mpirun -np 8 a.out /lustre/home/hu_y68lstwp8s7q/data.bin /lustre/home/hu_y68lstwp8s7q/outputfile

6bbc0014c35c1377c952fc70009ae2e319571b0eb3cd674efb31e41d2382047cf8f689929fd873da68a16892478033e9fa256247fee49ed5b4bc90968ef6a5f0

H. 矩阵乘法

首先分块矩阵必须满足块大小小于Intel Xeon 8358的缓存以解决缓存未命中的问题，测试分块大小在32 x 32(L1)内即可

尝试使用openblas INNER_KERNEL_k1m8n8对算法优化：https://github.com/OpenMathLib/OpenBLAS/blob/develop/kernel/x86_64/dgemm_kernel_8x8_skylakex.c

内联汇编代码:

//openblas
#define INNER_KERNEL_k1m8n8 \
    "vmovapd (%2), %%zmm0\n\t"\
    "vbroadcastsd (%0), %%zmm1\n\t"\
    "vmovapd (%1), %%zmm2\n\t"\
    "vfmadd231pd %%zmm2, %%zmm1, %%zmm0\n\t"\
    "vmovapd %%zmm0, (%2)\n\t"
    
void mul(double* a, double* b, double* c, uint64_t n1, uint64_t n2, uint64_t n3) {
    #pragma omp parallel for
    for (uint64_t ii = 0; ii < n1; ii += BLOCK_SIZE) {
        for (uint64_t jj = 0; jj < n2; jj += BLOCK_SIZE) {
            for (uint64_t kk = 0; kk < n3; kk += BLOCK_SIZE) {
                for (uint64_t i = ii; i < std::min(ii + BLOCK_SIZE, n1); ++i) {
                    for (uint64_t k = kk; k < std::min(kk + BLOCK_SIZE, n3); k += 8) {
                        for (uint64_t j = jj; j < std::min(jj + BLOCK_SIZE, n2); ++j) {
                            __asm__ __volatile__ (
                                INNER_KERNEL_k1m8n8
                                : 
                                : "r" (&a[i * n2 + j]), "r" (&b[j * n3 + k]), "r" (&c[i * n3 + k])
                                : "memory"
                            );
                        }
                    }
                }
            }
        }
    }
}

算法程序如下：

#include <fstream>
#include <iostream>
#include <omp.h>
#include <immintrin.h>

const int BLOCK_SIZE = 128; 

void mul(double* a, double* b, double* c, uint64_t n1, uint64_t n2, uint64_t n3) {
    #pragma omp parallel
    {
        uint64_t ii, jj, kk, i, j, k;

        #pragma omp for schedule(dynamic, 1) nowait
        for (ii = 0; ii < n1; ii += BLOCK_SIZE) {
            for (jj = 0; jj < n2; jj += BLOCK_SIZE) {
                for (kk = 0; kk < n3; kk += BLOCK_SIZE) {
                    for (j = jj; j < std::min(jj + BLOCK_SIZE, n2); ++j) {
                        for (i = ii; i < std::min(ii + BLOCK_SIZE, n1); ++i) {
                            __m512d a_vec = _mm512_set1_pd(a[i * n2 + j]);
                            for (k = kk; k < std::min(kk + BLOCK_SIZE, n3); k += 16) {
                                __m512d c_vec1 = _mm512_load_pd(&c[i * n3 + k]);
                                __m512d b_vec1 = _mm512_load_pd(&b[j * n3 + k]);
                                __m512d c_vec2 = _mm512_load_pd(&c[i * n3 + k + 8]);
                                __m512d b_vec2 = _mm512_load_pd(&b[j * n3 + k + 8]);
                                c_vec1 = _mm512_fmadd_pd(a_vec, b_vec1, c_vec1);
                                c_vec2 = _mm512_fmadd_pd(a_vec, b_vec2, c_vec2);
                                _mm512_store_pd(&c[i * n3 + k], c_vec1);
                                _mm512_store_pd(&c[i * n3 + k + 8], c_vec2);
                                _mm_prefetch((char*)&c[i * n3 + k + 64], _MM_HINT_T0);
                                _mm_prefetch((char*)&b[j * n3 + k + 64], _MM_HINT_T0);
                            }
                        }
                    }
                }
            }
        }
    }
}

int main() {
    uint64_t n1, n2, n3;
    FILE* fi;

    fi = fopen("conf.data", "rb");
    fread(&n1, sizeof(n1), 1, fi);
    fread(&n2, sizeof(n2), 1, fi);
    fread(&n3, sizeof(n3), 1, fi); 

    double* a = (double*)aligned_alloc(64, n1 * n2 * sizeof(double));
    double* b = (double*)aligned_alloc(64, n2 * n3 * sizeof(double));
    double* c = (double*)aligned_alloc(64, n1 * n3 * sizeof(double));

    fread(a, sizeof(double), n1 * n2, fi);
    fread(b, sizeof(double), n2 * n3, fi);
    fclose(fi); 

    mul(a, b, c, n1, n2, n3); 

    fi = fopen("out.data", "wb"); 
    fwrite(c, sizeof(double), n1 * n3, fi); 
    fclose(fi); 

    return 0;
}

编译代码：

1	g++ -std=c++2a -O3 -fopenmp -mavx512f -o matrix matrix.cpp

vtune:

vtune -collect hotspots -result-dir vtune_result ./matrix

vtune -collect threading -result-dir vtune_result -- ./matrix

vtune -report summary -result-dir vtune_result

I. logistic方程

#include <iostream>
#include <chrono>
#include <immintrin.h>
#include <omp.h>
#include <fstream>

constexpr size_t BUFFER_SIZE = 1024;

void itv(double r, double* x, int64_t n, int64_t itn) {
    __m512d rr = _mm512_set1_pd(r);
    __m512d one = _mm512_set1_pd(1.0); 

    #pragma omp parallel for schedule(dynamic)
    for (int64_t i = 0; i < n; i+=16) { 
        if (i+32 < n) {
            _mm_prefetch((char*)(x + i + 32), _MM_HINT_T0);
        }
        __m512d xx1 = _mm512_load_pd(x + i);
        __m512d xx2 = _mm512_load_pd(x + i + 8);
        for (int64_t j = 0; j < itn; j++) {
            __m512d product1 = _mm512_mul_pd(rr, xx1);
            __m512d product2 = _mm512_mul_pd(rr, xx2);
            xx1 = _mm512_mul_pd(product1, _mm512_sub_pd(one, xx1));
            xx2 = _mm512_mul_pd(product2, _mm512_sub_pd(one, xx2));
        }
        _mm512_store_pd(x + i, xx1);
        _mm512_store_pd(x + i + 8, xx2);
    }
}


int main(){
    omp_set_num_threads(8);
    FILE* fi;
    fi = fopen("conf.data", "rb");

    int64_t itn;
    double r;
    int64_t n;
    double* x;

    fread(&itn, 1, 8, fi);
    fread(&r, 1, 8, fi);
    fread(&n, 1, 8, fi);
    x = (double*)aligned_alloc(64, n * sizeof(double));
    fread(x, 1, n * 8, fi);
    fclose(fi); 
    
    itv(r, x, n, itn); 

    FILE* fo;
    fo = fopen("out.data", "wb");
    fwrite(x, 1, n * 8, fo);
    fclose(fo); 

    free(x);

    return 0;
}

编译代码：

1
2
3

g++ -std=c++20 -O3 -fopenmp -mavx512f -o logistic logistic.cpp

vtune -collect hotspots -result-dir vtune_result ./logistic

J. H-66

#include <vector>
#include <unordered_map>
#include <cstdint>
#include <cmath>
#include <chrono>

#ifdef _WIN32
#include <windows.h>
#define popcnt __popcnt64
#else
#define popcnt __builtin_popcountll
#endif

#include <iostream>

using map_t = std::unordered_map<uint64_t, size_t>;

struct restrict_t {
    int offset{}, range{}, minocc{}, maxocc{};
    int occ{};
    uint64_t substate{};
};

template <typename VT>
struct term_t {
    VT value;
    uint64_t an{}, cr{}, signmask{}, sign{};
};

struct sparse_t {
    std::vector<size_t> row;
    std::vector<size_t> col;
    std::vector<double> data;
};

int itrest(restrict_t& rest) {
    if (!rest.substate) {
        goto next;
    }
    {
        uint64_t x = rest.substate & (-(int64_t)rest.substate);
        uint64_t y = x + rest.substate;
        rest.substate = y + (y ^ rest.substate) / x / 4;
    }
    if (rest.substate >> rest.range) {
    next:
        if (rest.occ == rest.maxocc) {
            rest.occ = rest.minocc;
            rest.substate = (uint64_t(1) << rest.occ) - 1;
            return 1;
        }
        rest.occ++;
        rest.substate = (uint64_t(1) << rest.occ) - 1;
        return 0;
    }
    return 0;
}

int itrest(std::vector<restrict_t>& rest) {
    for (auto& re : rest) {
        if (!itrest(re)) {
            return 0;
        }
    }
    return 1;
}

uint64_t getstate(const std::vector<restrict_t>& rest) {
    uint64_t state = 0;
    for (const auto& re : rest) {
        state |= re.substate << re.offset;
    }
    return state;
}

int generatetable(std::vector<uint64_t>& table, map_t& map, std::vector<restrict_t>& rest) {
    for (auto& re : rest) {
        re.occ = re.minocc;
        re.substate = (uint64_t(1) << re.occ) - 1;
    }

    size_t index = 0;
    do {
        uint64_t state = getstate(rest);
        table.emplace_back(state);
        map.insert({state, index});
        index++;
    } while (!itrest(rest));

    return 0;
}

template <typename VT>
term_t<VT> getterm(VT value, const std::vector<int>& cr, const std::vector<int>& an) {
    term_t<VT> term;
    term.value = value;
    term.an = 0;
    term.cr = 0;
    term.signmask = 0;
    uint64_t signinit = 0;

    for (const auto& x : an) {
        uint64_t mark = uint64_t(1) << x;
        term.signmask ^= (mark - 1) & (~term.an);
        term.an |= mark;
    }
    for (const auto& x : cr) {
        uint64_t mark = uint64_t(1) << x;
        signinit ^= (mark - 1) & term.cr;
        term.signmask ^= (mark - 1) & (~term.an) & (~term.cr);
        term.cr |= mark;
    }
    term.sign = popcnt(signinit ^ (term.signmask & term.an));
    term.signmask = term.signmask & (~term.an) & (~term.cr);

    return term;
}

template <typename VT>
int act(std::vector<size_t>& row, std::vector<size_t>& col, std::vector<VT>& data, const std::vector<term_t<VT>>& op, const std::vector<uint64_t>& table, const map_t& map) {
    int64_t n = table.size();

    for (int64_t i = 0; i < n; i++) {
        uint64_t srcstate = table[i];

        for (const auto& term : op) {
            if ((srcstate & term.an) == term.an) {
                uint64_t dststate = srcstate ^ term.an;
                if ((dststate & term.cr) == 0) {
                    dststate ^= term.cr;

                    auto it = map.find(dststate);
                    if (it != map.end()) {
                        uint64_t sign = term.sign + popcnt(srcstate & term.signmask);
                        VT v = term.value;
                        if (sign & 1) {
                            v = -v;
                        }
                        data.emplace_back(v);
                        col.emplace_back(i);
                        row.emplace_back(it->second);
                    }
                }
            }
        }
    }

    return 0;
}

int readss(FILE* fi, std::vector<uint64_t>& table, map_t& map) {
    int n;
    fread(&n, 1, 4, fi);
    std::vector<restrict_t> restv(n);
    for (auto& rest : restv) {
        fread(&rest, 1, 16, fi);
    }
    generatetable(table, map, restv);
    return 0;
}

int readop(FILE* fi, std::vector<term_t<double>>& op) {
    int n, order;
    fread(&n, 1, 4, fi);
    fread(&order, 1, 4, fi);

    std::vector<double> v(n);
    fread(v.data(), 1, 8 * n, fi);

    std::vector<int> rawterm(order);
    std::vector<int> cr, an;

    for (int i = 0; i < n; i++) {
        fread(rawterm.data(), 1, 4 * order, fi);
        int tn = rawterm[0];

        for (int j = 0; j < tn; j++) {
            int type = rawterm[tn * 2 - 1 - j * 2];
            if (type) {
                cr.push_back(rawterm[tn * 2 - j * 2]);
            }
            else {
                an.push_back(rawterm[tn * 2 - j * 2]);
            }
        }

        op.push_back(getterm(v[i], cr, an));
        cr.clear();
        an.clear();
    }

    return 0;
}

//out=m*v;
void mmv(std::vector<double>& out, const sparse_t& m, const std::vector<double>& v) {
    for (auto& x : out) {
        x = 0;
    }

    for (size_t i = 0; i < m.data.size(); i++) {
        out[m.row[i]] += m.data[i] * v[m.col[i]];
    }
}

//v1'*v2;
double dot(const std::vector<double>& v1, const std::vector<double>& v2) {
    double s = 0;
    for (size_t i = 0; i < v1.size(); i++) {
        s += v1[i] * v2[i];
    }
    return s;
}

//v1+=s*v2;
void avv(std::vector<double>& v1, const double s, const std::vector<double>& v2) {
    for (size_t i = 0; i < v1.size(); i++) {
        v1[i] += s * v2[i];
    }
}

//v*=s;
void msv(const double s, std::vector<double>& v) {
    for (auto& x : v) {
        x *= s;
    }
}

//v'*v;
double norm2(const std::vector<double>& v) {
    double s = 0;
    for (auto& x : v) {
        s += x * x;
    }
    return s;
}

void getsp(std::vector<double>& out, int itn, const sparse_t m, std::vector<double>& v) {
    out.resize(itn * 2);
    out[0] = sqrt(norm2(v));
    msv(1.0 / out[0], v);

    std::vector<double> a(itn), b(itn - 1);

    std::vector<double> v_(v.size()), v__(v.size());
    for (int i = 0; i < itn; i++) {
        v__.swap(v_);
        v_.swap(v);
        mmv(v, m, v_);
        a[i] = dot(v, v_);

        if (i < itn - 1) {
            if (i == 0) {
                avv(v, -a[i], v_);
            }
            else {
                avv(v, -a[i], v_);
                avv(v, -b[i - 1], v__);
            }

            b[i] = sqrt(norm2(v));
            msv(1.0 / b[i], v);
        }
    }

    for (int i = 0; i < itn; i++) {
        out[1 + i] = a[i];
    }
    for (int i = 0; i < itn - 1; i++) {
        out[1 + itn + i] = b[i];
    }
}

int main()
{
    FILE* fi;
    std::vector<uint64_t> table;
    map_t map;
    std::vector<term_t<double>> op;

    fi = fopen("conf.data", "rb");
    auto t1 = std::chrono::steady_clock::now();
    readss(fi, table, map);
    auto t2 = std::chrono::steady_clock::now();
    readop(fi, op);

    sparse_t opm;
    act(opm.row, opm.col, opm.data, op, table, map);
    auto t3 = std::chrono::steady_clock::now();

    int itn;
    fread(&itn, 1, 4, fi);

    std::vector<double> v(table.size());
    fread(v.data(), 1, table.size() * 8, fi);

    fclose(fi);

    std::vector<double> result;
    getsp(result, itn, opm, v);

    auto t4 = std::chrono::steady_clock::now();
    fi = fopen("out.data", "wb");
    fwrite(result.data(), 1, 16 * itn, fi);
    fclose(fi);

    int d1 = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count();
    int d2 = std::chrono::duration_cast<std::chrono::milliseconds>(t3 - t2).count();
    int d3 = std::chrono::duration_cast<std::chrono::milliseconds>(t4 - t3).count();
    printf("%d,%d,%d\n", d1, d2, d3);
    std::cout << "Hello World!\n";
}

编译：

1	g++ -std=c++20 -O3 -fopenmp -march=native -o matrix matrix.cpp

K. 光之游戏

微调：

#include <cmath>
#include <cstdio>
#include <cstring>
#include <fstream>
#include <iostream>
#include <vector>
#include <string>

class Vector {
public:
    double x, y, z;
    Vector(const double x = 0, const double y = 0, const double z = 0) : x(x), y(y), z(z) {}
    Vector(const Vector&) = default;
    Vector& operator=(const Vector&) = default;
    Vector operator+(const Vector& b) const { return Vector(x + b.x, y + b.y, z + b.z); }
    Vector operator-(const Vector& b) const { return Vector(x - b.x, y - b.y, z - b.z); }
    Vector operator-() const { return Vector(-x, -y, -z); }
    Vector operator*(const double s) const { return Vector(x * s, y * s, z * s); }
    Vector operator/(const double s) const { return Vector(x / s, y / s, z / s); }
    bool isZero() const { return (x == 0.) && (y == 0.) && (z == 0.); }
};
inline double LengthSquared(const Vector& v) { return v.x * v.x + v.y * v.y + v.z * v.z; }
inline double Length(const Vector& v) { return std::sqrt(LengthSquared(v)); }
inline Vector operator*(const double s, const Vector& v) { return v * s; }
inline Vector Normalize(const Vector& v) { return v / Length(v); }
inline Vector Multiply(const Vector& v1, const Vector& v2) { return Vector(v1.x * v2.x, v1.y * v2.y, v1.z * v2.z); }
inline double Dot(const Vector& v1, const Vector& v2) { return v1.x * v2.x + v1.y * v2.y + v1.z * v2.z; }
inline const Vector Cross(const Vector& v1, const Vector& v2) {
    return Vector((v1.y * v2.z) - (v1.z * v2.y), (v1.z * v2.x) - (v1.x * v2.z), (v1.x * v2.y) - (v1.y * v2.x));
}
using Color = Vector;
constexpr double pi_2 = 1.5707963267948966192313216916398;
constexpr int hemicube_res = 256;

double multiplier_front[hemicube_res][hemicube_res];
double multiplier_down[hemicube_res / 2][hemicube_res];

// prospective camera
class Camera {
public:
    Vector pos, dir, up;
    double fov, aspect_ratio;
    double width, height;
    Camera(Vector pos, Vector dir, Vector up, double fov, double aspect_ratio = 1.) :
        pos(pos), dir(dir), up(up), fov(fov), aspect_ratio(aspect_ratio) {
        height = 2 * std::tan(fov / 2);
        width = height * aspect_ratio;
    }
    Vector project(const Vector& v) const {
        Vector a = v - pos;
        double z = Dot(a, dir);
        double y = Dot(a, up);
        double x = Dot(a, Cross(up, dir));
        return Vector(-x / z, y / z, z);
    }
};

// Patches are rectangle
class Patch {
public:
    Vector pos;
    Vector a, b; // regard Cross(b, a) as normal
    Color emission;
    Color reflectance;
    Color incident;
    Color excident;
    Patch(Vector pos, Vector a, Vector b, Color emission, Color reflectance) :
        pos(pos), a(a), b(b), emission(emission), reflectance(reflectance), incident(), excident(emission) {}
};

bool check_triangle(double cx, double cy, Vector a, Vector b, Vector c, double& depth) {
    Vector ab = b - a, ac = c - a;
    double tmp = ab.x * ac.y - ab.y * ac.x;
    if (std::fabs(tmp) > 1e-6) {
        double alpha = (-ac.x * (cy - a.y) + ac.y * (cx - a.x)) / tmp;
        double beta = (ab.x * (cy - a.y) - ab.y * (cx - a.x)) / tmp;
        if (alpha >= 0 && beta >= 0 && alpha + beta <= 1) {
            depth = alpha * b.z + beta * c.z + (1 - alpha - beta) * a.z;
            return true;
        }
    }
    return false;
}

bool inside_quadrilateral(double cx, double cy, Vector a, Vector b, Vector c, Vector d, double& depth) {
    return check_triangle(cx, cy, a, b, c, depth) || check_triangle(cx, cy, d, b, c, depth);
}

std::vector<Color> render_view(const Camera& camera, const std::vector<Patch>& scene, const int width, const int height) {
    std::vector<Color> image(width * height);
    std::vector<double> zbuffer(width * height, 1000000);
    for (const auto& p : scene) {
        if (Length(p.pos + 0.5 * (p.a + p.b) - camera.pos) < 1e-6) continue;
        Vector a = camera.project(p.pos);
        Vector b = camera.project(p.pos + p.a);
        Vector c = camera.project(p.pos + p.b);
        Vector d = camera.project(p.pos + p.a + p.b);
        if (a.z < 1e-6 || b.z < 1e-6 || c.z < 1e-6 || d.z < 1e-6)continue;
        double px = camera.width / width;
        double py = camera.height / height;
        for (int i = 0; i < height; ++i) {
            for (int j = 0; j < width; ++j) {
                double cx = (j + 0.5) * px - camera.width * 0.5;
                double cy = (i + 0.5) * py - camera.height * 0.5;
                double depth = 1000000;
                if (inside_quadrilateral(cx, cy, a, b, c, d, depth)) {
                    if (depth < zbuffer[i * width + j] && depth > 1e-6) {
                        image[i * width + j] = p.excident;
                        zbuffer[i * width + j] = depth;
                    }
                }
            }
        }
    }
    return image;
}

// https://stackoverflow.com/a/2654860
void save_bmp_file(const std::string& filename, const std::vector<Color>& image, const int width, const int height) {
    FILE* f = fopen(filename.c_str(), "wb");
    int filesize = 54 + 3 * width * height;
    unsigned char* img = new unsigned char[3 * width * height];
    memset(img, 0, 3 * width * height);

    for (int i = 0; i < width; i++)
    {
        for (int j = 0; j < height; j++)
        {
            double r = sqrt(image[i + j * width].x) * 255;
            double g = sqrt(image[i + j * width].y) * 255;
            double b = sqrt(image[i + j * width].z) * 255;
            if (r > 255) r = 255;
            if (g > 255) g = 255;
            if (b > 255) b = 255;
            img[(i + j * width) * 3 + 2] = (unsigned char)(r);
            img[(i + j * width) * 3 + 1] = (unsigned char)(g);
            img[(i + j * width) * 3 + 0] = (unsigned char)(b);
        }
    }
    unsigned char bmpfileheader[14] = { 'B','M', 0,0,0,0, 0,0, 0,0, 54,0,0,0 };
    unsigned char bmpinfoheader[40] = { 40,0,0,0, 0,0,0,0, 0,0,0,0, 1,0, 24,0 };
    unsigned char bmppad[3] = { 0,0,0 };

    bmpfileheader[2] = (unsigned char)(filesize);
    bmpfileheader[3] = (unsigned char)(filesize >> 8);
    bmpfileheader[4] = (unsigned char)(filesize >> 16);
    bmpfileheader[5] = (unsigned char)(filesize >> 24);

    bmpinfoheader[4] = (unsigned char)(width);
    bmpinfoheader[5] = (unsigned char)(width >> 8);
    bmpinfoheader[6] = (unsigned char)(width >> 16);
    bmpinfoheader[7] = (unsigned char)(width >> 24);
    bmpinfoheader[8] = (unsigned char)(height);
    bmpinfoheader[9] = (unsigned char)(height >> 8);
    bmpinfoheader[10] = (unsigned char)(height >> 16);
    bmpinfoheader[11] = (unsigned char)(height >> 24);

    fwrite(bmpfileheader, 1, 14, f);
    fwrite(bmpinfoheader, 1, 40, f);
    for (int i = 0; i < height; ++i) {
        fwrite(img + (i * width * 3), 3, width, f);
        fwrite(bmppad, 1, (4 - (width * 3) % 4) % 4, f);
    }
    fclose(f);
    delete[] img;
}

// Cornell Box
void load_scene(std::vector<Patch>& scene) {
    FILE* fi;
    int64_t n;
    fi = fopen("conf.data", "rb");
    fread(&n, 1, 8, fi);
    for (int i = 0; i < n; i++) {
        Vector a[5];
        fread(a, 1, 120, fi);
        scene.emplace_back(Patch(a[0], a[1], a[2], a[3], a[4]));
    }
    fclose(fi);
}

// divide patches into subpatches whose width and length are no less than given threshold
void divide_patches(std::vector<Patch>& scene, double threshold) {
    std::vector<Patch> tmp = std::move(scene);
    for (const auto& p : tmp) {
        double len_a = Length(p.a);
        double len_b = Length(p.b);
        int a = static_cast<int>(len_a / threshold);
        int b = static_cast<int>(len_b / threshold);
        for (int i = 0; i <= a; ++i) {
            for (int j = 0; j <= b; ++j) {
                Vector pos = p.pos + i * threshold * Normalize(p.a) + j * threshold * Normalize(p.b);
                Vector pa = (i + 1) * threshold > len_a ? p.a - i * threshold * Normalize(p.a) : threshold * Normalize(p.a);
                Vector pb = (j + 1) * threshold > len_b ? p.b - j * threshold * Normalize(p.b) : threshold * Normalize(p.b);
                if (pa.isZero() || pb.isZero()) continue;
                scene.emplace_back(Patch(pos, pa, pb, p.emission, p.reflectance));
            }
        }
    }
}

void cal_multiplier_map() {
    constexpr double pw = 2. / hemicube_res;

    double s=0;
    for (int i = 0; i < hemicube_res; ++i) {
        for (int j = 0; j < hemicube_res; ++j) {
            double cx = (j + 0.5) * pw - 1;
            double cy = (i + 0.5) * pw - 1;
            multiplier_front[i][j] = 1 / (cx * cx + cy * cy + 1)/ (cx * cx + cy * cy + 1);
            s = s + multiplier_front[i][j];
        }
    }
    for (int i = 0; i < hemicube_res / 2; ++i) {
        for (int j = 0; j < hemicube_res; ++j) {
            double cx = (j + 0.5) * pw - 1;
            double cz = (i + 0.5) * pw;
            multiplier_down[i][j] = cz / (cx * cx + cz * cz + 1)/(cx * cx + cz * cz + 1);
            //multiplier_down[i][j] *= multiplier_front[i + hemicube_res / 2][j];
            s = s + 4 * multiplier_down[i][j];
        }
    }
    for (int i = 0; i < hemicube_res; ++i) {
        for (int j = 0; j < hemicube_res; ++j) {
            //multiplier_front[i][j] *= multiplier_front[i][j];
        }
    }
    s = s / (hemicube_res * hemicube_res / 4) / 3.1416;
    printf("%f\n", s);
}

void cal_incident_light(std::vector<Patch>& scene) {

    for (int i = 0; i < scene.size(); i++) {
        auto& p = scene[i];

		Vector cpos = p.pos + 0.5 * (p.a + p.b);
		std::vector<Color> front = render_view(Camera(cpos, Normalize(Cross(p.b, p.a)), Normalize(p.b), pi_2), scene, hemicube_res, hemicube_res);
        std::vector<Color> up = render_view(Camera(cpos, Normalize(p.b), -Normalize(Cross(p.b, p.a)), pi_2), scene, hemicube_res, hemicube_res);
        std::vector<Color> left = render_view(Camera(cpos, -Normalize(p.a), Normalize(p.b), pi_2), scene, hemicube_res, hemicube_res);
        std::vector<Color> right = render_view(Camera(cpos, Normalize(p.a), Normalize(p.b), pi_2), scene, hemicube_res, hemicube_res);
        std::vector<Color> down = render_view(Camera(cpos, -Normalize(p.b), Normalize(Cross(p.b, p.a)), pi_2), scene, hemicube_res, hemicube_res);
        Color total_light{};
        for (int i = 0; i < hemicube_res; ++i) {
            for (int j = 0; j < hemicube_res; ++j) {
                total_light = total_light + front[i * hemicube_res + j] * multiplier_front[i][j];
                if (i < hemicube_res / 2) total_light = total_light + up[i * hemicube_res + j] * multiplier_down[hemicube_res / 2 - 1 - i][j];
                if (i >= hemicube_res / 2) total_light = total_light + down[i * hemicube_res + j] * multiplier_down[i - hemicube_res / 2][j];
                if (j < hemicube_res / 2) total_light = total_light + right[i * hemicube_res + j] * multiplier_down[hemicube_res / 2 - 1 - j][i];
                if (j >= hemicube_res / 2) total_light = total_light + left[i * hemicube_res + j] * multiplier_down[j - hemicube_res / 2][i];
            }
        }
        p.incident = total_light / (hemicube_res * hemicube_res / 4) / 3.1416;
    }
}

void cal_excident_light(std::vector<Patch>& scene) {
    for (auto& p : scene) {
        p.excident = Multiply(p.incident, p.reflectance) + p.emission;
    }
}

int main() {
    const int width = 512, height = 512;
    Camera camera(Vector(278, 273, -800), Vector(0, 0, 1), Vector(0, 1, 0), 2 * std::atan2(0.0125, 0.035));
    std::vector<Patch> scene;

    std::cout << "init scene" << std::endl;
    load_scene(scene);

    std::cout << "divide patches" << std::endl;
    divide_patches(scene, 15);
    std::cout << "total patch number: " << scene.size() << std::endl;

    int iter = 0;
    std::cout << "render view " << iter << std::endl;
    std::vector<Color> image = render_view(camera, scene, width, height);

    std::cout << "save image " << iter << std::endl;
    save_bmp_file("cornellbox" + std::to_string(iter) + ".bmp", image, width, height);

    cal_multiplier_map();

    const int max_iteration = 5;
    for (iter = 1; iter <= max_iteration; ++iter) {
        cal_incident_light(scene);
        cal_excident_light(scene);
        std::cout << "render view " << iter << std::endl;
        image = render_view(camera, scene, width, height);

        std::cout << "save image " << iter << std::endl;
        save_bmp_file("cornellbox" + std::to_string(iter) + ".bmp", image, width, height);
    }

    return 0;
}

编译：

1	g++ -O3 -fopenmp -march=native -ffast-math -std=c++20 main.cpp -o main

L. 洪水困兽

只需要在源码的基础上加上 #pragma omp parallel for 并行化for循环和 #pragma omp atomic 防止多个线程同时修改同一变量的值引起的数据竞争即可

#include <array>
#include <fstream>
#include <iostream>
#include <omp.h>
#include <vector>
#include <cmath>
#include <tuple>

using std::vector, std::array, std::tuple, std::string;

void particle2grid(int resolution, int numparticle,
                   const vector<double> &particle_position,
                   const vector<double> &particle_velocity,
                   vector<double> &velocityu, vector<double> &velocityv,
                   vector<double> &weightu, vector<double> &weightv) {
    double grid_spacing = 1.0 / resolution;
    double inv_grid_spacing = 1.0 / grid_spacing;
    auto get_frac = [&inv_grid_spacing](double x, double y) {
        int xidx = floor(x * inv_grid_spacing);
        int yidx = floor(y * inv_grid_spacing);
        double fracx = x * inv_grid_spacing - xidx;
        double fracy = y * inv_grid_spacing - yidx;
        return tuple(array<int, 2>{xidx, yidx},
                     array<double, 4>{fracx * fracy, (1 - fracx) * fracy,
                                      fracx * (1 - fracy),
                                      (1 - fracx) * (1 - fracy)});
    };

    #pragma omp parallel for
    for (int i = 0; i < numparticle; i++) {
        array<int, 4> offsetx = {0, 1, 0, 1};
        array<int, 4> offsety = {0, 0, 1, 1};

        auto [idxu, fracu] =
            get_frac(particle_position[i * 2 + 0],
                     particle_position[i * 2 + 1] - 0.5 * grid_spacing);
        auto [idxv, fracv] =
            get_frac(particle_position[i * 2 + 0] - 0.5 * grid_spacing,
                     particle_position[i * 2 + 1]);

        for (int j = 0; j < 4; j++) {
            int tmpidx = 0;
            tmpidx =
                (idxu[0] + offsetx[j]) * resolution + (idxu[1] + offsety[j]);
            #pragma omp atomic
            velocityu[tmpidx] += particle_velocity[i * 2 + 0] * fracu[j];
            #pragma omp atomic
            weightu[tmpidx] += fracu[j];

            tmpidx = (idxv[0] + offsetx[j]) * (resolution + 1) +
                     (idxv[1] + offsety[j]);
            #pragma omp atomic
            velocityv[tmpidx] += particle_velocity[i * 2 + 1] * fracv[j];
            #pragma omp atomic
            weightv[tmpidx] += fracv[j];
        }
    }
}

int main(int argc, char *argv[]) {
    if (argc < 2) {
        printf("Usage: %s inputfile\n", argv[0]);
        return -1;
    }

    string inputfile(argv[1]);
    std::ifstream fin(inputfile, std::ios::binary);
    if (!fin) {
        printf("Error opening file");
        return -1;
    }
    
    int resolution;
    int numparticle;
    vector<double> particle_position;
    vector<double> particle_velocity;

    fin.read((char *)(&resolution), sizeof(int));
    fin.read((char *)(&numparticle), sizeof(int));
    
    particle_position.resize(numparticle * 2);
    particle_velocity.resize(numparticle * 2);
    
    printf("resolution: %d\n", resolution);
    printf("numparticle: %d\n", numparticle);
    
    fin.read((char *)(particle_position.data()),
             sizeof(double) * particle_position.size());
    fin.read((char *)(particle_velocity.data()),
             sizeof(double) * particle_velocity.size());

    vector<double> velocityu((resolution + 1) * resolution, 0.0);
    vector<double> velocityv((resolution + 1) * resolution, 0.0);
    vector<double> weightu((resolution + 1) * resolution, 0.0);
    vector<double> weightv((resolution + 1) * resolution, 0.0);


    string outputfile;

    particle2grid(resolution, numparticle, particle_position,
                    particle_velocity, velocityu, velocityv, weightu,
                    weightv);
    outputfile = "output.dat";

    std::ofstream fout(outputfile, std::ios::binary);
    if (!fout) {
        printf("Error output file");
        return -1;
    }
    fout.write((char *)(&resolution), sizeof(int));
    fout.write(reinterpret_cast<char *>(velocityu.data()),
               sizeof(double) * velocityu.size());
    fout.write(reinterpret_cast<char *>(velocityv.data()),
               sizeof(double) * velocityv.size());
    fout.write(reinterpret_cast<char *>(weightu.data()),
               sizeof(double) * weightu.size());
    fout.write(reinterpret_cast<char *>(weightv.data()),
               sizeof(double) * weightv.size());

    return 0;
}

A. 欢迎参赛！

B. 流量席卷小土豆

C. 简单的编译

D. 小北问答 Classic

E. 齐心协力

G. 3D生命游戏

F. 高性能数据校验

H. 矩阵乘法

I. logistic方程

J. H-66

K. 光之游戏

L. 洪水 困兽

L. 洪水困兽