A. 欢迎参赛!

签到

B. 流量席卷小土豆

{"password":"de3e27056bde306b803cc5602f4cde9d464912330b4562ae163bd21e23495bcd","job_id":"6437"}

首先尝试8nodes 8tasks未能得分

1
2
Point 0:
msg: job was not run with 4 nodes, 4 tasks per node

使用4nodes 4tasks得到point 0分数:

1
2
Point 0:
msg: job was run with 4 nodes, 4 tasks per node by user `hu_y68lstwp8s7q`

ssh 登录在 shell 直接执行会因为爆内存被 kill

1
2
3
** (process:1206968): WARNING **: 21:52:42.121: Dissector bug, protocol VNC, in packet 2094662: Adding vnc.hextile_anysubrects would put more than 1000000 items in the tree -- possible infinite loop
mergecap: The file "0.ssh.pcap" appears to have been cut short in the middle of a packet.
slurmstepd: error: Detected 1 oom_kill event in StepId=5600.batch. Some of the step tasks have been OOM Killed.

根据文档 https://hpc.pku.edu.cn/ug/guide/quickstart/ 写 job.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash
#SBATCH -o job.%j.out
#SBATCH -p C064M0256G
#SBATCH --qos=high
#SBATCH -J TASK
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4

srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/0.pcap -Y ssh -w 0.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/1.pcap -Y ssh -w 1.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/2.pcap -Y ssh -w 2.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/3.pcap -Y ssh -w 3.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/4.pcap -Y ssh -w 4.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/5.pcap -Y ssh -w 5.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/6.pcap -Y ssh -w 6.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/7.pcap -Y ssh -w 7.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/8.pcap -Y ssh -w 8.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/9.pcap -Y ssh -w 9.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/10.pcap -Y ssh -w 10.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/11.pcap -Y ssh -w 11.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/12.pcap -Y ssh -w 12.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/13.pcap -Y ssh -w 13.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/14.pcap -Y ssh -w 14.ssh.pcap
srun -N4 --ntasks-per-node=4 tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/15.pcap -Y ssh -w 15.ssh.pcap
srun -N4 --ntasks-per-node=4 mergecap -w merged.pcap *.ssh.pcap

sbatch job.sh 命令提交到 SCOW 计算平台

Completed 之后 quantum-cracker merged.pcap 得到密码:

1
2
[hu_y68lstwp8s7q@l08c49n1 ~]$ quantum-cracker merged.pcap
de3e27056bde306b803cc5602f4cde9d464912330b4562ae163bd21e23495bcd

同时对比8nodes 8tasks的答案,得到point 1分数

1
2
[hu_y68lstwp8s7q@l08c49n1 ~]$ quantum-cracker merged.pcap
de3e27056bde306b803cc5602f4cde9d464912330b4562ae163bd21e23495bcd

成功解出B题

C. 简单的编译

Makefile: https://makefiletutorial.com/

CMakeLists: https://en.wikipedia.org/wiki/CMake#CMakeLists.txt

Makefile:

1
2
3
4
all:
nvcc -o hello_cuda hello_cuda.cu
mpic++ -o hello_mpi hello_mpi.cpp
g++ -fopenmp -o hello_omp hello_omp.cpp

CMakeLists.txt:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
cmake_minimum_required(VERSION 3.26)

project(HPC LANGUAGES CXX CUDA)

# packages
find_package(MPI REQUIRED)
find_package(OpenMP REQUIRED)

# CUDA
add_executable(hello_cuda hello_cuda.cu)

# MPI
add_executable(hello_mpi hello_mpi.cpp)
target_link_libraries(hello_mpi MPI::MPI_CXX)

# OpenMP
add_executable(hello_omp hello_omp.cpp)
target_link_libraries(hello_omp OpenMP::OpenMP_CXX)

D. 小北问答 Classic

  1. [填空] 根据 2023 年 11 月公布的 GREEN500 排行榜,能量效率最高的超级计算的能效达到了65.396(GFlops/watts)

  2. [填空] 根据 Amdahl 定律,在关键路径上,将占据程序运行时间 10% 的部分加速到 2 倍,则程序整体的加速比为105.26%;将占据程序运行时间40%的部分加速到1.2倍,则程序整体的加速比为107.14%。(均保留两位小数,均假设加速前后程序的关键路径不变)

  3. [单选] 以下哪个工具的作用与其他选项的区别最大?答案:GCC

    1. Meson
    2. Autoconf
    3. CMake
    4. GCC
  4. [单选] 从以下选项中选择正确答案依次填入空中:MPI 提供的是 进程 级别的并行;OpenMP 提供的是 线程 级别的并行;两者相比,线程 级别的并行更加轻量。

    1. 线程
    2. 进程
  5. [混合]

    AVX512 指令集是现代 x86_64 CPU 上常用的用于加速 SIMD 运算的指令集,使用 AVX512 指令集,一条指令可以对最多 8(填空)个 64 位浮点数进行运算。各个厂商对 AVX512 指令集的实现有所不同,如 Intel(以 Ice Lake SP 为例)在CPU中设计独立的 AVX512 运算单元,但有可能导致调用 AVX512 指令集时因功耗过大而降频,AMD(以 ZEN4 架构为例)复用两个 AVX2 运算单元执行 AVX512 运算

    1. 在CPU中设计独立的 AVX512 运算单元,但有可能导致调用 AVX512 指令集时因功耗过大而降频
    2. 复用两个 AVX2 运算单元执行 AVX512 运算
  6. [单选] 以下哪个不是专门用于在 GPU 上设计并行计算程序的?答案:OpenGL

    1. OpenACC
    2. CUDA
    3. HIP
    4. OpenGL
  7. [多选]

    SIMD(Single Instruction, Multiple Data)和 SIMT(Single Instruction, Multiple Threads)是并行计算的两种模型。下面哪些描述是正确的?(示例回答方式:i,ii,iii,iv)答案:iii,iv

    1. SIMD 和 SIMT 都是同一种并行计算模型,只是不同的命名。
    2. SIMD 每个线程执行相同的指令,但处理不同的数据;SIMT中每个线程可以执行不同的指令。
    3. AVX512 指令集是 SIMD 模型的代表,而 OpenMP 是 SIMT 模型的代表。
    4. 在 SIMT 模型中,线程协作执行相同的指令序列,但处理不同的数据。
  8. [单选]

    在高性能计算中,下列哪项技术不是用于节点间通信的?答案:HBM

    1. InfiniBand
    2. Ethernet
    3. NVLink
    4. HBM
  9. [单选]

    在高性能计算集群中,哪个工具通常用于作业调度和资源管理?答案:Slurm

    1. 微信群
    2. Slurm
    3. Docker Swarm
    4. Apache Mesos
  10. [单选]

    在处理矩阵乘法的并行化时,通常会使用分块矩阵乘法来减少缓存未命中,从而提高并行效率。

    1. 缓存未命中
    2. 通信开销
    3. 浮点运算错误
    4. 硬盘 I/O 延迟

E. 齐心协力

https://docs.ray.io/en/latest/ray-core/api/index.html

python实现并行即可

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import os
import numpy as np
import ray

ray.init(address=f"{os.environ['RAY_CLUSTER_ADDR']}")

@ray.remote(num_cpus=4)
class MatrixMultiplier(object):
def __init__(self, weight_file):
self.weight = np.load(weight_file)

def multiply(self, x):
return np.maximum(0, x @ self.weight)

multipliers = [
MatrixMultiplier.remote(f"weights/weight_{i}.npy") for i in range(4)
]

os.makedirs("outputs", exist_ok=True)

pipeline = [None] * 100

for i in range(100):
x = np.load(f"inputs/input_{i}.npy")

for multiplier in multipliers:
x = multiplier.multiply.remote(x)

pipeline[i] = x

for i, output in enumerate(pipeline):
output = ray.get(output)

np.save(f"outputs/output_{i}.npy", output)

ray.shutdown()

G. 3D生命游戏

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
#include <cstddef> 
#include <cstdint>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <utility>

__device__ int getNeighborIndex(int x, int y, int z, int M) {
x = (x + M) % M;
y = (y + M) % M;
z = (z + M) % M;
return x * M * M + y * M + z;
}

__global__ void update(uint8_t* curr_space, uint8_t* next_space, int M) {
extern __shared__ uint8_t shared_curr_space[];
int index = threadIdx.x + blockIdx.x * blockDim.x;
shared_curr_space[threadIdx.x] = curr_space[index];
__syncthreads();
int z = index % M;
int y = (index / M) % M;
int x = index / (M * M);
int xm = (x + M - 1) % M;
int x0 = x % M;
int xp = (x + 1) % M;
int ym = (y + M - 1) % M;
int y0 = y % M;
int yp = (y + 1) % M;
int zm = (z + M - 1) % M;
int z0 = z % M;
int zp = (z + 1) % M;
int neighbor_count = 0;
neighbor_count += curr_space[xm * M * M + ym * M + zm];
neighbor_count += curr_space[xm * M * M + ym * M + z0];
neighbor_count += curr_space[xm * M * M + ym * M + zp];
neighbor_count += curr_space[xm * M * M + y0 * M + zm];
neighbor_count += curr_space[xm * M * M + y0 * M + z0];
neighbor_count += curr_space[xm * M * M + y0 * M + zp];
neighbor_count += curr_space[xm * M * M + yp * M + zm];
neighbor_count += curr_space[xm * M * M + yp * M + z0];
neighbor_count += curr_space[xm * M * M + yp * M + zp];
neighbor_count += curr_space[x0 * M * M + ym * M + zm];
neighbor_count += curr_space[x0 * M * M + ym * M + z0];
neighbor_count += curr_space[x0 * M * M + ym * M + zp];
neighbor_count += curr_space[x0 * M * M + y0 * M + zm];
neighbor_count += curr_space[x0 * M * M + y0 * M + zp];
neighbor_count += curr_space[x0 * M * M + yp * M + zm];
neighbor_count += curr_space[x0 * M * M + yp * M + z0];
neighbor_count += curr_space[x0 * M * M + yp * M + zp];
neighbor_count += curr_space[xp * M * M + ym * M + zm];
neighbor_count += curr_space[xp * M * M + ym * M + z0];
neighbor_count += curr_space[xp * M * M + ym * M + zp];
neighbor_count += curr_space[xp * M * M + y0 * M + zm];
neighbor_count += curr_space[xp * M * M + y0 * M + z0];
neighbor_count += curr_space[xp * M * M + y0 * M + zp];
neighbor_count += curr_space[xp * M * M + yp * M + zm];
neighbor_count += curr_space[xp * M * M + yp * M + z0];
neighbor_count += curr_space[xp * M * M + yp * M + zp];
uint8_t curr_state = curr_space[x * M * M + y * M + z];
if (curr_state == 1) {
if (neighbor_count < 5 || neighbor_count > 7)
next_space[x * M * M + y * M + z] = 0;
else
next_space[x * M * M + y * M + z] = 1;
}
else {
if (neighbor_count == 6) {
next_space[x * M * M + y * M + z] = 1;
}
else {
next_space[x * M * M + y * M + z] = 0;
}
}
}

int main(int argc, char* argv[]) {
if (argc < 4) {
std::cout << "Usage: " << argv[0] << " <input_path> <output_path> <N>"
<< std::endl;
return 1;
}
char* input_path = argv[1];
char* output_path = argv[2];
int N = std::atoi(argv[3]);
size_t M, T;
std::ifstream input_file(input_path, std::ios::binary);
input_file.read(reinterpret_cast<char*>(&M), sizeof(M));
input_file.read(reinterpret_cast<char*>(&T), sizeof(T));
uint8_t* curr_space = new uint8_t[M * M * M],
* next_space = new uint8_t[M * M * M];
input_file.read(reinterpret_cast<char*>(curr_space), M * M * M);
uint8_t* d_curr_space, * d_next_space;
cudaMalloc(&d_curr_space, M * M * M * sizeof(uint8_t));
cudaMalloc(&d_next_space, M * M * M * sizeof(uint8_t));
cudaMemcpy(d_curr_space, curr_space, M * M * M * sizeof(uint8_t),
cudaMemcpyHostToDevice);
int num_cells = M * M * M;
int threads_per_block = 256;
int num_blocks = (num_cells + threads_per_block - 1) / threads_per_block;
for (int t = 0; t < N; ++t) {
update << <num_blocks, threads_per_block, threads_per_block * sizeof(uint8_t) >> > (d_curr_space, d_next_space, M);
std::swap(d_curr_space, d_next_space);
}
cudaMemcpy(curr_space, d_curr_space, M * M * M * sizeof(uint8_t),cudaMemcpyDeviceToHost);
T += N;
std::ofstream output_file(output_path, std::ios::binary);
output_file.write(reinterpret_cast<char*>(&M), sizeof(M));
output_file.write(reinterpret_cast<char*>(&T), sizeof(T));
output_file.write(reinterpret_cast<char*>(curr_space), M * M * M);
delete[] curr_space;
delete[] next_space;

cudaFree(d_curr_space);
cudaFree(d_next_space);

return 0;
}

编译:

1
/usr/local/cuda/bin/nvcc -O3 main.cu -o main

测试数据:

1
2
3
4
5
module load gcc/12

g++ -std=c++17 -o generate generate.cpp

./generate 1024 data.bin 10

测试:

1
srun -p GPU40G -N 1 --gres=gpu:1 ./main data.bin output 1024

4e2a575ae47d338562fbabebfc7cf5ca output

F. 高性能数据校验

这道题关键就在于如何高效地处理存在依赖关系的数据,题目提示“利用到现代计算机的多核特性以及并行文件系统的强悍能力”,但是后一块的SHA512校验值依赖于前一块的SHA512校验值...

想到了E题的 Pipeline Parallelism ,但是如何才能并行呢?

overlap...

加载modules

1
module load gcc/13 openmpi/4.1.6 openssl/3

生成校验文件

1
2
g++ -o generate generate.cpp -fopenmp
$PWD/generate data.bin 4294967296 256

校验代码(最终还是串行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
#include <algorithm> 
#include <chrono>
#include <cstring>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <mpi.h>
#include <openssl/evp.h>
#include <openssl/sha.h>

namespace fs = std::filesystem;

constexpr size_t BLOCK_SIZE = 1024 * 1024;

void print_checksum(std::ostream& os, uint8_t* md, size_t len) { for (int i = 0; i < len; i++) { os << std::setw(2) << std::setfill('0') << std::hex << static_cast<int>(md[i]); } }

void send_block(std::ifstream& istrm, size_t len, int dest, char* buffer) { istrm.read(buffer, std::min(BLOCK_SIZE, len)); MPI_Send(buffer, std::min(BLOCK_SIZE, len), MPI_CHAR, dest, 0, MPI_COMM_WORLD); }

void checksum(size_t len, uint8_t* obuf, int src, char* buffer) { int num_block = (len + BLOCK_SIZE - 1) / BLOCK_SIZE;

const char* precomputedHash = "cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e";

uint8_t prev_md[SHA512_DIGEST_LENGTH] = {};
for(int i = 0; i < SHA512_DIGEST_LENGTH; i++) {
sscanf(&precomputedHash[i*2], "%2hhx", &prev_md[i]);
}


EVP_MD_CTX* ctx = EVP_MD_CTX_new();
EVP_MD* sha512 = EVP_MD_fetch(nullptr, "SHA512", nullptr);

for (int i = 0; i < num_block; i++) {
MPI_Status status;
MPI_Recv(buffer, std::min(BLOCK_SIZE, len - i * BLOCK_SIZE), MPI_CHAR, src, 0, MPI_COMM_WORLD, &status);
EVP_DigestInit_ex(ctx, sha512, nullptr);
EVP_DigestUpdate(ctx, buffer, std::min(BLOCK_SIZE, len - i * BLOCK_SIZE));
EVP_DigestUpdate(ctx, prev_md, SHA512_DIGEST_LENGTH);

unsigned int md_len = 0;
EVP_DigestFinal_ex(ctx, prev_md, &md_len);
}

std::memcpy(obuf, prev_md, SHA512_DIGEST_LENGTH);
EVP_MD_CTX_free(ctx);
EVP_MD_free(sha512);
}

int main(int argc, char* argv[]) { auto begin_time = std::chrono::high_resolution_clock::now();

MPI_Init(&argc, &argv);

int rank, nprocs;

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

if (rank == 0) {
if (argc < 2) {
std::cout << "Usage: " << argv[0] << " <input_file>" << std::endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}

fs::path input_path = argv[1];

auto file_size = fs::file_size(input_path);
std::cout << input_path << " size: " << file_size << std::endl;

MPI_Bcast(&file_size, 1, MPI_UNSIGNED_LONG_LONG, 0, MPI_COMM_WORLD);

std::ifstream istrm(input_path, std::ios::binary);

char* buffer = new char[BLOCK_SIZE];

int num_block = (file_size + BLOCK_SIZE - 1) / BLOCK_SIZE;
for (int i = 0; i < num_block; i++) {
send_block(istrm, file_size - i * BLOCK_SIZE, 1, buffer);
}

delete[] buffer;

} else if (rank == 1) {

uint64_t file_size;
MPI_Bcast(&file_size, 1, MPI_UNSIGNED_LONG_LONG, 0, MPI_COMM_WORLD);

char* buffer = new char[BLOCK_SIZE];

uint8_t obuf[SHA512_DIGEST_LENGTH];
checksum(file_size, obuf, 0, buffer);

delete[] buffer;

std::cout << "checksum: ";
print_checksum(std::cout, obuf, SHA512_DIGEST_LENGTH);
std::cout << std::endl;
std::ofstream ofs("out.data");
print_checksum(ofs, obuf, SHA512_DIGEST_LENGTH);
}

MPI_Finalize();

return 0;
}

计时:

1
2
3
4
auto begin_time = std::chrono::high_resolution_clock::now();
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - begin_time);
std::cout << "checksum time cost: " << std::dec << duration.count() << "ms" << std::endl;

编译代码:

1
mpicxx -O3 -L/lustre/toolchain/openssl/lib64/ -lcrypto -O3 max.cpp

运行:

1
mpirun -np 8 a.out /lustre/home/hu_y68lstwp8s7q/data.bin /lustre/home/hu_y68lstwp8s7q/outputfile

6bbc0014c35c1377c952fc70009ae2e319571b0eb3cd674efb31e41d2382047cf8f689929fd873da68a16892478033e9fa256247fee49ed5b4bc90968ef6a5f0

H. 矩阵乘法

首先分块矩阵必须满足块大小小于Intel Xeon 8358的缓存以解决缓存未命中的问题,测试分块大小在32 x 32(L1)内即可

尝试使用openblas INNER_KERNEL_k1m8n8对算法优化:https://github.com/OpenMathLib/OpenBLAS/blob/develop/kernel/x86_64/dgemm_kernel_8x8_skylakex.c

内联汇编代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
//openblas
#define INNER_KERNEL_k1m8n8 \
"vmovapd (%2), %%zmm0\n\t"\
"vbroadcastsd (%0), %%zmm1\n\t"\
"vmovapd (%1), %%zmm2\n\t"\
"vfmadd231pd %%zmm2, %%zmm1, %%zmm0\n\t"\
"vmovapd %%zmm0, (%2)\n\t"

void mul(double* a, double* b, double* c, uint64_t n1, uint64_t n2, uint64_t n3) {
#pragma omp parallel for
for (uint64_t ii = 0; ii < n1; ii += BLOCK_SIZE) {
for (uint64_t jj = 0; jj < n2; jj += BLOCK_SIZE) {
for (uint64_t kk = 0; kk < n3; kk += BLOCK_SIZE) {
for (uint64_t i = ii; i < std::min(ii + BLOCK_SIZE, n1); ++i) {
for (uint64_t k = kk; k < std::min(kk + BLOCK_SIZE, n3); k += 8) {
for (uint64_t j = jj; j < std::min(jj + BLOCK_SIZE, n2); ++j) {
__asm__ __volatile__ (
INNER_KERNEL_k1m8n8
:
: "r" (&a[i * n2 + j]), "r" (&b[j * n3 + k]), "r" (&c[i * n3 + k])
: "memory"
);
}
}
}
}
}
}
}

算法程序如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#include <fstream>
#include <iostream>
#include <omp.h>
#include <immintrin.h>

const int BLOCK_SIZE = 128;

void mul(double* a, double* b, double* c, uint64_t n1, uint64_t n2, uint64_t n3) {
#pragma omp parallel
{
uint64_t ii, jj, kk, i, j, k;

#pragma omp for schedule(dynamic, 1) nowait
for (ii = 0; ii < n1; ii += BLOCK_SIZE) {
for (jj = 0; jj < n2; jj += BLOCK_SIZE) {
for (kk = 0; kk < n3; kk += BLOCK_SIZE) {
for (j = jj; j < std::min(jj + BLOCK_SIZE, n2); ++j) {
for (i = ii; i < std::min(ii + BLOCK_SIZE, n1); ++i) {
__m512d a_vec = _mm512_set1_pd(a[i * n2 + j]);
for (k = kk; k < std::min(kk + BLOCK_SIZE, n3); k += 16) {
__m512d c_vec1 = _mm512_load_pd(&c[i * n3 + k]);
__m512d b_vec1 = _mm512_load_pd(&b[j * n3 + k]);
__m512d c_vec2 = _mm512_load_pd(&c[i * n3 + k + 8]);
__m512d b_vec2 = _mm512_load_pd(&b[j * n3 + k + 8]);
c_vec1 = _mm512_fmadd_pd(a_vec, b_vec1, c_vec1);
c_vec2 = _mm512_fmadd_pd(a_vec, b_vec2, c_vec2);
_mm512_store_pd(&c[i * n3 + k], c_vec1);
_mm512_store_pd(&c[i * n3 + k + 8], c_vec2);
_mm_prefetch((char*)&c[i * n3 + k + 64], _MM_HINT_T0);
_mm_prefetch((char*)&b[j * n3 + k + 64], _MM_HINT_T0);
}
}
}
}
}
}
}
}

int main() {
uint64_t n1, n2, n3;
FILE* fi;

fi = fopen("conf.data", "rb");
fread(&n1, sizeof(n1), 1, fi);
fread(&n2, sizeof(n2), 1, fi);
fread(&n3, sizeof(n3), 1, fi);

double* a = (double*)aligned_alloc(64, n1 * n2 * sizeof(double));
double* b = (double*)aligned_alloc(64, n2 * n3 * sizeof(double));
double* c = (double*)aligned_alloc(64, n1 * n3 * sizeof(double));

fread(a, sizeof(double), n1 * n2, fi);
fread(b, sizeof(double), n2 * n3, fi);
fclose(fi);

mul(a, b, c, n1, n2, n3);

fi = fopen("out.data", "wb");
fwrite(c, sizeof(double), n1 * n3, fi);
fclose(fi);

return 0;
}

编译代码:

1
g++ -std=c++2a -O3 -fopenmp -mavx512f -o matrix matrix.cpp

vtune:

1
2
3
4
5
vtune -collect hotspots -result-dir vtune_result ./matrix

vtune -collect threading -result-dir vtune_result -- ./matrix

vtune -report summary -result-dir vtune_result

I. logistic方程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#include <iostream>
#include <chrono>
#include <immintrin.h>
#include <omp.h>
#include <fstream>

constexpr size_t BUFFER_SIZE = 1024;

void itv(double r, double* x, int64_t n, int64_t itn) {
__m512d rr = _mm512_set1_pd(r);
__m512d one = _mm512_set1_pd(1.0);

#pragma omp parallel for schedule(dynamic)
for (int64_t i = 0; i < n; i+=16) {
if (i+32 < n) {
_mm_prefetch((char*)(x + i + 32), _MM_HINT_T0);
}
__m512d xx1 = _mm512_load_pd(x + i);
__m512d xx2 = _mm512_load_pd(x + i + 8);
for (int64_t j = 0; j < itn; j++) {
__m512d product1 = _mm512_mul_pd(rr, xx1);
__m512d product2 = _mm512_mul_pd(rr, xx2);
xx1 = _mm512_mul_pd(product1, _mm512_sub_pd(one, xx1));
xx2 = _mm512_mul_pd(product2, _mm512_sub_pd(one, xx2));
}
_mm512_store_pd(x + i, xx1);
_mm512_store_pd(x + i + 8, xx2);
}
}


int main(){
omp_set_num_threads(8);
FILE* fi;
fi = fopen("conf.data", "rb");

int64_t itn;
double r;
int64_t n;
double* x;

fread(&itn, 1, 8, fi);
fread(&r, 1, 8, fi);
fread(&n, 1, 8, fi);
x = (double*)aligned_alloc(64, n * sizeof(double));
fread(x, 1, n * 8, fi);
fclose(fi);

itv(r, x, n, itn);

FILE* fo;
fo = fopen("out.data", "wb");
fwrite(x, 1, n * 8, fo);
fclose(fo);

free(x);

return 0;
}

编译代码:

1
2
3
g++ -std=c++20 -O3 -fopenmp -mavx512f -o logistic logistic.cpp

vtune -collect hotspots -result-dir vtune_result ./logistic

J. H-66

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
#include <vector>
#include <unordered_map>
#include <cstdint>
#include <cmath>
#include <chrono>

#ifdef _WIN32
#include <windows.h>
#define popcnt __popcnt64
#else
#define popcnt __builtin_popcountll
#endif

#include <iostream>

using map_t = std::unordered_map<uint64_t, size_t>;

struct restrict_t {
int offset{}, range{}, minocc{}, maxocc{};
int occ{};
uint64_t substate{};
};

template <typename VT>
struct term_t {
VT value;
uint64_t an{}, cr{}, signmask{}, sign{};
};

struct sparse_t {
std::vector<size_t> row;
std::vector<size_t> col;
std::vector<double> data;
};

int itrest(restrict_t& rest) {
if (!rest.substate) {
goto next;
}
{
uint64_t x = rest.substate & (-(int64_t)rest.substate);
uint64_t y = x + rest.substate;
rest.substate = y + (y ^ rest.substate) / x / 4;
}
if (rest.substate >> rest.range) {
next:
if (rest.occ == rest.maxocc) {
rest.occ = rest.minocc;
rest.substate = (uint64_t(1) << rest.occ) - 1;
return 1;
}
rest.occ++;
rest.substate = (uint64_t(1) << rest.occ) - 1;
return 0;
}
return 0;
}

int itrest(std::vector<restrict_t>& rest) {
for (auto& re : rest) {
if (!itrest(re)) {
return 0;
}
}
return 1;
}

uint64_t getstate(const std::vector<restrict_t>& rest) {
uint64_t state = 0;
for (const auto& re : rest) {
state |= re.substate << re.offset;
}
return state;
}

int generatetable(std::vector<uint64_t>& table, map_t& map, std::vector<restrict_t>& rest) {
for (auto& re : rest) {
re.occ = re.minocc;
re.substate = (uint64_t(1) << re.occ) - 1;
}

size_t index = 0;
do {
uint64_t state = getstate(rest);
table.emplace_back(state);
map.insert({state, index});
index++;
} while (!itrest(rest));

return 0;
}

template <typename VT>
term_t<VT> getterm(VT value, const std::vector<int>& cr, const std::vector<int>& an) {
term_t<VT> term;
term.value = value;
term.an = 0;
term.cr = 0;
term.signmask = 0;
uint64_t signinit = 0;

for (const auto& x : an) {
uint64_t mark = uint64_t(1) << x;
term.signmask ^= (mark - 1) & (~term.an);
term.an |= mark;
}
for (const auto& x : cr) {
uint64_t mark = uint64_t(1) << x;
signinit ^= (mark - 1) & term.cr;
term.signmask ^= (mark - 1) & (~term.an) & (~term.cr);
term.cr |= mark;
}
term.sign = popcnt(signinit ^ (term.signmask & term.an));
term.signmask = term.signmask & (~term.an) & (~term.cr);

return term;
}

template <typename VT>
int act(std::vector<size_t>& row, std::vector<size_t>& col, std::vector<VT>& data, const std::vector<term_t<VT>>& op, const std::vector<uint64_t>& table, const map_t& map) {
int64_t n = table.size();

for (int64_t i = 0; i < n; i++) {
uint64_t srcstate = table[i];

for (const auto& term : op) {
if ((srcstate & term.an) == term.an) {
uint64_t dststate = srcstate ^ term.an;
if ((dststate & term.cr) == 0) {
dststate ^= term.cr;

auto it = map.find(dststate);
if (it != map.end()) {
uint64_t sign = term.sign + popcnt(srcstate & term.signmask);
VT v = term.value;
if (sign & 1) {
v = -v;
}
data.emplace_back(v);
col.emplace_back(i);
row.emplace_back(it->second);
}
}
}
}
}

return 0;
}

int readss(FILE* fi, std::vector<uint64_t>& table, map_t& map) {
int n;
fread(&n, 1, 4, fi);
std::vector<restrict_t> restv(n);
for (auto& rest : restv) {
fread(&rest, 1, 16, fi);
}
generatetable(table, map, restv);
return 0;
}

int readop(FILE* fi, std::vector<term_t<double>>& op) {
int n, order;
fread(&n, 1, 4, fi);
fread(&order, 1, 4, fi);

std::vector<double> v(n);
fread(v.data(), 1, 8 * n, fi);

std::vector<int> rawterm(order);
std::vector<int> cr, an;

for (int i = 0; i < n; i++) {
fread(rawterm.data(), 1, 4 * order, fi);
int tn = rawterm[0];

for (int j = 0; j < tn; j++) {
int type = rawterm[tn * 2 - 1 - j * 2];
if (type) {
cr.push_back(rawterm[tn * 2 - j * 2]);
}
else {
an.push_back(rawterm[tn * 2 - j * 2]);
}
}

op.push_back(getterm(v[i], cr, an));
cr.clear();
an.clear();
}

return 0;
}

//out=m*v;
void mmv(std::vector<double>& out, const sparse_t& m, const std::vector<double>& v) {
for (auto& x : out) {
x = 0;
}

for (size_t i = 0; i < m.data.size(); i++) {
out[m.row[i]] += m.data[i] * v[m.col[i]];
}
}

//v1'*v2;
double dot(const std::vector<double>& v1, const std::vector<double>& v2) {
double s = 0;
for (size_t i = 0; i < v1.size(); i++) {
s += v1[i] * v2[i];
}
return s;
}

//v1+=s*v2;
void avv(std::vector<double>& v1, const double s, const std::vector<double>& v2) {
for (size_t i = 0; i < v1.size(); i++) {
v1[i] += s * v2[i];
}
}

//v*=s;
void msv(const double s, std::vector<double>& v) {
for (auto& x : v) {
x *= s;
}
}

//v'*v;
double norm2(const std::vector<double>& v) {
double s = 0;
for (auto& x : v) {
s += x * x;
}
return s;
}

void getsp(std::vector<double>& out, int itn, const sparse_t m, std::vector<double>& v) {
out.resize(itn * 2);
out[0] = sqrt(norm2(v));
msv(1.0 / out[0], v);

std::vector<double> a(itn), b(itn - 1);

std::vector<double> v_(v.size()), v__(v.size());
for (int i = 0; i < itn; i++) {
v__.swap(v_);
v_.swap(v);
mmv(v, m, v_);
a[i] = dot(v, v_);

if (i < itn - 1) {
if (i == 0) {
avv(v, -a[i], v_);
}
else {
avv(v, -a[i], v_);
avv(v, -b[i - 1], v__);
}

b[i] = sqrt(norm2(v));
msv(1.0 / b[i], v);
}
}

for (int i = 0; i < itn; i++) {
out[1 + i] = a[i];
}
for (int i = 0; i < itn - 1; i++) {
out[1 + itn + i] = b[i];
}
}

int main()
{
FILE* fi;
std::vector<uint64_t> table;
map_t map;
std::vector<term_t<double>> op;

fi = fopen("conf.data", "rb");
auto t1 = std::chrono::steady_clock::now();
readss(fi, table, map);
auto t2 = std::chrono::steady_clock::now();
readop(fi, op);

sparse_t opm;
act(opm.row, opm.col, opm.data, op, table, map);
auto t3 = std::chrono::steady_clock::now();

int itn;
fread(&itn, 1, 4, fi);

std::vector<double> v(table.size());
fread(v.data(), 1, table.size() * 8, fi);

fclose(fi);

std::vector<double> result;
getsp(result, itn, opm, v);

auto t4 = std::chrono::steady_clock::now();
fi = fopen("out.data", "wb");
fwrite(result.data(), 1, 16 * itn, fi);
fclose(fi);

int d1 = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count();
int d2 = std::chrono::duration_cast<std::chrono::milliseconds>(t3 - t2).count();
int d3 = std::chrono::duration_cast<std::chrono::milliseconds>(t4 - t3).count();
printf("%d,%d,%d\n", d1, d2, d3);
std::cout << "Hello World!\n";
}

编译:

1
g++ -std=c++20 -O3 -fopenmp -march=native -o matrix matrix.cpp

K. 光之游戏

微调:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
#include <cmath>
#include <cstdio>
#include <cstring>
#include <fstream>
#include <iostream>
#include <vector>
#include <string>

class Vector {
public:
double x, y, z;
Vector(const double x = 0, const double y = 0, const double z = 0) : x(x), y(y), z(z) {}
Vector(const Vector&) = default;
Vector& operator=(const Vector&) = default;
Vector operator+(const Vector& b) const { return Vector(x + b.x, y + b.y, z + b.z); }
Vector operator-(const Vector& b) const { return Vector(x - b.x, y - b.y, z - b.z); }
Vector operator-() const { return Vector(-x, -y, -z); }
Vector operator*(const double s) const { return Vector(x * s, y * s, z * s); }
Vector operator/(const double s) const { return Vector(x / s, y / s, z / s); }
bool isZero() const { return (x == 0.) && (y == 0.) && (z == 0.); }
};
inline double LengthSquared(const Vector& v) { return v.x * v.x + v.y * v.y + v.z * v.z; }
inline double Length(const Vector& v) { return std::sqrt(LengthSquared(v)); }
inline Vector operator*(const double s, const Vector& v) { return v * s; }
inline Vector Normalize(const Vector& v) { return v / Length(v); }
inline Vector Multiply(const Vector& v1, const Vector& v2) { return Vector(v1.x * v2.x, v1.y * v2.y, v1.z * v2.z); }
inline double Dot(const Vector& v1, const Vector& v2) { return v1.x * v2.x + v1.y * v2.y + v1.z * v2.z; }
inline const Vector Cross(const Vector& v1, const Vector& v2) {
return Vector((v1.y * v2.z) - (v1.z * v2.y), (v1.z * v2.x) - (v1.x * v2.z), (v1.x * v2.y) - (v1.y * v2.x));
}
using Color = Vector;
constexpr double pi_2 = 1.5707963267948966192313216916398;
constexpr int hemicube_res = 256;

double multiplier_front[hemicube_res][hemicube_res];
double multiplier_down[hemicube_res / 2][hemicube_res];

// prospective camera
class Camera {
public:
Vector pos, dir, up;
double fov, aspect_ratio;
double width, height;
Camera(Vector pos, Vector dir, Vector up, double fov, double aspect_ratio = 1.) :
pos(pos), dir(dir), up(up), fov(fov), aspect_ratio(aspect_ratio) {
height = 2 * std::tan(fov / 2);
width = height * aspect_ratio;
}
Vector project(const Vector& v) const {
Vector a = v - pos;
double z = Dot(a, dir);
double y = Dot(a, up);
double x = Dot(a, Cross(up, dir));
return Vector(-x / z, y / z, z);
}
};

// Patches are rectangle
class Patch {
public:
Vector pos;
Vector a, b; // regard Cross(b, a) as normal
Color emission;
Color reflectance;
Color incident;
Color excident;
Patch(Vector pos, Vector a, Vector b, Color emission, Color reflectance) :
pos(pos), a(a), b(b), emission(emission), reflectance(reflectance), incident(), excident(emission) {}
};

bool check_triangle(double cx, double cy, Vector a, Vector b, Vector c, double& depth) {
Vector ab = b - a, ac = c - a;
double tmp = ab.x * ac.y - ab.y * ac.x;
if (std::fabs(tmp) > 1e-6) {
double alpha = (-ac.x * (cy - a.y) + ac.y * (cx - a.x)) / tmp;
double beta = (ab.x * (cy - a.y) - ab.y * (cx - a.x)) / tmp;
if (alpha >= 0 && beta >= 0 && alpha + beta <= 1) {
depth = alpha * b.z + beta * c.z + (1 - alpha - beta) * a.z;
return true;
}
}
return false;
}

bool inside_quadrilateral(double cx, double cy, Vector a, Vector b, Vector c, Vector d, double& depth) {
return check_triangle(cx, cy, a, b, c, depth) || check_triangle(cx, cy, d, b, c, depth);
}

std::vector<Color> render_view(const Camera& camera, const std::vector<Patch>& scene, const int width, const int height) {
std::vector<Color> image(width * height);
std::vector<double> zbuffer(width * height, 1000000);
for (const auto& p : scene) {
if (Length(p.pos + 0.5 * (p.a + p.b) - camera.pos) < 1e-6) continue;
Vector a = camera.project(p.pos);
Vector b = camera.project(p.pos + p.a);
Vector c = camera.project(p.pos + p.b);
Vector d = camera.project(p.pos + p.a + p.b);
if (a.z < 1e-6 || b.z < 1e-6 || c.z < 1e-6 || d.z < 1e-6)continue;
double px = camera.width / width;
double py = camera.height / height;
for (int i = 0; i < height; ++i) {
for (int j = 0; j < width; ++j) {
double cx = (j + 0.5) * px - camera.width * 0.5;
double cy = (i + 0.5) * py - camera.height * 0.5;
double depth = 1000000;
if (inside_quadrilateral(cx, cy, a, b, c, d, depth)) {
if (depth < zbuffer[i * width + j] && depth > 1e-6) {
image[i * width + j] = p.excident;
zbuffer[i * width + j] = depth;
}
}
}
}
}
return image;
}

// https://stackoverflow.com/a/2654860
void save_bmp_file(const std::string& filename, const std::vector<Color>& image, const int width, const int height) {
FILE* f = fopen(filename.c_str(), "wb");
int filesize = 54 + 3 * width * height;
unsigned char* img = new unsigned char[3 * width * height];
memset(img, 0, 3 * width * height);

for (int i = 0; i < width; i++)
{
for (int j = 0; j < height; j++)
{
double r = sqrt(image[i + j * width].x) * 255;
double g = sqrt(image[i + j * width].y) * 255;
double b = sqrt(image[i + j * width].z) * 255;
if (r > 255) r = 255;
if (g > 255) g = 255;
if (b > 255) b = 255;
img[(i + j * width) * 3 + 2] = (unsigned char)(r);
img[(i + j * width) * 3 + 1] = (unsigned char)(g);
img[(i + j * width) * 3 + 0] = (unsigned char)(b);
}
}
unsigned char bmpfileheader[14] = { 'B','M', 0,0,0,0, 0,0, 0,0, 54,0,0,0 };
unsigned char bmpinfoheader[40] = { 40,0,0,0, 0,0,0,0, 0,0,0,0, 1,0, 24,0 };
unsigned char bmppad[3] = { 0,0,0 };

bmpfileheader[2] = (unsigned char)(filesize);
bmpfileheader[3] = (unsigned char)(filesize >> 8);
bmpfileheader[4] = (unsigned char)(filesize >> 16);
bmpfileheader[5] = (unsigned char)(filesize >> 24);

bmpinfoheader[4] = (unsigned char)(width);
bmpinfoheader[5] = (unsigned char)(width >> 8);
bmpinfoheader[6] = (unsigned char)(width >> 16);
bmpinfoheader[7] = (unsigned char)(width >> 24);
bmpinfoheader[8] = (unsigned char)(height);
bmpinfoheader[9] = (unsigned char)(height >> 8);
bmpinfoheader[10] = (unsigned char)(height >> 16);
bmpinfoheader[11] = (unsigned char)(height >> 24);

fwrite(bmpfileheader, 1, 14, f);
fwrite(bmpinfoheader, 1, 40, f);
for (int i = 0; i < height; ++i) {
fwrite(img + (i * width * 3), 3, width, f);
fwrite(bmppad, 1, (4 - (width * 3) % 4) % 4, f);
}
fclose(f);
delete[] img;
}

// Cornell Box
void load_scene(std::vector<Patch>& scene) {
FILE* fi;
int64_t n;
fi = fopen("conf.data", "rb");
fread(&n, 1, 8, fi);
for (int i = 0; i < n; i++) {
Vector a[5];
fread(a, 1, 120, fi);
scene.emplace_back(Patch(a[0], a[1], a[2], a[3], a[4]));
}
fclose(fi);
}

// divide patches into subpatches whose width and length are no less than given threshold
void divide_patches(std::vector<Patch>& scene, double threshold) {
std::vector<Patch> tmp = std::move(scene);
for (const auto& p : tmp) {
double len_a = Length(p.a);
double len_b = Length(p.b);
int a = static_cast<int>(len_a / threshold);
int b = static_cast<int>(len_b / threshold);
for (int i = 0; i <= a; ++i) {
for (int j = 0; j <= b; ++j) {
Vector pos = p.pos + i * threshold * Normalize(p.a) + j * threshold * Normalize(p.b);
Vector pa = (i + 1) * threshold > len_a ? p.a - i * threshold * Normalize(p.a) : threshold * Normalize(p.a);
Vector pb = (j + 1) * threshold > len_b ? p.b - j * threshold * Normalize(p.b) : threshold * Normalize(p.b);
if (pa.isZero() || pb.isZero()) continue;
scene.emplace_back(Patch(pos, pa, pb, p.emission, p.reflectance));
}
}
}
}

void cal_multiplier_map() {
constexpr double pw = 2. / hemicube_res;

double s=0;
for (int i = 0; i < hemicube_res; ++i) {
for (int j = 0; j < hemicube_res; ++j) {
double cx = (j + 0.5) * pw - 1;
double cy = (i + 0.5) * pw - 1;
multiplier_front[i][j] = 1 / (cx * cx + cy * cy + 1)/ (cx * cx + cy * cy + 1);
s = s + multiplier_front[i][j];
}
}
for (int i = 0; i < hemicube_res / 2; ++i) {
for (int j = 0; j < hemicube_res; ++j) {
double cx = (j + 0.5) * pw - 1;
double cz = (i + 0.5) * pw;
multiplier_down[i][j] = cz / (cx * cx + cz * cz + 1)/(cx * cx + cz * cz + 1);
//multiplier_down[i][j] *= multiplier_front[i + hemicube_res / 2][j];
s = s + 4 * multiplier_down[i][j];
}
}
for (int i = 0; i < hemicube_res; ++i) {
for (int j = 0; j < hemicube_res; ++j) {
//multiplier_front[i][j] *= multiplier_front[i][j];
}
}
s = s / (hemicube_res * hemicube_res / 4) / 3.1416;
printf("%f\n", s);
}

void cal_incident_light(std::vector<Patch>& scene) {

for (int i = 0; i < scene.size(); i++) {
auto& p = scene[i];

Vector cpos = p.pos + 0.5 * (p.a + p.b);
std::vector<Color> front = render_view(Camera(cpos, Normalize(Cross(p.b, p.a)), Normalize(p.b), pi_2), scene, hemicube_res, hemicube_res);
std::vector<Color> up = render_view(Camera(cpos, Normalize(p.b), -Normalize(Cross(p.b, p.a)), pi_2), scene, hemicube_res, hemicube_res);
std::vector<Color> left = render_view(Camera(cpos, -Normalize(p.a), Normalize(p.b), pi_2), scene, hemicube_res, hemicube_res);
std::vector<Color> right = render_view(Camera(cpos, Normalize(p.a), Normalize(p.b), pi_2), scene, hemicube_res, hemicube_res);
std::vector<Color> down = render_view(Camera(cpos, -Normalize(p.b), Normalize(Cross(p.b, p.a)), pi_2), scene, hemicube_res, hemicube_res);
Color total_light{};
for (int i = 0; i < hemicube_res; ++i) {
for (int j = 0; j < hemicube_res; ++j) {
total_light = total_light + front[i * hemicube_res + j] * multiplier_front[i][j];
if (i < hemicube_res / 2) total_light = total_light + up[i * hemicube_res + j] * multiplier_down[hemicube_res / 2 - 1 - i][j];
if (i >= hemicube_res / 2) total_light = total_light + down[i * hemicube_res + j] * multiplier_down[i - hemicube_res / 2][j];
if (j < hemicube_res / 2) total_light = total_light + right[i * hemicube_res + j] * multiplier_down[hemicube_res / 2 - 1 - j][i];
if (j >= hemicube_res / 2) total_light = total_light + left[i * hemicube_res + j] * multiplier_down[j - hemicube_res / 2][i];
}
}
p.incident = total_light / (hemicube_res * hemicube_res / 4) / 3.1416;
}
}

void cal_excident_light(std::vector<Patch>& scene) {
for (auto& p : scene) {
p.excident = Multiply(p.incident, p.reflectance) + p.emission;
}
}

int main() {
const int width = 512, height = 512;
Camera camera(Vector(278, 273, -800), Vector(0, 0, 1), Vector(0, 1, 0), 2 * std::atan2(0.0125, 0.035));
std::vector<Patch> scene;

std::cout << "init scene" << std::endl;
load_scene(scene);

std::cout << "divide patches" << std::endl;
divide_patches(scene, 15);
std::cout << "total patch number: " << scene.size() << std::endl;

int iter = 0;
std::cout << "render view " << iter << std::endl;
std::vector<Color> image = render_view(camera, scene, width, height);

std::cout << "save image " << iter << std::endl;
save_bmp_file("cornellbox" + std::to_string(iter) + ".bmp", image, width, height);

cal_multiplier_map();

const int max_iteration = 5;
for (iter = 1; iter <= max_iteration; ++iter) {
cal_incident_light(scene);
cal_excident_light(scene);
std::cout << "render view " << iter << std::endl;
image = render_view(camera, scene, width, height);

std::cout << "save image " << iter << std::endl;
save_bmp_file("cornellbox" + std::to_string(iter) + ".bmp", image, width, height);
}

return 0;
}

编译:

1
g++ -O3 -fopenmp -march=native -ffast-math -std=c++20 main.cpp -o main

L. 洪水 困兽

只需要在源码的基础上加上 #pragma omp parallel for 并行化for循环和 #pragma omp atomic 防止多个线程同时修改同一变量的值引起的数据竞争即可

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
#include <array>
#include <fstream>
#include <iostream>
#include <omp.h>
#include <vector>
#include <cmath>
#include <tuple>

using std::vector, std::array, std::tuple, std::string;

void particle2grid(int resolution, int numparticle,
const vector<double> &particle_position,
const vector<double> &particle_velocity,
vector<double> &velocityu, vector<double> &velocityv,
vector<double> &weightu, vector<double> &weightv) {
double grid_spacing = 1.0 / resolution;
double inv_grid_spacing = 1.0 / grid_spacing;
auto get_frac = [&inv_grid_spacing](double x, double y) {
int xidx = floor(x * inv_grid_spacing);
int yidx = floor(y * inv_grid_spacing);
double fracx = x * inv_grid_spacing - xidx;
double fracy = y * inv_grid_spacing - yidx;
return tuple(array<int, 2>{xidx, yidx},
array<double, 4>{fracx * fracy, (1 - fracx) * fracy,
fracx * (1 - fracy),
(1 - fracx) * (1 - fracy)});
};

#pragma omp parallel for
for (int i = 0; i < numparticle; i++) {
array<int, 4> offsetx = {0, 1, 0, 1};
array<int, 4> offsety = {0, 0, 1, 1};

auto [idxu, fracu] =
get_frac(particle_position[i * 2 + 0],
particle_position[i * 2 + 1] - 0.5 * grid_spacing);
auto [idxv, fracv] =
get_frac(particle_position[i * 2 + 0] - 0.5 * grid_spacing,
particle_position[i * 2 + 1]);

for (int j = 0; j < 4; j++) {
int tmpidx = 0;
tmpidx =
(idxu[0] + offsetx[j]) * resolution + (idxu[1] + offsety[j]);
#pragma omp atomic
velocityu[tmpidx] += particle_velocity[i * 2 + 0] * fracu[j];
#pragma omp atomic
weightu[tmpidx] += fracu[j];

tmpidx = (idxv[0] + offsetx[j]) * (resolution + 1) +
(idxv[1] + offsety[j]);
#pragma omp atomic
velocityv[tmpidx] += particle_velocity[i * 2 + 1] * fracv[j];
#pragma omp atomic
weightv[tmpidx] += fracv[j];
}
}
}

int main(int argc, char *argv[]) {
if (argc < 2) {
printf("Usage: %s inputfile\n", argv[0]);
return -1;
}

string inputfile(argv[1]);
std::ifstream fin(inputfile, std::ios::binary);
if (!fin) {
printf("Error opening file");
return -1;
}

int resolution;
int numparticle;
vector<double> particle_position;
vector<double> particle_velocity;

fin.read((char *)(&resolution), sizeof(int));
fin.read((char *)(&numparticle), sizeof(int));

particle_position.resize(numparticle * 2);
particle_velocity.resize(numparticle * 2);

printf("resolution: %d\n", resolution);
printf("numparticle: %d\n", numparticle);

fin.read((char *)(particle_position.data()),
sizeof(double) * particle_position.size());
fin.read((char *)(particle_velocity.data()),
sizeof(double) * particle_velocity.size());

vector<double> velocityu((resolution + 1) * resolution, 0.0);
vector<double> velocityv((resolution + 1) * resolution, 0.0);
vector<double> weightu((resolution + 1) * resolution, 0.0);
vector<double> weightv((resolution + 1) * resolution, 0.0);


string outputfile;

particle2grid(resolution, numparticle, particle_position,
particle_velocity, velocityu, velocityv, weightu,
weightv);
outputfile = "output.dat";

std::ofstream fout(outputfile, std::ios::binary);
if (!fout) {
printf("Error output file");
return -1;
}
fout.write((char *)(&resolution), sizeof(int));
fout.write(reinterpret_cast<char *>(velocityu.data()),
sizeof(double) * velocityu.size());
fout.write(reinterpret_cast<char *>(velocityv.data()),
sizeof(double) * velocityv.size());
fout.write(reinterpret_cast<char *>(weightu.data()),
sizeof(double) * weightu.size());
fout.write(reinterpret_cast<char *>(weightv.data()),
sizeof(double) * weightv.size());

return 0;
}