I'm writing a program to analyze a graph of social network. It means the program needs a lot of random memory accesses. It seems to me prefetch should help. Here is a small piece of the code of reading values from neighbors of a vertex.
for (size_t i = 0; i < v.get_num_edges(); i++) {
unsigned int id = v.neighbors[i];
res += neigh_vals[id];
}
I transform the code above to the one as below and prefetch the values of the neighbors of a vertex.
int *neigh_vals = new int[num_vertices];
for (size_t i = 0; i < v.get_num_edges(); i += 128) {
size_t this_end = std::min(v.get_num_edges(), i + 128);
for (size_t j = i; j < this_end; j++) {
unsigned int id = v.neighbors[j];
__builtin_prefetch(&neigh_vals[id], 0, 2);
}
for (size_t j = i; j < this_end; j++) {
unsigned int id = v.neighbors[j];
res += neigh_vals[id];
}
}
In this C++ code, I didn't override any operators.
Unfortunately, the code doesn't really improve the performance. I wonder why. Apparently, hardware prefetch doesn't work in this case because the hardware can't predict the memory location.
I wonder if it's caused by GCC optimization. When I compile the code, I enable -O3. I really hope prefetch can further improve performance even when -O3 is enabled. Does -O3 optimization fuse the two loops in this case? Can -O3 enable prefetch in this case by default?
I use gcc version 4.6.3 and the program runs on Intel Xeon E5-4620.
Thanks,
Da
No comments:
Post a Comment