1. Introduction
By looking at the CPU utilization flame graph, we observe that the calculation of cross sections in hadronic processes contributes much to the total execution time.
When we generate a CPU utilization flame graph, we basically let the application run and we capture userspace stack traces at a fixed sampling rate, e.g. 1Hz. Thefore, if one function is called frequently or if it takes much time to complete (or both), it is more likely to that its stack trace will be collected and shown in the graph. |
In this particular case both apply:
1.1. Time spent in ::GetCrossSection()
# dtrace -n '
pid$target::_ZN23G4CrossSectionDataStore17GetCrossSectionEPK17G4DynamicParticlePK10G4Material:entry
{
@ = count();
}' -c '/home/stathis/mainStatAccepTest ./exercise.g4'
[ ... snip ... ]
--- EndOfEventAction --- event= 20 t=1.34s random=0.693159
^C
4942549 # ~5M times for 20 events (which is a lot)
1.2. How many times ::GetCrossSection() is called
#dtrace -n '
pid$target::_ZN23G4CrossSectionDataStore17GetCrossSectionEPK17G4DynamicParticlePK10G4Material:entry
{
self->tt = vtimestamp;
}
pid$target::_ZN23G4CrossSectionDataStore17GetCrossSectionEPK17G4DynamicParticlePK10G4Material:return
/self->tt/
{
@ = sum(vtimestamp - self->tt);
self->tt = 0;
}
END
{
normalize(@, 1000000);
}' -c '/home/stathis/mainStatAccepTest ./exercise.g4'
[ ... snip ... ]
--- EndOfEventAction --- event= 20 t=1.61s random=0.320998
^C
3507 # msec (which is also a lot)
2. How to measure the repeatability of the cross section values
The above method can be applied to any other function in general. |
The first time I looked at how cross-sections are distributed, and before doing the following elaborate analysis, I did an error in my calculations. The results I got gave me the false impression that there were a lot more repeated values than they actually are. In hindsight, perhaps it wouldn’t make much sense to cache the results, but we should rather attack the problem from another angle. Nonetheless, the following is an interesting reading in my humble opinion. |
2.1. The obvious way
The obvious way would be to make ::GetCrossSection()
dump at stdout the
calculated value. The problem with this approach is that it would output
an overwhelming amount of data, plus we would need to write a custom tool
to aggregate the results.
2.2. The DTrace way
In order to use DTrace do the aggregation for us, we first need to consider
some things. DTrace does not operate on floating-point numbers. Nor can it
access (as of yet) the floating-point registers of the FPU, such as ST(0), …
Let us look at the disassembled code of the epilogue of ::GetCrossSection()
:
... snip ...
fstl (%eax,%ebx,8) ;; store ST(0), 'sigma', to address EAX + 8*EBX
addl $0x1,%ebx ;; increment EBX by 1
cmpl %ebx,-0x24(%ebp) ;; compare EBX to 'nElements'
jg -0x41 <::GetCrossSection+0x38> ;; loop if condition true
addl $0x3c,%esp ;; adjust ESP
popl %ebx
popl %esi
popl %edi
popl %ebp
ret ;; return to caller
Judging from the above we conclude that the result of the calculation is returned via the FPU register, where DTrace doesn’t have access. Whereas, if it would have been stored in some general purpose register, we could read its value.
We workaround the above obstacle by injecting an inline assembly instruction
inside ::GetCrossSection()
, so that the result of the calculation is copied
into EAX:EDX
. Mind that the result is stored in a double precision variable
(G4double
), hence it is 64bits wide, whereas EAX
and EDX
are both 32bits.
Therefore the result needs to be split into those 2 registers and be reassembled
later. Had we compiled everything in 64bits, we could have used RAX
register
which is 64bits wide.
G4double
G4CrossSectionDataStore::GetCrossSection(const G4DynamicParticle* part,
const G4Material* mat)
{
G4double sigma(0);
G4int nElements = mat->GetNumberOfElements();
const G4double* nAtomsPerVolume = mat->GetVecNbOfAtomsPerVolume();
if (unlikely(G4int(xsecelm.capacity()) < nElements)) {
xsecelm.resize(nElements);
}
for (G4int i = 0; i < nElements; ++i) {
sigma += nAtomsPerVolume[i] *
GetCrossSection(part, (*mat->GetElementVector())[i], mat);
xsecelm[i] = sigma;
}
__asm__ __volatile__ ("" : : "A" (*(&sigma))); // This copies sigma to EAX:EDX
return sigma;
}
Let us look at the new assembly:
... snip ...
fstl -0x40(%ebp) ;; store ST(0) to a local variable
movl -0x40(%ebp),%eax ;; store first half to EAX
movl -0x3c(%ebp),%edx ;; store second half to EDX, 0x3c + 0x4 = 0x40
addl $0x4c,%esp
popl %ebx
popl %esi
popl %edi
popl %ebp
ret ;; return to caller
The result of the computation is copied from the ST(0)
FPU register to
a local variable, and then it is split into two halves and copied to
EAX
and EDX
, which is what we wanted. DTrace can now be used to look
at the values of EAX
and EDX
with the aid of the uregs[]
array.
# dtrace -x aggrate=1000us -x aggsize=50M -n
'pid$target::_ZN23G4CrossSectionDataStore17GetCrossSectionEPK17G4DynamicParticlePK10G4Material:return
{
@[uregs[R_EAX], uregs[R_EDX]] = count();
}' -c '/home/stathis/mainStatAccepTest ./exercise.g4'
-o uregs.cross
The above script will output something in the following form:
2730418801 1064925043 7
2937292920 1052248252 7
1236151356 1064336539 8
387162505 1062135956 9
3605409360 1060393923 9
....
The first column is the value of EAX, the second column is the value of EDX and the 3rd column is the count of the respective <EAX,EDX> pair. In order to reassemble the sigma values from the individual 32bit integers, we use a C/C++ union:
union DoubleInt {
int32_t u_i[2];
double u_d;
};
We copy the integers into the u_i[]
array, and then print the result
from u_d
. So, for instance in a 20 events simulation of simplified
calorimeter:
# tail -n100 uregs.cross | ./reassembly
... snip ...
0.00393944 142
0.00260406 296
0.00104162 409
0.00208325 505
0.0010615 513
0.00243045 526
0.00138883 565
0.000694415 606
0.00573212 740
0.000347208 1190
3.47208e-05 2213
3.47208e-06 2614
0.000173604 3190
0 3103580
2.3. Validation of the divide-then-reassembly procedure
In order to validate the above methodology we made ::GetCrossSection()
return
the same hard-coded values always, and saw them (the values) being reassembled
correctly at the other end of DTrace output.
3. The caching algorithm
3.1. First attempt
We used a hash map to store tuples of the form:
((DynamicParticle, Material, Kinetic Energy), (CachedValue)).
The hash was computed based on the individuals components of the composite key, e.g., the energy, name of particle, material, etc.
At first, performance was way worse than without caching, which it didn’t make much sense, since I was under the false impression that there would be a lot of cache hits. Also, as the simulation progressed, the performance degraded because the map too often reached its maximum capacity and needed to reallocate memory and rehash all the existing elements.
3.2. Second attempt
After realizing that the cache hits weren’t that many, and by printing statistics on the hash map utilization, I noticed that the map was overwhelmed by values that weren’t reused. I then made the map clear itself when its load factor would go beyond an arbitrary upper bound, 0.8 or so.
That made things better, but still caching was slower than not-caching.
3.3. Third attempt
I restructured the hash map to be of the form:
((Particle, Material, Kinetic Energy), (CachedValue, Count))
Every time a cached value was reused, its hit count was increased. Then, when the map would start filling, those values with a hit count less than T would be discarded. That should favor the good values, throwing away the useless in a LRU (Least Recently Used) manner.
That made things better, but now the caching version is just as fast as the non-caching.
The following graph is not generated from the most optimized set of parameters (neither from the worse). It portrays how the size of the hash map gradually increases as more cross sections are being calculated. At some point the map’s utilization reaches a critical upper bound, when we decide to empty it, preserving the good elements though. Ideally, we would like those residuals element to increase after every purge cycle (here they don’t). |
3.4. Fourth attempth
Since the vast majority of recurring cross section values is zero
, I decided
to try caching only those (Particle, Material, Kinetic Energy) keys for which
the cross section calculation would result in a zero outcome. This significantly
increased the hits vs. misses ratio, and also allowed the hash map to reduce in
size (or, relatedly, the hash map to be purged less often).
3.5. Increasing the hits ratio
Regarding the kinetic energy of a particle, it is stored in a double data type, which is
64
bits wide as per the IEEE 754 standard. Out of the 64
bits, the 52
bits are used to
represent the fraction part of the number, which translates to log_10(2^52) ~ 16
decimal
digits of precision.
In order to increase the hits ratio, we consider using only up to N
bits of
kinetic energy when we form the composite key.
union {
double d;
uint64_t i;
} uDoubleInt;
uDoubleInt.d = part->GetKineticEnergy();
std::tuple<const G4DynamicParticle *,
const G4Material *, uint64_t>
partmat(part, mat, uDoubleInt.i & 0xFFFFFF0000000000);
This method increases the hits ratio, since more identical tuples are being generated, at the expense though of less precise results.
The following graph is not generated from the most optimized set of parameters (neither from the worse). It just portrays the depenence of hits/misses ratio on the number of bits that are used when creating the hash key. |
3.6. Which hadronic processes are amenable to cross-section caching ?
Every hadronic process has its own instance of G4CrossSectionDataStore
.
By gathering statistics (hits, misses, load factor) on a per
G4CrossSectionDataStore
instance basis, we have observed that not all of them
exhibited the same statistical characteristics. Some instances would
have a zero hits/misses ratio, whereas others a ratio of 20. Therefore,
it would only make sense to enable caching on those cases where it could
actually make a (positive) difference.
We subsequently modified the G4HadronicProcess
and G4CrossSectionDataStore
classes so that when the former instantiated the latter, it would also
pass its process name and the particle type as the "owner" of the cross-section
data store. This way we could map a G4CrossSectionDataStore
instance back to
the hadronic process, and thus identify which one of them would most interest us.
The combinations that manifest a considerable hits/(hits+misses)
ratio, at
their (presumed) steady state, in a 5.000
events Full CMS simulation are:
PhotonInelastic / gamma ( ~90% )
nFission / neutron ( ~98% )
ElectroNuclear / e- ( ~94% )
NeutronInelastic / neutron ( ~35% )
PositronNuclear / e+ ( ~67% )
hadElastic / alpha ( ~41% )
hadElasticTriton / triton ( ~2-3% )
nCapture / neutron ( ~2-2% )
The rest of the combinations fall into the 0-2%
range.
The list of plots with cache hits and misses.
The list of plots with hash maps' load factors.
And here is the bash/R script that was used to generate all the above plots.
In order for a process to benefit from caching, not only do the hits need to outweigh the misses, but also there have to be a lot of cross section calculations. A surrogate marker for the latter is to look at how many samples a hits or load plot has. |
3.7. Validation of the caching algorithm
We can enable a code block, via conditional compilation with #if
, which
for every ::GetCrossSection()
call it calculates the cross-section via
the normal way (non-caching) and by querying the hash map. It then
compares the results, and returns the correct one, while it increases
a mismatch
counter. This way we can be sure that the algorithm doesn’t
produce spurious results.
3.8. Various
Currently every hadronic process has its own instance of G4CrossSectionDataStore
and therefore its own cache. I turned G4CrossSectionDataStore
into a singleton
in an attempt to unify the caches, but Geant4 would throw an exception from some
other part of the program.
After talking to John Apostolakis regarding this, not being able to unify the caches should be considered a good thing.