- Code: Select all
for (i=0; i<4; i++)
{
sprintf(sIMCCounter, "snbep_unc_imc%d::UNC_M_CAS_COUNT:RD:e=0:i=0:t=0", i);
iError = PAPI_add_named_event(*pMaske, sIMCCounter);
if (iError != PAPI_OK && iError != PAPI_ECNFLCT) // proper error handling
}
for (i=0; i<4; i++)
{
sprintf(sIMCCounter, "snbep_unc_imc%d::UNC_M_CAS_COUNT:WR:e=0:i=0:t=0", i);
iError = PAPI_add_named_event(*pMaske, sIMCCounter);
if (iError != PAPI_OK && iError != PAPI_ECNFLCT) // proper error handling
}
According to the intel documentation, the memory bandwidth can be measures as:
- Code: Select all
Memory Read BW [MBytes/s] = 1.0E-06*(snbep_unc_imc0::UNC_M_CAS_COUNT:RD + snbep_unc_imc1::UNC_M_CAS_COUNT:RD + snbep_unc_imc2::UNC_M_CAS_COUNT:RD + snbep_unc_imc3::UNC_M_CAS_COUNT:RD)*64.0/time
Memory Write BW [MBytes/s] = 1.0E-06*(snbep_unc_imc0::UNC_M_CAS_COUNT:WR + snbep_unc_imc1::UNC_M_CAS_COUNT:WR + snbep_unc_imc2::UNC_M_CAS_COUNT:WR + snbep_unc_imc3::UNC_M_CAS_COUNT:WR)*64.0/time
Memory BW [MBytes/s] = Memory Read BW [MBytes/s] + Memory Write BW [MBytes/s]
Using this, I get an enormous memory bandwidth that can not really be.
I validated the results with the ones obtained with likwid:
- Code: Select all
+-----------------------+-------------+
| Event | core 16 |
+-----------------------+-------------+
| CAS_COUNT_RD | 3.17476e+07 |
| CAS_COUNT_WR | 4.49026e+08 |
| CAS_COUNT_RD | 5.67655e+07 |
| CAS_COUNT_WR | 4.48882e+08 |
| CAS_COUNT_RD | 5.8196e+07 |
| CAS_COUNT_WR | 4.49182e+08 |
| CAS_COUNT_RD | 5.76525e+07 |
| CAS_COUNT_WR | 4.48991e+08 |
+-----------------------+-------------+
| Memory Read BW [MBytes/s] | 469.748 |
| Memory Write BW [MBytes/s] | 4128.49 |
| Memory BW [MBytes/s] | 4598.24 |
+--------------------------------+------------+
In the case of PAPI I get:
- Code: Select all
CAS0_R=15637948602, CAS1_R=15652951072, CAS2_R=15655992784, CAS3_R=15619746407
CAS0_W=15663920078, CAS1_W=15654867324, CAS2_W=15671126429, CAS3_W=15706756602
Strange thing here is that CAS_R counters are 300 times bigger than the ones obtained with PAPI and CAS_W are almost 40 times bigger. CAS_R and CAS_W are in the same range bringing a "constant" RD_Bandwidth and RW_Bandwidth (it should be in me example).
The code I'm profiling is:
- Code: Select all
for (j=0; j<TIMES; j++)
{
pValues = (long int *) malloc (liSize * sizeof(long int));
if (!pValues) // Error handling
// kernel W
for (i=0; i<liSize; i++) pValues[i] = i;
free((void *) pValues);
pValues = NULL;
} //for
Can anybody explain what is happening and how can I measure the memory bandwidth of an application using PAPI?
Thanks in advance!!
Carmen
