Page 1 of 1

Measuring memory bandwidth

PostPosted: Thu Nov 28, 2013 7:50 am
by cnavarrete
I'm trying to measure the memory bandwidth of an application running on a SandyBridge machine using the CAS counters:
Code: Select all
for (i=0; i<4; i++)
{       
   sprintf(sIMCCounter, "snbep_unc_imc%d::UNC_M_CAS_COUNT:RD:e=0:i=0:t=0", i);
   iError = PAPI_add_named_event(*pMaske, sIMCCounter);
   if (iError != PAPI_OK && iError != PAPI_ECNFLCT) // proper error handling
}                                                                               
                               
for (i=0; i<4; i++)
{                                                 
    sprintf(sIMCCounter, "snbep_unc_imc%d::UNC_M_CAS_COUNT:WR:e=0:i=0:t=0", i);               
    iError = PAPI_add_named_event(*pMaske, sIMCCounter);
    if (iError != PAPI_OK && iError != PAPI_ECNFLCT) // proper error handling
 }       


According to the intel documentation, the memory bandwidth can be measures as:
Code: Select all
Memory Read BW [MBytes/s] = 1.0E-06*(snbep_unc_imc0::UNC_M_CAS_COUNT:RD + snbep_unc_imc1::UNC_M_CAS_COUNT:RD + snbep_unc_imc2::UNC_M_CAS_COUNT:RD + snbep_unc_imc3::UNC_M_CAS_COUNT:RD)*64.0/time
Memory Write BW [MBytes/s] = 1.0E-06*(snbep_unc_imc0::UNC_M_CAS_COUNT:WR + snbep_unc_imc1::UNC_M_CAS_COUNT:WR + snbep_unc_imc2::UNC_M_CAS_COUNT:WR + snbep_unc_imc3::UNC_M_CAS_COUNT:WR)*64.0/time
Memory BW [MBytes/s] = Memory Read BW [MBytes/s] + Memory Write BW [MBytes/s]


Using this, I get an enormous memory bandwidth that can not really be.
I validated the results with the ones obtained with likwid:
Code: Select all
+-----------------------+-------------+
|         Event         |   core 16   |
+-----------------------+-------------+
|     CAS_COUNT_RD      | 3.17476e+07 |
|     CAS_COUNT_WR      | 4.49026e+08 |
|     CAS_COUNT_RD      | 5.67655e+07 |
|     CAS_COUNT_WR      | 4.48882e+08 |
|     CAS_COUNT_RD      | 5.8196e+07  |
|     CAS_COUNT_WR      | 4.49182e+08 |
|     CAS_COUNT_RD      | 5.76525e+07 |
|     CAS_COUNT_WR      | 4.48991e+08 |
+-----------------------+-------------+
|   Memory Read BW [MBytes/s]    |  469.748   |
|   Memory Write BW [MBytes/s]   |  4128.49   |
|   Memory BW [MBytes/s]      |  4598.24   |
+--------------------------------+------------+


In the case of PAPI I get:
Code: Select all
CAS0_R=15637948602, CAS1_R=15652951072, CAS2_R=15655992784, CAS3_R=15619746407
CAS0_W=15663920078, CAS1_W=15654867324, CAS2_W=15671126429, CAS3_W=15706756602


Strange thing here is that CAS_R counters are 300 times bigger than the ones obtained with PAPI and CAS_W are almost 40 times bigger. CAS_R and CAS_W are in the same range bringing a "constant" RD_Bandwidth and RW_Bandwidth (it should be in me example).

The code I'm profiling is:
Code: Select all
for (j=0; j<TIMES; j++)
{                       
    pValues = (long int *) malloc (liSize * sizeof(long int));
    if (!pValues)       // Error handling
                               
    // kernel W                                                                             
    for (i=0; i<liSize; i++) pValues[i] = i;
                                                                                                         
    free((void *) pValues);                                                                 
    pValues = NULL;
} //for                 


Can anybody explain what is happening and how can I measure the memory bandwidth of an application using PAPI?

Thanks in advance!!
Carmen