Profiling / Benchmarking code
You have a piece of code, its too slow so you want to optimize it, but to do that you first need to be able to meassure its speed, otherwise you wont know if a change improved speed or not
Benchmarking a piece of code sounds easy but it isnt, you can try to do it with various standard “get the current time” functions but these tend to be very inaccurate, canned tools like gprof are an option too, but in my experience they are also inaccurate and they can produce highly misleading results due to function inlining. A more accurate method on the x86 architecture is to use rdtsc
So now we have a accurate method to get the current “time” but what now? simply running the code once and getting the time before and afterwards? This would be a very bad choice as the first time a piece of code is executed its not in the code cache, the data isnt in the data cache, branch prediction has not seen the code yet, … so we could ignore the first few iterations, but still a single measurement isnt very trustworthy, next improvement would be to average many interations but still its very sensitive, just think about what happens when the kernel decides to stop your program and give the cpu to your mp3 player, its like having a interation which needs 50 times longer then the average assuming the code between the rdtsc instructions doesnt take too long. Solving this is easy though, just discard such outliers which need many times the average (see the START/STOP_TIMER macros in ffmpeg/libavutil/common.h)
Hi,
Hey, thanks Michael for the tip. Accurate timing is indeed important. :-P
Having a look at ffmpeg/libavutil/common.h and how those macros are use in snow.c, it appeared to me that those macros are to be used inside an inner loop in order to collect many timing informations, and discard the outliers.
Like in:
static void spatial_decompose53i(DWTELEM *buffer, int width, int height, int stride){
int y;
DWTELEM *b0= buffer + mirror(-2-1, height-1)*stride;
DWTELEM *b1= buffer + mirror(-2 , height-1)*stride;
for(y=-2; y
Comment by Guillaume POIRIER — 2005-12-20 @ 15:47
Woops, looks like my previous message was truncated…
What I was trying to get at is that in spatial_decompose53i(), START/STOP_TIMER macros are used inside a loop (therefore there are many calls of the timers, and many measure can be taken, and the outliers can be discarded), whereas in spatial_compose53i_dy_buffered(), it looks to me like there is only one call to the START/STOP_TIMER macros, therefore it doesn’t appear that the mesure “carries as much weight”…. or am I just all wrong, and spatial_compose53i_dy_buffered() is called a “large enough number of times”, so that the mesure is still pretty accurate? ^_^
Comment by Guillaume POIRIER — 2005-12-20 @ 16:56
START/STOP_TIMER store statistics in static variables so if there are in a function and the function is called often then thats as fine as if they are inside a loop which runs similarely many iterations
furthermore the time between START and STOP_TIMER must not be too large otherwise the outlier removial wont work (100 cpu cycles for a loop and 100100 if a task switch happens vs. 1000000 per loop and 1100000 for the task switch)
the code should be executed in a as realistic environment as possible (a rdtsc after every single instruction or artificial test data can be bad)
Comment by Michael — 2005-12-21 @ 23:43
another thing which must be kept in mind is that the outlier removial will cause very inaccurate measurements if the code sometimes needs much longer then on average, for example a single wavelet transform pass which is executed on 512×512 256×256 128×128 and so on
Comment by Michael — 2005-12-21 @ 23:47
Any experience/comments about oprofiler. I’ve seen that 2.6. kernels include support for it, but haven’t tried it yet. (My guess is that it will at least know for task-switching).
Comment by Ivan Kalvachev — 2006-01-03 @ 20:35
i tried oprofile sometime ago but failed to get it working …
Comment by Michael — 2006-01-03 @ 22:29