Poor Man Profiler
Last week I made a small profiler for MS VC++. Using it, you can locate hotspots in your programs and measure the effect of your optimization efforts.
First, you should include “profiler.h” in your source file. Then, you insert BEGIN_FUNC and END_FUNC into each function you want to profile:
void Test() { BEGIN_FUNC; // Do some work Sleep(50); END_FUNC; }
It may seem a lot of work, but in real programs you don’t have to put these lines in every function you write. Usually, there are 5-10 functions that account for 80% of execution time. Don’t use BEGIN_FUNC and END_FUNC for initialization routines, error handling code, and other O(1) stuff, but insert them into functions running O(N) and O(N2) loops.
The next step is to modify your main function. Insert START_PROFILING before calling any of profiled functions, and put STOP_PROFILING at the end:
int main() {
START_PROFILING;
Test();
Test2();
STOP_PROFILING;
Just like with any other profiler, you should prepare some test data that are correct and representative. The program should preferably do its work without any user interaction, because you will have to run it many times when trying different optimization approaches, and it will be tiresome to enter the same data again and again.
After calling STOP_PROFILING, use the following code to print the results (NUM_OF is a macro returning the number of elements in the array):
#ifdef PROFILE TCHAR buff[4096]; g_profiler.EndAndPrintResults(buff, NUM_OF(buff)); printf(buff); #endif return 0; }
Here is a sample output of the program:
Function name Total time Self time Calls ================================================================================ Test 394 ( 0%) 394 ( 0%) 1 Test1 982414326 (99%) 608406677 (61%) 2 Test2 374007649 (38%) 374007649 (38%) 10
Self time is the most important column. In this (artificial) example, Test1 takes 61% of overall execution time, so it’s the main candidate for optimization. Total time shows the time spend in the function and its children, and Calls show the number of calls to the function (Test1 was called 2 times, and Test2 was called 10 times).
The profiler can use QueryPerformanceCounter or RDTSC (to switch between them, define or undefine USE_RDTSC in profiler.h). With QueryPerformanceCounter, the time is reported in milliseconds (e.g., 1650 means 1.65 seconds); RDTSC cause the time to be reported in clock cycles (e.g., if you have an 1 GHz processor, 1 650 000 000 means 1.65 seconds). While RDTSC counter is more accurate than QueryPerformanceCounter, the large numbers may be hard to read. Also, if your processor supports SpeedStep or Cool’n'Quiet technology, you should turn them off in Control Panel to prevent the processor from changing clock frequency. Neither is QueryPerformanceCounter free from disadvantages: it has a long latency, returns incorrect results on Cool’n'Quiet processors and some chipsets.
The profiler uses sprintf function to output the results into a string. If you wish to get rid of the CRT library, you can switch to wsprintf (undefine USE_STD_LIB in profiler.h), but then you should use QueryPerformanceCounter, not RDTSC. A clock count from RDTSC is 64 bit wide, and wsprintf is unable to print 64-bit numbers on Win32, so you will see a negative time or other nonsense on the screen. The profiler warns you about this mistake; you can turn off the warning if you want. Percentages will be still valid, so if you need not the number of clock cycles, you don’t have to worry about the warning.
If the profiled functions reside in several source files, then you should define FILE_ID before including profiler.h:
#define FILE_ID 0 #include "profiler.h"
The first file should have FILE_ID set to 0, the second one should be 1, the third should be 2, etc.
The file profiler.h defines the maximum number of counters per one file, the total maximum number of counters, and the maximum stack depth. The recipe for their usage is: set these limits to some large values (say, 1024 functions), then forget about them. Stack depth in non-recursive programs cannot be larger than the maximum number of functions; if you use recursion, please increase it. BEGIN_FUNC and END_FUNC returns ERR_FAILED if you have exceeded some limit.
And here is what I consider the main feature. If you don’t define PROFILE, all profiling code will be removed from your program. So you can create 3 configurations: Release, Debug, and Profile. Profile will be the copy of Release, but will have PROFILE defined (in Project Properties -> Configuration Properties -> C/C++ -> Preprocessor -> Preprocessor Definitions). In Release version, you code will remain small and clean.
I recently used the profiler to improve the performance of one small program. There were 7 functions that could be hotspots. One of them did multiplications inside two O(N) loops, while other functions did only O(N) additions and reads from memory. Certainly, I thought multiplications were slow.
But, as it turned out, it was some other function that took 69% of overall execution time, while other functions took only 5% each. If I had not used the profiler, I would have spent several hours optimizing the wrong function. Say, if I made it 70% faster, the program’s performance would be improved only by 0.05 * 0.7 = 3.5%.
The real slow function had a problem with memory access pattern; I quickly rewrote it and got 15% improvement in overall performance. Another refactoring added more speed, and the program is now 2 times faster than before. The profiler was an indispensable tool for this work; it helped me to identify the slow code and to measure my improvements.
Download Poor Man Profiler (31 KB)
Loved your blog but surprised no one else is here. Keep up the good work.
Hope to visit again,
Comment by SmartOxymoron — September 30, 2006 @ 1:53 amSOxy.