Opcode Tracker for the Python Byte-Code Interpreter

Opcode Tracker for the Python Byte-Code Interpreter
============================================================

by Perry A. Stoll
<[email protected]>

Created for Python 1.3, January 1996
Updated for Python 1.4, January 1997

Motivation
---------------
There has always been interest (on the python newsgroup, at least) in
improving the performance of Python. Two possible strategies are
compiling the bytecode to machine-specific binaries or improving the
performance of the interpreter. Every so often someone comes up with a
clever idea which they think should make a difference in the
performance of the interpreter. Often, people step up and put their
code where their mouth is, implementing the suggested
change. Invariably, the call is sounded for a way to measure the
affect of such changes. That's where this opcode tracker fits in. It
helps identify where the Python interpreter is spending its time.

What
---------------
This distribution contains the necessary tools to add the ability to
profile bytecodes run by the interpreter and generate statistics about
them. The files included in this distribution are:

* README - this file
* INSTALL - instructions to install this package for Unix
* NEWS - the original comp.lang.python article which inspired this stuff.
* ceval.optrack.patch - a small context diff file used to patch
Python/ceval.c, adding the ability to keep statistics (total count,
time spent executing, execution chain distributions). To enable opcode
tracking, you must then compile ceval.c with ENABLE_OPTRACKER defined.
* optracker.h - definitions and data structure for opcode tracker objects
* optrackermodule.c - a C module defining the optracker object and a few
useful functions.
* opstats.py - modeled after the pstats module in the standard
distribution, provides profiling statistics for opcode objects.
* optrackerscript.py - an example of using this package to profile the opcodes.
If you have NumPy, it does some more fancy analysis.
* pystone.py - a benchmark script from the Python-1.4 distribution,
distributed here for demonstration purposes. This is actually a
very boring program to profile, mainly function calls. Oh, well.

Note that building an interpreter to be optracker-capable must be done
at compile time.

Features:
----------------------------------------
Optracker provides a way to collect execution stats of the Python
interpreter by keeping track of frequency, time, and history of opcode
execution:

1) Counting of opcode execution in the Python interpreter.
2) Timing spent executing opcodes (until next opcode is executed).
3) Tracks opcode pair execution, i.e. digraph analysis. (Easily
interfaces with the Numeric package for more interesting analysis!)
4) Minimal changes to Python/ceval.c
5) Can be completely removed with a compile time option.
6) Stats are reported by interfacing with the existing Stats class in the
pstats.py file in the distribution.

What is Actually Being Profiled?
------------------------------------------------------------
There is a loop in Python bytecode interpreter (the for loop around the
massive switch statement in ceval.c:eval_code2()). Each time through
this loop, the count for the appropriate opcode is incremented by
one. That's sufficient to keep raw usage levels for opcodes.

Gathering timing information for the opcodes is complicated by the
fact that the interpreter (eval_code2) is often called recursively. If
there is no recursive call, then the time for the executed opcode is
simply the time through the large switch statement. If there is a
recursive call, than the time for the first opcode accumulated until
the next opcode begins. This means that all the recursive function
call overhead is charged to the previous opcode. Also, if an opcode
calls out to a C routine, i.e., an extension, all time is charged to
that opcode. Something like this showed up when profiling NumPy: I was
wondering why a BINARY_DIVIDE took about 6 seconds. Then I realized it
was a really large matrix operation. This whole scheme may not be the
best, but it doesn't seem to bad either. There are other strategies
worth exploring and I hope people will play around.

Sample Output:
------------------------------------------------------------
Here's the timing information I get when running the optrackerscript,
a sample script demonstrating optracking. Note where it says function
calls, that is of course opcodes executed. Also, the tottime (total
time) and cumtime (cumulative time) columns will always be the same.

I didn't understand until I'd finished a first implementation that
timing recursive calls differently (which makes sense for functions)
doesn't make much sense for opcodes. Well, maybe it does. Because the
interpeter can recur, there is some sense of total time spent in one
opcode. But the worth of tracking that seemed pretty dubious at best.

556223 function calls in 25.283 CPU seconds

Ordered by: internal time
List reduced from 55 to 20 due to restriction <20>

ncalls tottime percall cumtime percall opcode:opcode id:(NAME)
158150 4.504 0.000028 4.504 0.000028 opcode:124(LOAD_FAST)
18185 4.061 0.000223 4.061 0.000223 opcode:131(CALL_FUNCTION)
34298 2.711 0.000079 2.711 0.000079 opcode:105(LOAD_ATTR)
25954 2.216 0.000085 2.216 0.000085 opcode:23(BINARY_ADD)
83446 1.930 0.000023 1.930 0.000023 opcode:127(SET_LINENO)
67165 1.774 0.000026 1.774 0.000026 opcode:125(STORE_FAST)
24558 1.232 0.000050 1.232 0.000050 opcode:25(BINARY_SUBSCR)
39470 0.867 0.000022 0.867 0.000022 opcode:100(LOAD_CONST)
8660 0.835 0.000096 0.835 0.000096 opcode:95(STORE_ATTR)
29 0.695 0.023970 0.695 0.023970 opcode:72(PRINT_NEWLINE)
54 0.648 0.012003 0.648 0.012003 opcode:107(IMPORT_NAME)
7028 0.454 0.000065 0.454 0.000065 opcode:116(LOAD_GLOBAL)
6150 0.446 0.000072 0.446 0.000072 opcode:60(STORE_SUBSCR)
13664 0.387 0.000028 0.387 0.000028 opcode:1(POP_TOP)
1955 0.386 0.000197 0.386 0.000197 opcode:13(UNARY_CONVERT)
8273 0.384 0.000046 0.384 0.000046 opcode:92(UNPACK_TUPLE)
13496 0.325 0.000024 0.325 0.000024 opcode:111(JUMP_IF_FALSE)
6229 0.266 0.000043 0.266 0.000043 opcode:24(BINARY_SUBTRACT)
9785 0.250 0.000026 0.250 0.000026 opcode:83(RETURN_VALUE)
9642 0.209 0.000022 0.209 0.000022 opcode:110(JUMP_FORWARD)

Using NumPy for further analysis...

POP_TOP is followed by JUMP_IF_FALSE 97.2%
DUP_TOP is followed by RETURN_VALUE 90.1%
UNARY_NOT is followed by LOAD_FAST 90%
UNARY_CONVERT is followed by LOAD_ATTR 90%
BINARY_DIVIDE is followed by LOAD_FAST 100%
SLICE is followed by LOAD_FAST 98.1%
SLICE3 is followed by LOAD_FAST 76.5%
STORE_SUBSCR is followed by LOAD_FAST 94.2%
DELETE_SUBSCR is followed by LOAD_FAST 100%
PRINT_ITEM is followed by RETURN_VALUE 92.2%
PRINT_NEWLINE is followed by PRINT_ITEM 89.7%
BREAK_LOOP is followed by SET_LINENO 100%
LOAD_LOCALS is followed by STORE_NAME 100%
RETURN_VALUE is followed by LOAD_CONST 76.5%
EXEC_STMT is followed by LOAD_FAST 100%
END_FINALLY is followed by POP_TOP 100%
BUILD_CLASS is followed by RETURN_VALUE 100%
UNPACK_LIST is followed by CALL_FUNCTION 100%
STORE_ATTR is followed by LOAD_FAST 98.8%
LOAD_ATTR is followed by LOAD_FAST 92.4%
IMPORT_NAME is followed by SET_LINENO 100%
IMPORT_FROM is followed by IMPORT_NAME 100%
JUMP_IF_TRUE is followed by LOAD_FAST 97.1%
FOR_LOOP is followed by SET_LINENO 100%
SETUP_LOOP is followed by SET_LINENO 100%
SETUP_EXCEPT is followed by SET_LINENO 100%
SETUP_FINALLY is followed by SET_LINENO 100%
MAKE_FUNCTION is followed by LOAD_CONST 100%