Benchmarking Krita performance

Tile engine

Data Manager

Image dimension for test 4096*4096, RGB
I executed every test few times and I selected the results that popped again more times
callgrind backend did not produced callgrind.* files so I used valgrind directly, but that does create benchmarking also for Qt test lib
http://lukast.mediablog.sk/callgrind/DatamanagerBenchmarks.tar.gz

benchmark name	walltime	tickcounter	Mb/s
benchmarkWriteBytes	38.0 msec per iteration (total: 380, iterations: 10)	77,528,468.2 ticks per iteration (total: 775284683, iterations: 10)	1333.3 Mb/s
benchmarkReadBytes	39.3 msec per iteration (total: 394, iterations: 10)	77,311,910.2 ticks per iteration (total: 773119103, iterations: 10)	1628.4 Mb/s
benchmarkReadWriteBytes	46.2 msec per iteration (total: 462, iterations: 10)	91,198,881.7 ticks per iteration (total: 911988817, iterations: 10)	1391.3 Mb/s
benchmarkExtent	0.00020 msec per iteration (total: 34, iterations: 163840)	735.0 ticks per iteration (total: 7350, iterations: 10)	N/A
benchmarkClear	1.3 msec per iteration (total: 26, iterations: 20)	2,542,070.2 ticks per iteration (total: 25420702, iterations: 10)	N/A

Iterators

Horizontal Iterator

image used is 4096x4096, colorspace is RGBA, 8 bit per channel (64Mb)
http://lukast.mediablog.sk/callgrind/HLineBenchmarks.tar.gz

benchmark name	walltime	tickcounter	Mb/s
benchmarkWriteBytes	1,383.4 msec per iteration (total: 13834, iterations: 10)	4,389,801,089.3 ticks per iteration (total: 43898010893, iterations: 10)	46.3 Mb/s
benchmarkReadBytes	1,443.2 msec per iteration (total: 14433, iterations: 10)	4,461,418,645.5 ticks per iteration (total: 44614186455, iterations: 10)	44.4 Mb/s
benchmarkConstReadBytes	1,380.7 msec per iteration (total: 13808, iterations: 10)	4,501,257,062.3 ticks per iteration (total: 45012570623, iterations: 10)	46.3 Mb/s
benchmarkReadWriteBytes	2,041.7 msec per iteration (total: 20418, iterations: 10)	5,736,531,494.3 ticks per iteration (total: 57365314943, iterations: 10)	31.3 Mb/s
benchmarkNoMemCpy	655.7 msec per iteration (total: 6557, iterations: 10)	3,025,535,970.6 ticks per iteration (total: 30255359707, iterations: 10)	97.7 Mb/s
benchmarkConstNoMemCpy	583.7 msec per iteration (total: 5837, iterations: 10)	2,889,942,765.8 ticks per iteration (total: 28899427658, iterations: 10)	109.6 Mb/s
benchmarkTwoIteratorsNoMemCpy	1,205.7 msec per iteration (total: 12057, iterations: 10)	3,952,530,421.5 ticks per iteration (total: 39525304215, iterations: 10)	53.1 Mb/s

Update state:trunk 17.feb 2010 15:38

benchmark name	walltime	Mb/s
benchmarkWriteBytes	1,548.0 msec per iteration (total: 15481, iterations: 10)	41.34 Mb/s
benchmarkReadBytes	3,087.8 msec per iteration (total: 30878, iterations: 10)	20.73 Mb/s
benchmarkConstReadBytes	3,062.0 msec per iteration (total: 30620, iterations: 10)	20.90 Mb/s
benchmarkReadWriteBytes	3,725.0 msec per iteration (total: 37251, iterations: 10)	17.18 Mb/s
benchmarkNoMemCpy	2,264.4 msec per iteration (total: 22644, iterations: 10)	28.26 Mb/s
benchmarkConstNoMemCpy	2,316.8 msec per iteration (total: 23168, iterations: 10)	27.62 Mb/s
benchmarkTwoIteratorsNoMemCpy	2,950.0 msec per iteration (total: 29501, iterations: 10)	21.69 Mb/s

state: caching patch applied to trunk

benchmark name	walltime	Mb/s
benchmarkWriteBytes	1,211.4 msec per iteration (total: 12114, iterations: 10)	52.83 Mb/s (speedup 1.28)
benchmarkReadBytes	1,196.2 msec per iteration (total: 11962, iterations: 10)	53.50 Mb/s (speedup 2.58)
benchmarkConstReadBytes	1,202.2 msec per iteration (total: 12022, iterations: 10)	53.24 Mb/s (speedup 1.28)
benchmarkReadWriteBytes	1,563.0 msec per iteration (total: 15631, iterations: 10)	40.95 Mb/s (speedup 2.38)
benchmarkNoMemCpy	389.1 msec per iteration (total: 3891, iterations: 10)	164.48 Mb/s (speedup 5.82)
benchmarkConstNoMemCpy	372.5 msec per iteration (total: 3725, iterations: 10)	171.81 Mb/s (speedup 6.21)
benchmarkTwoIteratorsNoMemCpy	670.3 msec per iteration (total: 6704, iterations: 10)	95.48 Mb/s (speedup 4.4)

Vertical Iterator

image used is 4096x4096, colorspace is RGBA, 8 bit per channel (64Mb)
http://www.valdyas.org/~lukast/VLineIteratorBenchmarks.tar.gz

benchmark name	walltime	tickcounter	Mb/s
benchmarkWriteBytes	1,541.9 msec per iteration (total: 15419, iterations: 10)	Not measured	41.52 Mb/s
benchmarkReadBytes	1,534.4 msec per iteration (total: 15344, iterations: 10)	Not measured	41.7 Mb/s
benchmarkConstReadBytes	1,460.5 msec per iteration (total: 14606, iterations: 10)	Not measured	43.82 Mb/s
benchmarkReadWriteBytes	2,156.3 msec per iteration (total: 21563, iterations: 10)	Not measured	29.7 Mb/s
benchmarkNoMemCpy	649.0 msec per iteration (total: 6490, iterations: 10)	Not measured	98.6 Mb/s
benchmarkConstNoMemCpy	599.3 msec per iteration (total: 5994, iterations: 10)	Not measured	106.7 Mb/s
benchmarkTwoIteratorsNoMemCpy	1,231.5 msec per iteration (total: 12316, iterations: 10)	Not measured	52 Mb/s

Rectangular Iterator

image used is 4096x4096, colorspace is RGBA, 8 bit per channel (64Mb)
http://valdyas.org/~lukast/RectIteratorBenchmarks.tar.gz

benchmark name	walltime	Mb/s
benchmarkWriteBytes	118.2 msec per iteration (total: 1182, iterations: 10)	541.4 Mb/s
benchmarkReadBytes	121.7 msec per iteration (total: 1217, iterations: 10)	525.9 Mb/s
benchmarkConstReadBytes	120.5 msec per iteration (total: 1205, iterations: 10)	533.3 Mb/s
benchmarkReadWriteBytes	167.0 msec per iteration (total: 1670, iterations: 10)	383.2 Mb/s
benchmarkNoMemCpy	35.7 msec per iteration (total: 358, iterations: 10)	1792.7 Mb/s
benchmarkConstNoMemCpy	37.7 msec per iteration (total: 377, iterations: 10)	1697.6 Mb/s
benchmarkTwoIteratorsNoMemCpy	65.2 msec per iteration (total: 652, iterations: 10)	981.6 Mb/s

Random Iterator

image used is 4096x4096, colorspace is RGBA, 8 bit per channel (64Mb)
http://lukast.mediablog.sk/callgrind/RandomIterBenchmarks.tar.gz

benchmark name	walltime	Mb/s
benchmarkWriteBytes	1,641.5 msec per iteration (total: 16415, iterations: 10)	39.0 Mb/s
benchmarkReadBytes	1,598.5 msec per iteration (total: 15985, iterations: 10)	40.0 Mb/s
benchmarkConstReadBytes	1,654.5 msec per iteration (total: 16545, iterations: 10)	38.68 Mb/s
benchmarkReadWriteBytes	2,934.8 msec per iteration (total: 29348, iterations: 10)	21.8 Mb/s
benchmarkNoMemCpy	971.3 msec per iteration (total: 9714, iterations: 10)	65.9 Mb/s
benchmarkConstNoMemCpy	938.6 msec per iteration (total: 9386, iterations: 10)	68.2 Mb/s
benchmarkTwoIteratorsNoMemCpy	1,929.7 msec per iteration (total: 19298, iterations: 10)	33.2 Mb/s
benchmarkTileByTileWrite	1,310.0 msec per iteration (total: 13101, iterations: 10)	48.9 Mb/s
benchmarkTotalRandom	27,999 msec per iteration (total: 27999, iterations: 1)	2.2 Mb/s
benchmarkTotalRandomConst	29,124 msec per iteration (total: 29124, iterations: 1)	2.2 Mb/s

KisPainter

Composition (bitBlt)

image used is 4096x4096, colorspace is RGBA, 8 bit per channel (64Mb)
two images are composited 20 times in loop with and without selections
http://lukast.mediablog.sk/callgrind/KisPainterBenchmarks.tar.gz

benchmark name	walltime	Mb/s
benchmarkBitBlt	5,456.8 msec per iteration (total: 54569, iterations: 10)	234.6 Mb/s
benchmarkBitBltSelection	5,922.8 msec per iteration (total: 59228, iterations: 10)	216.1 Mb/s
benchmarkFixedBitBlt	3,635.5 msec per iteration (total: 36356, iterations: 10)	352.1 Mb/s
benchmarkFixedBitBltSelection	5,342.1 msec per iteration (total: 53421, iterations: 10)	239.6 Mb/s

Filters

Brightness/Contrast

Random image is generated with 3274x2067 dimension, RGBA 8-bit (pippin test image dimension) (25.82 Mb)
curve is linear (0.0 - 1.0)
http://lukast.mediablog.sk/callgrind/BContrastBenchmark.tar.gz

benchmark name	walltime	Mb/s
benchmarkFilter	1,783.5 msec per iteration (total: 17835, iterations: 10)	14.47 Mb/s

Blur

Random image is generated with 3274x2067 dimension, RGBA 8-bit (pippin test image dimension) (25.82 Mb)
Default settings is used for blur, which means convolution 5x5
http://lukast.mediablog.sk/callgrind/blurBenchmark.tar.gz

benchmark name	walltime	Mb/s
benchmarkFilter	31,674 msec per iteration (total: 31674, iterations: 1)	0.81 Mb/s

Projection

we load image in Krita native format 1000x753 with 100 dpi with all types of layers (group, effect, adjustment,..)
projection is computed by refreshGraph()
we save image in Krita native format again
http://lukast.mediablog.sk/callgrind/ProjectionBenchmark.tar.gz

Everything is benchmarked in one go.

benchmark name	walltime	Mb/s
benchmarkProjection	834.6 msec per iteration (total: 8346, iterations: 10)	N/A

Painting strokes

we paint on empty 4096x4096 paint device
The brush used is 70px pixelbrush, autobrush (the default one)
the benchmark can run with any paintop, just need to change the preset
first test paints the stroke you can see in the preview box in different scale. On 4096x4096px image.
the second test paints 20 random lines (every test the same 20 lines) with varying pressure (from 0.0 to 1.0)
http://lukast.mediablog.sk/callgrind/strokeBenchmarks.tar.gz [TODO add bouds result]

benchmark name	walltime	Mb/s
benchmarkStroke	2,962 msec per iteration (total: 2962, iterations: 1)	N/A
benchmarkRandomLines	18,576 msec per iteration (total: 18576, iterations: 1)	N/A

First results

Computer specification

CPU: Intel(R) Core(TM)2 Duo CPU P7350 @2.00GHz ( details http://ark.intel.com/Product.aspx?id=36750&code=p7350 )
RAM:2 GB
Graphics: NVidia 9200M
Fedora 12 i686 (32 bit version), KDE4.4 RC2, Qt 4.6.1

Compiler options

gcc -Wnon-virtual-dtor -Wno-long-long -ansi -Wundef -Wcast-align -Wchar-subscripts -Wall -W -Wpointer-arith -Wformat-security -fno-exceptions -DQT_NO_EXCEPTIONS -fno-check-new -fno-common -Woverloaded-virtual -fno-threadsafe-statics -fvisibility=hidden -fvisibility-inlines-hidden -O2 -g -fPIC -Wl,--enable-new-dtags

In CMake Configuration we have option called KritaDevs, that's what I used for the benchmarking. This output was found by make VERBOSE=1

First optimizations

With performance fix + FastMath::atan2

benchmark name	walltime	Mb/s
benchmarkStroke	650.2 msec per iteration (total: 6503, iterations: 10)	N/A
benchmarkRandomLines	4,158.8 msec per iteration (total: 41589, iterations: 10)	N/A

Cyrille's tuning commits around lunch

benchmark name	walltime	Mb/s
benchmarkStroke	533.3 msec per iteration (total: 5334, iterations: 10)	N/A
benchmarkRandomLines	3,555.5 msec per iteration (total: 35556, iterations: 10)	N/A

Just with performance fix

benchmark name	walltime	Mb/s
benchmarkStroke	683.7 msec per iteration (total: 6838, iterations: 10)	N/A
benchmarkRandomLines	4,696.3 msec per iteration (total: 46964, iterations: 10)	N/A

Compute 1/4 for the symmetrical brushes

benchmark name	walltime	Mb/s
benchmarkStroke	257.3 msec per iteration (total: 2574, iterations: 10)	N/A
benchmarkRandomLines	1,449.2 msec per iteration (total: 14492, iterations: 10)	N/A