Benchmarking Krita performance
Tile engine
Data Manager
- Image dimension for test 4096*4096, RGB
- I executed every test few times and I selected the results that popped again more times
- callgrind backend did not produced callgrind.* files so I used valgrind directly, but that does create benchmarking also for Qt test lib
- http://lukast.mediablog.sk/callgrind/DatamanagerBenchmarks.tar.gz
benchmark name |
walltime |
tickcounter |
Mb/s
|
benchmarkWriteBytes |
38.0 msec per iteration (total: 380, iterations: 10) |
77,528,468.2 ticks per iteration (total: 775284683, iterations: 10) |
1333.3 Mb/s
|
benchmarkReadBytes |
39.3 msec per iteration (total: 394, iterations: 10) |
77,311,910.2 ticks per iteration (total: 773119103, iterations: 10) |
1628.4 Mb/s
|
benchmarkReadWriteBytes |
46.2 msec per iteration (total: 462, iterations: 10) |
91,198,881.7 ticks per iteration (total: 911988817, iterations: 10) |
1391.3 Mb/s
|
benchmarkExtent |
0.00020 msec per iteration (total: 34, iterations: 163840) |
735.0 ticks per iteration (total: 7350, iterations: 10) |
N/A
|
benchmarkClear |
1.3 msec per iteration (total: 26, iterations: 20) |
2,542,070.2 ticks per iteration (total: 25420702, iterations: 10) |
N/A
|
Iterators
Horizontal Iterator
benchmark name |
walltime |
tickcounter |
Mb/s
|
benchmarkWriteBytes |
1,383.4 msec per iteration (total: 13834, iterations: 10) |
4,389,801,089.3 ticks per iteration (total: 43898010893, iterations: 10) |
46.3 Mb/s
|
benchmarkReadBytes |
1,443.2 msec per iteration (total: 14433, iterations: 10) |
4,461,418,645.5 ticks per iteration (total: 44614186455, iterations: 10) |
44.4 Mb/s
|
benchmarkConstReadBytes |
1,380.7 msec per iteration (total: 13808, iterations: 10) |
4,501,257,062.3 ticks per iteration (total: 45012570623, iterations: 10) |
46.3 Mb/s
|
benchmarkReadWriteBytes |
2,041.7 msec per iteration (total: 20418, iterations: 10) |
5,736,531,494.3 ticks per iteration (total: 57365314943, iterations: 10) |
31.3 Mb/s
|
benchmarkNoMemCpy |
655.7 msec per iteration (total: 6557, iterations: 10) |
3,025,535,970.6 ticks per iteration (total: 30255359707, iterations: 10) |
97.7 Mb/s
|
benchmarkConstNoMemCpy |
583.7 msec per iteration (total: 5837, iterations: 10) |
2,889,942,765.8 ticks per iteration (total: 28899427658, iterations: 10) |
109.6 Mb/s
|
benchmarkTwoIteratorsNoMemCpy |
1,205.7 msec per iteration (total: 12057, iterations: 10) |
3,952,530,421.5 ticks per iteration (total: 39525304215, iterations: 10) |
53.1 Mb/s
|
Update
state:trunk 17.feb 2010 15:38
benchmark name |
walltime |
Mb/s
|
benchmarkWriteBytes |
1,548.0 msec per iteration (total: 15481, iterations: 10) |
41.34 Mb/s
|
benchmarkReadBytes |
3,087.8 msec per iteration (total: 30878, iterations: 10) |
20.73 Mb/s
|
benchmarkConstReadBytes |
3,062.0 msec per iteration (total: 30620, iterations: 10) |
20.90 Mb/s
|
benchmarkReadWriteBytes |
3,725.0 msec per iteration (total: 37251, iterations: 10) |
17.18 Mb/s
|
benchmarkNoMemCpy |
2,264.4 msec per iteration (total: 22644, iterations: 10) |
28.26 Mb/s
|
benchmarkConstNoMemCpy |
2,316.8 msec per iteration (total: 23168, iterations: 10) |
27.62 Mb/s
|
benchmarkTwoIteratorsNoMemCpy |
2,950.0 msec per iteration (total: 29501, iterations: 10) |
21.69 Mb/s
|
state: caching patch applied to trunk
benchmark name |
walltime |
Mb/s
|
benchmarkWriteBytes |
1,211.4 msec per iteration (total: 12114, iterations: 10) |
52.83 Mb/s (speedup 1.28)
|
benchmarkReadBytes |
1,196.2 msec per iteration (total: 11962, iterations: 10) |
53.50 Mb/s (speedup 2.58)
|
benchmarkConstReadBytes |
1,202.2 msec per iteration (total: 12022, iterations: 10) |
53.24 Mb/s (speedup 1.28)
|
benchmarkReadWriteBytes |
1,563.0 msec per iteration (total: 15631, iterations: 10) |
40.95 Mb/s (speedup 2.38)
|
benchmarkNoMemCpy |
389.1 msec per iteration (total: 3891, iterations: 10) |
164.48 Mb/s (speedup 5.82)
|
benchmarkConstNoMemCpy |
372.5 msec per iteration (total: 3725, iterations: 10) |
171.81 Mb/s (speedup 6.21)
|
benchmarkTwoIteratorsNoMemCpy |
670.3 msec per iteration (total: 6704, iterations: 10) |
95.48 Mb/s (speedup 4.4)
|
Vertical Iterator
benchmark name |
walltime |
tickcounter |
Mb/s
|
benchmarkWriteBytes |
1,541.9 msec per iteration (total: 15419, iterations: 10) |
Not measured |
41.52 Mb/s
|
benchmarkReadBytes |
1,534.4 msec per iteration (total: 15344, iterations: 10) |
Not measured |
41.7 Mb/s
|
benchmarkConstReadBytes |
1,460.5 msec per iteration (total: 14606, iterations: 10) |
Not measured |
43.82 Mb/s
|
benchmarkReadWriteBytes |
2,156.3 msec per iteration (total: 21563, iterations: 10) |
Not measured |
29.7 Mb/s
|
benchmarkNoMemCpy |
649.0 msec per iteration (total: 6490, iterations: 10) |
Not measured |
98.6 Mb/s
|
benchmarkConstNoMemCpy |
599.3 msec per iteration (total: 5994, iterations: 10) |
Not measured |
106.7 Mb/s
|
benchmarkTwoIteratorsNoMemCpy |
1,231.5 msec per iteration (total: 12316, iterations: 10) |
Not measured |
52 Mb/s
|
Rectangular Iterator
benchmark name |
walltime |
Mb/s
|
benchmarkWriteBytes |
118.2 msec per iteration (total: 1182, iterations: 10) |
541.4 Mb/s
|
benchmarkReadBytes |
121.7 msec per iteration (total: 1217, iterations: 10) |
525.9 Mb/s
|
benchmarkConstReadBytes |
120.5 msec per iteration (total: 1205, iterations: 10) |
533.3 Mb/s
|
benchmarkReadWriteBytes |
167.0 msec per iteration (total: 1670, iterations: 10) |
383.2 Mb/s
|
benchmarkNoMemCpy |
35.7 msec per iteration (total: 358, iterations: 10) |
1792.7 Mb/s
|
benchmarkConstNoMemCpy |
37.7 msec per iteration (total: 377, iterations: 10) |
1697.6 Mb/s
|
benchmarkTwoIteratorsNoMemCpy |
65.2 msec per iteration (total: 652, iterations: 10) |
981.6 Mb/s
|
Random Iterator
benchmark name |
walltime |
Mb/s
|
benchmarkWriteBytes |
1,641.5 msec per iteration (total: 16415, iterations: 10) |
39.0 Mb/s
|
benchmarkReadBytes |
1,598.5 msec per iteration (total: 15985, iterations: 10) |
40.0 Mb/s
|
benchmarkConstReadBytes |
1,654.5 msec per iteration (total: 16545, iterations: 10) |
38.68 Mb/s
|
benchmarkReadWriteBytes |
2,934.8 msec per iteration (total: 29348, iterations: 10) |
21.8 Mb/s
|
benchmarkNoMemCpy |
971.3 msec per iteration (total: 9714, iterations: 10) |
65.9 Mb/s
|
benchmarkConstNoMemCpy |
938.6 msec per iteration (total: 9386, iterations: 10) |
68.2 Mb/s
|
benchmarkTwoIteratorsNoMemCpy |
1,929.7 msec per iteration (total: 19298, iterations: 10) |
33.2 Mb/s
|
benchmarkTileByTileWrite |
1,310.0 msec per iteration (total: 13101, iterations: 10) |
48.9 Mb/s
|
benchmarkTotalRandom |
27,999 msec per iteration (total: 27999, iterations: 1) |
2.2 Mb/s
|
benchmarkTotalRandomConst |
29,124 msec per iteration (total: 29124, iterations: 1) |
2.2 Mb/s
|
KisPainter
Composition (bitBlt)
benchmark name |
walltime |
Mb/s
|
benchmarkBitBlt |
5,456.8 msec per iteration (total: 54569, iterations: 10) |
234.6 Mb/s |
|
benchmarkBitBltSelection |
5,922.8 msec per iteration (total: 59228, iterations: 10) |
216.1 Mb/s |
|
benchmarkFixedBitBlt |
3,635.5 msec per iteration (total: 36356, iterations: 10) |
352.1 Mb/s |
|
benchmarkFixedBitBltSelection |
5,342.1 msec per iteration (total: 53421, iterations: 10) |
239.6 Mb/s |
|
Filters
Brightness/Contrast
benchmark name |
walltime |
Mb/s
|
benchmarkFilter |
1,783.5 msec per iteration (total: 17835, iterations: 10) |
14.47 Mb/s
|
Blur
benchmark name |
walltime |
Mb/s
|
benchmarkFilter |
31,674 msec per iteration (total: 31674, iterations: 1) |
0.81 Mb/s
|
Projection
Everything is benchmarked in one go.
benchmark name |
walltime |
Mb/s
|
benchmarkProjection |
834.6 msec per iteration (total: 8346, iterations: 10) |
N/A
|
Painting strokes
- we paint on empty 4096x4096 paint device
- The brush used is 70px pixelbrush, autobrush (the default one)
- the benchmark can run with any paintop, just need to change the preset
- first test paints the stroke you can see in the preview box in different scale. On 4096x4096px image.
- the second test paints 20 random lines (every test the same 20 lines) with varying pressure (from 0.0 to 1.0)
- http://lukast.mediablog.sk/callgrind/strokeBenchmarks.tar.gz [TODO add bouds result]
benchmark name |
walltime |
Mb/s
|
benchmarkStroke |
2,962 msec per iteration (total: 2962, iterations: 1) |
N/A
|
benchmarkRandomLines |
18,576 msec per iteration (total: 18576, iterations: 1) |
N/A
|
First results
Computer specification
Compiler options
gcc -Wnon-virtual-dtor -Wno-long-long -ansi -Wundef -Wcast-align -Wchar-subscripts -Wall -W -Wpointer-arith -Wformat-security -fno-exceptions -DQT_NO_EXCEPTIONS -fno-check-new -fno-common -Woverloaded-virtual -fno-threadsafe-statics -fvisibility=hidden -fvisibility-inlines-hidden -O2 -g -fPIC -Wl,--enable-new-dtags
In CMake Configuration we have option called KritaDevs, that's what I used for the benchmarking. This output was found by make VERBOSE=1
First optimizations
With performance fix + FastMath::atan2
benchmark name |
walltime |
Mb/s
|
benchmarkStroke |
650.2 msec per iteration (total: 6503, iterations: 10) |
N/A
|
benchmarkRandomLines |
4,158.8 msec per iteration (total: 41589, iterations: 10) |
N/A
|
Cyrille's tuning commits around lunch
benchmark name |
walltime |
Mb/s
|
benchmarkStroke |
533.3 msec per iteration (total: 5334, iterations: 10) |
N/A
|
benchmarkRandomLines |
3,555.5 msec per iteration (total: 35556, iterations: 10) |
N/A
|
Just with performance fix
benchmark name |
walltime |
Mb/s
|
benchmarkStroke |
683.7 msec per iteration (total: 6838, iterations: 10) |
N/A
|
benchmarkRandomLines |
4,696.3 msec per iteration (total: 46964, iterations: 10) |
N/A
|
Compute 1/4 for the symmetrical brushes
benchmark name |
walltime |
Mb/s
|
benchmarkStroke |
257.3 msec per iteration (total: 2574, iterations: 10) |
N/A
|
benchmarkRandomLines |
1,449.2 msec per iteration (total: 14492, iterations: 10) |
N/A
|