Jump to content

GSoC/2018/StatusReports/ivanyossiIván: Difference between revisions

From KDE Community Wiki
Ghevan (talk | contribs)
Created page with "== Optimize Krita Soft, Gaussian and Stamp brushes mask generation to use AVX with Vc Library == Krita digital painting app relies on quick painting response to give a natural..."
 
Ghevan (talk | contribs)
mNo edit summary
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Optimize Krita Soft, Gaussian and Stamp brushes mask generation to use AVX with Vc Library ==
== Optimize Krita Soft, Gaussian and Stamp brushes mask generation to use AVX with Vc Library ==
Krita digital painting app relies on quick painting response to give a natural experience. A painted line is composed of thousands of images, called dabs, placed one after the other, each dab is masked to generate a different brush tip shape. This mask creation as stamping on canvas must be performed super fast as it is done thousands of times per second (A small brush of 300x300px with 10% spacing does around 600 dabs per second) . If the process of applying the images on canvas is not fast enough the painting process gets compromised and the enjoyment of painting is reduced.


For optimizing the mask creation we can use the AVX instructions set to apply transformation in vectors of data in one step. In this case the data is the image component coordinates composing the mask. One way of programming AVX is in assembly, but this is not manageable or future proof, as newer processors will come out with new, enhanced instruction sets. To allow future proof, krita has opted to use the Vc optimization library, which translates C++ code templates to assembly code tailored to the user’s processor features.
=== Summary ===
* '''Project Name:''' Optimize Krita Soft, Gaussian and Stamp brushes mask generation to use AVX with Vc Library
* '''Proposal:''' [https://docs.google.com/document/d/1TfAJnq_ZD3omGgsdGApBH_05B-w_OMC428UoHIJ0KWk/edit View Proposal]
* '''Abstract:''' Digital painting app relies on quick painting response to give a natural experience. A painted line is composed of thousands of images, called dabs, placed one after the other, each dab is masked to generate a different brush tip shape. As mask shapes are more complex and bigger, rendering them can be costly and painting becomes laggy. This project seeks to minimize the time spent generating the mask by implementing the generator using AVX instructions sets. Vc library is used to interface with the SIMD operations. Testing suggest the speed gains can be up to 10 times faster which improves the workflow using big brushes or complex multibrushes.
* '''[https://phabricator.kde.org/D14735 Final merged Code differential]'''


=== Project Goals ===
=== Project Goals ===
Implement AVX optimization using Vc library on
Implement Mask AVX optimization (Mask Type / Status, task) [https://phabricator.kde.org/T8580 Tasks]
* Circular Gauss
* Circular Gauss ''implemented, merged'' [https://phabricator.kde.org/T8734 T8734]
* Circular Soft
* Circular Soft ''implemented, merged'' [https://phabricator.kde.org/T8868 T8868]
* Rectangular Gaussian
* Rectangular Gaussian ''implemented, merged'' [https://phabricator.kde.org/T9010 T9010]
* Rectangular Soft
* Rectangular Soft ''implemented, merged'' [https://phabricator.kde.org/T9133 T9133]
* Stamp Mask
* Default Rectangular ''implemented, merged'' [https://phabricator.kde.org/T9344 T9344] ''(Added later to plan)''
* Stamp Mask ''progressively implementing'' [https://phabricator.kde.org/T9346 T9346]


== Work report ==
=== Project related links ===
First week, during community bonding, we read documentation and made a first proposal for the Unit test to be used in the implementation process. This Unit test has to compare the new mask shap and the legacy one and assert they are similar with a certain error. Unit test works ok, but it is not as isolated as needed and possibly other brush preparations used could interfere with the brush mask testing.
 
== Project related links ==
* [https://phabricator.kde.org/T8580 phabricator task]
* [https://phabricator.kde.org/T8580 phabricator task]
* [https://colorathis.wordpress.com/tag/KDE/ Personal blog]
* [https://colorathis.wordpress.com/tag/KDE/ Personal blog]
=== Code summaries ===
* '''54''' Commits merged
* [https://phabricator.kde.org/P237 Work done Differential] ''(code made until June 12, 2018)''
* '''[https://phabricator.kde.org/D14735 Final merged Code differential]'''
== Implementations Status ==
Status report on each goal implementation.
=== Unit Test: Similarity test ===
'''Goal:''' Test the current mask generators produce the same mask representation.
This unit test makes sure the masks generated are equal to the dab shape stated by the Krita community. The mask shape equality ensures consistency between versions and every implementation needs to adhere to the shape accepted (unless a new definition is decided upon).
'''Current Status''' Current test verifies the equality between the old engine and new AVX vectorized engine. The similarity is adjusted such as no pixel is allowed to be different by more than a brightness value of 2 (in RGB 8-but space).
'''TODO''' Simplify code. The test checks the mask generated from the scalar an vector method are equal, but it doesn't check the mask generated is consistent with the expected Mask (the one defined by Krita).
'''Challenges''' Mask shape has many variants that affect size, ratio, fade and antialias. Each of this operations work in tandem but in some situations input variants won't alter result, or need to be tested separately. The test needs to include as many variants in as few shapes as possible.
==== Related blog posts ====
* [https://colorathis.wordpress.com/2018/06/13/gsoc18_krta02/ Implementing dev and test environment]
==== Commits and Differentials ====
'''[https://phabricator.kde.org/T8581 Phabricator Task T8581]'''
* [https://phabricator.kde.org/R37:8fa826838aa903a0e615f4b1f0ebaa1405fa1e6d R37:8fa826838aa9: Adjust similarity Tolerance]
* [https://phabricator.kde.org/R37:5f60267ccd80e57e182c04a3c7625e522874a9f6 R37:5f60267ccd80: Modify similiarity test to try more mask variation]
* [https://phabricator.kde.org/R37:08efed86d2bb1365d87460de08d7755ea90636e2 R37:08efed86d2bb: Added CircleGauss to SimilarityTes]
* [https://phabricator.kde.org/R37:9963768b392bfc95637b2969b679172ff90a7b02 R37:9963768b392b: Correctly compare images alpha channel by setting fuzzy alpha and tolerance]
* [https://phabricator.kde.org/R37:d67ecdd905ad7022ae106a367b0f3271a7d30cc4 R37:d67ecdd905ad: Adhere code to coding style more strictly]
* [https://phabricator.kde.org/R37:db1ebe824c929a57b71fb3b2e6b38fe800f6e96e R37:db1ebe824c92: KisBrushMaskSimilarityTest:]
=== Circular Gauss ===
'''Goal:''' Implement Circular Gauss vectorized Mask generator using Vc
Gaussian mask generator uses a Gauss function to control the fade of the mask shape. Because of that is the slowest of all mask generators, since it calls the math ''erf()'' function twice on each pixel. The ''erf'' number can be approximated in a number of ways, original implementation does this using standard double precision ''erf()'' on each pixel, making it very slow.
'''Current Status:''' Implemented and added merge to master. ''Released in Krita 4.1''. Mask generation is 10 times more faster to render. All tests pass which proves both scalar and vectorized implementation are identical. Code profiled, no bottle necks or code issues found. Feature work 100%
'''Challenges''' Gaussian depends in the correct ''erf()'' values generation, but no such function existed for the vectorized data type of Vc. Implement a correct and quick vectorized ''erf()'' using single precision float was the biggest issue. The standard ''erf()'' not only works in double precision but it also makes different operations depending on the input value. The implemented '''vcErf()''' takes into account that any value it will receive is between zero and 255. Working with cases we replicated the precision needed to replicate the original Scalar implementation.
==== Related blog posts ====
* [https://colorathis.wordpress.com/2018/06/26/gsoc18_krta03/ Optimizing Circular Gaussian Mask, Krita:GSoC]
==== Commits and Differentials ====
'''[https://phabricator.kde.org/T8734 Phabricator Task T8734]'''
* [https://phabricator.kde.org/R37:b55ed74ac98b6345e9885340e7385745de6d1957 37:b55ed74ac98b: FIX: Gauss Circular Mask Antialiasing    ]
* [https://phabricator.kde.org/R37:45cf521214b566579cc6ad62c2d1f139727894df 37:45cf521214b5: FIX: Float precision bug masking issues for vectorized GaussMask generator    ]
* [https://phabricator.kde.org/R37:8dc950e705ee107e91f4e9607348931c19c2c14d 37:8dc950e705ee: FIX: Gauss Circular Mask Antialiasing]
* [https://phabricator.kde.org/R37:b395b05ef54d16bb7866ae13ed399b8e14bcdb78 37:b395b05ef54d: FIX: Float precision bug masking issues for vectorized GaussMask generator    ]
* [https://phabricator.kde.org/R37:daac6985670c81165df05eb802448d03f6d6afd2 R37:daac6985670c: FIX: Missing Antialias on Vectorized Circular Gauss    ]
* [https://phabricator.kde.org/R37:df4cb29add283c6caaa1475ecb0fd0467e8b01dc R37:df4cb29add28: FIX: Missing Antialias on Vectorized Circular Gauss]
* [https://phabricator.kde.org/R37:884dcc104e3a408ae1839d1165013d471a6a6582 R37:884dcc104e3a: FIX: Missing Antialias on Vectorized Circular Gauss]
* [https://phabricator.kde.org/R37:08efed86d2bb1365d87460de08d7755ea90636e2 R37:08efed86d2bb: Added CircleGauss to SimilarityTest]
* [https://phabricator.kde.org/R37:37effe636a305debc2f936b9af76b6939b1f0e37 R37:37effe636a30: ADD: Vectorized CircularGaussMask, UnitTestPAssing]
* [https://phabricator.kde.org/R37:a9b6c3a4eb36960bf11b89c12794044d72d86b5e R37:a9b6c3a4eb36: ref T8734]
'''Differentials'''
* [https://phabricator.kde.org/D13052 D13052: Krita GaussMask AVX optimization full vectorized]
=== Circular Soft ===
'''Goal:''' Implement Circular Soft vectorized Mask generator using Vc
Soft Generator creates a Mask based on curve values. The curve itself is generated elsewhere using the initial values on the mask generator. The curve is defined by a list of points in which 0 < x < 1 and 0 < y < 1. Fade generation uses the same object as the Circular Gauss
'''Current Status:''' Implemented and merged to master. Mask generation improved by 5 times, the change is not as drastic as the Gauss version, this is because the scalar implementation was not as heavy dependent on math operations but on memory read. All tests variants pass. Profiling code shows no time consumer. Feature set is implemented in full.
'''Challenges''' Soft Mask values are determined by a curve represented as a Vector of gray values. Each value index position corresponds to the distance to the center of the Mask. For a Scalar approach getting value one by one using an index is something trivial. On ''Vc'' however the values needs to be in an array next to the other to allow for the best optimization. Getting the space values from the vector into the Vc SIMD array was the main problem to solve. Luckily there was no need for in house implementation as '''Vc''' has a method to gather indexes from different regions of an array into the Vc Array. Using this method and passing the data pointer of the vector allowed to access the curve values efficiently.
==== Related blog posts ====
* [https://colorathis.wordpress.com/2018/07/27/optimizing-circular-soft-mask-kritagsoc/ Optimizing Circular Soft Mask, Krita:GSoC]
==== Commits and Differentials ====
'''[https://phabricator.kde.org/T8868 Phabricator Task T8868]'''
* [https://phabricator.kde.org/R37:ae2f0e5cdaa10a5ca03745977819272b33726bed R37:ae2f0e5cdaa1: Adjust format and on CircSoft Mask FastRow]
* [https://phabricator.kde.org/R37:f6182887b9b550e310cf6fb895b069190753bde0 R37:f6182887b9b5: Modify maksBenchmark to create identical Soft Masks]
* [https://phabricator.kde.org/R37:e8de81d0db26b5206a481cc5148f6fe5650e482f R37:e8de81d0db26: - Soft Circular vectorized brush mask Add missing antialias modification for]
* [https://phabricator.kde.org/R37:dfae36961a09fd55dcbb2f05041c3b720a651990 R37:dfae36961a09: NEW: Implement Vectorized Soft Brush Mask Generator.]
'''Differentials'''
* [https://phabricator.kde.org/D13504 D13504: Krita SoftBrush AVX Mask generation Optim.]
=== Rectangular Gauss ===
'''Goal:''' Implement Rectangular Gauss vectorized Mask generator using Vc
'''Current Status:''' Implemented and merged to master. Speed up of 10x, mask generator reuses '''vcErf()''' implemented for '''Circular Gauss'''. Vectorize and Scalar implementation give the same result and profiling show no bottleneck
'''Challenges''' Rectangular shapes use a 2D fader to determine the gray value a the given coordinate. This value is calculated in a relation between x and y coordinates. The 2D fader made different operations depending on the x and y values received and returned the results at different moments in the function. Adapting this to use masks on vectors was the main challenge. Also since any operation done would be done for all values, the new implementation was made such that the hard double ''vcErf()'' operation was done only once per iteration.
==== Related blog posts ====
* ''No related blogposts''
==== Commits and Differentials ====
'''[https://phabricator.kde.org/T9010 Phabricator Task T9010]'''
* [https://phabricator.kde.org/R37:461af3f9a957c74444070f1834d4e96c23302c1e R37:461af3f9a957: Minor code clean up Set a bigger size for generated mask rect]
* [https://phabricator.kde.org/R37:55067015ca490159bd45e1e6845d3d2a065fe325 R37:55067015ca49: Include Gaussian Rectangular in FreeStrokeBenchMark]
* [https://phabricator.kde.org/R37:e36538da1a4537a8cff2987c894c7fbd69a22e19 R37:e36538da1a45: Optimize Rectangular Gauss Mask]
'''Differentials'''
* [https://phabricator.kde.org/D13646 D13646: Optimize Rectangular Gauss Mask Generator]
=== Rectangular Soft ===
'''Goal:''' Implement Rectangular Soft vectorized Mask generator using Vc
'''Current Status:''' Implemented and merged to master. Speed gains were as good as 4x, since most time is spent getting values from memory. Current implementation generates the same output as the scalar version.
'''Challenges''' Biggest challenge was to get the best performance from ''Vc'' initializations to avoid adding time to the slow memory access time. Original code used casting to get the integer part of the float value, at first this casting made the ''Vc'' code slow, but switching to internal ''VcIndexes'' made the casting super fast.
==== Related blog posts ====
* ''No related blogposts''
==== Commits and Differentials ====
'''[https://phabricator.kde.org/T9133 Phabricator Task T9133]'''
* [https://phabricator.kde.org/R37:728cec98fba82e326b014240de79d3f4e89f025f R37:728cec98fba8: Adjust Spacing of auto_soft_rect.kpp test preset to be 0.1]
* [https://phabricator.kde.org/R37:21afc59cea140a38e750daf363b219ba2b528dad R37:21afc59cea14: Use Vc Indexes instead of custom SimdArray for integer casting]
* [https://phabricator.kde.org/R37:b8f2080917a31f808e0cb28ccebf870f3c4bcf06 R37:b8f2080917a3: Reduce casting on Vector Indexes creation]
* [https://phabricator.kde.org/R37:2565a74bdc9c167253cb19b52e5c6fa3d86ab599 R37:2565a74bdc9c: Vectorize Soft Rect Mask Generator]
'''Differentials'''
* [https://phabricator.kde.org/D14314 D14314: Default and Soft Rectangular mask generator Optimization]
=== Rectangular Default ===
''Added to project goals as it was very similar to the other work done and not very time consuming''
'''Goal:''' Implement Rectangular Default vectorized Mask generator using Vc
'''Current Status:''' Implemented and merged to master. Speed is almost as fast the fastest Vc implementation, the Default Circular. Mask generated is equal to the original scalar version and profiling shows no bottlenecks
'''Challenges''' As usual with math dependent implementations, the difference in precision made the final values vary by more than the threshold in similarity test. This difference also made a bug surface much more often with single precision, fixing the bug and reducing the epsilon used for integer conversion allowed the masks to be equal. Default Rect does not uses a 2D fader which made the implementation a bit tricky to mimic the scalar version logic.
==== Related blog posts ====
* ''No related blogposts''
==== Commits and Differentials ====
'''[https://phabricator.kde.org/T9344 Phabricator Task T9344]'''
* [https://phabricator.kde.org/R37:0a7d9b11379164c8712eb628be740b99ef9bf114 Add Default Rectangular to FreehandStrokeBenchmark]
* [https://phabricator.kde.org/R37:ff83ebb7382198d9b4e02978e10c80ec20d77750 Reduce difference gap of Default Rect Mask Vector impl]
* [https://phabricator.kde.org/R37:da249312e649d92cafa863376bd498188f3cbdf6 Implement Vc Optimization for Default Rect Mask]
'''Differentials'''
* [https://phabricator.kde.org/D14314 14314: Default and Soft Rectangular mask generator Optimization]
=== Stamp Mask ===
'''Goal:''' Optimize Predefined Mask generator using vectorized operations with Vc
'''Current Status:''' Identified current bottleneck: ''qpainter.drawImage'' method is not very fast. Corrently code is reorganized to avoid repetitions on highly iterative sections. Making current QImage transform fron ''QImage >> QImage'' to a ''QImage >> KisFixedPaintDevice''.
''''TODO''' Finish implementation of Qimage transform to KisFixedPaintDevice and make the code use the new code. Port the new transform code to ''Vc'' operations for fast processing.
'''Challenges''' Base code is completely different than ''auto_brush'' code used in previous optimizations. First challenge would be identifying the slow parts of the code, then understanding how eachp part of the code works to generate the Mask. With the the help of my mentor we discuss a plan to implement the optimization progressively in steps. Biggest challenge will be to speed up the image transformation beyond what is currently doing Qt.
==== Related blog posts ====
* ''No related blogposts''
==== Commits and Differentials ====
'''[https://phabricator.kde.org/T9346 Phabricator Task T9346]'''
== GSoC Work report chronicle ==
First week, during community bonding, I read the documentation and made a first proposal for the Unit test to be used in the implementation process. This Unit test has to compare the new mask shape and the legacy one and assert they are similar with a certain error. Unit test works ok, but it is not as isolated as needed and possibly other brush preparations used could interfere with the brush mask testing.
On the following weeks and previous to the coding phase I started to be more on IRC and the forums and help out the users I could. I began reading more about Vc and Intel AVX and started to make a small map of the code about brush masks to know exactly what was going on. A second version of the unit test was made, this time we went deeper into the code and managed the Masks directly from the pointer data of '''KisMaskGenerator'''.
=== Coding phase ===
I spend the '''first week''' of coding phase working on understanding how to implement a fully featured Circular Gaussian. I get into the problem of implementing an in house '''erf''' for vectorize operations. Once this implementation was passing the test I made a quick painting test and run the '''FreehandStrokeBenchmark''' to see if there was more speed gain than with the first dummy implementation. The new implementation was super fast.
'''Second week''' my mentor asked me to create a ''BenchMark'' specifically for the ''MaskGeneration'', the idea behind this is to have even more evidence that we are getting much better performance from the new vectorized version. The benchmark did not take long to implement and testings confirm the speed gains seen on the other test. I sent the code for review and it was suggested I merged it.
'''Third and Fourth week '''SoftBrush implementation was born and during the tests and feature implementation I realized there was some features missing from the Gauss implementation. The feature in question was the antialiasing. I ported the antialias code from the Soft Mask to Gauss Mask (since both use the same logic in the scalar version), and while testing I discovered that with some softness and fading values Gauss Mask failed. The image confirmed the mask was not coming out properly. I spent the next two day finding the root cause and fixing the bug, caused by float imprecision and one bad guard condition. The fixes also applied to Soft Brush and we finished initial feature complete implementation. I did not sent for review yet as I wanted to do much more in deep testing and optimization first.
Also we used this time to help out a little with the new documentation platform. I specifically sent two proposals to help in the automatization of the LaTeX version of the manual [https://phabricator.kde.org/D13205 D13205], [https://phabricator.kde.org/D13204 D13204]. this should make easier to deploy the PDF version of the documentation when its needed.
'''5ª and 6º week''' ''Circular Soft'' was sent for revision and work on the ''Rectangular Gauss'' began, this show the necessity of implementing a new static class with all ''Vc'' extra operation needed by both Gaussian implementations to avoid repeating code. My mentor pointed out that my current 1D Fader implementation was not very good as it made private members public (trough methods) with defeated the purpose of encapsulation. A refactored 1D Fader was made and with those lessons learned I proceeded to make the Vc implementation for the 2D Fader. The finished implementation had a missing file so review process was delayed. In the end the revision found a regression on smaller brushes, which took me a few days to figure out as in the beginning I thought it was due to my code, but turned out to be that I turned on supersampling, mainly used for Default brush.
'''7º and 8º week'''. Used this week to fix bugs and improve the tests, but since I was doing a good advance I started working on ''Rectangular Soft'' which was pushed to my branch one week later. The implementation took longer to include all tests cases. And also because i was testing other options to generate the square image. And another full week to finish completely all tests on how to make it faster (this included making a version that used stack instead of heap, however it didn't prove fruitful).
Coding for GSoC had to be put aside in this lapse since my university calendar is wrapping up the trimester around the end of July, and as part of my final project I had to finish a system and run a heavy simulation. Attempts to run it on ubuntu prove unsuccessful so I had to use the main computer to run the simulations which took more than the 3 days anticipated, up to a week and a half. Running two simulations at a time produced errors and made the main computer unavailable for any type of work. In the end it took almost two weeks because I reduced the simulation running to one at a time at day 5. This let me work again on GSoC, but limiting any recompile to one core.
In the end the slowdown because of simulation suffered a little more with the write up of the project report (a small thesis). however I made it work by coding for GSoC half a day and writing the other half, allowing me to finish the project in time. :].
Back to coding! ''But barely we have time left to code!''
'''Final weeks''' With two weeks to go my mentor and I traced a plan, for the remaining mask optimization. however since Rectangular Default used the same ''auto_brush'' framework we decided I should optimize it as well and then work on the Stamp mask (or officialy: predefined mask). I thought estimated the optimization of ''Rectangular default'' would take me at most 3 days, but as with everything, the simplest looking problem turned out to have hidden difficulties. In the end it took me an entire week to fix all quirks in the code and make it render the same as the original version. Also I made an alternate render of this mask a little more softer in the corners, but I haven't send this to evaluation to inclusion. All in all that lost week left me with one week to understand a completely new mask generator code for the predefined masks. In that week I was able to understand the code and start planing an implementation route. At the end of that week my mentor helped me to trace a concrete route to start incorporating optimizations to the code to eventually make the Vc vectorization.
==Things left to do==
The project finished almost all objectives and those finished are already merged into the master repository an ready to land on '''Krita 4.2'''. The missing generator mask implementation path is described below:
* Predefined mask generator is under refactor to start using an in-house implementation of the drawImage method of QPainter to work directly with the Krita paint device and Vc to optimize it. Once the scalar implementation works the same as the QPainter one, the idea is to create a vectorized version to speed all math operations doing what is now a two step operation into one.

Latest revision as of 02:04, 11 August 2018

Optimize Krita Soft, Gaussian and Stamp brushes mask generation to use AVX with Vc Library

Summary

  • Project Name: Optimize Krita Soft, Gaussian and Stamp brushes mask generation to use AVX with Vc Library
  • Proposal: View Proposal
  • Abstract: Digital painting app relies on quick painting response to give a natural experience. A painted line is composed of thousands of images, called dabs, placed one after the other, each dab is masked to generate a different brush tip shape. As mask shapes are more complex and bigger, rendering them can be costly and painting becomes laggy. This project seeks to minimize the time spent generating the mask by implementing the generator using AVX instructions sets. Vc library is used to interface with the SIMD operations. Testing suggest the speed gains can be up to 10 times faster which improves the workflow using big brushes or complex multibrushes.
  • Final merged Code differential

Project Goals

Implement Mask AVX optimization (Mask Type / Status, task) Tasks

  • Circular Gauss implemented, merged T8734
  • Circular Soft implemented, merged T8868
  • Rectangular Gaussian implemented, merged T9010
  • Rectangular Soft implemented, merged T9133
  • Default Rectangular implemented, merged T9344 (Added later to plan)
  • Stamp Mask progressively implementing T9346

Project related links

Code summaries

Implementations Status

Status report on each goal implementation.

Unit Test: Similarity test

Goal: Test the current mask generators produce the same mask representation.

This unit test makes sure the masks generated are equal to the dab shape stated by the Krita community. The mask shape equality ensures consistency between versions and every implementation needs to adhere to the shape accepted (unless a new definition is decided upon).

Current Status Current test verifies the equality between the old engine and new AVX vectorized engine. The similarity is adjusted such as no pixel is allowed to be different by more than a brightness value of 2 (in RGB 8-but space).

TODO Simplify code. The test checks the mask generated from the scalar an vector method are equal, but it doesn't check the mask generated is consistent with the expected Mask (the one defined by Krita).

Challenges Mask shape has many variants that affect size, ratio, fade and antialias. Each of this operations work in tandem but in some situations input variants won't alter result, or need to be tested separately. The test needs to include as many variants in as few shapes as possible.

Related blog posts

Commits and Differentials

Phabricator Task T8581

Circular Gauss

Goal: Implement Circular Gauss vectorized Mask generator using Vc

Gaussian mask generator uses a Gauss function to control the fade of the mask shape. Because of that is the slowest of all mask generators, since it calls the math erf() function twice on each pixel. The erf number can be approximated in a number of ways, original implementation does this using standard double precision erf() on each pixel, making it very slow.

Current Status: Implemented and added merge to master. Released in Krita 4.1. Mask generation is 10 times more faster to render. All tests pass which proves both scalar and vectorized implementation are identical. Code profiled, no bottle necks or code issues found. Feature work 100%

Challenges Gaussian depends in the correct erf() values generation, but no such function existed for the vectorized data type of Vc. Implement a correct and quick vectorized erf() using single precision float was the biggest issue. The standard erf() not only works in double precision but it also makes different operations depending on the input value. The implemented vcErf() takes into account that any value it will receive is between zero and 255. Working with cases we replicated the precision needed to replicate the original Scalar implementation.

Related blog posts

Commits and Differentials

Phabricator Task T8734

Differentials


Circular Soft

Goal: Implement Circular Soft vectorized Mask generator using Vc

Soft Generator creates a Mask based on curve values. The curve itself is generated elsewhere using the initial values on the mask generator. The curve is defined by a list of points in which 0 < x < 1 and 0 < y < 1. Fade generation uses the same object as the Circular Gauss

Current Status: Implemented and merged to master. Mask generation improved by 5 times, the change is not as drastic as the Gauss version, this is because the scalar implementation was not as heavy dependent on math operations but on memory read. All tests variants pass. Profiling code shows no time consumer. Feature set is implemented in full.

Challenges Soft Mask values are determined by a curve represented as a Vector of gray values. Each value index position corresponds to the distance to the center of the Mask. For a Scalar approach getting value one by one using an index is something trivial. On Vc however the values needs to be in an array next to the other to allow for the best optimization. Getting the space values from the vector into the Vc SIMD array was the main problem to solve. Luckily there was no need for in house implementation as Vc has a method to gather indexes from different regions of an array into the Vc Array. Using this method and passing the data pointer of the vector allowed to access the curve values efficiently.

Related blog posts

Commits and Differentials

Phabricator Task T8868

Differentials


Rectangular Gauss

Goal: Implement Rectangular Gauss vectorized Mask generator using Vc

Current Status: Implemented and merged to master. Speed up of 10x, mask generator reuses vcErf() implemented for Circular Gauss. Vectorize and Scalar implementation give the same result and profiling show no bottleneck

Challenges Rectangular shapes use a 2D fader to determine the gray value a the given coordinate. This value is calculated in a relation between x and y coordinates. The 2D fader made different operations depending on the x and y values received and returned the results at different moments in the function. Adapting this to use masks on vectors was the main challenge. Also since any operation done would be done for all values, the new implementation was made such that the hard double vcErf() operation was done only once per iteration.

Related blog posts

  • No related blogposts

Commits and Differentials

Phabricator Task T9010

Differentials

Rectangular Soft

Goal: Implement Rectangular Soft vectorized Mask generator using Vc

Current Status: Implemented and merged to master. Speed gains were as good as 4x, since most time is spent getting values from memory. Current implementation generates the same output as the scalar version.

Challenges Biggest challenge was to get the best performance from Vc initializations to avoid adding time to the slow memory access time. Original code used casting to get the integer part of the float value, at first this casting made the Vc code slow, but switching to internal VcIndexes made the casting super fast.

Related blog posts

  • No related blogposts

Commits and Differentials

Phabricator Task T9133

Differentials

Rectangular Default

Added to project goals as it was very similar to the other work done and not very time consuming Goal: Implement Rectangular Default vectorized Mask generator using Vc

Current Status: Implemented and merged to master. Speed is almost as fast the fastest Vc implementation, the Default Circular. Mask generated is equal to the original scalar version and profiling shows no bottlenecks

Challenges As usual with math dependent implementations, the difference in precision made the final values vary by more than the threshold in similarity test. This difference also made a bug surface much more often with single precision, fixing the bug and reducing the epsilon used for integer conversion allowed the masks to be equal. Default Rect does not uses a 2D fader which made the implementation a bit tricky to mimic the scalar version logic.

Related blog posts

  • No related blogposts

Commits and Differentials

Phabricator Task T9344

Differentials

Stamp Mask

Goal: Optimize Predefined Mask generator using vectorized operations with Vc

Current Status: Identified current bottleneck: qpainter.drawImage method is not very fast. Corrently code is reorganized to avoid repetitions on highly iterative sections. Making current QImage transform fron QImage >> QImage to a QImage >> KisFixedPaintDevice.

'TODO Finish implementation of Qimage transform to KisFixedPaintDevice and make the code use the new code. Port the new transform code to Vc operations for fast processing.

Challenges Base code is completely different than auto_brush code used in previous optimizations. First challenge would be identifying the slow parts of the code, then understanding how eachp part of the code works to generate the Mask. With the the help of my mentor we discuss a plan to implement the optimization progressively in steps. Biggest challenge will be to speed up the image transformation beyond what is currently doing Qt.

Related blog posts

  • No related blogposts

Commits and Differentials

Phabricator Task T9346


GSoC Work report chronicle

First week, during community bonding, I read the documentation and made a first proposal for the Unit test to be used in the implementation process. This Unit test has to compare the new mask shape and the legacy one and assert they are similar with a certain error. Unit test works ok, but it is not as isolated as needed and possibly other brush preparations used could interfere with the brush mask testing.

On the following weeks and previous to the coding phase I started to be more on IRC and the forums and help out the users I could. I began reading more about Vc and Intel AVX and started to make a small map of the code about brush masks to know exactly what was going on. A second version of the unit test was made, this time we went deeper into the code and managed the Masks directly from the pointer data of KisMaskGenerator.

Coding phase

I spend the first week of coding phase working on understanding how to implement a fully featured Circular Gaussian. I get into the problem of implementing an in house erf for vectorize operations. Once this implementation was passing the test I made a quick painting test and run the FreehandStrokeBenchmark to see if there was more speed gain than with the first dummy implementation. The new implementation was super fast.

Second week my mentor asked me to create a BenchMark specifically for the MaskGeneration, the idea behind this is to have even more evidence that we are getting much better performance from the new vectorized version. The benchmark did not take long to implement and testings confirm the speed gains seen on the other test. I sent the code for review and it was suggested I merged it.

Third and Fourth week SoftBrush implementation was born and during the tests and feature implementation I realized there was some features missing from the Gauss implementation. The feature in question was the antialiasing. I ported the antialias code from the Soft Mask to Gauss Mask (since both use the same logic in the scalar version), and while testing I discovered that with some softness and fading values Gauss Mask failed. The image confirmed the mask was not coming out properly. I spent the next two day finding the root cause and fixing the bug, caused by float imprecision and one bad guard condition. The fixes also applied to Soft Brush and we finished initial feature complete implementation. I did not sent for review yet as I wanted to do much more in deep testing and optimization first.

Also we used this time to help out a little with the new documentation platform. I specifically sent two proposals to help in the automatization of the LaTeX version of the manual D13205, D13204. this should make easier to deploy the PDF version of the documentation when its needed.

5ª and 6º week Circular Soft was sent for revision and work on the Rectangular Gauss began, this show the necessity of implementing a new static class with all Vc extra operation needed by both Gaussian implementations to avoid repeating code. My mentor pointed out that my current 1D Fader implementation was not very good as it made private members public (trough methods) with defeated the purpose of encapsulation. A refactored 1D Fader was made and with those lessons learned I proceeded to make the Vc implementation for the 2D Fader. The finished implementation had a missing file so review process was delayed. In the end the revision found a regression on smaller brushes, which took me a few days to figure out as in the beginning I thought it was due to my code, but turned out to be that I turned on supersampling, mainly used for Default brush.

7º and 8º week. Used this week to fix bugs and improve the tests, but since I was doing a good advance I started working on Rectangular Soft which was pushed to my branch one week later. The implementation took longer to include all tests cases. And also because i was testing other options to generate the square image. And another full week to finish completely all tests on how to make it faster (this included making a version that used stack instead of heap, however it didn't prove fruitful).

Coding for GSoC had to be put aside in this lapse since my university calendar is wrapping up the trimester around the end of July, and as part of my final project I had to finish a system and run a heavy simulation. Attempts to run it on ubuntu prove unsuccessful so I had to use the main computer to run the simulations which took more than the 3 days anticipated, up to a week and a half. Running two simulations at a time produced errors and made the main computer unavailable for any type of work. In the end it took almost two weeks because I reduced the simulation running to one at a time at day 5. This let me work again on GSoC, but limiting any recompile to one core.

In the end the slowdown because of simulation suffered a little more with the write up of the project report (a small thesis). however I made it work by coding for GSoC half a day and writing the other half, allowing me to finish the project in time. :].

Back to coding! But barely we have time left to code!

Final weeks With two weeks to go my mentor and I traced a plan, for the remaining mask optimization. however since Rectangular Default used the same auto_brush framework we decided I should optimize it as well and then work on the Stamp mask (or officialy: predefined mask). I thought estimated the optimization of Rectangular default would take me at most 3 days, but as with everything, the simplest looking problem turned out to have hidden difficulties. In the end it took me an entire week to fix all quirks in the code and make it render the same as the original version. Also I made an alternate render of this mask a little more softer in the corners, but I haven't send this to evaluation to inclusion. All in all that lost week left me with one week to understand a completely new mask generator code for the predefined masks. In that week I was able to understand the code and start planing an implementation route. At the end of that week my mentor helped me to trace a concrete route to start incorporating optimizations to the code to eventually make the Vc vectorization.


Things left to do

The project finished almost all objectives and those finished are already merged into the master repository an ready to land on Krita 4.2. The missing generator mask implementation path is described below:

  • Predefined mask generator is under refactor to start using an in-house implementation of the drawImage method of QPainter to work directly with the Krita paint device and Vc to optimize it. Once the scalar implementation works the same as the QPainter one, the idea is to create a vectorized version to speed all math operations doing what is now a two step operation into one.