Multi-threaded render discussion #1

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Multi-threaded render discussion #1

MichaelAtOz
Administrator
This is to save clogging up GitHub.

Re Multithreaded CGAL geometry evaluation #1980.
Snapshot 2017.04.05...multithread here. You need to enable the feature in preferences.

I have some spooky results to collate, it will take a little while.

In the mean time, I'd be interested in results of anyone rendering the  following CSG file.
random-threads-k=566519-n=4000-parts=64.csg (generated by slightly different  code than I posted on GitHub)
(anywhere from <2 minutes to 5+ minutes to render depending on PC, I wouldn't try on under-powered systems)

If you don't get an error, set the openscad.exe process CPU affinity to a reduced number of CPUs ~50%/75% of cores/threads.


Admin - PM me if you need anything,
or if I've done something stupid...

Unless specifically shown otherwise above, my contribution is in the Public Domain; to the extent possible under law, I have waived all copyright and related or neighbouring rights to this work.
Obviously inclusion of works of previous authors is not included in the above.


The TPP is no simple “trade agreement.” Fight it! http://www.ourfairdeal.org/ time is running out!
Reply | Threaded
Open this post in threaded view
|

Re: Multi-threaded render discussion #1

MichaelAtOz
Administrator
tl;dr - concurrency issues (presumably) & weird scheduling behavior.

This is all on my new box;
SSD 16 GB 4 core hyperthread  (8 threads)  i7-3770 (3.5/3.9 turbo GHz)
Windows 7 64 bit (fresh install - no updates applied - still un-authenticated)
Snapshot 2017.04.05/64
Microsoft sysinternals process-explorer monitoring, little else running.

I first came across the error using the random-thread code on my 2 core (non hyper) system, then was able to get it repeating by using the same key seed. Then as Marius pointed out, using the exported CSG would ensure the repeatability.

So I whittled the threads down to the CSG file posted above. On my new box:
"Threaded traversal phase 1: Generating 4003 leaf geometries
Threaded traversal phase 2: Spawning 16075 threads on 8 cores"

All results below are rendering that file, unless mentioned otherwise. All renders were preceded by a Flush Caches.

Background.
The i7-3370 has 4 cores hyperthreaded, "For each processor core that is physically present, the operating system addresses two virtual (logical) cores and shares the workload between them when possible. The main function of hyper-threading is to increase the number of independent instructions in the pipeline; it takes advantage of superscalar architecture, in which multiple instructions operate on separate data in parallel. With HTT, one physical core appears as two processors to the operating system, allowing concurrent scheduling of two processes per core. In addition, two or more processes can use the same resources: if resources for one process are not available, then another process can continue if its resources are available."

So to Windows it looks like 8 CPUs 0-7, each pair (0-1,2-3,4-5,6-7) is one core, with 1 physical processor and one 'logical' processor. "the logical processors in a hyper-threaded core share the execution resources. These resources include the execution engine, caches, and system bus interface; the sharing of resources allows two logical processors to work with each other more efficiently, and allows a logical processor to borrow resources from a stalled logical core (assuming both logical cores are associated with the same physical core). A processor stalls when it is waiting for data it has sent for so it can finish processing the present thread."

So, in terms of CPU use, 1 thread at 100%, shows as 12.5% CPU usage, with pipelining, well cached threads can be seen to use 12.5% on all 8 threads.

Note that the multi-threaded C code uses a non-yielding Spinlock, a tight loop waiting for the lock to be freed. I speculate here, that if I wrote the kernel and/or the compiler optimiser, I would detect a spin condition and yield to other processes rather than chewing the CPU.  

Windows allows the "processor affinity" to be set for a process, 8 check boxes to select which threads are available to the process, all 8 by default.

Results.
tl;dr - changing the affinity, causes a CGAL error in some configurations (I speculate that this is due to a concurrency issue), and affects the thread behavior in strange ways (I speculate that the spinlock may have some issue with CPU cache architecture peculiarities)

End of part 1 - I need a break...  
Admin - PM me if you need anything,
or if I've done something stupid...

Unless specifically shown otherwise above, my contribution is in the Public Domain; to the extent possible under law, I have waived all copyright and related or neighbouring rights to this work.
Obviously inclusion of works of previous authors is not included in the above.


The TPP is no simple “trade agreement.” Fight it! http://www.ourfairdeal.org/ time is running out!
Reply | Threaded
Open this post in threaded view
|

Re: Multi-threaded render discussion #1

kintel
Administrator
In reply to this post by MichaelAtOz

> On Apr 16, 2017, at 20:38, MichaelAtOz <[hidden email]> wrote:
>
> This is to save clogging up GitHub.
>
> Re  Multithreaded CGAL geometry evaluation #1980
> <https://github.com/openscad/openscad/pull/1980>  .
> Snapshot  2017.04.05...multithread here
> <http://files.openscad.org/snapshots/>  . You need to enable the feature in
> preferences.
>
> I have some spooky results to collate, it will take a little while.
>
> In the mean time, I'd be interested in results of anyone rendering the
> following CSG file.

$ time ./OpenSCAD.app/Contents/MacOS/OpenSCAD random-threads.csg --enable thread-traversal -o out.stl
Threaded traversal phase 1: Generating 4003 leaf geometries
Threaded traversal phase 2: Spawning 16075 threads on 8 cores
real 14m40.628s
user 18m8.360s
sys 5m29.745s

This ran without errors, but only saturated 4 out of 8 cores.
Mac OS X doesn’t do per-process CPU affinity..

 -Marius


_______________________________________________
OpenSCAD mailing list
[hidden email]
http://lists.openscad.org/mailman/listinfo/discuss_lists.openscad.org
Reply | Threaded
Open this post in threaded view
|

Re: Multi-threaded render discussion #1

MichaelAtOz
Administrator
In reply to this post by MichaelAtOz
Part 2.

Results. Cont.

I indicate the affinity on/off as 0/1 - threads 0-7, ie A=11111111 all CPUs available.

I had a textual display of the treads and a performance graph visible.
This is representative example of multi-thread, the lines on the CPU graph are around (manually sized) 1 thread (12.5%). I forget the configuration at the time.


------------------------------------

A baseline.

A=01010101 -  thread feature disabled (affinity doesn't matter)
One thread @ 12.x% most of the time.

Compiling design (CSG Tree generation)...
Rendering Polygon Mesh using CGAL...
Geometries in cache: 15916
Geometry cache size in bytes: 11829776
CGAL Polyhedrons in cache: 60
CGAL cache size in bytes: 104662192
Total rendering time: 0 hours, 4 minutes, 15 seconds
   Top level object is a 3D object:
   Simple:        yes
   Vertices:    26334
   Halfedges:   79002
   Edges:       39501
   Halffacets:  40766
   Facets:      20383
   Volumes:      3451
Rendering finished.



-------------------------------------------------

All following results have multi-thread feature enabled.

-------------------------------------------------

Affinity=11110000
4 (busy) threads @~12.x%, to tail-off one thread @12.x%
Compiling design (CSG Tree generation)...
Rendering Polygon Mesh using CGAL...
Threaded traversal phase 1: Generating 4003 leaf geometries
Threaded traversal phase 2: Spawning 16075 threads on 8 cores
Geometries in cache: 15916
Geometry cache size in bytes: 11829776
CGAL Polyhedrons in cache: 60
CGAL cache size in bytes: 104662192
Total rendering time: 0 hours, 3 minutes, 17 seconds
   Top level object is a 3D object:
   Simple:        yes
   Vertices:    26334
   Halfedges:   79002
   Edges:       39501
   Halffacets:  40766
   Facets:      20383
   Volumes:      3451
Rendering finished.



---------------------------------------------------

Affinity=11111111
Used all threads, chunks @ mostly @~12.x%, chunks @~7-9%
Compiling design (CSG Tree generation)...
Rendering Polygon Mesh using CGAL...
Threaded traversal phase 1: Generating 4003 leaf geometries
Threaded traversal phase 2: Spawning 16075 threads on 8 cores
ERROR: CGAL error in CGALUtils::applyBinaryOperator union: CGAL ERROR: assertion violation! Expr: cet->get_index() == ce->twin()->get_index() File: /opt/mxe/usr/x86_64-w64-mingw32.static/include/CGAL/Nef_3/SNC_external_structure.h Line: 1169
Geometries in cache: 15916
Geometry cache size in bytes: 11829776
CGAL Polyhedrons in cache: 63
CGAL cache size in bytes: 104536656
Total rendering time: 0 hours, 3 minutes, 10 seconds
   Top level object is a 3D object:
   Simple:        yes
   Vertices:    25902
   Halfedges:   77706
   Edges:       38853
   Halffacets:  40120
   Facets:      20060
   Volumes:      3399
Rendering finished.



Note the stats are different.
--------------------------------------------------

Affinity=10101010
It appeared to use all 4 CPUs watching the threads (~8 threads ~6%), but see CPU graph
[perhaps I wasn't paying attention, but it could be that misinterpretation...]
Compiling design (CSG Tree generation)...
Rendering Polygon Mesh using CGAL...
Threaded traversal phase 1: Generating 4003 leaf geometries
Threaded traversal phase 2: Spawning 16075 threads on 8 cores
Geometries in cache: 15916
Geometry cache size in bytes: 11829776
CGAL Polyhedrons in cache: 60
CGAL cache size in bytes: 104662192
Total rendering time: 0 hours, 3 minutes, 31 seconds
   Top level object is a 3D object:
   Simple:        yes
   Vertices:    26334
   Halfedges:   79002
   Edges:       39501
   Halffacets:  40766
   Facets:      20383
   Volumes:      3451
Rendering finished.



----------------------------------------------------------

A=11010111
Used most of 6 CPUs at times
Compiling design (CSG Tree generation)...
Rendering Polygon Mesh using CGAL...
Threaded traversal phase 1: Generating 4003 leaf geometries
Threaded traversal phase 2: Spawning 16075 threads on 8 cores
ERROR: CGAL error in CGALUtils::applyBinaryOperator union: CGAL ERROR: assertion violation! Expr: e_below != SHalfedge_handle() File: /opt/mxe/usr/x86_64-w64-mingw32.static/include/CGAL/Nef_3/SNC_FM_decorator.h Line: 417
Geometries in cache: 15916
Geometry cache size in bytes: 11829776
CGAL Polyhedrons in cache: 66
CGAL cache size in bytes: 35077472
Total rendering time: 0 hours, 1 minutes, 30 seconds
   Top level object is a 3D object:
   Simple:        yes
   Vertices:      570
   Halfedges:    1710
   Edges:         855
   Halffacets:    610
   Facets:        305
   Volumes:        11
Rendering finished.



Note stats and missing objects in display. Different error.

----------------------------------------------------------

A=00011000
Used 2 cpus
Compiling design (CSG Tree generation)...
Rendering Polygon Mesh using CGAL...
Threaded traversal phase 1: Generating 4003 leaf geometries
Threaded traversal phase 2: Spawning 16075 threads on 8 cores
Geometries in cache: 15916
Geometry cache size in bytes: 11829776
CGAL Polyhedrons in cache: 60
CGAL cache size in bytes: 104662192
Total rendering time: 0 hours, 3 minutes, 23 seconds
   Top level object is a 3D object:
   Simple:        yes
   Vertices:    26334
   Halfedges:   79002
   Edges:       39501
   Halffacets:  40766
   Facets:      20383
   Volumes:      3451
Rendering finished.



This is representative of the chunks where CPU was < 12.x%



Red is a finished thread (showing CPU at the time it finished) Green is a new thread since last refresh. ie shows two sequential pairs of threads.

-----------------------------------------------------------

This is the weird scheduling one...

A=01010101
Only used 1 CPU
Compiling design (CSG Tree generation)...
Rendering Polygon Mesh using CGAL...
Threaded traversal phase 1: Generating 4003 leaf geometries
Threaded traversal phase 2: Spawning 16075 threads on 8 cores
Geometries in cache: 15916
Geometry cache size in bytes: 11829776
CGAL Polyhedrons in cache: 60
CGAL cache size in bytes: 104662192
Total rendering time: 0 hours, 4 minutes, 2 seconds
   Top level object is a 3D object:
   Simple:        yes
   Vertices:    26334
   Halfedges:   79002
   Edges:       39501
   Halffacets:  40766
   Facets:      20383
   Volumes:      3451
Rendering finished.



This is representative of the thread display



ie shows three sequential threads.

----------------------------------------------------------
Can there be a non-concurrency reason the errors occur or not, purely based on #threads available?

There were other multiple CGAL errors I'll post next, using different geometry.

End of part 2.

Admin - PM me if you need anything,
or if I've done something stupid...

Unless specifically shown otherwise above, my contribution is in the Public Domain; to the extent possible under law, I have waived all copyright and related or neighbouring rights to this work.
Obviously inclusion of works of previous authors is not included in the above.


The TPP is no simple “trade agreement.” Fight it! http://www.ourfairdeal.org/ time is running out!
Reply | Threaded
Open this post in threaded view
|

Re: Multi-threaded render discussion #1

MichaelAtOz
Administrator
In reply to this post by MichaelAtOz
Note all above was 3D objects.
Admin - PM me if you need anything,
or if I've done something stupid...

Unless specifically shown otherwise above, my contribution is in the Public Domain; to the extent possible under law, I have waived all copyright and related or neighbouring rights to this work.
Obviously inclusion of works of previous authors is not included in the above.


The TPP is no simple “trade agreement.” Fight it! http://www.ourfairdeal.org/ time is running out!
Reply | Threaded
Open this post in threaded view
|

Re: Multi-threaded render discussion #1

MichaelAtOz
Administrator
In reply to this post by kintel
kintel wrote
$ time ./OpenSCAD.app/Contents/MacOS/OpenSCAD random-threads.csg --enable thread-traversal -o out.stl
Threaded traversal phase 1: Generating 4003 leaf geometries
Threaded traversal phase 2: Spawning 16075 threads on 8 cores
real 14m40.628s
user 18m8.360s
sys 5m29.745s

This ran without errors, but only saturated 4 out of 8 cores.
Mac OS X doesn’t do per-process CPU affinity..

 -Marius
Hmmm...

a. IANALG (I am not a linux guru)
b. Does OS X have this? Or perhaps kick off two competing renders, the point of the exercise is to cause thread contention, possiblty affecting order of execution. Tho I'm wondering if spinlock is not as atomic as it should be given the CPU cache architecture...

I have been thinking that there may be a need for a 'threads offset' option, value +/-n where you can use fewer or over-commit threads used. Fewer if you want to reserve capacity for other things, like funny kitten videos, while it renders in the background, over-commit - as you see above there are periods where CPU is free, but no threads available to use it. Experience will see what degree of lock contention occurs when over-committing...
Admin - PM me if you need anything,
or if I've done something stupid...

Unless specifically shown otherwise above, my contribution is in the Public Domain; to the extent possible under law, I have waived all copyright and related or neighbouring rights to this work.
Obviously inclusion of works of previous authors is not included in the above.


The TPP is no simple “trade agreement.” Fight it! http://www.ourfairdeal.org/ time is running out!
Reply | Threaded
Open this post in threaded view
|

Re: Multi-threaded render discussion #1

MichaelAtOz
Administrator
In reply to this post by kintel
kintel wrote
This ran without errors, but only saturated 4 out of 8 cores.
Mac OS X
I presume you have had occasions where >4 threads are saturated?
What model machine & CPU do you have & OS X version?
My detailed Apple knowledge is in the garage with my 1st gen Mac 128KB...

I'm wondering if OS X handles locks differently...
Admin - PM me if you need anything,
or if I've done something stupid...

Unless specifically shown otherwise above, my contribution is in the Public Domain; to the extent possible under law, I have waived all copyright and related or neighbouring rights to this work.
Obviously inclusion of works of previous authors is not included in the above.


The TPP is no simple “trade agreement.” Fight it! http://www.ourfairdeal.org/ time is running out!
Reply | Threaded
Open this post in threaded view
|

Re: Multi-threaded render discussion #1

kintel
Administrator
> On Apr 17, 2017, at 18:56, MichaelAtOz <[hidden email]> wrote:
>
> I presume you have had occasions where >4 threads are saturated?
> What model machine & CPU do you have & OS X version?

This might just be a bi-effect of how this was implemented. I’ve got a quad-core i7 with hyperthreading, so a heavily CPU-bound process with spinlocks may very well grab one entire core. This is on OS X 10.10.5.

 -Marius



_______________________________________________
OpenSCAD mailing list
[hidden email]
http://lists.openscad.org/mailman/listinfo/discuss_lists.openscad.org
Reply | Threaded
Open this post in threaded view
|

Re: Multi-threaded render discussion #1

MichaelAtOz
Administrator
It grabbed all 8 on Windows, frequently at ~12.5% each, hence both hyperthreads at ~100%, must be good cache (CPU) interleaving (or sitting there spinning the lock, wish there was a way to tell). Other times ~6-7% max, not sure what's happening then, possibly both halves getting cache misses. (that was 3D)

Anyway...

I know this is Shotgun-Testing® (lots of random stuff), but I had many goes with 256 shotguns firing ~1.5 MegaThreads® (my new clothing line) of slightly-non-trivial 2D geometries, without an error (detected). So I think 2D may be more thread safe. I suspect it is older & more mature?

However threads (8) were getting ~5% each for most of the peak shown here, ~50% total CPU.



I'm about to do performance comparisons.
Admin - PM me if you need anything,
or if I've done something stupid...

Unless specifically shown otherwise above, my contribution is in the Public Domain; to the extent possible under law, I have waived all copyright and related or neighbouring rights to this work.
Obviously inclusion of works of previous authors is not included in the above.


The TPP is no simple “trade agreement.” Fight it! http://www.ourfairdeal.org/ time is running out!
Reply | Threaded
Open this post in threaded view
|

Re: Multi-threaded render discussion #1

codifies
In reply to this post by MichaelAtOz
I tried your stress test in the first post of this thread, but I'm not sure how to monkey with affinity under Linux - never really having the need, anyhow it works okay but I did notice that the very last thread seems to take a significant portion of the overall render time

I then tested it with a number of "sweeps" (a module that produces usually 180-360 hulls in a loop) when there are multiple sweeps >3 it keeps all the cores busy, but again with the last thread taking its time solo...

with a single sweep only one core was ever active, I did try tinkering with the loop ie group inside the loop and not using children() but couldn't seem to interest it in using multiple cores

while this isn't a "catch all" solution it is certainly a great step forward and even if it means changing coding style a little to make it easier to portion up, its still worth using!

Is there an *easy* way in Linux to alter core affinity to test any issues with the stress test ?

great work!
Reply | Threaded
Open this post in threaded view
|

Re: Multi-threaded render discussion #1

MichaelAtOz
Administrator
As I said, I'm not a linux guru, have you seen this?
Admin - PM me if you need anything,
or if I've done something stupid...

Unless specifically shown otherwise above, my contribution is in the Public Domain; to the extent possible under law, I have waived all copyright and related or neighbouring rights to this work.
Obviously inclusion of works of previous authors is not included in the above.


The TPP is no simple “trade agreement.” Fight it! http://www.ourfairdeal.org/ time is running out!
Reply | Threaded
Open this post in threaded view
|

Re: Multi-threaded render discussion #1

MichaelAtOz
Administrator
This post has NOT been accepted by the mailing list yet.
This post was updated on .
(copy of what I posted on GitHub)
MichaelAtOz wrote
Other times ~6-7% max, not sure what's happening then, possibly both halves getting cache misses. (that was 3D)
...
However threads (8) were getting ~5% each for most of the peak shown here, ~50% total CPU.
...
I'm about to do performance comparisons.
I'm still working thru it, but I'm now leaning toward these lower CPU% threads are micro-workloads, for want of a term, they finish before consuming much CPU. I'm still looking at the performance info, but close to 100% of threads record zero CPU (below some threshold level), only a handful record more (eg ~300/~50,000 2D, ~1,700/~50,000 3D). [small sample size, need more varied CSG workload, still working on it]

ATM 2D multi-threading is worse than single-threaded, some times up to 200% [elapsed]. Probably due to the above. Whether a more chunky thread/workload allocation would be better needs consideration, where thread overheads would be less.
Admin - PM me if you need anything,
or if I've done something stupid...

Unless specifically shown otherwise above, my contribution is in the Public Domain; to the extent possible under law, I have waived all copyright and related or neighbouring rights to this work.
Obviously inclusion of works of previous authors is not included in the above.


The TPP is no simple “trade agreement.” Fight it! http://www.ourfairdeal.org/ time is running out!