If you can't split the matrices them self, maybe dividing them per (set of) components helps.
For example split the base 3x3 over 3 executable's all holding one row of that base matrix inside their own major matrix. Then ether use a 4th exe or make one 'master' to fetch the components from all 3 sources per element when needed.
Also is there a valid reason for supporting 32bit? Because even the multiple exe's approach needs a 64bit os in the end to be useful. So it seems a lot of overhead just for having a 32bit exe. And using a native 64bit executable probably results in better performance with the large float types you are using.
On the other side if you set up the communication between exe's to use tcp/ip (localhost) you potentially open the door to a primitive cluster implementation This would also solve running out of physical memory (even on a 64 bit os).
For example split the base 3x3 over 3 executable's all holding one row of that base matrix inside their own major matrix. Then ether use a 4th exe or make one 'master' to fetch the components from all 3 sources per element when needed.
Also is there a valid reason for supporting 32bit? Because even the multiple exe's approach needs a 64bit os in the end to be useful. So it seems a lot of overhead just for having a 32bit exe. And using a native 64bit executable probably results in better performance with the large float types you are using.
On the other side if you set up the communication between exe's to use tcp/ip (localhost) you potentially open the door to a primitive cluster implementation This would also solve running out of physical memory (even on a 64 bit os).