Discussion:
interest in an asynchronous linear algebra library?
(too old to reply)
Oswin Krause
2015-11-19 14:55:52 UTC
Permalink
Hi list,

Disclaimer: What I am proposing here does not exist (yet). But it will
in the near future and if there is interest i would like to give
something back to the boost community.

I would like to propose a new library to boost: asynchronous blas.

What is it and what can I do with it?
It is a linear algebra library where all computations are carried out in
an asynchronous manner on GPU and CPU. Computing a matrix-matrix product
does not block the current thread but instead launches a task which at
some point returns with a result, allowing the main thread to continue
until the result is needed. GPU and CPU tasks can be started at the same
time depending on each other (e.g. a CPU computation waits until the GPU
returned some needed result and vice-versa)

Why is this important? Which problems does it solve?
Many linear algebra libraries are target towards either
a) solving a few large linear algebra problems
b) solving many small similar problems

This library is targeted towards the middle ground: a medium amount of
decently sized problems. Here the problem is that it is often hard to
use all the computing resources available as a single problem will not
exhaust the device resources and there is no easy way to launch several
similar tasks in parallel.
This problem occurs often in GPU computing where out-of-order execution
of kernels is used to fill up the device. Similarly, it is often hard to
make CPU and GPU work together as GPU tasks are asynchronous while CPU
tasks are often synchronous and it is not straight forward to couple
both without blocking either CPU or GPU.

How does it work?
Expression templates similar to boost::ublas are used to create a
dependency graph: which object is used in which computations and which
tasks have to wait to ensure a correct outcome?

a simple case would be two independent computations like

matrix A=prod(B,trans(B));
matrix C=prod(trans(B),B);

a normal library like ublas would wait until A is computed before C is
evaluated. Instead, it can be more efficient to just add the two tasks
to a list and let worker threads execute them as soon as resources are
available. First the computation of A is put into the graph and A is
marked as "being written to" while B is marked as "being read". When C
is added to the list, it does not depend on the first computation as
several tasks can read from B at the same time. So another worker thread
can grab the task and execute it.

If we now add the task

B += prod(D,prod(C,trans(D));

the library computes it as

matrix temp = prod(C,trans(D);
B += prod(D,temp).

temp can be computed as soon as C is computed. But B can only be
computed after A and temp are computed as we can only write to B after
all previous tasks using B are finished.

Are you qualified?
I am main contributor to https://github.com/Shark-ML/Shark/
and as part of it I forked
ublas
https://github.com/Shark-ML/Shark/tree/master/include/shark/LinAlg/BLAS
which now uses several linear algebra libraries as possible backends
(ATLAS, openBLAS...).


What is the timeline?
I would start developing a minimal example ASAP based on the rewrite of
ublas above. I would use a minimum set of features (only dense square
matrices) and basic GPU support.

Opinions?

Best,
Oswin

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Ben Pope
2015-11-19 16:58:32 UTC
Permalink
Post by Oswin Krause
Hi list,
Disclaimer: What I am proposing here does not exist (yet). But it will
in the near future and if there is interest i would like to give
something back to the boost community.
I would like to propose a new library to boost: asynchronous blas.
What is it and what can I do with it?
It is a linear algebra library where all computations are carried out in
an asynchronous manner on GPU and CPU. Computing a matrix-matrix product
does not block the current thread but instead launches a task which at
some point returns with a result, allowing the main thread to continue
until the result is needed. GPU and CPU tasks can be started at the same
time depending on each other (e.g. a CPU computation waits until the GPU
returned some needed result and vice-versa)
This sounds like a specific solution to a general need for schedulers.

Have you looked at the current proposals from the various sg groups such
as sg1? Do they not help to solve your problem? Now is the time to speak
up if they don't. Go find Michael Wong or Hans Boehm.

Ben




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Oswin Krause
2015-11-19 18:13:13 UTC
Permalink
Hi,
Post by Ben Pope
This sounds like a specific solution to a general need for schedulers.
This proposal can work on top of any scheduler that allows for
dependency graphs. You could see this library as a specific application
of scheduling linear algebra tasks, because we know exactly how the
dependencies are and how to order stuff. Therefore, we can completely
hide the scheduling.
Post by Ben Pope
Have you looked at the current proposals from the various sg groups
such as sg1? Do they not help to solve your problem? Now is the time
to speak up if they don't. Go find Michael Wong or Hans Boehm.
I am not sure about the current state of the proposals, I only remember
a recent discussion about executors where I raised concerns about
possible deadlocks and missing guarantees. Anyways, I would not know
where to start looking for the current state of the proposals.

I found this here:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4414.pdf

and I doubt one could implement dependency graphs with that proposal as
the executor does not return references to scheduled tasks or allows for
explicit dependencies(i can not use futures as I want operations as much
as possible to use existing objects and not generate new objects as this
can exhaust memory pretty easily). There is also no wording about "what
happens when a work package adds more work packages to the executor and
waits for them", which makes any thread_pool proposal worthless as
blocking worker threads might result in deadlocks. I think there is
further no hope that we get unified dependency handling which allows for
tasks handled by different executors/schedulers (a CPU and GPU scheduler
fx).

Best,
Oswin

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Beman Dawes
2015-11-19 21:59:34 UTC
Permalink
On Thu, Nov 19, 2015 at 1:13 PM, Oswin Krause <
Post by Ben Pope
Hi,
This sounds like a specific solution to a general need for schedulers.
This proposal can work on top of any scheduler that allows for dependency
graphs. You could see this library as a specific application of scheduling
linear algebra tasks, because we know exactly how the dependencies are and
how to order stuff. Therefore, we can completely hide the scheduling.
Have you looked at the current proposals from the various sg groups
Post by Ben Pope
such as sg1? Do they not help to solve your problem? Now is the time
to speak up if they don't. Go find Michael Wong or Hans Boehm.
I am not sure about the current state of the proposals, I only remember a
recent discussion about executors where I raised concerns about possible
deadlocks and missing guarantees. Anyways, I would not know where to start
looking for the current state of the proposals.
The C++ committee's papers are a good place to start. See
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/

Work from newest (at the end) to oldest (at the top).

You might be particularly interested in:

P0159R0
<http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0159r0.html>
Draft of Technical Specification for C++ Extensions for Concurrency,
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0159r0.html

N4514 <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4514.pdf>
Technical Specification for C++ Extensions for Transactional Memory,
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4514.pdf

N4507 <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4507.pdf>
Technical Specification for C++ Extensions for Parallelism ,
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4507.pdf
<http://lists.boost.org/mailman/listinfo.cgi/boost>

--Beman

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Sebastian Schaetz
2015-11-21 05:12:19 UTC
Permalink
Hi Oswin,
Post by Oswin Krause
I would like to propose a new library to boost: asynchronous blas.
asynchrony and optimization of the AST are very interesting techniques that
the DL community should leverage. There is a lot of existing knowledge and
code in this area so in addition to looking at standard proposals I would
recommend you look into advanced expression template libraries like NT2 [0]
or Eigen [1] when researching previous work. I would also recommend you
look into HPX [2].

Best,
Sebastian

[0] https://github.com/jfalcou/nt2
[1] http://eigen.tuxfamily.org/index.php?title=Main_Page
[2] https://github.com/STEllAR-GROUP/hpx

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Joel FALCOU
2015-11-21 17:42:36 UTC
Permalink
Post by Sebastian Schaetz
asynchrony and optimization of the AST are very interesting techniques that
the DL community should leverage. There is a lot of existing knowledge and
code in this area so in addition to looking at standard proposals I would
recommend you look into advanced expression template libraries like NT2 [0]
or Eigen [1] when researching previous work. I would also recommend you
look into HPX [2].
FYI, we have experimental support of HPX inside NT2 already. My PHD
student defended on that recently. Need to merge that to NT2 master stll.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Oswin Krause
2015-11-22 14:06:43 UTC
Permalink
Post by Sebastian Schaetz
Hi Oswin,
Post by Oswin Krause
I would like to propose a new library to boost: asynchronous blas.
asynchrony and optimization of the AST are very interesting techniques
that the DL community should leverage. There is a lot of existing
knowledge and code in this area so in addition to looking at standard
proposals I would recommend you look into advanced expression template
libraries like NT2 [0] or Eigen [1] when researching previous work. I
would also recommend you look into HPX [2].
Hi Sebastian (and others),

I am a bit confused by the replies. Is this a "yes, there is interest
but do some proper research beforehand"?

Of course I do know Eigen, Armadillo and NT2 as I had to do some
research in the past for the linked library. For the CPU part i was
indeed planning a system with an interface similar to HPX,but I can not
rely on HPX because It would then be a mandatory external library - and
distributed computations are currently not in the scope of the
project(honestly, I do not have enough experience with distributed
computing to start such an endeavor).

The question for me is: is there enough interest in such a thing that it
is worthwhile for me to create a minimal prototype on which a discussion
could be based on?

Best,
Oswin

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Hartmut Kaiser
2015-11-22 16:44:11 UTC
Permalink
Post by Oswin Krause
Post by Sebastian Schaetz
Hi Oswin,
Post by Oswin Krause
I would like to propose a new library to boost: asynchronous blas.
asynchrony and optimization of the AST are very interesting techniques
that the DL community should leverage. There is a lot of existing
knowledge and code in this area so in addition to looking at standard
proposals I would recommend you look into advanced expression template
libraries like NT2 [0] or Eigen [1] when researching previous work. I
would also recommend you look into HPX [2].
Hi Sebastian (and others),
I am a bit confused by the replies. Is this a "yes, there is interest
but do some proper research beforehand"?
Of course I do know Eigen, Armadillo and NT2 as I had to do some
research in the past for the linked library. For the CPU part i was
indeed planning a system with an interface similar to HPX,but I can not
rely on HPX because It would then be a mandatory external library - and
distributed computations are currently not in the scope of the
project(honestly, I do not have enough experience with distributed
computing to start such an endeavor).
<rant>
Why do people fall into the trap of the NIH syndrome over and over again?
*sigh*
</rant>

Anyways, while HPX supports distributed computing, you don't have to use it
for this. HPX can be used just fine for SMP style (local only) applications
as well. In fact we have shown that by simply switching from
std::async/std::future to hpx::async/hpx::future you can achieve a speedup
of up to a factor of 3-4 (depending on the application). That does not
entail utilizing some of the more sophisticated features HPX provides you
with (extending futures like proposed by the Concurrency TS, dataflow,
parallel algorithms, task regions, GPU integration, etc.). Those usually
give you an additional benefit in terms of speedup, readability, and
maintainability.

Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Robert Ramey
2015-11-22 18:32:46 UTC
Permalink
Post by Hartmut Kaiser
<rant>
Why do people fall into the trap of the NIH syndrome over and over again?
*sigh*
</rant>
Hmmmm - Is HPX a boost library? Maybe you want to affiliate HPX with
boost in the manner that spirit is. Might make for an interesting
review discussion.

Robert Ramey

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Hartmut Kaiser
2015-11-22 19:06:00 UTC
Permalink
Robert,
Post by Hartmut Kaiser
Post by Hartmut Kaiser
<rant>
Why do people fall into the trap of the NIH syndrome over and over
again?
Post by Hartmut Kaiser
*sigh*
</rant>
Hmmmm - Is HPX a boost library? Maybe you want to affiliate HPX with
boost in the manner that spirit is. Might make for an interesting
review discussion.
HPX is actually at least 5 (if not more) different libraries, so this would
fall through even before any discussion has started. It is also ~250k LOC,
nobody would want to review that anyways.

While I understand the significance of being part of Boost (in terms of
legal and company BS), I don't think that it would be worth the effort at
this point for us to invest time into such an endeavor. That said - if
somebody would be interested in turning HPX (or part of it) into his 'own'
Boost library - please be our guest. We would be in full support for such an
effort.

We (HPX) have always used Boost as our role model for code quality,
portability requirements, coding guidelines, license, review discipline (I
believe we're even further along the road here if compared to Boost), etc.
Thus - if somebody is interested in code quality and excellent support - I
don't see any reason not to use HPX in the first place, even if it's
external. It wouldn't be more 'external' than using Boost.

Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Oswin Krause
2015-11-22 19:18:05 UTC
Permalink
Hi,
Post by Hartmut Kaiser
<rant>
Why do people fall into the trap of the NIH syndrome over and over again?
*sigh*
</rant>
This is not NIH. I am happily using all kinds of external libraries.
However, for a general purpose library, especially when targeting
something as boost I hesitate to do it without providing a reasonable
fallback implementation, especially if it is a library i can not assume
to be readily installed and available on a regular unix system.

My experience with users of my libraries is: most users just want to
play around with it and do not care about it being fast and efficient.
Thus requiring them to install a lot of external dependencies can make
them loose interest - and i do not want that
Later, when they want to try it out for real tasks they will remember
that they should install the external libraries and then they can
actually make use of it and see the improvements of it.
(I do not work together with skilled programmers, but students, they
struggle getting boost installed on their windows machines and then
getting cmake to find boost, so i really don't want to add more
hurdles!).
Post by Hartmut Kaiser
Anyways, while HPX supports distributed computing, you don't have to use it
for this. HPX can be used just fine for SMP style (local only)
applications
as well. In fact we have shown that by simply switching from
std::async/std::future to hpx::async/hpx::future you can achieve a speedup
of up to a factor of 3-4 (depending on the application).
This for example would be a reason for me to provide a default
implementation using std::future that i can exchange by the hpx::
versions using a preprocessor directive.

Btw:
is there a specific reason for that speedup? is there just less overhead
of the implementation or is it improved scheduling of tasks?
Post by Hartmut Kaiser
That does not
entail utilizing some of the more sophisticated features HPX provides you
with (extending futures like proposed by the Concurrency TS, dataflow,
parallel algorithms, task regions, GPU integration, etc.). Those usually
give you an additional benefit in terms of speedup, readability, and
maintainability.
And this would be an example where i would rather wait using the
features unless I can find ways to make everything work with some
default implementation.


Best,
Oswin

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Hartmut Kaiser
2015-11-22 19:30:40 UTC
Permalink
Post by Oswin Krause
Post by Hartmut Kaiser
Anyways, while HPX supports distributed computing, you don't have to use it
for this. HPX can be used just fine for SMP style (local only) applications
as well. In fact we have shown that by simply switching from
std::async/std::future to hpx::async/hpx::future you can achieve a speedup
of up to a factor of 3-4 (depending on the application).
This for example would be a reason for me to provide a default
versions using a preprocessor directive.
is there a specific reason for that speedup? is there just less overhead
of the implementation or is it improved scheduling of tasks?
It's mainly because of a) very efficient user level threading, and b) low
overhead synchronization primitives.

Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Joel FALCOU
2015-11-23 12:58:54 UTC
Permalink
Post by Oswin Krause
Hi,
Post by Hartmut Kaiser
<rant>
Why do people fall into the trap of the NIH syndrome over and over again?
*sigh*
</rant>
This is not NIH. I am happily using all kinds of external libraries.
However, for a general purpose library, especially when targeting
something as boost I hesitate to do it without providing a reasonable
fallback implementation, especially if it is a library i can not assume
to be readily installed and available on a regular unix system.
Having a cmake script install w/e dependecies oyu requrie is nto
terrible to write.

We tried to do our own async in NT2 then switched to HPX cause it was
too much work.


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Oswin Krause
2015-12-03 09:19:52 UTC
Permalink
Hi,

giving a small heads-up.

There is now a minimal PoC at https://github.com/Ulfgard/aBLAS . Minimal
in the sense that i took my existing LinAlg, ripped it apart and rewrote
it partially to fit the new needs of the library. Only cpu is
implemented, yet. I am open for suggestions, helpful advice and of
course people who are interested to work on it. I am not too happy with
the scheduling interface right now and its implementation looks a bit
slower than necessary, but i think this will evolve over time.

two basic examples showing what the library already can do are given in
examples/ . For ublas users this should not look too foreign.

For all who are interested, here is the basic design and the locations
in include/aBLAS/:

1. computational kernels are implemented in kernels/ and represent
typical bindings to the BLAS1-3 functionality as well as a default
implementation (currently only dot,gemv,gemm and assignment are tested
and working, no explicit bindings included, yet). kernels are enqueued
in the scheduler via the expression template mechanisms and kernels are
not allowed to enqueue kernels recursively.

2. a simple PoC scheduler is implemented in scheduling/scheduling.hpp.
It implements a dependency graph between work packages and work is
enqueued into a boost::thread::basic_thread_pool when all its
dependencies are resolved. A kernel is enqueued together with a set of
dependency_node objects which encapsulate dependencies of variables the
kernel uses (i.e. every variable keeps track about what its latest
dependencies are and whether these dependencies read from it or write to
it). The current interface should be abstracted enough to allow
implementation using different technologies(e.g. it should be possible
to implement the scheduler in terms of HPX.).

One of the tasks of the scheduler is to allow creation of closures where
variables are guaranteed to exist until all kernels using them are
finished as well as moving a variable into a closure. This is used to
prevent an issue similar to the blocking destructor of std::future<T>.
Instead of blocking, the variable is moved into the scheduler which then
ensures lifetime, until all kernels are finished. This of course
requires the kernels to be called in a way that they can cope with the
moving of types.

what is currently missing is "user created dependencies" to be used in
conjunction with the gpu (as gpus are fully asynchronous, we have to
register a callback that notifies the scheduler when the gpu is done
with its computations just as the worker threads do).

3. basic matrix/vector classes are implemented in matrix.hpp and
vector.hpp. The implementation is a bit convoluted for the "move into
closure" to work. Basically they introduce another indirection. When a
kernel is created, it references a special closure type of the variable
(vector<T>::closure_type), which references that indirection.

4. the remaining files in include/aBLAS*.hpp implement the expression
templates, which are similar to uBLAS. There are two types distinguished
using the CRTP classes matrix_expression<Type,Device> and
vector_expression<Type,Device>. Type is the exact type of the
expression, Device marks the Device this is working on (cpu_tag or
gpu_tag). The second template parameter ensures that one can not mix gpu
and cpu expresions unless the operation explicitly allows this. While
this looks clumsy, it removes code duplication between the different
device types as most of the code can be used for both implementations.

assignment.hpp implements the basic assignment operators (Except op=).
A += B either calls kernel::assign<>(A,B) if B can be evaluated
efficiently elementwise (e.g. A+= 2*A1+A2-A3) or calls
B.plus_assign_to(A) which then assigns the terms one-by-one using their
specialized kernels (e.g. A+=prod(B,C)+D is evaluated as
kernels::gemm(B,C,A); kernels::assign<...>(A,D);)

matrix/vector_proxy.hpp implement subranges, rows-of-matrix-operations,
etc and matrix/vector_expression the algebraic operations.


Best,
Oswin

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Hartmut Kaiser
2015-12-03 21:39:06 UTC
Permalink
Post by Oswin Krause
There is now a minimal PoC at https://github.com/Ulfgard/aBLAS . Minimal
it partially to fit the new needs of the library. Only cpu is
implemented, yet. I am open for suggestions, helpful advice and of
course people who are interested to work on it. I am not too happy with
the scheduling interface right now and its implementation looks a bit
slower than necessary, but i think this will evolve over time.
The scheduling should not be part of your library, it's a pure
implementation detail.
Post by Oswin Krause
two basic examples showing what the library already can do are given in
examples/ . For ublas users this should not look too foreign.
I don't think that you should make it explicit to the user whether a certain
operation has to be async or not. Just make every operation async by default
and let the library decide when to synchronize (i.e. whenever the actual
matrix data is accessed). Any expressions (like r += 2*x+1 + prod(x,y) -
btw: why prod(x,y) and not x * y)?) should not execute any code directly,
just construct an expression dependency tree which starts scheduling the
asynchronous sub-tasks. You need to wait for the result only when somebody
accesses an element of r. The easiest would be to let your matrix type wrap
a future which references the actual matrix. Additional transformations of
the expression tree could be used to merge the operations, if needed.
Post by Oswin Krause
For all who are interested, here is the basic design and the locations
1. computational kernels are implemented in kernels/ and represent
typical bindings to the BLAS1-3 functionality as well as a default
implementation (currently only dot,gemv,gemm and assignment are tested
and working, no explicit bindings included, yet). kernels are enqueued
in the scheduler via the expression template mechanisms and kernels are
not allowed to enqueue kernels recursively.
Why do you implement your own BLAS functions?
Post by Oswin Krause
2. a simple PoC scheduler is implemented in scheduling/scheduling.hpp.
It implements a dependency graph between work packages and work is
enqueued into a boost::thread::basic_thread_pool when all its
dependencies are resolved. A kernel is enqueued together with a set of
dependency_node objects which encapsulate dependencies of variables the
kernel uses (i.e. every variable keeps track about what its latest
dependencies are and whether these dependencies read from it or write to
it). The current interface should be abstracted enough to allow
implementation using different technologies(e.g. it should be possible
to implement the scheduler in terms of HPX.).
One of the tasks of the scheduler is to allow creation of closures where
variables are guaranteed to exist until all kernels using them are
finished as well as moving a variable into a closure. This is used to
prevent an issue similar to the blocking destructor of std::future<T>.
Instead of blocking, the variable is moved into the scheduler which then
ensures lifetime, until all kernels are finished. This of course
requires the kernels to be called in a way that they can cope with the
moving of types.
As said, the scheduler should be a pure implementation detail and not
exposed to the user. I'd even suggest to rely on executors for the
scheduling needs, leaving the scheduler out of the picture completely.

Also, if you need dependencies, those shouldn't be handled by the executor,
but be better imposed by the code using it (via futures or similar).
Post by Oswin Krause
what is currently missing is "user created dependencies" to be used in
conjunction with the gpu (as gpus are fully asynchronous, we have to
register a callback that notifies the scheduler when the gpu is done
with its computations just as the worker threads do).
3. basic matrix/vector classes are implemented in matrix.hpp and
vector.hpp. The implementation is a bit convoluted for the "move into
closure" to work. Basically they introduce another indirection. When a
kernel is created, it references a special closure type of the variable
(vector<T>::closure_type), which references that indirection.
4. the remaining files in include/aBLAS*.hpp implement the expression
templates, which are similar to uBLAS. There are two types distinguished
using the CRTP classes matrix_expression<Type,Device> and
vector_expression<Type,Device>. Type is the exact type of the
expression, Device marks the Device this is working on (cpu_tag or
gpu_tag). The second template parameter ensures that one can not mix gpu
and cpu expresions unless the operation explicitly allows this. While
this looks clumsy, it removes code duplication between the different
device types as most of the code can be used for both implementations.
assignment.hpp implements the basic assignment operators (Except op=).
A += B either calls kernel::assign<>(A,B) if B can be evaluated
efficiently elementwise (e.g. A+= 2*A1+A2-A3) or calls
B.plus_assign_to(A) which then assigns the terms one-by-one using their
specialized kernels (e.g. A+=prod(B,C)+D is evaluated as
kernels::gemm(B,C,A); kernels::assign<...>(A,D);)
matrix/vector_proxy.hpp implement subranges, rows-of-matrix-operations,
etc and matrix/vector_expression the algebraic operations.
Can I use other BLAS implementations, like for instance Intel MKL?


General note: I have the impression, that - just like any decently sized
system eventually embeds a half-baked and bug-ridden implementation of LISP
- any asynchronous library nowadays contains its own half-assed
implementation of a parallel runtime system.

So let me ask again: why not use a full blown, mature, and thoroughly tested
runtime system for your needs instead of worrying about how to schedule 5
dependent tasks your library creates?

Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu




_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Robert Ramey
2015-11-22 18:30:35 UTC
Permalink
Of course I do know Eigen, Armadillo and NT2 ... but I can not
rely on HPX because It would then be a mandatory external library -
The question for me is: is there enough interest in such a thing that it
is worthwhile for me to create a minimal prototype on which a discussion
could be based on?
Hmmm - if you made a boost library which implemented what you desire,
wouldn't that become an "external library" which you could not then use?

Is the whole motivation of this proposal to work around some internal
prohibition of "external code" imposed for some non-technical reasons
like legal department BS or grant requirements or ???

I'm not getting the motivation here.

Robert Ramey


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
David Bellot
2015-12-05 14:49:20 UTC
Permalink
Hi Oswin,

just a simple question, why not contribute it to ublas directly ?
Ublas is getting old anyway and needs a serious rewrite of some parts.
Your contribution would be exceptional.

And your work would be recognized as being part of ublas directly and be
used right away by thousands of users.
Let me know if you're interested contributing back to ublas instead.
Cheers,
David

On Thu, Nov 19, 2015 at 2:55 PM, Oswin Krause <
Post by Oswin Krause
Hi list,
Disclaimer: What I am proposing here does not exist (yet). But it will in
the near future and if there is interest i would like to give something
back to the boost community.
I would like to propose a new library to boost: asynchronous blas.
What is it and what can I do with it?
It is a linear algebra library where all computations are carried out in
an asynchronous manner on GPU and CPU. Computing a matrix-matrix product
does not block the current thread but instead launches a task which at some
point returns with a result, allowing the main thread to continue until the
result is needed. GPU and CPU tasks can be started at the same time
depending on each other (e.g. a CPU computation waits until the GPU
returned some needed result and vice-versa)
Why is this important? Which problems does it solve?
Many linear algebra libraries are target towards either
a) solving a few large linear algebra problems
b) solving many small similar problems
This library is targeted towards the middle ground: a medium amount of
decently sized problems. Here the problem is that it is often hard to use
all the computing resources available as a single problem will not exhaust
the device resources and there is no easy way to launch several similar
tasks in parallel.
This problem occurs often in GPU computing where out-of-order execution of
kernels is used to fill up the device. Similarly, it is often hard to make
CPU and GPU work together as GPU tasks are asynchronous while CPU tasks are
often synchronous and it is not straight forward to couple both without
blocking either CPU or GPU.
How does it work?
Expression templates similar to boost::ublas are used to create a
dependency graph: which object is used in which computations and which
tasks have to wait to ensure a correct outcome?
a simple case would be two independent computations like
matrix A=prod(B,trans(B));
matrix C=prod(trans(B),B);
a normal library like ublas would wait until A is computed before C is
evaluated. Instead, it can be more efficient to just add the two tasks to a
list and let worker threads execute them as soon as resources are
available. First the computation of A is put into the graph and A is marked
as "being written to" while B is marked as "being read". When C is added to
the list, it does not depend on the first computation as several tasks can
read from B at the same time. So another worker thread can grab the task
and execute it.
If we now add the task
B += prod(D,prod(C,trans(D));
the library computes it as
matrix temp = prod(C,trans(D);
B += prod(D,temp).
temp can be computed as soon as C is computed. But B can only be computed
after A and temp are computed as we can only write to B after all previous
tasks using B are finished.
Are you qualified?
I am main contributor to https://github.com/Shark-ML/Shark/
and as part of it I forked
ublas
https://github.com/Shark-ML/Shark/tree/master/include/shark/LinAlg/BLAS
which now uses several linear algebra libraries as possible backends
(ATLAS, openBLAS...).
What is the timeline?
I would start developing a minimal example ASAP based on the rewrite of
ublas above. I would use a minimum set of features (only dense square
matrices) and basic GPU support.
Opinions?
Best,
Oswin
_______________________________________________
http://lists.boost.org/mailman/listinfo.cgi/boost
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Oswin Krause
2015-12-07 08:47:36 UTC
Permalink
Hi David,
Post by David Bellot
And your work would be recognized as being part of ublas directly and be
used right away by thousands of users.
Let me know if you're interested contributing back to ublas instead.
Yes, I am interested, but currently I do not see a roadmap as ublas is
SO big, that I fear that I can not fix as much code as my proposed
changes would break (even without the asynchronous part, gpus...). One
of the starting points of my rewrite was that I deleted around 40k lines
of ublas code before i started and then designed the interface around a
defined subset of features (namely one dense format and one sparse
format, later adding the triangular packed format. I still haven't
gotten sparse right, though.). From then on i decided to break the
functionality apart in a few places (assignment, kernels, and iterators)
and use the expression templates as a simple connection between the
parts.

Another point: I would not propose a major change to an existing boost
library without triggering a formal boost review process - this change
would break any code that relies on any ublas internals.

Best,
Oswin
Post by David Bellot
Cheers,
David
On Thu, Nov 19, 2015 at 2:55 PM, Oswin Krause <
Post by Oswin Krause
Hi list,
Disclaimer: What I am proposing here does not exist (yet). But it will in
the near future and if there is interest i would like to give
something
back to the boost community.
I would like to propose a new library to boost: asynchronous blas.
What is it and what can I do with it?
It is a linear algebra library where all computations are carried out in
an asynchronous manner on GPU and CPU. Computing a matrix-matrix product
does not block the current thread but instead launches a task which at some
point returns with a result, allowing the main thread to continue until the
result is needed. GPU and CPU tasks can be started at the same time
depending on each other (e.g. a CPU computation waits until the GPU
returned some needed result and vice-versa)
Why is this important? Which problems does it solve?
Many linear algebra libraries are target towards either
a) solving a few large linear algebra problems
b) solving many small similar problems
This library is targeted towards the middle ground: a medium amount of
decently sized problems. Here the problem is that it is often hard to use
all the computing resources available as a single problem will not exhaust
the device resources and there is no easy way to launch several similar
tasks in parallel.
This problem occurs often in GPU computing where out-of-order
execution of
kernels is used to fill up the device. Similarly, it is often hard to make
CPU and GPU work together as GPU tasks are asynchronous while CPU tasks are
often synchronous and it is not straight forward to couple both without
blocking either CPU or GPU.
How does it work?
Expression templates similar to boost::ublas are used to create a
dependency graph: which object is used in which computations and which
tasks have to wait to ensure a correct outcome?
a simple case would be two independent computations like
matrix A=prod(B,trans(B));
matrix C=prod(trans(B),B);
a normal library like ublas would wait until A is computed before C is
evaluated. Instead, it can be more efficient to just add the two tasks to a
list and let worker threads execute them as soon as resources are
available. First the computation of A is put into the graph and A is marked
as "being written to" while B is marked as "being read". When C is added to
the list, it does not depend on the first computation as several tasks can
read from B at the same time. So another worker thread can grab the task
and execute it.
If we now add the task
B += prod(D,prod(C,trans(D));
the library computes it as
matrix temp = prod(C,trans(D);
B += prod(D,temp).
temp can be computed as soon as C is computed. But B can only be computed
after A and temp are computed as we can only write to B after all previous
tasks using B are finished.
Are you qualified?
I am main contributor to https://github.com/Shark-ML/Shark/
and as part of it I forked
ublas
https://github.com/Shark-ML/Shark/tree/master/include/shark/LinAlg/BLAS
which now uses several linear algebra libraries as possible backends
(ATLAS, openBLAS...).
What is the timeline?
I would start developing a minimal example ASAP based on the rewrite of
ublas above. I would use a minimum set of features (only dense square
matrices) and basic GPU support.
Opinions?
Best,
Oswin
_______________________________________________
http://lists.boost.org/mailman/listinfo.cgi/boost
_______________________________________________
http://lists.boost.org/mailman/listinfo.cgi/boost
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Continue reading on narkive:
Loading...