Oswin Krause

2015-11-19 14:55:52 UTC

Hi list,

Disclaimer: What I am proposing here does not exist (yet). But it will

in the near future and if there is interest i would like to give

something back to the boost community.

I would like to propose a new library to boost: asynchronous blas.

What is it and what can I do with it?

It is a linear algebra library where all computations are carried out in

an asynchronous manner on GPU and CPU. Computing a matrix-matrix product

does not block the current thread but instead launches a task which at

some point returns with a result, allowing the main thread to continue

until the result is needed. GPU and CPU tasks can be started at the same

time depending on each other (e.g. a CPU computation waits until the GPU

returned some needed result and vice-versa)

Why is this important? Which problems does it solve?

Many linear algebra libraries are target towards either

a) solving a few large linear algebra problems

b) solving many small similar problems

This library is targeted towards the middle ground: a medium amount of

decently sized problems. Here the problem is that it is often hard to

use all the computing resources available as a single problem will not

exhaust the device resources and there is no easy way to launch several

similar tasks in parallel.

This problem occurs often in GPU computing where out-of-order execution

of kernels is used to fill up the device. Similarly, it is often hard to

make CPU and GPU work together as GPU tasks are asynchronous while CPU

tasks are often synchronous and it is not straight forward to couple

both without blocking either CPU or GPU.

How does it work?

Expression templates similar to boost::ublas are used to create a

dependency graph: which object is used in which computations and which

tasks have to wait to ensure a correct outcome?

a simple case would be two independent computations like

matrix A=prod(B,trans(B));

matrix C=prod(trans(B),B);

a normal library like ublas would wait until A is computed before C is

evaluated. Instead, it can be more efficient to just add the two tasks

to a list and let worker threads execute them as soon as resources are

available. First the computation of A is put into the graph and A is

marked as "being written to" while B is marked as "being read". When C

is added to the list, it does not depend on the first computation as

several tasks can read from B at the same time. So another worker thread

can grab the task and execute it.

If we now add the task

B += prod(D,prod(C,trans(D));

the library computes it as

matrix temp = prod(C,trans(D);

B += prod(D,temp).

temp can be computed as soon as C is computed. But B can only be

computed after A and temp are computed as we can only write to B after

all previous tasks using B are finished.

Are you qualified?

I am main contributor to https://github.com/Shark-ML/Shark/

and as part of it I forked

ublas

https://github.com/Shark-ML/Shark/tree/master/include/shark/LinAlg/BLAS

which now uses several linear algebra libraries as possible backends

(ATLAS, openBLAS...).

What is the timeline?

I would start developing a minimal example ASAP based on the rewrite of

ublas above. I would use a minimum set of features (only dense square

matrices) and basic GPU support.

Opinions?

Best,

Oswin

_______________________________________________

Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Disclaimer: What I am proposing here does not exist (yet). But it will

in the near future and if there is interest i would like to give

something back to the boost community.

I would like to propose a new library to boost: asynchronous blas.

What is it and what can I do with it?

It is a linear algebra library where all computations are carried out in

an asynchronous manner on GPU and CPU. Computing a matrix-matrix product

does not block the current thread but instead launches a task which at

some point returns with a result, allowing the main thread to continue

until the result is needed. GPU and CPU tasks can be started at the same

time depending on each other (e.g. a CPU computation waits until the GPU

returned some needed result and vice-versa)

Why is this important? Which problems does it solve?

Many linear algebra libraries are target towards either

a) solving a few large linear algebra problems

b) solving many small similar problems

This library is targeted towards the middle ground: a medium amount of

decently sized problems. Here the problem is that it is often hard to

use all the computing resources available as a single problem will not

exhaust the device resources and there is no easy way to launch several

similar tasks in parallel.

This problem occurs often in GPU computing where out-of-order execution

of kernels is used to fill up the device. Similarly, it is often hard to

make CPU and GPU work together as GPU tasks are asynchronous while CPU

tasks are often synchronous and it is not straight forward to couple

both without blocking either CPU or GPU.

How does it work?

Expression templates similar to boost::ublas are used to create a

dependency graph: which object is used in which computations and which

tasks have to wait to ensure a correct outcome?

a simple case would be two independent computations like

matrix A=prod(B,trans(B));

matrix C=prod(trans(B),B);

a normal library like ublas would wait until A is computed before C is

evaluated. Instead, it can be more efficient to just add the two tasks

to a list and let worker threads execute them as soon as resources are

available. First the computation of A is put into the graph and A is

marked as "being written to" while B is marked as "being read". When C

is added to the list, it does not depend on the first computation as

several tasks can read from B at the same time. So another worker thread

can grab the task and execute it.

If we now add the task

B += prod(D,prod(C,trans(D));

the library computes it as

matrix temp = prod(C,trans(D);

B += prod(D,temp).

temp can be computed as soon as C is computed. But B can only be

computed after A and temp are computed as we can only write to B after

all previous tasks using B are finished.

Are you qualified?

I am main contributor to https://github.com/Shark-ML/Shark/

and as part of it I forked

ublas

https://github.com/Shark-ML/Shark/tree/master/include/shark/LinAlg/BLAS

which now uses several linear algebra libraries as possible backends

(ATLAS, openBLAS...).

What is the timeline?

I would start developing a minimal example ASAP based on the rewrite of

ublas above. I would use a minimum set of features (only dense square

matrices) and basic GPU support.

Opinions?

Best,

Oswin

_______________________________________________

Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost