Module Cudajit.Stream

CUDA streams are independent FIFO schedules for CUDA tasks, allowing them to potentially run in parallel. See: Stream Management.

type t

Stores a stream pointer and manages lifetimes of kernel launch arguments. See CUstream.

val sexp_of_t : t -> Sexplib0.Sexp.t
val mem_alloc : t -> size_in_bytes:int -> Deviceptr.t

See cuMemAllocAsync.

The pointer is finalized using cuMemFreeAsync.

val mem_free : t -> Deviceptr.t -> unit
val memcpy_H_to_D_unsafe : dst:Deviceptr.t -> src:unit Ctypes.ptr -> size_in_bytes:int -> t -> unit
val memcpy_H_to_D : ?host_offset:int -> ?length:int -> dst:Deviceptr.t -> src:('a, 'b, 'c) Stdlib.Bigarray.Genarray.t -> t -> unit

Copies the bigarray (or its interval) into the device memory asynchronously. host_offset and length are in numbers of elements. See memcpy_H_to_D_async_unsafe.

type kernel_param =
  1. | Tensor of Deviceptr.t
  2. | Int of int
    (*

    Passed as C int.

    *)
  3. | Size_t of Unsigned.size_t
  4. | Single of float
    (*

    Passed as C float.

    *)
  5. | Double of float
    (*

    Passed as C double.

    *)

Parameters to pass to a kernel.

val sexp_of_kernel_param : kernel_param -> Sexplib0.Sexp.t
val no_stream : t

The NULL stream which is the main synchronization stream of a device. Manages lifetimes of the corresponding kernel launch parameters.

val launch_kernel : Module.func -> grid_dim_x:int -> ?grid_dim_y:int -> ?grid_dim_z:int -> block_dim_x:int -> ?block_dim_y:int -> ?block_dim_z:int -> shared_mem_bytes:int -> t -> kernel_param list -> unit
val memcpy_D_to_H_unsafe : dst:unit Ctypes.ptr -> src:Deviceptr.t -> size_in_bytes:int -> t -> unit
val memcpy_D_to_H : ?host_offset:int -> ?length:int -> dst:('a, 'b, 'c) Stdlib.Bigarray.Genarray.t -> src:Deviceptr.t -> t -> unit

Copies from the device memory into the bigarray (or its interval) asynchronously. host_offset and length are in numbers of elements. See memcpy_D_to_H_unsafe and cuMemcpyDtoHAsync.

val memcpy_D_to_D : ?kind:('a, 'b) Stdlib.Bigarray.kind -> ?length:int -> ?size_in_bytes:int -> dst:Deviceptr.t -> src:Deviceptr.t -> t -> unit

Copies between two memory positions on the same device asynchronously. The size to copy can optionally be provided in numbers of elements via kind and length. Provide either both kind and length, or just size_in_bytes. See cuMemcpyDtoDAsync.

val memcpy_peer : ?kind:('a, 'b) Stdlib.Bigarray.kind -> ?length:int -> ?size_in_bytes:int -> dst:Deviceptr.t -> dst_ctx:Context.t -> src:Deviceptr.t -> src_ctx:Context.t -> t -> unit

Copies between memory positions on two different devices asynchronously. The size to copy can optionally be provided in numbers of elements via kind and length. Provide either both kind and length, or just size_in_bytes. See cuMemcpyPeerAsync.

type attach_mem =
  1. | GLOBAL
    (*

    Memory can be accessed by any stream on any device.

    *)
  2. | HOST
    (*

    Memory cannot be accessed from devices.

    *)
  3. | SINGLE_stream
    (*

    Memory can only be accessed by a single stream.

    *)
val sexp_of_attach_mem : attach_mem -> Sexplib0.Sexp.t
val attach_mem_of_sexp : Sexplib0.Sexp.t -> attach_mem
val attach_mem : t -> Deviceptr.t -> int -> attach_mem -> unit
val create : ?non_blocking:bool -> ?lower_priority:int -> unit -> t

Lower lower_priority numbers represent higher priorities, the default is 0. See cuStreamCreateWithPriority.

The stream value is finalized using cuStreamDestroy. This is safe without needing to set the proper context.

val get_context : t -> Context.t
val get_id : t -> Unsigned.uint64
val is_ready : t -> bool

Returns false when the querying status is CUDA_ERROR_NOT_READY, and true if it is CUDA_SUCCESS. See cuStreamQuery.

val synchronize : t -> unit

Waits until a stream's tasks are completed. See cuStreamSynchronize.

val memset_d8 : Deviceptr.t -> Unsigned.uchar -> length:int -> t -> unit
val memset_d16 : Deviceptr.t -> Unsigned.ushort -> length:int -> t -> unit

length is in number of elements. See cuMemsetD16Async.

val memset_d32 : Deviceptr.t -> Unsigned.uint32 -> length:int -> t -> unit

length is in number of elements. See cuMemsetD32Async.