A multithreaded parallel data processing system has at least one processing element for processing multiple threads of computation. Threads are described by thread descriptors which are stored while waiting to be processed in a thread descriptor storage. Thread descriptors are comprised of an instruction pointer and a frame pointer. The instruction pointer points to the next instruction to be executed, and the frame pointer points to a frame of memory locations that the next instruction will operate on. Included within the instruction on set of the at least one processing element is a load instruction that loads global data into local processing element memory that is performed to two phases: a request phase and a response phase. Also included are instructions to fork a thread into two threads and to join two threads into a single thread.