Distributed, Parallel and Multi-Core Terminologies

Shared memory systems

In this model, memory is accessible by multiple threads. It has several advantages over the message passing. Communication between processors is fast and It does not need data partitioning.

Message Passing

There is no share resource between processes, and concurrent process communicate by exchanging messages. This method is lockless and mutual exclusion is not required.

Memory consistency model

Strict Consistency specifies that memory commands(specially write) are indivisible and effects of command immediately is available to the later read commands.
Sequentially consistent specifies that memory commands(specially write) are indivisible and effects of command immediately is available to the later read commands. It also specifies that all program's command executed in order that programmer wrote
Relaxed(Weak) consistent memory commands are not atomic in all cases. This model is used to perform optimizations in hardware and compiler designing.

Serializable

Two transaction A,B are serialized if their result in sequential and concurrency environment are identical.

linearizable

A linearizable operation is atomic. It means if they start, then they must be completed before any interrupt.

Process

Process(Some operating systems use the term 'task') is an executing instance of a program that has it's own memory space.

Thread

Thread is an executing instance of a process. A process can have multiple threads and all threads in a process have the same address space.

Amdahl's Law

Maximum speedup S is possible for n processors. We assume that each process need 1 time to complete and p is a fraction of process that can be run in parallel (0<=P<=1).
S = 1/(1-p+(p/n))
Amdahl's Law shows that the amount of parallelism is important because the maximum amount of speedup is:
1/(1-p)
Then speedup declines as sequential proportion increases and for getting a better performance we should try to minimize the sequential part.

Computer architecture

Single Instruction, Single Data stream (SISD) provide no parallelism
Single Instruction, Multiple Data streams (SIMD) is useful in multi media applications such as image processing. A SIMD CPU has a single control unit and multiple ALUs. Control unit broadcasts instructions to the ALUs.
Multiple Instruction, Single Data stream (MISD) is not practical
Multiple Instruction, Multiple Data streams (MIMD) that most of today's computers use this architecture. In this model, CPU consists of a collection of fully independent processing units or cores. Each core has its own control unit and its own ALU.

Parallel patterns

Most of parallel algorism's structure fall into one of following pattern categories:

Organized by task
Organized by data
Organized by data follow

Parallelism's factors

For choosing an appropriate pattern for parallel algorism following factor should be considered:

synchronization: Threads need to coordinate the order of thread execution. Synchronization is used to access critical section or to reach a specific condition.
communication: It means data exchange between threads. Message passing is one of the common way of communication. MPI(message passing interface) is the most common interface to communicate.
load balancing: It means try to divide same amount of work among threads.
scalability: It helps us to use efficiently system resource when run in more core system.

Critical section

Critical section is where that in each moment only one thread is allowed to enter. Usually critical section is used to control the access to the share resource such as memory variable. Control the critical section is possible by using one of the following methods:

Machine level instruction that provides atomic test and set value.
Semaphore acts as a counter that controllers the access of specific number of threads to critical section. Strong semaphore can guarantee the order of process based on FCFS (First Come First Service).
Conditional variable: When a specific condition occur one or more of waiting threads wake up.
Monitor: Higher level construct to hide implementation details from programmer.

Deadlock

When two or more threads need to keep their current locked resource and ask for another locked resource, occurring deadlock is possible. For example thread1 has a resource lock1 and try to get resource lock2. On the other hand, assume that thread2 has the resource lock2 and try to get the resource lock1. It means that thread1 is waiting for lock2 and thread2 is waiting for lock1. In this condition deadlock is happened. There are three solution for handling deadlock:

Timeouts
Deadlock prevention
Deadlock detection

Starvation

In a concurrent system, each thread should have enough access to the limited resources such as CPU. Otherwise, it is probable that one or more threads couldn't make progress because of lack of chance to gain access to the share resources. Deadlock is an unlimited form of starvation. Starvation-freedom algorithm guarantees that every thread that calls lock method eventually enters the critical section.

Peterson lock algorithm

Peterson lock algorithm is a 2-thread lock algorithm. It is a combination of LockOne and LockTwo algorithms that provides a starvation-free and deadlock-free solution.

Filter Lock

It is a n-thread lock algorithm based on Peterson lock algorithm. In this algorism there are n levels that a thread must pass before gain the lock. It is a starvation-free and deadlock-free algorithm. Unfortunately, filter lock is not a complete fair algorithm.

Fairness

A lock algorithm is fair, if P1 calls lock() method before P2, then P1 enters the critical section before P2 (first come, first served).

Bakery Lock Algorithm

Bakery lock algorithm is a fair algorism (first-come, first served). Each thread takes a number(ascending order by time) and then wait until no other thread with a smaller number is waiting. It is a deadlock-free, first-come-first-served (fair) and certainly starvation-free algorithm.

Fiber

Fiber provides kind of concurrency that does not need parallel execution. Threads are managed by OS scheduler but fiber's time slice must be manage by programmer. A thread can contain a number of fibers and at each time only one fiber can be run. Windows operating system supports fiber.

Thread Pools

Since creating a thread is an expensive operation, thread pools can save time. Threads can be kept in thread pool for reuse in the future that help to save time of thread creation. Microsoft .Net Framework supports thread pools mechanism.

POSIX thread

POSIX thread is a thread library to support a common interface for accessing to thread on multiple platform. It is supported on most of UNIX platforms. There is also an open source library available for Windows.

Race condition

Race condition occurs when multi threads have asynchronous access to a share resource and threads try to change it. In this condition the result is nondeterministic because the order of access to the share data is variable depend on thread scheduler. To solve the race condition problem we can put the code of access to the share data in a critical section .

Priority ceiling

When threads have priority level and a low priority thread enters to mutual exclusion section, it would cause performance problem. Because hight priority threads must wait for that low level priority thread to leave the critical section. In this condition the highest priority that called ceiling is assigned to the thread inside mutual exclusion section. By using this trick we temporary increase the priority of thread inside the mutual exclusion section to make critical section free as soon as possible.

Fine-grained locking

A solution to reduce contention for a resource is spreading out the contention. If we have only one locked resource all threads must contend for the same lock. If we can spread the contention among multiple locks the multi threads can access the resource concurrently. A good example of this technique is distributed hash table(DHT) .

Non-blocking algorithm

In non-blocking algorithm threads execution does not postpone due to mutual exclusion. In this mechanism threads use atomic operation(such as CAS) instead of lock to synchronize access to shared resources.

ABA problem

In non-blocking algorithm when memory content is used to indicate "nothing has changed" ABA occur. In this situation memory content change twice(or more). At first it's value change to a new value and in the second change the original value back. It seems that nothing has changed.

CPU components

Central processing unit (CPU) performs the basic arithmetical, logical, and input/output operations in a computer system. CPU is a set of functional units(such as ALU,Control Unit). Functional unit is combination of logic blocks. Logic block is consist of transistors that are smallest units in CUP.

Context switch

Context switching is occurred before thread switching. Processor save the current state of instruction and then switch to another thread's instruction.

Hyper-Threading Technology

In processors that support Hyper-Threading(HT), resources of the same core are share among threads. These threads called logical processor. The number of register in HT processor are the same as processor without HT and only one thread can use the share resource at same time. Processor fetch instruction of threads concurrently, then get more service from processor. In HT we do not have physical processor and only one processor is share between threads. Each processor(core) has some functional units that in HT we try to share this units between thread. In processor without HT, each core is usable only by one thread. In multi-core processor, cores can support HT technology then each core can run more than one thread concurrently.

Cache

It is a closer memory to CPU and then faster memory in compare to the main memory. Because of slow access to the main memory, cache is located between CPU and main memory. When CPU need memory address, It first look it up in the cache. When a program read and write a variable, it is likely to access the same variable again(Locality). Cache is effective because of locality feature. Usually cache is hundred times faster than main memory. Since cache is expensive, it is smaller than main memory and a small fraction of memory can fit into the cache.

Cache coherence

Cache coherence problem occurs when a memory address is share among processors and each processor has its own cache. When an update is done by one processor other processors must update their cache with new value. One of the most common protocol to solve this problem is MESI.

Virtual memory

In multitasking operating systems, a large program may not fit into main memory. VM combines RAM with temporary space of persistent storage. If RAM shortage occurs, data is moved from RAM to the persistent storage to frees up RAM. Transferring the data makes it possible for other processes to complete their tasks.

MESI

MESI(Modified, Exclusive, Shared, Invalid) is a protocol to solve the cache coherence problem. MESI has been used in the Pentium processors.

Synchronization primitives

Modern multiprocessor architecture support synchronization primitives such as compare-and-swap (CAS). Architectures such as AMD and Intel support CAS hardware instruction and programming language such as Java and C# use it for implementing synchronization. General format of CAS is as follow:

CompareAndSwap(*address, oldValue, newValue)

But CAS has ABA problem.

CAP Theorem

Dr. Eric Brewer’s CAP Theorem states that, distributed system(shared-data) cannot achieve all three of Consistency, Availability and Partition tolerance and we can pick only two of them. Solution for this problem is eventual consistency.

Consistency: All nodes have a same data at the same time. If one part of database modification fails, the entire modification fails and the database state is left unchanged. We can achieve it by using Transaction.
Availability: If some nodes fail, substitute nodes cover their duties. It is achievable by using Replication.
Partition tolerance: System continues to operate despite loss of connection between some nodes.

Eventual Consistency

Eventual Consistency(or optimistic consistency) sacrifices consistency to achieve a high availability. In this model updates propagate eventually through the nodes and all the replicas will be consistent eventually. Most of the modern database(NoSQL) are fall in this category.

If we have following definitions:

N = number of nodes to store replicas
W = minimum number of replicas that is need to have a successful update (if system cannot write to W nodes because of failures, the write operation has to fail)
R = minimum number of replicas that is needed for a successful read (if system cannot read an identical value from R nodes because of failures or inconsistency, the read operation has to fail)

We have a strong consistency if W+R > N .
For example if N=3, R=2, W=2 then we have a strong consistency.

In eventual consistency(Weak consistency) W+R <= N to provide a better latency.

Multi-Version Concurrency Control

Multi-Version Concurrency Control (MVCC) is used to manage concurrent access to a distributed database. In MVCC mechanisms, requests are run in parallel without using lock then write operations do not block reads. To change a value in DB, you create an new version of that value and save it over the old one. Since new request can append new version of data to the database without having to wait for acquiring lock. Conflict can occurs when you change the same data(with same version) in two different database simultaneously. Conflict is detected and resolved by merging the two versions and save the result as a new version.