Wednesday, April 24, 2013

Concurrency is not parallelism

No so new, but still good piece of reading:
"Concurrency is the composition of independently executing processes, while Parallelism is the simultaneous execution of (possibly related) computations"
As I wrote several times in the past, in OLTP, throughput is king, concurrency is the main thing that is put into the test.

Concurrency is where Facebook has a million "Like"s every second, each "Like" is independent, and they need to be processed concurrently.

Parallelism, is where few concurrent activities, say a few analytic reports run in Oracle Exadata, Vertica or GreenPlum. Every report is is sliced into many related computations that execute simultaneously.

Are these the same?

From 50,000 feet, we see many things running in the same time, in parallel, concurrently, maybe even distributed. But we need to be accurate, there is a huge difference, and it is in the source: how many "original" transactions we had to process? A million "Like"s vs. a few big analytic report. In both cases I see million operations coming out of them at the back, but:
In the "Like"s use case - those are the real transactions, concurrently running, distributed.
In the report use case - those are million pieces of the same initial single job.

Important! Not to be confused! Big difference! One is great for throughput scalability and one is not. More in my next post.

Wednesday, April 3, 2013

MySQL thread pool and scalability examples

Nice article about SimCity outage and ways to defend databases:

The graphs showing throughput with and without the thread pool are taken from the benchmark performed by Oracle and taken from here:

The main take away is this graph (all rights reserved to Oracle, picture original URL):
20x Better Scalability: Read/Write
Scalability is where throughput can grow and grow, as demand grows. I need to get more from the database, the question is: "can it scale to give it to me?". Scalability is where the response time remains "acceptable" while the throughput grows and grows.

Every database has a "knee point".
  1. In the best case scenario, in this knee-point, throughput will go into a flat plateau, and On the same point BTW,  response time will start climbing, passing the non-acceptable point.
  2. In a worse case scenario, in this knee-point, throughput, instead of a flat plateau, it will take a plunger. On the same point BTW, response time will start climbing fast to the roof.
Actually, the red best case scenario, is actually pretty bad... There's NO scalability there, throughput has a hard limit! It's around 6,500 transactions per second. I need to do more on my DB, there are additional connections - but the DB is not giving even 1 inch of additional throughput. It doesn't scale.

The thread pool feature is no more than a defense mechanism. It doesn't break the scalability limit of a single machine, rather its job is to defend the database from death.

Real scalability is when throughput graph is neither dropping or becoming flat - it goes up and up and up with a stable response time. This can be achieved only by Scale Out. Getting 7,500 TPS with 1 database with 32 connections, then add an additional database and the straight line going up will reach, say, 14,000. A system with 3 database can support 96 connections and 21,000 TPS... and on and on it goes... 

Data needs to be distributed across those databases, so the load can be distributed as well. Maintaining this distributed data on the scaled-out database is the key... I'll touch that in future posts.