Wednesday, April 9, 2014

Porting from Oracle to MySQL

A potential customer asked my about porting her application from Oracle Database to MySQL.

I always try to start with the "why" (a dear friend bought me this book, recommended:

She said "cloud!". I said "OK!".

I conducted a short research, found many things in many places all over the place, brought them to a nice email I sent her back and then thought I'll post it here and make it public as it might be useful for us all. If you feel that I missed something, add comments, send feedback.

These are the leading tools to do the actual migration of the data structure, data export/import, sprocs, triggers, etc.:
  1. MySQL Workbench has a migration feature:
  2. MySQLYog can be used to migrate:  (already in the conversation in the second comment there)
  3. Navicat can be used to migrate:
  4. Tungsten support Oracle-to-MySQL replication:
  5. Focused data migrators:

The way I see it, migrating the data is 15% of a database porting project. Efforts are in (partial list):

  1. Porting drivers and driver behavior in the app code
  2. Porting SQL commands all around the app code
    1. Conversion of non-standard SQL flavor
    2. Work-around restrictions and non-supported commands
  3. Ecosystem, monitoring, tuning, tools, scripts, hardware best practices, ops skills, dev skills

Way before the migration of the data on d-day.

A lot of services, some tools. Services-wise I see around:

  1. Pythian:
  2. Baron (Percona):

I bet the big SIs (Accenture et al) are strong in this game, as those would be the default go-to service provider for the Oracle shops.

Thursday, March 13, 2014

How Elasticity is dictated by Data Model?

I talk a lot about "Elasticity" and "Data Model", a prospect asked me today "what makes you think they are related?".

Not only are they related, the relation between them holds big part of the substance of ScaleBase, the technology I've been working on for the last 5 years...

Elasticity is the ability to grow or shrink in accordance to the demand.
The cloud makes it very easy to spin more machines, on demand and kill them a day after, pay by the hour, only for real usage. This alone offers fantastic elasticity. Remember that AWS's EC2 stands for "Elastic Compute Cloud".

Volatile/transient/stateless servers are easier to make elastic, AKA application servers, web servers. Just spin another same-image-server behind a round-robin load balancer would solve 80% of the problem.

Data is harder to "elastify".

  1. Data can be replicated across multiple identical servers behind the same round-robin load balancer, but data-replication multiplies data size (bad ROI) and cannot scale writes and updates to the data. 
  2. The only way to scale data is to have it distributed across multiple non-identical servers. 

New challenges:

  1. How would all data consumers (apps, tools) know where the data they look for resides? 
  2. If all for every access they need data from several (or all) the servers, load will end-up multiplied rather than distributed. = no scalability.
  3. OK not all or most, but the minority of accesses do need data from several (or all) the servers. How this data can be found on all quickly and aggregated? 

Challenge 1 is the simplest, just have an index expressing "I want to distribute my data by profile_id" and "put profiles 1-1000 on db1 and 1001-1500 on db2", and then force all data consumers check this index before every data access.

Challenges 2 and 3 are where data model kicks in. For NoSQLs, data model is a document, complete and self-contained, challenges 2 and 3 do not exist.
For SQL databases, a relational data model, takes challenges 2 and 3 to the extreme.

A carefully crafted data distribution policy and the ability to do real-time data aggregation are crucial for a successful scaling relational database.

In our profiles distribution example, identifying that "a profile" is actually a chunk of related data from 100 tables in a complex, multi-level, deep hierarchy - is a hard task to do.
ScaleBase Analysis Genie simplifies the authoring of a data distribution policy that makes sure that related data is stored together on the same server, solving challenge 2.

ScaleBase Controller employs multi-threaded massive parallel execution and advanced result aggregation, supporting all SQL aspects including support for GROUP BY, ORDER BY, HAVING, UNION, JOIN, SUBSELECT to solve challenge 3.

Thursday, November 14, 2013

Will AWS plans for PostgreSQL RDS help it finally pick up?

"Amazon to add Postgres to its most-favored database list" says GigaOM:
"To many this is no-brainer. Amazon wants to support the databases that its developer audiences want to use. This is simply a  case of Amazon responding to user demand and oh-by-the-way making its cloud infrastructure more attractive to a specific target audience. Some say Postgres has gained traction since Oracle’s acquisition of MySQL via its Sun buyout a few years back."

Some people I know said "yea, the writing was on the wall...". Well, was it?? Really? 

AWS finally got the time to "plan" for supporting Postgres now? After supporting MySQL, Oracle and SQL Servers for almost 3 years?! Writing was on the wall? Where can I find a wall this old?

PostgreSQL has not picked up. 

This is why it is a far 4th on Amazon's list. The writer of the text above also makes clear efforts not to pick a side here... "to many this is a no-brainer" or "some say Postgres has gained traction". 

It has been around for ages, thru many "oh! it's now happening!" events, such as the acquisition by of MySQL by Sun, then by Oracle... 

Technically, PostgreSQL's few superior capabilities, especially around schema online modifications (which gets more important these days!), probably could not change its fate, and it's still being held back by too many inferior capabilities, around performance, robustness, ecosystem... 

So - with plans for RDS, will Postgres now pick up? 

Feel free to Share your thoughts... 

Wednesday, April 24, 2013

Concurrency is not parallelism

No so new, but still good piece of reading:
"Concurrency is the composition of independently executing processes, while Parallelism is the simultaneous execution of (possibly related) computations"
As I wrote several times in the past, in OLTP, throughput is king, concurrency is the main thing that is put into the test.

Concurrency is where Facebook has a million "Like"s every second, each "Like" is independent, and they need to be processed concurrently.

Parallelism, is where few concurrent activities, say a few analytic reports run in Oracle Exadata, Vertica or GreenPlum. Every report is is sliced into many related computations that execute simultaneously.

Are these the same?

From 50,000 feet, we see many things running in the same time, in parallel, concurrently, maybe even distributed. But we need to be accurate, there is a huge difference, and it is in the source: how many "original" transactions we had to process? A million "Like"s vs. a few big analytic report. In both cases I see million operations coming out of them at the back, but:
In the "Like"s use case - those are the real transactions, concurrently running, distributed.
In the report use case - those are million pieces of the same initial single job.

Important! Not to be confused! Big difference! One is great for throughput scalability and one is not. More in my next post.

Wednesday, April 3, 2013

MySQL thread pool and scalability examples

Nice article about SimCity outage and ways to defend databases:

The graphs showing throughput with and without the thread pool are taken from the benchmark performed by Oracle and taken from here:

The main take away is this graph (all rights reserved to Oracle, picture original URL):
20x Better Scalability: Read/Write
Scalability is where throughput can grow and grow, as demand grows. I need to get more from the database, the question is: "can it scale to give it to me?". Scalability is where the response time remains "acceptable" while the throughput grows and grows.

Every database has a "knee point".
  1. In the best case scenario, in this knee-point, throughput will go into a flat plateau, and On the same point BTW,  response time will start climbing, passing the non-acceptable point.
  2. In a worse case scenario, in this knee-point, throughput, instead of a flat plateau, it will take a plunger. On the same point BTW, response time will start climbing fast to the roof.
Actually, the red best case scenario, is actually pretty bad... There's NO scalability there, throughput has a hard limit! It's around 6,500 transactions per second. I need to do more on my DB, there are additional connections - but the DB is not giving even 1 inch of additional throughput. It doesn't scale.

The thread pool feature is no more than a defense mechanism. It doesn't break the scalability limit of a single machine, rather its job is to defend the database from death.

Real scalability is when throughput graph is neither dropping or becoming flat - it goes up and up and up with a stable response time. This can be achieved only by Scale Out. Getting 7,500 TPS with 1 database with 32 connections, then add an additional database and the straight line going up will reach, say, 14,000. A system with 3 database can support 96 connections and 21,000 TPS... and on and on it goes... 

Data needs to be distributed across those databases, so the load can be distributed as well. Maintaining this distributed data on the scaled-out database is the key... I'll touch that in future posts. 

Tuesday, March 26, 2013

They say: "Relational Databases Aren't Dead"

This is a good read, claiming: "Relational Databases Aren't Dead. Heck, They're Not Even Sleeping", A key quote:
"While not comprehensive, the uses for NoSQL databases center around the acquisition of fast-growing data or data that does not easily fit within uniform structures."

There were 2 parts in the statement about NoSQL's uses. I'll start with the latter:

"data that does not easily fit within uniform structures" - NoSQL is probably the right choice, hmm although I always encourage thinking and architecting in advance. And also online structure changes do exist in the RDBMS world and recently in MySQL:
I would definitely warn about the caveats of NoSQL when it comes to actually use and query the data that is so easily stored there...

"acquisition of fast-growing data" - is no longer a no-go for RDBMS and MySQL database. Distributed RDBMS solutions do exist today and they can exploit performance and scalability from the good old MySQL itself

What do you think?

Friday, January 4, 2013

Partial partitioning and sharding

I came across this:
I was wondering if sharding is an alternate name for partial replication or not. What I have figured out that --
  • Partial Repl. – each data item has only copies at some but not all of the nodes (‘Sharding’?)
  • Pure Partial Repl. – has only copies of a subset of the data item but no node contains a full copy of the database
  • Hybrid Partial Repl. – a set of nodes are full replicas and another set of nodes are partial replicas

I thought it was a good topic, I write a really nice answer, but there was a problem when I pressed "post this answer", probably some error or mistake on their side. Anyway - this is what I have to say about partial partitioning and sharding:

Partial replication is an interesting way, in which you distribute the data with replication from a master to slaves, each contains a portion of the data. Eventually you get an array of smaller DBs, read only, each contains a portion of the data. Reads can very well be distributed and parallelized. 

But what about the writes? 

Those are still clogged, in 1 big fat lazy master database, tasks as buffer management, locking, thread locks/semaphores, and recovery tasks - are the real bottleneck of the OLTP, they make writes impossible to scale... As I wrote in many previous posts, for example here:

Sharding is where every piece of data lives only in one place, within an array of DBs. Each database is the complete owner of the data: data is read from there, data is written to there. This way, reads and writes are distributed and parallelized, real scale-out can be achieved. 

The idea behind sharding is great, the ultimate scaling solution (ask Facebook, Google, Twitter and all the other big guys) but it's a mess to handle, to maintain. It's hard as hell if done by yourself, ScaleBase enables an automatic transparent scale-out machine - that does all that, so you won't have too...