Databases

Resources:

Database types

SQL (Structured Query Language): technically differen from RDBMS (Relational Database Management System) but used interchangeably: Rigid schemas AND rigid relations between tables.
NoSQL: weak or no schemas. Relationships are handled differently.

Key-Value Store

Lists of key-value pairs. Keys must be unique. No schema or structures.

Used for Unstructured Data.

Strengths:

Very scalable because pairs can be distributed
Very fast
They handle multiple data types
Keys are linked directly to their values, no need for indexing or complex join operations
The content of a key can easily be copied to other systems without reprogramming the data, because they support many data types

Weaknesses:

Analytical queries are difficult since there are no Joins
Access pattern must be known in advance for optimum performance

Use cases:

Simple requirements
Data is made up of just names/keys and values, no or little structure

AWS services:

DynamoDB

In-memory Databases

For workloads that need ultrafast access to data that’s frequently accessed but rarely updated.
Data is stored as Key-Value pairs.

Strengths:

sub-milliseconds response times
Extremely scalable with no downtime

Weaknesses:

Not suitable for seldom changing data
Applications must tolerate stale data

Use cases:

Real-time data access
In-memory caching
Games
Session store

AWS services:

ElastiCache
MemoryDB for Redis

Wide Column Store

They contain a Partition key and optionally items can have additional keys.

Used for Structured Data.

Every item in a table has to have the same layout: the same number of keys. That keys combination is unique in that table.

Those are not the same tables as SQL databases. Each item doesn’t have to have the same attributes as the others.

Document

Entities are stored in collections and cannot be accessed using a key. Each document has a unique ID and is very flexible when it comes to fields (as opposed to SQL).

Used for Semi-structured and Unstructured Data.

Strengths:

Flexible
Don’t require planning data structures too much
Scalable

Weaknesses:

Flexibility comes with lack of ACID compliance
Cannot perform cross-file queries natively

Use cases:

Purchase orders
Contacts
Structures with deeply nested attributes
Structures that need to be linked to each other

AWS Services:

Amazon DocumentDB (with MongoDB compatibility)

Row Store

Like MySQL.

Used for Structured Data.

Also called OLTP (Online Transaction Processing)

Use cases:

Adding/updating/deleting things that are managed as items

Column Store

They’re not designed to describe an item, because they’re optimized to perform queries in a columnar way.

Used for Structured Data.

In row stores you may have | id | name | size | and add items in the form "{ 001, "this", 45 }", you then query for one or more fields (columns);
In column stores you add data and are not interested in the item as a whole, rather you want to perform queries on attributes. They’re very good for reporting. You may be interested in making a chart of items for a particular size you sold, so you query for all the sizes.

They’re very inefficient for transactions.

Strengths:

They support a large volume of data
Scalable
Fast write speeds

AWS Services:

Amazon Keyspaces (for Apache Cassandra)

Graph

They focus on relations.

Used for Structured, Semi-structured and Unstructured Data.

Relations, in a graph database, are data themselves, they don’t need to be computed. Relations can have attributes that define them.

Strengths:

Straightforward, fast retrieval of complex hierarchical structures

Weaknesses:

Not ideal for transactional data
Not efficient for analytics

Use cases:

DB for social media
Data with complex relationships
real-time big data mining
Recommendations

AWS Services:

Neptune

OLTP and OLAP (for Relational Databases)

Online transaction processing (OLTP) databases focus on recording Update, Insertion, and Deletion data transactions. OLTP queries are simple and short, which require less time and space to process.

Online analytical processing (OLAP) databases store historical data that has been input by OLTP. With OLAP databases, users can view different summaries of multidimensional data. Using OLAP, users can extract information from a large database and analyze it for decision making. A good example of an OLAP system is a business intelligence tool.

Transaction models

ACID and BASE a transaction models.

CAP Theorem

CAP stands for Consistency, Availability, Partition tolerant. Where:

Consistency: Every read either receives the most recent data written or an error. This is this definition is different from the that in ACID design (see below).
Availability: Every read from a non-failing has a response (not an error).
Partition tolerance: The system keeps working despite an arbitrary number of messages being dropped or delayed by the network between nodes.

"Any distributed data store can provide only two of the above".

In case of a network partition fail you need to decide between:

Consistency: cancel the transaction and throw an error or cause a timeout, because otherwise the next read is not guaranteed to read the most recent value. Availability drops.
Availability: proceed with the transaction and return the value, though it’s not guaranteed to be the most recently written.

In the absence of partitioning both Consistency and Availability can be guaranteed. E.g.: a single node database, where you either reach the node and DO get the most up-to-date value, or you can’t reach the node.

ACIDs are designed to favor consistency. BASEs favor availability.

ACID

In the ACID model transactions are:

Atomic: All the parts of the transaction are successful or the whole transaction is discarded and the state is reverted/unchanged. E.g.: a bank wants all transfers (reduce the amount on A, increase the amount on B) to be atomic.
Consistent: Given 2 valid states, transactions move the system from one to the other but at no time an in-between state exists.
Isolated: Multiple transactions happening at the same time don’t interfere with each other. They execute as if each were the only transaction running on the database.
Durable: Once a transaction has been committed, it remains committed even in the case of a system failure.

ACID usually refers to RDS/Aurora database and comes with a limited scalability.

BASE

BASE stands for Basically Available, Soft state, Eventually consistent.

Basically Available: read and write are available as much as possible, with no consistency guarantees. Those databases aim to spread data over the nodes.
Soft state: It delegates consistency to the developer. If the database optionally allows it, the developer must ask it to provide consistent data (I.e.: most up-to-date) or the application has to tolerate potential inconsistencies.
Eventually consistent (Optimistic replication): The system cannot guarantee that, immediately after a write operation, consistency is achieved.

BASE database are highly scalable and provide high performance.

Some BASE databases offer ACID functionality. E.g.: DynamoDB transactions.

Databases on EC2

They’re generally a very bad practice. It should be avoided and only allowed under very few circumstances.

Even running the application on one instance and the database on a dedicated instance is problematic, because it introduces a network dependency. Also, if the two instances are running in two different AZs this introduces cross-AZs networking costs.

Downsides

Admin overhead
You need to take care about backups and DR. ⇒ Admin overhead
EC2 is AZ resilient.
You can’t benefit from AWS solutions features
You have no serverless or scaling
Replication is on you ⇒ Admin overhead
You can’t have AWS solutions performance

When you’re forced to do so

You want to access the instance OS, maybe with root account permissions. Sometimes clients request this, and you need to understand if the need it, want it, or think they want it.
You need DB root account permissions (are you sure?)
You need to run a DB or DB version that AWS don’t support.
You requirea a specific DB and OS combination.
You require an architecture that AWS don’t provide, like some types of replication or resilience.
Decision makers say so.

Database Migration Service

A managed migration service to migrate from one DB to another, one of which must be in AWS. It uses an EC2 instance where you can define replication tasks, source endpoint and destination endpoint. It’s NOT limited to SQL.

It provides no-downtime migrations.

Many sources are supported, and include: MySQL, Aurora. MSSQL, MariaDB, MongoDB, PostgreSQL, Oracle, Azure SQL and others.

There are different kinds of jobs:

Full load: one-off migration of all data
Full load + CDC (Change Data Capture): migrates data and captures ongoing changes.
CDC only: allows for using custom migration tools like those provided by the source DB.

DMS does not offer schema conversion betweem DB versions or engines but you can use AWS SCT (Schema Conversion Tool) to do so.

UPDATE: Now it does. See DMS Schema Conversion

DMS can also be used for large migrations where data is serveral TB in size, in which case transferring over the network consumes time and capacity. It must be used with the Snowball family of products (That uses SCT):

Use SCT to extract data
Move the data to the snowball device you ordered
Ship the device to AWS (Data will be available in S3)
(optional) Use DMS CDC to monitor data changes
DMS migrates data from the S3 bucket to the target

Schema Conversion Tool (SCT)

It’s used only when the DB engine is incompatible, including a generic file format (no db related) for transfer.

Used for:

Convert schemas from one database engine to another.
Larger Migrations: when you need an alternative way to migrate data to AWS (including DB to S3) rather than using a data link.

It’s not used to move data across compatible engines.

Works with:

OLTP: MySQL, MSSQL, Oracle.
OLAP: Teradata, Oracle, Vertica, Greenplum.

Examples:

On-prem MySQL to AWS Aurora MySQL: compatible engines. Use DMS
On-prem MSSQL to AWS Aurora MySQL: incompatible engines. Use SCT
On-prem Oracle to AWS Aurora (only MySQL and Postgres supported): incompatible engines. Use SCT