Non relational data management for big data management
Modern information societies are continuously producing massive amounts of
data: storing and processing such data represent an issue that real-world
systems have to take into account. Any solution must be able to scale up to
datasets of the order of petabytes.
Handling massive datasets must necessarily be performed in a distributed
environment. In particular, storage of data must be spread accross multiple
servers, and computations over data must necessarily exploit parallelism as
much as possible.
The need to handle massive datasets has led in recent years to the
development of nosql databases for distributed storage, and to the mapreduce
framework for parallell computation.
Nosql, which stands for "Not Only SQL", is a broad class of database
management systems that do not adhere to the relational database management
system model.
In relational databases, data is highly structured, and is organized into
tables, consisting of fixed columns, each having a well-specified data type.
In nosql databases, instead, data is semi-structured, and is organized into
maps assigning values to keys, where values can be of arbitrary nature.
Relational databases can perform complex queries for retrieving data, and in
particular it is possible to join together data belonging to different
tables. In nosql databases, instead, joins are not possible, and there is
little functionality beyond storage and retrieval of a single datum.
The reduced query capabilities of the nosql model is compensated by marked
gains in scalability and performance. Nosql databases can scale out, so
that it is possible to handle bigger and bigger data by simply adding to a
cluster more servers running on commodity hardware.
Therefore, nosql databases are preferred to relational databases when the
size of the data is very large, and the data's nature does not require joins
for its manipulation.
The implementation and development of nosql databases is strongly influenced
by the CAP theorem, which states that any computer system can satisfy at
most two of the three following properties:
consistency, availability, and partition tolerance. Relational databases
satisfy consistency and availability, but do not satisfy partition
tolerance. Nosql databases, instead, satisfy partition tolerance, and
therefore, according to the CAP theorem, must either drop consistency or
drop availability.
Nosql databases can be classified according to the following taxonomy:
key-value databases (e.g. BerkeleyDB and Dynamo), tabular databases (e.g.
BigTable and its clones), document databases (e.g., MongoDB and CouchDB),
and graph databases (e.g. Neo4j).
Key-value databases simply store a map of values indexed by a key.
Tabular databases store a table whose cells can be indexed by both a row key
and a column key. Document databases store a map whose values are documents
in a standardized format such as JSON or BSON. Finally, graph databases
store a graph consisting of vertices and edges.
Nosql databases are used for storing the inputs of outputs of parallel
computations performed with the mapreduce framework. Mapreduce is both a
programming model used for expressing distributed computations on massive
datasets, and an execution framework for large-scale data processing on
clusters of commodity servers. Mapreduce was originally developed by Google
in 2003 (but has its roots in functional
programming) and has since enjoyed widespread adoption via an open-source
implementation called Hadoop.
Mapreduce follows a divide and conquer approach involving two stages:
a map stage for dividing, and a reduce stage for conquering. In the map
stage, a large problem is partitioned into smaller independent subproblems,
which are tackled in parallel by different workers using a user-defined map
function. In the reduce stage, the outputs of the map stage are combined
together using a user-defined reduce function, to produce the final output.