Computational Social Science: Big Data

Damiano Carra, University of Verona, Italy
Rosalba Giugno, University of Catania, Italy
Calogero Zarba, Neodata Group, Italy

Non relational data management for big data management
Modern information societies are continuously producing massive amounts of data: storing and processing such data represent an issue that real-world systems have to take into account. Any solution must be able to scale up to datasets of the order of petabytes.

Handling massive datasets must necessarily be performed in a distributed environment. In particular, storage of data must be spread accross multiple servers, and computations over data must necessarily exploit parallelism as much as possible.

The need to handle massive datasets has led in recent years to the development of nosql databases for distributed storage, and to the mapreduce framework for parallell computation.

Nosql, which stands for "Not Only SQL", is a broad class of database management systems that do not adhere to the relational database management system model.

In relational databases, data is highly structured, and is organized into tables, consisting of fixed columns, each having a well-specified data type. In nosql databases, instead, data is semi-structured, and is organized into maps assigning values to keys, where values can be of arbitrary nature.

Relational databases can perform complex queries for retrieving data, and in particular it is possible to join together data belonging to different tables. In nosql databases, instead, joins are not possible, and there is little functionality beyond storage and retrieval of a single datum.

The reduced query capabilities of the nosql model is compensated by marked gains in scalability and performance. Nosql databases can scale out, so that it is possible to handle bigger and bigger data by simply adding to a cluster more servers running on commodity hardware.

Therefore, nosql databases are preferred to relational databases when the size of the data is very large, and the data's nature does not require joins for its manipulation.

The implementation and development of nosql databases is strongly influenced by the CAP theorem, which states that any computer system can satisfy at most two of the three following properties: consistency, availability, and partition tolerance. Relational databases satisfy consistency and availability, but do not satisfy partition tolerance. Nosql databases, instead, satisfy partition tolerance, and therefore, according to the CAP theorem, must either drop consistency or drop availability.

Nosql databases can be classified according to the following taxonomy: key-value databases (e.g. BerkeleyDB and Dynamo), tabular databases (e.g. BigTable and its clones), document databases (e.g., MongoDB and CouchDB), and graph databases (e.g. Neo4j).

Key-value databases simply store a map of values indexed by a key. Tabular databases store a table whose cells can be indexed by both a row key and a column key. Document databases store a map whose values are documents in a standardized format such as JSON or BSON. Finally, graph databases store a graph consisting of vertices and edges.

Nosql databases are used for storing the inputs of outputs of parallel computations performed with the mapreduce framework. Mapreduce is both a programming model used for expressing distributed computations on massive datasets, and an execution framework for large-scale data processing on clusters of commodity servers. Mapreduce was originally developed by Google in 2003 (but has its roots in functional programming) and has since enjoyed widespread adoption via an open-source implementation called Hadoop.

Mapreduce follows a divide and conquer approach involving two stages: a map stage for dividing, and a reduce stage for conquering. In the map stage, a large problem is partitioned into smaller independent subproblems, which are tackled in parallel by different workers using a user-defined map function. In the reduce stage, the outputs of the map stage are combined together using a user-defined reduce function, to produce the final output.

Lipari School on Computational Social Science