Big Data Technologies

published by endre on Thu, 12/10/2015 - 13:56

Here is a brief overview of the leading open source Big Data technologies currently available on the market and their functions.

Hadoop - A scalable distributed file system that a great majority of the giant media companies use nowadays. It uses mapping and reducing algorithms to process search requests by first parallel mapping (searching) the data across the cluster and then reducing the results by merging the findings into a finished table. Hadoop has a new type of file system called HDFS (similar to the proprietary Google File System or GoogleFS). HDFS is a highly fault-tolerant scalable file system written in Java. It normally sits on a bunch (can be thousands) of inexpensive computers with cheap drives, with files and directories scattered everywhere and operates as a cluster. HDFS can support up to 4500 servers and 200 petabyte addressible file space for a partition. A petabyte being about 1000 terrabytes (TBs), that is a total capacity of about 200,000 TBs or 200 million gigabytes (GBs). The MapReduce framework functionality is able to search the entire cluster in milliseconds and locate the file or data. Access to the HDFS file system is with the use of specialized hdfs commands (Ex. "hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2" or "hadoop fs -ls /user/hadoop/dir1/filename.txt").

There is of course a lot more to Big Data than a cluster of file servers running a virtual file system. Below are the most popular Big Data database technologies and concepts that I'm familiar with, but first let's clear up what Big Data databases are about. Big Data databases are primarily focused on storing non-relational data. NoSQL is a concept that describes the process of storing non-relational data. The various data types are, document (ex. MongoDB, CouchDB), graph, key and value (ex. Riak), and wide-column hybrid (ex. Cassandra). At the highest level in a big data system all data is in the form of keys and values, where they keys are the indexes and the values are complex structures (documents, hashes, graphs, etc).

The various leading database technologies (in no particular order) are:

NoDB - A NoSQL Java API that creates in-memory eventual consistency (EC) object repository for an application to use. Basically you can store, update, delete, and retrieve anything that the application has access to from any source as long as it can be stored in a Java object. Capacity and performance depends on the machine or cloud platform it's on. The objects are defined by the application commonly as classes with serialized values.
Riak KV - A distributed NoSQL Key and Value store database. Similar to Cassandra just much simpler, offers fault tolerance, high availability, and scalability.
Riak TS - Same ideas as Riak KV, just used for time series data.
Riak S2 (Riak CS) - Same ideas as Riak KV and Riak TS, but geared toward large object storage such as video, software packages, backups, etc.
Cassandra - A distributed database system with a hybrid design that uses both relational and non-relational data concepts for its data model. The servers are distributed with high availability, fault tolerance, and high performance across the cluster. Intended uses are large number of low cost servers containing massive amounts of data scattered or clustered over large geographical regions. Supports clusters and data replication across multiple datacenters with very high best in class throughput. Nodes are identical in role so there is no single point of failure in the cluster. Performance and capacity goes up linearly with each new node added. The eventual consistency levels (essentially the QOS) are adjustable. Integrates with Hadoop.
CouchDB - NoSQL document database like MongoDB. Stores data as JSON documents. API calls are made using HTTP and the actual queries are done using JavaScript.
MongoDB - Most popular NoSQL document database. Uses dynamic schemas. Can be clustered and provides high availability replicating across multiple servers. Load balances using sharding where each shard contains a part of the data and consists of a master and multiple slave components. Anything such as JSON, XML, etc is supported. Can be used as a load balanced grid file system (GridFS) also. MongoDB GridFX implementations are used in NGINX and lighthttp web servers.
Pig - Used to run Hadoop MapReduce programs written in Pig Latin.
Pig Latin - programming language used to write Pig programs, essentially search functions. Integrates (can call) routines from common programming languages like Java, Python, and Java Script.
Apache Hive - Distributed database platform built into the Hadoop platform. Supports batch querying of text files (flat files), Key and Value pair format files (SequenceFiles), and Record Columnar Files (RCFiles). It's intended use is storing and managing large data sets located in distributed storage on the Hadoop File System (HDFS). Used the Hive query language HiveQL (simillar to SQL) which are compiled into MapReduce jobs. Ideal uses would be storing archive data of time series (ex. Stock prices) and running analysis on the data set frequently or storing tons of log files from a large number of sources for analytical or archival purposes. It's not intended for Online Transaction Processing (OLTP) applications.
Apache HBase - Just like Hive, Apache HBase is also made for the Hadoop platform and since it's on the HDFS it's also distributed, scalable, and fault tolerant. Unlike Hive which is intended more as a storage, and batch processor of data analysis sorts of jobs, HBase is made for real-time querying big data similarly to the way SPLUNK works, by storing data in massive tables with billions of records, and querying the tables frequently in a matter of seconds. HBase is modeled after Google's proprietary BigTable data storage system which Google runs on its own distributed Google File System (GFS) applications. Essentially Apache HBase is the big transactional database system for Hadoop the way BigTable is for the Google systems running GFS. HBase is intended to be used for data tables that contain upwards toward a billion or more records. Anything less and you're better off with an RDBMS system like MySQL or Oracle since the querying and reporting capacity of HBase isn't as advanced as the existing RDBMS systems out there.

endre's blog
Log in to post comments
Tweet

Endre Pálfi

Consulting Services

Portfolio

Contact Information

Find me on Facebook

You are here

Big Data Technologies

Recent blog posts

Recent comments

Publications and Internet Projects

Affiliated Sites

For Developers

Favorites

Endre Pálfi

User login

Consulting Services

Portfolio

Contact Information

Find me on Facebook

You are here

Big Data Technologies

Recent blog posts

Recent comments

Publications and Internet Projects

Affiliated Sites

For Developers

Favorites