Skip to main content

Endre Pálfi

Moving business forward, one client at a time.
  • Login
  • Register
  • Home
  • About Me
  • Blog
  • Portfolio
  • Résumé
  • Consulting

User login

  • Request new password

Consulting Services

  • Accounting and Financial Services
  • Administration and Management
  • Event Photography
  • Online Product and Service Marketing
  • Web Development and I.T. Management

Portfolio

  • Intelliswift (Intuit, Digital Insight, NCR)
  • Future Media Venture Group
  • Zuma Beach Entertainment
  • Morgan Stanley (MSCI Barra)
  • Infomedium
  • Unlimited Fiber Output
  • CurtCo Media Group
  • Independent Work

Contact Information

Do you need help with your business? Fill out the email form to the right to send me a message. Please describe the type of business or technology issues you're having in as much detail as possible. If I can't help you, I will find you someone who can at an affordable price. You may also wish to check out Future Media Venture Group or for live response during business hours (PST) call (855) FMVG-777.

Find me on Facebook

You are here

Home » Blogs » endre's blog

Big Data Technologies

published by endre on Thu, 12/10/2015 - 13:56

Here is a brief overview of the leading open source Big Data technologies currently available on the market and their functions.

Hadoop - A scalable distributed file system that a great majority of the giant media companies use nowadays. It uses mapping and reducing algorithms to process search requests by first parallel mapping (searching) the data across the cluster and then reducing the results by merging the findings into a finished table. Hadoop has a new type of file system called HDFS (similar to the proprietary Google File System or GoogleFS). HDFS is a highly fault-tolerant scalable file system written in Java. It normally sits on a bunch (can be thousands) of inexpensive computers with cheap drives, with files and directories scattered everywhere and operates as a cluster. HDFS can support up to 4500 servers and 200 petabyte addressible file space for a partition. A petabyte being about 1000 terrabytes (TBs), that is a total capacity of about 200,000 TBs or 200 million gigabytes (GBs). The MapReduce framework functionality is able to search the entire cluster in milliseconds and locate the file or data. Access to the HDFS file system is with the use of specialized hdfs commands (Ex. "hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2" or "hadoop fs -ls /user/hadoop/dir1/filename.txt").

There is of course a lot more to Big Data than a cluster of file servers running a virtual file system. Below are the most popular Big Data database technologies and concepts that I'm familiar with, but first let's clear up what Big Data databases are about. Big Data databases are primarily focused on storing non-relational data. NoSQL  is a concept that describes the process of storing non-relational data. The various data types are, document (ex. MongoDB, CouchDB), graph, key and value (ex. Riak), and wide-column hybrid (ex. Cassandra).  At the highest level in a big data system all data is in the form of keys and values, where they keys are the indexes and the values are complex structures (documents, hashes, graphs, etc).
 

The various leading database technologies (in no particular order) are:

  • NoDB - A NoSQL Java API that creates in-memory eventual consistency (EC) object repository for an application to use. Basically you can store, update, delete, and retrieve anything that the application has access to from any source as long as it can be stored in a Java object. Capacity and performance depends on the machine or cloud platform it's on. The objects are defined by the application commonly as classes with serialized values. 
  • Riak KV - A distributed NoSQL Key and Value store database. Similar to Cassandra just much simpler, offers fault tolerance, high availability, and scalability.
  • Riak TS - Same ideas as Riak KV, just used for time series data.
  • Riak S2 (Riak CS) - Same ideas as Riak KV and Riak TS, but geared toward large object storage such as video, software packages, backups, etc.
  • Cassandra - A distributed database system with a hybrid design that uses both relational and non-relational data concepts for its data model. The servers are distributed with high availability, fault tolerance, and high performance across the cluster. Intended uses are large number of low cost servers containing massive amounts of data scattered or clustered over large geographical regions. Supports clusters and data replication across multiple datacenters with very high best in class throughput. Nodes are identical in role so there is no single point of failure in the cluster. Performance and capacity goes up linearly with each new node added. The eventual consistency levels (essentially the QOS) are adjustable. Integrates with Hadoop.
  • CouchDB - NoSQL document database like MongoDB. Stores data as JSON documents. API calls are made using HTTP and the actual queries are done using JavaScript.
  • MongoDB - Most popular NoSQL document database. Uses dynamic schemas. Can be clustered and provides high availability replicating across multiple servers. Load balances using sharding where each shard contains a part of the data and consists of a master and multiple slave components. Anything such as JSON, XML, etc is supported. Can be used as a load balanced grid file system (GridFS) also. MongoDB GridFX implementations are used in NGINX and lighthttp web servers.
  • Pig - Used to run Hadoop MapReduce programs written in Pig Latin.
  • Pig Latin - programming language used to write Pig programs, essentially search functions. Integrates (can call) routines from common programming languages like Java, Python, and Java Script.
  • Apache Hive - Distributed database platform built into the Hadoop platform. Supports batch querying of text files (flat files), Key and Value pair format files (SequenceFiles), and Record Columnar Files (RCFiles). It's intended use is storing and managing large data sets located in distributed storage on the Hadoop File System (HDFS). Used the Hive query language HiveQL (simillar to SQL) which are compiled into MapReduce jobs. Ideal uses would be storing archive data of time series (ex. Stock prices) and running analysis on the data set frequently or storing tons of log files from a large number of sources for analytical or archival purposes. It's not intended for Online Transaction Processing (OLTP) applications.
  • Apache HBase - Just like Hive, Apache HBase is also made for the Hadoop platform and since it's on the HDFS it's also distributed, scalable, and fault tolerant. Unlike Hive which is intended more as a storage, and batch processor of data analysis sorts of jobs, HBase is made for real-time querying big data similarly to the way SPLUNK works, by storing data in massive tables with billions of records, and querying the tables frequently in a matter of seconds. HBase is modeled after Google's proprietary BigTable data storage system which Google runs on its own distributed Google File System (GFS) applications. Essentially Apache HBase is the big transactional database system for Hadoop the way BigTable is for the Google systems running GFS. HBase is intended to be used for data tables that contain upwards toward a billion or more records. Anything less and you're better off with an RDBMS system like MySQL or Oracle since the querying and reporting capacity of HBase isn't as advanced as the existing RDBMS systems out there.
  • endre's blog
  • Log in to post comments
  • Tweet

Recent blog posts

  • Big Data Technologies
  • To Hadoop or not to Hadoop the enterprise?
  • Farewell to the hard-driving days of Digital Insight
  • Stock pumping explained
  • Insurance Sales
  • Ad System Upgrade
  • Job hunting
  • Gal Holiday and the Honky Tonk Revue @ Joe's - Burbank, CA - January 4, 2012
  • Gal Holiday and the Honky Tonk Revue @ Viva Cantina, Burbank, CA - January 5, 2012
  • RayBones - Let Me Know - Video
More

Recent comments

  • This is almost complete. Now 10 years 3 months ago

Publications and Internet Projects

  • Future Media Venture Group
  • Wild Jester
  • Twist Flix
  • Direct Promoter

Affiliated Sites

  • Angel Babies
  • Burbank Aquatics
  • Direct Promoter
  • ExchangeMall.com
  • First National Information Network
  • Investor Concepts
  • Zuma Beach Entertainment

For Developers

  • Drupal - Open Source CMS
  • MySQL
  • Skype
  • The Official Microsoft ASP.NET Site
  • SWiSH Max
  • Autodesk 3ds Max
  • FlashDevelop

Favorites

  • YouTube
  • Facebook
  • Google
  • eBay
  • QuickBooks Online
  • LogMeIn
  • eFax
Copyright © 2012 Endre Pálfi. All rights reserved.