The RDBMS which was a standard for storing the data for most of the applications has lot of features which are good but not required for many of the current day data-driven applications. On the other hand the relational databases have shown poor performance on certain data-intensive applications, including indexing a large number of documents, serving pages on high-traffic websites, and delivering streaming media.
Looking at current trend of social applications, we need to look for data storage alternatives which are scalable as well as have low latency. This paper briefly discusses about No-SQL databases with detailed look on Hypertable, which is an open source database inspired by publications on the design of Google's BigTable. It will explain No-SQL database concept, Hypertable database architecture, Advantages of Hypertable as well as applications where Hypertable might not be the best choice.
What is No-SQL?
No-SQL databases are data storages that do not follow relational model, they are databases without a specific structure, and which don't use just SQL as a specific query language or rather uses different additional means to access data. These non-relational, distributed data stores often do not attempt to provide "ACID (Atomicity, Consistency, Isolation and Durability) guarantees" which is the key attributes of classic relational database systems.
No-SQL database emphasizes the advantages of Key-Value Stores, Document Databases, and Graph Databases. These databases are column-oriented unlike the RDBMS databases which are row-oriented. This orientation is further explained in next sections. Main driving force for using No-SQL database is the limitations of current RDBMS software. The price of processing data in an RDBMS scales exponentially with the amount of data - getting twice the processing speed on twice the data rarely costs only twice as much. No matter what you do, something has to give. Eventually, you can try to shard out databases but even these solutions still require expensive hardware, custom hash algorithms, compression, brittle analysis structures, and the licenses are enormously pricey. With those sorts of limitations, it's difficult to tackle "Web-Scale" problems.
Several No-SQL systems employ a distributed architecture, with the data held in a redundant manner on several servers, often using a distributed hash table. In this way, the system can readily scale out by adding more servers, and failure of a server can be tolerated. By employing many nodes in a cluster, No-SQL systems allow applications to be built on top of a No-SQL database that can process immense amounts of data by subdividing the work among the nodes.
What is Hypertable?
Hypertable is high performance, scalable and open source database. It is modeled after Google's BigTable database. Hypertable is more generic & more flexible implementation of BigTable. Hypertable has been developed as in-house software at Zvents Inc. It has been open-sourced under the GPL in 2007. It is later sponsored by Leading Chinese Search Engine Company "Baidu".
Hypertable runs on top of a distributed file system such as the Apache HDFS (Hadoop Distributed File System), GlusterFS, or the Kosmos File System (KFS). Hypertable has "Thrift Interface" to all popular languages such as Java, PHP, Ruby, Python, Perl etc. Hypertable has its own query language called HQL (Hypertable Query Language) and exposes a native C++ as well as a Thrift API. Hypertable is almost completely developed using C++ for better performance.
Hypertable is designed to manage the storage and processing of information on a large cluster of commodity servers, providing resilience to machine and component failures. Hypertable seeks to set the open source standard for highly available, petabyte scale, database systems. It is processing system for structured and unstructured data.
Data is represented in the system as a multi-dimensional table of information. The data in a table can be transformed and organized at high speed by performing computations in parallel, pushing them to where the data is physically stored.
Hypertable Architecture:
Hyperspace is a service that provides a file system for storing small amounts of metadata as well as it acts as a lock manager.
Range Server manages ranges of table data, caches updates in memory which is known as CellCache and periodically writes updates to physical disk in a specially formatted file called a CellStore. Initially each table consists of a single range that spans the entire row key space. As the table fills with data, the range will eventually exceed a size threshold which is pre-configured and will split into two ranges using the middle row key as a split point. One of the ranges will stay on the same range server that held the original range and the other will get reassigned to another range server by the Master. This splitting process continues for all of the ranges as they continue to grow.
The master handles all meta operations such as creating and deleting tables. It is responsible for detecting range server failures and re-assigning ranges if necessary in turn taking care of range server load balancing.
DFS broker is a process that translates standardized file system protocol messages into the system calls that are unique to the specific file system. DFS brokers have been developed for HDFS (hadoop), KFS, and local file system.
Graphical Representation of Storage:
Hypertable is two dimensional table with cell versions. Cells are identified by four-part key: 1. Row 2. Column Family 3.Column Qualifier & 4.Timestamp
In above diagram "com.facebook.com" is row, "title" and "content" are Column Families, Column Qualifier is not shown in above diagram & there is a timestamp dimension making it multi-dimensional array like structure.
Graphical Representation of Storage:
In above diagram, this multi-dimensional table of information is represented as a flat sorted list of key/value pairs.
Traditional databases are considered to be either row oriented or column oriented depending on how data is physically stored. With a row oriented database, all the data for a given row is stored contiguously on disk. With a column oriented database, all data for a given column is stored contiguously on disk. Hypertable is considered column oriented database as the data is stored as columns on the disk. Access groups in Hypertable provide a way to group columns together physically. All of the columns in an access group will have their data stored physically together in the same CellStore.
What is different in Hypertable?
In Hypertable we have a concept of "namespace" which is similar to concept of "database" in any RDBMS. So we need to create namespace before we could create the tables. But here we can have a namespace within namespace. Usually namespace contains tables as well as other sub-namespaces.
Name space can be created with simple syntax
hypertable> create namespace "test";
We can switch to newly created namespace by below syntax.
hypertable> use test;
Hypertable uses API as well as HQL (Hypertable Query Language) which is similar to traditional SQL. Above "create namespace" & "use namespace" are the examples of HQL query. There are many types HQL statements such as create, insert, delete etc. The script file containing HQL statements usually have extension .hql
One feature that Hypertable supports is that of access groups, which allow different logical column families to be together physically, which can have some performance advantages. Hypertable now supports regular expression based filtering. Queries can now filter cells by regular expression matches on the row key, column qualifier, and value. For this regular expression functionality, Hypertable uses Google's RE2 regular expression engine which is used in BigTable. RE2 allows Hypertable to deliver powerful pattern matching capability at the lowest hardware cost.
Read operations on hypertable have to read from all the CellStores as well as the CellCache. The CellCache accesses are all in-memory, but the CellStore accesses can result in many disk seeks, if the blocks containing the relevant keys are not in memory. BloomFilter is used to reduce disk lookups for rows and columns that have no values in the CellStore. The BloomFilter can be defined at the time of the creation of the table
Hypertable: Advantages vs. Limitations
Advantages:
High Performance Implementation due to extensive use of C++ as implementation language
Massively scalable as it is horizontally scalable storage. You can add additional server to increase the scalability
Distributed architecture: it runs on top of a distributed file system such as the Apache HDFS (Hadoop Distributed File System), GlusterFS, or the Kosmos File System (KFS).
High availability as it manages the storage and processing of information on a large cluster of commodity servers, providing resilience to machine and component failures. Hypertable seeks to set the open source standard for highly available, petabyte scale, database systems.
Proven architecture as it is modeled after Google's Bigtable which has many proven implementations.
Although main interface API is in C++ due to its implementation in C++ language, it has "Thrift Interface" for all popular High Level Languages like Java, Ruby, Python, and PHP etc.
Lower cost than competitive solutions at that scale as it is Open Source - GPL License.
Richer than simple key-value pairs.
Schema Flexibility & semi-structured data can be stored.
Limitations:
No transactions : Do not provide "ACID (Atomicity, Consistency, Isolation and Durability) guarantees"
Inherited complexity and problems with single points of failure from BigTable. Master node is the single point of failure at this moment.
No joins.
No secondary indexes.
Who should use Hypertable?
Now the important thing is who should utilize these new features which are provided with No-SQL database like Hypertable? According to me the web applications like search engines or social networking companies who has data-driven businesses or applications processing huge amount of data should make use of the features which are actually provided by Hypertable. This kind of "Open Source" No-SQL database can be used by start-up companies which are growing exponentially in terms of data storage. They can start with limited infrastructure for storage and then add the additional distributed servers for storage whenever they need to increase the scalability & performance.
The trend has changed from using "RDBMS" as data storage even if many important features are not used by that application. Hypertable can store structured data which is generally stored by traditional RDBMS systems as well as additionally it can store free text which is considered as un-structured data. As it supports structured as well as un-structured data it is called as semi-structured database.
Who should NOT use Hypertable?
When we discussed the advantages which we get from using semi-structured database as Hypertable for data-driven applications, we do realize that these databases do have some limitations when they provide these additional features such as exceptionally huge scalability, distributed flexible architecture where you can add more nodes or hardware to improve scalability & performance.
The limitations such as single point failure which was inherited from BigTable architecture, no joins or no transaction support makes it unusable for traditional applications where these features are must. Such applications include banking systems where transactions (all or none) are must. It is advisable that those kinds of applications should still stick with RDBMS where they actually utilize the heavy features of traditional RDBMS software.
Case Study
Rediff.com is largest India owned and operated web portal. It uses this for spam classification as well as giving recommendations. It has several hundred frontend web servers with tens of millions requests per day. It has peak rate of 500 queries/second. It has deployed hypertable database and has 99.99% sub-second response time.
Baidu is "Local Search" company located in China. It is one of the major sponsors for hypertable development. They have 120 node cluster running hypertable and HDFS. In their system, the log processing, viewing applications injecting approximately 500 gigabytes of data per day into their implementation.
Other than above two, there are some other companies like EPICS, Inepex, Endgame Systems, Dehems, ZVents, Tribalytic who has successfully implemented applications using Hypertable database.
Final Observation and Conclusion
Selection of appropriate technology has considerable impact on the success of the project. The software as well as the hardware which is running the software should be chosen very carefully. The project characteristics as well as requirements should be carefully studied and it should be carefully mapped to the technology or tools which are used.
When selecting the data storage system, there is no single right or wrong choice for database. Mainly if you are looking at massive data read-write and the applications which are mainly data-driven, still expects the low latency and do not care much about transactions, then hypertable might be a good choice for you.
But on the other hand side, if you are looking at traditional software system which does care about ACID (Atomicity, Consistency, Isolation and Durability) and cannot overlook any of these limitations then in that case traditional RDBMS is good choice.