Distribution
Lily has a distributed architecture. This distribution is manifested in two ways. First, there are nodes (= systems, servers) that perform different functions, causing a functional layering. Second, there are multiple nodes that perform the same function, for purposes of scalability and fault-tolerance.
This is illustrated in the following figure.
In this diagram, the Lily node serves as a black box node for different components, which are described further on.
Not every box in this diagram necessarily corresponds to a physical server. While multiple processes of the same kind should be run on different servers, you can run e.g. a Lily node, a HBase region node and a HDFS data node on the same server.
While the diagram shows three nodes of every kind, the actual numbers can differ for each type of node, depending on the needs.
For some kinds of nodes, it does not matter to what node to connect. For example, each client can connect to any arbitrary Lily node. For others, the node to connect to depends on the one that hosts the data. For example, a Lily node that wants to read a row from HBase will have to connect with the HBase node that hosts this row.
A Lily client does not connect to one fixed endpoint. It decides itself to what Lily node to connect, and directly talks to HDFS and HBase nodes when appropriate.
Main components
The diagram below shows the main components of the Lily data repository and the connections between them. For clarity, this figure shows only one instance of each component, but remember that there can be any number of them.
What we referred to as “Lily node” in the above section on distribution consists of different independent components such as the repository, the indexer, and the message queue. These could be run as different processes or in one process, this is of little importance for our discussion here.
HBase
Lily uses Apache HBase for the storage of fine-grained data. HBase is modeled after Google's BigTable. HBase has little in common with the SQL databases everyone knows: it does not offer much querying, nor transactions. But instead it offers scalability to very large amounts of data (billions of rows) by adding more hardware as needed. No manual repartitioning of the data is necessary. It also handles failing nodes automatically.
HBase has a special data model, whereby rows can contain very large amounts of columns, and columns without a value do not take space, so it is ideal for sparse data structures. The BigTable people concisely call it “a sparse, distributed, persistent multidimensional sorted map”. HBase does not know data types, it handles everything as bytes.
HBase stores its data on HDFS, described next.
HDFS
HDFS is the Hadoop distributed file system, thus a file system that spans across nodes. It is modeled after GFS, the Google File System. A file in HDFS is stored multiple times in the cluster (by default 3 times), so that if a node fails, the data is still available elsewhere. HDFS is best used for the storage of larger files. The namespace of the file system (the link between the names of the files and where they are stored) is maintained on one system, called the name node. The number of files one can store in HDFS is limited by the amount of memory in that system. Practically this means you can still store millions of files on it, but in Lily we will store smaller blobs in HBase, to avoid quickly hitting this limit. HDFS has a focus on high throughput rather than low latency.
The repository
The repository provides the basic record CRUD functionality. Clients connect to the repository using an Avro-based protocol. Avro is an efficient binary serialization system. The repository connects to HBase using the HBase Java API, which talks HBase RPC, also based on an efficient binary serialization.
The basic entity managed by the repository is called a record, see the repository model description. When reading a record, a client can specify to read only some fields, and when updating a record, a client only needs to communicate the changed fields.
The Java API exposed by the repository is based on simple data objects and service-style interfaces. This API-approach makes manipulation of the data objects straightforward.
The ID of the record can either be assigned by the user when creating the record, or is automatically assigned by the repository, in which case it is a UUID.
Fields in a record can be blobs. These blobs are stored either in HBase or on HDFS, depending on a size-based strategy. Smaller blobs like HTML pages can be stored in HBase, while bigger blobs that should be handled as streams are stored on HDFS (see also discussion on HDFS above).
One record, which can contain multiple versions, maps onto one row in HBase. This makes that a record is the unit of atomic manipulation.
The Write Ahead Log
When creating or updating a record, often secondary actions (= post-update actions) will need to happen, the most common example of which is keeping indexes up to date.
If we would naively update the row in HBase, and then update the corresponding indexes, there would be a possibility that the indexes would not be updated if the repository process dies.
In more traditional architectures, transactions are used to assure that multiple actions happen as one atomic operation. For our use-cases, full transaction support is not needed. We do not need atomicity, nor do we need rollback. All secondary actions are considered to be subordinate actions which should succeed, and if they fail, they should not invalidate the operation on the record.
The solution we use in Lily is a write-ahead-log, or WAL for short. Before performing an action to the repository, we write our intention to do this to the WAL. Then we update the repository, and confirm this to the WAL. Then the secondary actions are performed, each time confirming to the WAL. If at any point the process would be interrupted, upon restart the WAL can be checked to see up to where we got and to perform any remaining actions.
Lily's WAL is unrelated to the HBase WAL. It is also conceptually different, since Lily does not write the data of a record update to its WAL. Its only purpose is to guarantee the execution of the secondary actions.
The Message Queue
Updating indexes does not need to happen synchronously with the update of the record, while the client is waiting for its response. Rather, this can be done asynchronously. The usual solution to this is to make use of a message queue. Pushing a message onto the queue is a kind of secondary action that needs to happen when updating a record, and our WAL will assure that the message will surely be pushed to the queue even if the repository dies before it gets to that.
Now, rather than bringing an existing queue technology into our system, which would have its own persistence, admin needs, failover solution, etc. Lily will use a lightweight message queue that reuses HBase for persistence.
The Indexer
The role of the Indexer is to keep the Solr-index up to date when records are created, updated or deleted. For this purpose, the Indexer listens to the message queue.
The indexer maps Lily records onto Solr documents, by deciding (based on configuration) which records and what fields of the record need to be indexed. For blob fields, it can perform content extraction using the Tika library.
Denormalization
Lily records can contain link fields. Link fields are links to other records. During indexing, you can include information from linked records within the index of the current record. This is called denormalization. Information can be denormalized by following links multiple levels deep. Denormalization at index time is an alternative for SQL-join-like functionality at query time. General join-queries are not available in Lucene, and complicated to do with sharded databases in general. Denormalization makes querying faster and easier, but complicates indexing. Denormalization assumes you know at beforehand (= when indexing) what sort of queries you will want to do on linked content.
A consequence of denormalization is that when a record is updated, the index entries of other records might also become invalid, when they contain information from the updated record. The Lily Indexer will automatically update such index entries. For this, it maintains special tables that store the dependencies of indexed records, which allows to quickly and accurately find out which records need re-indexing in response to repository changes.
Solr
Solr is a search server based on Lucene, the well-known excellent text-search library. It provides powerful search functionality including full-text search (with spell check, search suggestions, and so on), fielded search and faceted navigation. The configuration can be tweaked, e.g. with regards to text analysis, to provide an optimal search experience. It supports distributed querying across a set of Solr nodes (to support data sets that do not fit on a single server), and Solr nodes can be replicated (to support many concurrent search requests).
Lily supports both SolrCloud and classic (non-cloud) Solr. For classic Solr, Lily is able to do the distributed indexing (the routing towards the shards), something which is taken care of automatically when using SolrCloud.
ZooKeeper
ZooKeeper provides some basic services for the coordination of distributed applications, like distributed synchronization, leader election and configuration. As these things are hard to get right, it is a good thing that many applications re-use ZooKeeper for this purpose. ZooKeeper is used by Lily, HBase, and SolrCloud.