1. Introduction
Most major DBMS vendors implement record-oriented
storage systems, where the attributes of a record (or tuple)
are placed contiguously in storage. With this row store
architecture, a single disk write suffices to push all of the
fields of a single record out to disk. Hence, high
performance writes are achieved, and we call a DBMS
with a row store architecture a write-optimized system.
These are especially effective on OLTP-style applications.
In contrast, systems oriented toward ad-hoc querying
of large amounts of data should be read-optimized. Data
warehouses represent one class of read-optimized system, in which periodically a bulk load of new data is
performed, followed by a relatively long period of ad-hoc
queries. Other read-mostly applications include customer
relationship management (CRM) systems, electronic
library card catalogs, and other ad-hoc inquiry systems. In
such environments, a column store architecture, in which
the values for each single column (or attribute) are stored
contiguously, should be more efficient. This efficiency
has been demonstrated in the warehouse marketplace by
products like Sybase IQ [FREN95, SYBA04], Addamark
[ADDA04], and KDB [KDB04]. In this paper, we discuss
the design of a column store called C-Store that includes a
number of novel features relative to existing systems.
With a column store architecture, a DBMS need only
read the values of columns required for processing a given
query, and can avoid bringing into memory irrelevant
attributes. In warehouse environments where typical
queries involve aggregates performed over large numbers
of data items, a column store has a sizeable performance
advantage. However, there are several other major
distinctions that can be drawn between an architecture that
is read-optimized and one that is write-optimized.
Current relational DBMSs were designed to pad
attributes to byte or word boundaries and to store values in
their native data format. It was thought that it was too
expensive to shift data values onto byte or word
boundaries in main memory for processing. However,
CPUs are getting faster at a much greater rate than disk
bandwidth is increasing. Hence, it makes sense to trade
CPU cycles, which are abundant, for disk bandwidth,
which is not. This tradeoff appears especially profitable in
a read-mostly environment.
There are two ways a column store can use CPU cycles
to save disk bandwidth. First, it can code data elements
into a more compact form. For example, if one is storing
an attribute that is a customer’s state of residence, then US
states can be coded into six bits, whereas the two-
character abbreviation requires 16 bits and a variable
length character string for the name of the state requires
many more. Second, one should densepack values in
storage. For example, in a column store it is
straightforward to pack N values, each K bits long, into N
* K bits. The coding and compressibility advantages of a column store over a row store have been previously
pointed out in [FREN95]. Of course, it is also desirable to
have the DBMS query executor operate on the compressed
representation whenever possible to avoid the cost of
decompression, at least until values need to be presented
to an application.
Commercial relational DBMSs store complete tuples
of tabular data along with auxiliary B-tree indexes on
attributes in the table. Such indexes can be primary,
whereby the rows of the table are stored in as close to
sorted order on the specified attribute as possible, or
secondary, in which case no attempt is made to keep the
underlying records in order on the indexed attribute. Such
indexes are effective in an OLTP write-optimized
environment but do not perform well in a read-optimized
world. In the latter case, other data structures are
advantageous, including bit map indexes [ONEI97], cross
table indexes [ORAC04], and materialized views
[CERI91]. In a read-optimized DBMS one can explore
storing data using only these read-optimized structures,
and not support write-optimized ones at all.
Hence, C-Store physically stores a collection of
columns, each sorted on some attribute(s). Groups of
columns sorted on the same attribute are referred to as
“projections”; the same column may exist in multiple
projections, possibly sorted on a different attribute in each.
We expect that our aggressive compression techniques
will allow us to support many column sort-orders without
an explosion in space. The existence of multiple sort-
orders opens opportunities for optimization.
Clearly, collections of off-the-shelf “blade” or “grid”
computers will be the cheapest hardware architecture for
computing and storage intensive applications such as
DBMSs [DEWI92]. Hence, any new DBMS architecture
should assume a grid environment in which there are G
nodes (computers), each with private disk and private
memory. We propose to horizontally partition data across
the disks of the various nodes in a “shared nothing”
architecture [STON86]. Grid computers in the near future
may have tens to hundreds of nodes, and any new system
should be architected for grids of this size. Of course, the
nodes of a grid computer may be physically co-located or
divided into clusters of co-located nodes. Since database
administrators are hard pressed to optimize a grid
environment, it is essential to allocate data structures to
grid nodes automatically. In addition, intra-query
parallelism is facilitated by horizontal partitioning of
stored data structures, and we follow the lead of Gamma [DEWI90] in implementing this construct.
Many warehouse systems (e.g. Walmart [WEST00])
maintain two copies of their data because the cost of
recovery via DBMS log processing on a very large
(terabyte) data set is prohibitive. This option is rendered
increasingly attractive by the declining cost per byte of
disks. A grid environment allows one to store such
replicas on different processing nodes, thereby supporting a Tandem-style highly-available system [TAND89]. However, there is no requirement that one store multiple
copies in the exact same way. C-Store allows redundant
objects to be stored in different sort orders providing
higher retrieval performance in addition to high
availability. In general, storing overlapping projections
further improves performance, as long as redundancy is
crafted so that all data can be accessed even if one of the
G sites fails. We call a system that tolerates K failures K-
safe. C-Store will be configurable to support a range of
values of K.
It is clearly essential to perform transactional updates,
even in a read-mostly environment. Warehouses have a
need to perform on-line updates to correct errors. As well,
there is an increasing push toward real-time warehouses,
where the delay to data visibility shrinks toward zero. The
ultimate desire is on-line update to data warehouses.
Obviously, in read-mostly worlds like CRM, one needs to
perform general on-line updates.
There is a tension between providing updates and
optimizing data structures for reading. For example, in
KDB and Addamark, columns of data are maintained in
entry sequence order. This allows efficient insertion of
new data items, either in batch or transactionally, at the
end of the column. However, the cost is a less-than-
optimal retrieval structure, because most query workloads
will run faster with the data in some other order.
However, storing columns in non-entry sequence will
make insertions very difficult and expensive.
C-Store approaches this dilemma from a fresh
perspective. Specifically, we combine in a single piece of
system software, both a read-optimized column store and
an update/insert-oriented writeable store, connected by a
tuple mover, as noted in Figure 1. At the top level, there
is a small Writeable Store (WS) component, which is
architected to support high performance inserts and
updates. There is also a much larger component called the
Read-optimized Store (RS), which is capable of
supporting very large amounts of information. RS, as the
name implies, is optimized for read and supports only a
very restricted form of insert, namely the batch movement
of records from WS to RS, a task that is performed by the
tuple mover of Figure 1. Of course, queries must access data in both storage
systems. Inserts are sent to WS, while deletes must be
marked in RS for later purging by the tuple mover.
Updates are implemented as an insert and a delete. In
order to support a high-speed tuple mover, we use a
variant of the LSM-tree concept [ONEI96], which
supports a merge out process that moves tuples from WS
to RS in bulk by an efficient method of merging ordered
WS data objects with large RS blocks, resulting in a new
copy of RS that is installed when the operation completes.
The architecture of Figure 1 must support transactions
in an environment of many large ad-hoc queries, smaller
update transactions, and perhaps continuous inserts.
Obviously, blindly supporting dynamic locking will result
in substantial read-write conflict and performance
degradation due to blocking and deadlocks.
Instead, we expect read-only queries to be run in
historical mode. In this mode, the query selects a
timestamp, T, less than the one of the most recently
committed transactions, and the query is semantically
guaranteed to produce the correct answer as of that point
in history. Providing such snapshot isolation [BERE95]
requires C-Store to timestamp data elements as they are
inserted and to have careful programming of the runtime
system to ignore elements with timestamps later than T.
Lastly, most commercial optimizers and executors are
row-oriented, obviously built for the prevalent row stores
in the marketplace. Since both RS and WS are column-
oriented, it makes sense to build a column-oriented
optimizer and executor. As will be seen, this software
looks nothing like the traditional designs prevalent today.
In this paper, we sketch the design of our updatable
column store, C-Store, that can simultaneously achieve
very high performance on warehouse-style queries and
achieve reasonable speed on OLTP-style transactions. C-
Store is a column-oriented DBMS that is architected to
reduce the number of disk accesses per query. The
innovative features of C-Store include:
1. A hybrid architecture with a WS component optimized
for frequent insert and update and an RS component
optimized for query performance.
2. Redundant storage of elements of a table in several
overlapping projections in different orders, so that a
query can be solved using the most advantageous
projection.
3. Heavily compressed columns using one of several
coding schemes.
4. A column-oriented optimizer and executor, with
different primitives than in a row-oriented system.
5. High availability and improved performance through
K-safety using a sufficient number of overlapping
projections.
6. The use of snapshot isolation to avoid 2PC and locking
for queries.
It should be emphasized that while many of these topics
have parallels with things that have been studied in
isolation in the past, it is their combination in a real
system that make C-Store interesting and unique.
It should be emphasized that while many of these topics
have parallels with things that have been studied in
isolation in the past, it is their combination in a real
system that make C-Store interesting and unique.
The rest of this paper is organized as follows. In
Section 2 we present the data model implemented by C-
Store. We explore in Section 3 the design of the RS
portion of C-Store, followed in Section 4 by the WS
component. In Section 5 we consider the allocation of C-
Store data structures to nodes in a grid, followed by a
presentation of C-Store updates and transactions in
Section 6. Section 7 treats the tuple mover component of
C-Store, and Section 8 presents the query optimizer and
executor. In Section 9 we present a comparison of C-
Store performance to that achieved by both a popular
commercial row store and a popular commercial column
store. On TPC-H style queries, C-Store is significantly
faster than either alternate system. However, it must be
noted that the performance comparison is not fully
completed; we have not fully integrated the WS and tuple
mover, whose overhead may be significant. Finally,
Sections 10 and 11 discuss related previous work and our
conclusions.
storage systems, where the attributes of a record (or tuple)
are placed contiguously in storage. With this row store
architecture, a single disk write suffices to push all of the
fields of a single record out to disk. Hence, high
performance writes are achieved, and we call a DBMS
with a row store architecture a write-optimized system.
These are especially effective on OLTP-style applications.
In contrast, systems oriented toward ad-hoc querying
of large amounts of data should be read-optimized. Data
warehouses represent one class of read-optimized system, in which periodically a bulk load of new data is
performed, followed by a relatively long period of ad-hoc
queries. Other read-mostly applications include customer
relationship management (CRM) systems, electronic
library card catalogs, and other ad-hoc inquiry systems. In
such environments, a column store architecture, in which
the values for each single column (or attribute) are stored
contiguously, should be more efficient. This efficiency
has been demonstrated in the warehouse marketplace by
products like Sybase IQ [FREN95, SYBA04], Addamark
[ADDA04], and KDB [KDB04]. In this paper, we discuss
the design of a column store called C-Store that includes a
number of novel features relative to existing systems.
With a column store architecture, a DBMS need only
read the values of columns required for processing a given
query, and can avoid bringing into memory irrelevant
attributes. In warehouse environments where typical
queries involve aggregates performed over large numbers
of data items, a column store has a sizeable performance
advantage. However, there are several other major
distinctions that can be drawn between an architecture that
is read-optimized and one that is write-optimized.
Current relational DBMSs were designed to pad
attributes to byte or word boundaries and to store values in
their native data format. It was thought that it was too
expensive to shift data values onto byte or word
boundaries in main memory for processing. However,
CPUs are getting faster at a much greater rate than disk
bandwidth is increasing. Hence, it makes sense to trade
CPU cycles, which are abundant, for disk bandwidth,
which is not. This tradeoff appears especially profitable in
a read-mostly environment.
There are two ways a column store can use CPU cycles
to save disk bandwidth. First, it can code data elements
into a more compact form. For example, if one is storing
an attribute that is a customer’s state of residence, then US
states can be coded into six bits, whereas the two-
character abbreviation requires 16 bits and a variable
length character string for the name of the state requires
many more. Second, one should densepack values in
storage. For example, in a column store it is
straightforward to pack N values, each K bits long, into N
* K bits. The coding and compressibility advantages of a column store over a row store have been previously
pointed out in [FREN95]. Of course, it is also desirable to
have the DBMS query executor operate on the compressed
representation whenever possible to avoid the cost of
decompression, at least until values need to be presented
to an application.
Commercial relational DBMSs store complete tuples
of tabular data along with auxiliary B-tree indexes on
attributes in the table. Such indexes can be primary,
whereby the rows of the table are stored in as close to
sorted order on the specified attribute as possible, or
secondary, in which case no attempt is made to keep the
underlying records in order on the indexed attribute. Such
indexes are effective in an OLTP write-optimized
environment but do not perform well in a read-optimized
world. In the latter case, other data structures are
advantageous, including bit map indexes [ONEI97], cross
table indexes [ORAC04], and materialized views
[CERI91]. In a read-optimized DBMS one can explore
storing data using only these read-optimized structures,
and not support write-optimized ones at all.
Hence, C-Store physically stores a collection of
columns, each sorted on some attribute(s). Groups of
columns sorted on the same attribute are referred to as
“projections”; the same column may exist in multiple
projections, possibly sorted on a different attribute in each.
We expect that our aggressive compression techniques
will allow us to support many column sort-orders without
an explosion in space. The existence of multiple sort-
orders opens opportunities for optimization.
Clearly, collections of off-the-shelf “blade” or “grid”
computers will be the cheapest hardware architecture for
computing and storage intensive applications such as
DBMSs [DEWI92]. Hence, any new DBMS architecture
should assume a grid environment in which there are G
nodes (computers), each with private disk and private
memory. We propose to horizontally partition data across
the disks of the various nodes in a “shared nothing”
architecture [STON86]. Grid computers in the near future
may have tens to hundreds of nodes, and any new system
should be architected for grids of this size. Of course, the
nodes of a grid computer may be physically co-located or
divided into clusters of co-located nodes. Since database
administrators are hard pressed to optimize a grid
environment, it is essential to allocate data structures to
grid nodes automatically. In addition, intra-query
parallelism is facilitated by horizontal partitioning of
stored data structures, and we follow the lead of Gamma [DEWI90] in implementing this construct.
Many warehouse systems (e.g. Walmart [WEST00])
maintain two copies of their data because the cost of
recovery via DBMS log processing on a very large
(terabyte) data set is prohibitive. This option is rendered
increasingly attractive by the declining cost per byte of
disks. A grid environment allows one to store such
replicas on different processing nodes, thereby supporting a Tandem-style highly-available system [TAND89]. However, there is no requirement that one store multiple
copies in the exact same way. C-Store allows redundant
objects to be stored in different sort orders providing
higher retrieval performance in addition to high
availability. In general, storing overlapping projections
further improves performance, as long as redundancy is
crafted so that all data can be accessed even if one of the
G sites fails. We call a system that tolerates K failures K-
safe. C-Store will be configurable to support a range of
values of K.
It is clearly essential to perform transactional updates,
even in a read-mostly environment. Warehouses have a
need to perform on-line updates to correct errors. As well,
there is an increasing push toward real-time warehouses,
where the delay to data visibility shrinks toward zero. The
ultimate desire is on-line update to data warehouses.
Obviously, in read-mostly worlds like CRM, one needs to
perform general on-line updates.
There is a tension between providing updates and
optimizing data structures for reading. For example, in
KDB and Addamark, columns of data are maintained in
entry sequence order. This allows efficient insertion of
new data items, either in batch or transactionally, at the
end of the column. However, the cost is a less-than-
optimal retrieval structure, because most query workloads
will run faster with the data in some other order.
However, storing columns in non-entry sequence will
make insertions very difficult and expensive.
C-Store approaches this dilemma from a fresh
perspective. Specifically, we combine in a single piece of
system software, both a read-optimized column store and
an update/insert-oriented writeable store, connected by a
tuple mover, as noted in Figure 1. At the top level, there
is a small Writeable Store (WS) component, which is
architected to support high performance inserts and
updates. There is also a much larger component called the
Read-optimized Store (RS), which is capable of
supporting very large amounts of information. RS, as the
name implies, is optimized for read and supports only a
very restricted form of insert, namely the batch movement
of records from WS to RS, a task that is performed by the
tuple mover of Figure 1. Of course, queries must access data in both storage
systems. Inserts are sent to WS, while deletes must be
marked in RS for later purging by the tuple mover.
Updates are implemented as an insert and a delete. In
order to support a high-speed tuple mover, we use a
variant of the LSM-tree concept [ONEI96], which
supports a merge out process that moves tuples from WS
to RS in bulk by an efficient method of merging ordered
WS data objects with large RS blocks, resulting in a new
copy of RS that is installed when the operation completes.
The architecture of Figure 1 must support transactions
in an environment of many large ad-hoc queries, smaller
update transactions, and perhaps continuous inserts.
Obviously, blindly supporting dynamic locking will result
in substantial read-write conflict and performance
degradation due to blocking and deadlocks.
Instead, we expect read-only queries to be run in
historical mode. In this mode, the query selects a
timestamp, T, less than the one of the most recently
committed transactions, and the query is semantically
guaranteed to produce the correct answer as of that point
in history. Providing such snapshot isolation [BERE95]
requires C-Store to timestamp data elements as they are
inserted and to have careful programming of the runtime
system to ignore elements with timestamps later than T.
Lastly, most commercial optimizers and executors are
row-oriented, obviously built for the prevalent row stores
in the marketplace. Since both RS and WS are column-
oriented, it makes sense to build a column-oriented
optimizer and executor. As will be seen, this software
looks nothing like the traditional designs prevalent today.
In this paper, we sketch the design of our updatable
column store, C-Store, that can simultaneously achieve
very high performance on warehouse-style queries and
achieve reasonable speed on OLTP-style transactions. C-
Store is a column-oriented DBMS that is architected to
reduce the number of disk accesses per query. The
innovative features of C-Store include:
1. A hybrid architecture with a WS component optimized
for frequent insert and update and an RS component
optimized for query performance.
2. Redundant storage of elements of a table in several
overlapping projections in different orders, so that a
query can be solved using the most advantageous
projection.
3. Heavily compressed columns using one of several
coding schemes.
4. A column-oriented optimizer and executor, with
different primitives than in a row-oriented system.
5. High availability and improved performance through
K-safety using a sufficient number of overlapping
projections.
6. The use of snapshot isolation to avoid 2PC and locking
for queries.
It should be emphasized that while many of these topics
have parallels with things that have been studied in
isolation in the past, it is their combination in a real
system that make C-Store interesting and unique.
It should be emphasized that while many of these topics
have parallels with things that have been studied in
isolation in the past, it is their combination in a real
system that make C-Store interesting and unique.
The rest of this paper is organized as follows. In
Section 2 we present the data model implemented by C-
Store. We explore in Section 3 the design of the RS
portion of C-Store, followed in Section 4 by the WS
component. In Section 5 we consider the allocation of C-
Store data structures to nodes in a grid, followed by a
presentation of C-Store updates and transactions in
Section 6. Section 7 treats the tuple mover component of
C-Store, and Section 8 presents the query optimizer and
executor. In Section 9 we present a comparison of C-
Store performance to that achieved by both a popular
commercial row store and a popular commercial column
store. On TPC-H style queries, C-Store is significantly
faster than either alternate system. However, it must be
noted that the performance comparison is not fully
completed; we have not fully integrated the WS and tuple
mover, whose overhead may be significant. Finally,
Sections 10 and 11 discuss related previous work and our
conclusions.

0 comments:
Post a Comment