The case for grid storage

Table of Contents

1 The case for grid storage

My data, anywhere, anytime, anyhow. That sums up what 'data management' should be like from the perspective of the user. And yet, after 10 years of research and development in this area, the reality is nowhere near the mark.

The opening requirement is deceivingly simple, yet it is very hard to distill technical detail from it. Compare it with the requirements for a new to-be-developed consumer gadged: when the engineering department asked the marketing department what the gadget should do, the marketing department said "it should outsell the competition." Such a requirement can be tested, but in every other respect is completely useless.

What I propose below is a set of requirements that may provide the engineers with a better handle of the problem, and I will attempt to show that satisfying these requirements will cover the overall goal. If this were not the case, the exercise would be futile.

2 What is data?

To start with the entirely obvious, what we mean by data is simply a collection of binary strings of arbitrary length. What these strings represent is entirely up to the user. It is best at this point not to think of data items as files, as a file is rather a way of exposing a data item to a user by means of a file system view, where a file has a path relative to the file system's entry point. This information is also a form of data typically called 'meta-data' because it is data about data. There are other ways of addressing data items as we will discuss next.

3 Exposing data

Data can be exposed to the user in many ways, and a basic requirement is that, if the user so chooses, the data should be presented as a file system hierarchy. It is up to the user to organise the data into files and directories, but given this is done, all of the typical file system operations should be at the user's disposal.

The 'anyhow' part of the mother requirement dictates this. File systems are ubiquitous, and not only are the users familiar with these, but sometimes they are forced to rely on a file system view simply because the legacy software they use understand little else.

Other means to expose data to users is through a directory service, a structured database, or an unstructured database.

It is, at least in theory, possible to map one kind of representation to another; databases can be implemented on top of file systems, and file systems can be implemented on top of databases. The aim should therefore be to implement the 'lowest' level of data access in a way that makes all the implementations 'on top' effective (by effective I mean feasible and scalable).

4 Adressing data

There is a requirement for data to have meaningful addresses, at least meaningful to the user. This is to cover the 'anywhere' part of the mother requirement. We're considering storage and data management on a global scale, meaning that the data could be in one place, and the user could be in another, and as long as one can be reached from the other through the network there should be a way to address it uniformly. Part of the address will be set by the system, but the tail end of the address should be under control of the user, and the user should have the opportunity to change the (user part of) the address.

5 Accessing data

Once a data item is addressed, it should be possible to perform all the usual operations, such as open, read, write, seek, etc. All the operations must behave consistently, which is particularly challenging in a shared environment with shared write access.

6 Securing data

The 'owner' of the data should be in control over who may access the data, and in what manner. Permissions may be given to list, read, or write the data. In the grid world we see that individual ownership of data is not always a meaningful concept, so additional attributes to implement co-ownership, or to give grant/revoke permissions to a whole category of users, may be necessary.

7 scalability

The scalability issues involving grid storage make the above requirements really hard to implement. The number of users that access a given data item at any time may range from one to one thousand; the size of the data items may range from a few bytes to several terabytes; the number of data items in a single collection may scale from one to one million. None of these ranges can be assumed, or predicted, nor should any of these properties be exposed to, or even noticeable by the user. Data item sizes may grow and shrink at will, usage access patterns fluctuate, and the system should manage this behind the screens by whatever means available.

8 Stability

Last but not least, the system as a whole should be so stable that the user may never notice temporary glitches in parts of the system. Terms like graceful degradation come to mind, and given enough hardware, disk space and bandwidth, the system should be able to cope with whatever load the users put on it. We've experienced that systems can be taxed very heavily, and measures must be taken to prevent overload, by limiting resource consumption and other safety measures.

Author: Dennis <dennisvd@nikhef.nl>

Date: 2010-07-07 22:25:20 CEST

HTML generated by org-mode 6.30e in emacs 23