1  Translator 101 Lesson

1.1   

 

This is the first postin a series that will explain some of the details of writing a GlusterFStranslator, using some actual code to illustrate.

Before we begin, aword about environments. GlusterFS is over 300K lines of code spread across afew hundred files. That’s no Linux kernel or anything, but you’re still goingto be navigating through a lot of code in every code-editing session, so somekind of cross-referencing is essential. I use cscope with the vimbindings, and if I couldn’t do “crtl-\ g” and such to jump between definitionsall the time my productivity would be cut in half. You may prefer differenttools, but as I go through these examples you’ll need something functionallysimilar to follow on. OK, on with the show.

The first thing youneed to know is that translators are not just bags of functions and variables.They need to have a very definite internal structure so that thetranslator-loading code can figure out where all the pieces are. The way itdoes this is to use dlsym to look for specific names within your shared-objectfile, as follow (from xlator.c):

 

if(!(xl->fops = dlsym (handle,"fops")))

{

gf_log  ("xlator", GF_LOG_WARNING,"dlsym(fops) on %s", dlerror ());goto out;

}  

if(!(xl->cbks = dlsym (handle,"cbks")))

{

gf_log  ("xlator", GF_LOG_WARNING,"dlsym(cbks) on %s", dlerror ());goto out;

}  

if(!(xl->init = dlsym (handle,"init")))

{

gf_log  ("xlator", GF_LOG_WARNING,"dlsym(init) on %s", dlerror ());goto out;

}  

if(!(xl->fini = dlsym (handle,"fini")))

{

gf_log  ("xlator", GF_LOG_WARNING,"dlsym(fini) on %s", dlerror ());goto out;

}

 

In this example, xl isa pointer to the in-memory object for the translator we’re loading. As you cansee, it’s looking up various symbols by name in the sharedobject it just loaded, and storing pointers to those symbols. Some of them(e.g. init are functions, while others e.g. fops aredispatch tables containing pointers to many functions. Together, these make upthe translator’s public interface.

Most of this glue orboilerplate can easily be found at the bottom of one of the source files thatmake up each translator. We’re going to use the rot-13 translator just for fun,so in this case you’d look in rot-13.c to see this:

 

struct xlator_fops fops ={

.readv= rot13_readv,

.writev= rot13_writev

};  

struct xlator_cbks cbks ={};  

struct volume_options options[]={

{ .key={"encrypt-write"}, .type= GF_OPTION_TYPE_BOOL },

{ .key={"decrypt-read"}, .type= GF_OPTION_TYPE_BOOL },

{ .key={NULL}},

};

 

The fops table,defined in xlator.h, is one of the most important pieces. This table contains apointer to each of the filesystem functions that your translator mightimplement – open, read, stat, chmod, and so on. There are 82 such functions inall, but don’t worry; any that you don’t specify here will be see as null andfilled with defaults from defaults.c when your translator is loaded. In thisparticular example, since rot-13 is an exceptionally simple translator, we onlyfill in two entries for readv and writev.

There are actually twoother tables, also required to have predefined names, that are also used tofind translator functions: cbks (which is empty in thissnippet) and dumpops (which is missing entirely). The first ofthese specify entry points for when inodes are forgotten or file descriptorsare released. In other words, they’re destructors for objects in which yourtranslator might have an interest. Mostly you can ignore them, because thedefault behavior handles even the simpler cases of translator-specific inode/fdcontext automatically. However, if the context you attach is a complexstructure requiring complex cleanup, you’ll need to supply these functions. Asfor dumpops, that’s just used if you want to provide functions topretty-print various structures in logs. I’ve never used it myself, though Iprobably should. What’s noteworthy here is that we don’t even define dumpops.That’s because all of the functions that might use these dispatch functionswill check for xl->dumpops being NULL before callingthrough it. This is in sharp contrast to the behavior for fops and cbks,whichmust be present. If they’re not, translator loading will failbecause these pointers are not checked every time and if they’re NULL thenwe’ll segfault. That’s why we provide an empty definition for cbks;it’s OK for the individual function pointers to be NULL, but not for the wholetable to be absent.

The last piece I’llcover today is options. As you can see, this is a table oftranslator-specific option names and some information about their types.GlusterFS actually provides a pretty rich set of types (volume_option_type_t inoptions.h) which includes paths, translator names, percentages, and times inaddition to the obvious integers and strings. Also, the volume_option_t structurecan include information about alternate names, min/max/default values,enumerated string values, and descriptions. We don’t see any of these here, solet’s take a quick look at some more complex examples from afr.c and then comeback to rot-13.

 

{ .key={"data-self-heal-algorithm"},

.type= GF_OPTION_TYPE_STR,

.default_value="",

.description="Select between \"full\", \"diff\". The ""\"full\" algorithm copies the entire file from ""source to sink. The \"diff\" algorithm copies to ""sink only those blocks whose checksums  don't match ""with those of source.", .value={"diff","full",""}},

{ .key={"data-self-heal-window-size"},

.type= GF_OPTION_TYPE_INT, .min=1, .max=1024,

.default_value="1", .description="Maximum number blocks per file for  which self-heal ""process would be applied  simultaneously."},

 

When your translatoris loaded, all of this information is used to parse the options actuallyprovided in the volfile, and then the result is turned into a dictionary andstored as xl->options. This dictionary is then processed byyour init function, which you can see being looked up in thefirst code fragment above. We’re only going to look at a small part of therot-13′s init for now.

 

priv->decrypt_read =1; priv->encrypt_write =1;  

        data = dict_get (this->options,"encrypt-write");

if(data){

  if(gf_string2boolean  (data->data,&priv->encrypt_write)==-1)

  {

      gf_log (this->name, GF_LOG_ERROR,"encrypt-write  takes only boolean options");

        return-1;

  }}

 

What we can see hereis that we’re setting some defaults in our priv structure,then looking to see if an “encrypt-write” option was actually provided. If so,we convert and store it. This is a pretty classic use of dict_get tofetch a field from a dictionary, and of using one of many conversion functionsin common-utils.c to convert data->data into something wecan use.

So far we’ve coveredthe basic of how a translator gets loaded, how we find its various parts, andhow we process its options. In my next Translator 101 post, we’ll go a littledeeper into other things that init and its companion fini mightdo, and how some other fields in our xlator_t structure(commonly referred to asthis) are commonly used.

 

1.2   

 

In the , we looked at some of the dispatch tables and optionsprocessing in a translator. This time we’re going to cover the rest of the“shell” of a translator – i.e. the other global parts not specific to handlinga particular request.

Let’s start by looking at the relationship between a translator and itsshared library. At a first approximation, this is the relationship between anobject and a class in just about any object-oriented programming language. Theclass defines behaviors, but has to be instantiated as an object to have anykind of existence. In our case the object is an xlator_t. Severalof these might be created within the same daemon, sharing all of the same codethrough init/fini and dispatch tables, but sharing no data. Youcould implement shared data (as static variables in your shared libraries) butthat’s strongly discouraged. Every function in your shared library will getan xlator_t as an argument, and should use it. This lack ofclass-level data is one of the points where the analogy to common OOP systemsstarts to break down. Another place is the complete lack of inheritance.Translators inherit behavior (code) from exactly one shared library – looked upand loaded using the “type” field in a volfile “volume . . . end-volume” block– and that’s it – not even single inheritance, no subclasses or superclasses,no mixins or prototypes, just the relationship between an object and its class.With that in mind, let’s turn to the init function that wejust barely touched on last time.

 

132

133

134

135

136

137

138

139

  int32_t init (xlator_t *this) { data_t *data = NULL; rot_13_private_t *priv  = NULL;    if (!this->children  || this->children->next) { gf_log ("rot13",  GF_LOG_ERROR, "FATAL:  rot13 should have exactly one child"); return -1; }   if (!this->parents) { gf_log (this->name, GF_LOG_WARNING,  "dangling volume. check volfile "); }  

        priv = GF_CALLOC (sizeof (rot_13_private_t), 1, 0); if (!priv) return -1;

 

At the very top, we see the function signature – we get a pointer tothe xlator_t object that we’re initializing, and we returnan int32_t status. As with most functions in the translatorAPI, this should be zero to indicate success. In this case it’s safe to return-1 for failure, but watch out: in dispatch-table functions, the return valuemeans the status of the function call rather than the request.A request error should be reflected as a callback with a non-zero op_retvalue,but the dispatch function itself should still return zero. In fact, thehandling of a non-zero return from a dispatch function is not all that robust(we recently had a bug report in HekaFS related to this) so it’s something youshould probably avoid altogether. This only underscores the difference betweendispatch functions and init/fini functions, where non-zero returns are expectedand handled logically by aborting the translator setup. We can see that down atthe bottom, where we return -1 to indicate that we couldn’t allocate ourprivate-data area (more about that later).

The first thing this init function does is check that thetranslator is being set up in the right kind of environment. Translators arecalled by parents and in turn call children. Some translators are “initial”translators that inject requests into the system from elsewhere – e.g.mount/fuse injecting requests from the kernel, protocol/server injectingrequests from the network. Those translators don’t need parents, but rot-13does and so we check for that. Similarly, some translators are “final”translators that (from the perspective of the current process) terminaterequests instead of passing them on – e.g. protocol/client passing them toanother node, storage/posix passing them to a local filesystem. Othertranslators “multiplex” between multiple children – passing each parent requeston to one (cluster/dht), some (cluster/stripe), or all (cluster/afr) of thosechildren. Rot-13 fits into none of those categories either, so it checks thatit has exactly one child. It might be more convenient orrobust if translator shared libraries had standard variables describing theserequirements, to be checked in a consistent way by the translator-loadinginfrastructure itself instead of by each separate init function,but this is the way translators work today.

The last thing we see in this fragment is allocating our private dataarea. This can literally be anything we want; the infrastructure just providesthe priv pointer as a convenience but takes no responsibilityfor how it’s used. In this case we’re using GF_CALLOC toallocate our own rot_13_private_t structure. This gets us allthe benefits of GlusterFS’s memory-leak detection infrastructure, but the waywe’re calling it is not quite ideal. For one thing, the first two arguments –from calloc(3) – are kind of reversed. For another, notice howthe last argument is zero. That can actually be an enumerated value, to tellthe GlusterFS allocator what type we’re allocating. This canbe very useful information for memory profiling and leak detection, so it’srecommended that you follow the example of any xxx-mem-types.h fileelsewhere in the source tree instead of just passing zero here (even thoughthat works).

To finish our tour of standard initialization/termination, let’s look atthe end of init and the beginning of fini

 

174

175

176

177

 

 this->private  = priv;  gf_log ("rot13", GF_LOG_DEBUG,  "rot13 xlator loaded"); return 0; }   void fini (xlator_t  *this) { rot_13_private_t *priv  = this->private;   if (!priv) return; this->private  = NULL;  GF_FREE (priv);

 

At the end of init we’re just storing our private-datapointer in the priv field of our xlator_t, thenreturning zero to indicate that initialization succeeded. As is usually thecase, our fini is even simpler. All it really has to dois GF_FREE our private-data pointer, which we do in a slightlyroundabout way here. Notice how we don’t even have a return value here, sincethere’s nothing obvious and useful that the infrastructure could do if fini failed.

That’s practically everything we need to know to get our translatorthrough loading, initialization, options processing, and termination. If we haddefined no dispatch functions, we could actually configure a daemon to use ourtranslator and it would work as a basic pass-through from its parent to asingle child. In the next post I’ll cover how to build the translator andconfigure a daemon to use it, so that we can actually step through it in adebugger and see how it all fits together before we actually start addingfunctionality.

 

1.3   

 

In the first two parts of this series, we learned how to write a basictranslator skeleton that can get through loading, initialization, and optionprocessing. This time we’ll cover how to build that translator, configure avolume to use it, and run the glusterfs daemon in debug mode.

Unfortunately, there’s not much direct support for writing newtranslators. You can check out a GlusterFS tree and splice in your owntranslator directory, but that’s a bit painful because you’ll have to updatemultiple makefiles plus a bunch of autoconf garbage. As part of the HekaFSproject, I basically reverse engineered the truly necessary parts of thetranslator-building process and then pestered one of the Fedora glusterfspackage maintainers (thanks daMaestro!) to add a glusterfs-devel package withthe required headers. Since then the complexity level in the HekaFS tree hascrept back up a bit, but I still remember the simple method and still considerit the easiest way to get started on a new translator. For the sake of thosenot using Fedora, I’m going to describe a method that doesn’t depend on thatheader package. What it does depend on is a GlusterFS source tree, much as youmight have cloned from or the . This treedoesn’t have to be fully built, but you do need to run autogen.sh and configure init. Then you can take the following simple makefile and put it in a directorywith your actual source.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

 

# Change these to match your source code. TARGET = rot-13.so

OBJECTS = rot-13.o

  # Change these to match your  environment. GLFS_SRC = /play/glusterfs

GLFS_LIB = /opt/glusterfs/3git/lib64

HOST_OS = GF_LINUX_HOST_OS

  # You shouldn't need to change  anything below here.  

CFLAGS = -fPIC  -Wall -O0  -g \ -DHAVE_CONFIG_H  -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D$(HOST_OS) \ -I$(GLFS_SRC) -I$(GLFS_SRC)/libglusterfs/src  \ -I$(GLFS_SRC)/contrib/uuid

LDFLAGS = -shared  -nostartfiles -L$(GLFS_LIB) -lglusterfs  -lpthread

  $(TARGET): $(OBJECTS) $(CC) $(OBJECTS) $(LDFLAGS) -o $(TARGET)

 

Yes, it’s still Linux-specific. Mea culpa. As you can see, we’re stickingwith the rot-13 example, so you can just copy the files from…/xlators/encryption/rot-13/src in your GlusterFS tree to follow on. Type“make” and you should be rewarded with a nice little .so file.

 

1

2

[jeff@gfs-i8c-01 xlator_example]$ ls -l rot-13.so

-rwxr-xr-x. 1 jeff jeff 40784 Nov 16 16:41 rot-13.so

 

Notice that we’ve built with optimization level zero and debugging symbolsincluded, which would not typically be the case for a packaged version ofGlusterFS. Let’s put our version of rot-13.so into a slightly different file onour system, so that it doesn’t stomp on the installed version (not that you’dever want to use that anyway).

 

1

2

3

[root@gfs-i8c-01 xlator_example]# ls /opt/glusterfs/3git/lib64/glusterfs/3git/xlator/encryption/

crypt.so  crypt.so.0  crypt.so.0.0.0  rot-13.so   rot-13.so.0  rot-13.so.0.0.0

[root@gfs-i8c-01 xlator_example]# cp rot-13.so  /opt/glusterfs/3git/lib64/glusterfs/3git/xlator/encryption/my-rot-13.so

 

These paths represent the current Gluster filesystem layout, which islikely to be deprecated in favor of the Fedora layout; your paths may vary. Atthis point we’re ready to configure a volume using our new translator. To dothat, I’m going to suggest something that’s strongly discouraged except duringdevelopment (the Gluster guys are going to hate me for this): write our ownvolfile. Here’s just about the simplest volfile you’ll ever see.

 

1

2

3

4

5

6

7

8

9

volume my-posix

    type storage/posix

    option directory /play/export

end-volume

 

volume my-rot13

    type encryption/my-rot-13

    subvolumes my-posix

end-volume

 

All we have here is a basic brick using /play/export for its data, andthen an instance of our translator layered on top – no client or server isnecessary for what we’re doing, and the system will automatically push amount/fuse translator on top if there’s no server translator. To try this out,all we need is the following command (assuming the directories involved alreadyexist).

 

1

[jeff@gfs-i8c-01 xlator_example]$ glusterfs --debug -f my.vol  /play/import

 

You should be rewarded with a whole lot of log output, including the textof the volfile (this is very useful for debugging problems in the field). Ifyou go to another window on the same machine, you can see that you have a newfilesystem mounted.

 

1

2

3

4

[jeff@gfs-i8c-01 ~]$ df /play/import

Filesystem           1K-blocks      Used Available Use% Mounted on

/play/xlator_example/my.vol

                      114506240   2706176  105983488   3% /play/import

 

Just for fun, write something into a file in /play/import, then look atthe corresponding file in /play/export to see it all rot-13′ed for you.

 

1

2

3

4

 

[jeff@gfs-i8c-01 ~]$ echo hello > /play/import/a_file

[jeff@gfs-i8c-01 ~]$ cat /play/export/a_file

uryyb

 

There you have it – functionality you control, implemented easily, layeredon top of local storage. Now you could start adding functionality – realencryption, perhaps – and inevitably having to debug it. You could do that theold-school way, with gf_log (preferred) or even plain old printf, or you couldrun daemons under gdb instead. Alternatively, you could wait for the nextTranslator 101 post, where we’ll be doing exactly that.

 

1.4   

 

Now that we’ve learned what a translator looks like and how to build one,it’s time to run one and actually watch it work. The best way to do this isgood old-fashioned gdb, as follows (using some of the examples from last time).

 

1

2

3

4

5

6

7

[root@gfs-i8c-01 xlator_example]# gdb glusterfs

GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)

...

(gdb) r --debug -f my.vol /play/import

Starting program: /usr/sbin/glusterfs --debug -f my.vol /play/import

...

[2011-11-23 11:23:16.495516] I [fuse-bridge.c:2971:fuse_init]  0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel  7.13

 

If you get to this point, your glusterfs client process is alreadyrunning. You can go to another window to see the mountpoint, do fileoperations, etc.

[root@gfs-i8c-01 ~]# df /play/import

Filesystem           1K-blocks      Used Available Use% Mounted on

/root/xlator_example/my.vol

                     114506240   2643968 106045568   3% /play/import

[root@gfs-i8c-01 ~]# ls /play/import

a_file

[root@gfs-i8c-01 ~]# cat /play/import/a_file

hello

Now let’s interrupt the process and see where we are.

 

1

2

3

4

5

6

7

8

9

10

11

^C

Program received signal SIGINT, Interrupt.

0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from  /lib64/libpthread.so.0

(gdb) info threads

  5 Thread 0x7fffeffff700 (LWP  27206)  0x0000003a002dd8c7 in readv ()

   from /lib64/libc.so.6

  4 Thread 0x7ffff50e3700 (LWP  27205)  0x0000003a0060b75b in  pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  3 Thread 0x7ffff5f02700 (LWP 27204)  0x0000003a0060b3dc in  pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

  2 Thread 0x7ffff6903700 (LWP  27203)  0x0000003a0060f245 in sigwait  ()

   from /lib64/libpthread.so.0

* 1 Thread 0x7ffff7957700 (LWP 27196)   0x0000003a0060b3dc in pthread_cond_wait@@GLIBC_2.3.2 () from  /lib64/libpthread.so.0

 

Like any non-toy server, this one has multiple threads. What are they alldoing? Honestly, even I don’t know. Thread 1 turns out to be inevent_dispatch_epoll,which means it’s the one handling all of our network I/O. Note that with  thiswill change, with one thread insocket_poller per connection. Thread2 is in glusterfs_sigwaiter which means signals will be isolatedto that thread. Thread 3 is in syncenv_task, so it’s a workerprocess for synchronous requests such as those used by the rebalance and repaircode. Thread 4 is in janitor_get_next_fd, so it’s waiting for achance to close no-longer-needed file descriptors on the local filesystem. (Iadmit I had to look that one up, BTW.) Lastly, thread 5 is in fuse_thread_proc,so it’s the one fetching requests from our FUSE interface. You’ll often seemany more threads than this, but it’s a pretty good basic set. Now, let’s set abreakpoint so we can actually watch a request.

 

1

2

3

4

(gdb) b rot13_writev

Breakpoint 1 at 0x7ffff50e4f0b: file rot-13.c, line 119.

(gdb) c

Continuing.

 

At this point we go into our other window and do something that willinvolve a write.

 

1

2

3

4

5

6

7

[root@gfs-i8c-01 ~]# echo goodbye > /play/import/another_file

(back to the first window)

[Switching to Thread 0x7fffeffff700 (LWP 27206)]

 

Breakpoint 1, rot13_writev (frame=0x7ffff6e4402c, this=0x638440,  fd=0x7ffff409802c,

    vector=0x7fffe8000cd8,  count=1, offset=0, iobref=0x7fffe8001070) at rot-13.c:119

119             rot_13_private_t  *priv = (rot_13_private_t *)this->private;

 

Remember how we built with debugging symbols enabled and no optimization?That will be pretty important for the next few steps. As you can see, we’re inrot13_writev,with several parameters.

frame is our always-present frame pointer for this request. Also,frame->local will point to any local data we created and attached to therequest ourselves.

this is a pointer to our instance of the rot-13 translator. You canexamine it if you like to see the name, type, options, parent/children, inodetable, and other stuff associated with it.

fd is a pointer to a file-descriptor object (fd_t, not just a file-descriptorindex which is what most people use “fd” for). This in turn points to an inodeobject (inode_t) and we can associate our own rot-13-specific data with eitherof these.

vector and count together describe the data buffers for this write, whichwe’ll get to in a moment.

offset is the offset into the file at which we’re writing.

iobref is a buffer-reference object, which is used to track the life cycleof buffers containing read/write data. If you look closely, you’ll noticethatvector[0].iov_base points to the same address as iobref->iobrefs[0]NaNr,which should give you some idea of the inter-relationships between vectorandiobref.

 

OK, now what about that vector? We can use it to examine the data beingwritten, like this.

(gdb) p vector[0]

$2 = {iov_base = 0x7ffff7936000, iov_len = 8}

(gdb) x/s 0x7ffff7936000

0x7ffff7936000: "goodbye\n"

It’s not always safe to view this data as a string, because it might justas well be binary data, but since we’re generating the write this time it’ssafe and convenient. With that knowledge, let’s step through things a bit.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

(gdb) s

120             if  (priv->encrypt_write)

(gdb)

121                     rot13_iovec  (vector, count);

(gdb)

rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57

57              for (i = 0; i <  count; i++) {

(gdb)

58                      rot13  (vector[i].iov_base, vector[i].iov_len);

(gdb)

rot13 (buf=0x7ffff7936000 "goodbye\n", len=8) at rot-13.c:45

45              for (i = 0; i <  len; i++) {

(gdb)

46                      if (buf[i]  >= 'a' && buf[i] <= 'z')

(gdb)

47                               buf[i] = 'a' + ((buf[i] - 'a' + 13) % 26);

 

Here we’ve stepped into rot13_iovec, which iterates throughour vector calling rot13, which in turn iteratesthrough the characters in that chunk doing the rot-13 operation if/asappropriate. This is pretty straightforward stuff, so let’s skip to the nextinteresting bit.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

(gdb) fin

Run till exit from #0  rot13  (buf=0x7ffff7936000 "goodbye\n", len=8) at rot-13.c:47

rot13_iovec (vector=0x7fffe8000cd8, count=1) at rot-13.c:57

57              for (i = 0; i <  count; i++) {

(gdb) fin

Run till exit from #0  rot13_iovec  (vector=0x7fffe8000cd8, count=1) at rot-13.c:57

rot13_writev (frame=0x7ffff6e4402c, this=0x638440, fd=0x7ffff409802c,

    vector=0x7fffe8000cd8,  count=1, offset=0, iobref=0x7fffe8001070) at rot-13.c:123

123             STACK_WIND (frame,  

(gdb) b 129

Breakpoint 2 at 0x7ffff50e4f35: file rot-13.c, line 129.

(gdb) b rot13_writev_cbk

Breakpoint 3 at 0x7ffff50e4db3: file rot-13.c, line 106.

(gdb) c

 

So we’ve set breakpoints on both the callback and the statement followingthe STACK_WIND. Which one will we hit first?

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Breakpoint 3, rot13_writev_cbk (frame=0x7ffff6e4402c,  cookie=0x7ffff6e440d8,

    this=0x638440, op_ret=8,  op_errno=0, prebuf=0x7fffefffeca0,

    postbuf=0x7fffefffec30) at  rot-13.c:106

106             STACK_UNWIND_STRICT (writev, frame,  op_ret, op_errno, prebuf, postbuf);

(gdb) bt

#0  rot13_writev_cbk  (frame=0x7ffff6e4402c, cookie=0x7ffff6e440d8, this=0x638440,

    op_ret=8, op_errno=0,  prebuf=0x7fffefffeca0, postbuf=0x7fffefffec30)

    at rot-13.c:106

#1  0x00007ffff52f1b37 in  posix_writev (frame=0x7ffff6e440d8,

    this=, fd=,

    vector=, count=1, offset=,

    iobref=0x7fffe8001070) at  posix.c:2217

#2  0x00007ffff50e513e in  rot13_writev (frame=0x7ffff6e4402c, this=0x638440,

    fd=0x7ffff409802c,  vector=0x7fffe8000cd8, count=1, offset=0,

    iobref=0x7fffe8001070) at  rot-13.c:123

 

Surprise! We’re in rot13_writev_cbk now, called(indirectly) while we’re still in rot13_writev beforeSTACK_WIND returns (still at rot-13.c:123). If you did any request cleanuphere, then you need to be careful about what you do in the remainder of rot13_writev becausedata may have been freed etc. It’s tempting to say you should just do thecleanup in rot13_writev after the STACK_WIND, but that’s notvalid because it’s also possible that some other translator returned withoutcalling STACK_UNWIND – i.e. before rot13_writev is called, sothen it would be the one getting null-pointer errors instead. To put it anotherway, the callback and the return from STACK_WIND can occur in either order oreven simultaneously on different threads. Even if you were to use referencecounts, you’d have to make sure to use locking or atomic operations to avoidraces, and it’s not worth it. Unless you really understand thepossible flows of control and know what you’re doing, it’s better to do cleanupin the callback and nothing after STACK_WIND.

At this point all that’s left is a STACK_UNWIND and a return. TheSTACK_UNWIND invokes our parent’s completion callback, and in this case ourparent is FUSE so at that point the VFS layer is notified of the write beingcomplete. Finally, we return through several levels of normal function callsuntil we come back to fuse_thread_proc, which waits for the nextrequest.

So that’s it. For extra fun, you might want to repeat this exercise bystepping through some other call – stat or setxattr mightbe good choices – but you’ll have to use a translator that actually implementsthose calls to see much that’s interesting. Then you’ll pretty much knoweverything I knew when I started writing my first for-real translators, andprobably even a bit more. I hope you’ve enjoyed this series, or at least foundit useful, and if you have any suggestions for other topics I should coverplease let me know (via comments or email, IRC or Twitter).

 

1.5   

 

If you want to hack on distributed filesystems, there is no easier way toget started than by writing a GlusterFS translator. To prove this point, I’verecently implemented two new translators which are very simple but providesignificant benefits in certain situations. These have nothing to do with HekaFS,really, except that HekaFS takes advantage of this same simplicity to do whatit does. The first translator does .

     This is a very simple translator to cache“negative lookups” for workloads in which the same file is looked up many timesin places where it doesn’t exist. In particular, web script files with manyincludes/requires and long paths can generate hundreds of such lookups perfront-end request. If we don’t cache the negative results, this can meanhundreds of back-end network round trips per front-end request. So we cache.Very simple tests for this kind of workload on two machines connected via GigEshow an approximately 3x performance improvement.

 

The second translator .

        This is a proof-of-concepttranslator for an idea that was proposed at FUDcon 2012 in Blacksburg, VA. Theidea is simply that we can forward writes only to local storage, bypassing AFRbut setting the xattrs ourselves to indicate that self-heal is needed. Thisgives us near-local write speeds, and we can mount later without the bypass toforce self-heal when it’s convenient. We can do almost the same thing for readsas well.

 

They weigh in at 224 and 229 lines respectively, with some of that takenup by licenses and white space. Each took less than a day to write. Please bearin mind, though, that these are only prototypes. They exist to teach and tomake a point, not – in their current form – to be used in production. Makingthem suitable for real-world use would at least double their size and triplethe time needed for testing. That’s still orders of magnitude better than whatyou’d have to do to implement similar functionality in other projects thatclaim to be competitive with GlusterFS, and the result is still much morefunctional than one of those stripped-down jokes that just have “FS” in thename to mislead users. If you’re a developer and you think you can do distributionor replication or caching or anything else better than GlusterFS, showus. Translators let you implement your ideas quickly, and then do a true“apples to apples” comparison vs. what came before. That could revolutionizedistributed storage, but only if people take advantage of the opportunity.

 

 /negative.h

 

#ifndef __NEGATIVE_H__

#define __NEGATIVE_H__

#ifndef _CONFIG_H

#define _CONFIG_H

#include "config.h"

#endif

#include "mem-types.h"

#include "hashfn.h"

 

#define GHOST_BUCKETS 64

#define GHOST_HASH(x) (SuperFastHash(x,strlen(x)) %  GHOST_BUCKETS)

 

typedef struct _ghost {

        struct  _ghost *next;

        char *path;  

} ghost_t;

 

typedef struct {

        ghost_t  *ghosts[GHOST_BUCKETS];

} negative_private_t;

 

enum gf_negative_mem_types_ {

        gf_negative_mt_priv  = gf_common_mt_end + 1,

        gf_negative_mt_ghost,  

        gf_negative_mt_end  

};

 

#endif /* __NEGATIVE_H__ */

 /  negative.c

 

#include  

#include  

 

#ifndef _CONFIG_H

#define _CONFIG_H

#include "config.h"

#endif

 

#include "glusterfs.h"

#include "xlator.h"

#include "logging.h"

 

#include "negative.h"

 

void

exorcise (xlator_t *this,  char *spirit)

{

        negative_private_t  *priv = this->private;

        ghost_t  *gp = NULL;

        ghost_t  **gpp = NULL;

        uint32_t bucket =  0;

 

        bucket =  GHOST_HASH(spirit);

        for  (gpp = &priv->ghosts[bucket]; *gpp; gpp =  &(*gpp)->next) {

                gp  = *gpp;

                if  (!strcmp(gp->path,spirit)) {

                        *gpp  = gp->next;

                        GF_FREE(gp->path);  

                        GF_FREE(gp);  

                        gf_log(this->name,GF_LOG_DEBUG,"removed  %s",spirit);

                        break;  

                }  

        }

}

 

int32_t

negative_lookup_cbk (call_frame_t  *frame, void *cookie, xlator_t *this,  

                     int32_t op_ret, int32_t op_errno,  inode_t *inode,

                     struct  iatt *buf, dict_t *dict, struct iatt *postparent)  

{

        negative_private_t  *priv = this->private;

        ghost_t  *gp = NULL;

        uint64_t ctx = 0;

        uint32_t bucket =  0;

 

        inode_ctx_get(inode,this,&ctx);  

        if  (op_ret < 0) {

                gp  = GF_CALLOC(1,sizeof(ghost_t),gf_negative_mt_ghost);

                if  (gp) {

                        gp->path  = (char *)ctx;

                        bucket  = GHOST_HASH(gp->path);

                        /* TBD:  locking */

                        gp->next  = priv->ghosts[bucket];

                        priv->ghosts[bucket]  = gp;

                        gf_log(this->name,GF_LOG_DEBUG,"added  %s",

                               (char *)ctx);  

                        goto  unwind;

                }  

        }

        else  {

                gf_log(this->name,GF_LOG_DEBUG,"found  %s", (char *)ctx);

                exorcise(this,(char *)ctx);  

        }

 

        /* Both  positive result and allocation failure come here. */

        GF_FREE((void *)ctx);  

 

unwind:

        STACK_UNWIND_STRICT  (lookup, frame, op_ret, op_errno, inode, buf,

                             dict,  postparent);

        return  0;

}

 

int32_t

negative_lookup (call_frame_t  *frame, xlator_t *this, loc_t *loc,

                 dict_t  *xattr_req)

{

        negative_private_t  *priv = this->private;

        ghost_t  *gp = NULL;

        uint32_t bucket =  0;

 

        bucket =  GHOST_HASH(loc->path);

        for  (gp = priv->ghosts[bucket]; gp; gp = gp->next)  {

                if  (!strcmp(gp->path,loc->path)) {

                        gf_log(this->name,GF_LOG_DEBUG,"%s (%p)  => HIT",

                               loc->path,  loc->inode);

                        STACK_UNWIND_STRICT  (lookup, frame, -1, ENOENT,

                                             NULL, NULL, NULL, NULL);

                        return  0;

                }  

        }

        gf_log(this->name,GF_LOG_DEBUG,"%s (%p)  => MISS",

               loc->path,  loc->inode);

        inode_ctx_put(loc->inode,this,(uint64_t)gf_strdup(loc->path));  

        STACK_WIND  (frame, negative_lookup_cbk, FIRST_CHILD(this),

                    FIRST_CHILD(this)->fops->lookup,  loc, xattr_req);

        return  0;

}

 

int32_t

negative_create_cbk (call_frame_t  *frame, void *cookie, xlator_t *this,  

                     int32_t op_ret, int32_t op_errno,  fd_t *fd, inode_t *inode,

                     struct  iatt *buf, struct iatt *preparent,

                     struct  iatt *postparent)

{

        uint64_t ctx = 0;

 

        inode_ctx_get(inode,this,&ctx);  

        exorcise(this,(char *)ctx);  

        GF_FREE((void *)ctx);  

 

        STACK_UNWIND_STRICT  (create, frame, op_ret, op_errno, fd, inode, buf,

                             preparent,  postparent);

        return  0;

}

 

int32_t

negative_create (call_frame_t  *frame, xlator_t *this, loc_t *loc, int32_t flags,

                 mode_t  mode, fd_t *fd, dict_t *params)

{

        inode_ctx_put(loc->inode,this,(uint64_t)gf_strdup(loc->path));  

        STACK_WIND  (frame, negative_create_cbk, FIRST_CHILD(this),

                    FIRST_CHILD(this)->fops->create,  loc, flags, mode, fd,

                    params);  

        return  0;

}

 

int32_t

negative_mkdir_cbk (call_frame_t  *frame, void *cookie, xlator_t *this,  

                    int32_t op_ret, int32_t op_errno,  inode_t *inode,

                    struct  iatt *buf, struct iatt *preparent,

                    struct  iatt *postparent)

{

        uint64_t ctx = 0;

 

        inode_ctx_get(inode,this,&ctx);  

        exorcise(this,(char *)ctx);  

        GF_FREE((void *)ctx);  

 

        STACK_UNWIND_STRICT  (mkdir, frame, op_ret, op_errno, inode,

                             buf,  preparent, postparent);

        return  0;

}

 

int

negative_mkdir (call_frame_t  *frame, xlator_t *this, loc_t *loc, mode_t mode,

                dict_t  *params)

{

        inode_ctx_put(loc->inode,this,(uint64_t)gf_strdup(loc->path));  

        STACK_WIND  (frame, negative_mkdir_cbk, FIRST_CHILD(this),

                    FIRST_CHILD(this)->fops->mkdir,  loc, mode, params);

        return  0;

}

 

int32_t

init (xlator_t *this)

{

 negative_private_t *priv = NULL;

 

 if (!this->children  || this->children->next) {

 gf_log ("negative",  GF_LOG_ERROR, 

 "FATAL:  negative should have exactly one child");

 return -1;

 }

 

 if (!this->parents)  {

 gf_log (this->name,  GF_LOG_WARNING,

 "dangling  volume. check volfile ");

 }

 

 priv =  GF_CALLOC (1, sizeof (negative_private_t),  gf_negative_mt_priv);

        if  (!priv)

                return  -1;

 

 this->private  = priv;

 gf_log ("negative",  GF_LOG_DEBUG, "negative xlator loaded");

 return 0;

}

 

void

fini (xlator_t *this)

{

 negative_private_t *priv =  this->private;

 

        if  (!priv)

                return;  

        this->private  = NULL;

 GF_FREE (priv);

 

 return;

}

 

struct xlator_fops fops = {

        .lookup  = negative_lookup,

        .create  = negative_create,

        .mkdir =  negative_mkdir,

};

 

struct xlator_cbks cbks = {

};

 

struct volume_options options[] = {

 { .key =  {NULL} },

};

 

 /Makefile 

 

# Change these to match your source code.

TARGET = negative.so

OBJECTS = negative.o

 

# Change these to match your environment.

GLFS_SRC = /root/glusterfs_patches

GLFS_ROOT = /opt/glusterfs

GLFS_VERS = 3git

GLFS_LIB = $(GLFS_ROOT)/$(GLFS_VERS)/lib64  

HOST_OS = GF_LINUX_HOST_OS

 

# You shouldn't need to change anything below here.

 

CFLAGS = -fPIC -Wall -O0 -g \

 -DHAVE_CONFIG_H  -D_FILE_OFFSET_BITS=64 -D_GNU_SOURCE -D$(HOST_OS) \

 -I$(GLFS_SRC)  -I$(GLFS_SRC)/libglusterfs/src \

 -I$(GLFS_SRC)/contrib/uuid  -I.

LDFLAGS = -shared -nostartfiles -L$(GLFS_LIB)  -lglusterfs -lpthread

 

$(TARGET): $(OBJECTS)  

 $(CC)  $(CFLAGS) $(OBJECTS) $(LDFLAGS) -o $(TARGET)  

 

install: $(TARGET)

 cp $(TARGET)  $(GLFS_LIB)/glusterfs/$(GLFS_VERS)/xlator/features  

 

clean:

 rm -f $(TARGET)  $(OBJECTS)

 

 /README.md  

 

This is a very simple translator to cache  "negative lookups" for workloads in which the same file is looked  up many times in places where it doesn't exist. In particular, web script  files with many includes/requires and long paths can generate hundreds of  such lookups per front-end request. If we don't cache the negative results,  this can mean hundreds of back-end network round trips per front-end request.  So we cache. Very simple tests for this kind of workload on two machines  connected via GigE show an approximately 3x performance improvement.

 

This code is nowhere near ready for production use yet.  It was originally developed as a pedagogical example, but one that could  lead to something truly useful as well. Among other things, the  following features need to be added.

 

·          Support for other namespace-modifying operations -  link, symlink, mknod, rename, even funky xattr requests.

·          Time-based cache expiration to cover the case  where another client creates a file that's in our cache  because it wasn't there when we first looked it up. This might even include  periodic pruning of entries that are already stale but will never be looked  up (and therefore never reaped in-line) again.

·          Locking on the cache for when we're called  concurrently.

 

This is intended to be a learning tool. I might not get  back to this code myself for a long time, but I always have time to help  anyone who's learning to write translators. If you want to help move it  along, please fork and send me pull requests.

 

For more information on writing GlusterFS translators,  check out my "Translator 101" series:

 

·          

·          

·          

·          

 

模块二:

 

 / bypass.h

/*

 * Copyright (c)  2011 Red Hat  

 */

#ifndef __bypass_H__

#define __bypass_H__

#ifndef _CONFIG_H

#define _CONFIG_H

#include "config.h"

#endif

#include "mem-types.h"

/* Deal with casts for 32-bit architectures. */

#define CAST2INT(x) ((uint64_t)(long)(x))

#define CAST2PTR(x) ((void *)(long)(x))

typedef struct {

        xlator_t  *target;

} bypass_private_t;

enum gf_bypass_mem_types_ {

        gf_bypass_mt_priv_t  = gf_common_mt_end + 1,

        gf_by_mt_int32_t,  

        gf_bypass_mt_end  

};

#endif /* __bypass_H__ */

 

 / bypass.c

/*

 * Copyright (c)  2011 Red Hat  

 */

#include  

#include  

#ifndef _CONFIG_H

#define _CONFIG_H

#include "config.h"

#endif

#include "glusterfs.h"

#include "call-stub.h"

#include "defaults.h"

#include "logging.h"

#include "xlator.h"

#include "bypass.h"

int32_t

bypass_readv (call_frame_t  *frame, xlator_t *this, fd_t *fd, size_t size,

              off_t offset)

{

        bypass_private_t  *priv = this->private;

        STACK_WIND  (frame, default_readv_cbk, priv->target,

                    priv->target->fops->readv,  fd, size, offset);

        return  0;

}

dict_t *

get_pending_dict (xlator_t *this)  

{

dict_t *dict = NULL;

xlator_list_t *trav = NULL;

char *key = NULL;

int32_t *value = NULL;

        xlator_t  *afr = NULL;

        bypass_private_t  *priv = this->private;

dict = dict_new();

if (!dict) {

gf_log (this->name, GF_LOG_WARNING, "failed  to allocate dict");

                return  NULL;

}

        afr =  this->children->xlator;

for (trav = afr->children;  trav; trav = trav->next) {

                if  (trav->xlator == priv->target) {

                        continue;  

                }  

if (gf_asprintf(&key,"trusted.afr.%s",trav->xlator->name)  < 0) {

gf_log (this->name, GF_LOG_WARNING,

"failed to allocate key");

goto free_dict;

}

value = GF_CALLOC(3,sizeof(*value),gf_by_mt_int32_t);  

if (!value) {

gf_log (this->name, GF_LOG_WARNING,

"failed to allocate value");

goto free_key;

}

                /* Amazingly,  there's no constant for this. */

                value[0] =  htons(1);

if (dict_set_dynptr(dict,key,value,3*sizeof(*value))  < 0) {

gf_log (this->name, GF_LOG_WARNING,

"failed to set up dict");

goto free_value;

}

}

        return  dict;

free_value:

        GF_FREE(value);  

free_key:

        GF_FREE(key);  

free_dict:

dict_unref(dict);

        return  NULL;

}

int32_t

bypass_set_pending_cbk (call_frame_t  *frame, void *cookie, xlator_t *this,  

int32_t op_ret, int32_t op_errno,  dict_t *dict)

{

if (op_ret < 0) {

goto unwind;

}

call_resume(cookie);

return0;

unwind:

        STACK_UNWIND_STRICT  (writev, frame, op_ret, op_errno, NULL, NULL);

        return  0;

}

int32_t

bypass_writev_resume (call_frame_t  *frame, xlator_t *this, fd_t *fd,

                      struct  iovec *vector, int32_t count, off_t off,

                      struct  iobref *iobref)

{

        bypass_private_t  *priv = this->private;

        STACK_WIND  (frame, default_writev_cbk, priv->target,

                    priv->target->fops->writev,  fd, vector, count, off,

                    iobref);  

        return  0;

}

int32_t

bypass_writev (call_frame_t  *frame, xlator_t *this, fd_t *fd,

               struct  iovec *vector, int32_t count, off_t off,

               struct  iobref *iobref)

{

dict_t *dict = NULL;

call_stub_t *stub = NULL;

        bypass_private_t  *priv = this->private;

        /*

 * I wish we  could just create the stub pointing to the target's

 * writev  function, but then we'd get into another translator's code

 * with  "this" pointing to us.

 */

stub = fop_writev_stub(frame,  bypass_writev_resume,

fd, vector, count, off, iobref);

if (!stub) {

gf_log (this->name, GF_LOG_WARNING, "failed  to allocate stub");

goto wind;

}

        dict =  get_pending_dict(this);

        if  (!dict) {

gf_log (this->name, GF_LOG_WARNING, "failed  to allocate stub");

                goto  free_stub;

        }

STACK_WIND_COOKIE (frame, bypass_set_pending_cbk, stub,  

                           priv->target,  priv->target->fops->fxattrop,

                           fd,  GF_XATTROP_ADD_ARRAY, dict);

return0;

free_stub:

        call_stub_destroy(stub);  

wind:

dict_unref(dict);

        STACK_WIND  (frame, default_writev_cbk, FIRST_CHILD(this),

                    FIRST_CHILD(this)->fops->writev,  fd, vector, count, off,

                    iobref);  

        return  0;

}

/*

 * Even  applications that only read seem to call this, and it can force an

 * unwanted  self-heal.

 * TBD: there are  probably more like this - stat, open(O_RDONLY), etc.

 */

int32_t

bypass_fstat (call_frame_t  *frame, xlator_t *this, fd_t *fd)

{

        bypass_private_t  *priv = this->private;

        STACK_WIND  (frame, default_fstat_cbk, priv->target,

                    priv->target->fops->fstat,  fd);

        return  0;

}

int32_t

init (xlator_t *this)

{

xlator_t *tgt_xl = NULL;

bypass_private_t *priv = NULL;

if (!this->children ||  this->children->next) {

gf_log (this->name, GF_LOG_ERROR,

"FATAL: bypass should have exactly one child");

return -1;

}

tgt_xl = this->children->xlator;  

/* TBD: check for cluster/afr as well */

if (strcmp(tgt_xl->type,"cluster/replicate")) {

gf_log (this->name, GF_LOG_ERROR,

"%s must be loaded above cluster/replicate",

                        this->type);  

return -1;

}

        /* TBD: pass  target-translator name as an option (instead of first) */

tgt_xl = tgt_xl->children->xlator;  

priv = GF_CALLOC (1, sizeof  (bypass_private_t), gf_bypass_mt_priv_t);

        if  (!priv)

                return  -1;

priv->target = tgt_xl;

this->private = priv;

gf_log (this->name, GF_LOG_DEBUG, "bypass  xlator loaded");

return0;

}

void

fini (xlator_t *this)

{

bypass_private_t *priv = this->private;  

        if  (!priv)

                return;  

        this->private  = NULL;

GF_FREE (priv);

return;

}

struct xlator_fops fops = {

        .readv =  bypass_readv,

.writev = bypass_writev,

        .fstat =  bypass_fstat

};

struct xlator_cbks cbks = {

};

struct volume_options options[] = {

{ .key = {NULL} },

};

 

 / Makefile 

 

# Change these to match your source code.

TARGET = bypass.so

OBJECTS = bypass.o

# Change these to match your environment.

GLFS_SRC = /root/glusterfs_patches

GLFS_ROOT = /opt/glusterfs

GLFS_VERS = 3git

GLFS_LIB = `ls -d $(GLFS_ROOT)/$(GLFS_VERS)/lib*`

HOST_OS = GF_LINUX_HOST_OS

# You shouldn't need to change anything below here.

CFLAGS = -fPIC -Wall -O0 -g \

-DHAVE_CONFIG_H -D_FILE_OFFSET_BITS=64  -D_GNU_SOURCE -D$(HOST_OS) \

-I$(GLFS_SRC) -I$(GLFS_SRC)/libglusterfs/src  \

-I$(GLFS_SRC)/contrib/uuid -I.

LDFLAGS = -shared -nostartfiles -L$(GLFS_LIB)  -lglusterfs -lpthread

$(TARGET): $(OBJECTS)  

$(CC) $(CFLAGS) $(OBJECTS)  $(LDFLAGS) -o $(TARGET)

install: $(TARGET)

cp $(TARGET) $(GLFS_LIB)/glusterfs/$(GLFS_VERS)/xlator/features

clean:

rm -f $(TARGET) $(OBJECTS)

 

 

 /bytest-fuse.vol

volume bytest-posix-0

    type storage/posix

    option directory  /export/bytest1

end-volume

 

volume bytest-locks-0

    type features/locks

    subvolumes bytest-posix-0

end-volume

 

volume bytest-client-1

    type protocol/client

    option remote-host gfs1

    option remote-subvolume  /export/bytest2

    option transport-type tcp

end-volume

 

volume bytest-replicate-0

    type cluster/replicate

    subvolumes bytest-locks-0  bytest-client-1

end-volume

 

volume bytest-bypass

    type features/bypass

    subvolumes bytest-replicate-0

end-volume

 

volume bytest

    type debug/io-stats

    option latency-measurement off

    option count-fop-hits off

    subvolumes bytest-bypass

end-volume

 

 /README.md

This is a proof-of-concept translator for an idea that  was proposed at FUDcon 2012 in Blacksburg, VA. The idea is simply that we can  forward writes only to local storage, bypassing AFR but setting the xattrs  ourselves to indicate that self-heal is needed. This gives us near-local  write speeds, and we can mount later without the bypass to force self-heal when  it's convenient. We can do almost the same thing for reads as well. I've  tried this and it seems to work, but there are some major caveats.

 

·          If multiple clients try to write the same file with  bypass turned on, you'll get massive split-brain problems. Solutions  might include honoring AFR's quorum-enforcement rules, auto-issuing locks  during open to prevent such concurrent access, or simply documenting the fact  that users must do such locking themselves. The last might sound like a  cop-out, but such locking is already common for the likely use case of  serving virtual-machine p_w_picpaths.

·          We only intercept readv, writev, and fstat. There are  many other calls that can trigger self-heal, including plain old lookup. The  only way to prevent a lookup self-heal would be to put another  translator below AFR to intercept xattr requests and pretend  everything's OK. Ick. Remember, though, that this is only a proof of concept.  If we really wanted to get serious about this, we could implement the same  technique within AFR and do all the necessary coordination there.

·          It would be nice if the AFR subvolume to use could be  specified as an option (instead of just picking the first child), if bypass  could be made selective, etc.

 

The coolest direction to go here would be to put information  about writes we've seen onto a queue, with a separate process listening on  that queue to perform assynchronous but nearly immediate self-heal on just  those files. As long as the other consistency issues are handled properly,  this might be a really easy way to get near-local performance for  virtual-machine-p_w_picpath use cases without introducing consistency/recovery  nightmares.