mirror of
				https://github.com/apache/cloudstack.git
				synced 2025-11-04 00:02:37 +01:00 
			
		
		
		
	
		
			
				
	
	
		
			322 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			322 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
Blktap2 Userspace Tools + Library
 | 
						|
================================
 | 
						|
 | 
						|
Dutch Meyer
 | 
						|
4th June 2009
 | 
						|
 | 
						|
Andrew Warfield and Julian Chesterfield
 | 
						|
16th June 2006
 | 
						|
 | 
						|
 | 
						|
The blktap2 userspace toolkit provides a user-level disk I/O
 | 
						|
interface. The blktap2 mechanism involves a kernel driver that acts
 | 
						|
similarly to the existing Xen/Linux blkback driver, and a set of
 | 
						|
associated user-level libraries.  Using these tools, blktap2 allows
 | 
						|
virtual block devices presented to VMs to be implemented in userspace
 | 
						|
and to be backed by raw partitions, files, network, etc.
 | 
						|
 | 
						|
The key benefit of blktap2 is that it makes it easy and fast to write
 | 
						|
arbitrary block backends, and that these user-level backends actually
 | 
						|
perform very well.  Specifically:
 | 
						|
 | 
						|
- Metadata disk formats such as Copy-on-Write, encrypted disks, sparse
 | 
						|
  formats and other compression features can be easily implemented.
 | 
						|
 | 
						|
- Accessing file-based images from userspace avoids problems related
 | 
						|
  to flushing dirty pages which are present in the Linux loopback
 | 
						|
  driver.  (Specifically, doing a large number of writes to an
 | 
						|
  NFS-backed image don't result in the OOM killer going berserk.)
 | 
						|
 | 
						|
- Per-disk handler processes enable easier userspace policing of block
 | 
						|
  resources, and process-granularity QoS techniques (disk scheduling
 | 
						|
  and related tools) may be trivially applied to block devices.
 | 
						|
 | 
						|
- It's very easy to take advantage of userspace facilities such as
 | 
						|
  networking libraries, compression utilities, peer-to-peer
 | 
						|
  file-sharing systems and so on to build more complex block backends.
 | 
						|
 | 
						|
- Crashes are contained -- incremental development/debugging is very
 | 
						|
  fast.
 | 
						|
 | 
						|
How it works (in one paragraph):
 | 
						|
 | 
						|
Working in conjunction with the kernel blktap2 driver, all disk I/O
 | 
						|
requests from VMs are passed to the userspace deamon (using a shared
 | 
						|
memory interface) through a character device. Each active disk is
 | 
						|
mapped to an individual device node, allowing per-disk processes to
 | 
						|
implement individual block devices where desired.  The userspace
 | 
						|
drivers are implemented using asynchronous (Linux libaio),
 | 
						|
O_DIRECT-based calls to preserve the unbuffered, batched and
 | 
						|
asynchronous request dispatch achieved with the existing blkback
 | 
						|
code.  We provide a simple, asynchronous virtual disk interface that
 | 
						|
makes it quite easy to add new disk implementations.
 | 
						|
 | 
						|
As of June 2009 the current supported disk formats are:
 | 
						|
 | 
						|
 - Raw Images (both on partitions and in image files)
 | 
						|
 - Fast sharable RAM disk between VMs (requires some form of 
 | 
						|
   cluster-based filesystem support e.g. OCFS2 in the guest kernel)
 | 
						|
 - VHD, including snapshots and sparse images
 | 
						|
 - Qcow, including snapshots and sparse images
 | 
						|
 | 
						|
 | 
						|
Build and Installation Instructions
 | 
						|
===================================
 | 
						|
 | 
						|
Make to configure the blktap2 backend driver in your dom0 kernel.  It
 | 
						|
will inter-operate with the existing backend and frontend drivers.  It
 | 
						|
will also cohabitate with the original blktap driver.  However, some
 | 
						|
formats (currently aio and qcow) will default to their blktap2
 | 
						|
versions when specified in a vm configuration file.
 | 
						|
 | 
						|
To build the tools separately, "make && make install" in
 | 
						|
tools/blktap2.
 | 
						|
 | 
						|
 | 
						|
Using the Tools
 | 
						|
===============
 | 
						|
 | 
						|
Preparing an image for boot:
 | 
						|
 | 
						|
The userspace disk agent is configured to start automatically via xend
 | 
						|
 | 
						|
Customize the VM config file to use the 'tap:tapdisk' handler,
 | 
						|
followed by the driver type. e.g. for a raw image such as a file or
 | 
						|
partition:
 | 
						|
 | 
						|
disk = ['tap:tapdisk:aio:<FILENAME>,sda1,w']
 | 
						|
 | 
						|
Alternatively, the vhd-util tool (installed with make install, or in
 | 
						|
/blktap2/vhd) can be used to build sparse copy-on-write vhd images.
 | 
						|
 | 
						|
For example, to build a sparse image -
 | 
						|
  vhd-util create -n MyVHDFile -s 1024
 | 
						|
 | 
						|
This creates a sparse 1GB file named "MyVHDFile" that can be mounted
 | 
						|
and populated with data.
 | 
						|
 | 
						|
One can also base the image on a raw file -
 | 
						|
  vhd-util snapshot -n MyVHDFile -p SomeRawFile -m
 | 
						|
 | 
						|
This creates a sparse VHD file named "MyVHDFile" using "SomeRawFile"
 | 
						|
as a parent image.  Copy-on-write semantics ensure that writes will be
 | 
						|
stored in "MyVHDFile" while reads will be directed to the most
 | 
						|
recently written version of the data, either in "MyVHDFile" or
 | 
						|
"SomeRawFile" as is appropriate.  Other options exist as well, consult
 | 
						|
the vhd-util application for the complete set of VHD tools.
 | 
						|
 | 
						|
VHD files can be mounted automatically in a guest similarly to the
 | 
						|
above AIO example simply by specifying the vhd driver.
 | 
						|
 | 
						|
disk = ['tap:tapdisk:vhd:<VHD FILENAME>,sda1,w']
 | 
						|
 | 
						|
 | 
						|
Snapshots:
 | 
						|
 | 
						|
Pausing a guest will also plug the corresponding IO queue for blktap2
 | 
						|
devices and stop blktap2 drivers.  This can be used to implement a
 | 
						|
safe live snapshot of qcow and vhd disks.  An example script "xmsnap"
 | 
						|
is shown in the tools/blktap2/drivers directory.  This script will
 | 
						|
perform a live snapshot of a qcow disk.  VHD files can use the
 | 
						|
"vhd-util snapshot" tool discussed above.  If this snapshot command is
 | 
						|
applied to a raw file mounted with tap:tapdisk:AIO, include the -m
 | 
						|
flag and the driver will be reloaded as VHD.  If applied to an already
 | 
						|
mounted VHD file, omit the -m flag.
 | 
						|
 | 
						|
 | 
						|
Mounting images in Dom0 using the blktap2 driver
 | 
						|
===============================================
 | 
						|
Tap (and blkback) disks are also mountable in Dom0 without requiring an
 | 
						|
active VM to attach. 
 | 
						|
 | 
						|
The syntax is -
 | 
						|
  tapdisk2 -n <type>:<full path to file>
 | 
						|
 | 
						|
For example -
 | 
						|
  tapdisk2  -n aio:/home/images/rawFile.img
 | 
						|
 | 
						|
When successful the location of the new device will be provided by
 | 
						|
tapdisk2 to stdout and tapdisk2 will terminate.  From that point
 | 
						|
forward control of the device is provided through sysfs in the
 | 
						|
directory-
 | 
						|
 | 
						|
  /sys/class/blktap2/blktap#/
 | 
						|
 | 
						|
Where # is a blktap2 device number present in the path that tapdisk2
 | 
						|
printed before terminating.  The sysfs interface is largely intuitive,
 | 
						|
for example, to remove tap device 0 one would-
 | 
						|
  
 | 
						|
  echo 1 > /sys/class/blktap2/blktap0/remove
 | 
						|
 | 
						|
Similarly, a pause control is available, which is can be used to plug
 | 
						|
the request queue of a live running guest.
 | 
						|
 | 
						|
Previous versions of blktap mounted devices in dom0 by using blkfront
 | 
						|
in dom0 and the xm block-attach command.  This approach is still
 | 
						|
available, though slightly more cumbersome.
 | 
						|
 | 
						|
 | 
						|
Tapdisk Development
 | 
						|
===============================================
 | 
						|
 | 
						|
People regularly ask how to develop their own tapdisk drivers, and
 | 
						|
while it has not yet been well documented, the process is relatively
 | 
						|
easy.  Here I will provide a brief overview.  The best reference, of
 | 
						|
course, comes from the existing drivers.  Specifically,
 | 
						|
blktap2/drivers/block-ram.c and blktap2/drivers/block-aio.c provide
 | 
						|
the clearest examples of simple drivers.
 | 
						|
 
 | 
						|
 | 
						|
Setup:
 | 
						|
 | 
						|
First you need to register your new driver with blktap. This is done
 | 
						|
in disktypes.h.  There are five things that you must do.  To
 | 
						|
demonstrate, I will create a disk called "mynewdisk", you can name
 | 
						|
yours freely.
 | 
						|
 | 
						|
1) Forward declare an instance of struct tap_disk.
 | 
						|
 | 
						|
e.g. -  
 | 
						|
  extern struct tap_disk tapdisk_mynewdisk;
 | 
						|
 | 
						|
2) Claim one of the unused disk type numbers, take care to observe the
 | 
						|
MAX_DISK_TYPES macro, increasing the number if necessary.
 | 
						|
 | 
						|
e.g. -
 | 
						|
  #define DISK_TYPE_MYNEWDISK         10
 | 
						|
 | 
						|
3) Create an instance of disk_info_t.  The bulk of this file contains examples of these.
 | 
						|
 | 
						|
e.g. -
 | 
						|
  static disk_info_t mynewdisk_disk = {
 | 
						|
          DISK_TYPE_MYNEWDISK,
 | 
						|
          "My New Disk (mynewdisk)",
 | 
						|
          "mynewdisk",
 | 
						|
          0,
 | 
						|
  #ifdef TAPDISK
 | 
						|
          &tapdisk_mynewdisk,
 | 
						|
  #endif
 | 
						|
  };
 | 
						|
 | 
						|
A few words about what these mean.  The first field must be the disk
 | 
						|
type number you claimed in step (2).  The second field is a string
 | 
						|
describing your disk, and may contain any relevant info.  The third
 | 
						|
field is the name of your disk as will be used by the tapdisk2 utility
 | 
						|
and xend (for example tapdisk2 -n mynewdisk:/path/to/disk.image, or in
 | 
						|
your xm create config file).  The forth is binary and determines
 | 
						|
whether you will have one instance of your driver, or many.  Here, a 1
 | 
						|
means that your driver is a singleton and will coordinate access to
 | 
						|
any number of tap devices.  0 is more common, meaning that you will
 | 
						|
have one driver for each device that is created.  The final field
 | 
						|
should contain a reference to the struct tap_disk you created in step
 | 
						|
(1).
 | 
						|
 | 
						|
4) Add a reference to your disk info structure (from step (3)) to the
 | 
						|
dtypes array.  Take care here - you need to place it in the position
 | 
						|
corresponding to the device type number you claimed in step (2).  So
 | 
						|
we would place &mynewdisk_disk in dtypes[10].  Look at the other
 | 
						|
devices in this array and pad with "&null_disk," as necessary.
 | 
						|
 | 
						|
5) Modify the xend python scripts.  You need to add your disk name to
 | 
						|
the list of disks that xend recognizes.
 | 
						|
 | 
						|
edit:
 | 
						|
  tools/python/xen/xend/server/BlktapController.py
 | 
						|
 | 
						|
And add your disk to the "blktap_disk_types" array near the top of
 | 
						|
your file.  Use the same name you specified in the third field of step
 | 
						|
(3).  The order of this list is not important.
 | 
						|
 | 
						|
 | 
						|
Now your driver is ready to be written.  Create a block-mynewdisk.c in
 | 
						|
tools/blktap2/drivers and add it to the Makefile.
 | 
						|
 | 
						|
 | 
						|
Development:
 | 
						|
 | 
						|
Copying block-aio.c and block-ram.c would be a good place to start.
 | 
						|
Read those files as you go through this, I will be assisting by
 | 
						|
commenting on a few useful functions and structures.
 | 
						|
 | 
						|
struct tap_disk:
 | 
						|
 | 
						|
Remember the forward declaration in step (1) of the setup phase above?
 | 
						|
Now is the time to make that structure a reality.  This structure
 | 
						|
contains a list of function pointers for all the routines that will be
 | 
						|
asked of your driver.  Currently the required functions are open,
 | 
						|
close, read, write, get_parent_id, validate_parent, and debug.
 | 
						|
 | 
						|
e.g. -
 | 
						|
  struct tap_disk tapdisk_mynewdisk = {
 | 
						|
          .disk_type          = "tapdisk_mynewdisk",
 | 
						|
          .flags              = 0,
 | 
						|
          .private_data_size  = sizeof(struct tdmynewdisk_state),
 | 
						|
          .td_open            = tdmynewdisk_open,
 | 
						|
                 ....
 | 
						|
 | 
						|
The private_data_size field is used to provide a structure to store
 | 
						|
the state of your device.  It is very likely that you will want
 | 
						|
something here, but you are free to design whatever structure you
 | 
						|
want.  Blktap will allocate this space for you, you just need to tell
 | 
						|
it how much space you want.
 | 
						|
 | 
						|
 | 
						|
tdmynewdisk_open:
 | 
						|
 | 
						|
This is the open routine.  The first argument is a structure
 | 
						|
representing your driver.  Two fields in this array are
 | 
						|
interesting. 
 | 
						|
 | 
						|
driver->data will contain a block of memory of the size your requested
 | 
						|
in in the .private_data_size field of your struct tap_disk (above).
 | 
						|
 | 
						|
driver->info contains a structure that details information about your
 | 
						|
disk.  You need to fill this out.  By convention this is done with a
 | 
						|
_get_image_info() function.  Assign a size (the total number of
 | 
						|
sectors), sector_size (the size of each sector in bytes, and set
 | 
						|
driver->info->info to 0.
 | 
						|
 | 
						|
The second parameter contains the name that was specified in the
 | 
						|
creation of your device, either through xend, or on the command line
 | 
						|
with tapdisk2.  Usually this specifies a file that you will open in
 | 
						|
this routine.  The final parameter, flags, contains one of a number of
 | 
						|
flags specified in tapdisk.h that may change the way you treat the
 | 
						|
disk.
 | 
						|
 | 
						|
 | 
						|
_queue_read/write:
 | 
						|
 | 
						|
These are your read and write operations.  What you do here will
 | 
						|
depend on your disk, but you should do exactly one of- 
 | 
						|
 | 
						|
1) call td_complete_request with either error or success code.
 | 
						|
 | 
						|
2) Call td_forward_request, which will forward the request to the next
 | 
						|
driver in the stack.
 | 
						|
 | 
						|
3) Queue the request for asynchronous processing with
 | 
						|
td_prep_read/write.  In doing so, you will also register a callback
 | 
						|
for request completion.  When the request completes you must do one of
 | 
						|
options (1) or (2) above.  Finally, call td_queue_tiocb to submit the
 | 
						|
request to a wait queue.
 | 
						|
 | 
						|
The above functions are defined in tapdisk-interface.c.  If you don't
 | 
						|
use them as specified you will run into problems as your driver will
 | 
						|
fail to inform blktap of the state of requests that have been
 | 
						|
submitted.  Blktap keeps track of all requests and does not like losing track.
 | 
						|
 | 
						|
 | 
						|
_close, _get_parent_id, _validate_parent:
 | 
						|
 | 
						|
These last few tend to be very routine.  _close is called when the
 | 
						|
device is closed, and also when it is paused (in this case, open will
 | 
						|
also be called later).  The other functions are used in stacking
 | 
						|
drivers.  Most often drivers will return TD_NO_PARENT and -EINVAL,
 | 
						|
respectively.
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 |