mirror of
				https://github.com/apache/cloudstack.git
				synced 2025-10-26 08:42:29 +01:00 
			
		
		
		
	
		
			
				
	
	
		
			322 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			322 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Blktap2 Userspace Tools + Library
 | |
| ================================
 | |
| 
 | |
| Dutch Meyer
 | |
| 4th June 2009
 | |
| 
 | |
| Andrew Warfield and Julian Chesterfield
 | |
| 16th June 2006
 | |
| 
 | |
| 
 | |
| The blktap2 userspace toolkit provides a user-level disk I/O
 | |
| interface. The blktap2 mechanism involves a kernel driver that acts
 | |
| similarly to the existing Xen/Linux blkback driver, and a set of
 | |
| associated user-level libraries.  Using these tools, blktap2 allows
 | |
| virtual block devices presented to VMs to be implemented in userspace
 | |
| and to be backed by raw partitions, files, network, etc.
 | |
| 
 | |
| The key benefit of blktap2 is that it makes it easy and fast to write
 | |
| arbitrary block backends, and that these user-level backends actually
 | |
| perform very well.  Specifically:
 | |
| 
 | |
| - Metadata disk formats such as Copy-on-Write, encrypted disks, sparse
 | |
|   formats and other compression features can be easily implemented.
 | |
| 
 | |
| - Accessing file-based images from userspace avoids problems related
 | |
|   to flushing dirty pages which are present in the Linux loopback
 | |
|   driver.  (Specifically, doing a large number of writes to an
 | |
|   NFS-backed image don't result in the OOM killer going berserk.)
 | |
| 
 | |
| - Per-disk handler processes enable easier userspace policing of block
 | |
|   resources, and process-granularity QoS techniques (disk scheduling
 | |
|   and related tools) may be trivially applied to block devices.
 | |
| 
 | |
| - It's very easy to take advantage of userspace facilities such as
 | |
|   networking libraries, compression utilities, peer-to-peer
 | |
|   file-sharing systems and so on to build more complex block backends.
 | |
| 
 | |
| - Crashes are contained -- incremental development/debugging is very
 | |
|   fast.
 | |
| 
 | |
| How it works (in one paragraph):
 | |
| 
 | |
| Working in conjunction with the kernel blktap2 driver, all disk I/O
 | |
| requests from VMs are passed to the userspace deamon (using a shared
 | |
| memory interface) through a character device. Each active disk is
 | |
| mapped to an individual device node, allowing per-disk processes to
 | |
| implement individual block devices where desired.  The userspace
 | |
| drivers are implemented using asynchronous (Linux libaio),
 | |
| O_DIRECT-based calls to preserve the unbuffered, batched and
 | |
| asynchronous request dispatch achieved with the existing blkback
 | |
| code.  We provide a simple, asynchronous virtual disk interface that
 | |
| makes it quite easy to add new disk implementations.
 | |
| 
 | |
| As of June 2009 the current supported disk formats are:
 | |
| 
 | |
|  - Raw Images (both on partitions and in image files)
 | |
|  - Fast sharable RAM disk between VMs (requires some form of 
 | |
|    cluster-based filesystem support e.g. OCFS2 in the guest kernel)
 | |
|  - VHD, including snapshots and sparse images
 | |
|  - Qcow, including snapshots and sparse images
 | |
| 
 | |
| 
 | |
| Build and Installation Instructions
 | |
| ===================================
 | |
| 
 | |
| Make to configure the blktap2 backend driver in your dom0 kernel.  It
 | |
| will inter-operate with the existing backend and frontend drivers.  It
 | |
| will also cohabitate with the original blktap driver.  However, some
 | |
| formats (currently aio and qcow) will default to their blktap2
 | |
| versions when specified in a vm configuration file.
 | |
| 
 | |
| To build the tools separately, "make && make install" in
 | |
| tools/blktap2.
 | |
| 
 | |
| 
 | |
| Using the Tools
 | |
| ===============
 | |
| 
 | |
| Preparing an image for boot:
 | |
| 
 | |
| The userspace disk agent is configured to start automatically via xend
 | |
| 
 | |
| Customize the VM config file to use the 'tap:tapdisk' handler,
 | |
| followed by the driver type. e.g. for a raw image such as a file or
 | |
| partition:
 | |
| 
 | |
| disk = ['tap:tapdisk:aio:<FILENAME>,sda1,w']
 | |
| 
 | |
| Alternatively, the vhd-util tool (installed with make install, or in
 | |
| /blktap2/vhd) can be used to build sparse copy-on-write vhd images.
 | |
| 
 | |
| For example, to build a sparse image -
 | |
|   vhd-util create -n MyVHDFile -s 1024
 | |
| 
 | |
| This creates a sparse 1GB file named "MyVHDFile" that can be mounted
 | |
| and populated with data.
 | |
| 
 | |
| One can also base the image on a raw file -
 | |
|   vhd-util snapshot -n MyVHDFile -p SomeRawFile -m
 | |
| 
 | |
| This creates a sparse VHD file named "MyVHDFile" using "SomeRawFile"
 | |
| as a parent image.  Copy-on-write semantics ensure that writes will be
 | |
| stored in "MyVHDFile" while reads will be directed to the most
 | |
| recently written version of the data, either in "MyVHDFile" or
 | |
| "SomeRawFile" as is appropriate.  Other options exist as well, consult
 | |
| the vhd-util application for the complete set of VHD tools.
 | |
| 
 | |
| VHD files can be mounted automatically in a guest similarly to the
 | |
| above AIO example simply by specifying the vhd driver.
 | |
| 
 | |
| disk = ['tap:tapdisk:vhd:<VHD FILENAME>,sda1,w']
 | |
| 
 | |
| 
 | |
| Snapshots:
 | |
| 
 | |
| Pausing a guest will also plug the corresponding IO queue for blktap2
 | |
| devices and stop blktap2 drivers.  This can be used to implement a
 | |
| safe live snapshot of qcow and vhd disks.  An example script "xmsnap"
 | |
| is shown in the tools/blktap2/drivers directory.  This script will
 | |
| perform a live snapshot of a qcow disk.  VHD files can use the
 | |
| "vhd-util snapshot" tool discussed above.  If this snapshot command is
 | |
| applied to a raw file mounted with tap:tapdisk:AIO, include the -m
 | |
| flag and the driver will be reloaded as VHD.  If applied to an already
 | |
| mounted VHD file, omit the -m flag.
 | |
| 
 | |
| 
 | |
| Mounting images in Dom0 using the blktap2 driver
 | |
| ===============================================
 | |
| Tap (and blkback) disks are also mountable in Dom0 without requiring an
 | |
| active VM to attach. 
 | |
| 
 | |
| The syntax is -
 | |
|   tapdisk2 -n <type>:<full path to file>
 | |
| 
 | |
| For example -
 | |
|   tapdisk2  -n aio:/home/images/rawFile.img
 | |
| 
 | |
| When successful the location of the new device will be provided by
 | |
| tapdisk2 to stdout and tapdisk2 will terminate.  From that point
 | |
| forward control of the device is provided through sysfs in the
 | |
| directory-
 | |
| 
 | |
|   /sys/class/blktap2/blktap#/
 | |
| 
 | |
| Where # is a blktap2 device number present in the path that tapdisk2
 | |
| printed before terminating.  The sysfs interface is largely intuitive,
 | |
| for example, to remove tap device 0 one would-
 | |
|   
 | |
|   echo 1 > /sys/class/blktap2/blktap0/remove
 | |
| 
 | |
| Similarly, a pause control is available, which is can be used to plug
 | |
| the request queue of a live running guest.
 | |
| 
 | |
| Previous versions of blktap mounted devices in dom0 by using blkfront
 | |
| in dom0 and the xm block-attach command.  This approach is still
 | |
| available, though slightly more cumbersome.
 | |
| 
 | |
| 
 | |
| Tapdisk Development
 | |
| ===============================================
 | |
| 
 | |
| People regularly ask how to develop their own tapdisk drivers, and
 | |
| while it has not yet been well documented, the process is relatively
 | |
| easy.  Here I will provide a brief overview.  The best reference, of
 | |
| course, comes from the existing drivers.  Specifically,
 | |
| blktap2/drivers/block-ram.c and blktap2/drivers/block-aio.c provide
 | |
| the clearest examples of simple drivers.
 | |
|  
 | |
| 
 | |
| Setup:
 | |
| 
 | |
| First you need to register your new driver with blktap. This is done
 | |
| in disktypes.h.  There are five things that you must do.  To
 | |
| demonstrate, I will create a disk called "mynewdisk", you can name
 | |
| yours freely.
 | |
| 
 | |
| 1) Forward declare an instance of struct tap_disk.
 | |
| 
 | |
| e.g. -  
 | |
|   extern struct tap_disk tapdisk_mynewdisk;
 | |
| 
 | |
| 2) Claim one of the unused disk type numbers, take care to observe the
 | |
| MAX_DISK_TYPES macro, increasing the number if necessary.
 | |
| 
 | |
| e.g. -
 | |
|   #define DISK_TYPE_MYNEWDISK         10
 | |
| 
 | |
| 3) Create an instance of disk_info_t.  The bulk of this file contains examples of these.
 | |
| 
 | |
| e.g. -
 | |
|   static disk_info_t mynewdisk_disk = {
 | |
|           DISK_TYPE_MYNEWDISK,
 | |
|           "My New Disk (mynewdisk)",
 | |
|           "mynewdisk",
 | |
|           0,
 | |
|   #ifdef TAPDISK
 | |
|           &tapdisk_mynewdisk,
 | |
|   #endif
 | |
|   };
 | |
| 
 | |
| A few words about what these mean.  The first field must be the disk
 | |
| type number you claimed in step (2).  The second field is a string
 | |
| describing your disk, and may contain any relevant info.  The third
 | |
| field is the name of your disk as will be used by the tapdisk2 utility
 | |
| and xend (for example tapdisk2 -n mynewdisk:/path/to/disk.image, or in
 | |
| your xm create config file).  The forth is binary and determines
 | |
| whether you will have one instance of your driver, or many.  Here, a 1
 | |
| means that your driver is a singleton and will coordinate access to
 | |
| any number of tap devices.  0 is more common, meaning that you will
 | |
| have one driver for each device that is created.  The final field
 | |
| should contain a reference to the struct tap_disk you created in step
 | |
| (1).
 | |
| 
 | |
| 4) Add a reference to your disk info structure (from step (3)) to the
 | |
| dtypes array.  Take care here - you need to place it in the position
 | |
| corresponding to the device type number you claimed in step (2).  So
 | |
| we would place &mynewdisk_disk in dtypes[10].  Look at the other
 | |
| devices in this array and pad with "&null_disk," as necessary.
 | |
| 
 | |
| 5) Modify the xend python scripts.  You need to add your disk name to
 | |
| the list of disks that xend recognizes.
 | |
| 
 | |
| edit:
 | |
|   tools/python/xen/xend/server/BlktapController.py
 | |
| 
 | |
| And add your disk to the "blktap_disk_types" array near the top of
 | |
| your file.  Use the same name you specified in the third field of step
 | |
| (3).  The order of this list is not important.
 | |
| 
 | |
| 
 | |
| Now your driver is ready to be written.  Create a block-mynewdisk.c in
 | |
| tools/blktap2/drivers and add it to the Makefile.
 | |
| 
 | |
| 
 | |
| Development:
 | |
| 
 | |
| Copying block-aio.c and block-ram.c would be a good place to start.
 | |
| Read those files as you go through this, I will be assisting by
 | |
| commenting on a few useful functions and structures.
 | |
| 
 | |
| struct tap_disk:
 | |
| 
 | |
| Remember the forward declaration in step (1) of the setup phase above?
 | |
| Now is the time to make that structure a reality.  This structure
 | |
| contains a list of function pointers for all the routines that will be
 | |
| asked of your driver.  Currently the required functions are open,
 | |
| close, read, write, get_parent_id, validate_parent, and debug.
 | |
| 
 | |
| e.g. -
 | |
|   struct tap_disk tapdisk_mynewdisk = {
 | |
|           .disk_type          = "tapdisk_mynewdisk",
 | |
|           .flags              = 0,
 | |
|           .private_data_size  = sizeof(struct tdmynewdisk_state),
 | |
|           .td_open            = tdmynewdisk_open,
 | |
|                  ....
 | |
| 
 | |
| The private_data_size field is used to provide a structure to store
 | |
| the state of your device.  It is very likely that you will want
 | |
| something here, but you are free to design whatever structure you
 | |
| want.  Blktap will allocate this space for you, you just need to tell
 | |
| it how much space you want.
 | |
| 
 | |
| 
 | |
| tdmynewdisk_open:
 | |
| 
 | |
| This is the open routine.  The first argument is a structure
 | |
| representing your driver.  Two fields in this array are
 | |
| interesting. 
 | |
| 
 | |
| driver->data will contain a block of memory of the size your requested
 | |
| in in the .private_data_size field of your struct tap_disk (above).
 | |
| 
 | |
| driver->info contains a structure that details information about your
 | |
| disk.  You need to fill this out.  By convention this is done with a
 | |
| _get_image_info() function.  Assign a size (the total number of
 | |
| sectors), sector_size (the size of each sector in bytes, and set
 | |
| driver->info->info to 0.
 | |
| 
 | |
| The second parameter contains the name that was specified in the
 | |
| creation of your device, either through xend, or on the command line
 | |
| with tapdisk2.  Usually this specifies a file that you will open in
 | |
| this routine.  The final parameter, flags, contains one of a number of
 | |
| flags specified in tapdisk.h that may change the way you treat the
 | |
| disk.
 | |
| 
 | |
| 
 | |
| _queue_read/write:
 | |
| 
 | |
| These are your read and write operations.  What you do here will
 | |
| depend on your disk, but you should do exactly one of- 
 | |
| 
 | |
| 1) call td_complete_request with either error or success code.
 | |
| 
 | |
| 2) Call td_forward_request, which will forward the request to the next
 | |
| driver in the stack.
 | |
| 
 | |
| 3) Queue the request for asynchronous processing with
 | |
| td_prep_read/write.  In doing so, you will also register a callback
 | |
| for request completion.  When the request completes you must do one of
 | |
| options (1) or (2) above.  Finally, call td_queue_tiocb to submit the
 | |
| request to a wait queue.
 | |
| 
 | |
| The above functions are defined in tapdisk-interface.c.  If you don't
 | |
| use them as specified you will run into problems as your driver will
 | |
| fail to inform blktap of the state of requests that have been
 | |
| submitted.  Blktap keeps track of all requests and does not like losing track.
 | |
| 
 | |
| 
 | |
| _close, _get_parent_id, _validate_parent:
 | |
| 
 | |
| These last few tend to be very routine.  _close is called when the
 | |
| device is closed, and also when it is paused (in this case, open will
 | |
| also be called later).  The other functions are used in stacking
 | |
| drivers.  Most often drivers will return TD_NO_PARENT and -EINVAL,
 | |
| respectively.
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 |