mirror of
https://github.com/apache/cloudstack.git
synced 2025-11-03 04:12:31 +01:00
322 lines
12 KiB
Plaintext
322 lines
12 KiB
Plaintext
Blktap2 Userspace Tools + Library
|
|
================================
|
|
|
|
Dutch Meyer
|
|
4th June 2009
|
|
|
|
Andrew Warfield and Julian Chesterfield
|
|
16th June 2006
|
|
|
|
|
|
The blktap2 userspace toolkit provides a user-level disk I/O
|
|
interface. The blktap2 mechanism involves a kernel driver that acts
|
|
similarly to the existing Xen/Linux blkback driver, and a set of
|
|
associated user-level libraries. Using these tools, blktap2 allows
|
|
virtual block devices presented to VMs to be implemented in userspace
|
|
and to be backed by raw partitions, files, network, etc.
|
|
|
|
The key benefit of blktap2 is that it makes it easy and fast to write
|
|
arbitrary block backends, and that these user-level backends actually
|
|
perform very well. Specifically:
|
|
|
|
- Metadata disk formats such as Copy-on-Write, encrypted disks, sparse
|
|
formats and other compression features can be easily implemented.
|
|
|
|
- Accessing file-based images from userspace avoids problems related
|
|
to flushing dirty pages which are present in the Linux loopback
|
|
driver. (Specifically, doing a large number of writes to an
|
|
NFS-backed image don't result in the OOM killer going berserk.)
|
|
|
|
- Per-disk handler processes enable easier userspace policing of block
|
|
resources, and process-granularity QoS techniques (disk scheduling
|
|
and related tools) may be trivially applied to block devices.
|
|
|
|
- It's very easy to take advantage of userspace facilities such as
|
|
networking libraries, compression utilities, peer-to-peer
|
|
file-sharing systems and so on to build more complex block backends.
|
|
|
|
- Crashes are contained -- incremental development/debugging is very
|
|
fast.
|
|
|
|
How it works (in one paragraph):
|
|
|
|
Working in conjunction with the kernel blktap2 driver, all disk I/O
|
|
requests from VMs are passed to the userspace deamon (using a shared
|
|
memory interface) through a character device. Each active disk is
|
|
mapped to an individual device node, allowing per-disk processes to
|
|
implement individual block devices where desired. The userspace
|
|
drivers are implemented using asynchronous (Linux libaio),
|
|
O_DIRECT-based calls to preserve the unbuffered, batched and
|
|
asynchronous request dispatch achieved with the existing blkback
|
|
code. We provide a simple, asynchronous virtual disk interface that
|
|
makes it quite easy to add new disk implementations.
|
|
|
|
As of June 2009 the current supported disk formats are:
|
|
|
|
- Raw Images (both on partitions and in image files)
|
|
- Fast sharable RAM disk between VMs (requires some form of
|
|
cluster-based filesystem support e.g. OCFS2 in the guest kernel)
|
|
- VHD, including snapshots and sparse images
|
|
- Qcow, including snapshots and sparse images
|
|
|
|
|
|
Build and Installation Instructions
|
|
===================================
|
|
|
|
Make to configure the blktap2 backend driver in your dom0 kernel. It
|
|
will inter-operate with the existing backend and frontend drivers. It
|
|
will also cohabitate with the original blktap driver. However, some
|
|
formats (currently aio and qcow) will default to their blktap2
|
|
versions when specified in a vm configuration file.
|
|
|
|
To build the tools separately, "make && make install" in
|
|
tools/blktap2.
|
|
|
|
|
|
Using the Tools
|
|
===============
|
|
|
|
Preparing an image for boot:
|
|
|
|
The userspace disk agent is configured to start automatically via xend
|
|
|
|
Customize the VM config file to use the 'tap:tapdisk' handler,
|
|
followed by the driver type. e.g. for a raw image such as a file or
|
|
partition:
|
|
|
|
disk = ['tap:tapdisk:aio:<FILENAME>,sda1,w']
|
|
|
|
Alternatively, the vhd-util tool (installed with make install, or in
|
|
/blktap2/vhd) can be used to build sparse copy-on-write vhd images.
|
|
|
|
For example, to build a sparse image -
|
|
vhd-util create -n MyVHDFile -s 1024
|
|
|
|
This creates a sparse 1GB file named "MyVHDFile" that can be mounted
|
|
and populated with data.
|
|
|
|
One can also base the image on a raw file -
|
|
vhd-util snapshot -n MyVHDFile -p SomeRawFile -m
|
|
|
|
This creates a sparse VHD file named "MyVHDFile" using "SomeRawFile"
|
|
as a parent image. Copy-on-write semantics ensure that writes will be
|
|
stored in "MyVHDFile" while reads will be directed to the most
|
|
recently written version of the data, either in "MyVHDFile" or
|
|
"SomeRawFile" as is appropriate. Other options exist as well, consult
|
|
the vhd-util application for the complete set of VHD tools.
|
|
|
|
VHD files can be mounted automatically in a guest similarly to the
|
|
above AIO example simply by specifying the vhd driver.
|
|
|
|
disk = ['tap:tapdisk:vhd:<VHD FILENAME>,sda1,w']
|
|
|
|
|
|
Snapshots:
|
|
|
|
Pausing a guest will also plug the corresponding IO queue for blktap2
|
|
devices and stop blktap2 drivers. This can be used to implement a
|
|
safe live snapshot of qcow and vhd disks. An example script "xmsnap"
|
|
is shown in the tools/blktap2/drivers directory. This script will
|
|
perform a live snapshot of a qcow disk. VHD files can use the
|
|
"vhd-util snapshot" tool discussed above. If this snapshot command is
|
|
applied to a raw file mounted with tap:tapdisk:AIO, include the -m
|
|
flag and the driver will be reloaded as VHD. If applied to an already
|
|
mounted VHD file, omit the -m flag.
|
|
|
|
|
|
Mounting images in Dom0 using the blktap2 driver
|
|
===============================================
|
|
Tap (and blkback) disks are also mountable in Dom0 without requiring an
|
|
active VM to attach.
|
|
|
|
The syntax is -
|
|
tapdisk2 -n <type>:<full path to file>
|
|
|
|
For example -
|
|
tapdisk2 -n aio:/home/images/rawFile.img
|
|
|
|
When successful the location of the new device will be provided by
|
|
tapdisk2 to stdout and tapdisk2 will terminate. From that point
|
|
forward control of the device is provided through sysfs in the
|
|
directory-
|
|
|
|
/sys/class/blktap2/blktap#/
|
|
|
|
Where # is a blktap2 device number present in the path that tapdisk2
|
|
printed before terminating. The sysfs interface is largely intuitive,
|
|
for example, to remove tap device 0 one would-
|
|
|
|
echo 1 > /sys/class/blktap2/blktap0/remove
|
|
|
|
Similarly, a pause control is available, which is can be used to plug
|
|
the request queue of a live running guest.
|
|
|
|
Previous versions of blktap mounted devices in dom0 by using blkfront
|
|
in dom0 and the xm block-attach command. This approach is still
|
|
available, though slightly more cumbersome.
|
|
|
|
|
|
Tapdisk Development
|
|
===============================================
|
|
|
|
People regularly ask how to develop their own tapdisk drivers, and
|
|
while it has not yet been well documented, the process is relatively
|
|
easy. Here I will provide a brief overview. The best reference, of
|
|
course, comes from the existing drivers. Specifically,
|
|
blktap2/drivers/block-ram.c and blktap2/drivers/block-aio.c provide
|
|
the clearest examples of simple drivers.
|
|
|
|
|
|
Setup:
|
|
|
|
First you need to register your new driver with blktap. This is done
|
|
in disktypes.h. There are five things that you must do. To
|
|
demonstrate, I will create a disk called "mynewdisk", you can name
|
|
yours freely.
|
|
|
|
1) Forward declare an instance of struct tap_disk.
|
|
|
|
e.g. -
|
|
extern struct tap_disk tapdisk_mynewdisk;
|
|
|
|
2) Claim one of the unused disk type numbers, take care to observe the
|
|
MAX_DISK_TYPES macro, increasing the number if necessary.
|
|
|
|
e.g. -
|
|
#define DISK_TYPE_MYNEWDISK 10
|
|
|
|
3) Create an instance of disk_info_t. The bulk of this file contains examples of these.
|
|
|
|
e.g. -
|
|
static disk_info_t mynewdisk_disk = {
|
|
DISK_TYPE_MYNEWDISK,
|
|
"My New Disk (mynewdisk)",
|
|
"mynewdisk",
|
|
0,
|
|
#ifdef TAPDISK
|
|
&tapdisk_mynewdisk,
|
|
#endif
|
|
};
|
|
|
|
A few words about what these mean. The first field must be the disk
|
|
type number you claimed in step (2). The second field is a string
|
|
describing your disk, and may contain any relevant info. The third
|
|
field is the name of your disk as will be used by the tapdisk2 utility
|
|
and xend (for example tapdisk2 -n mynewdisk:/path/to/disk.image, or in
|
|
your xm create config file). The forth is binary and determines
|
|
whether you will have one instance of your driver, or many. Here, a 1
|
|
means that your driver is a singleton and will coordinate access to
|
|
any number of tap devices. 0 is more common, meaning that you will
|
|
have one driver for each device that is created. The final field
|
|
should contain a reference to the struct tap_disk you created in step
|
|
(1).
|
|
|
|
4) Add a reference to your disk info structure (from step (3)) to the
|
|
dtypes array. Take care here - you need to place it in the position
|
|
corresponding to the device type number you claimed in step (2). So
|
|
we would place &mynewdisk_disk in dtypes[10]. Look at the other
|
|
devices in this array and pad with "&null_disk," as necessary.
|
|
|
|
5) Modify the xend python scripts. You need to add your disk name to
|
|
the list of disks that xend recognizes.
|
|
|
|
edit:
|
|
tools/python/xen/xend/server/BlktapController.py
|
|
|
|
And add your disk to the "blktap_disk_types" array near the top of
|
|
your file. Use the same name you specified in the third field of step
|
|
(3). The order of this list is not important.
|
|
|
|
|
|
Now your driver is ready to be written. Create a block-mynewdisk.c in
|
|
tools/blktap2/drivers and add it to the Makefile.
|
|
|
|
|
|
Development:
|
|
|
|
Copying block-aio.c and block-ram.c would be a good place to start.
|
|
Read those files as you go through this, I will be assisting by
|
|
commenting on a few useful functions and structures.
|
|
|
|
struct tap_disk:
|
|
|
|
Remember the forward declaration in step (1) of the setup phase above?
|
|
Now is the time to make that structure a reality. This structure
|
|
contains a list of function pointers for all the routines that will be
|
|
asked of your driver. Currently the required functions are open,
|
|
close, read, write, get_parent_id, validate_parent, and debug.
|
|
|
|
e.g. -
|
|
struct tap_disk tapdisk_mynewdisk = {
|
|
.disk_type = "tapdisk_mynewdisk",
|
|
.flags = 0,
|
|
.private_data_size = sizeof(struct tdmynewdisk_state),
|
|
.td_open = tdmynewdisk_open,
|
|
....
|
|
|
|
The private_data_size field is used to provide a structure to store
|
|
the state of your device. It is very likely that you will want
|
|
something here, but you are free to design whatever structure you
|
|
want. Blktap will allocate this space for you, you just need to tell
|
|
it how much space you want.
|
|
|
|
|
|
tdmynewdisk_open:
|
|
|
|
This is the open routine. The first argument is a structure
|
|
representing your driver. Two fields in this array are
|
|
interesting.
|
|
|
|
driver->data will contain a block of memory of the size your requested
|
|
in in the .private_data_size field of your struct tap_disk (above).
|
|
|
|
driver->info contains a structure that details information about your
|
|
disk. You need to fill this out. By convention this is done with a
|
|
_get_image_info() function. Assign a size (the total number of
|
|
sectors), sector_size (the size of each sector in bytes, and set
|
|
driver->info->info to 0.
|
|
|
|
The second parameter contains the name that was specified in the
|
|
creation of your device, either through xend, or on the command line
|
|
with tapdisk2. Usually this specifies a file that you will open in
|
|
this routine. The final parameter, flags, contains one of a number of
|
|
flags specified in tapdisk.h that may change the way you treat the
|
|
disk.
|
|
|
|
|
|
_queue_read/write:
|
|
|
|
These are your read and write operations. What you do here will
|
|
depend on your disk, but you should do exactly one of-
|
|
|
|
1) call td_complete_request with either error or success code.
|
|
|
|
2) Call td_forward_request, which will forward the request to the next
|
|
driver in the stack.
|
|
|
|
3) Queue the request for asynchronous processing with
|
|
td_prep_read/write. In doing so, you will also register a callback
|
|
for request completion. When the request completes you must do one of
|
|
options (1) or (2) above. Finally, call td_queue_tiocb to submit the
|
|
request to a wait queue.
|
|
|
|
The above functions are defined in tapdisk-interface.c. If you don't
|
|
use them as specified you will run into problems as your driver will
|
|
fail to inform blktap of the state of requests that have been
|
|
submitted. Blktap keeps track of all requests and does not like losing track.
|
|
|
|
|
|
_close, _get_parent_id, _validate_parent:
|
|
|
|
These last few tend to be very routine. _close is called when the
|
|
device is closed, and also when it is paused (in this case, open will
|
|
also be called later). The other functions are used in stacking
|
|
drivers. Most often drivers will return TD_NO_PARENT and -EINVAL,
|
|
respectively.
|
|
|
|
|
|
|
|
|
|
|
|
|