.. ddumbfs documentation master file, created by sphinx-quickstart on Fri Sep 16 20:15:25 2011. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. ddumbfs's homepage ================== **ddumbfs** is a fast inline **deduplication** filesystem for Linux. Based on `FUSE `_ and released under the GNU GPL. Deduplication is a technique to avoid data duplication on disks and to increase its virtual capacity. Documentation ------------- * :doc:`Getting started ` * :doc:`How it works ` * :doc:`Performances and tuning ` * :doc:`Recovery ` * :doc:`Regression suite ` * :doc:`Resizing ` * :ref:`man_pages` * :doc:`FAQ ` * *How to backup a* **ddumbfs** *volume* (next to come) Features -------- * Works on Linux * ddumbfs is a userspace filesystem using `FUSE `_ * Increase virtual disk capacity using online-deduplication * deduplication at block level, from 4k up to 128k * Fast, thanks to it very simple index design * Not limited in the size, can dedup Peta bytes * VMware support: can be exported through NFS and be used to run or backup VMs * Index can be locked into memory for maximum performance, or stored on SSD drive Requirements ------------ **ddumbfs** requires **fuse** and **mhash** libraries. *fuse*>=2.7 is required. If you plan to export your filesystem through **NFS** then fuse>=2.8 and linux>=2.6.29 are recommended. To compile **ddumbfs** you need as usual: **make** and **gcc**, the headers for **fuse** and **mhash** library and **pkg-config**. Here are the corresponding package for **redhat** and **debian** based distributions: ============== ============================= =================================== Distribution To run To install from source ============== ============================= =================================== **redhat** fuse fuse-libs mhash fuse-devel mhash-devel pkgconfig **debian** libfuse2 fuse-utils libmhash2 libfuse-dev libmhash-dev pkg-config ============== ============================= =================================== RedHat 6 don't include *mhash* in the default repository anymore. You can find them in the **EPEL** repository. You can the *EPEL* repository using:: # rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-7.noarch.rpm Install from source ------------------- To install from source, extract the sources, *cd* into directory *ddumbfs-X.X* and run the commands:: ./configure make make install Download -------- Sources can be found `here `_ In a near future RPM packages for last Fedora and Centos should be available. Limits ------- The data structure is not limited. The size of the block addresses is set at filesystem creation time. Size can be 3, 4, 5 bytes width or even more, their is no limit. Anyway current architectures use 64bits (8bytes) integers. These addresses are stored and handled in such 64bits integers. The result of the multiplication of two 32 bits values requires at max 64bits. Then, it is not always possible to use value above 32bits on 64bits architecture. Hopefully most of the operations are additions or bit shift that should support addresses bigger than 32bits. And for the few critical parts where 64bits would not be enough for intermediate results, it should be possible to tweaks the code to make it works. Then with few modifications it should be easy to handle addresses bigger than 32bits. It is even maybe already possible now, because I take care to order operations to minimize overflows. But I have not tested. Using 32bits addresses: ============ ===================== ============== block size Max *BlockFile* Size Max Index Size ============ ===================== ============== 4K 16To 146Go 128k 512To 146Go ============ ===================== ============== Even with 4K block and a dedup factor of 10, it would be possible to store up to 160To in one single filesystem. This means 4 days to fill the volume at 512Mo/s constant throughput. Having 40bits addresses (5bytes) would allow to go beyond the PetaBytes, but where will you store the 37To of the *index* ? A filesystem suitable for backups ---------------------------------- Even if **ddumbfs** is fast and reliable and could be used as any multi-purpose filesystem, it has been designed with *backups* in mind. A **backup** share 99% of identical data with previous and next backups. *Differential* techniques exist to reduce the amount of data to save, but complicates backup handling and restore procedure. The idea of **ddumbfs** is to do the difference at the filesystem level to allow backup tools to take benefit of space saving without constraints. Moreover accesses to the backup are less demanding than to the online data. Most backup tools generate big archive files in one continuous write. Read access to these file are infrequent and random read/write are really uncommon. Usually backups are done during the night and increase of the backup time to increase the retention period is valuable until the backup finishes before morning. Regarding these backup requirements, we can imaging to increase disk capacity at cost of some flexibility shortage. The proof, is we have used and still use magnetic tapes for backup despite of its lack of flexibility ! Modern filesystems looks overkill to just store backups or archives ! This is where a dedicated filesystem like **ddumbfs** could bring more to backup applications ! With it :ref:`writer_pool` and asynchronous features, **ddumbfs** can even transcend the speed of other filesystems ib througput, not in iops. How safe is it -------------- Everything as been done to make **ddumbfs** as reliable as possible: * *ddumbfs* succeeds all the tests from the doc:`regression suite `. * *ddumbfs* detect when it was uncleanly unmounted and run a quick *filesystem check* at startup. * *ddumbfs* is released with :doc:`fsckddumbfs ` that can deeply check, repair and report all detected, corrected and unrecoverable errors. How **ddumbfs** is different ---------------------------- Other de-duplication filesystem alternatives_ exist. - minimalist, it does de-duplication only and does it well. - optimized to works without any memory cache, but works faster with caching or SSD drives. - **fast**, very fast, even faster ! - tools to access an offline *volume*. - tools to repair, rebuild and test the filesystem integrity. - access time doesn't increase with disk usage. - no kind of garbage collector that *freeze* the filesystem at runtime. - reliable, at lot of tests in the :doc:`Regression suite ` to proof it. .. _man_pages: MAN pages --------- :doc:`ddumbfs `: mount a ddumbfs filesystem :doc:`mkddumbfs `: initialize a ddumbfs filesystem :doc:`fsckddumbfs `: check and repair a ddumbfs filesystem :doc:`cpddumbfs `: upload and download file from a offline ddumbfs filesystem :doc:`migrateddumbfs `: resize an offline ddumbfs filesystem :doc:`alterddumbfs `: create anomalies in an offline ddumbfs filesystem, **for testing only** testddumbfs: to do some integrity and performance tests queryddumbfs: allow to query the filesystem about hashes, blocks and files. Download -------- You can download sources and binaries at `here `__. **RPM** packages are available for CentOS 6 and Fedora 17. **DEB** packages are available Ubuntu 12.04 and 12.10. Contact ------- Ask your questions in this `forum `__. .. _alternatives: DE-Duplication alternatives --------------------------- * `lessfs `_ : include a lot of features: compression, replication. * `SDFS `_ : works on windows too, very good marketing. License ------- Copyright (C) 2011 Alain Spineux You can redistribute **ddumbfs** and/or modify it under the terms of either (1) the GNU General Public License as published by the Free Software Foundation; or (2) obtain a commercial license by contacting the Author. You should have received a copy of the **GNU General Public License** along with this program. If not, see http://www.gnu.org/licenses/. **ddumbfs** is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. Contents: .. toctree:: :maxdepth: 2 getstarted performance recovery regression_suite howitworks recovery resize faq Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`