Why RPM is a terrible system to base a distribution on, the problems faced by package management systems for Linux, and what can be done about it.
This is a rant. A long, bitter rant about the inequities of life and an unjust, unfeeling Universe. It is a rant about how poorly thought-out software can kill a really great Linux distribution, and turn away proponents. It is a tirade about how a series of not-good, but not bad, software design decisions can add up to a really bad, unmaintainable system.
I'm Sean Russell, and I'm a recovering Mandrake user. I feel good about myself, and thanks to my support group, am well on the road to recovery from the mental scarring induced by RPM.
I've started articles on Why RPM Sucks a dozen times. Each time, I've never finished the article, in part because I didn't have an answer for RPM, and when I'm a good boy, I avoid complaining about things when I don't have a solution. Eventually, I realized that there are some inherent problems with modern operating systems that cause problems for all package management systems. Linux suffers from these problems as much as any other operating system, and I haven't seen any OS that solves these particular problems. Perhaps there is no solution.
However, I have seen, and worked with, a package management system that goes a long way to solving some of the worst RPM atrocities, and I'll describe this below.
I've finally written this paper because, over the course of seven months, my Mandrake-based laptop system has slowly degraded to the point where it is broken, and will be easier to fix by re-installing the operating system than by trying to fix the RPMs. For those of you who use RPM and think you like it, stop and think for a moment, and ask yourself if this sounds familiar:
You want to install some software package. You look for a package, and find several dozen packages, of varying versions, for different Linux distributions. From experience, you know that none of the RPMs built for other distributions will work on your system. In fact, if it isn't built specifically for the version of the distribution you have installed, it ain't gonna work.
You manage to find the exact RPM you need. You try to install it, and discover that you have to either install some other software, or upgrade some part of your system.
You go find the other pieces that you need, and download and try to install them. You discover that you need yet some more packages upgraded or installed to satisfy dependencies on those RPMs.
Repeat the previous step ad nauseam. Applications such as urpmi can alleviate this, but they can't fix the following problem, which is:
Eventually, if you're lucky, you get the software installed. More commonly, you've broken some other piece of your system -- "A" depends on "B" version 2, but "C" depends on "B" version 1, and there isn't a newer version of "C" that is happy with "B" version 2. You have to choose between "A" and "C". Even more commonly, you find yourself installing software that you never wanted or needed, but is a required install because of a bad dependency tree. "A", a purely text-based application, depends on "B", which depends on "C", which depends on "D", which depends on X11. This happens so often, it can only be seen as an architectural design flaw in RPM.
The problems start with the operating system. To greatly simplify the situation, modern software systems are built up of components. When a programmer writes software, they use existing libraries of work, much like an author references other, preexisting documents. Do this, and you avoid a lot of work. Someone else has already written code that draws windows and buttons, code that sorts lists, code that accesses files on a file system, and so on. Now, you can go one of two ways with this. You can statically link everything, which means that your program has everything that it needs built into it. The problem with this approach is that your program will be very large, and will consume a lot of memory. The second approach is to dynamically link a lot of the libraries. With this approach, when the program runs it asks the operating system for the libraries it needs. Many times, many of those libraries are being used by other programs, too, so your application can reuse these resources, saving memory. These programs take up less space on the hard drive, too. The major problem with this approach is that the libraries can change. Their behavior can change, but also their APIs -- the things your program hooks into the library with -- can change. Either of these can break your application. If you expect some function X() to return "Sean" and it starts returning "Steve" all of a sudden, you're up the proverbial creek.
So most operating systems have versions on their libraries. That way, when you ask for library version 1, you know exactly what sort of behavior and API you're getting. With this mechanism, you can define dependencies. You can say, "my application needs library X with version Y to run". This way, you can do some basic bounds checking, to help make sure your application will run.
All this is fine -- it provides enough infrastructure to build a system that is reasonably robust and efficient. There hasn't been an OS that doesn't have dynamically linked libraries in a long time, and they all tend to work pretty much the same way. The problem is that most operating systems are set up so that they can't take advantage of this. Linux is no exception, and much of that is the fault of the Linux Standard Base, or LSB.
The LSB is intended to be a way of standardizing how Linux software distribution file systems are laid out. The goal is to provide an common environment so that application authors can build software that will install on multiple target distributions. Unfortunately, the LSB is working from a bad foundation, the standard Unix file system layout. The LSB itself is a fine project; given what they're working with, they're doing the right thing. However, they're just adding support to a bad design -- it may work, but it isn't right.
The Redhat Package Manager, or RPM, aggravates the issue in three ways:
RPM specs are complicated and are difficult to build correctly. As a consequence, almost nobody does build them correctly. The most commonly used switch god Linux admins use is "--badreloc".
RPM has poor dependency managing tools. RPM allows developers to define dependencies on other software, but this feature is extremely weak. In particular, RPM:
has only one type of dependency: hard. Much software has optional support for some features. For example, PostgreSQL can be compiled with some GUI tools. On a headless server, these GUI tools may not be desired. With RPM, your options are binary: you can either build the RPM with a dependency on the GUI components, or not. There is no option at install time to resolve the dependencies and "do the right thing" with the install. As a result, packages such as PostgreSQL are distributed as a bunch of separate RPMs, each providing a different feature. This quickly becomes unwieldy.
has only rudimentary dependency resolution. If package A depends on package B and package B depends on C, when you try to install A, it only tells you about the dependency on B.
has a low-grain dependency version mechanism. An RPM spec builder can't, for instance, say that package A requires some version of package B, where the version is greater that 2.0 but less than 3.0, and not version 2.3. In fact, you can't specify that any version less than 2.0 is acceptable.
As a result of this weak dependency management, package builders either choose the simplest dependency possible, or build their own complex, verbose dependency rules into RPM. However, no matter how nice your RPM dependencies are specified, the whole system can fail if just one package you depend on has a bad dependency specification. In short, RPM dependencies are fragile.
RPM has no built in dependency resolution mechanism. That is, RPM doesn't know where to go to get packages to resolve dependencies. Now, there are tools that sit on top of RPM to do this; urpmi, rupdate, and so on. However, in practice, this lack of awareness of package resolution in RPM shows up as strongly crippling any third-party resolution tool. This is because:
RPM packages are monolithic. That is to say, the package specifications are part of the entire package. Why is this a problem? Because to be able to resolve dependency trees, you have to have access to all of the packages in the tree. Here is the main failure of RPM: you can't get intelligent queries out of it. The O() number of packages you need to query any dependency tree is all of the RPM packages in the entire world.
RPM makes it hard to install multiple versions of the same package on a system. This isn't the sole fault of RPM; the LSB -- the legacy left to Linux by Unix contributes much to this. Since two versions of the same package tend to install the same files in the same place, conflicts are common.
RPM is fundamentally a global software installation mechanism. It can't be used (not without a lot of pain) by non-admin users. This makes it useless for a distribution mechanism for shared systems.
Finally, RPM is stupid. If a dependency isn't in the database, it doesn't exist. For example, if I have library X version Y installed, RPM will insist that it isn't installed merely because it isn't in the database. I'm sorry, but this is just retarded. It would be trivial to check ldconfig and see if the library exists.
The end result is that RPM-based installations go one of two routes, which both end up at the same place. Either they get upgraded regularly, slowly degrading as various pieces of the system fall out of RPM sync, until an entire re-install of a newer version is needed; or, they don't get any extra software installed on them until, at some point, the system is so old that it needs to be re-installed with a new version. In both cases, woe to the systems admin who has to go back and re-install all of the third-party software that was previously on the system.
Aside from redesigning the Linux kernel, there are a number of software solutions. The following outlines a solution using available software.
Self-contained software, a-la NeXTSTEP and MacOS-X. NeXTSTEP, and OS-X via inheritance, tends to package software in .apps. These are directories that contain a mini-Unix file system. Each directory contains a lib, bin, man, and so on. The operating system is aware of these directories, and treats them specially. .apps can be "run", whereby the OS sets up environment variables to include the .app mini-filestructure. The advantage is twofold:
.apps, being self-contained, are portable. They can be moved around on the file system, easily, they can be deleted easily, and it is easy to see which files belong to which application.
being self-contained, it is easy to have multiple versions of the same app on a system. There are no file conflicts.
Another compelling reason for this mechanism is for non-admin users. If applications are self-contained, it doesn't matter where they live. They can live in user's home directories, for instance. Most people building software for Linux, unfortunately, are working on single-user systems, and ignore the fact that not everybody has root access to their primary work machine.
Grafting software, such as graft, can extend the .app metaphor to system libraries, and can help the migration from a traditional Unix file system to a .app based one. Linux would not work well with shared libraries packaged as .apps, because the Linux library loader (ldd) doesn't understand .apps; it expects libraries to be named with a specific file extension ontology. Grafting can resolve some of these problems.
A better package management system will solve most of the many RPM-based problems. Most package management systems are superior to RPM; take your pick. However, any system should support at least the features in Portage. Portage, the package management system of Gentoo, is probably the best system available today. This is because:
Portage build files are not monolithic. A Portage build file is just the specification. The actual software is pulled from elsewhere. This means that you can form meaningful dependency queries without having access to all of the software.
Portage knows how to resolve a dependency tree. It can download and install any software that is needed to resolve dependencies. Furthermore, it does a pretty good job in maintaining old software dependencies, because:
Portage allows multiple versions of the same package to be installed on the same system. This support is limited, practically, by the ldd issues, but could be improved using the other parts of this solution.
Portage is fundamentally source-based. However, it does have support for binary distributions. This support would need to be strengthened to make Portage general purpose enough for casual users. The primary advantage to building everything from sources are that the software is faster, since it is optimized for the system on which it was built. This source-based approach isn't that unusual. As any RPM-based Linux admin knows, you eventually have to resort to source RPMs, because you can never find the right binary for an RPM system that's been installed for more than a few months.
Portage has a very robust dependency resolution mechanism. Dependencies can be easily described with a flexible dependency language, allowing for version ranges, excludes, and includes.
Portage build files are easy to write. As a result, they are more often written correctly.
Portage has both hard and soft dependencies, meaning that optional features can be used if the dependencies are available. This means that, if you specify --without-x11 in a global configuration file, packages that you install where X11 is optional will install without installing X11. ImageMagick is a good example of this.
While Portage is a stupid as RPM in that software that isn't in the package database is invisible to Portage, an admin can tell Portage that the other software is indeed installed. This is different from RPMs "--nodepends", which simply tells RPM to ignore dependencies. Portage's mechanism is more granular, and sticky. IE, once you tell Portage some other software is installed, it is in the installation database.
Portage has limitations. It, too, is a system-wide based mechanism, making it unsuitable for non-admin users. I have a feeling that Portage could be easily patched to allow use by individual users to install software in their home directories.
Portage goes a long way to making a system maintainable. So much so, in fact, that it may be a long time before you notice where it fails.
The biggest culprit in the brittle, most popular, Linux distributions is by far RPM. Debian, Gentoo, Sourcerer, and other distributions are much more maintainable. The source-based distributions, Gentoo and ilk, are difficult to install -- they don't have Mandrake's flashy installation software. Debian... well, the only reason I can site for the fact that Debian is less popular than the RPM distributions is the petty bickering and infighting of the Debian maintainers. Well, that, and the fact that Debian packages usually lag several months behind even Redhat, the slowest of RPM distributions. By the time you get something on Debian, everybody else has already upgraded to the next version. In a nutshell, Debian suffers from a bad case of bureaucracy, proof that an excellent system can be dragged down by politics.
A combination of three mechanisms -- self-contained software, grafting, and better package management software -- would create an operating system where non-admin users would have more control of their software access, would improve and ease systems administration, and would create more robust, maintainable distributions.