Configuration Management With Autoconf, Pkgsrc, and Cfengine ============================================================ March 18, 2004 Hal Snyder 0. Introduction. The purpose of this presentation is to A. Describe the configuration problem. B. Propose some solutions. 1. An example. Some details of this example are made up, but the issues presented are typical of what we face with any new subsystem on our VoIP platform, such as call detail record (CDR) collection. Suppose Jack is a programmer coding a SIP application MYAPP. When he finishes, all we will need to do is copy the program to the production server and start it up ... Jack works with: a work area: ~jack/MYAPP/... on a server: sipsrv02 with an operating system: x86 SunOS 5.8 SP 02/2003 compiler: gcc-3.0.1 and libraries /usr/local/lib/libgcc* freeware third party support libraries: freetds, libwww proprietary third party speech recognition libraries SIP stack: /net/sipdev/home/build/sip/releases/20030211-03/SunOS/5.8/i386/ When the new application is finished, i.e. programming is done on server sipsrv02, the result is to be somehow installed on the platform. Jack has compiled the program in his home directory. If you look at what the OS sees when he tests it, the dependencies will look something like this: - shared library files used when myapp runs - libvccxml.so => /export/home/jack/MYAPP/sandbox/libvccxml.so libvjsdbc.so => /export/home/jack/MYAPP/sandbox/libvjsdbc.so libvvml.so => /export/home/jack/MYAPP/sandbox/libvvml.so libmyapp.so => /export/home/jack/MYAPP/sandbox/libmyapp.so libmozjs.so => /usr/local/mozilla/lib/libmozjs.so libnspr4.so => /usr/local/mozilla/lib/libnspr4.so libplc4.so => /usr/local/mozilla/lib/libplc4.so libplds4.so => /usr/local/mozilla/lib/libplds4.so libwwwapp.so.0 => /usr/local/lib/libwwwapp.so.0 libwwwcache.so.0 => /usr/local/lib/libwwwcache.so.0 libwwwhtml.so.0 => /usr/local/lib/libwwwhtml.so.0 libwwwutils.so.0 => /usr/local/lib/libwwwutils.so.0 libwwwcore.so.0 => /usr/local/lib/libwwwcore.so.0 libwwwinit.so.0 => /usr/local/lib/libwwwinit.so.0 libwwwmime.so.0 => /usr/local/lib/libwwwmime.so.0 libwwwhttp.so.0 => /usr/local/lib/libwwwhttp.so.0 libwwwfile.so.0 => /usr/local/lib/libwwwfile.so.0 libwwwstream.so.0 => /usr/local/lib/libwwwstream.so.0 libwwwftp.so.0 => /usr/local/lib/libwwwftp.so.0 libwwwnews.so.0 => /usr/local/lib/libwwwnews.so.0 libwwwdir.so.0 => /usr/local/lib/libwwwdir.so.0 libwwwtelnet.so.0 => /usr/local/lib/libwwwtelnet.so.0 libwwwtrans.so.0 => /usr/local/lib/libwwwtrans.so.0 libwwwgopher.so.0 => /usr/local/lib/libwwwgopher.so.0 libmd5.so.1 => /usr/lib/libmd5.so.1 libnsl.so.1 => /usr/lib/libnsl.so.1 libsocket.so.1 => /usr/lib/libsocket.so.1 libpthread.so.1 => /usr/lib/libpthread.so.1 librt.so.1 => /usr/lib/librt.so.1 libsybdb.so.3 => /usr/local/freetds-0.61-1/lib/libsybdb.so.3 libstdc++.so.3 => /usr/local/lib/libstdc++.so.3 libm.so.1 => /usr/lib/libm.so.1 libc.so.1 => /usr/lib/libc.so.1 libstdc++.so.2.10.0 => /opt/sfw/lib/libstdc++.so.2.10.0 libdl.so.1 => /usr/lib/libdl.so.1 libgcc_s.so.1 => /usr/local/lib/libgcc_s.so.1 libthread.so.1 => /usr/lib/libthread.so.1 libmp.so.2 => /usr/lib/libmp.so.2 libaio.so.1 => /usr/lib/libaio.so.1 On the development server Jack was using, the program actually was pulling in libraries from 3 different versions of gcc at the same time. When the program runs, it answers calls to a test number which is routed through a test gateway. It registers as a test app with a test SIP proxy. It accesses the test database authenticating with test user and password entries. The app appends to log files which are written into Jack's home directory. It runs as user "jack" and accesses files like /export/home/jack/MYAPP/sandbox/2 It is started manually and exits after each call. Each instance requires a separate directory writable by the application, into which log files are placed without limit. Note this story is simplified because it makes no mention of monitoring, call detail, failover, load balancing, resource management, etc. Meanwhile, out on the platform: SunOS 5.8 refuses to install on any of the new servers we buy, but we are able to install SunOS 5.9. A customer on another part of the platform reports a bug that they claim keeps them from going into production. According to vendor release notes, the bug is fixed in a new release of the speech recognition libraries. When it is time to install myapp, we have to set location on the server from which the file runs user and group id under which myapp executes DNIS ANI to display to called party initial page of XML to be executed location of prompts used unique identifier for each instance to run UDP SIP port for each instance telnet port for each instance proxies with which to register database: name, user, password log directories: location, ownership, permissions, rotation scheme In addition, there are often two and or more versions of freetds, libwww, C and C++ run-time libraries on the production servers. This case is not the most complex we have dealt with. There exist single Unix processes with over 100 configurable parameters. 2. Problems. a. How do we make sure that a program coupled to development resources (libraries, compilers, databases, SIP proxies, etc) will work as desired in production? b. How do we make sure that programs developed in a manual, prototype setting function properly in a 24x7 shared environment? (i.e. that they can write data where it is needed, but don't fill up all disk space with logfiles, use up all memory, file descriptors, process table entries, etc) c. How do we make software dependencies manageable during development? (How can Jack easily select one version vs. another of freetds and track what is done?) d. How do we set up all the needed interactions when moving into production? 3. What does not work. a. Copy the files and edit things manually on production servers. b. Ghost the hard drives. c. Jumpstart. d. Write some scripts. e. Keep a log of all the manual operations done. f. Write lots of in-house how-to pages. g. Whenever something is installed on a server, make an entry in a database. h. Run a program that scans every server to find out what is out there. We have tried all of these. The next three sections discuss three technologies that will help deal with the problems mentioned in #2. 4. Cfengine. Cfengine is a tool for deploying and configuring software on large numbers of servers. It originated at the University of Oslo in 1993 and has been actively maintained ever since. Users of cfengine include Cisco, Hewitt, NASA, Nokia, NorTel, Motorola, RedHat, and Sun. How it works. Each production server is a cfengine client and belongs to various cfengine classes. Policy host: server with files that describe what happens to each class of client. File masters: servers with content to be distributed to clients. Servers may be configured to poll the policy host (we do this hourly) or if being maintained, will poll only when manually triggered to do so. We use cfengine to do the following sort of thing: if a server is going to be a myapp server, then install the software needed by myapp create startup entries (inittab) for myapp configure myapp on target host for sip domain, sip proxy, etc. Other things cfengine handles: locally replicated prompts and grammars filtering of log messages for email to hosting customers setup of rsync and ftp servers initialization of filesystem databases scan for appearance of new core files creation of crontab entries associating rec clients with the right rec servers placement of SNMP MIBs localhost replication of initial page routing table maintaining users, groups, and sudo authorization Cfengine uses pull rather than push (configuration is done when the client requests it). This feature makes it easy to defer updates until they are needed, and to catch up on changes if a server happens to be offline for awhile. Cfengine will typically overwrite files that it is supposed to manage, but we have several areas on each server where manual edits, if required, will not be undone. Our policy is not to stop or start processes with cfengine - for example installing myapp creates inittab entries, but leaves them "off"; they have to be set to "respawn" before the interpreter is operational. Potentially destructive operations such as recompiling a recognition package (with vs. without speaker verification) or generating a new speech engine configuration file are scripted in cfengine but only happen if a special flag is added when the utility is invoked - "cfagent -Dspeech_engine_config" for example. Advantages. Cfengine is better than jumpstart alone because it lets us update servers any time after installation. Note jumpstart or equivalent still has a role bootstrapping the OS and initial cfengine bits. Cfengine gives us a record in the config files of what configuration is done to which servers. That record is accurate because it was actually used to perform the configuration. The record is maintained under version control giving us a history of the platform. There is a single, well-known area in CVS - netadmin/cfengine/conf - where configuration rules for every server and every service can be found and executed. Problems. Cfengine has root access on all client servers. That means it could do immense damage if configured to do so. However, we have used it for over a year on our VoiceXML hosting platform (about 30 servers for most of that time) with only one instance in which production services failed en masse - and that was before making the policy decision not to start or stop processes from cfengine. In practice, when a mistake occurs, the scope is limited and it can almost always be remedied without taking services offline. There is nothing cfengine can do that a human user could not do with admin privileges, it's just that cfengine can do it to a lot of computers very fast. Cfengine configuration file syntax is limited. It is difficult to specify the ordering of certain kinds of actions and difficult to do complex modifications of shared configuration files like inittab. Cfengine is slow. It one recent test it was about 100 times slower than rsync, taking almost a minute to replicate about 500 files. Cfengine has bugs. Most of these cause it either to crash (in which case replication succeeds the next time) or to perform a replication when none is needed. I keep a list of defects I've found in cfengine - at present it has over 30 entries. In balance. Cfengine has saved us a huge amount of work in the past year. While we are still learning the best way to organize the policy files, we should continue to use it. We are learning to work around its limitations. Simplified example. When creating a new type of server, edit cfagent.conf: # myapp hosts myapp_server:: /var/cfengine/inputs/cf.myapp Create cf.myapp and put things in it like this: groups: has_myapp = ( ReturnsZero(/usr/pkg/sbin/pkg_info -q -e myapp-0.8.1) ) which says a server is in a certain group if it has the myapp package. shellcommands: !has_myapp.ipv4_192_168_32:: "/bin/true;PATH=$(cf_path2) SIP_DOMAIN=site_a.local MYAPP_HOME=/usr/pkg/site/myapp /usr/pkg/sbin/pkg_add ftp://cfesrv01.local/pub/pkgbin/i386sol-8/myapp-0.8.1.tgz" umask=022 the above line says if the server is on the Chicago production network and does not have the myapp package, then install myapp and all packages required by myapp. Create any users, groups, and directories needed and make the necessary edits on the end server. Now suppose you want to configure server sipsrv01 to run myapp. Add the following to cfagent.conf myapp_server = ( sipsrv01 ... ) and either wait for regularly scheduled replication or log onto sipsrv01 and do sudo cfagent To put the server into production, log on, edit occurrences of "off" to "respawn" in /etc/inittab, and do sudo kill -1 1 and you have added another server to the platform. 5. Pkgsrc. What it is. Pkgsrc is a suite of tools developed by the NetBSD core team, for packaging software prior to installation, similar to RPMs on Linux and Solaris packages on SunOS. The focus of the NetBSD project is portability. The OS runs on 54 different architectures. Writing portable software enforces a discipline that makes you pay attention to architectural issues you wouldn't notice as quickly otherwise. The pkgsrc system, unlike any of the other major packaging systems, runs on every modern Unix-family OS. A port to Windows Interix is in early development. Pkgsrc is derived from the FreeBSD "ports" system. FreeBSD ports have been in use since 1993 and today allow users of FreeBSD access to over 10,000 software packages. The OpenBSD packaging system is also based on FreeBSD ports. Like other packaging systems, pkgsrc allows you to specify a list of files with destination directory, ownership, and permissions. A database is kept of files which are installed, so that packages can be uninstalled. Dependencies among packages are tracked: installation of a package will not succeed if prerequisites are not present or installable; a package will not be uninstalled under default settings if it is required by another installed package. Pkgsrc can create users and groups with specified id numbers. It can apply patches when files are built on a staging server as well as when they are installed on the target server. It represents a far more comprehensive interface than rpm's, because the latter relegate much of the detail other than copying of files to ad hoc installation scripts. Pkgsrc allows us to keep a depot of version-tagged packages on a master server. Cfengine can then install those packages as needed. The pkgsrc settings allow finer control of configuration settings than cfengine does. Here's a listing of some packages installed with pkgsrc on the current production myapp servers: sipsrv05.local>pkg_info |tail| sort erlang-9.2nb1 Concurrent functional programming language freetds-0.61.2 LGPL'd implementation of Sybase's db-lib/ct-lib/ODBC libs gcc-2.95.3nb4 GNU Compiler Collection, version 2 libwww-5.4.0 The W3C Reference Library moz-lib-1.0 mozilla libs needed for myapp openssl-0.9.6l Secure Socket Layer and cryptographic library p5-DBD-Sybase-0.94nb2 Perl DBI/DBD driver for Sybase/MS-SQL databases thttpd-2.23.0.1nb1 Tiny/turbo/throttling HTTP server myapp-0.8.1 myapp main program and dedicated shlibs myapp-wav-1.0 generic prompts for myapp The freetds package includes several local adaptations made to the library. Pkgsrc can be used to deploy binary only packages from vendors and to build packages from source for any target architecture. Pkgsrc and its relatives on the other BSDs have been used for nearly a decade by thousands of developers to solve many of the deployment and configuration problems facing us today. It represents a colossal investment of labor by a large number of advanced programmers. We would be foolish to ignore it. Using pkgsrc. Create a distfile, a version-labeled tar archive of files to be deployed, like myapp-0.8.1.tar.gz and put the distfile into the distfile depot cfesrv04 /u1/ftp/pub/pkgsrc/distfiles>sudo scp .../myapp-0.8.1.tar.gz . Create a directory in your work area ~/work/pkgsrc/site/myapp Create a stub package url2pkg ftp://cfesrv04:/pub/pkgsrc/distfiles/myapp-0.8.1.tar.gz Edit Makefile and DESCR. Special functions such as creating directories and users on the target host are configured here, as are dependencies on other packages. Record a checksum for the distfile. bmake makesum Test install and deinstall. bmake install bmake deinstall Make the package. bmake package Put the binary package into the pkgbin depot. cfesrv04 /u1/ftp/pub/pkgbin/i386sol>sudo scp .../myapp-0.8.1.tgz . The package may now be installed manually or via cfengine with sudo pkg_add ftp://cfesrv04:/pub/pkgbin/i386sol/myapp-0.8.1.tgz and may be deinstalled with sudo pkg_delete myapp Deleting a package will not remove files that were edited or added after installation. Cost. Most of the cost in creating a pkgsrc package will be in getting the Makefile right. For someone familiar with the process, it takes from an hour to a day, depending on the complexity of the project. This week, in a couple evenings, one programmer created the following pkgsrc packages from existing code: cloud_mon-1.0 Erlang Cluster Monitor cdr_client-2.0 Call Detail Record (CDR) System Client cdr_mapper-1.0 Call Detail Record (CDR) System Mapper cdr_call_state-2.0 Call Detail Record (CDR) System Server cdr_subscriber-1.0 Call Detail Record (CDR) System Subscriber cdr_spool-2.0 Call Detail Record (CDR) System Spooler www_tools-1.0.1 WWW Tools resource_manager-1.0 Resource Manager 6. Autoconf. GNU Autoconf has been at the underpinnings of open source since 1991. It is used to adapt software to variations is Unix-like operating systems. Generally, when you download a package, you get something like foo-1.13.tar.gz. You extract files from that archive, then type sh configure make and the program compiles. The package downloaded is a collection of source files, traditionally C or C++. The invocation of the "sh configure" script makes the adjustments needed for the particular environment you use when building the program. Thus, a programmer can write programs on Solaris, and use autoconf to make it easier to build those programs on Linux or FreeBSD, or even a different release of Solaris. There is considerable support for autoconf on Windows. Nearly every open source program we use was built with the aid of autoconf. The list includes gcc apache tomcat erlang perl openssl freetds libwww net-snmp tcpdump zebra and hundreds more. With none of these do we worry about whether we are building on RedHat Linux or Solaris or FreeBSD. In fact this is one area where the proprietary vendors are decades behind the times. The main advantages of autoconf for us today come not from portability, but the ability to standardize on compile-time dependencies. Autoconf macros are written to find needed libraries in their default locations, but these locations may be overridden at configuration time, for example sh configure --prefix=/usr/pkg --with-freetds=/usr/pkg/freetds-0.61-1 The autoconf toolset includes libtool, which we can use to our advantage with native checking for required versions of shared libraries (built into modern Unix operating systems) instead of various ad hoc legacy "magic number" techniques we have used in legacy software. There is a high degree of synergy between pkgsrc and autoconf, because large parts of pkgsrc were developed to assist in installation of software products developed under autoconf. Like pkgsrc, autoconf is the result of a huge amount of development, testing and use. Think of it as $1M in free software consulting time. We should not again undertake the creation of a build system without at least taking a long, hard look at autoconf. Example. Suppose you have a program working - it is in CVS with files that look like this: Makefile main.c header.h xyz.c then autoscan (creates configure.scan) edit configure.scan -> configure.ac edit Makefile -> Makefile.am this step may require creating aclocal macros autoreconf -fiv creates config.h.in, configure, aclocal.m4 code may be added here to deal with variations in OS, etc. sh configure --prefix=/usr/pkg gmake gmake dist (creates foo-1.6.tar.gz or such, the source distfile) Cost. The work in making a project autoconf-compatible lies in creation of configure.ac, Makefile.am, and supporting aclocal.m4 files. Complex projects such as myapp could take a week or more. Simpler projects such as most Erlang modules or our modifications to freetds take less than an afternoon for someone familiar with the process. Making the code truly portable will require extra effort - autoconf can only help find parts of the program that will need to support variations in the OS and make it easier to do so. 7. Migration. One advantage common to each of the three systems is that they can be deployed incrementally. Each can coexist with legacy processes so that conversion can proceed at whatever rate is feasible. All three tools, but especially autoconf, have evolved a set of coding guidelines that make programming more amenable to portability and configurability. One of the best reasons for using these tools is that they will help us to write better software. 8. Other issues. a. Windows. Many autoconf projects, including the Erlang platform, offer Windows compatibility. But, we have not explored this field. Also, we know we must support some MS deployment tools such as WMI. b. Delegation of responsibility. We would like configuration for types of servers to be managed by teams specific to those projects. One approach is to permit modification of selected cfengine files to the team in question. c. Data areas. Some collections of files are too large for regular replication with cfengine, but change too rapidly for repackaging to be practical. One example is the collection of prompts used by an application. Probably we should use another replication tool for such things, such as rsync. Network services such as SFS (I would not use bare NFS in mission critical applications) or custom servers could also be used. d. Backing out configuration. Sometimes we want to change a server, say from call push to vapp. Some dependencies are not obvious, such as incompatible sets of recognition packages. Probably it will never be cost effective to automate all aspects of deinstallation - someone will just have to know that placing a server into server class A means some special intervention is needed if we ever want it in class B instead. e. Updates. Updating something like the OS (Solaris 8 to Solaris 9) or speech recognition libraries still present challenges. About the best we can do now is peel off a server into a new cfengine class, test proposed changes to it, and gradually migrate other servers into the new class. Sometimes you have to do several servers at once because clients and servers need to be upgraded at the same time. 9. Suitability for our customers. Are we locking ourselves into a system of using software which will cause problems with customers? Probably if you go to some customers and try to tell them all about cfengine or pkgsrc, you will scare them off. On the other hand, I know of no Unix development shop where autoconf tools do not play some role. I think if we present a working system of large-scale configuration and deployment, that it will be of great added value to our products. The tools presented also make production of releases and tracking of dependencies a routine task, so that it will be much easier to productize our software for external use. There is nothing requiring a customer to adopt cfengine if they don't want it. They can always copy and edit the same files by hand or substitute another deployment tool of their choosing. Similarly with pkgsrc - they don't have to use packages - we can install them and just tar up the installed files. 10. Licensing. IANAL. Autoconf and cfengine are GNU projects. I believe that means that if you ship them, you need to make source for autoconf or cfengine available (not the rest of your company software), and that if you modify autoconf or cfengine and ship those modifications, then those modifications need to be made available. Pkgsrc is a BSD project. I believe you can do anything you want with it, as long as you include appropriate attribution in your product.