Contact InformationGeneral Information about the IPUMSFrequently Asked QuestionsBibliography, citing the IPUMS, and terms of useRevision history and other updatesIPUMS HomepageFull text search and site mapDocumentation on using the IPUMSVariables, codes, and frequenciesDownload data and create custom data set
Instructions for using the IPUMS Data Extraction System

The great size of the census microdata files has always been a major obstacle to their use. Accordingly, we have developed an interactive extract system on the World Wide Web to provide easy access to the IPUMS from personal computers, workstations, and mainframe computers. The system allows researchers to fashion smaller extracts of the data specifically oriented to their own research needs and suited to their available computing power and storage capacity. In practice, researchers never require all variables and all cases from a census year. In the past, however, they have had no choice but to obtain the entire census samples to get the cases they wanted. With our extract system, researchers can design subsamples incorporating a subset of variables pertaining to the specific population(s) of interest to them.

Preliminary version of the interface
We have already resolved some of the most difficult design issues for developing a new generation of data extraction software. The key component is the user interface, which we developed using the programming language Perl. This is still in preliminary form, but it is quite usable.

The extract procedure involves four steps, each on a separate Web page. The contents of each page depending on selections made on the previous page. Before users initiate a data extract, they are prompted for their e-mail address, which acts as their password and provides us with a means of contacting them and constructing a unique file name for their extract output.

On the first page users define the general characteristics of their desired extract. They select the particular census sample or combination of samples they want (e.g., 1970 5% county group sample, or the 1880 sample). Users choose the preferred file structure for their extract: hierarchical (household record followed by person records) or rectangular ("flat" all household information attached to respective household members). Several sample densities are available, ranging from 1-in-20 samples available in recent census years to very small samples constructed in all years for purposes of testing and instruction (the "small" and "tiny" samples). A feature allowing continuously variable sample densities will be added in the future. Finally, researchers may elect to include data quality flags, in which case the program will automatically append the flags corresponding to selected variables.

On the second page of the extract interface users select which variables they want to include in their extract. Only those variables available for the particular samples selected on the first page are displayed as options. If users have selected multiple census samples, all variables occurring in any of the specified samples are displayed. Some variables have a second check box allowing users to select cases based on the value of the variable. In the future, we will add case selection boxes for many more variables. In addition, clicking on a variable name will call up all relevant documentation (see below). Users can also select entire groups of related variables by checking a single box.

On the right-hand side of variable selection page, there is a column for each census year selected on the previous (sample selection) page. Only columns for selected years are displayed, with the symbols showing the availability of each variable across years. An "X" indicates that the variable is available for all individuals or households in the particular census year, an "S" means that the variable is only available for sample-line individuals, and so on.

The third page provides for case selection. Only those variables chosen for case selection on the second page will appear on the third. Depending on the type of variable, the page employs one of three procedures. For simple categorical variables such as region, the user selects values from a series of check boxes. With complex categorical variables such as birthplace, values are selected from a scroll-box that displays descriptive value labels rather than numeric codes. For numeric variables like age, users select minimum and maximum values. Users have the option of selecting only those individuals with the selected characteristics or entire households containing individuals with the selected characteristics.

At the moment, extracts of the 5% samples from 1980 and 1990 are limited to cases from a single state. These files are so large that is impractical and usually unnecessary to allow extracts on the entire samples. If either of these samples is selected, the user is forced to choose a particular state. Depending on demand and continuing improvements in processor speed, this constraint might be removed in the future.

In the final step, users review their selections on a summary screen. If they are satisfied with their extract design, they submit it for processing. When they click the "submit" button, the program creates an extract request file that initiates the extract engine. We inform researchers via e-mail when their extract is completed and provide instructions for downloading their files. For each extract, users receive data, codebook, and "readme" files, and an SPSS command file. The command files will contain the column locations of variables, variable labels, value labels for categorical variables, and missing values.