IPUMSi: Integrating International Census Microdata, a project proposal
(c) Steven Ruggles, Robert McCaa, Deborah Levison, Todd Gardiner, and Matt Sobek 1999

Project Summary

This project will create and disseminate an integrated international census database incorporating 21 countries on six continents. It will be the world’s largest public-use demographic database, with multiple samples from each country enabling analyses across time and space. The project entails two complementary tasks: first, the collection of data that will support broad-based investigations in the social and behavioral sciences; second, the creation of a system incorporating innovative capabilities for worldwide web-based access to both metadata and microdata.

Although large machine-readable census samples exist for many countries, public access to these data is restricted in virtually every case. The investigators have proposed to twenty countries a plan for public access to these data, and in every case they have expressed enthusiasm and eagerness to cooperate. But the goal of this project is not simply to make international microdata available; it will also make them usable. Even in the few cases where microdata are available, comparison across countries or time periods is challenging owing to inconsistencies between datasets and inadequate documentation of comparability problems. Because of this, comparative international research based on pooled microdata is rarely attempted. This project will reduce the barriers to international research by preserving datasets and making them freely available, converting them into a uniform format, providing comprehensive documentation, and by developing new web-based tools for disseminating the microdata and documentation.

The project builds on the experience of the Integrated Public Use Microdata Series (IPUMS), which received primary funding from NSF. The IPUMS is a coherent series of individual-level U.S. census data drawn from 13 census years between 1850 and 1990. By putting all the census samples in a compatible format and integrating their documentation, the IPUMS greatly simplifies the use of multiple census years. Just as important, new methods of electronic dissemination have democratized access to these resources. The IPUMS includes 22 samples and 65 million records drawn from one country, the United States. The IPUMS is one of the world’s largest public-use databases, but it is modest by comparison with the current project: 650 samples drawn from 21 countries, including about 550 million records. The database will take up some 250 gigabytes uncompressed, and require the equivalent of 14,000 pages of documentation. The scale of the database and documentation necessitates new navigation and extraction tools to keep access simple.

The project is composed of four interrelated elements. The first is planning and design. The international dimension of the database poses new design challenges, since it must accommodate variations in census design and cultural concepts. The basic design goals, however, remain the same as in the IPUMS: the system should simplify use of the data while losing no meaningful information.

The second element, microdata conversion, involves both domestic and international components. The domestic segment will include the IPUMS data while adding new U.S. samples to allow detailed study of the late 20th and early 21st century. It will incorporate the Current Population Survey from 1964 to 2008, the 2000 Census sample, and the American Community Surveys for 2000-2008. With these additions, the database will have a much stronger contemporary focus than the current IPUMS. The international component of the database falls into two categories. For some countries, the project will incorporate already-existing public-use samples. For other countries, no public-use census files presently exist. In these instances, new samples will be drawn from surviving census tapes using techniques to ensure that respondent confidentiality is preserved. These data files are often poorly documented, and will necessitate extensive assistance from the statistical offices and experts of each country to assure their correct interpretation.

The third element, the development of metadata, is central to the project and poses even greater challenges than the microdata. The documentation will not be confined to codebooks and census questionnaires. As with the IPUMS, a wide variety of ancillary information will be provided to aid in the interpretation of the data, including full detail on sample designs and sampling errors, procedural histories of each dataset, full documentation of error correction and other post-enumeration processing, and analyses of data quality.

The final element of the project is the creation of an integrated data access system to distribute both the data and the documentation on the Internet. Users will extract customized subsets of both data and documentation tailored to their particular research questions. The system will consist of a set of tools for navigating the mass of documentation, defining datasets, and constructing customized variables. Given the large number of variables and samples, the documentation will be so unwieldy as to be virtually unusable in printed form. Accordingly, we will develop software that will construct electronic documentation customized for the needs of each user.

Project Description

Objectives

This proposal seeks funding to create and disseminate an integrated international census database composed of high-precision, high-density samples of individuals and households. Our proposal has two components. First, we propose to collect data that will support broad-based investigations into the most important scientific questions facing social and behavioral science. Second, we will create a web-based data dissemination system that will incorporate innovative capabilities for worldwide access to both metadata and microdata.

Large machine-readable census microdata samples exist for many countries around the world, but access to these data is highly limited and the documentation is often inadequate. Even where such microdata are available for scholarly research, comparisons across countries or time periods are difficult because of inconsistencies in both data and documentation. This project will provide basic infrastructure for the social sciences by making the samples publicly available, converting them to a consistent format, supplying comprehensive documentation, and by developing new web-based tools for disseminating the microdata and documentation over the Internet.

The Internet is transforming the nature of electronic data dissemination. At the same time, the proliferation of fast personal computers and UNIX workstations has slashed the cost of large-scale data analysis. This proposal capitalizes on both of these developments by creating a population database of unprecedented size and power, and by providing tools to make it readily available for analysis on desktop machines.

The project builds on our experience with the Integrated Public Use Microdata Series (IPUMS), which received primary funding from the National Science Foundation (grants SBR-9118299, SBR-9422805 and SBR-9617820). The IPUMS is a coherent series of individual-level U.S. census data drawn from thirteen census years between 1850 and 1990. By putting all the census samples in a compatible format with consistent variable codes and integrating their documentation, the IPUMS greatly simplifies the use of multiple census years. Just as important, we have developed methods of electronic dissemination that have democratized access to these resources (http://www.ipums.umn.edu).

The original IPUMS project includes 22 samples drawn from one country, the United States. It contains 65 million records totaling 25 gigabytes when uncompressed. Although this is one of the world’s largest public-use databases, it is modest by comparison with our new endeavor: we plan to build a database with some 650 samples drawn from 21 countries on six continents, and it will include about 550 million records requiring some 250 gigabytes in uncompressed form. We will need to write the equivalent of about 14,000 pages of documentation, compared with 3,000 pages in the current IPUMS. This increase in scale will necessitate a proportionate increase in complexity, so we will develop new navigation and extraction tools to keep access to the data and documentation simple.

Our task is not merely to convert hundreds of additional samples into IPUMS format. Because of international variation in census concepts such as "group quarters," and cultural concepts such as race and marital status, we will need to design the database from the ground up. This design process will be undertaken in close collaboration with international and domestic microdata experts. The basic design goals, however, remain the same as in the original IPUMS: we will create a system that simplifies use of the data and at the same time loses no meaningful information except when necessary to protect respondent confidentiality.

The project will incorporate domestic and international data from a variety of sources. We will start with the U.S. census samples for the period 1850 through 1990 in the current IPUMS. Then we will add additional domestic samples to allow detailed study of the U.S. population in the late twentieth and early twenty-first centuries. Specifically, we will incorporate 528 monthly samples of the Current Population Survey (CPS) for the period 1964 through 2008, the 2000 Census Public Use Microdata Sample (PUMS), and the American Community Surveys (ACS) for the period 2000 through 2008. With these additions, the database will have a much stronger contemporary focus than the current IPUMS, and will be especially useful for national and local studies addressing policy questions.

The international component of the database falls into two categories. For some countries, we will incorporate public-use census or survey samples that already exist, just as we have done for the United States. These data are generally well-documented, but they will pose complexities we have not previously encountered because of national differences in census concepts, cultural practices, and language. For other countries, no public-use census files presently exist. In these instances, we will create new anonymized samples drawn from surviving census tapes that were used to construct census tabulations for publication. In collaboration with the statistical offices of the countries concerned, we will explore new techniques to ensure full respondent confidentiality while maximizing detail. These data files are often poorly documented, and we will require extensive assistance from the statistical offices and experts of each country to ensure that we interpret them correctly.

The development of metadata is central to the project and poses even greater challenges than the manipulation of the microdata. For every census and country we aim to provide comprehensive documentation at or exceeding the standards of the U.S. Census Bureau. The metadata will not be confined to codebooks and census questionnaires. As in the case of the existing IPUMS, we will provide a wide variety of ancillary information to aid in the interpretation of the data, including full detail on sample designs and sampling errors, procedural histories of each dataset, full documentation of error correction and other post-enumeration processing, and analyses of data quality.

Both the data and the documentation will be distributed through an integrated data access system on the Internet. Users will extract customized subsets of both data and documentation tailored to their particular research questions. This will not, however, simply be a data extraction system. Rather, it will be a set of tools for navigating documentation, defining datasets, constructing customized variables, and adding contextual information.

The most difficult task will be to provide a system whereby users can easily gauge the comparability of a particular variable in any sample to its counterpart variable in any other sample. Given the large number of samples, this level of documentation would be so unwieldy as to be virtually unusable in printed form. Accordingly, we will develop software that will construct electronic documentation customized for the needs of each user.

Background and Significance

The existing IPUMS database has been remarkably successful. Since 1995, we have distributed over two terabytes of IPUMS data to users around the world. We are now distributing about 95 gigabytes of data per month, which translates to 130 megabytes per hour, 24 hours per day (see Figure 1). We have prepared approximately 10,000 custom extracts of IPUMS data since our extraction system went on-line in May 1996, and are now processing approximately 500 data extracts per month (Figure 2).

 

 

 

 

 

 

 

 

 

 

Figure 1. Quantity of IPUMS Data Distributed, 1993-1999 Figure 2. Number of Data Extracts Per Month, 1996-1998

Even though the database has been widely distributed only since late 1995, it has already served as the basis of three books, ten completed dissertations, and 65 articles, many of which appeared in leading journals of various disciplines such as the American Economic Review, the American Sociological Review, the American Historical Review, Social Forces and Demography (see http://www.ipums.umn.edu/~ipums98/research.html). The user base continues to grow, with some 500 new registered users during the past six months. In a very brief period the IPUMS has become one of the most widely used databases in American social science.

What accounts for this success? Some key factors are enumerated below.

A prerequisite of our success is the power of the underlying data. The national census files incorporated in the existing IPUMS database have four key strengths: broad chronological scope, large sample populations, national coverage, and high data quality. Social scientists have increasingly recognized that we cannot understand contemporary social behavior without investigating processes of change. Many have turned to longitudinal sample survey data in recent years, such as the Survey of Income and Program Participation and the National Longitudinal Survey of Youth. Such sources are invaluable for the study of short-run life-course transitions, but they are less useful for the analysis of longer-term change across periods or between cohorts. The census is the only source of microdata for the study of such long-run changes, and the IPUMS design makes such investigations comparatively simple. The second strength of the public use census files is their large size. The number of cases available for each census year ranges from the hundreds of thousands to the tens of millions. This allows the study of small and geographically dispersed population subgroups. Even the largest surveys are too small to allow analysis of small population subgroups such as Native Americans or particular occupational groups. Moreover, there are presently no national surveys large enough to be used for policy research at the municipal level. The third strength, national coverage, is important because it allows researchers flexibility and permits generalization at the national level. Datasets that cover only a particular region or municipality have considerably more limited application. The large size of the IPUMS files together with their national coverage permits multi-level analyses of the effects of local conditions on individual and family behavior (e.g. Ruggles 1997a, 1997b). Finally, the U.S. census offers precision, reliability and response rates that compare favorably with the very best alternative sources.

Despite the convenience of the IPUMS and the power of the underlying data, the existing database has two profound limitations. First, the most recent data are from 1990 and is therefore almost ten years old. Second, the IPUMS provides observations of the U.S. population only at ten-year intervals. The success of the IPUMS is remarkable considering that the data are so old and the observations of the population so infrequent that it cannot be used to study many of the most pressing issues in social and behavioral science.

The new database will be applicable to a far broader range of problems. The addition of data from the Current Population Survey will allow us to provide monthly samples from 1964 right up to the present. The data from the American Community Survey, which will begin in 2000, will provide high-density samples on an annual basis suitable for the study of small population subgroups and specific geographic areas. These data will find a broad range of applications for public policy and the analysis of contemporary social and economic conditions in temporal context.

Equally exciting is the globalization of the IPUMS paradigm to twenty-one countries, each of which will also incorporate a chronological dimension. For most of these countries, the new database will include multiple censuses spanning the period from 1960 to 2000. For some countries, the chronological depth is far greater, stretching back into the nineteenth century. The inclusion of international data will allow investigations of social change across space as well as through time. The last third of the twentieth century has been a time of unprecedented economic change and global population movement, and we have a great deal yet to learn about these processes. The inclusion of international data will not only create a key resource for comparative population studies, but will also enrich the study of U.S. population issues, owing to the dramatic rise of immigration to the United States during the past three decades.

The potential list of topics that can be addressed with these data is far too long to discuss within the space constraints of this proposal. Among key research areas are economic development, poverty and inequality, industrial and occupational structure, household and family composition, the household economy, female labor force participation, employment patterns, population growth, urbanization, internal migration, immigration, nuptiality, fertility, and education. In each case, these topics can be studied across countries and across census years. Analysts of immigration to the United States will be able to compare the characteristics of newcomers with those they left behind. Students of African, British, Italian, or Hispanic diasporas will be able to compare the characteristics of people with the same ethnic origin in a wide variety of countries, and to assess how those characteristics have changed over time in each country. Researchers working on the impact of NAFTA will be able to carry out multivariate analyses spanning the last three decades of the twentieth century using pooled microdata from Canada, Mexico, and the United States.

The creation of a global microdata archive will make a permanent and substantial contribution to the infrastructure of social and behavioral science. By making these data easily accessible to researchers around the world and developing comprehensive and comprehensible documentation, the project will stimulate new research that transcends national boundaries and static interpretation. In a very brief period, the existing IPUMS database has had a significant impact on social science scholarship, multiplying many-fold the volume of quantitative research on long-run change in the United States. We anticipate that expanding the database from 25 to 650 samples and from 1 to 21 nations will have an even more dramatic influence, with consequences for the direction of future research in economics, sociology, population studies, history, and even political science.

In addition to scholarly research, we anticipate that the new database will make important contributions to teaching in the social sciences. The existing data extract system is already widely used in undergraduate and graduate courses. The additional coverage of the new database will make it a suitable vehicle for integrating research and teaching, bringing the excitement of discovery into the classroom. We plan to work with the Russell Sage Foundation and other funding agencies to develop web-based instructional materials for the new database which capitalize on the flexibility of the data access system and the international scope of the data.

Methods and Procedures

1. Overall plan of work. The U.S. and international components of the project will proceed on two separate tracks. For the U.S. component, we will start by converting the Current Population Surveys (CPS), the Census 2000 Public Use Microdata Sample, and the American Community Survey (ACS) files for 2000-2002 into the existing IPUMS format. This strategy will ensure that the U.S. data are brought up to date as quickly as possible. After we have designed a new IPUMS format to accommodate international census variables and concepts, we will convert all the U.S. files to the new format and begin distributing them using our data access system. At that time, we will evaluate whether or not the new IPUMS format significantly compromises ease of use for researchers exclusively studying the United States. If so, we will consider maintaining the U.S. portion of the database in both the old IPUMS format as well as the new international format.

We anticipate that the Census 2000 and ACS files will pose few problems for us, since both are modeled on the 1990 Public Use Microdata Sample, which is already part of the IPUMS database. We expect to be able to release both Census and ACS microdata in IPUMS format within a week after the data are made available by the Census Bureau in 2001 or 2002. The CPS will take longer, since there are hundreds of samples and thousands of variables with constantly shifting codes and definitions. Though laborious, the task is straightforward, and is similar to work we have done before. We will begin with the March files, which include supplemental questions on demographic topics, and plan to complete this part of the work in September 2004. The complete series for all months will be brought up to date by 2007. The Census Bureau will review our work on both data transformations and documentation starting in 2003, when work on the 2000 census will be diminishing.

The international component of the project will proceed in three phases of seven countries each, taking place concurrently with the development of the U.S. data. Phase I consists of the highest priority countries that have already met all of our criteria for inclusion in the database (described below), and which have agreed to begin work on the project immediately. We anticipate that each country will require approximately three years of work, but there will be considerable variation from country to country. Some countries have simple, well-documented samples already in existence, and only require a straightforward conversion of data and documentation into standard form. In other cases, we must work from raw data files that were never intended for use outside the national census office, and these will often pose greater challenges.

In every country, we will work closely with local statistical experts. We will develop tools for web-based collaboration, so that we can directly engage our international partners in the design of variables and documentation. In most cases, we will contract with international partners to assemble all available information on each census, obtain copies of the data, translate documents, and write essays on the comparability of the censuses over time in cases where these items do not already exist. When the design is complete, we will develop translation matrices to convert each variable into standard format, and amplify the comparability analyses to cover cross-national as well as cross-temporal issues. Finally, we will turn again to our partners to evaluate both our data transformations and our documentation.

We have been working on methods of electronic dissemination for social science data and documentation for five years, and this research experience provides the foundation for our new data access system. The complexity of the new database will be an order of magnitude greater than the original IPUMS, but our goal is to make access to both microdata and metadata even simpler than it is in the current system. Accordingly, we will begin work on the data access system immediately, so that it will be functional by the time our first international samples are ready for distribution in 2003. As soon as each sample is prepared, it will be added to the online data system.

2. Inventory and preservation of surviving census microdata. Our experience has shown that careful planning both in systems of reconciling variables and design of metadata is essential for success of the project. We plan to design a system that is not only compatible with microdata samples available now, but one that can also incorporate samples that may become available in the future. To that end, at the outset of the project we will develop a comprehensive inventory of machine-readable census and large-scale survey microdata around the world. We will collect enumerator instructions, census forms, codebooks, studies of data quality, and any other ancillary documentation we can locate for all countries that will allow us access to this information. In each case we will also assess the potential for the data to be made freely available to scholars. Wherever feasible, we will obtain copies of microdata as well as metadata. All materials collected will be transferred to stable media and permanently archived in the new long-term digital storage facility of the University of Minnesota Library, which is located in an underground sandstone cavern on the University of Minnesota campus.

The international microdata inventory will meet several needs. First, it will constitute an important resource in its own right for researchers and data archivists. We will post on the web a detailed report on the inventory at the end of the first year of the project. Second, the inventory will underpin the design of the new database. It will allow us to design the system to accommodate future expansion by taking into account the range of variation in census content and concepts around the world. Third, the microdata inventory will help us identify additional census and survey resources that meet all the selection criteria described below and that should be incorporated into the database.

The microdata inventory will also help us to locate machine-readable materials that are not of high enough priority to include in the new database but which are nevertheless worthy of preservation. The integrated database will incorporate data from approximately 100 samples in 20 countries, not including the United States, but this represents a small fraction of existing machine-readable census data. We propose to preserve for future generations data from a much broader group of nations. This task is urgent; a substantial portion of the world’s microdata is on the verge of destruction. Our preliminary work has identified over 8,000 census microdata tapes at risk of becoming unreadable due to age and technological obsolescence. Many of these tapes are contained in the census microdata library of the United Nations Centro Latinoamericano y Caribeño de Demografía (CELADE), which is at extreme risk due to bone-deep budget cuts. CELADE’s entire collection of seven- and nine-track tapes recording the demographic experience of over two dozen Latin American countries from the 1960s through the 1990s is in danger. For some countries, the library contains entire censuses, while for many others there are sizeable samples, with densities ranging from 1 to 25 percent. Forced to marshal its scarce resources, CELADE maintains only the most recent microdata for each country represented. Once lost these data can never be recovered. For about $60 per tape, the entire collection can be preserved for future integration and research. We propose to allocate approximately 5% of our budget to preservation efforts, beginning with the CELADE collection.

We will carry out most of the research for the inventory by means of web-based collaboratories, e-mail, conventional mail, and telephone communication with statistical authorities and demographic researchers around the world, supplemented with published sources. We anticipate, however, that in a few cases on-site investigations will be needed to sort through paper documentation and tapes, and that in other cases we will need to pay the statistical agencies for the time it will take them to produce the needed information.

3. Criteria for inclusion of international census samples. We have selected all the samples to be included in the first phase of the project, and have tentatively identified the samples we will include in the second phase. The third and final phase is less certain, since our decisions will be based on the microdata inventory to be carried out in the first year of the project. We have six criteria for inclusion of international samples:

  1. Public accessibility. The United States is one of very few countries whose data products are fully in the public domain. The conditions for use of the new database will vary from country to country. In no case will we incorporate data unless we can make it available at no cost to certified academic researchers who are willing to sign a nondisclosure agreement. In most cases, the conditions of use will be only slightly more restrictive than they are for existing IPUMS data. For each country, we will draw up a contract to allow us to distribute the microdata without charge. For all countries identified for inclusion under Phase I of the project and most of the countries we are considering for Phase II, we have already received written assurances that such distribution will be possible. Indeed, we have been struck by the enthusiasm with which every country we have approached has embraced our proposal to make the microdata freely available (see attached letters). We will immediately begin negotiating formal contracts once the grant is approved.
  2. Data quality. All samples included in the database must meet certain minimum criteria of data quality. They must have a net undercount of no more than 12%, as determined either through a post-enumeration survey or by demographic estimation techniques. In addition, we will review documentation relating to enumeration methods and the procedural history of each census to detect significant flaws in methods of data collection and processing.
  3. Size. The costs associated with incorporating each additional sample do not vary greatly with size. To maximize cost-effectiveness, we will not normally incorporate a sample unless there are at least 100,000 cases available for analysis.
  4. Availability of key variables. All samples must provide information on individuals grouped into households, families, and/or dwellings. At a minimum, each census must provide information on age, sex, marital status, occupation, and birthplace. In most cases, we will also require family relationships, education, employment status, and income.
  5. Chronological depth. In all cases we will require a minimum of two available census years, and we prefer countries with four or more.
  6. Cooperation of experts and availability of documentation. We will only include countries in which we can secure the assistance of national statistical agencies or other data experts, and where we can obtain adequate documentation specifying the details of sample creation.

These criteria will not be the only factors we consider. In some cases we may bend the criteria on sample size, chronological depth, data quality, or key variables for samples with exceptional intrinsic interest. For example, we want to ensure geographic diversity of the database, and this may involve compromises. Moreover, we will give high priority to samples that are currently unavailable to scholars in any form.

Table 1 lists the countries and census years we plan to include in Phase I. We will begin with these countries because they meet all of the criteria listed above and are ready to start work immediately. Each of the samples listed will have a density of at least 1-in-100 except for the survey data from the United States and Brazil. In most cases the sample densities will be much larger.

Table 1. Phase I Countries in the International Database

Australia

1971, 1981, 1991, 1996, 2001

Brazil

1960, 1970, 1980, 1991, 2000; annual surveys, 1975-2000+

Canada

1871, 1901, 1971, 1981, 1991, 1996, 2001

Colombia*

1964, 1973, 1985, 1993, 2000

China

1982, 1990

Great Britain

1851, 1881, 1991, 2001

Mexico

1960, 1970, 1990, 2000

Norway

1801, 1865, 1875, 1891, 1900, 1910, 1920, 1970, 1980, 2000

United States

Decennial 1850-2000 (except 1890), monthly surveys 1964-2000+

* We have a separate pending proposal to NICHD to fund the construction of the Colombian data series, and it is not included in the present proposal.

Although we have not finalized our selection of Phase II countries, we tentatively plan to include the Philippines, Hungary, Indonesia, Italy, Chile, France and Ghana. Likely candidates for inclusion in Phase III include Kenya, Zimbabwe, South Africa, Zambia, Egypt, Turkey, Argentina, Costa Rica, Ecuador, Spain and India (which will require an act of parliament). We will consider alternate countries for Phases II and III as we learn more about them in the course of carrying out the world microdata inventory in the first year of the project.

4. Variable Design and Data Transformations. The design of the new database will take place in collaboration with data experts from the United States and our international partners. We held an initial planning meeting in Chicago in November 1998, and a second meeting will take place in Ottawa in May 1999, funded by the Canadian Economic and Social Research Council. We have requested funds in this proposal for additional planning conferences with our partners and other experts in years 1, 4 and 7 of the project, and have also requested travel funds sufficient to meet individually with experts in each country at least once during the project period. In addition, we will take advantage of web-based collaboratories and Internet meeting technology to stay in close touch with the entire group throughout the project.

The international census samples employ differing numeric classification systems and reconciliation of these codes is a major part of this project. The IPUMS has provided us invaluable experience in designing common variable classification schemes. The new database will pose additional challenges, however, because the process of reconciliation must extend across countries. The international dimension of the database requires careful attention to differing cultural meanings of questions and responses, and involves comparing sometimes strikingly different systems of classification. It is here that close collaboration with our international partners is especially critical.

Variable design often influences the analytical strategies adopted by researchers, and we must therefore develop our plans with care. We have two competing goals. On one hand, we want to keep the variables simple and easy to use for comparisons across time and space. This requires that we provide the lowest common denominator of detail that is fully comparable, with underlying complexities transparent to the user. On the other hand, we must retain all meaningful detail in each sample, even when it is unique to a single dataset.

We will employ several strategies to achieve these competing goals. In some cases, the original variables are compatible and their recoding into a common classification is straightforward. The documentation will note any subtle distinctions a user should be aware of when making comparisons. For most variables, however, it is impossible to construct a single uniform classification without losing information. Some samples provide far more detail than others, so the lowest common denominator of all samples inevitably loses important information. In these cases, we will construct composite coding schemes. The first one or two digits of the code will provide information available across all samples. The next one or two digits provide additional information available in a broad subset of samples. Finally, trailing digits provide detail only rarely available. The data access system will guide researchers to use only the level of detail appropriate for the particular cross-national or cross-temporal comparisons they are making. All data from the original enumerations will nevertheless be available to researchers who wish to use it.

In some cases, incompatibilities across samples are so great that the composite coding scheme is significantly more cumbersome than the original variable coding design. In these cases, we will develop alternate versions of the variables suitable for particular comparisons across time and space. The data access system will recommend the most appropriate version of each variable to researchers based on user profile and the particular combination of datasets they are using. We anticipate that this approach will be needed more often in the international context than it was in the construction of the original IPUMS. Where feasible, we will base our coding designs on United Nations coding systems. For geographic variables, we will generally conform to the standard of the country.

Most data transformations are simple recodes of one value into another. As in the case of the original IPUMS, we will develop data transformation matrices for each variable which provide information on the location of the original variable in each sample, each original data value, and each new standardized data value. These matrices will be maintained in a standard relational database. The actual recoding operations, however, will be carried out with a C program operating as a sequential batch process, since that is the most efficient approach with respect to both storage and speed. In many instances, it is necessary to use information from more than one variable in the original to construct a new compatible variable. For example, one might need information on both province and subdistrict to identify a metropolitan area. Data transformation matrices can sometimes handle such complex transformations, but in other cases we will have to resort to customized programming solutions.

The greatest challenge in a project of this sort is coping with the mass of detail. The original IPUMS project required some 130,000 data transformations; we anticipate that the new project will involve over a million. Each transformation must be planned, executed, checked, rechecked and documented. This work accounts for almost half of the total effort required for the project.

5. Additional Data Processing. In addition to recoding variables to maximize comparability, we will carry out additional processing to enhance usability. Some procedures are straightforward, such as the addition of compatible variables on serial number, census year, country code, size of unit, and case weights. Others are more complicated; some examples follow.

  1. Constructed Family Interrelationship and Household Composition Variables. One of the greatest contributions of the IPUMS to the original U.S. census files was the creation of family interrelationship variables in all years. We will construct similar variables for the international database. A system of logical rules identifies the record number within each household of every individual’s mother, father, or spouse, if they were present in the household. These "pointer" variables allow users to attach the characteristics of these kin or to construct measures of fertility and family composition. For example, use of the spouse pointer variable makes it easy for users to identify spouse’s income for each married person in the census. Because of variations across countries in the information available for identifying family interrelationships and in the cultural meaning of marriage (e.g., the high frequency of consensual unions in Latin America and of cohabitation in Scandinavia), we plan to revise the logic of the family interrelationship variables for the international database.
  2. We will also construct a wide variety of fully compatible variables describing family and household characteristics at the individual and household level. Some of these tools—such as family and subfamily membership, family and subfamily size, and number of own children—are incorporated in the existing IPUMS. For the new database, we will design new constructed variables to describe household and family composition in ways that reflect the diversity of family forms across countries.

  3. Missing Data Allocation. We will allocate missing and inconsistent values in all datasets that require it. Missing and inconsistent values are routinely replaced with allocated values in recent U.S. census data, by means of logical edits and probabilistic "hot deck" imputation procedures. For example, if sex is missing or illegible it is edited by logical inference from the family relationship field or based on the sex of a spouse. If such logical editing is not possible, probabilistic methods are used. For each variable, there is a series of criteria for matching a "donor" record used to impute the missing value. The donated value is then subjected to consistency checks and is rejected if unsuitable. A data-quality flag identifies allocated data items.
  4. Allocation of missing and inconsistent data significantly increases the precision of sample estimates and makes the samples simpler to use. Missing data allocation is not, however, routinely incorporated in non-U.S. microdata. We have considerable experience with these methods, as we have already adapted them to edit missing and inconsistent data items in the U.S. censuses of the period 1850-1920 as part of the existing IPUMS project (Ruggles and Sobek 1998, volume 3). We will modify the procedures to suit each individual sample in the international database, document the procedures fully, and will allow users to eliminate allocated cases with a simple selection in the data access system.

  5. Confidentiality Protection and Sampling Procedures. All publicly accessible census microdata files are designed to protect the confidentiality of individuals. Countries have different standards, but in all cases names and detailed geographic information are suppressed and top-codes are imposed on variables such as income that might identify specific persons. Some countries take additional steps, such as "blurring" a small percentage of geographic information or randomizing the sequence of cases so that detailed geography cannot be inferred from file position.

Many datasets we will be working with will already have been subjected to confidentiality procedures by the national agency that created the files, and in these cases we will not need to take any additional steps. In other cases, however, we will be working with the original 100 percent machine-readable census returns, from which we will draw a nationally representative sample of specified density. In such cases, we will work closely with each country’s statistical office to ensure full confidentiality of all files before they are made public. We will work to develop new methods to maximize the available detail while maintaining full confidentiality.

When we must draw a new sample, we will use a multistage ratio estimation procedure to select cases within strata defined by geography, household size, household composition, and other key variables. The final sample will have equal weights across cases, except in cases where there is a clear need to oversample a minority population. Our goal will be to maximize sample precision while maintaining ease of use. In general, we will model our procedures on the design used to create the 1980 U.S. Public Use Microdata Sample, as described in Ruggles and Sobek (1998: Volume 3).

6. Metadata. The microdata are of little use without adequate metadata to interpret them. The design of an integrated documentation system is central to the project, and poses the greatest challenge. We will provide comprehensive documentation on each of the samples included in the database. Thoroughness is essential, and we will be adding significantly to the coverage of even the existing public-access samples. Our preliminary estimate is that the database will require the equivalent of 14,000 pages of documentation. We will develop a variety of tools designed to enhance the usability of this vast quantity of information.

The metadata system will limit the scope of information to only those elements relevant to a given research project, as defined by the user, essentially creating a customized codebook for each research project. Comparability discussions, for example, will cover only the specific samples requested by the user. In this fashion, we anticipate that we can provide documentation that devotes as much attention to subtle problems of comparability as does our current documentation, but which for most studies will actually be briefer than the current system.

In the existing IPUMS, the bulk of the variable descriptions consist of discussions of comparability. We highlight important differences and provide warnings about likely errors and strategies for enhancing compatibility for specific comparisons. For many variables, these discussions are quite long, extending up to several thousand words. This format is already unwieldy; as we extend the database from 13 censuses to hundreds of samples, it will become completely impractical.

In light of the scale of the new database, we plan a different design. Every variable will have a core archetypal description. We will supplement this description with a series of comparability discussions describing the differences between each dataset and the archetypal model. Users will specify which datasets they are interested in before they enter the documentation system and only the relevant discussions will appear. Users will also be able to add and remove samples while navigating the system, with the documentation adjusting itself accordingly. This approach will allow us to pare the comparability discussions down to a manageable scale. We also plan to craft brief essays directly addressing the most common chronological and international comparisons that users are likely to make. For example, users focussing on multiple census years within a particular country will be given a discussion designed especially for that purpose. The process of writing this documentation is demanding intellectual labor, but it is critical to ensure the intelligent use of the database. Our international and domestic partners will collaborate in the development of these materials, and will review all documentation relevant to their expertise.

In the present IPUMS documentation, we provide marginal frequencies for every variable in large tables, with each column representing a different census year. In the new database, this will no longer be practical; instead, we will generate customized tables giving marginal frequency distributions restricted to the particular datasets under analysis.

Extensive variable discussions are not sufficient in themselves. As in the case of the existing IPUMS, we will provide a wide collection of supporting information to aid in the interpretation of the data. Users will often require access to information from the original census collection, so we plan to include facsimiles of census forms and enumerator instructions, and procedural histories of each census. Other elements from the original sample creation include full detail on sample designs and sampling errors, complete documentation of error correction and any other post-enumeration processing, and analyses of data quality, such as post-enumeration surveys. We will provide images of census forms, maps, and any other documentary elements not readily presentable in text format. Acquisition of such information is an important part of the inventory process in the first year of the project, and its availability is among the criteria for inclusion in the database. Where the original documentation is in another language, we will translate the most essential material into English. Where foreign-language material is extensive, however, we will provide English-language summaries as well as the full text in the original language.

The documentation system will also describe all data transformations we perform on the original samples to generate the integrated database. This documentation will include the actual computer code, the transformation matrices detailing specific variable recodes, and a textual description of the data manipulation process. Since we lose no information from the original data and document all changes, it will be theoretically possible for a user to reverse-engineer all our transformations for a given variable to reconstruct the original data.

Since the amount of material will be extensive, we will implement advanced automated search features in the documentation system. Users will be able to search by keywords and concepts across variable descriptions and all of the various elements of the metadata, including census forms, enumerator instructions, programming descriptions, and topical essays. Rich hypertext links throughout the documentation system will allow nonlinear access to information based on the user's needs and interests. All documentation will be available on the web and on CD-ROM or DVD-ROM, and we will also make a self-extracting downloadable version so users can easily install the documentation system on their desktop computer.

7. Data Access System. Access to the metadata and microdata will be integrated in a web-based data dissemination system. The IPUMS project provides an invaluable model for electronic dissemination, but we anticipate that new challenges arising from the international context will require fresh solutions.

There are two very different models for access to large microdata files on the Internet: the centralized model and the distributed model. In the centralized model, users request crosstabulations from a remote server and then simply download the results. This model is useful because users need no specialized software to obtain basic results even where computing resources and skills may be limited. Also, researchers can work with large datasets despite slow Internet connections. The problem with the centralized approach, however, is that analysts cannot do sophisticated data transformations or employ advanced statistical methods. Since all serious social science research requires such capability, this model is simply not appropriate for any but the most basic data analysis.

The dramatic rise in the performance of inexpensive desktop computers, along with the rapidly increasing capacity of the Internet, has allowed an alternative approach for accessing large databases on the Internet. Rather than centralized tabulation, we have chosen to develop our system on the distributed model. Thus, the server provides actual individual-level data, which the researcher downloads for processing on a desktop workstation. This requires greater sophistication on the part of users because they must be able to use a statistical package, such as SPSS, SAS, or Stata. The great advantage of this model is that it imposes no limits on the sophistication of data manipulation or statistical techniques available to the user.

We have already developed the most powerful web-based data extraction system available for access to large microdata files. The system is designed to be flexible and expandable, so it is a suitable foundation for the much more sophisticated data access system we envision for the new database. The recursive design of our web interface software makes system-wide modifications easy to implement. The system architecture has two key advantages over any existing alternative. First, every page of our web interface is constructed dynamically by means of Perl and JavaScript; there are no HTML pages in the entire extraction system. This means that the content of every page is customized to the needs of particular users. For example, if a user selects censuses only for the year 1990, then s/he will only be offered choices of variables present in that year. As the new database expands from 25 to 650 different samples, this ability to filter out extraneous information will become increasingly important. We plan to expand this concept of dynamically constructed pages from microdata access to metadata access, so users browsing documentation will only be offered information relevant to their analyses.

The second advantage of our approach is even more important. Unlike every other web-based system for large-scale data extraction, we do not rely on a statistical package—SAS or SPSS—to handle the data. Instead, we have programmed our own extraction engine. This is an absolutely essential prerequisite for the data access system we envisage. It allows us to maximize the speed and efficiency of data extraction by means of techniques such as the inversion of data matrices. Even more fundamental, this approach allows us to take full advantage of the hierarchical structure of census data.

The census microdata samples are simultaneously samples of households and of individuals, and within households the interrelationships among individuals are known. This hierarchical structure is one of the greatest strengths of census microdata files. By combining the characteristics of multiple individuals within a household, researchers can create a wide range of new variables about family and household composition and the characteristics of family members. Other web-based extraction systems, because of their reliance on statistical packages to do the data extraction, can only work with flat files. Thus, for example, the Census Bureau’s system does not even allow users to simultaneously choose household- and person-level characteristics in the same extract. Unlike other systems, we offer users the option of rectangular or hierarchical output files, and offer household, family or individual case selection based on individual-level characteristics. We plan to add two additional ways to make it easier for researchers to exploit the information embedded in the hierarchical structure of the data:

  1. A procedure for attaching characteristics of household heads, family heads, subfamily heads, spouses, own mothers, and own fathers to each individual’s record. For example, the system will allow analysts of marriage to create new variables describing spouse’s age or spouse’s birthplace. Similarly, analysts of school attendance will be able to attach information on father’s and mother’s income to the record of each child.
  2. A procedure for counting the number of persons within the household, family, subfamily, or group of own children with any combination of characteristics. Thus demographers using own-child fertility methods will be able to calculate the number of own children of each age for every mother, and to attach that information as new variables to the mother’s record. Similarly, economists will be able to construct new variables describing the number of employed co-residing kin. The system will also sum numeric characteristics (e.g., income or property) across households, families, subfamilies, or own children. This system can be used to construct virtually any conceivable measure of household composition. By making complex variables easy to create, this procedure has the potential to open up advanced analytical strategies to a broad range of users.

Our data access system has one additional advantage over other existing data extraction software: it is substantially easier to use. Our chief designers, Todd Gardner and Steven Ruggles, are heavy users of the system as well as programmers. Moreover, before we ever began work on automated data extraction software, we had a decade of experience creating thousands of extracts the old-fashioned way—with customized Fortran programs—for both novices and advanced users. This experience, combined with three years of user feedback on the IPUMS extraction system, has given us a good sense of what researchers of all levels of sophistication need, and has allowed us to build a system which is transparently simple to use but which incorporates powerful features.

As we expand the system to accommodate the new database, we will make every effort to ensure that we keep it user friendly. Indeed, our goal is to make the new system even easier to use than the existing IPUMS model. Given the far greater complexity of the new database, however, we will have to make substantial innovations to ensure that access remains easy. To take one example, we currently present the available variables as a simple list, either alphabetized or subject classified. This will no longer be practical in the new system, since the number of variables will grow from about 300 (excluding data-quality flags) to over 2,500. Therefore, we will develop new tools for navigating the variable list. Users will be able at any time to limit the variable list according to keywords or by subject area. They will be able to reduce the list to only those variables that appear in all samples under study or to expand it to include all variables in any sample under study. We will provide reduced tables of the variables most commonly requested, as determined through analysis of extract logs. In cases where there are multiple variables in the same subject area-—such as the occupation and industry variables—we will write a brief "usage" discussion for each variable explaining when it is the best choice, and when alternate variables would be better suited.

With each extract, users will have the option of obtaining a full set of customized documentation text, including the relevant variable descriptions, comparability discussions, marginal frequencies, and enumeration instructions. In addition to documentation designed for humans to read, we will generate a variety of customized metadata designed to be read by computer software. First, we will offer data definition files for SAS, SPSS, and Stata, the leading statistical analysis programs, tailored for each data extract. We will also create customized codebooks marked up according to the Data Documentation Initiative XML Document Type Definition metadata standard, which is currently in beta-test (see http://www.icpsr.umich.edu/DDI/codebook.html). We will monitor developments in this field closely, and adopt any new metadata standards that gain widespread acceptance.

The new system will need a number of additional capabilities to meet the requirements of participating countries. To comply with confidentiality regulations, we will need a more sophisticated security system than we have now. A few countries—such as Britain—will require users to sign agreements specifying conditions of use before they are authorized to access microdata. British authorities will be responsible for maintaining the list of authorized users, but we will have to implement a secure procedure for authenticating identities. Other countries will accept an electronic rather than a physical agreement on conditions for use, but the conditions vary from country to country. Some countries require only that users promise not to attempt to identify individuals. Some prohibit any use of the data except for scholarly research. Others want us to maintain a list of the names, addresses, and affiliations of users. Therefore before an individual is authorized to use microdata for a particular country, the system will require agreement to any conditions of use for that country and to fill out any necessary electronic forms.

The new extract system will be slightly more complicated than the current one because of changes in the way we will store the data. We now store the data in a uniform record layout, in hierarchical column-format ASCII files. This somewhat wasteful approach was designed before the extraction system existed, because the uniform record layout was an essential component of sample compatibility. But users will access the new database exclusively through the data access system, which will have the capability to construct on the fly multi-sample data files with uniform record layouts. Thus, we will redesign our internal file structures to maximize efficiency with respect to data processing and storage, and this will require additional capabilities of the data extract engine.

The new system will offer several alternative output file formats. At present, all extracts are created as column-format Unix-compressed ASCII files with either rectangular or hierarchical structure. We plan to study a variety of alternative formats to determine which ones are in greatest demand. At a minimum, we will offer users the options of delimited files and alternate compression formats; if sufficient demand exists, we will explore various specialized formats, such as SPSS portable files.

The flexibility of the extraction system means that advanced users will create complex dataset definitions with differing countries, years, case selections, and constructed customized variables. Accordingly, we plan to add the capability for users to save, retrieve, and modify dataset definition files created by the extract query system. Our experience has revealed that users frequently need to replicate their extracts multiple times with only minor modifications. Moreover, investigators often want to recreate precisely the same dataset they used in a prior analysis. Users will have the option of storing their dataset definition files on our server or downloading them to their own system. In either case, the dataset definition files may be read back into the extract interface software at a later date. This will automatically fill in all the dialog boxes, which users can then modify and submit as a new extract.

Finally, the new system will require redesign of our extract engine and file structures to maximize efficiency. We are counting on substantial increases in the speed of our extract servers, but the increased scale of the database together with large growth in the number of users will necessitate the adoption of every available method for the improvement of extract efficiency. For example, we will organize files to optimize the most frequent population and variable selections, and we will store the most commonly requested data items in discrete files to speed access. In addition, we will routinely keep the highest-demand data in memory, so no disk access will be required. We anticipate that will eventually be able to process all but the largest data requests within a few minutes.

8. Project management and Responsibilities. The project retains the core staff of the original IPUMS project, and adds two new co-principal investigators with expertise in international microdata. The complexity of the endeavor is substantial: we must carry out a million data transformations, write the equivalent of 14,000 pages of documentation, and coordinate activities of dozens of partners at other institutions. Accordingly, tightly integrated management will be essential.

The principal investigator and co-principal investigators will work closely together, with weekly meetings and daily interaction. Although all five investigators will share responsibility for the entire project, each will focus on a different aspect of project management.

9. Domestic and International Partners. Although our core staff has broad experience with U.S. and international microdata, a project of this complexity could not be contemplated without the close collaboration of experts from each participating country. Our partners will:

  1. Gather data and documentation for each census, which in some cases will involve substantial research into unpublished statistical agency documents;
  2. Translate key documents, including enumeration instructions, descriptions of enumeration procedures, post-enumeration survey data, descriptions of samples designs, and so on;
  3. Write essays discussing changes in the census or survey instruments and assessing comparability issues;
  4. Evaluate both documentation and data transformations carried out at Minnesota;
  5. Set up mirror sites for the data access system.

Not every partner, of course, will carry out each of these tasks. In countries with existing public-use data files, the necessary documentation often exists already. No translation is necessary for English-speaking countries, and we will do translations in Minnesota for the Spanish-speaking countries and for several other languages. Only a few of our partners will set up mirror sites. In every case, however, we will contract with national experts to evaluate and correct the work we do in Minnesota.

Our partners include representatives of the statistical agencies of every Phase I country we are working with. We are also working with many of the leading national and international organizations involved in the production, analysis, and dissemination of census microdata. Some examples include:

Other notable organizational partners include The African Census Analysis Project (ACAP), directed by Prof. Tukufu Zuberi, The U.S. Bureau of the Census, The Institut National de la Statistique et des Etudes Economiques, the agency responsible for collecting French statistical data, Mexico’s Instituto Nacional de Estadistica, Geografia y Informatica, Statisics Canada, and the Cathie Marsh Centre for Census and Survey Research (CCSR) of the University of Manchester. In addition, we are consulting with leading experts on international microdata, including Massimo Livi-Bacci, a past president of the the International Union for the Scientific Study of the Population; Zeng Yi a Chinese Demographer with appointments at Duke University, the Max Planck Institute, and the Institute of Population Research at Peking University; Nikolai Botev of the United Nations Population Activities Unit in Geneva; and Susan DeVos of the Center for Demography and Ecology at the University of Wisconsin.

10. Evaluation. In each year of the project we will carry out a thorough evaluation of the project’s effectiveness at reaching the international academic community. The two basic criteria in this assessment will be the quantity of data distributed and the number of users of the data access system. A third criterion in our self-assessment will be the number of publications that use the database and the number of citations. Due to the length of the publication cycle, there is a considerable delay before a new work generates substantial citations, but by the second year after the data access system goes on line, we expect to see growing evidence of usage in the citation record.

We will also analyze the patterns of use among academic disciplines and by U.S. and foreign researchers. We will create a log file that records every step that researchers take, whether they are browsing documentation or extracting data. We will then analyze these log files statistically to assess patterns of use. We will use this information to reveal which areas of the data access system and the documentation deserve the greatest attention. We are especially interested in monitoring patterns of international comparison, so that we can focus our variable comparability discussions on the highest priority issues.

The development of our data access system is collaborative enterprise with the user community. We will seek feedback continuously, and plan to carry out periodic surveys of users. After every 20th extract, each user will be asked to fill out an evolving questionnaire regarding their current research applications for the data, the strengths and weaknesses of the access system, and what new features would most aid their research.

In addition to user surveys, we expect to obtain regular feedback at the various academic conferences in which we present the database. When our assessment shows under-utilization by particular disciplines or groups who seem logical clients of the database, we will target the relevant conferences as well as posting to the related list servers and web sites to advertise the availability and potential of the data.

11. Preservation and sustainability. Long-run survival of the database beyond the project period is critical. Preservation of the data and documentation is the easy part. In addition to the digital storage facility of the University of Minnesota Library, we will deposit the completed database and machine-readable documentation in the Center for Electronic Records of the National Archives and Records Administration, the Inter-University Consortium for Political and Social Research (ICPSR), The Data Archive, which is the British national machine-readable archive in Colchester, England; and the Australian Consortium for Political and Social Research (ACPSR). Thus, we can be assured that our work on data and documentation will be permanently preserved.

Sustaining the data access system and maintaining the database so that new datasets are added on an ongoing basis is more difficult. Three institutions—The University of Wisconsin’s Center for Demography and Ecology, the University of Michigan’s Population Studies Center, and the University of Minnesota’s Social Science Research Facility—have made written commitments to maintain the data access system indefinitely. This is a good start, but it is not sufficient to guarantee that the system will continue to be upgraded. Our goal is to persuade a major permanent organization such as the National Archives, ICPSR, or the Association of Population Centers to take on the permanent responsibility for upgrading the software and adding new datasets as they become available. These organizations are highly interested in the project, but are unable to make firm commitments ten years in advance. We are confident, however, that we will be able to persuade a major permanent archive or consortium of data centers to take responsibility for the database. To make the task easier, we will design the data access system to be easily transportable to new platforms, document it well, and write a comprehensive manual of procedures for adapting new datasets to the database.

12. Schedule of Work and Deliverables.

During the first year of the project, we will carry out an inventory of surviving machine-readable census microdata around the world, write and disseminate a report on this inventory, develop a preliminary design for the principal variables describing individual-level characteristics, begin the consulting relationships with the seven Phase I countries, begin work on the data extract system, and prepare data transformations and documentation for Census 2000 and the ACS. From the second year onwards, we will begin converting the international samples and the CPS data into standardized format, and begin developing the documentation system. Fact-checking will begin in the third year, as our first datasets and documentation are completed.

We anticipate releasing four major versions of the database during the ten years of the project. Our experience has shown that redesign is an ongoing process as datasets are added, so we plan major revisions on a three-year cycle. We will begin, however, by expanding the existing IPUMS database with no significant redesign by incorporating Census 2000, the American Community Survey, and selected Current Population Survey samples. We will release the Census 2000 and ACS data in this format as soon as the Census Bureau makes the data available (probably in 2002). The CPS will take longer; we plan to release a preliminary version of the March samples in September 2003, and the entire CPS data series in September 2007. On a parallel track, we will develop a preliminary version of the international database, and release it with data from the United States, Canada, Mexico and China in September 2003. We plan a revised version of the database, incorporating all Phase I countries and at least four Phase II countries to be released in September 2006, and a second revised version containing all countries in 2009. The data access system will evolve continuously, but by the 2003 release we expect to have the basic features in place.

Results of Prior NSF Research

We have already described the success of the IPUMS project, which was funded by NSF; the most important of our NSF awards for the project was "Integrated Public Use Microdata Series," SBR-9118299, $464,913, 4/1992-10/1995. Many publications have resulted from this work; for examples, see Gardner 1995, 1998, 1999a, 1999b; Gardner, Sobek and Ruggles 1999; Ruggles 1993, 1994a, 1994b, 1995a, 1995b, 1996a, 1996b, 1997a, 1997b; Ruggles, Hacker and Sobek 1995; Ruggles and Menard 1995; Ruggles and Sobek 1995, 1998; Ruggles, Gardner, and Sobek 1996; Ruggles, Sobek, and Gardner 1996; Sobek 1996, 1997; Sobek and Ruggles 1999. For additional publications resulting from the IPUMS project, see http://www.ipums.umn.edu/~ipums98/research.html.

References Cited

Gardner, Todd. (1995a). "Software development for the Public Use Microdata Samples." Historical Methods 28: 59-62.

Gardner, Todd. (1998). The Rise of the American Suburb, 1850-1950. Ph.D. Dissertation, University of Minnesota.

Gardner, Todd. (1999a). "Metropolitan classification for census years before World War II." Historical Methods (forthcoming).

Gardner, Todd. (1999b). "Suburbanization in the United States 1850-1940." Journal of Urban History (forthcoming)

Gardner, Todd, Matthew Sobek and Steven Ruggles. (1999). "The IPUMS data extraction system." Historical Methods (forthcoming).

Ruggles, Steven. (1993). "Historical demography from the census: applications of the American census microdata files," in Roger Schofield and David Reher (eds.) Old and New Methods in Historical Demography. Oxford: Oxford University Press, 383-393.

Ruggles, Steven. (1994a). "The origins of African-American family structure." American Sociological Review 59: 136-151.

Ruggles, Steven. (1994b). "The transformation of American family structure." American Historical Review 99: 103-128.

Ruggles, Steven. (1995a). "Sample designs and sampling errors in the Public Use Microdata Samples." Historical Methods 28: 40-46.

Ruggles, Steven. (1995b). "Family interrelationship coding in the Integrated Public Use Microdata Series." Historical Methods 28: 52-58.

Ruggles, Steven. (1996a). "The effects of demographic change on multigenerational family Structure: United States Whites 1880-1980," in Alain Bideau, A. Perrenoud, K. A. Lynch, and G. Brunet (eds.) Les systèmes demographiques du passé. Lyons: Centre Jacques Cartier, 21-40.

Ruggles, Steven. (1996b). "Living arrangements of the elderly in America, 1880-1980," in Tamara K. Hareven (ed.) Aging and Generational Relations Over the Life Course: A Historical and Cross-Cultural Perspective. New York: Aldine de Gruyter, 254-271.

Ruggles, Steven. (1997a). "The rise of divorce and separation in the United States, 1880-1980." Demography. 34 (1997), 455-466.

Ruggles, Steven. (1997b). "The effects of AFDC on American family structure, 1940-1990." Journal of Family History. 22, 307-25.

Ruggles, Steven, J. David Hacker and Matthew Sobek. (1995). "Order out of chaos: General design of the Integrated Public Use Microdata Series." Historical Methods 28: 33-39.

Ruggles, Steven and Russell R. Menard. (1995). "The Minnesota Historical Census Projects." Historical Methods 28: 6-10.

Ruggles, Steven, Todd Gardner and Matthew Sobek. (1996)."Disseminating historical census data on the World Wide Web." With Matthew Sobek and Todd Gardner. Iassist Quarterly 20, 4-18.

Ruggles, Steven and Matthew Sobek. (1998). IPUMS-98: Integrated Public Use Microdata Series (5 volumes).

Ruggles, Steven, Matthew Sobek and Todd Gardner. (1996). "Distributing large historical census samples on the Internet." History and Computing 9: 145-159.

Sobek, Matthew. (1996). "Work, status, and income: Men in the American occupational structure since the Nineteenth Century." Social Science History 20: 169-207.

Sobek, Matthew. (1997). Occupational Structure and the Labor Force in the United States, 1880-1990. Ph.D. Dissertation, University of Minnesota.

Sobek, Matthew and Steven Ruggles. (1999). "The IPUMS project: An update." Historical Methods (forthcoming).