spssread.pl

A Perl script to extract metadata from SPSS data files

On this page, I describe and make available for download a perl script, spssread, which can be used to extract metadata from SPSS data files.


Background

The native file format for SPSS data files is the SAV file. This is a binary file format containing both data and "dictionary" information pertaining to the dataset, including variable labels and value labels. Many statistical packages and other commercial third-party tools offer methods of converting SPSS data files into different formats. However, in addition to being costly, these tools may not always be as flexible as a given situation requires. Since source code is not available, it is not often possible to extract the required information in the most easily accessible format.

In the SPSS Developer's Guide (available on the product CD-ROM or via download from the support section of their website), SPSS provides details on how to use a royalty-free Windows DLL to programatically access the contents of a SAV file. In addition to requiring significant programming effort, this is limited to being of use solely within the win32 environment. Still, this is the only SPSS-approved way of accessing SAV files, with the exception, naturally, of the SPSS program itself. It is likely that the third-party conversion tools were developed using this DLL.

The only official documentation of the SAV file format specification that is publicly-available is an eight-page chapter from a very old manual (the latest version of SPSS referenced in the document is version 4.0). This document is available on Wotsit's format (search for "SAV").

Additional useful information can be found at the website of the ambitious PSPP project, an open source clone of SPSS.

Using these sources, I wrote a program to parse and print the metadata stored in SAV files. A first pass was made several years ago in C, but this release has been re-written in perl for beauty and simplicity.


Download


Instructions for use

You must have a perl interpreter installed. On unix-ish systems, check for /usr/bin/perl. On Windows, try ActiveState Perl.

At present, three reporting options are available:

Usage: ./spssread.pl [OPTION] [spss-filename.sav]
Choose one of the following single-character options:
  -h  Print File Header information
  -r  Print tab-delimited info about Variables
  -l  Print tab-delimited info about Value Labels

The 'h' option prints a list of the fields stored in the SPSS File Header. These are not entirely valuable in themselves but may provide insight if you run into errors using the other options. The following listing shows the results of running the 'h' option on the file "voter.sav" provided on the SPSS product CD-ROM.

$ ./spssread.pl -h voter.sav

Record type      $FL2
Product name     @(#) SPSS DATA FILE MS WINDOWS Release 8.0
Layout code      2
Case Size        6
Compression      1
Weight index     0
Number of cases  1847
Bias             100.000000
Creation date    22 Nov 98
Creation time    22:03:11
File label

The 'r' option prints a tab-delimited listing of the variable names, their type (whether numeric or string), and their label if any. Here, I have added a new variable to the "voter.sav" file to show how string variables are listed.

$ ./spssread.pl -r voter.sav
Name      Type          Label
PRES92    Numeric       VOTE FOR CLINTON, BUSH, PEROT
AGE       Numeric       AGE OF RESPONDENT
AGECAT    Numeric       age categories
EDUC      Numeric       HIGHEST YEAR OF SCHOOL COMPLETED
DEGREE    Numeric       RS HIGHEST DEGREE
SEX       Numeric       RESPONDENTS SEX
TEXTVAR   String (23)   New string variable of length 23.

The 'l' (that's a lower case version of 'L') option prints a tab-delimited listing of value labels associated with each variable. There are many conceivable ways of organizing this information, and I have chosen the one that best suits my needs. For each variable for which value labels are present, there is one row for each value/label pair associated with the variable. If you'd prefer to present the value label metadata differently, this can be done in the "print_value_labels" subroutine.

$ ./spssread.pl -l voter.sav
Varname   Value   Label
PRES92    1       Bush
PRES92    2       Perot
PRES92    3       Clinton
AGECAT    1        lt 35
AGECAT    2       35 - 44
AGECAT    3       45 - 64
AGECAT    4       65 +
DEGREE    0       lt high school
DEGREE    1       high school
DEGREE    2       junior college
DEGREE    3       bachelor
DEGREE    4       graduate degree
SEX       1       male
SEX       2       female

The script prints to standard output, making it easy to redirect to a file

$ ./spssread.pl -l voter.sav > labels.txt

Note that on Windows, you may have to start the script from a command prompt like this:

C:\> c:\perl\bin\perl.exe spssread.pl

Legal

SPSS is a registered trademark of SPSS Inc. Neither this website nor its author are affiliated with SPSS. This program is not endorsed or supported by SPSS.

spssread is licensed under the GNU General Public License.

Feel free to contact me at this link.