Home > Data Sets > Data Sets Description

Two types of datasets are available. Summary data sets combine data from multiple waves. These data are organized by generation. The G1(generation 1) set includes data on the generation of the index cases. The G2 (generation 2) set includes all data on the offspring of the index parents and the G3 set includes subjects who are the offspring of generation 2.

Raw clinical data sets are files that include the complete SADS and KSADS collected at each wave. There are separate data sets for each wave for both the SADS and the K-SADS as the versions changed over time. In addition, each of these files may contain multiple data points on the same child because data was collected from multiple informants (parent and child). In some cases data was collected from a single informant (parent or child). Identifying (key) fields are used to locate records of interest as follows. These identifiers are consistent across all files from all waves.

The key identifiers are character fields: Family (FAM, length 5), Individual-in-family (ID, length 4) and Informant-ID (INF, length 4). Subjects are uniquely identified by the concatenation of FAM+ID and this concatenated field identifies the data as belonging-to or describing this specific subject. In addition, the Informant-ID (INF) identifies the person providing the data.

For example, consider the following records:

FAM ID INF
4000 0001 0001
4000 0001 0032
4000 0002 0002
4000 0002 0032
4000 0003 0032
4000 0032 0032

The records are associated with a single family (4000) and there are 4 distinct subjects in the family (4000-0001, 4000-0002, 4000-0003, 4000-0032). Two subjects (4000-0001 and 4000-0002) have multiple records with the person reporting on themselves and also person 0032 reporting. Typically the second informant is a parent. Subject 4000-0003 has no record that is a self-report i.e. only one collected from 4000-0032.

In order to accommodate the (now obsolete) 8-charcter restriction on variable names the following convention was used in naming variables: 2-character FORM CODE + 2-character CARD NUMBER with leading zero + CM (indicating the Column) + 2-character COLUMN NUMBER also with a leading zero. For variables spanning more than one column, the column number of the initial/first column for that variable is used.


Data on clinical symptoms was collected for either the subject's lifetime or for a specific interval of time (usually since the time of a previous clinical assessment). Lifetime data/diagnoses can be found in the following files: All other files contain catch-up clinical assessments since a previous interview.


Note on Wave 1-2 K-SADS files. The exact same form was used for both Child-on-Self and Parent-on-Child assessments for these two waves with the following exception. On Wave 2, a series of Medical Questions were included and responses to these were obtained from the mother only. These questions start on page 4 and continue through page 10. The variable names start with AD02CM14 and continue through A209CM63. These variables are not present in the Wave 1 datafiles. In addition, the variables in the Wave1-2 K-SADS datasets have slightly different names. Specifically, all the variable names begin with an initial 2-character codes as follows:

Wave 1 Child-on-Self KD
Wave 1 Parent-on-Child PT
Wave 2 Child-on-Self A2
Wave 2 Parent-on-Child A1

For example, the first variable on the Wave1-2 K-SADS is:

1. With whom does the child now live? Mother?

This datum has one of the following variable names depending on the wave and informant:

KD01CM31, PT01CM31, A201CM31 or A101CM31.

Note on Wave 1-2 SADS (Form 71), page 8. The variable "ethnicity" was not collected at Wave 1-2 and is not in the dataset.


Funded by National Institute of Mental Health Grant: MH36197 NYSPI Div of Epidemiology ©Copyright 2016