README for the data used in Coding Assignment 1, USC course CSCI-544 (Applied Natural Language Processing), Spring 2021.

This package contains two types of files:

1. Three files with distributional data from the U.S. census (frequency of female first names from the 1990 census, frequency of male first names from the 1990 census, frequency of surnames from the 2010 census). The files were downloaded from the following web addresses, which belong to the U.S. government:

https://www2.census.gov/topics/genealogy/1990surnames/dist.male.first (accessed 2020-08-25)
https://www2.census.gov/topics/genealogy/1990surnames/dist.female.first (accessed 2020-08-25)
https://www2.census.gov/topics/genealogy/2010surnames/names.zip (accessed 2020-08-17)

The above data were collected and released by the U.S. government, and are therefore presumed to be in the public domain. The original names.zip archive contains the data in both Excel and CSV formats; only the CSV file is included in this package.

2. Additional lists with fictional names of couples and individuals. These are not names of real people. The lists were created by an original program written by Ron Artstein in August 2020 (revised January 2021), for use in a class exercise. The program which created the lists uses statistical properties derived from the U.S. census data above.

The fictional name lists are copyrighted by Ron Artstein, and released under a Creative Commons Attribution-ShareAlike 4.0 International License: https://creativecommons.org/licenses/by-sa/4.0/

The program used for creating the name lists is not released at this time.

Ron Artstein
artstein@ict.usc.edu
2021-01-19