Data documentation

A description of every computed variable from the obits

Occuptational coding

The full analysis of how well these OCC codes are doing, as judged against our hand-coded test set, is here. I've reproduced the descriptions of each item here.

OCC
This algorithm looks in both the title and first sentence. It first removes any instances of their name, and convert the whole sentence to lower-case. It then uses a dictionary of terms, allowing a 1-word "buffer" for 2-word terms, and a 2-word buffer for 3-word terms, ignoring order. For example, the term "marketing assistant" would match with "assistant of marketing activities" but not "assistant to the marketing department". We ignore terms which are subsets of other matched terms, for instance ignoring the "president" in "vice president".
OCC_FsT_nobreaks
This algorithm is the same as OCC, except it does not allow any buffer words.
OCC_fullBody
This algorithm looks in the full body, excluding the title, using our curated dictionary. It does not allow buffer words, to improve performance (there are already way too many false-positives).
OCC_syntax
This one isolates a specific grammatical construction which occurs often in the first sentence of the obituary. That is the appositional modifier, or "APPOS". The appos is a noun which immediately follows another, typically within a comma-delimited phrase. In the example below, "Archbishop" is the appositional modifier of James Peter Davis.
James Peter Davis, Archbishop of Santa Fe from 1964 to 1974, died Friday.
The algorithm then looks this word (or words, if it's a compound noun) up in our dictionary and records a match.
OCC_title
This algorithm looks exclusively in the description field of the title, which automatically excludes the obituarized's name. It again uses our curated dictionary.
OCC_wikidata
If there is a wikidata entry which matches the obituaried's name exactly, we look at all occupations (P106) of the individual, and match the labels for these occupations against our dictionary. We then code them as ALL of these occupations, as well as any superclass (P279) of this occupation. For example, Martin Luther King Jr. (Q8027) is listed as having the occupation preacher (Q432386), which is a subclass of religious servant (Q4504549), which in turn is a subclass of cleric (Q2259532). After this the superclasses get much more general, and typically nonoccupational (e.g. believer) which are filtered.
OCC_titleSyntaxI / OCC_titleSyntaxU These are just the intersection (OCC_titleSyntaxI) and union (OCC_titleSyntaxU) of OCC_title and OCC_syntax, the highest precision and most straightforward use of our vocabulary.

Basic attributes

age is extracted using RegEx expressions such as 'he was [0-9]{2}' in the first paragraph. A small (unknown) percentage of these are false.
gender is judged by the relative frequency of male to female pronouns.
kinship is a set which indicates whether the reason the person was obituarized was because of their relatives' status.
date_of_death is judged based on the obituary's publish date and statements such as "died yesterday", "died on October 17, 2018", etc.
died_sentence is the first sentence whose main verb is something about dying or killing.
firstSentence
fs_corrupted
lexicalAttributes
location_all
location_died

Name

first_name
last_name
name is my old name
best_name was perfected and cleaned by David Strang
name_by_most_common
name_from_died_sent
name_parts
name_prior
namesInObit
not_obit

Named entities

These attributes are dumps of various attributes available when spacy parses a document.

spacy_ents a list of all of the entities, with tags.
spacy_ents_GPE geo-political entity (a location)
spacy_ents_NORP
spacy_ents_ORG organization
spacy_ents_PERSON person
spacy_noun_chunks

Because Stanford tags individual words, we group together adjacent words with the same tag. Thus instead of these being tagged lists of words (the output of the Stanford NER), these attributes are lists of strings which each represent a contiguous n-gram.

stanford_LOCATION
stanford_ORG
stanford_PERSON
stanford_PERSON_title

Title information

title Gathered from the original LexisNexis dumps, with occasional modifications to remove extraneous crap.
title_corrupted Indicator (ala David Strang) of whether the title is corrupted.
title_info

Grammatical

nouns
proper_nouns
verbs
appos_words These are the words apposite to the name, if the name is the first thing in the article.
whatTheyDid
whatTheyWere
wikiPageId
wiki_categories
wiki_content
wiki_death_date
wiki_link