Data documentation
A description of every computed variable from the obits
Occuptational coding
The full analysis of how well these OCC codes are doing, as judged against our hand-coded test set, is here. I've reproduced the descriptions of each item here.
-
OCC
This algorithm looks in both the title and first sentence. It first removes any instances of their name, and convert the whole sentence to lower-case. It then uses a dictionary of terms, allowing a 1-word "buffer" for 2-word terms, and a 2-word buffer for 3-word terms, ignoring order. For example, the term "marketing assistant" would match with "assistant of marketing activities" but not "assistant to the marketing department". We ignore terms which are subsets of other matched terms, for instance ignoring the "president" in "vice president". -
OCC_FsT_nobreaks
This algorithm is the same asOCC
, except it does not allow any buffer words. -
OCC_fullBody
This algorithm looks in the full body, excluding the title, using our curated dictionary. It does not allow buffer words, to improve performance (there are already way too many false-positives). -
OCC_syntax
This one isolates a specific grammatical construction which occurs often in the first sentence of the obituary. That is the appositional modifier, or "APPOS". The appos is a noun which immediately follows another, typically within a comma-delimited phrase. In the example below, "Archbishop" is the appositional modifier of James Peter Davis.
James Peter Davis, Archbishop of Santa Fe from 1964 to 1974, died Friday.
The algorithm then looks this word (or words, if it's a compound noun) up in our dictionary and records a match. -
OCC_title
This algorithm looks exclusively in the description field of the title, which automatically excludes the obituarized's name. It again uses our curated dictionary. -
OCC_wikidata
If there is a wikidata entry which matches the obituaried's name exactly, we look at all occupations (P106) of the individual, and match the labels for these occupations against our dictionary. We then code them as ALL of these occupations, as well as any superclass (P279) of this occupation. For example, Martin Luther King Jr. (Q8027) is listed as having the occupation preacher (Q432386), which is a subclass of religious servant (Q4504549), which in turn is a subclass of cleric (Q2259532). After this the superclasses get much more general, and typically nonoccupational (e.g. believer) which are filtered. -
OCC_titleSyntaxI / OCC_titleSyntaxU These are just the intersection (OCC_titleSyntaxI) and union (OCC_titleSyntaxU) of OCC_title and OCC_syntax, the highest precision and most straightforward use of our vocabulary.
Basic attributes
-
age is extracted using RegEx expressions such as 'he was [0-9]{2}' in the first paragraph. A small (unknown) percentage of these are false.
-
gender is judged by the relative frequency of male to female pronouns.
-
kinship is a set which indicates whether the reason the person was obituarized was because of their relatives' status.
-
date_of_death is judged based on the obituary's publish date and statements such as "died yesterday", "died on October 17, 2018", etc.
-
died_sentence is the first sentence whose main verb is something about dying or killing.
-
firstSentence
-
fs_corrupted
-
lexicalAttributes
-
location_all
-
location_died
Name
- first_name
- last_name
- name is my old name
- best_name was perfected and cleaned by David Strang
- name_by_most_common
- name_from_died_sent
- name_parts
- name_prior
- namesInObit
- not_obit
Named entities
These attributes are dumps of various attributes available when spacy
parses a document.
- spacy_ents a list of all of the entities, with tags.
- spacy_ents_GPE geo-political entity (a location)
- spacy_ents_NORP
- spacy_ents_ORG organization
- spacy_ents_PERSON person
- spacy_noun_chunks
Because Stanford tags individual words, we group together adjacent words with the same tag. Thus instead of these being tagged lists of words (the output of the Stanford NER), these attributes are lists of strings which each represent a contiguous n-gram.
- stanford_LOCATION
- stanford_ORG
- stanford_PERSON
- stanford_PERSON_title
Title information
- title Gathered from the original LexisNexis dumps, with occasional modifications to remove extraneous crap.
- title_corrupted Indicator (ala David Strang) of whether the title is corrupted.
- title_info
Grammatical
-
nouns
-
proper_nouns
-
verbs
-
appos_words These are the words apposite to the name, if the name is the first thing in the article.
-
whatTheyDid
-
whatTheyWere
-
wikiPageId
-
wiki_categories
-
wiki_content
-
wiki_death_date
-
wiki_link