Data documentation - Alec's Project Documentation

>Occuptational coding The full analysis of how well these OCC codes are doing, as judged against our hand-coded test set, is [here](http://rpubs.com/amcgail/occupation_in_the_times). I've reproduced the descriptions of each item here. + **OCC** This algorithm looks in both the title and first sentence. It first removes any instances of their name, and convert the whole sentence to lower-case. It then uses a dictionary of terms, allowing a 1-word "buffer" for 2-word terms, and a 2-word buffer for 3-word terms, ignoring order. For example, the term "marketing assistant" would match with "assistant of marketing activities" but not "assistant to the marketing department". We ignore terms which are subsets of other matched terms, for instance ignoring the "president" in "vice president". + **OCC_FsT_nobreaks** This algorithm is the same as `OCC`, except it does not allow any buffer words. + **OCC_fullBody** This algorithm looks in the full body, excluding the title, using our curated dictionary. It does not allow buffer words, to improve performance (there are already way too many false-positives). + **OCC_syntax** This one isolates a specific grammatical construction which occurs often in the first sentence of the obituary. That is the appositional modifier, or "APPOS". The appos is a noun which immediately follows another, typically within a comma-delimited phrase. In the example below, "Archbishop" is the appositional modifier of James Peter Davis. `James Peter Davis, Archbishop of Santa Fe from 1964 to 1974, died Friday.` The algorithm then looks this word (or words, if it's a compound noun) up in our dictionary and records a match. + **OCC_title** This algorithm looks exclusively in the description field of the title, which automatically excludes the obituarized's name. It again uses our curated dictionary. + **OCC_wikidata** If there is a wikidata entry which matches the obituaried's name exactly, we look at all occupations ([P106](https://www.wikidata.org/wiki/Property:P106)) of the individual, and match the labels for these occupations against our dictionary. We then code them as ALL of these occupations, as well as any superclass ([P279](https://www.wikidata.org/wiki/Property:P279)) of this occupation. For example, Martin Luther King Jr. ([Q8027](https://www.wikidata.org/wiki/Q8027)) is listed as having the occupation preacher ([Q432386](https://www.wikidata.org/wiki/Q432386)), which is a subclass of religious servant ([Q4504549](https://www.wikidata.org/wiki/Q4504549)), which in turn is a subclass of cleric ([Q2259532](https://www.wikidata.org/wiki/Q4504549)). After this the superclasses get much more general, and typically nonoccupational (e.g. believer) which are filtered. + **OCC_titleSyntaxI / OCC_titleSyntaxU** These are just the intersection (OCC_titleSyntaxI) and union (OCC_titleSyntaxU) of OCC_title and OCC_syntax, the highest precision and most straightforward use of our vocabulary. > Basic attributes + **age** is extracted using RegEx expressions such as 'he was [0-9]{2}' in the first paragraph. A small (unknown) percentage of these are false. + **gender** is judged by the relative frequency of male to female pronouns. + **kinship** is a set which indicates whether the reason the person was obituarized was because of their relatives' status. + **date_of_death** is judged based on the obituary's publish date and statements such as "died yesterday", "died on October 17, 2018", etc. + **died_sentence** is the first sentence whose main verb is something about dying or killing. + **firstSentence** + **fs_corrupted** + **lexicalAttributes** + **location_all** + **location_died** > Name + **first_name** + **last_name** + **name** is my old name + **best_name** was perfected and cleaned by David Strang + **name_by_most_common** + **name_from_died_sent** + **name_parts** + **name_prior** + **namesInObit** + **not_obit** > Named entities These attributes are dumps of various attributes available when `spacy` parses a document. + **spacy_ents** a list of all of the entities, with tags. + **spacy_ents_GPE** geo-political entity (a location) + **spacy_ents_NORP** + **spacy_ents_ORG** organization + **spacy_ents_PERSON** person + **spacy_noun_chunks** *** Because Stanford tags individual words, we group together adjacent words with the same tag. Thus instead of these being tagged lists of words (the output of the Stanford NER), these attributes are lists of strings which each represent a contiguous n-gram. + **stanford_LOCATION** + **stanford_ORG** + **stanford_PERSON** + **stanford_PERSON_title** > Title information + **title** Gathered from the original LexisNexis dumps, with occasional modifications to remove extraneous crap. + **title_corrupted** Indicator (ala David Strang) of whether the title is corrupted. + **title_info** > Grammatical + **nouns** + **proper_nouns** + **verbs** + **appos_words** These are the words apposite to the name, if the name is the first thing in the article. + **whatTheyDid** + **whatTheyWere** + **wikiPageId** + **wiki_categories** + **wiki_content** + **wiki_death_date** + **wiki_link**