Quantitative analysis of sociological journal articles

# What is the end goal of this analysis? I need to finish an A-exam, which translates to **preparing a publication**. Thus a relevant question to ask is [where to publish?](/journal-analysis/where-to-publish). I intend the paper to be methods- and theory- heavy, something akin to [this Poetics special edition](https://www.sciencedirect.com/journal/poetics/vol/68/suppl/C) I loved so much. And for topic, I'm trying to understand the development of theory in Sociology. ## Advice from the late great Ben Cornwell > Keep it simple, general, generic + Produce a follow-up to [Moody's network analysis](https://www.jstor.org/stable/27700461) + Is there something more recent? + Model the paper's structure and content after Moody's + Emphasize this as an update, with more detail, more journals, more years + Progress on this front can be found [here](/journal-analysis/extendingMoody) + What's the simplest line-graph you can draw? + General, generic, "Kraft brand," boring, sociology + Get out the milk, or the glass of water + Make some excel spreadsheets + The life-table analysis of sociology of knowledge + *Don't* self-publish. + We're *not* the smartest people in the room. + Our work *has* to be criticized, refined, reworked, to be useful. + Self-publishing ensures that no one will read or use your work. [debatable?] + Don't discount major outlets + Find someone who would read your paper, whose work you respect, and look where their work is published + A low-tier journal publication (never heard of) is actually a *discount* on your CV # Theory: specifying a theoretically interesting analysis ## What are the good questions to ask? ### The Positivist One way of stating this is asking whether existing theories are true or false. The steps in answering this question are as follows: * Find well-formulated theories of the sociology of sociology * Are they testable with the data I have? How? * Are they important? ### The Qualitative Sociologist **My favorite** In this case our question wouldn't be `is this true?`, it would be `what is happening?`. That is, `what generalizations can I make?` These descriptive, inductive questions are actually themselves secondary - that is, after - the questions `how should I look?` and `what should I look at?`. In this model the analysis procedes as follows: - Specify a multitude of **ways to look at** the phenomenon, hoping for triangulation - Look - Carefully make generalizations from what you see These steps are just a heuristic, and multiple steps can be happening at once, but it's useful. The emphasis on "ways to look," creating new tools for observation can be evaluated based on what they allow the researcher to see. This way of conceiving of research questions allows the researcher to be surprised. ## Question forms ### What is scientific literature accomplishing, and how is it doing that? Simply a matter of curiousity. ### What can do to fix problem X? Here's the problem, and here's a way to fix it. We will all be better off if we know the answer to this question, and I have asked and answered the question here. This form of question-asking is the dominant trend in the [most prolific writers who cite Moody 2004](/journal-analysis/extendingMoody). ### [BAD] I've solved a problem in "intellectual history," uncovering the "true source" of an idea. Although this question is only bad if you mean the individual who made it. And it's not bad to assume there's a source, but uncovering lost sources is boring. Talking to "true sources" of an idea and "prediscoveries" of the idea could be interesting. ### Why did theory progress in this way, instead of that way? > What constructs sociology? > What are the forces guiding it? > If different forces had been applied, where would our discipline be now? The answer to this must acknowledge and extend existing understanding of the progression of theory, and thus is a discourse with the philosophy of science and sociology of science. Because it is a study of Sociology proper the most up-to-date understanding is found in Baehr's *Founders, Classics, and Canons* [^founders]. He gives a thorough accounting of existing theories, making some critical comments. [^founders]: Baehr, Peter. 2016. Founders, Classics, Canons: Modern Disputes over the Origins and Appraisal of Sociology’s Heritage. Second. New Brunswick: Transaction Publishers. **Next step** Build an explicit catalogue of all *testable* theories of the development of Sociological theory present in this book. Start by doing a first pass and filtering out most references, keeping a manageable collection for individual inspection. ## Research Questions ### What makes terminology "stick"? > Still way too broad. ### The quantitative history of consensus in sociological terminology Does sociology ever come to a "consensus" just for this consensus to shift, and the consensus turned over? **Try 1** + Quantify the "surprise" of each sentence with decade windows of each journal. This helps us understand where it "comes from", what it's most akin to. + This "surprise" requires a *language model*. We could assume for this that each sentence is a random draw from tuples, with a bunch of other random words mixed in. + Tuples should occur in more than one journal. + The output is a `Journal-Year-Article-Sentence` and a measure of similarity with each `Journal-Decade`. + For each sentence, we can then draw a "trajectory" through Journal-Decade space. It would be a heat-map. + The next step is to look at relevant sentences, which should have a rich history. For this I should ask someone... + Another option is to compute this for all sentences in a document, and average the trajectories of all of them. ERP-style. ### Are there any direct contradictions, clear cut disagreements, either over time or at a single time? What does the map of these contradictions look like? The main methodological problem in this line of questioning is to identify contradictions in academic literature. *random idea:* split sentences into cleaned phrases (parts of sentences), and see if one author says "PART is undeniable" and another says "PART cannot be true". ### How long do "conversations" go on? I've seen a few conversations which last just a few papers, and others which balloon into global phenomena. What does this diffusion network look like? What ends a conversation? What keeps it going? How has this changed over time, with new global forces in the academic world? > Before we answer any of these, we must define a conversation analytically. **Try #1** It is a "back and forth" between individual authors. $p_1$ and $p_2$ are papers, and $p_1 \to_t p_2$ indicates that $p_1$ references $p_2$ at time $t$. A conversation is a maximal set $P = \left\{ p_i \right\}$ such that for any $p\in P$ there is a chain of papers $c_1 \to c_2 \to c_3$ such that $p_j$ is one of $c_i$, and $A(c_1) = A(c_3)$ This definition is likely too permissive, creating huge blocks which are in no way a part of the same conversation. **Try #2** Same as Try #1, but you restrict to sets of authors who are in conversation with each other, and the conversations they have. ### How is a Sociological language built? ### Is Sociology progressing? You have to be careful with questions like this. Certainly those that are doing it have some idea what the point is. And it is progressing haphazardly towards the ambitions of those who inhabit it (Sociologists). ## Helper Questions ### Who are the subjects in my study? An ethnographic understanding of the social actors involved can help in many ways. But before I can go out and talk to them, I need to **clearly define who** they are. I'm interested in those who participate in the thought communities exemplified and formed by the papers under my study. ## Ways to look ### What are the extremes of the distributions? Construct metrics of success, conceptual brokerage, conceptual creation. In these metrics, who are the outliers? How skewed is the distribution? ### Qualitative interviews with Sociologists What do the subjects of my study have which is useful for my project? Indeed, the selves that wrote and published those papers are gone, and I could be mislead by their post-hoc constructions of what and why. On the other hand, they could act as great informants, pointing out factors I had never considered. For example, I need to think critically about how scholarship was made in the past, what "taken-for-granted" I should keep in mind when analyzing these articles. This is something which isn't written anywhere, except in the minds of the participants. > Ease of access I can simply interview the sociologists in my department, and maybe those in some other related departments. Anyone at Cornell who publishes in my target journals. So what do I interview people about? > Interview questions I want to pose my research questions to them, and ask them what their intuition is. I want to know what their lay theories are. What do **they** think makes up a community of thought? And what are their histories? How did you get here? What did they learn, and where? Why do you cite X? * You've cited X multiple times, what does this paper mean to you? * When did you read X paper? Where were you? Why did you read it? * What do you remember about the paper? * Do you know the authors? When did you meet them? * More free form: what's the most important paper you can think of? * Have you read any papers in the last week? if so, which papers? ### Clustering concept trajectories over time and (social) space The input of this method is the trends of concept change over time generated above. A distance is calculated between each pair of trends and a cluster analysis is done, grouping those that are similar. This grouping of trends is then examined by a person, looking for patterns among these patterns, and applying some interpretation. Examinations of where, when, and for whom a certain "meaning shift" happened can be investigated. The researcher then connects these trends to development of meaning in some concrete case, showing that these trends can be used as indicators for X. ### [BORING] Look at author publication sequences Do they stick to the same journal or are they more varied? Over time how does this propensity evolve? Do authors "settle in" to places of publication? # Methods: creating derivative datasets for analysis ## Existing datasets ### Cleaned body of the article * Headers, footers, and footnotes are stripped from the document * The "top-matter", the abstract, and references of a paper are removed. * Tables and other non-textual information is removed. * Hyphenation broken by page is remedied * Pages are bound together, respecting hyphenation * The document is split by sentence * The document is word-tokenized ### Citation network I have been able to extract nearly all perfectly ASA-style citations. I accomplished this using an Easley parser. A deterministic parser was much more effective than the [state of the art Neural network method](doi.org/10.1007/s00799-018-0242-1) because the citation style is highly regulated. This amounts to more than 260,000 citations from the bibliography, complete with publisher, city, version, etc. More could be gathered by examining those missed and writing Easley parsers for this, a job which can easily be hired out. **Possible next steps** + Resolve citations to specific documents + Merge documents before ~1967, [when ASR announced it was requiring citations](https://www.jstor.org/stable/2092841) + These are contained in footnotes and in-text references of widely varying formats. + May require hand-coding ## Desired datasets ### n-gram frequency counts An **n-gram** is simply a sequence of words. For instance "simply a sequence of words" is a 5-gram, an n-gram of length 5. n-grams, properly looked at, act as a quantitative window into meaning and can be used to study changes in meaning-making over time - that is, to study the development of concepts. We can build a network of n-grams by looking at how many documents they appear together in. If you look at only those documents from 1987, or only from a specific journal, "Socius," you get different networks. This network can be looked at from many different angles, and provides a rich insight into meaning. The various understandings of different subcommunities can be identified, leading to a deeper clustering of scientists, along their methods of argumentation, their phrasing, and the concepts they espouse. + This will likely be stored in a .pickle file, as it will need to be loaded into memory in its entirety for use anyways. + In-memory computation limits the total data size to ~10GB This dataset consists of the following look-up tables (i.e. functions): + $Docs(n)$ n-gram $\to$ list of document ids + $count(n,y,j)$ n-gram, year, journal $\to$ count + $Ngrams(d)$ document $\to$ list of n-grams >We compute the residuals of the counts and create six new maps + $cyj(y, j) \equiv \sum_n count(n,y,j)$ + $cny(n, y) \equiv \sum_j count(n,y,j)$ + $cnj(n, j) \equiv \sum_y count(n,y,j)$ + $cy(y) \equiv \sum_{n,j} count(n,y,j)$ + $cn(n) \equiv \sum_{y,j} count(n,y,j)$ + $cj(j) \equiv \sum_{n,y} count(n,y,j)$ > We must store more detailed information if we are to create term networks. + $Nbd_i(n, d)$ ngram, document $\to$ neighboring ngrams The $i$ here represents the neighborhood algorithm, of which there can be several. > **Methodological note:** The easiest way to assess how well this *scales* is to try it out. Trying to plan how much memory each object will use is boring and not necessary. I can build the thing and scale it as far as reasonable. ### n-gram egonetworks Changes in the constituents of the ego-network serve as indicators of shifts in the meaning-making of the community. ### Conceptual links We can obtain a rich graph structure by looking for specific grammatical relations. + For every transitive verb, e.g. "makes", consider the subject connected to the object with a labeled edge. the network of grammatical relations between subtrees of a dependency graph. This graph can in turn be reduced to interesting subsets, for example as a result of a SPARQL query on this graph. The resulting graphs will be much more easily interpretable, and the entire graph structure offers many inroads for analysis. ### Distribution of helpers ### Author metadata Additional metadata about an author is taken from their CV. We pay MTurkers to go and find the CV for the person. If an MTurker doesn't find a CV we ask a second Mturker. If they can't find it either, we consider the data missing. Once the PDF is received it is automatically split into sections. Metadata is parsed automatically using a collection of Easley parsers, depending on the section it is in. The following metadata exists, in principle, to be extracted from these CVs: + Students + Papers published + Academic positions + Graduate and undergraduate education + Grants and fellowships + Awards + Talks + Service + Classes The institution can be extracted from the body of the document itself. ### Textual and institutional context of each mention of central sociological terms and figures Central terms of interest are taken from a glossary in a commonly used and modern Sociology textbook. Terms are marked with their part of speech and various alternative surface forms. Each sentence of each document is checked for each term On a match, the exact position and document are recorded, as well as the full sentence in which the term appears. >The famous figures search might be a bit more complex, as names could mis-match. ### Relevant terms appearing in each article Uses a ground-up approach to extracting useful or important terms. These can be nouns or compound nouns, adjectives, or relations between statements. There is some literature on each of these, and what makes them "interesting" is something which can be rigorously specified. ## Potential or existing analyses ### Continuations Current version [here](http://alecmcgail.com/continuations/) or [with Cortext](https://documents.cortext.net/d7ef/d7ef57cb2d894ad086388ddd360f95cc/143876/distants.output_distant.html) [you have to press the "c" in parentheses, and it's pretty slow]. The user interface allows you to see what typically happens around a word or phrase in my dataset. **Possible next steps** + Add a visualization of relative frequency over time. + What journals are over-represented? + Create a web-form so these counts can be explored in real-time + Probably requires a nicer structure of n-grams (see the n-grams dataset) ## Just ideas ### Clustering of term network change over time and place Options: * An edge can be counted if two terms are in the same document, the same sentence. Alternatively weighted edges can be distance-based, or based on syntactic relations. * The aggregation of these edges over documents can be a simple sum, a log frequency. * Post-hoc weighting, for example based on tf-idf, could be applied * We can tweak what terms are part of the network in the first place, either a priori or after constructing a complete network * There are a few options for a distance metric between egonetworks of a term * Distance metric between time trends *dubious, needs to be checked* The method uses a window of 5 years for "now", measuring the difference to the 5 year window starting next year. The output is a single number for each slice (its distance to the previous slice). This should be nonnegative. This windowed method increases the size of the time series and smooths trends, while providing ample data for accurate construction of the term-term network at each time. This magnitude of change can be extracted on different sets of documents, producing trends at different levels of granularity: 1. Whole dataset 2. Each journal 3. By geographic origin of writers 4. By stated subdisciplines / keywords > **A methodological note** This method is extremely computationally intensive. Keeping track of counts of all tuples takes a lot of space and computational resources. This can be avoided to some extent by surveying a small set of representative documents and identifying terms of interest. Constructing a network for these fewer concepts is a bit easier. > **Methodological note 2** At the moment there is plenty of junk finding its way into my dataset. I successfully removed bibliography, headers, and footers, but I have yet to extract tables. It doesn't pose much of a problem, because these "terms" form their own distant cluster. We then filter the trends, keeping only those which show substantial variation and which have enough presence in the documents. Because we have several levels of granularity, we can exclude terms which don't change across the dataset, or are constant within the journal of interest. What we're left with is a rich indicator of semantic change over time for each term. *One possible next step* [listed below](#clusterTrajectories) is to cluster these trends, looking for changes which co-occur, and can be said to constitute the same change in language. ### Term co-presence over time and place The idea to analyze word egonets over time (see *Clustering of term network change over time and place*) can be simplified a bit by considering the trends of specific edges, instead of the entire egonetworks. The same idea of filtering and clustering to find trends could be applied. ### Semantic network Anything which can be described by an Easley parser can be run efficiently against sentences. This method can't be used to express certain intuitive syntax, but is useful in extracting a small set of meaningful phrases. Sentences which can be decomposed are highlighted as to the parts that are decomposable. This is done for about 1000 sentences. These forms are then applied to the full dataset, and false positives are coded away until the data looks mostly clean. Each surface form is interpreted as a relation between various sub-statements. Superfluous add-ons are also specified based on a reading of the sentences, such as "it is apparent that". Sub-statements must then be reduced, grouped into similar or the same referents. Those statements which have sufficiently many relations are exported to a Neo4J graph which can then be queried. Each edge of the graph is labeled with the historical and textual context of that link.