Technical Knowledge from Unstructured Text

    Alex Barbieri

    Clients often have large collections of technical reports generated by their business, which are dutifully archived. It takes significant knowledge and time to dig into those archives to get useful insights. Word embeddings offer a way to extract hidden technical relationships from the collections without requiring manual review.

    Word embeddings allow you to perform math on words in ways that preserve semantic meaning. The classic example is of the analogical 'equation': 'king' - 'man' + 'woman' = 'queen', where the word embedding has encoded the gendered relationship of the nouns at the same time as their general usage.

    Recent research on word embeddings has shown that scientific texts can be used to build word embeddings that can enhance likely candidate searches, in this case 'discovering' novel thermoelectric materials before they are explicitly researched.

    Tessella's AI accelerator, Cortex, allowed me the time and resources to try to extend this type of analysis to datasets that are closer to our real-life clients' collections. Because of the low cost (in time and computation) to build word embeddings, if the analysis 'works' on datasets in general, then it has broad applicability across all industry sectors in which Capgemini operates.

    Picture1

    Data collections

    I arranged three different datasets to study the performance of the analysis under varying corpus properties. Each one of these is elaborated on below.

    Scientific Reports

    The first dataset was a collection of 140,000 scientific abstracts related to cancer research, selected primarily to replicate previous research. I built word embeddings out of this dataset to predict whether pairs of genes/proteins interacted and compared the results to StringDB, which acted as the 'ground truth'.

    Finding likely interactions between genes/proteins is often the first step at targeted drug therapies; speeding up this process constitutes a huge gain for our clients in the life sciences. In a short amount of time, I was able to make predictions of interactions with F1[1] scores nearing 0.3, which represents a triumph for a publicly available dataset without any genetics subject matter expert input.

    Patent Applications

    While some of our clients do collect scientific text, most are more likely to have collections of technical reports that are not published research. Does the analysis technique still apply to text which does not look like that from a scientific paper abstract?

    As an approximation of internal technical documents, I used the US Patent and Trademark Office’s collection of cancer moonshot patent applications. Patents are highly technical descriptions of techniques and inventions which often reference diagrams and tables that are not textual information. They are considerably longer (more than 60 pages) and have a more complicated textual structure compared to the single-paragraph scientific abstracts.

    Using 40,000 cancer-related patents, I was able to build word embeddings and make predictions of cancer-related gene/protein interactions with the same level of accuracy as the scientific abstracts - the analysis method generalized to the technical document dataset and is not only applicable to published research.

    The use cases for likely candidate searches with technical documents are far-ranging: identifying the likely root cause of production failures, finding common elements of safety violations, finding unknown relationships between reagents from ELN records, etc.

    Picture2

    User Reviews

    The final dataset I analyzed was designed to stretch the analysis well beyond its limits. Kaggle provided a set of recipes from food.com and their associated user reviews. To measure the 'information content' of the word embedding I could not use a ground truth like StringDB because a list of ingredients and their known flavors is not easily obtainable, but I was able to build a family of analogies that tested my pre-conceived notions about ingredients. For example, to test whether the word embedding successfully encoded 'flavor', I built a family of analogies of the form:

    > 'salt' - 'salty' + 'garlicy' = 'garlic'

    where there are two 'ingredients' and two 'flavors' involved. Then, using the vector math of the word embedding space, I tested whether 'garlic' was in the top-20 results closest to 'salt' - 'salty' + 'garlicy'. Other examples from the collection of analogies are:

    > 'apple' - 'tart' + 'spicy' = 'peppers'

    and

    > 'soy_sauce' - 'salty' + 'sweet' = 'banana'

    Picture3

    The user reviews comprise wildly non-technical, free-form responses from the general public. Using these I was still able to get intended results within the top 20 closest words 1/5 of the time. This shows that even incredibly messy, unstructured, noisy datasets can be used in the analysis to find likely candidates.

    This exact analysis could be used to help new food product development at consumer goods companies by associating ingredients with unexpected flavors reported by testers, but also extended to explore the connections between any set of 'nouns' and 'adjectives' that a client might be interested in.

    Intentionally Degrading Datasets

    In addition to extending the likely candidate search analyses to new datasets, I also studied how the performance of the analysis was affected by the size of the training corpus and by intentionally adding noise to the dataset as a way of approximating messier data sources.

    The figures below summarize the results, which show a logarithmic reduction of performance as the corpus shrinks, and a gradual reduction of performance as a function of signal purity. There are no hard cliffs; much smaller or noisier datasets have lower performance, but that may still be acceptable depending on the precise analysis requirements.

    Picture6

    Broad Applicability and Minimal Requirements

    My conclusions from the Cortex project are that the word embedding analysis applies to every dataset I could throw at it and the results were surprisingly useful even in very small and noisy datasets drawn from across a wide variety of different domains.

    • Finding likely candidates in scientific or technical documents only requires:
      • ~20,000 documents
      • <20% noisy mentions of terms
    • Even in very noisy datasets, likely candidates can still be found with:
      • ~40,000 documents
    • There are no performance cliffs, falling underneath the above limits might be okay if lower performance is acceptable

    [1] F1 score is a measure of accuracy computed as the harmonic mean of precision and recall, that is, F1 = TP / (TP + 0.5*(FP+FN)) where TP, FP, FN denote the numbers of true positives, false positives and false negatives, respectively. The maximum F1 score is therefore 1.0. Scores of 0.3 represent a very good algorithmic performance in the context of likely candidate searches.