TextDistil

TextDistil is an innovative end-to-end language processing pipeline software that takes input as plain text documents and outputs knowledge facts as RDF triples that conform to a target ontology, which are readily loaded into W3C compliant triple stores or graph databases. TextDistil runtime is finetuned on training data and configured with linguistic rules and neural classifiers that work together to extract embedded facts – subject, predicate, object triples from the raw text.

TextDistil has been in production since the release of version 1.0 at the beginning of 2022. TextDistil is a server software, in its default configuration, it runs in the background. TextDistil server runs as many documents in parallel as the cores in the server. TextDistil server can be connected to a visual debugger – FactCheck, which enables live viewing of the extracted facts in an interactive visual graph.

TextDistil is a novel platform to build custom language processing pipelines to suit wide ranging advanced text analytics. As the graph below shows the most valuable quadrant is the RIGHT-TOP, the one with knowledge that adheres to a domain ontology and which is directly the result of processing raw untampered source – the Natural Language text! TextDistil is one among handful of solutions in this quadrant that are on the market today.

TextDistil has three out of the box pipelines: 1). Knowledge Extraction Pipeline, 2). Text classification pipeline, 3). Topic Modeling & Semantic Index pipeline.

knowledge graphs

Knowledge extraction pipeline involves several stages of complex computation with each stage contributing to the overall transformation required for the end solution. Raw text is unstructured and complex data as it occurs naturally. In order to extract value out of raw text, first some structure has to be imposed on it and different parts of text must be tagged with what they actually are. Structure in running text starts with identified sentences, tokenized words and tokens tagged as different parts-of-speech. Later stage in this pipeline assigns a higher level meaning to the basic structure i.e. parts-of-speech tags are now supplemented with higher level tags such as ‘subject’ and ‘object’ of the sentence.

text-based knowledge semantic web stack

Phrases and clauses are also identified and made available for subsequent stages based on configurations. Another stage in the pipeline resolves co-references and tags named entities as PERSON, LOCATION, ORGANIZATION, etc. There are stages in the pipeline to tag relations between entities and another stage in the pipeline that links the entities with the target Ontology. At the end of the pipeline the results are produced as knowledge facts consisting of triples with subject, predicate and object all resolved to their URI, if they exist in the target knowledgebase.

Below is the output graph at the end of the pipeline processing of the sentence:

“George Martin, 72, lives in Santa Fe, New Mexico, with his wife, Paris McBride.”

”Named Entities: “George Martin”, “Santa Fe”, “Paris McBride”, “New Mexico” Relations & Properties: “lives in”, “wife”, Numerical Values: “72”

language generation knowledge graphs

Modular architecture of TextDistil makes it possible to reuse stages while composing novel text processing pipelines as well as while upgrading individual stages with new SOTA solutions from the fast changing world of neural language models.

The ‘text classification’ pipeline uses both traditional machine learning and the newer neural models to classify ‘short text’ such as tweets, customer reviews, facebook posts, customer feedback, product reviews, problem reports, server logs, chats, etc. Each ‘short text’ be it a tweet or a product review is labeled as a single (or multi label) class. Similarly, sentiment of the text is flagged as well. Short text classification is valuable to the enterprise in computing customer voice based on customers’ feedback and comments.

TextDistil platform also includes a feature rich highly scalable intelligent document corpus management with topic modeling, duplicate tagging, indexing and semantic search. Documents are first organized using Elastic Search for basic search and index on keywords, raw text. At a higher level the documents are categorized by their subject topics and keywords which are computed using the language processing pipeline. Duplicate documents are also flagged.

TextDistil works with standards based (commercial and opensource) annotation tools, taxonomy, thesauri and ontology IDEs.