First Look – New STN – Big Data Creates Chemistry Without Limits
Arguably, one of the hottest topics in the field of data analysis, and visualization involves analytics exploring extremely large data sets. This subfield of data analysis even has its own moniker, Big Data, and it is postulated by some, including the über-analysts at McKinsey, to be the “next frontier for innovation, competition, and productivity”. Wikipedia provides the following definition of Big Data:
Big Data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.
This is relevant to a discussion of chemistry and patent, searching and analysis, because information professionals have been working with big data for years, decades even, but didn’t use a catchy phrase to describe what they were doing. The universe of available patent documents, worldwide, is well over 80 million, and in the CAS world of chemistry, the running count of known organic, and inorganic molecules currently stands at over 73 million substances, not including an additional nearly 65 million sequences. These types of numbers, as well as the interconnectedness of the data, certainly allow patent, and chemical information to qualify as sources of Big Data.
The Wikipedia definition also suggests an issue that chemistry and patent information professionals have understood, and struggled with for years, the fact that large data collections are difficult to process. Anyone trying to run a broad structure search for a collection of compounds of interest, as an example, will have encountered system limits, and other barriers, which have prevented these inquiries from being run in a timely fashion, or at all. Professional searchers have always found ways to circumvent these issues, but it usually involved segmenting the data in some fashion to enable a search to run to completion. With the new STN platform, developed by the STN partners, CAS and FIZ-Karlsruhe, these work arounds become a thing of the past, catapulting chemistry, and patent information into the exciting world of Big Data.
The Welcome page for new STN discusses some of the philosophy behind the system:
The new STN platform is being developed in versions. Version One focuses on the core search and retrieval functionality and most essential content, combining the complete CAS Registry and Chemical Abstracts content along with Thomson Reuters’ Derwent World Patents Index.
The first version is recommended for preliminary searches in the following areas:
Chemistry and general technology research
Intellectual property, such as basic novelty and prior art
First pass Freedom to Operate
In a nod to the tools of Big Data, applied to the fields of chemical and patent information, it was revealed during the 2013 PIUG Annual Conference that new STN is powered by Hadoop. This is a term that may not be familiar to patent, and chemical information professionals, but in data analysis circles it is the holy grail, working in the background, which makes all things Big Data possible.
Once again, Wikipedia provides some context on Hadoop and its value to the field of data retrieval, and analysis:
Apache Hadoop (High-availability distributed object-oriented platform) is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware.
The Hadoop framework transparently provides both reliability and data motion to applications. Hadoop implements a computational paradigm named MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. It enables applications to work with thousands of computation-independent computers and petabytes of data.
What this means for STN users is that the new platform is very fast, and can handle almost any query that an information professional can throw at it. It opens up an entirely new class of questions that can be explored without worrying about imposing restraints in order to get them to run. For analysts, it puts the entire collection of chemical, and patent data at their fingertips, and allows them to manipulate it at will.
To demonstrate some of this potential let’s take a hypothetical example from the pharmaceutical industry:
As an organization you are interested in Dipeptidyl Peptidase – IV (DPP4) inhibitors. One of a new class of anti-diabetic agents, Januvia, targets this enzyme as it’s mechanism of action. In considering your own drug discovery effort a researcher would like to know the composition of all compounds that are structurally similar to sitagliptin (the free-base of Januvia), and have been studied in conjunction with DPP4.
The structure of sitagliptin is below:
The structure has two ring systems, one a phenyl, and the other a 6,5 system with four Nitrogen atoms. Theoretically, the structure below could be used to find structurally related substances:
There are variable points of attachment on the first ring and the A constitutes any atom except H, allowing for heteroatoms, or carbon, to be at any of these positions. Ordinarily, this structure query would not run on the traditional STN system, unless the user locked the rings, or somehow modified the structure, or added structure screens of one type of another. In new STN, this query runs in seconds and produces almost 2.5 million structures. But why stop here, in this query both rings have to have six atoms, and the chain between them needs to have four. Either ring can be substituted or be part of a larger ring unit but the two six-member rings connected by a four atom chain still remains. Let’s try an even broader query:
In this case, the chain can be between 3-6 atoms and the ring system on the right can include 5-7 member rings, as opposed to just six. This structure ran in under a minute and generated more than 10 million structures.
Now that structures have been identified they can be linked to DPP4 by crossing them into the CAplusSM database and linking them to the CAS Registry Number® or chemical names associated with DPP4.
Stop and think about what is being requested, take more than 10 million substances, along with two CAS Registry Numbers®, and 15 name segments and identify all of the references that link these items. This search was finished in less than 30 seconds and identified 523 references.
At this point an analyst could look at these records, or the substances from them can be extracted and viewed individually. Extraction from the 523 references produces over 32,000 substances. These were limited to substances that had only two ring systems, and also matched the original, broader structure query in order to ensure that the more relevant substances could be studied. This led to a collection of 2,480 substances that can be examined by the research staff, all of which are structurally related to Januvia, have been studied in conjunction with its’ molecular target, but allow for a great deal of diversity, while retaining the core structure.
There are many additional functions available in Version One, but this example is meant to illustrate the potential that this new platform has for completely changing the way chemical, and patent information professionals interact with the database collections that are currently, and will be available in the future, on the new STN system. Full traditional STN functionality is not yet available but the STN partners acknowledge this, and have released Version One to start bringing the beginning of this new paradigm to chemical, and patent information professionals.
New STN truly brings the power and functionality of Big Data to the study of chemical, and patent information. Searches that were nearly unthinkable in the past can now be done in record time. Combining the breadth, and depth of the collections available, with the deep indexing that has been created by the database producers, generates a powerful combination that opens the door to exploring chemistry, and patents in a way that has never existed before. With the ability to explore these vast collections, at will, the new platform is certainly creating a world without limits.