Revisiting an Old Standard – 80% of Technical Information is Found Only in Patents

It is one of those old statistical measures that most of us take for granted – 80% of the information in patents is never published anywhere else. Information professionals have been saying this for years and, for the most part, many of us in the patent information profession have simply taken this a true statement. In fact, if questioned, this is one of those statistics that people have used for so long most aren’t even sure where it originated from, or what proof there is for it. Such was the case when this question was raised recently on the Patent Information User Group’s (PIUG) discussion forum. The post elicited a number of relevant comments, and is worth reading for historical perspective on the statistic.

It is generally accepted that the source of this saying is the “Eighth Technology Assessment and Forecast Report” from the USPTO published in 1977, but as pointed out in the comments of the PIUG post this study was done with a very small amount of data. A statistic like this is also likely to be technology dependent with different areas being more or less focused on only publishing via patents. The PIUG thread also includes comments discussing more recent research on this question, using chemical information that was published in 2005. These studies came to similar conclusions as the original statistic, but did in fact vary depending on the chemical sub-discipline studied.

One of the threads that run through all of the attempts to answer this question revolves around the use of chemical information to study the issue. This is likely due to the existence of data from the American Chemical Society’s Chemical Abstracts Service (CAS), an organization that does a pretty comprehensive job of capturing discrete chemical entities from both patent and non-patent literature. Specific chemical substances are only one type of potential “technology” but considering how difficult it generally is to capture, and subsequently search for other types of technology it makes sense to use something compartmentalized, like chemical substances to look at a question like this one.

CAS has been saying for many years now that more than 70% of the new substances added to the CAS Registry from the literature come from patents. This statement, in and of itself is interesting, and while it doesn’t directly answer the question associated with the oft quoted statistic it is a relevant piece of information since the majority of substances in the file only have one publication associated with them.

So while we have some evidence that, at least for the chemical sciences, this statement about patent publications is likely true, is there a way to more definitively study the issue. The previous studies have always had to settle for small sample sizes, or make certain assumptions due to the sheer volume of data associated with chemical compounds, and the limitations associated with analyzing them. Thinking about this it occurred to me that while that used to be the case we now have a powerful tool for studying large amounts of chemical information. I talked about this tool in a previous post when I provided a first look at the New STN system. In that post I talked about the idea of Big Data, and how the people behind New STN were taking advantage of recent advances in data analytics to bring the benefits of big data to the world of chemical information. The question of what percentage of chemical information described in patents is ever discussed anywhere else seems like an ideal example of the sort of question a big data solution for chemical information could answer.

One of the features of New STN is the ability to transfer information quickly between multiple files within the system. In this case I am interested in extracting substance information from the CAplus file, which is the database were literature references, both patents and non-patents are stored, and transferring that data into the Registry file, where the substances are kept. I am also interested in finding all references associated with chemical substances once I have identified them. The commands to do this on New STN are called subx for extracting the substances, and refx that can be used to find references associated with them. Using these commands I was quickly able to come up with a comprehensive answer to the publication in patents only question using the world’s largest collection of chemical information.

I started by simultaneously entering both the Registry and CAplus files on new STN and ran a search for patents as a document type in CAplus. This produced 9,543,607 patent references in the database. Extracting the substances from these references produced a collection of 49,058,846 compounds in the Registry database. These numbers are pretty staggering, and as was pointed out in the previous post on New STN would not have been possible to produce based on the system limits opposed by the previous versions of STN. See the image below for a look at some of the most recent substances from this collection:

Screenshot of Most Recent Patented Substances from CAS Registry Database - Click to Expand

Screenshot of Most Recent Patented Substances from CAS Registry Database – Click to Expand

Crossing all of these substances back into CAplus generates 24,824,536 literature references, both patent and non-patent associated with these over 49 million substances. Of these nearly 25 million references, 19,190,577 are not patents. I can now extract the substances from just the non-patent literature references, bring them back to Registry and compare that to my originally extracted patented substances collection of just over 49 million.

When I did this I found that the 19 million non-patent literature references generated 35,654,723 substances themselves. Already, this is a smaller number than the 49 million we started with, but the real question is what happens when this collection is NOTed out of the original collection of patented substances. What we find is that 46,449,600 substances remain when the substances associated with the non-patent literature references are removed from the starting collection of patented substances.

This means that 95% of the substances coming from the patent collection on CAplus did not have a corresponding non-patent literature reference associated with them.

The series of search commands I followed for this are below:

L2  p/dt

CAplus: 9,543,607 (Patents in CAplus)

L3       subx L2

CAplus:        REGISTRY: 49,058,846 (Substances associated with the patents)

L4       refx L3

CAplus: 24,824,536       REGISTRY: (All references associated with the patented substances)

L5        L4 not p/dt

CAplus: 19,190,577        REGISTRY: (Just the non-patent references)

L6        subx L5

CAplus:-REGISTRY: 35654723 (extraction of the substances from the non-patent literature references)

L7       L3 not L6

CAplus:-        REGISTRY: 46449600 (the substances found in the patents but not the literature references)

The patenting of chemical substances represents a reasonable percentage of the technologies covered by the world’s patenting authorities, and thus represents a reasonable collection to study to determine how often technologies mentioned in patents are never mentioned elsewhere. In this study the substances associated with every patent included in the CAS literature database were extracted. These discrete chemical entities were then searched, and all literature references associated with them discovered. From these the patent documents were excluded, and again the substances were extracted. A comparison of the substances coming from the non-patent literature references to those coming from the patent references showed that 95% of the patented substances did not appear in the non-patent literature references. Once again, this example only covered chemical technologies, and is thus not applicable to other technology areas, but in this case the percentage of information found in patents that is never published elsewhere is actually significantly larger than the 80% value that has been bandied about for more than 30 years.

 

Tags: , , ,

 
 

discuss this post

  • Roger Sayle

    Is “L4 refx L4″ a typo?

    • Anthony Trippe Anthony Trippe

      Hello Roger,

      Yes, certainly, and I have now corrected that. Thank you for pointing this out.

      Thanks again,
      Tony

  • Michael

    I really appreciated this post. Although my background is not in the chemical arts, it’s nice to see a data-driven approach to assessing a statistic many of us have heard cited over the years.

    • Anthony Trippe Anthony Trippe

      Thank you Michael!

      The approach certainly has its assumptions and limitations, but for what it represents I think the approach is reasonably rigorous.

      Stay tuned for some additional information on this data collection.

      Thanks again,
      Tony

  • David Walsh

    I welcome the analysis of this statistic, proving the importance of patent information. However, there are caveats that also need to be considered. Patents do contain large numbers of chemical molecules that are not necessarily the focus of the invention, but present to determine a space around the invention. Their intention is not particularly to be treated as an invention, but to ward off competition. In relation to sequences of proteins, DNA, RNA etc,it is more likely that they will have been deposited in the relevant sequence databanks as well as being reported in CAS. The statement is correct, but if anything is of commercial or research value, then it will appear in the non-patent literature often before patent publication, and multiple times after.

    • Anthony Trippe Anthony Trippe

      Hello David,

      Thank you for your comments and I agree completely with the caveats you mentioned. There has also been a reasonable amount of discussion on LinkedIn about the concept of inventiveness and invention and how this can be measured. I completely agree that this particular study is a single data point in a bigger landscape but no less valuable and interesting.

      Thanks,
      Tony

  • Interesting analysis but at 5.1 substances/patent this has caveats. Can you filter by C07D/A61K and even reduce to WO-only to get at the med chem specific bits ?

    Slicing PubChem provides comparative slices but oviously with different caveats. Patent extraction ~ 15 million, literature extraction (mostly ChEMBL)~1 milion and intersect ~ 0.5 million

    • Anthony Trippe Anthony Trippe

      Hello Chris,

      It would certainly be possible to slice the data a number of different ways. Let me check in with the people at Chemical Abstracts to see if they would be okay with some continued exploration in this area.

      Looking at PubChem, and the material coming from IBM and/or SureChem that covers patents would also be pretty interesting but I don’t know if you can filter on the literature source.

      Thanks,
      Tony

  • Rutger

    It seems to me that we still don’t know if the 80% from the title is accurate.
    The calculations thus far tells us how many compounds are only disclosed in patents. To find out if the 80% is correct, it is necessary to also calculate how many compounds are only disclosed in the non-patent literature. Only then you can calculate the percentage of compounds only disclosed in patents.
    It would be interesting to see what percentage comes out and if the 80% from the title is correct.

    • Anthony Trippe Anthony Trippe

      Hello Rutger,

      The 80% figure comes from an old USPTO document. A link is to the original article is included so you can have a look at it if you want to see the rational. Personally, I think the study is not terribly relevant since it was conducted on such a small collection of inventions.

      Regarding your comment on the number of substances from the non-patent literature, I am afraid I don’t agree that this is required to say whether the small molecules identified as coming from patents were published in the non-patent literature. You simply need to see if the substances that are known to have come from patents were ever covered in a NPL document.

      Having said that, it is a separate question to ask how many substances from the NPL ever show up in a patent. I believe this percentage is also quite high indicating that for the most part the sources are somewhat mutually exclusive in the types of molecules covered.

      Let me know if I am misunderstanding the point of your question.

      Best regards,
      Tony

 
 

Add a comment

required

required

optional