Of the three machine learning tasks covered in Part 1 of this series, classification may be the one that is the least familiar to patent information professionals. The methods used for automatic classification have been around for some time, and have been used by patent offices, publishers and database producers, in association with patent information, but there have not been many commercial tools providing classification capabilities to analysts, and information retrieval specialists. This is unfortunate, since statistical classification can, potentially, lead to enormous benefits for patent information professionals. Before launching into an example of how classification can assist with the identification, and prioritization of relevant references, within large patent document sets, let’s look at some details of the task itself.
The Wikipedia entry on Statistical Classification provides the following description of this method:
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.
An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term “classifier” sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category.
In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available.
Classification can be thought of as two separate problems – binary classification and multiclass classification. In binary classification, a better understood task, only two classes are involved, whereas multiclass classification involves assigning an object to one of several classes. Since many classification methods have been developed specifically for binary classification, multiclass classification often requires the combined use of multiple binary classifiers.
As covered in the Wikipedia description, there are two separate approaches to classification, this post will apply the first application, binary classification, to patent information retrieval and analysis. In this example a support vector machine (SVM) implementation of binary classification will be used for the task. A general description of SVMs was provided in the previous post, but the Wikipedia link listed also mentions the motivation behind the method, and an illustration:
Classifying data is a common task in machine learning. Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new data point will be in. In the case of support vector machines, a data point is viewed as a p-dimensional vector (a list of p items), and we want to know whether we can separate such points with a (p − 1)-dimensional hyperplane. This is called a linear classifier. There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. So we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier; or equivalently, the perceptron of optimal stability.
Applying this method to patent analysis, and retrieval, and the prioritization of documents, let’s look at a portion of a chapter from the book Current Challenges in Patent Information Retrieval entitled, Evaluating Real Patent Retrieval Effectiveness. It was postulated that recall, and precision, in patent information retrieval, were in conflict with one another, and that perhaps they should be separate activities:
Thinking about the issues in searching IR researchers usually look at precision and recall simultaneously, and measure their methods by how techniques stack up against both elements. We would like to suggest that when it comes to patent searching that it might be more productive to separate these functions so that they can be maximized independently. It has been demonstrated that risk, precision and recall do not follow the same linear path when discussing the various types of patent searches. Since this is the case it might be more productive to begin with creating methods that produce high recall exclusive of precision. Once this is accomplished the results can be ranked using different methods to improve precision and manage the way the results are shared with the searcher. It will likely be the case that different methods will be used to provide higher recall than those that can be employed to share records with higher precision. Instead of expecting a single method to do both it would be useful to the patent searching community if the process was done stepwise to maximize the value to the user.
Binary classification provides a means for categorizing large collections of patent documents into the references that are likely to be of highest interest to the information professional, and those that are likely not related, but were still retrieved in a broad search. The training set, in this case will be made up of references that are highly relevant to the interests of the analyst. In training the classifier, the analyst will need to identify documents that are off-topic as well, so the classifier can establish a hyperplane that will distinguish between the two categories.
To demonstrate this, the KMX Patent Analytics solution from Treparel will be used in conjunction with the CLAIMS Global Patent Database (CLAIMS GPD) from IFI CLAIMS. The following description of Treparel and KMX was found on their website:
We at Treparel came up with a solution that, using an advanced mathematical algorithm called Support Vector Machines (SVM), can control the big data challenge! Our solutions use the SVM algorithm in our unique methodology that dramatically changes the way people obtain information from data by means of text mining and visualization.
In close corporation with some of the largest multinational companies in the world we have developed a unique Text Analytics platform: KMX. KMX – Knowledge Mapping and Exploration – can be applied to any knowledge domain in any industry, in numerous business functions. KMX is used by professionals by small to multinational organizations that have vested interest to extract unprecedented value from “unstructured” data to support their operational business processes.
The breadth and scope of the CLAIMS GPD is provided from their website, as well:
IFI’s comprehensive international patent collection. This state of the art database is normalized and curated to provide unprecedented consistency and quality. Key features include:
KMX is available with a direct interface to the CLAIMS Global Patent Database and can support many other data sources as well.
Wearable fitness monitors have been discussed previously, and this area of technology will provide the examples used for this post, and a number of the remaining ones, in this series. Aliphcom (doing business as Jawbone) sells the Up fitness monitor while Nike competes with them with the Nike+ FuelBand product. Both organizations sell other products, and have extensive portfolios, which cover their fitness monitors, as well as many additional items. Let’s study how a binary classifier can help identify the patents associated with the Up, and the FuelBand, in the midst of many other documents from these companies.
Searching worldwide, in the CLAIMS GPD, 275 patent documents are assigned to Aliphcom. Of these, we know at least 80 of the documents are associated with Up, based on the previous analysis conducted. Ten of these documents will be used to represent the positive examples in our training set, and one of the outcomes of this example will be to see how well the classifier identifies, and prioritized the remaining 70 documents. The Aliphcom portfolio also contains patent documents associated with Bluetooth headsets and speakers. Ten documents associated with these items will be identified as the negative examples.
After the documents are imported from the CLAIMS GPD, the first step in preparing to build a classifier, is to decide on the sections of text, and their relative weighting, which will be used to create the individual document vectors within KMX. In this case, the source titles, and abstracts were used, with the abstract getting a weight of 5, and the title getting a weight of two. All potential family members were imported from CLAIMS GPD in this case, but frankly, under normal usage, it’s probably a good idea to put the documents through some type of family reduction before performing a classification task. The source titles and abstracts can be used for a binary classification since the user is simply trying to separate relevant documents from the remainder of the collection. In a future post, when we want to segregate documents into multiple categories, a different choice for the text source will need to be made, in order to deal with circumstances where document that share priority use the same title, and abstract, for the different filings.
The next step involves training the classifier, and as discussed above, ten positive and ten negative examples were chosen to accomplish this. When the classification, as opposed to the Landscaping tab, is selected at the bottom of the screen, the user can select positive and negative examples simply by clicking the plus or minus button next to the document of interest. The figure below provides an illustration of how positive and negative documents are added to a learning set in KMX.
Once the training documents are selected, the “Train Classifier” button is pushed, and, as long as the “Also classify after training” box is checked, the system will perform the first classification on the set. When this is completed two new columns, Score and S appear in the document list window. Sorting on the Score column, by clicking on the column heading, will list the documents in order of their score. In the Aliphcom example, the initial classifier worked reasonably well, but there were still many relevant documents that received a low score, normally under 50, within the set. This is to be expected and KMX uses the S column to suggest documents for the user to classify manually in order to improve the classifier. The figure below shows the second round of retraining that was conducted to build the Aliphcom Fitness Band classifier.
After a second retraining, a classifier had been created that successfully classified all but one of the Aliphcom documents correctly into those covering the Up fitness band, and the remainder of the company’s products. The one document, and its equivalent members are new documents, recently published, which deal with a new application of the product line. All and all, with minimal effort, a result with greater than 95% precision was achieved. But this was the easy part, could the classifier created be used to help classify a larger, more highly diverse portfolio.
To test this, 11,126 worldwide patent documents from Nike were imported from the CLAIMS GPD, and submitted to the SVM for classification using the final classifier generated from the classification effort on the Aliphcom patents. As one might expect, the initial use of the Aliphcom classifier did not produce stellar results. Having looked at patents associated with the Nike FuelBand using traditional searching methods, many of these documents did not score well with the classifier. Again, this is to be expected since the language and functions described in the titles and abstracts of the Aliphcom documents are different from the ones used by Nike. This situation was remedied by classifying the documents checked in the S column, and retraining the classifier, as was done with the Aliphcom classifier. After three generations of training, the classifier had successful scored ~85% of the Nike documents accurately. It still scored some of the originally discovered documents poorly, but frankly, many of these were associated more with the Nike + iPod sensor system than they were with the FuelBand. Conversely, the classifier identified several Nike families that were not discovered using a reasonable traditional search. Combing the methods, in this case, would have led to a more comprehensive result when studying the Nike fitness monitor filings.
Finally, 43,612 US, and WO documents, from 2008 to present, in IPC A61B005, the class under which the majority of the relevant documents analyzed to this point were assigned by their respective patent offices, were classified using the Aliphcom and Nike classifiers, as can be seen in the image below:
This is an enormous number of extraordinarily diverse documents, and a very tall order for a machine learning method. A61B005 is the IPC class for measuring for diagnostic purposes, and it includes MRI, and blood glucose monitoring as well as the fitness devices being investigated. The language used in the titles and abstracts, of these documents, can be very different that what was used in the Nike and Aliphcom documents.
The first attempt at classification produced very poor results with a few documents receiving a high score and a handful that received reasonable scores, but were off-topic. Following the previous pattern, the S labeled documents were manually classified and retraining took place. Due to the size and diversity of the collection, this process was repeated five times before a reasonable outcome was produced. Retraining based on the suggestions of the system produced a better classifier in this case, than when an attempt was made to classify the set using the newly generated, fifth generation classifier in combination with the Nike and Aliphcom classifiers. Using the fifth generation classifier, 620 documents received a score of 50, or better. The titles of these were studied and from a preliminary examination it appeared as if ~80% of these were on topic.
This left 43K documents that received a lower score and clearly there are relevant ones, which were not properly classified, but based on the diversity within the set this is not an unexpected result. Taking a large number of documents from a diverse IPC collection is the ultimate test for a machine learning method, and in this example the SVM performed reasonably. In real world situations, it is recommended that equivalents be removed and document collections created that are not quite so broad. Alternatively, additional retraining sessions will help bring more relevant references over the score of 50 threshold.
Binary classification, using an SVM can be a powerful tool for prioritizing patents within a larger collection of documents. One of the best aspects of this method is that the classifiers, once created, can be reapplied to other collections, including classifying new documents that publish on a weekly basis. In this fashion, measures can be taken to maximize recall when searching, and then focus on precision, in a second step, using a classifier.