Machine Learning in Patent Analytics – Part 3: Spatial Concept Maps for Exploring Large Domains
Continuing the Machine Learning in Patent Analytics series, the next method to be covered is Spatial Concept Maps. This is one of the most popular machine learning tasks associated with patent analytics, and was pioneered by the use of the ThemeScape® tool, in what was originally the AurekaTM platform by Aurigin Systems®, and can now be found in Thomson Innovation® by Thomson Reuters®. Another early innovator was the VxInsight® system that was developed at the Sandia National Laboratory. The methods associated with this tool were used to create the Research Landscape feature contained in STN AnaVistTM from the STN® partnership. The visualizations associated with this task go by many names including thematic, or concept maps, and involve determining the similarity of documents, and representing it as a relative distance. Since the characteristic being visualized is relative distance, maps are often used as a visual device since, as a species, humans are accustomed to comparing distances using them. Beyond organizing, or clustering the documents, calculating relative distance adds the additional benefit of determining which clusters are related to one another. This places distinct, but related sub-categories, or methods closer to one another, while placing different approaches or methods in another location.
As discussed in Part 1 of this series, spatial concept maps generally use unsupervised machine learning methods, to group documents, based on similarity, as one of the early steps in the process of generating a map. The two primary algorithms used in spatial concept mapping, are K-means, and Force-Directed Placement. Each method starts with the creation of a vector, to represent the characteristics, or fingerprint, of each document, but they differ in how the vectors are measured against one another to determine similarity, and eventually used to create a visualization based on relative distances.
A vector is a mathematical concept, which represents the identifiers associated with each document that is going to be analyzed. In the case of thematic, or concept maps, words, or phrases contained in the documents are dimensions within the vector. Theoretically, the total dimensionality of the vector is the number of distinct words, occurring in the corpus, that are going to be used for comparing the documents. If a discrete identifier occurs in a document, its value in the vector, that corresponds to the specific document, is non-zero. In a sense, a vector for a document could be considered as a long list of 1s and 0s, where the 1s represent the fact that a term within the document collection is present in the document of interest ,while 0s mean that other terms within the collection were not found in the specific document. Other identifiers, such as classification codes, or citations, can also be used, but for spatial concept maps the focus will be on words from the documents of interest.
There are several methods for determining which concepts will be used to build document vectors but the one used most frequently is called Term Frequency Inverse Document Frequency (tf-idf). Wikipedia provides some information on the application of this technique:
tf–idf, term frequency–inverse document frequency, is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.
When establishing the terms that will make up the dimensions of a vector it is critical that words that distinguish one document from another be used. If a term is used too frequently within a corpus then it will not be helpful for discriminating between the documents. Conversely, if a term is used infrequently than it is too specific and doesn’t help in aggregating the documents.
Ultimately, the choice of terms used to populate the document vectors is the single most important factor in producing a useful map. When working with original patent documents an analysis of this type should be restricted to certain sections of the document, such as the claims, or the titles and abstracts. Working with the entire body of text can confuse the system since there are sections, such as the background of the invention that are talking about other inventions, as opposed to the one covered by the patent. Large text documents also impacts the words identified by the ft-idf algorithm resulting in the selection of more generalized terms. The vector generated in this case will likely be sub-optimized since the words chosen won’t be as distinctive.
Looking at ThemeScape®, and the Research Landscape in STN AnaVistTM as examples, there are two different variables associated with the generation of vector terms. In the STN AnaVistTM FAQs the following statement on concepts used for mapping can be found:
The software was significantly enhanced to improve visualization results, utilizing the expertise of our database building staff and scientists:
CAS vocabulary to standardize the clustering concepts
A stopword list to improve cluster results for sci-tech searches
These enhancements allow for the software to produce more scientifically relevant clusters that are focused on scientific and intellectual property information.
In the case of STN AnaVistTM the use of CAS vocabulary is done in the background, but generally provides good results since synonymous terms are removed from consideration reducing the number of candidates for the vector.
When ThemeScape® was first released it could only access source titles and abstracts, claims, or the full-text of patent documents. When it was integrated into Thomson Innovation® the standardization, and analysis provided by various fields in the Derwent World Patent Index® (DWPI) also became available. A particularly powerful combination is the use of the Advantage, Novelty and Use fields of the documents of interest. These fields highlight key aspects of the inventions associated with the patents, and generally produce maps that highlight the differences, and uses of the corresponding technology.
The image below was created from a collection of patent documents associated with personal fitness bands. The use of the DWPI fields has generated a first pass map that identifies many of the major applications associated with these devices, including monitoring heart and pulse rate, and exercise.
As alluded to in the STN AnaVistTM FAQs, the other major lever that users have, which can impact document vectors is to selectively add stopwords to their settings. Stopwords are also referred to as non-content bearing words, and they can adversely impact similarity measurements if they are included in the vector since they do not impart knowledge of the topic area. Almost all mapping tools come with a list of standard stopwords, such as “the”, “and”, “a”, and other non-content bearing terms, but users can also look at initial results and identify words that do not add meaning to the technology being examined. New words can be added to stopword lists within tools on a map-to-map basis, or permanently. For additional information on the impact of stopwords on patent mapping refer to Understanding and customizing stopword lists for enhanced patent mapping, by Antoine Blanchard. Many excellent tips, and examples are provided by Antoine in this paper.
Now that the document vectors have been created, the methods used to compare them, and generated the corresponding maps can be discussed. Looking at ThemeScape® and IN-SPIRE™, the K-means algorithm is used to create clusters. As discussed in Part 1 of this series, this algorithm aims to partition n observations (documents in this case) into k clusters in which each observation belongs to the cluster with the nearest mean (closest similarity), serving as a prototype of the cluster. After the clusters are created the software compares the clusters against each other for similarity, and arranges them in high-dimensional space (about 200 axes) so that similar clusters are located close together. It is not clear, from the explanations available publicly, what methods are used to create a vector associated with the individual clusters, as opposed to the documents, and how the vectors are compared to one another. Theoretically, this could also be done using K-means in a second step, before dimensional reduction takes place, to reduce the high-dimensional space to two dimensions.
STN AnaVistTM and its cousin, VxInsight® use a completely different mechanism for measuring similarity and representing it visually, even though they look similar to maps generated using K-means. Force-Directed Placement in conjunction with cosine similarity is used with these tools. The force-directed placement ordination routine accepts a list of pre-computed similarities and outputs an x,y location for each object. The Wikipedia entry on Force-directed graph drawing explains the process of ordination further:
Their purpose is to position the nodes of a graph in two-dimensional or three-dimensional space so that all the edges are of more or less equal length and there are as few crossing edges as possible, by assigning forces among the set of edges and the set of nodes, based on their relative positions, and then using these forces either to simulate the motion of the edges and nodes or to minimize their energy.
The pre-computed similarities in this case are generated using a method called cosine similarity which is also defined in Wikipedia:
Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a Cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1].
Note that these bounds apply for any number of dimensions, and Cosine similarity is most commonly used in high-dimensional positive spaces. For example, in Information Retrieval, each term is notionally assigned a different dimension and a document is characterized by a vector where the value of each dimension corresponds to the number of times that term appears in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be in terms of their subject matter.
Once more in English, a vector for each document is created, they are compared using cosine similarity, and then positioned using force-directed placement. Based on this explanation, it is probably correct to say that this method of generating a spatial concept maps is not technically clustering, since the documents are not initially partitioned, but since it is an unsupervised machine learning method, it is usually placed in that category. The image below illustrates a spatial concept map generated by STN AnaVistTM
Maps have also been created using classification, a supervised machine learning method, especially in the case of Kohonen Self Organizing Maps, but this method is not frequently applied to patent documents, and thus will not be covered in detail in this series.
To complete this examination of spatial concept maps there are a few additional items, common to both major methods of creating them, which are worth considering.
While the maps, and document organization, is provided in two-dimensions a third-dimension is often added, after the fact, by incorporating document density. The number of documents, found in a cluster, or in reasonable proximity to one another, can be used to call out topics of higher interest than others in the collection. On a topographical version of the spatial maps this is represented by an implied increase in peak heights on the map, visualized by a change in color.
Many of the spatial maps, especially the ones based on clustering methods, also provide contour lines on the diagrams. Generally, these lines are drawn based on the distance between the document dots. The distance between a dot and its nearest neighbor determines the boundaries of the lines. Once the threshold is exceeded the line is drawn between the two dots. It has been speculated that contour lines encompassing multiple groups on a map implies a relationship between the two groups, but generally, this is not the case, and the lines are simply based on the spread of the documents.
Finally, once the analyst feels comfortable that the system has done a reasonable job clustering documents, and positioning the clusters relative to one another, they can change the labels on the map so they reflect the terminology used by the stakeholders of the project. Most systems generate labels by looking at frequently used words, especially if they are unique to a particular cluster. Sometimes this works well, but often the label terms are too generic and don’t really reflect the contents of the cluster. The clustering, in fact, may have been quite good, but a poor label may be the first, and only, thing that a client sees. If the labels are poor, and don’t reflect meaningful categories, the client can lose interest or believe that the map is not meaningful. Labels can be changed within most mapping tools and should be done on a cluster-by-cluster basis by examining the titles of the individual documents within them.
One last item on the interpretation of spatial concept maps. There appears to be an X and Y-axis on most maps, so many users think these visualizations behave like a scatterplot, where extrapolating between items on the X and Y-axis can identify the contents of empty spaces on the map. In reality, there are no X and Y-axis associated with the spatial concept maps, and the distance between documents, usually represented by dots, are relative and based on similarity as previously discussed. Since these distances are relative, and based on the contents of the collection, guesses cannot typically be made about what sort of document might occupy an empty space on the map.
It is also worth pointing out that a few years ago the analysts at Bristol-Myers Squibb published two articles in World Patent Information that also discussed spatial concept maps in the context of patent analysis. These article also provide examples and additional tips for getting the most out of this technique. The titles, and links to purchase these are below:
While spatial concept maps are used frequently, and have been for over a decade, by patent information professionals, analysts still get many questions on how they are generated, and how they should be interpreted. There is also a desire to be able to influence the key attributes that are represented in the corresponding visualization to ensure that the immediate impact of the labeling and organization of the map is meaningful to their clients. By understanding the process involved in creating documents vectors, and recognizing ways that it can be adjusted analysts can produce maps that are directed to the attributes they want to highlight. Labels can also be changed in order to provide immediate relevance to the end-users of the analysis.