Patent Assignee Cleanup Using Google Refine (Open Refine) – Text Facets and Clustering
There are a variety of tasks which patent analytics practitioners need to become proficient at in order to accurately portray the insights they glean from patent information. In previous posts we have looked at counting patents within families and how patenting practice within different countries can influence this. We have also looked at counting citations by patent family, within the US, and in Europe. In this post we will look at the task of List Cleanup and in particular Patent Assignee cleanup and what analysts can do to make this task more efficient.
List Cleanup is the process of manual or automatic standardization of terms or items within a data field to correct errors or inconsistencies. It is required by patent analysts in order to produce statistically relevant results. It is necessary since raw patent data is notorious “messy” and requires clean-up or standardization to produce accurate results. Misspellings, for instance, are a common occurrence with certain fields and require corrections.
One of the fields within patent information which is most often in need of being cleaned up is the Patent Assignee (PA) field. In this particular case, there are typos and misspellings found with the entries associated with each specific patent document. When looking at a large collection of documents, even if the company name is represented properly, the individual data providers may have added the organization’s location, division name or other assorted details to the PA field since that is the way the information was provided to them by the applicant.
One of the simplest ways to cleanup or organize the contents of the PA field is to create a Pivot Table in MS Excel and after sorting the column in alphabetic order, grouping the items manually. A recent Intellogist blog post covered aspects of this method and it is one of the least expensive, if time consuming, ways to cleanup the PA field since most people already have a copy of MS Excel or a similar spreadsheet tool.
There are also a variety of tools available from commercial database and tool producers for data cleanup. It is beyond the scope of this post to go through each one but you can find additional details on this function at the Orbit.com, Thomson Innovation, VantagePoint, STN AnaVist, Innography, PatSeer, Patent iNSIGHT Pro, and Minesoft websites. Some of these tools allow the user to manually group items, others rely on lists of common terms or “Corporate Trees” and some have developed their own thesaurus for assigning standardized names or codes to companies. A few of the tools also use algorithms to identify PAs which are similar to one another based on string-matching or homology and allows the user to group these items.
Getting this data cleaned or standardized is critical since eventually we will want to compare two entries or create a list or histogram based on the items in the PA field and we will likely do that using a PivotTable or filtering method that is going to look for an exact string match when it comes to counting the discrete item. If the field is not cleaned up , that is if the strings do not match, then we will see individual line items for each of the entries. The image below shows a list of PAs which come directly from the source documents without cleanup:
Notice in this case how there are multiple entries for the US Government. Just in this list, where any organization with more than three patents is represented, we see that there are two entries for the US Navy and one for the US Air Force. You might want a distinction to be made between the Navy and the Air Force or you might want all of the patents from the US grouped together into one category. In any event, we need to do something about the multiple entries for the US Navy. What caused the issue, in this rather long string, is at the very end where one Navy entry has US included and the other doesn’t. While I only show the entries with at least three mentions in the image above the actual list being used in this case is quite long and contains over a 1,000 items. In long lists cleanup becomes even more critical since merging entries can have a dramatic effect on their final count and the ranking of the items.
While there are a number of methods and tools available for data cleanup the focus of this blog post is on the use of Google Refine, soon to be called Open Refine, for conducting this task. I first learned of this tool in November of 2010 in a posting to the PIUG wiki by Amit Kumar Goyal who suggested that the patent information community might find it useful for this task.
Google Refine (let’s just call it Refine from this point on) is a desktop tool which loads in your browser so, unlike other Google Apps, this one does not require you to upload your data to the cloud. While patent data is typically publicly available it is still good to know that for some projects the information is not leaving your local environment. A copy of Refine can be downloaded from – http://code.google.com/p/google-refine/ (in the not too distant future it will be located at https://github.com/OpenRefine) for all major operating systems. Once launched and after clicking on the Create Project button on the upper left hand of the browser window the user can push Choose Files to import data into the tool.
In this example, input from a spreadsheet was used to bring the data into Refine. The tool can port data from a variety of formats but I find that the Comma Separated Value (CSV) format works best. Once the import is complete, the system will ask about how it should be parsed so it appears in a collection of rows and columns.
Since this is a CSV file the Columns are separated by button needs to be set to Comma. You can also tell the system how many header lines you have among other settings and see a preview of what your data will look like in Refine. Once it looks the way you expect hit the Create Project button in the upper right corner to create the project.
Ordinarily, a user might have many columns associated with various fields they are interested in but to simplify matters for this post only the relevant PA field was brought into Refine.
In this particular collection of PA data, the location information has also been included. Ideally, getting rid of this, or at least migrating it to a separate column, would be useful. Remember that cleanup and counting work on string matches so the tighter we can make the strings in the field we are working with, the easier our job will be later when conducting the analysis. Fortunately, there is a comma that marks where the organization name ends and the location begins. We can use the Edit column function under the drop down menu associated with this data column to split the cell into multiple ones.
This will take our initial, single column and split it into four columns, one each for the organization name, organization city, organization state and organization country in this particular case. While having the additional location data can be very helpful in some studies, for the purposes of this example only the first column for the organization name will be used. Now that we have tightened this field we open a Text Facet to see the items in this column and clean them.
The facet we just opened is on the left hand side of the window and at the top right corner of this window we see a button called Cluster. Pushing this button will activate the cleanup algorithms built into Refine.
There are six different algorithms provided for cleaning up data using the clustering methods found in Refine. There is a documentation page that provides the details on the various methods provided. This page suggests going through each of the methods in order starting with the Key Collision methods and ending with the Nearest Neighbor methods. It doesn’t take long to do this, but personally, I have found that I get the maximum value by using the Metaphone3 algorithm. If you only use one method use this one.
Once the algorithm is launched it will provide the results along with some statistics about the entries in the cells. The results of the clustering can be reviewed and if you agree with the way the algorithm performed the user can check the Merge box. A New Cell Value can also be added so the organization name looks the way the users wants it to as opposed to what is provided by the tool.
When finished, click the Merge Selected and Close button to get back to the main Refine page. At this point the Count link at the top of the facet window can be selected to see the cell values in descending order as opposed to alphabetic. In this case, the decision was made to group all of the US Gov agencies into a single entry.
Within the text facet window there is also the capacity to manually edit any of the values so if the algorithms missed something they can be cleaned up manually by scanning the list while in alphabetic order. If a change is required, pass the cursor over an entry and an edit link will appear that allows the user to change the value to something different. Remember to stay consistent if you are repeating an organization name that was used in a different part of the list.
Once the list is ready, it can be exported back out of Refine by using the Export button in the top right corner of the browser window.
Comparing our original list to our cleaned one we can see some significant differences in the counts and the relative ranking of the organizations. Ethicon, for instance, has moved up several spots. Both lists require at least three documents in the collection, in order to make the cut, and it can be seen that seven additional organizations have now made the list, since the cleanup grouped together various misspellings and assorted errors.
Data cleanup can be one of the most time intensive processes in a patent analysis project. Refine is a tool that can be used to make this task simpler and more efficient while still keeping the data private on the user’s computer. Refine can also be used for a number of other useful tasks that will be covered in future posts.