app.munge module¶
-
exception
app.munge.
DataFormatError
[source]¶ Bases:
Exception
-
args
¶
-
with_traceback
()¶ Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
-
-
class
app.munge.
Munger
(limit, cache)[source]¶ Bases:
object
Collects patent data into a graph by querying the USPTO.
Initializes a munger.
Parameters: - limit – the maximum number of patents to process and query
- cache – whether or not to use data cached in a csv file or make a fresh query
-
get_edges
()[source]¶ Return the edges from this query, if it has been made; else, load data
Returns: the edge list in a dataframe, including date
-
get_network
(metadata=False, limit=None)[source]¶ Constructs a citation network from the edge list. :param metadata: whether or not to include metadata :param limit: a limit to the number of documents to return :return: the NetworkX graph
-
load_data_from_file
(datafile)[source]¶ Load data from file for this query (using the unique make_filename function) :param datafile: the file to search for :return: this instance
-
make_filename
(prefix='QUERY', dirname='query')[source]¶ Creates a filename to under which to store the munged data.
Returns: the filename
-
static
query
(json_query)[source]¶ Makes a query to the USPTO using a JSON attributes object.
Parameters: json_query – the json query according to the PatentsView API. Returns: the return query in JSON format
-
query_fields
= ['patent_number', 'cited_patent_number', 'cited_patent_date', 'citedby_patent_number', 'citedby_patent_date']¶
-
query_to_dataframe
(info, bcites=False)[source]¶ Converts the JSON query results from PatentsView to an edge list dataframe.
Parameters: - info – the query json output
- bcites – whether or not to include backward citations
Returns: the dataframe containing an edge list wtih the query results
-
exception
app.munge.
QueryError
[source]¶ Bases:
Exception
-
args
¶
-
with_traceback
()¶ Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
-
-
class
app.munge.
QueryMunger
(query_json, limit=None, cache=True, per_page=1000, allow_external=False, feature_keys=['cpc_category', 'cpc_group_id', 'assignee_type', 'assignee_total_num_patents', 'assignee_id', 'inventor_id', 'inventor_total_num_patents', 'ipc_class', 'ipc_main_group', 'nber_category_id', 'nber_subcategory_id', 'patent_date', 'patent_num_claims', 'patent_num_cited_by_us_patents', 'patent_processing_time', 'uspc_mainclass_id', 'uspc_subclass_id', 'wipo_field_id'])[source]¶ Bases:
app.munge.Munger
A special munger designed to make a specific query to the PatentsView API
Initializes the query munger
Parameters: - query_json – the JSON for the query
- limit – the maximum number of documents to munge
- cache – whether or not to use cached query data
- per_page – the number of patents to request in each individual query
-
ensure_data
()¶ Check that edge list has been minimally loaded
-
ensure_meta
()¶ Check that metadata has been loaded
-
static
get_citation_keys
()¶
-
get_edges
()¶ Return the edges from this query, if it has been made; else, load data
Returns: the edge list in a dataframe, including date
-
static
get_filename_from_stem
(file_string, dir_name)¶
-
get_network
(metadata=False, limit=None)[source]¶ Constructs a citation network from the edge list. :param metadata: whether or not to include metadata :param limit: a limit to the number of documents to return :return: the NetworkX graph
-
load_data
()¶ Loads data from query or file to a dataframe.
Returns: the instance
-
load_data_from_file
(datafile)¶ Load data from file for this query (using the unique make_filename function) :param datafile: the file to search for :return: this instance
-
load_metadata
()¶ Query for metadata about each patent and add to dataframe
-
make_filename
(prefix='QUERY', dirname='query')[source]¶ Creates a filename to under which to store the munged data.
Returns: the filename
-
static
post_request
(json_query)¶
-
static
query
(json_query)¶ Makes a query to the USPTO using a JSON attributes object.
Parameters: json_query – the json query according to the PatentsView API. Returns: the return query in JSON format
-
query_fields
= ['patent_number', 'cited_patent_number', 'cited_patent_date', 'citedby_patent_number', 'citedby_patent_date']¶
-
query_paginated
(page, per_page)[source]¶ Iteratively queries the PatentsView API (so as not to receive a timeout, and to gather data to file over time)
Parameters: - page – the current page number to query
- per_page – the number of patents per query
Returns: a dataframe containing the query page results
-
query_sounding
()[source]¶ Sends a sounding query to establish the number of documents to scrape
Returns: the number of patents to scrape
-
query_to_dataframe
(info, bcites=False)¶ Converts the JSON query results from PatentsView to an edge list dataframe.
Parameters: - info – the query json output
- bcites – whether or not to include backward citations
Returns: the dataframe containing an edge list wtih the query results
-
summary
()¶ Summarize the edge list
-
summary_meta
()¶ Summarize the metadata
-
write_data_to_file
(filename)¶ Write the data collected to a file
Parameters: filename – the name of the file, typically the query name
-
static
write_graph_to_file
(G, filename)¶
-
class
app.munge.
RootMunger
(patent_number, depth, limit=None, cache=True)[source]¶ Bases:
app.munge.Munger
A special munger to fetch the descendants of a given patent number
Initializes the root munger
Parameters: - patent_number – the root patent number
- depth – the depth of the search (the number of generations of children)
- limit – a document limit
- cache – whether to use a cached query in the filesystem
-
ensure_data
()¶ Check that edge list has been minimally loaded
-
ensure_meta
()¶ Check that metadata has been loaded
-
features
= ['cpc_category', 'cpc_group_id', 'assignee_type', 'assignee_total_num_patents', 'assignee_id', 'inventor_id', 'inventor_total_num_patents', 'ipc_class', 'ipc_main_group', 'nber_category_id', 'nber_subcategory_id', 'patent_date', 'patent_num_claims', 'patent_num_cited_by_us_patents', 'patent_processing_time', 'uspc_mainclass_id', 'uspc_subclass_id', 'wipo_field_id']¶
-
get_children
(curr_num, curr_depth)[source]¶ Recursively fetches the children of the current patent up to the maximum depth and add to the edge list
Parameters: - curr_num – the current patent being munged
- curr_depth – the current depth away from the root patent number
-
static
get_citation_keys
()¶
-
get_edges
()¶ Return the edges from this query, if it has been made; else, load data
Returns: the edge list in a dataframe, including date
-
static
get_filename_from_stem
(file_string, dir_name)¶
-
get_network
(metadata=False, limit=None)¶ Constructs a citation network from the edge list. :param metadata: whether or not to include metadata :param limit: a limit to the number of documents to return :return: the NetworkX graph
-
load_data
()¶ Loads data from query or file to a dataframe.
Returns: the instance
-
load_data_from_file
(datafile)¶ Load data from file for this query (using the unique make_filename function) :param datafile: the file to search for :return: this instance
-
load_metadata
()¶ Query for metadata about each patent and add to dataframe
-
make_filename
(prefix='QUERY', dirname='query')[source]¶ Creates a filename to under which to store the munged data.
Returns: the filename
-
static
post_request
(json_query)¶
-
static
query
(json_query)¶ Makes a query to the USPTO using a JSON attributes object.
Parameters: json_query – the json query according to the PatentsView API. Returns: the return query in JSON format
-
query_fields
= ['patent_number', 'cited_patent_number', 'cited_patent_date', 'citedby_patent_number', 'citedby_patent_date']¶
-
query_to_dataframe
(info, bcites=False)¶ Converts the JSON query results from PatentsView to an edge list dataframe.
Parameters: - info – the query json output
- bcites – whether or not to include backward citations
Returns: the dataframe containing an edge list wtih the query results
-
summary
()¶ Summarize the edge list
-
summary_meta
()¶ Summarize the metadata
-
write_data_to_file
(filename)¶ Write the data collected to a file
Parameters: filename – the name of the file, typically the query name
-
static
write_graph_to_file
(G, filename)¶