app.munge module¶
-
exception
app.munge.DataFormatError[source]¶ Bases:
Exception-
args¶
-
with_traceback()¶ Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
-
-
class
app.munge.Munger(limit, cache)[source]¶ Bases:
objectCollects patent data into a graph by querying the USPTO.
Initializes a munger.
Parameters: - limit – the maximum number of patents to process and query
- cache – whether or not to use data cached in a csv file or make a fresh query
-
get_edges()[source]¶ Return the edges from this query, if it has been made; else, load data
Returns: the edge list in a dataframe, including date
-
get_network(metadata=False, limit=None)[source]¶ Constructs a citation network from the edge list. :param metadata: whether or not to include metadata :param limit: a limit to the number of documents to return :return: the NetworkX graph
-
load_data_from_file(datafile)[source]¶ Load data from file for this query (using the unique make_filename function) :param datafile: the file to search for :return: this instance
-
make_filename(prefix='QUERY', dirname='query')[source]¶ Creates a filename to under which to store the munged data.
Returns: the filename
-
static
query(json_query)[source]¶ Makes a query to the USPTO using a JSON attributes object.
Parameters: json_query – the json query according to the PatentsView API. Returns: the return query in JSON format
-
query_fields= ['patent_number', 'cited_patent_number', 'cited_patent_date', 'citedby_patent_number', 'citedby_patent_date']¶
-
query_to_dataframe(info, bcites=False)[source]¶ Converts the JSON query results from PatentsView to an edge list dataframe.
Parameters: - info – the query json output
- bcites – whether or not to include backward citations
Returns: the dataframe containing an edge list wtih the query results
-
exception
app.munge.QueryError[source]¶ Bases:
Exception-
args¶
-
with_traceback()¶ Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
-
-
class
app.munge.QueryMunger(query_json, limit=None, cache=True, per_page=1000, allow_external=False, feature_keys=['cpc_category', 'cpc_group_id', 'assignee_type', 'assignee_total_num_patents', 'assignee_id', 'inventor_id', 'inventor_total_num_patents', 'ipc_class', 'ipc_main_group', 'nber_category_id', 'nber_subcategory_id', 'patent_date', 'patent_num_claims', 'patent_num_cited_by_us_patents', 'patent_processing_time', 'uspc_mainclass_id', 'uspc_subclass_id', 'wipo_field_id'])[source]¶ Bases:
app.munge.MungerA special munger designed to make a specific query to the PatentsView API
Initializes the query munger
Parameters: - query_json – the JSON for the query
- limit – the maximum number of documents to munge
- cache – whether or not to use cached query data
- per_page – the number of patents to request in each individual query
-
ensure_data()¶ Check that edge list has been minimally loaded
-
ensure_meta()¶ Check that metadata has been loaded
-
static
get_citation_keys()¶
-
get_edges()¶ Return the edges from this query, if it has been made; else, load data
Returns: the edge list in a dataframe, including date
-
static
get_filename_from_stem(file_string, dir_name)¶
-
get_network(metadata=False, limit=None)[source]¶ Constructs a citation network from the edge list. :param metadata: whether or not to include metadata :param limit: a limit to the number of documents to return :return: the NetworkX graph
-
load_data()¶ Loads data from query or file to a dataframe.
Returns: the instance
-
load_data_from_file(datafile)¶ Load data from file for this query (using the unique make_filename function) :param datafile: the file to search for :return: this instance
-
load_metadata()¶ Query for metadata about each patent and add to dataframe
-
make_filename(prefix='QUERY', dirname='query')[source]¶ Creates a filename to under which to store the munged data.
Returns: the filename
-
static
post_request(json_query)¶
-
static
query(json_query)¶ Makes a query to the USPTO using a JSON attributes object.
Parameters: json_query – the json query according to the PatentsView API. Returns: the return query in JSON format
-
query_fields= ['patent_number', 'cited_patent_number', 'cited_patent_date', 'citedby_patent_number', 'citedby_patent_date']¶
-
query_paginated(page, per_page)[source]¶ Iteratively queries the PatentsView API (so as not to receive a timeout, and to gather data to file over time)
Parameters: - page – the current page number to query
- per_page – the number of patents per query
Returns: a dataframe containing the query page results
-
query_sounding()[source]¶ Sends a sounding query to establish the number of documents to scrape
Returns: the number of patents to scrape
-
query_to_dataframe(info, bcites=False)¶ Converts the JSON query results from PatentsView to an edge list dataframe.
Parameters: - info – the query json output
- bcites – whether or not to include backward citations
Returns: the dataframe containing an edge list wtih the query results
-
summary()¶ Summarize the edge list
-
summary_meta()¶ Summarize the metadata
-
write_data_to_file(filename)¶ Write the data collected to a file
Parameters: filename – the name of the file, typically the query name
-
static
write_graph_to_file(G, filename)¶
-
class
app.munge.RootMunger(patent_number, depth, limit=None, cache=True)[source]¶ Bases:
app.munge.MungerA special munger to fetch the descendants of a given patent number
Initializes the root munger
Parameters: - patent_number – the root patent number
- depth – the depth of the search (the number of generations of children)
- limit – a document limit
- cache – whether to use a cached query in the filesystem
-
ensure_data()¶ Check that edge list has been minimally loaded
-
ensure_meta()¶ Check that metadata has been loaded
-
features= ['cpc_category', 'cpc_group_id', 'assignee_type', 'assignee_total_num_patents', 'assignee_id', 'inventor_id', 'inventor_total_num_patents', 'ipc_class', 'ipc_main_group', 'nber_category_id', 'nber_subcategory_id', 'patent_date', 'patent_num_claims', 'patent_num_cited_by_us_patents', 'patent_processing_time', 'uspc_mainclass_id', 'uspc_subclass_id', 'wipo_field_id']¶
-
get_children(curr_num, curr_depth)[source]¶ Recursively fetches the children of the current patent up to the maximum depth and add to the edge list
Parameters: - curr_num – the current patent being munged
- curr_depth – the current depth away from the root patent number
-
static
get_citation_keys()¶
-
get_edges()¶ Return the edges from this query, if it has been made; else, load data
Returns: the edge list in a dataframe, including date
-
static
get_filename_from_stem(file_string, dir_name)¶
-
get_network(metadata=False, limit=None)¶ Constructs a citation network from the edge list. :param metadata: whether or not to include metadata :param limit: a limit to the number of documents to return :return: the NetworkX graph
-
load_data()¶ Loads data from query or file to a dataframe.
Returns: the instance
-
load_data_from_file(datafile)¶ Load data from file for this query (using the unique make_filename function) :param datafile: the file to search for :return: this instance
-
load_metadata()¶ Query for metadata about each patent and add to dataframe
-
make_filename(prefix='QUERY', dirname='query')[source]¶ Creates a filename to under which to store the munged data.
Returns: the filename
-
static
post_request(json_query)¶
-
static
query(json_query)¶ Makes a query to the USPTO using a JSON attributes object.
Parameters: json_query – the json query according to the PatentsView API. Returns: the return query in JSON format
-
query_fields= ['patent_number', 'cited_patent_number', 'cited_patent_date', 'citedby_patent_number', 'citedby_patent_date']¶
-
query_to_dataframe(info, bcites=False)¶ Converts the JSON query results from PatentsView to an edge list dataframe.
Parameters: - info – the query json output
- bcites – whether or not to include backward citations
Returns: the dataframe containing an edge list wtih the query results
-
summary()¶ Summarize the edge list
-
summary_meta()¶ Summarize the metadata
-
write_data_to_file(filename)¶ Write the data collected to a file
Parameters: filename – the name of the file, typically the query name
-
static
write_graph_to_file(G, filename)¶