app.munge module

exception app.munge.DataFormatError[source]

Bases: Exception

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class app.munge.Munger(limit, cache)[source]

Bases: object

Collects patent data into a graph by querying the USPTO.

Initializes a munger.

Parameters:
  • limit – the maximum number of patents to process and query
  • cache – whether or not to use data cached in a csv file or make a fresh query
ensure_data()[source]

Check that edge list has been minimally loaded

ensure_meta()[source]

Check that metadata has been loaded

static get_citation_keys()[source]
get_edges()[source]

Return the edges from this query, if it has been made; else, load data

Returns:the edge list in a dataframe, including date
static get_filename_from_stem(file_string, dir_name)[source]
get_network(metadata=False, limit=None)[source]

Constructs a citation network from the edge list. :param metadata: whether or not to include metadata :param limit: a limit to the number of documents to return :return: the NetworkX graph

load_data()[source]

Loads data from query or file to a dataframe.

Returns:the instance
load_data_from_file(datafile)[source]

Load data from file for this query (using the unique make_filename function) :param datafile: the file to search for :return: this instance

load_metadata()[source]

Query for metadata about each patent and add to dataframe

make_filename(prefix='QUERY', dirname='query')[source]

Creates a filename to under which to store the munged data.

Returns:the filename
static post_request(json_query)[source]
static query(json_query)[source]

Makes a query to the USPTO using a JSON attributes object.

Parameters:json_query – the json query according to the PatentsView API.
Returns:the return query in JSON format
query_data()[source]

Queries data from the USPTO to dataframe.

query_fields = ['patent_number', 'cited_patent_number', 'cited_patent_date', 'citedby_patent_number', 'citedby_patent_date']
query_to_dataframe(info, bcites=False)[source]

Converts the JSON query results from PatentsView to an edge list dataframe.

Parameters:
  • info – the query json output
  • bcites – whether or not to include backward citations
Returns:

the dataframe containing an edge list wtih the query results

summary()[source]

Summarize the edge list

summary_meta()[source]

Summarize the metadata

write_data_to_file(filename)[source]

Write the data collected to a file

Parameters:filename – the name of the file, typically the query name
static write_graph_to_file(G, filename)[source]
exception app.munge.QueryError[source]

Bases: Exception

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class app.munge.QueryMunger(query_json, limit=None, cache=True, per_page=1000, allow_external=False, feature_keys=['cpc_category', 'cpc_group_id', 'assignee_type', 'assignee_total_num_patents', 'assignee_id', 'inventor_id', 'inventor_total_num_patents', 'ipc_class', 'ipc_main_group', 'nber_category_id', 'nber_subcategory_id', 'patent_date', 'patent_num_claims', 'patent_num_cited_by_us_patents', 'patent_processing_time', 'uspc_mainclass_id', 'uspc_subclass_id', 'wipo_field_id'])[source]

Bases: app.munge.Munger

A special munger designed to make a specific query to the PatentsView API

Initializes the query munger

Parameters:
  • query_json – the JSON for the query
  • limit – the maximum number of documents to munge
  • cache – whether or not to use cached query data
  • per_page – the number of patents to request in each individual query
ensure_data()

Check that edge list has been minimally loaded

ensure_meta()

Check that metadata has been loaded

static get_citation_keys()
get_edges()

Return the edges from this query, if it has been made; else, load data

Returns:the edge list in a dataframe, including date
static get_filename_from_stem(file_string, dir_name)
get_network(metadata=False, limit=None)[source]

Constructs a citation network from the edge list. :param metadata: whether or not to include metadata :param limit: a limit to the number of documents to return :return: the NetworkX graph

handle_external()[source]
load_data()

Loads data from query or file to a dataframe.

Returns:the instance
load_data_from_file(datafile)

Load data from file for this query (using the unique make_filename function) :param datafile: the file to search for :return: this instance

load_metadata()

Query for metadata about each patent and add to dataframe

make_filename(prefix='QUERY', dirname='query')[source]

Creates a filename to under which to store the munged data.

Returns:the filename
static post_request(json_query)
static query(json_query)

Makes a query to the USPTO using a JSON attributes object.

Parameters:json_query – the json query according to the PatentsView API.
Returns:the return query in JSON format
query_data()[source]

Queries data from the USPTO to dataframe.

query_fields = ['patent_number', 'cited_patent_number', 'cited_patent_date', 'citedby_patent_number', 'citedby_patent_date']
query_paginated(page, per_page)[source]

Iteratively queries the PatentsView API (so as not to receive a timeout, and to gather data to file over time)

Parameters:
  • page – the current page number to query
  • per_page – the number of patents per query
Returns:

a dataframe containing the query page results

query_sounding()[source]

Sends a sounding query to establish the number of documents to scrape

Returns:the number of patents to scrape
query_to_dataframe(info, bcites=False)

Converts the JSON query results from PatentsView to an edge list dataframe.

Parameters:
  • info – the query json output
  • bcites – whether or not to include backward citations
Returns:

the dataframe containing an edge list wtih the query results

summary()

Summarize the edge list

summary_meta()

Summarize the metadata

write_data_to_file(filename)

Write the data collected to a file

Parameters:filename – the name of the file, typically the query name
static write_graph_to_file(G, filename)
class app.munge.RootMunger(patent_number, depth, limit=None, cache=True)[source]

Bases: app.munge.Munger

A special munger to fetch the descendants of a given patent number

Initializes the root munger

Parameters:
  • patent_number – the root patent number
  • depth – the depth of the search (the number of generations of children)
  • limit – a document limit
  • cache – whether to use a cached query in the filesystem
ensure_data()

Check that edge list has been minimally loaded

ensure_meta()

Check that metadata has been loaded

features = ['cpc_category', 'cpc_group_id', 'assignee_type', 'assignee_total_num_patents', 'assignee_id', 'inventor_id', 'inventor_total_num_patents', 'ipc_class', 'ipc_main_group', 'nber_category_id', 'nber_subcategory_id', 'patent_date', 'patent_num_claims', 'patent_num_cited_by_us_patents', 'patent_processing_time', 'uspc_mainclass_id', 'uspc_subclass_id', 'wipo_field_id']
get_children(curr_num, curr_depth)[source]

Recursively fetches the children of the current patent up to the maximum depth and add to the edge list

Parameters:
  • curr_num – the current patent being munged
  • curr_depth – the current depth away from the root patent number
static get_citation_keys()
get_edges()

Return the edges from this query, if it has been made; else, load data

Returns:the edge list in a dataframe, including date
static get_filename_from_stem(file_string, dir_name)
get_network(metadata=False, limit=None)

Constructs a citation network from the edge list. :param metadata: whether or not to include metadata :param limit: a limit to the number of documents to return :return: the NetworkX graph

load_data()

Loads data from query or file to a dataframe.

Returns:the instance
load_data_from_file(datafile)

Load data from file for this query (using the unique make_filename function) :param datafile: the file to search for :return: this instance

load_metadata()

Query for metadata about each patent and add to dataframe

make_filename(prefix='QUERY', dirname='query')[source]

Creates a filename to under which to store the munged data.

Returns:the filename
static parse_features(patent_info)[source]
static post_request(json_query)
static query(json_query)

Makes a query to the USPTO using a JSON attributes object.

Parameters:json_query – the json query according to the PatentsView API.
Returns:the return query in JSON format
query_data()[source]

Queries data from the USPTO to dataframe.

static query_features(patents=None, query=None)[source]
static query_features_single(patent)[source]
query_fields = ['patent_number', 'cited_patent_number', 'cited_patent_date', 'citedby_patent_number', 'citedby_patent_date']
query_to_dataframe(info, bcites=False)

Converts the JSON query results from PatentsView to an edge list dataframe.

Parameters:
  • info – the query json output
  • bcites – whether or not to include backward citations
Returns:

the dataframe containing an edge list wtih the query results

summary()

Summarize the edge list

summary_meta()

Summarize the metadata

write_data_to_file(filename)

Write the data collected to a file

Parameters:filename – the name of the file, typically the query name
static write_graph_to_file(G, filename)
app.munge.chunks(l, n)[source]

Yields successive n-sized chunks from l.