app.munge module¶

exception app.munge.DataFormatError[source]¶

Bases: Exception

args¶

with_traceback()¶: Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class app.munge.Munger(limit, cache)[source]¶

Bases: object

Collects patent data into a graph by querying the USPTO.

Initializes a munger.

Parameters:	limit – the maximum number of patents to process and query cache – whether or not to use data cached in a csv file or make a fresh query

ensure_data()[source]¶: Check that edge list has been minimally loaded

ensure_meta()[source]¶: Check that metadata has been loaded

static get_citation_keys()[source]¶

get_edges()[source]¶

Return the edges from this query, if it has been made; else, load data

Returns:	the edge list in a dataframe, including date

static get_filename_from_stem(file_string, dir_name)[source]¶

get_network(metadata=False, limit=None)[source]¶: Constructs a citation network from the edge list. :param metadata: whether or not to include metadata :param limit: a limit to the number of documents to return :return: the NetworkX graph

load_data()[source]¶

Loads data from query or file to a dataframe.

Returns:	the instance

load_data_from_file(datafile)[source]¶: Load data from file for this query (using the unique make_filename function) :param datafile: the file to search for :return: this instance

load_metadata()[source]¶: Query for metadata about each patent and add to dataframe

make_filename(prefix='QUERY', dirname='query')[source]¶

Creates a filename to under which to store the munged data.

Returns:	the filename

static post_request(json_query)[source]¶

static query(json_query)[source]¶

Makes a query to the USPTO using a JSON attributes object.

Parameters:	json_query – the json query according to the PatentsView API.
Returns:	the return query in JSON format

query_data()[source]¶: Queries data from the USPTO to dataframe.

query_fields = ['patent_number', 'cited_patent_number', 'cited_patent_date', 'citedby_patent_number', 'citedby_patent_date']¶

query_to_dataframe(info, bcites=False)[source]¶

Converts the JSON query results from PatentsView to an edge list dataframe.

Parameters:	info – the query json output bcites – whether or not to include backward citations
Returns:	the dataframe containing an edge list wtih the query results

summary()[source]¶: Summarize the edge list

summary_meta()[source]¶: Summarize the metadata

write_data_to_file(filename)[source]¶

Write the data collected to a file

Parameters:	filename – the name of the file, typically the query name

static write_graph_to_file(G, filename)[source]¶

exception app.munge.QueryError[source]¶

Bases: Exception

args¶

with_traceback()¶: Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class app.munge.QueryMunger(query_json, limit=None, cache=True, per_page=1000, allow_external=False, feature_keys=['cpc_category', 'cpc_group_id', 'assignee_type', 'assignee_total_num_patents', 'assignee_id', 'inventor_id', 'inventor_total_num_patents', 'ipc_class', 'ipc_main_group', 'nber_category_id', 'nber_subcategory_id', 'patent_date', 'patent_num_claims', 'patent_num_cited_by_us_patents', 'patent_processing_time', 'uspc_mainclass_id', 'uspc_subclass_id', 'wipo_field_id'])[source]¶

Bases: app.munge.Munger

A special munger designed to make a specific query to the PatentsView API

Initializes the query munger

Parameters:	query_json – the JSON for the query limit – the maximum number of documents to munge cache – whether or not to use cached query data per_page – the number of patents to request in each individual query

ensure_data()¶: Check that edge list has been minimally loaded

ensure_meta()¶: Check that metadata has been loaded

static get_citation_keys()¶

get_edges()¶

Return the edges from this query, if it has been made; else, load data

Returns:	the edge list in a dataframe, including date

static get_filename_from_stem(file_string, dir_name)¶

get_network(metadata=False, limit=None)[source]¶: Constructs a citation network from the edge list. :param metadata: whether or not to include metadata :param limit: a limit to the number of documents to return :return: the NetworkX graph

handle_external()[source]¶

load_data()¶

Loads data from query or file to a dataframe.

Returns:	the instance

load_data_from_file(datafile)¶: Load data from file for this query (using the unique make_filename function) :param datafile: the file to search for :return: this instance

load_metadata()¶: Query for metadata about each patent and add to dataframe

make_filename(prefix='QUERY', dirname='query')[source]¶

Creates a filename to under which to store the munged data.

Returns:	the filename

static post_request(json_query)¶

static query(json_query)¶

Makes a query to the USPTO using a JSON attributes object.

Parameters:	json_query – the json query according to the PatentsView API.
Returns:	the return query in JSON format

query_data()[source]¶: Queries data from the USPTO to dataframe.

query_fields = ['patent_number', 'cited_patent_number', 'cited_patent_date', 'citedby_patent_number', 'citedby_patent_date']¶

query_paginated(page, per_page)[source]¶

Iteratively queries the PatentsView API (so as not to receive a timeout, and to gather data to file over time)

Parameters:	page – the current page number to query per_page – the number of patents per query
Returns:	a dataframe containing the query page results

query_sounding()[source]¶

Sends a sounding query to establish the number of documents to scrape

Returns:	the number of patents to scrape

query_to_dataframe(info, bcites=False)¶

Converts the JSON query results from PatentsView to an edge list dataframe.

Parameters:	info – the query json output bcites – whether or not to include backward citations
Returns:	the dataframe containing an edge list wtih the query results

summary()¶: Summarize the edge list

summary_meta()¶: Summarize the metadata

write_data_to_file(filename)¶

Write the data collected to a file

Parameters:	filename – the name of the file, typically the query name

static write_graph_to_file(G, filename)¶

class app.munge.RootMunger(patent_number, depth, limit=None, cache=True)[source]¶

Bases: app.munge.Munger

A special munger to fetch the descendants of a given patent number

Initializes the root munger

Parameters:	patent_number – the root patent number depth – the depth of the search (the number of generations of children) limit – a document limit cache – whether to use a cached query in the filesystem

ensure_data()¶: Check that edge list has been minimally loaded

ensure_meta()¶: Check that metadata has been loaded

features = ['cpc_category', 'cpc_group_id', 'assignee_type', 'assignee_total_num_patents', 'assignee_id', 'inventor_id', 'inventor_total_num_patents', 'ipc_class', 'ipc_main_group', 'nber_category_id', 'nber_subcategory_id', 'patent_date', 'patent_num_claims', 'patent_num_cited_by_us_patents', 'patent_processing_time', 'uspc_mainclass_id', 'uspc_subclass_id', 'wipo_field_id']¶

get_children(curr_num, curr_depth)[source]¶

Recursively fetches the children of the current patent up to the maximum depth and add to the edge list

Parameters:	curr_num – the current patent being munged curr_depth – the current depth away from the root patent number

static get_citation_keys()¶

get_edges()¶

Return the edges from this query, if it has been made; else, load data

Returns:	the edge list in a dataframe, including date

static get_filename_from_stem(file_string, dir_name)¶

get_network(metadata=False, limit=None)¶: Constructs a citation network from the edge list. :param metadata: whether or not to include metadata :param limit: a limit to the number of documents to return :return: the NetworkX graph

load_data()¶

Loads data from query or file to a dataframe.

Returns:	the instance

load_data_from_file(datafile)¶: Load data from file for this query (using the unique make_filename function) :param datafile: the file to search for :return: this instance

load_metadata()¶: Query for metadata about each patent and add to dataframe

make_filename(prefix='QUERY', dirname='query')[source]¶

Creates a filename to under which to store the munged data.

Returns:	the filename

static parse_features(patent_info)[source]¶

static post_request(json_query)¶

static query(json_query)¶

Makes a query to the USPTO using a JSON attributes object.

Parameters:	json_query – the json query according to the PatentsView API.
Returns:	the return query in JSON format

query_data()[source]¶: Queries data from the USPTO to dataframe.

static query_features(patents=None, query=None)[source]¶

static query_features_single(patent)[source]¶

query_fields = ['patent_number', 'cited_patent_number', 'cited_patent_date', 'citedby_patent_number', 'citedby_patent_date']¶

query_to_dataframe(info, bcites=False)¶

Converts the JSON query results from PatentsView to an edge list dataframe.

Parameters:	info – the query json output bcites – whether or not to include backward citations
Returns:	the dataframe containing an edge list wtih the query results

summary()¶: Summarize the edge list

summary_meta()¶: Summarize the metadata

write_data_to_file(filename)¶

Write the data collected to a file

Parameters:	filename – the name of the file, typically the query name

static write_graph_to_file(G, filename)¶

app.munge.chunks(l, n)[source]¶: Yields successive n-sized chunks from l.