retriever.lib package

Submodules

retriever.lib.cleanup module

class retriever.lib.cleanup.Cleanup(function=<function no_cleanup>, **kwargs)

Bases: object

This class represents a custom cleanup function and a dictionary of arguments to be passed to that function.

retriever.lib.cleanup.correct_invalid_value(value, args)

This cleanup function replaces missing value indicators with None.

retriever.lib.cleanup.floatable(value)

Check if a value can be converted to a float

retriever.lib.cleanup.no_cleanup(value, args)

Default cleanup function, returns the unchanged value.

retriever.lib.compile module

retriever.lib.compile.add_dialect(table_dict, table)

Reads dialect key of JSON script and extracts key-value pairs to store them in python script

Contains properties such ‘missingValues’, delimiter’, etc

retriever.lib.compile.add_schema(table_dict, table)

Reads schema key of JSON script and extracts values to store them in python script

Contains properties related to table schema, such as ‘fields’ and cross-tab column name (‘ct_column’).

retriever.lib.compile.compile_json(json_file)

Function to compile JSON script files to python scripts The scripts are created with retriever new_json <script_name> using command line

retriever.lib.datapackage module

retriever.lib.datapackage.clean_input(prompt='', split_char='', ignore_empty=False, dtype=None)

Clean the user-input from the CLI before adding it.

retriever.lib.datapackage.create_json()

Creates datapackage.JSON script. http://specs.frictionlessdata.io/data-packages/#descriptor-datapackagejson Takes input from user via command line.

Usage: retriever new_json

retriever.lib.datapackage.delete_json(json_file)
retriever.lib.datapackage.edit_dict(obj, tabwidth=0)

Recursive helper function for edit_json() to edit a datapackage.JSON script file.

retriever.lib.datapackage.edit_json(json_file)

Edit existing datapackage.JSON script.

Usage: retriever edit_json <script_name> Note: Name of script is the dataset name.

retriever.lib.datapackage.get_contains_pk(dialect)

Set contains_pk property.

retriever.lib.datapackage.get_delimiter(dialect)

Get the string delimiter for the dataset file(s).

retriever.lib.datapackage.get_do_not_bulk_insert(dialect)

Set do_not_bulk_insert property.

retriever.lib.datapackage.get_fixed_width(dialect)

Set fixed_width property.

retriever.lib.datapackage.get_header_rows(dialect)

Get number of rows considered as the header.

retriever.lib.datapackage.get_nulls(dialect)

Get list of strings that denote missing value in the dataset.

retriever.lib.datapackage.get_replace_columns(dialect)

Get list of tuples with old and new names for the columns in the table.

retriever.lib.datapackage.get_script_filename(shortname)
retriever.lib.datapackage.is_empty(val)

Check if a variable is an empty string or an empty list.

retriever.lib.datasets module

retriever.lib.datasets.dataset_names()

Return list of all available dataset names.

retriever.lib.datasets.datasets(arg_keyword=None)

Return list of all available datasets.

retriever.lib.datasets.license(dataset)

Get the license for a dataset.

retriever.lib.defaults module

retriever.lib.engine module

class retriever.lib.engine.Engine

Bases: object

A generic database system. Specific database platforms will inherit from this class.

add_to_table(data_source)

This function adds data to a table from one or more lines specified in engine.table.source.

auto_create_table(table, url=None, filename=None, pk=None)

Create table automatically by analyzing a data source and predicting column names, data types, delimiter, etc.

auto_get_datatypes(pk, source, columns, column_values)

Determine data types for each column.

For string columns adds an additional 100 characters to the maximum observed value to provide extra space for cases where special characters are counted differently by different engines.

auto_get_delimiter(header)

Determine the delimiter.

Find out which of a set of common delimiters occurs most in the header line and use this as the delimiter.

connect(force_reconnect=False)
connection
convert_data_type(datatype)

Convert Retriever generic data types to database platform specific data types.

create_db()

Create a new database based on settings supplied in Database object engine.db.

create_db_statement()

Return SQL statement to create a database.

create_raw_data_dir()

Check to see if the archive directory exists and creates it if necessary.

create_table()

Create new database table based on settings supplied in Table object engine.table.

create_table_statement()

Return SQL statement to create a table.

cursor

Get db cursor.

database_name(name=None)

Return name of the database.

datatypes = []
db = None
debug = False
disconnect()
download_file(url, filename)

Download file to the raw data directory.

download_files_from_archive(url, filenames, filetype='zip', keep_in_dir=False, archivename=None)

Download files from an archive into the raw data directory.

drop_statement(objecttype, objectname)

Return drop table or database SQL statement.

execute(statement, commit=True)

Execute given statement.

executemany(statement, values, commit=True)

Execute given statement with multiple values.

exists(script)

Check to see if the given table exists.

extract_fixed_width(line)

Split line based on the fixed width, returns list of the values.

final_cleanup()

Close the database connection.

find_file(filename)

Check for an existing datafile.

format_data_dir()

Return correctly formatted raw data directory location.

format_filename(filename)

Return full path of a file in the archive directory.

format_insert_value(value, datatype)

Format a value for an insert statement based on data type.

Different data types need to be formated differently to be properly stored in database management systems. The correct formats are obtained by:

  1. Removing extra enclosing quotes
  2. Harmonizing null indicators
  3. Cleaning up badly formatted integers
  4. Obtaining consistent float representations of decimals
get_connection()

This method should be overloaded by specific implementations of Engine.

get_ct_data(lines)

Create cross tab data.

get_ct_line_length(lines)

Returns the number of real lines for cross-tab data

get_cursor()

Get db cursor.

get_input()

Manually get user input for connection information when script is run from terminal.

insert_data_from_archive(url, filenames)

Insert data from files located in an online archive. This function extracts the file, inserts the data, and deletes the file if raw data archiving is not set.

insert_data_from_file(filename)

The default function to insert data from a file. This function simply inserts the data row by row. Database platforms with support for inserting bulk data from files can override this function.

insert_data_from_url(url)

Insert data from a web resource, such as a text file.

insert_statement(values)

Return SQL statement to insert a set of values.

instructions = 'Enter your database connection information:'
load_data(filename)

Generator returning lists of values from lines in a data file.

1. Works on both delimited (csv module) and fixed width data (extract_fixed_width) 2. Identifies the delimiter if not known 3. Removes extra line endings

name = ''
pkformat = '%s PRIMARY KEY %s '
required_opts = []
script = None
set_engine_encoding()
set_table_delimiter(file_path)
table = None
table_exists(dbname, tablename)

This can be overridden to return True if a table exists. It returns False by default.

table_name(name=None, dbname=None)

Return full tablename.

to_csv()
use_cache = True
warning(warning)
warnings = []
retriever.lib.engine.file_exists(path)

Return true if a file exists and its size is greater than 0.

retriever.lib.engine.filename_from_url(url)

Extract and returns the filename from the url.

retriever.lib.engine.gen_from_source(source)

Return generator from a source tuple.

Source tuples are of the form (callable, args) where callable(*args) returns either a generator or another source tuple. This allows indefinite regeneration of data sources.

retriever.lib.engine.reporthook(count, block_size, total_size)

Generate the progress bar.

Uses file size to calculate the percentage of file size downloaded. If the total_size of the file being downloaded is not in the header, provide progress as size of bytes downloaded in either KB, MB and GB.

retriever.lib.engine.skip_rows(rows, source)

Skip over the header lines by reading them before processing.

retriever.lib.excel module

Data Retriever Excel Functions

This module contains optional functions for importing data from Excel.

class retriever.lib.excel.Excel

Bases: object

static cell_value(cell)

Return string value of an excel spreadsheet cell.

static empty_cell(cell)

Test if excel cell is empty or contains only whitespace.

retriever.lib.get_opts module

retriever.lib.parse_script_to_json module

retriever.lib.repository module

Checks the repository for updates.

retriever.lib.repository.check_for_updates(quiet=False)

Check for updates to datasets.

This updates the HOME_DIR scripts directory with the latest script versions

retriever.lib.scripts module

retriever.lib.scripts.MODULE_LIST(force_compile=False)

Load scripts from scripts directory and return list of modules.

retriever.lib.scripts.SCRIPT_LIST(force_compile=False)
retriever.lib.scripts.get_script(dataset)

Return the script for a named dataset.

retriever.lib.scripts.open_csvw(csv_file, encode=True)

Open a csv writer forcing the use of Linux line endings on Windows.

Also sets dialect to ‘excel’ and escape characters to ‘’

retriever.lib.scripts.open_fr(file_name, encoding='ISO-8859-1', encode=True)

Open file for reading respecting Python version and OS differences.

Sets newline to Linux line endings on Windows and Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes

retriever.lib.scripts.open_fw(file_name, encoding='ISO-8859-1', encode=True)

Open file for writing respecting Python version and OS differences.

Sets newline to Linux line endings on Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes

retriever.lib.scripts.to_str(object, object_encoding=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)

retriever.lib.table module

class retriever.lib.table.Table(name, **kwargs)

Bases: object

Information about a database table.

auto_get_columns(header)

Get column names from the header row.

Identifies the column names from the header row. Replaces database keywords with alternatives. Replaces special characters and spaces.

clean_column_name(column_name)

Clean column names using the expected sql guidelines remove leading whitespaces, replace sql key words, etc.

combine_on_delimiter(line_as_list)

Combine a list of values into a line of csv data.

get_column_datatypes()

Get set of column names for insert statements.

get_insert_columns(join=True, create=False)

Get column names for insert statements.

create should be set to True if the returned values are going to be used for creating a new table. It includes the pk_auto column if present. This column is not included by default because it is not used when generating insert statements for database management systems.

values_from_line(line)

retriever.lib.templates module

Datasets are defined as scripts and have unique properties. The Module defines generic dataset properties and models the functions available for inheritance by the scripts or datasets.

class retriever.lib.templates.BasicTextTemplate(**kwargs)

Bases: retriever.lib.templates.Script

Defines the pre processing required for scripts.

Scripts that need pre processing should use the download function from this class. Scripts that require extra tune up, should override this class.

download(engine=None, debug=False)

Defines the download processes for scripts that utilize the default pre processing steps provided by the retriever.

print_message()
reference_url()
class retriever.lib.templates.DownloadOnlyTemplate(**kwargs)

Bases: retriever.lib.templates.Script

Script template for non-tabular data that are only for download.

download(engine=None, debug=False)
class retriever.lib.templates.HtmlTableTemplate(title='', description='', name='', urls={}, tables={}, ref='', public=True, addendum=None, citation='Not currently available', licenses=[{'name': None}], retriever_minimum_version='', version='', encoding='', message='', **kwargs)

Bases: retriever.lib.templates.Script

Script template for parsing data in HTML tables.

class retriever.lib.templates.Script(title='', description='', name='', urls={}, tables={}, ref='', public=True, addendum=None, citation='Not currently available', licenses=[{'name': None}], retriever_minimum_version='', version='', encoding='', message='', **kwargs)

Bases: object

This class defines the properties of a generic dataset.

Each Dataset inherits attributes from this class to define it’s Unique functionality.

checkengine(engine=None)

Returns the required engine instance

download(engine=None, debug=False)

Generic function to prepare for installation or download.

exists(engine=None)
matches_terms(terms)
reference_url()

retriever.lib.tools module

Data Retriever Tools

This module contains miscellaneous classes and functions used in Retriever scripts.

retriever.lib.tools.create_file(data, output='output_file')

Write lines to file from a list.

retriever.lib.tools.create_home_dir()

Create Directory for retriever.

retriever.lib.tools.file_2list(input_file)

Read in a csv file and return lines a list.

retriever.lib.tools.final_cleanup(engine)

Perform final cleanup operations after all scripts have run.

retriever.lib.tools.get_module_version()

This function gets the version number of the scripts and returns them in array form.

retriever.lib.tools.getmd5(data, data_type='lines')

Get MD5 of a data source.

retriever.lib.tools.json2csv(input_file, output_file=None, header_values=None)

Convert Json file to CSV.

Function is used for only testing and can handle the file of the size.

retriever.lib.tools.name_matches(scripts, arg)

Check for a match of the script in available scripts

if all, return the entire script list if the exact script is available, return that script if no exact script name detected, match the argument with keywords title and name of all scripts and return the closest matches

retriever.lib.tools.reset_retriever(scope='all', ask_permission=True)

Remove stored information on scripts, data, and connections.

retriever.lib.tools.set_proxy()

Check for proxies and makes them available to urllib.

retriever.lib.tools.sort_csv(filename)

Sort CSV rows minus the header and return the file.

Function is used for only testing and can handle the file of the size.

retriever.lib.tools.sort_file(file_path)

Sort file by line and return the file.

Function is used for only testing and can handle the file of the size.

retriever.lib.tools.xml2csv(input_file, outputfile=None, header_values=None, row_tag='row')

Convert xml to csv.

Function is used for only testing and can handle the file of the size.

retriever.lib.warning module

class retriever.lib.warning.Warning(location, warning)

Bases: object

Module contents

retriever.lib contains the core Data Retriever modules.

retriever.lib.check_for_updates(quiet=False)

Check for updates to datasets.

This updates the HOME_DIR scripts directory with the latest script versions

retriever.lib.datasets(arg_keyword=None)

Return list of all available datasets.

retriever.lib.dataset_names()

Return list of all available dataset names.

retriever.lib.download(dataset, path='./', quiet=False, subdir=False, debug=False)

Download scripts for retriever.

retriever.lib.reset_retriever(scope='all', ask_permission=True)

Remove stored information on scripts, data, and connections.

retriever.lib.install_csv(dataset, table_name='./{db}_{table}.csv', debug=False, use_cache=True)

Install datasets into csv.

retriever.lib.install_mysql(dataset, user='root', password='', host='localhost', port=3306, database_name='{db}', table_name='{db}.{table}', debug=False, use_cache=True)

Install datasets into mysql.

retriever.lib.install_postgres(dataset, user='postgres', password='', host='localhost', port=5432, database='postgres', database_name='{db}', table_name='{db}.{table}', debug=False, use_cache=True)

Install datasets into postgres.

retriever.lib.install_sqlite(dataset, file='./sqlite.db', table_name='{db}_{table}', debug=False, use_cache=True)

Install datasets into sqlite.

retriever.lib.install_msaccess(dataset, file='./access.mdb', table_name='[{db} {table}]', debug=False, use_cache=True)

Install datasets into msaccess.

retriever.lib.install_json(dataset, table_name='./{db}_{table}.json', debug=False, use_cache=True)

Install datasets into json.

retriever.lib.install_xml(dataset, table_name='./{db}_{table}.xml', debug=False, use_cache=True)

Install datasets into xml.