retriever.lib package


retriever.lib.cleanup module

class retriever.lib.cleanup.Cleanup(function=<function no_cleanup>, **kwargs)

Bases: object

This class represents a custom cleanup function and a dictionary of arguments to be passed to that function.

retriever.lib.cleanup.correct_invalid_value(value, args)

This cleanup function replaces missing value indicators with None.


Check if a value can be converted to a float

retriever.lib.cleanup.no_cleanup(value, args)

Default cleanup function, returns the unchanged value.

retriever.lib.compile module

retriever.lib.compile.add_dialect(table_dict, table)

Reads dialect key of JSON script and extracts key-value pairs to store them in python script

Contains properties such ‘missingValues’, delimiter’, etc

retriever.lib.compile.add_schema(table_dict, table)

Reads schema key of JSON script and extracts values to store them in python script

Contains properties related to table schema, such as ‘fields’ and cross-tab column name (‘ct_column’).


Function to compile JSON script files to python scripts The scripts are created with retriever new_json <script_name> using command line

retriever.lib.datapackage module

retriever.lib.datapackage.clean_input(prompt='', split_char='', ignore_empty=False, dtype=None)

Clean the user-input from the CLI before adding it


Creates datapackage.JSON script. Takes input from user via command line.

Usage: retriever new_json

retriever.lib.datapackage.edit_dict(obj, tabwidth=0)

Recursive helper function for edit_json() to edit a datapackage.JSON script file.


Edits existing datapackage.JSON script.

Usage: retriever edit_json <script_name> Note: Name of script is the dataset name.


Set contains_pk property


Get the string delimiter for the dataset file(s)


Set do_not_bulk_insert property


Set fixed_width property


Get number of rows considered as the header


Get list of strings that denote missing value in the dataset


Get list of tuples with old and new names for the columns in the table


Check if a variable is an empty string or an empty list

retriever.lib.engine module

class retriever.lib.engine.Engine

Bases: object

A generic database system. Specific database platforms will inherit from this class.


This function adds data to a table from one or more lines specified in engine.table.source.

auto_create_table(table, url=None, filename=None, pk=None)

Creates a table automatically by analyzing a data source and predicting column names, data types, delimiter, etc.

auto_get_datatypes(pk, source, columns, column_values)

Determines data types for each column.

For string columns adds an additional 100 characters to the maximum observed value to provide extra space for cases where special characters are counted differently by different engines.


Determine the delimiter

Find out which of a set of common delimiters occurs most in the header line and use this as the delimiter.


Converts Retriever generic data types to database platform specific data types


Creates a new database based on settings supplied in Database object engine.db


Returns a SQL statement to create a database.


Checks to see if the archive directory exists and creates it if necessary.


Creates a new database table based on settings supplied in Table object engine.table.


Returns a SQL statement to create a table


Gets the db cursor.


Returns the name of the database

datatypes = []
db = None
debug = False
download_file(url, filename)

Downloads a file to the raw data directory.

download_files_from_archive(url, filenames, filetype='zip', keep_in_dir=False, archivename=None)

Downloads files from an archive into the raw data directory.

drop_statement(objecttype, objectname)

Returns a drop table or database SQL statement.

execute(statement, commit=True)

Executes the given statement

executemany(statement, values, commit=True)

Executes the given statement with multiple values


Checks to see if the given table exists


Splits a line based on the fixed width and returns a list of the values


Close the database connection.


Checks for an existing datafile


Returns the correctly formatted raw data directory location.


Returns the full path of a file in the archive directory.

format_insert_value(value, datatype)

Format a value for an insert statement based on data type

Different data types need to be formated differently to be properly stored in database management systems. The correct formats are obtained by:

  1. Removing extra enclosing quotes
  2. Harmonizing null indicators
  3. Cleaning up badly formatted integers
  4. Obtaining consistent float representations of decimals

This method should be overloaded by specific implementations of Engine.


Creates cross tab data


Returns the number of real lines for cross-tab data


Gets the db cursor.


Manually get user input for connection information when script is run from terminal.

insert_data_from_archive(url, filenames)

Insert data from files located in an online archive. This function extracts the file, inserts the data, and deletes the file if raw data archiving is not set.


The default function to insert data from a file. This function simply inserts the data row by row. Database platforms with support for inserting bulk data from files can override this function.


Insert data from a web resource, such as a text file.


Returns a SQL statement to insert a set of values.

instructions = 'Enter your database connection information:'

Generator returning lists of values from lines in a data file

  1. Works on both delimited (csv module) and fixed width data (extract_fixed_width)
  2. Identifies the delimiter if not known
  3. Removes extra line endings
name = ''
pkformat = '%s PRIMARY KEY %s '
required_opts = []
script = None
table = None
table_exists(dbname, tablename)

This can be overridden to return True if a table exists. It returns False by default.

table_name(name=None, dbname=None)

Returns the full tablename.

use_cache = True
warnings = []

Returns true if a file exists and its size is greater than 0.


Extracts and returns the filename from the url


Returns a generator from a source tuple.

Source tuples are of the form (callable, args) where callable(*args) returns either a generator or another source tuple. This allows indefinite regeneration of data sources.

retriever.lib.engine.reporthook(count, block_size, total_size)

Generated the progress bar

Uses file size to calculate the percentage of file size downloaded. If the total_size of the file being downloaded is not in the header, provide progress as size of bytes downloaded in either KB, MB and GB.

retriever.lib.engine.skip_rows(rows, source)

Skip over the header lines by reading them before processing.

retriever.lib.excel module

Data Retriever Excel Functions

This module contains optional functions for importing data from Excel.

class retriever.lib.excel.Excel

Bases: object

static cell_value(cell)

Returns the string value of an excel spreadsheet cell

static empty_cell(cell)

Tests whether an excel cell is empty or contains only whitespace

retriever.lib.get_opts module

retriever.lib.parse_script_to_json module

retriever.lib.repository module

Checks the repository for updates.


Check for updates to scripts.

This updates the HOME_DIR scripts directory with the latest script versions

retriever.lib.repository.download_from_repository(filepath, newpath, repo='')

Downloads the latest version of a file from the repository.


Show progressbar Takes a number between 0 and 1 to indicate progress from 0 to 100%. And set the bar_length according to the console size

retriever.lib.table module

class retriever.lib.table.Table(name, **kwargs)

Bases: object

Information about a database table.


Gets the column names from the header row

Identifies the column names from the header row. Replaces database keywords with alternatives. Replaces special characters and spaces.


Clean column names using the expected sql guidelines remove leading whitespaces, replace sql key words, etc..


Combine a list of values into a line of csv data


Gets a set of column names for insert statements.

get_insert_columns(join=True, create=False)

Gets column names for insert statements

create should be set to True if the returned values are going to be used for creating a new table. It includes the pk_auto column if present. This column is not included by default because it is not used when generating insert statements for database management systems.


retriever.lib.templates module

Class models for dataset scripts from various locations. Scripts should inherit from the most specific class available.

class retriever.lib.templates.BasicTextTemplate(**kwargs)

Bases: retriever.lib.templates.Script

Script template based on data files from Ecological Archives.

download(engine=None, debug=False)
class retriever.lib.templates.DownloadOnlyTemplate(**kwargs)

Bases: retriever.lib.templates.Script

Script template for non-tabular data that are only for download

download(engine=None, debug=False)
class retriever.lib.templates.HtmlTableTemplate(title='', description='', name='', urls={}, tables={}, ref='', public=True, addendum=None, citation='Not currently available', retriever_minimum_version='', version='', encoding='', message='', **kwargs)

Bases: retriever.lib.templates.Script

Script template for parsing data in HTML tables

class retriever.lib.templates.Script(title='', description='', name='', urls={}, tables={}, ref='', public=True, addendum=None, citation='Not currently available', retriever_minimum_version='', version='', encoding='', message='', **kwargs)

Bases: object

This class represents a database toolkit script. Scripts should inherit from this class and execute their code in the download method.

download(engine=None, debug=False)
reference_url() module

Data Retriever Tools

This module contains miscellaneous classes and functions used in Retriever scripts., choice=True)

Prompts the user to select a database engine, output='output_file')

Writes a string to a file for use by tests

return file contents as a string

Perform final cleanup operations after all scripts have run.

This function gets the version number of the scripts and returns them in array form., data_type='lines')

Get MD5 of a data source, output_file=None, header_values=None)

Convert Json file to CSV function is used for only testing and can handle the file of the size, arg)

Remove stored information on scripts, data, and connections

Sort CSV rows minus the header and return the file function is used for only testing and can handle the file of the size

Sort file by line and return the file function is used for only testing and can handle the file of the size, outputfile=None, header_values=None, row_tag='row')

Convert xml to csv function is used for only testing and can handle the file of the size

retriever.lib.warning module

class retriever.lib.warning.Warning(location, warning)

Bases: object

Module contents

retriever.lib contains the core Data Retriever modules.