retriever.lib package

Submodules

retriever.lib.cleanup module

class retriever.lib.cleanup.Cleanup(function=<function no_cleanup>, **kwargs)

Bases: object

This class represents a custom cleanup function and a dictionary of arguments to be passed to that function.

retriever.lib.cleanup.correct_invalid_value(value, args)

This cleanup function replaces null indicators with None.

retriever.lib.cleanup.floatable(value)

Check if a value can be converted to a float

retriever.lib.cleanup.no_cleanup(value, args)

Default cleanup function, returns the unchanged value.

retriever.lib.compile module

retriever.lib.compile.add_dialect(table_dict, table)

Reads dialect key of JSON script and extracts key-value pairs to store them in python script

Contains properties such ‘nulls’, delimiter’, etc

retriever.lib.compile.add_schema(table_dict, table)

Reads schema key of JSON script and extracts values to store them in python script

Contains properties related to table schema, such as ‘fields’ and cross-tab column name (‘ct_column’).

retriever.lib.compile.compile_json(json_file)

Function to compile JSON script files to python scripts The scripts are created with retriever create_json <script_name using command line

retriever.lib.engine module

class retriever.lib.engine.Engine

Bases: object

A generic database system. Specific database platforms will inherit from this class.

add_to_table(data_source)

This function adds data to a table from one or more lines specified in engine.table.source.

auto_create_table(table, url=None, filename=None, pk=None)

Creates a table automatically by analyzing a data source and predicting column names, data types, delimiter, etc.

auto_get_datatypes(pk, source, columns, column_values)

Determines data types for each column.

For string columns adds an additional 100 characters to the maximum observed value to provide extra space for cases where special characters are counted differently by different engines.

auto_get_delimiter(header)

Determine the delimiter

Find out which of a set of common delimiters occurs most in the header line and use this as the delimiter.

connect(force_reconnect=False)
connection
convert_data_type(datatype)

Converts Retriever generic data types to database platform specific data types

create_db()

Creates a new database based on settings supplied in Database object engine.db

create_db_statement()

Returns a SQL statement to create a database.

create_raw_data_dir()

Checks to see if the archive directory exists and creates it if necessary.

create_table()

Creates a new database table based on settings supplied in Table object engine.table.

create_table_statement()

Returns a SQL statement to create a table

cursor

Gets the db cursor.

database_name(name=None)

Returns the name of the database

datatypes = []
db = None
debug = False
disconnect()
download_file(url, filename)

Downloads a file to the raw data directory.

download_files_from_archive(url, filenames, filetype='zip', keep_in_dir=False, archivename=None)

Downloads files from an archive into the raw data directory.

drop_statement(objecttype, objectname)

Returns a drop table or database SQL statement.

escape_double_quotes(value)

Escapes double quotes in the value

escape_single_quotes(value)

Escapes single quotes in the value

execute(statement, commit=True)

Executes the given statement

exists(script)

Checks to see if the given table exists

extract_fixed_width(line)

Splits a line based on the fixed width and returns a list of the values

final_cleanup()

Close the database connection.

find_file(filename)

Checks for an existing datafile

format_data_dir()

Returns the correctly formatted raw data directory location.

format_filename(filename)

Returns the full path of a file in the archive directory.

format_insert_value(value, datatype, escape=True, processed=False)

Format a value for an insert statement based on data type

Different data types need to be formated differently to be properly stored in database management systems. The correct formats are obtained by:

  1. Removing extra enclosing quotes
  2. Harmonizing null indicators
  3. Cleaning up badly formatted integers
  4. Obtaining consistent float representations of decimals

The optional escape argument controls whether additional quotes in strings are escaped, as needed for SQL database management systems (escape=True), or not escaped, as needed for flat file based engines (escape=False).

The optional processed argument indicates that the engine has it’s own escaping mechanism. i.e the csv engine which uses its own dialect

get_connection()

This method should be overloaded by specific implementations of Engine.

get_ct_data(lines)

Creates cross tab data

get_cursor()

Gets the db cursor.

get_input()

Manually get user input for connection information when script is run from terminal.

insert_data_from_archive(url, filenames)

Insert data from files located in an online archive. This function extracts the file, inserts the data, and deletes the file if raw data archiving is not set.

insert_data_from_file(filename)

The default function to insert data from a file. This function simply inserts the data row by row. Database platforms with support for inserting bulk data from files can override this function.

insert_data_from_url(url)

Insert data from a web resource, such as a text file.

insert_statement(values)

Returns a SQL statement to insert a set of values.

instructions = 'Enter your database connection information:'
load_data(filename)

Generator returning lists of values from lines in a data file

  1. Works on both delimited (csv module) and fixed width data (extract_fixed_width)
  2. Identifies the delimiter if not known
  3. Removes extra line endings
name = ''
pkformat = '%s PRIMARY KEY %s '
required_opts = []
script = None
set_engine_encoding()
set_table_delimiter(file_path)
table = None
table_exists(dbname, tablename)

This can be overridden to return True if a table exists. It returns False by default.

table_name(name=None, dbname=None)

Returns the full tablename.

to_csv()
warning(warning)
warnings = []
retriever.lib.engine.file_exists(path)

Returns true if a file exists and its size is greater than 0.

retriever.lib.engine.filename_from_url(url)

Extracts and returns the filename from the url

retriever.lib.engine.gen_from_source(source)

Returns a generator from a source tuple.

Source tuples are of the form (callable, args) where callable(*args) returns either a generator or another source tuple. This allows indefinite regeneration of data sources.

retriever.lib.engine.reporthook(count, block_size, total_size)

Generated the progress bar

Uses file size to calculate the percentage of file size downloaded. If the total_size of the file being downloaded is not in the header, provide progress as size of bytes downloaded in either KB, MB and GB.

retriever.lib.engine.skip_rows(rows, source)

Skip over the header lines by reading them before processing.

retriever.lib.excel module

Data Retriever Excel Functions

This module contains optional functions for importing data from Excel.

class retriever.lib.excel.Excel

Bases: object

static cell_value(cell)

Returns the string value of an excel spreadsheet cell

static empty_cell(cell)

Tests whether an excel cell is empty or contains only whitespace

retriever.lib.get_opts module

retriever.lib.lists module

retriever.lib.models module

Data Retriever Data Model

This module contains basic class definitions for the Retriever platform.

retriever.lib.repository module

Checks the repository for updates.

retriever.lib.repository.check_for_updates()

Check for updates to scripts.

This updates the HOME_DIR scripts directory with the latest script versions

retriever.lib.repository.download_from_repository(filepath, newpath, repo='https://raw.github.com/weecology/retriever/master/')

Downloads the latest version of a file from the repository.

retriever.lib.repository.update_progressbar(progress)

Show progressbar Takes a number between 0 and 1 to indicate progress from 0 to 100%. And set the bar_length according to the console size

retriever.lib.table module

class retriever.lib.table.Table(name, **kwargs)

Bases: object

Information about a database table.

auto_get_columns(header)

Gets the column names from the header row

Identifies the column names from the header row. Replaces database keywords with alternatives. Replaces special characters and spaces.

clean_column_name(column_name)

Clean column names using the expected sql guidelines remove leading whitespaces, replace sql key words, etc..

combine_on_delimiter(line_as_list)

Combine a list of values into a line of csv data

get_column_datatypes()

Gets a set of column names for insert statements.

get_insert_columns(join=True, create=False)

Gets column names for insert statements

create should be set to True if the returned values are going to be used for creating a new table. It includes the pk_auto column if present. This column is not included by default because it is not used when generating insert statements for database management systems.

values_from_line(line)

retriever.lib.templates module

Class models for dataset scripts from various locations. Scripts should inherit from the most specific class available.

class retriever.lib.templates.BasicTextTemplate(**kwargs)

Bases: retriever.lib.templates.Script

Script template based on data files from Ecological Archives.

download(engine=None, debug=False)
reference_url()
class retriever.lib.templates.DownloadOnlyTemplate(**kwargs)

Bases: retriever.lib.templates.Script

Script template for non-tabular data that are only for download

download(engine=None, debug=False)
class retriever.lib.templates.HtmlTableTemplate(name='', description='', shortname='', urls={}, tables={}, ref='', public=True, addendum=None, citation='Not currently available', retriever_minimum_version='', version='', **kwargs)

Bases: retriever.lib.templates.Script

Script template for parsing data in HTML tables

class retriever.lib.templates.Script(name='', description='', shortname='', urls={}, tables={}, ref='', public=True, addendum=None, citation='Not currently available', retriever_minimum_version='', version='', **kwargs)

Bases: object

This class represents a database toolkit script. Scripts should inherit from this class and execute their code in the download method.

checkengine(engine=None)
download(engine=None, debug=False)
exists(engine=None)
matches_terms(terms)
reference_url()

retriever.lib.tools module

Data Retriever Tools

This module contains miscellaneous classes and functions used in Retriever scripts.

retriever.lib.tools.choose_engine(opts, choice=True)

Prompts the user to select a database engine

retriever.lib.tools.create_file(data, output='output_file')

Writes a string to a file for use by tests

retriever.lib.tools.file_2string(input_file)

return file contents as a string

retriever.lib.tools.final_cleanup(engine)

Perform final cleanup operations after all scripts have run.

retriever.lib.tools.get_default_connection()

Gets the first (most recently used) stored connection from connections.config.

retriever.lib.tools.get_saved_connection(engine_name)

Given the name of an engine, returns the stored connection for that engine from connections.config.

retriever.lib.tools.getmd5(data, data_type='lines')

Get MD5 of a data source

retriever.lib.tools.json2csv(input_file, output_file=None, header_values=None)

Convert Json file to CSV function is used for only testing and can handle the file of the size

retriever.lib.tools.name_matches(scripts, arg)
retriever.lib.tools.reset_retriever(scope)

Remove stored information on scripts, data, and connections

retriever.lib.tools.save_connection(engine_name, values_dict)

Saves connection information for an engine in connections.config.

retriever.lib.tools.sort_csv(filename)

Sort CSV rows minus the header and return the file function is used for only testing and can handle the file of the size

retriever.lib.tools.sort_file(file_path)

Sort file by line and return the file function is used for only testing and can handle the file of the size

retriever.lib.tools.xml2csv(input_file, outputfile=None, header_values=None, row_tag='row')

Convert xml to csv function is used for only testing and can handle the file of the size

retriever.lib.warning module

class retriever.lib.warning.Warning(location, warning)

Bases: object

Module contents

retriever.lib contains the core Data Retriever modules.