retriever.lib package


retriever.lib.cleanup module

class retriever.lib.cleanup.Cleanup(function=<function no_cleanup>, **kwargs)

Bases: object

This class represents a custom cleanup function and a dictionary of arguments to be passed to that function.

retriever.lib.cleanup.correct_invalid_value(value, args)

This cleanup function replaces missing value indicators with None.


Check if a value can be converted to a float

retriever.lib.cleanup.no_cleanup(value, args)

Default cleanup function, returns the unchanged value.

retriever.lib.create_scripts module

Module to auto create scripts from source

class retriever.lib.create_scripts.RasterPk(**kwargs)

Bases: TabularPk

Raster package class


Get resource information from raster file

get_resources(file_path, driver_name=None, skip_lines=None, encoding=None)

Get raster resources

get_source(file_path, driver=None)

Read raster data source

multi_formats = ['hdf']
pk_formats = ['gif', 'img', 'bil', 'jpg', 'tif', 'tiff', 'hdf', 'l1b', '.gif', '.img', '.bil', '.jpg', '.tif', '.tiff', '.hdf', '.l1b']

Set raster specific properties

class retriever.lib.create_scripts.TabularPk(name='fill', title='fill', description='fill', citation='fill', licenses=[], keywords=[], archived='fill or remove this field if not archived', homepage='fill', version='1.0.0', resources=[], retriever='True', retriever_minimum_version='2.1.0', **kwargs)

Bases: object

Main Tabular data package

create_tabular_resources(file, skip_lines, encoding)

Create resources for tabular scripts

get_resources(file_path, driver_name=None, skip_lines=None, encoding='utf-8')

Get resource values from tabular data source

class retriever.lib.create_scripts.VectorPk(**kwargs)

Bases: TabularPk

Vector package class

create_vector_resources(path, driver_name)

Create vector data resources

get_resources(file_path, driver_name=None, skip_lines=None, encoding=None)

Get resource values from tabular data source

get_source(source, driver_name=None)

Open a data source

pk_formats = ['.shp', 'shp']

Set vector values


Remove and replace chars . and ‘-’ with ‘_’

retriever.lib.create_scripts.create_package(path, data_type, file_flag, out_path=None, skip_lines=None, encoding='utf-8')

Creates package for a path

path: string path to files to be processed data_type: string data type of the files to be processed file_flag: boolean for whether the files are processed as files or directories out_path: string path to write scripts out to skip_lines: int number of lines to skip as a list encoding: encoding of source

retriever.lib.create_scripts.create_raster_datapackage(pk_type, path, file_flag, out_path)

Creates raster package for a path

retriever.lib.create_scripts.create_script_dict(pk_type, path, file, skip_lines, encoding)

Create a script dict or skips file if resources cannot be made

retriever.lib.create_scripts.create_tabular_datapackage(pk_type, path, file_flag, out_path, skip_lines, encoding)

Creates tabular package for a path

retriever.lib.create_scripts.create_vector_datapackage(pk_type, path, file_flag, out_path)

Creates vector package for a path


Returns absolute directory path for a path.

retriever.lib.create_scripts.process_dirs(pk_type, sub_dirs_path, out_path, skip_lines, encoding)

Creates a script for each directory.

retriever.lib.create_scripts.process_singles(pk_type, single_files_path, out_path, skip_lines, encoding)

Creates a script for each file

If the filepath is a file, creates a single script for that file. If the filepath is a directory, creates a single script for each file in the directory.

retriever.lib.create_scripts.process_source(pk_type, path, file_flag, out_path, skip_lines=None, encoding='utf-8')

Process source file or source directory

retriever.lib.create_scripts.write_out_scripts(script_dict, path, out_path)

Writes scripts out to a given path

retriever.lib.datapackage module

retriever.lib.datapackage.clean_input(prompt='', split_char='', ignore_empty=False, dtype=None)

Clean the user-input from the CLI before adding it.


Check if a variable is an empty string or an empty list.

retriever.lib.datasets module


Return set with all available licenses.


Return list of all available dataset names.

retriever.lib.datasets.dataset_verbose_list(script_names: list)

Returns the verbose list of the specified dataset(s)

retriever.lib.datasets.datasets(keywords=None, licenses=None)

Search all datasets by keywords and licenses.


Get the license for a dataset.

retriever.lib.defaults module module, path='./', quiet=False, sub_dir='', debug=False, use_cache=True)

Download scripts for retriever.

retriever.lib.dummy module

Dummy connection classes for connectionless engine instances

This module contains dummy classes required for non-db based children of the Engine class.

class retriever.lib.dummy.DummyConnection

Bases: object

Dummy connection class


Dummy close connection


Dummy commit


Dummy cursor function


Dummy rollback

class retriever.lib.dummy.DummyCursor

Bases: DummyConnection

Dummy connection cursor

retriever.lib.engine module

class retriever.lib.engine.Engine

Bases: object

A generic database system. Specific database platforms will inherit from this class.


Adds data to a table from one or more lines specified in engine.table.source.

auto_create_table(table, url=None, filename=None, pk=None, make=True)

Create table automatically by analyzing a data source and predicting column names, data types, delimiter, etc.

auto_get_datatypes(pk, source, columns)

Determine data types for each column.

For string columns adds an additional 100 characters to the maximum observed value to provide extra space for cases where special characters are counted differently by different engines.


Determine the delimiter.

Find out which of a set of common delimiters occurs most in the header line and use this as the delimiter.


Check if a bulk insert could be performed on the data


Create a connection.

property connection

Create a connection.


Convert Retriever generic data types to database platform specific data types.


Create a new database based on settings supplied in Database object engine.db.


Return SQL statement to create a database.


Check to see if the archive directory exists and creates it if necessary.


Create new database table based on settings supplied in Table object engine.table.


Return SQL statement to create a table.

property cursor

Get db cursor.

data_path = None

Return name of the database.

datatypes = []
db = None
debug = False

Disconnect a connection.


Files systems should override this method.

Enables commit per file object.

download_file(url, filename)

Download file to the raw data directory.

download_files_from_archive(url, file_names=None, archive_type='zip', keep_in_dir=False, archive_name=None)

Download files from an archive into the raw data directory.

download_from_kaggle(data_source, dataset_name, archive_dir, archive_full_path)

Download files from Kaggle into the raw data directory

download_from_socrata(url, path, progbar)

Download files from Socrata to the raw data directory

download_response(url, path, progbar)

Returns True|None according to the download GET response

drop_statement(object_type, object_name)

Return drop table or database SQL statement.

encoding = None
excel_to_csv(src_path, path_to_csv, excel_info=None, encoding='utf-8')

Convert excel files to csv files.

execute(statement, commit=True)

Execute given statement.

executemany(statement, values, commit=True)

Execute given statement with multiple values.


Split line based on the fixed width, returns list of the values.

extract_gz(archive_path, archivedir_write_path, file_name=None, open_archive_file=None, archive=None)

Extract gz files.

Extracts a given file name or all the files in the gz.

extract_tar(archive_path, archivedir_write_path, archive_type, file_name=None)

Extract tar or tar.gz files.

Extracts a given file name or the file in the tar or tar.gz. # gzip archives can only contain a single file

extract_zip(archive_path, archivedir_write_path, file_name=None)

Extract zip files.

Extracts a given file name or the entire files in the archive.

fetch_tables(dataset, table_names)

This can be overridden to return the tables of sqlite db as pandas data frame. Return False by default.


Close the database connection.


Check for an existing datafile.


Return correctly formatted raw data directory location.


Return full path of a file in the archive directory.

format_insert_value(value, datatype)

Format a value for an insert statement based on data type.

Different data types need to be formated differently to be properly stored in database management systems. The correct formats are obtained by:

  1. Removing extra enclosing quotes

  2. Harmonizing null indicators

  3. Cleaning up badly formatted integers

  4. Obtaining consistent float representations of decimals


This method should be overridden by specific implementations of Engine.


Create cross tab data.


Returns the number of real lines for cross-tab data


Get db cursor.


Manually get user input for connection information when script is run from terminal.

insert_data_from_archive(url, filenames)

Insert data from files located in an online archive. This function extracts the file, inserts the data, and deletes the file if raw data archiving is not set.


The default function to insert data from a file. This function simply inserts the data row by row. Database platforms with support for inserting bulk data from files can override this function.


Insert data from a web resource, such as a text file.

insert_raster(path=None, srid=None)

Base function for installing raster data from path


Return SQL statement to insert a set of values.

insert_vector(path=None, srid=None)

Base function for installing vector data from path

instructions = 'Enter your database connection information:'

Generator returning lists of values from lines in a data file.

1. Works on both delimited (csv module) and fixed width data (extract_fixed_width) 2. Identifies the delimiter if not known 3. Removes extra line ending

name = ''
pkformat = '%s PRIMARY KEY %s '
placeholder = None
process_geojson2csv(src_path, path_to_csv, encoding='utf-8')
process_hdf52csv(src_path, path_to_csv, data_name, data_type, encoding='utf-8')
process_json2csv(src_path, path_to_csv, headers, encoding='utf-8')
process_sqlite2csv(src_path, path_to_csv, table_name=None, encoding='utf-8')

Process sqlite database to csv files.

process_xml2csv(src_path, path_to_csv, header_values=None, empty_rows=1, encoding='utf-8')

Register table names of scripts

required_opts = []
script = None
script_table_registry = {}

Set up the encoding to be used.


Get the delimiter from the data file and set it.

spatial_support = False
supported_raster(path, ext=None)

“Spatial data is not currently supported for this database type or file format. PostgreSQL is currently the only supported output for spatial data.

table = None
table_name(name=None, dbname=None)

Return full table name.

to_csv(sort=True, path=None, select_columns=None, select_table=None)

Create a CSV file from the a data store.

sort flag to create a sorted file, path to write the flag else write to the PWD, select_columns flag is used by large files to select columns data and has SELECT LIMIT 3.

use_cache = True

Create a warning message using the current script and table.

warnings = []
write_fileobject(archivedir_write_path, file_name, file_obj=None, archive=None, open_object=False)

Write a file object from a archive object to a given path

open_object flag helps up with zip files, open the zip and the file


Return true if a file exists and its size is greater than 0.


Extract and returns the filename from the url.


Return generator from a source tuple.

Source tuples are of the form (callable, args) where callable(star args) returns either a generator or another source tuple. This allows indefinite regeneration of data sources.

retriever.lib.engine.reporthook(tqdm_inst, filename=None)

tqdm wrapper to generate progress bar for urlretriever


Set the CSV size limit based on the available resources

retriever.lib.engine.skip_rows(rows, source)

Skip over the header lines by reading them before processing.

retriever.lib.engine_tools module

Data Retriever Tools

This module contains miscellaneous classes and functions used in Retriever scripts.

retriever.lib.engine_tools.create_file(data, output='output_file')

Write lines to file from a list.


Create Directory for retriever.


Read in a csv file and return lines a list.

retriever.lib.engine_tools.geojson2csv(input_file, output_file, encoding)

Convert Geojson file to csv.

Function is used for testing only.

retriever.lib.engine_tools.getmd5(data, data_type='lines', encoding='utf-8')

Get MD5 of a data source.

retriever.lib.engine_tools.hdf2csv(file, output_file, data_name, data_type, encoding='utf-8')
retriever.lib.engine_tools.json2csv(input_file, output_file=None, header_values=None, encoding='utf-8', row_key=None)

Convert Json file to CSV.

retriever.lib.engine_tools.reset_retriever(scope='all', ask_permission=True)

Remove stored information on scripts and data.


Check for proxies and makes them available to urllib.

retriever.lib.engine_tools.sort_csv(filename, encoding='utf-8')

Sort CSV rows minus the header and return the file.

Function is used for only testing and can handle the file of the size.

retriever.lib.engine_tools.sort_file(file_path, encoding='utf-8')

Sort file by line and return the file.

Function is used for only testing and can handle the file of the size.

retriever.lib.engine_tools.sqlite2csv(input_file, output_file, table_name=None, encoding='utf-8')

Convert sqlite database file to CSV.

retriever.lib.engine_tools.walker(raw_data, row_key=None, header_values=None, rows=[], normalize=False)

Extract rows of data from json datasets

retriever.lib.engine_tools.xml2csv(input_file, output_file, header_values=None, empty_rows=1, encoding='utf-8')

Convert xml to csv.

retriever.lib.engine_tools.xml2csv_test(input_file, outputfile=None, header_values=None, row_tag='row')

Convert xml to csv.

Function is used for only testing and can handle the file of the size.

retriever.lib.engine_tools.xml2dict(data, node, level)

Convert xml to dict type.

retriever.lib.excel module

Data Retriever Excel Functions

This module contains optional functions for importing data from Excel.

class retriever.lib.excel.Excel

Bases: object

Excel class to handle excel values

static cell_value(cell)

Return string value of an excel spreadsheet cell.

static empty_cell(cell)

Test if excel cell is empty or contains only whitespace.

retriever.lib.fetch module

retriever.lib.fetch.fetch(dataset, file='sqlite.db', table_name='{db}_{table}', data_dir='.')

Import a dataset into pandas data frames

retriever.lib.get_opts module

retriever.lib.install module

retriever.lib.install.install_csv(dataset, table_name='{db}_{table}.csv', data_dir='.', debug=False, use_cache=True, force=False, hash_value=None)

Install datasets into csv.

retriever.lib.install.install_hdf5(dataset, file='hdf5.h5', table_name='{db}_{table}', data_dir='.', debug=False, use_cache=True, hash_value=None)

Install datasets into hdf5.

retriever.lib.install.install_json(dataset, table_name='{db}_{table}.json', data_dir='.', debug=False, use_cache=True, pretty=False, force=False, hash_value=None)

Install datasets into json.

retriever.lib.install.install_msaccess(dataset, file='access.mdb', table_name='[{db} {table}]', data_dir='.', debug=False, use_cache=True, force=False, hash_value=None)

Install datasets into msaccess.

retriever.lib.install.install_mysql(dataset, user='root', password='', host='localhost', port=3306, database_name='{db}', table_name='{db}.{table}', debug=False, use_cache=True, force=False, hash_value=None)

Install datasets into mysql.

retriever.lib.install.install_postgres(dataset, user='postgres', password='', host='localhost', port=5432, database='postgres', database_name='{db}', table_name='{db}.{table}', bbox=[], debug=False, use_cache=True, force=False, hash_value=None)

Install datasets into postgres.

retriever.lib.install.install_sqlite(dataset, file='sqlite.db', table_name='{db}_{table}', data_dir='.', debug=False, use_cache=True, force=False, hash_value=None)

Install datasets into sqlite.

retriever.lib.install.install_xml(dataset, table_name='{db}_{table}.xml', data_dir='.', debug=False, use_cache=True, force=False, hash_value=None)

Install datasets into xml.

retriever.lib.load_json module


Read Json dataset package files

Load each json and get the appropriate encoding for the dataset Reload the json using the encoding to ensure correct character sets

retriever.lib.models module

Data Retriever Data Model

This module contains basic class definitions for the Retriever platform.

retriever.lib.provenance module

retriever.lib.provenance.commit(dataset, commit_message='', path=None, quiet=False)

Commit dataset to a zipped file.

retriever.lib.provenance.commit_info_for_commit(dataset, commit_message, encoding='utf-8')

Generate info for a particular commit.


Returns a dictionary with commit info and changes in old and current environment


Shows logs for a committed dataset which is in provenance directory

retriever.lib.provenance.commit_writer(dataset, commit_message, path, quiet)

Creates the committed zipped file

retriever.lib.provenance.install_committed(path_to_archive, engine, force=False, quiet=False)

Installs the committed dataset

retriever.lib.provenance.installation_details(metadata_info, quiet)

Outputs details of the commit for eg. commit message, time, changes in environment


Returns a dictionary with details of installed packages in the current environment

retriever.lib.provenance_tools module


Returns a dictionary after reading metadata.json file of a committed dataset


Reads script from archive.

retriever.lib.rdatasets module

retriever.lib.rdatasets.create_rdataset(engine, package, dataset_name, script_path=None)

Download files for RDatasets to the raw data directory


displays the list of rdataset names present in the package(s) provided


returns a list of all the available RDataset names present


Updates the datasets_url.json from the github repo

retriever.lib.rdatasets.update_rdataset_contents(data_obj, package, dataset_name, json_file)

Update the contents of json script

retriever.lib.rdatasets.update_rdataset_script(data_obj, dataset_name, package, script_path)

Renames and updates the RDataset script

retriever.lib.repository module

Checks the repository for updates.


Check for updates to datasets.

This updates the HOME_DIR scripts directory with the latest script versions

retriever.lib.scripts module


Return Loaded scripts.

Ensure that only one instance of SCRIPTS is created.

class retriever.lib.scripts.StoredScripts

Bases: object

Stored scripts class


Return shared scripts


Set shared scripts


Return true if a script’s version number is greater than the retriever’s version.


Basic method for getting upstream data

retriever.lib.scripts.get_dataset_names_upstream(keywords=None, licenses=None, repo='')

Search all datasets upstream by keywords and licenses. If the keywords or licenses argument is passed, Github’s search API is used for looking in the repositories. Else, the version.txt file is read and the script names are then returned.


Return the versions of the present local scripts


Return the script for a named dataset.


Get the citation list for a script

retriever.lib.scripts.get_script_upstream(dataset, repo='')

Return the upstream script for a named dataset.

retriever.lib.scripts.get_script_version_upstream(dataset, repo='')

Return the upstream script version for a named dataset.

retriever.lib.scripts.name_matches(scripts, arg)

Check for a match of the script in available scripts

if all, return the entire script list if the exact script is available, return that script if no exact script name detected, match the argument with keywords title and name of all scripts and return the closest matches


Open a csv writer forcing the use of Linux line endings on Windows.

Also sets dialect to ‘excel’ and escape characters to ‘'

retriever.lib.scripts.open_fr(file_name, encoding='utf-8', encode=True)

Open file for reading respecting Python version and OS differences.

Sets newline to Linux line endings on Windows and Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes

retriever.lib.scripts.open_fw(file_name, encoding='utf-8', encode=True)

Open file for writing respecting Python version and OS differences.

Sets newline to Linux line endings on Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes


Read the version of a script from a JSON file

retriever.lib.scripts.read_py_version(script_name, search_path)

Read the version of a script from a python file


Load scripts from scripts directory and return list of modules.

retriever.lib.scripts.to_str(object, object_encoding=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, object_decoder='utf-8')

Convert to str

retriever.lib.socrata module

retriever.lib.socrata.create_socrata_dataset(engine, name, resource, script_path=None)

Downloads raw data and creates a script for the socrata dataset


Returns metadata for the following dataset id

Returns the list of dataset names after autocompletion


Returns the dataset information of the dataset name provided

retriever.lib.socrata.update_socrata_contents(json_file, script_name, url, resource)

Update the contents of the json script

retriever.lib.socrata.update_socrata_script(script_name, filename, url, resource, script_path)

Renames the script name and the contents of the script

retriever.lib.socrata.url_response(url, params)

Returns the GET response for the given url and params

retriever.lib.table module

class retriever.lib.table.Dataset(name=None, url=None)

Bases: object

Dataset generic properties

class retriever.lib.table.RasterDataset(name=None, url=None, dataset_type='RasterDataset', **kwargs)

Bases: Dataset

Raster table implementation

class retriever.lib.table.TabularDataset(name=None, url=None, pk=True, contains_pk=False, delimiter=None, header_rows=1, column_names_row=1, fixed_width=False, cleanup=<retriever.lib.cleanup.Cleanup object>, record_id=0, columns=[], replace_columns=[], missingValues=None, cleaned_columns=False, number_of_records=None, **kwargs)

Bases: Dataset

Tabular database table.


Initialize dialect table properties.

These include a table’s null or missing values, the delimiter, the function to perform on missing values and any values in the dialect’s dict.


Add a schema to the table object.

Define the data type for the columns in the table.


Get column names from the header row.

Identifies the column names from the header row. Replaces database keywords with alternatives. Replaces special characters and spaces.


Clean column names using the expected sql guidelines remove leading whitespaces, replace sql key words, etc.


Combine a list of values into a line of csv data.


Get set of column names for insert statements.

get_insert_columns(join=True, create=False)

Get column names for insert statements.

create should be set to True if the returned values are going to be used for creating a new table. It includes the pk_auto column if present. This column is not included by default because it is not used when generating insert statements for database management systems.


Return expected row values

Includes dynamically generated field values like auto pk

class retriever.lib.table.VectorDataset(name=None, url=None, dataset_type='VectorDataset', **kwargs)

Bases: Dataset

Vector table implementation.

retriever.lib.templates module

Datasets are defined as scripts and have unique properties. The Module defines generic dataset properties and models the functions available for inheritance by the scripts or datasets.

class retriever.lib.templates.BasicTextTemplate(**kwargs)

Bases: Script

Defines the pre processing required for scripts.

Scripts that need pre processing should use the download function from this class. Scripts that require extra tune up, should override this class.

download(engine=None, debug=False)

Defines the download processes for scripts that utilize the default pre processing steps provided by the retriever.

process_archived_data(table_obj, url)

Pre-process archived files.

Archive info is specified for a single resource or entire data package. Extract the files from the archived source based on the specifications. Either extract a single file or all files. If the archived data is excel, use the xls_sheets to obtain the files to be extracted.


Process spatial data for insertion

process_tables(table_obj, url)

Obtain the clean file and create a table

if xls_sheets, convert excel to csv Create the table from the file

process_tabular_insert(table_obj, url)

Process tabular data for insertion

class retriever.lib.templates.HtmlTableTemplate(title='', description='', name='', urls={}, tables={}, ref='', public=True, addendum=None, citation='Not currently available', licenses=[{'name': None}], retriever_minimum_version='', version='', encoding='utf-8', message='', **kwargs)

Bases: Script

Script template for parsing data in HTML tables.

class retriever.lib.templates.Script(title='', description='', name='', urls={}, tables={}, ref='', public=True, addendum=None, citation='Not currently available', licenses=[{'name': None}], retriever_minimum_version='', version='', encoding='utf-8', message='', **kwargs)

Bases: object

This class defines the properties of a generic dataset.

Each Dataset inherits attributes from this class to define it’s Unique functionality.


Returns the required engine instance

download(engine=None, debug=False)

Generic function to prepare for installation or download.


Check if the terms matches a script metadata info


Get a reference url as the parent url from data url module, path_to_csv, excel_info=None, encoding='utf-8')

Convert an excel sheet to csv

Read src_path excel file and write the excel sheet to path_to_csv excel_info contains the index of the sheet and the excel file name

Open a csv writer forcing the use of Linux line endings on Windows.

Also sets dialect to ‘excel’ and escape characters to ‘', encoding='utf-8', encode=True)

Open file for reading respecting Python version and OS differences.

Sets newline to Linux line endings on Windows and Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes, encoding='utf-8', encode=True)

Open file for writing respecting Python version and OS differences.

Sets newline to Linux line endings on Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes, object_encoding=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, object_decoder='utf-8')

Convert encoded values to string

Return relative paths of files in the directory

retriever.lib.warning module

class retriever.lib.warning.Warning(location, warning)

Bases: object

Custom warning class