retriever.lib package¶
Submodules¶
retriever.lib.cleanup module¶
- class retriever.lib.cleanup.Cleanup(function=<function no_cleanup>, **kwargs)¶
Bases:
object
This class represents a custom cleanup function and a dictionary of arguments to be passed to that function.
- retriever.lib.cleanup.correct_invalid_value(value, args)¶
This cleanup function replaces missing value indicators with None.
- retriever.lib.cleanup.floatable(value)¶
Check if a value can be converted to a float
- retriever.lib.cleanup.no_cleanup(value, args)¶
Default cleanup function, returns the unchanged value.
retriever.lib.create_scripts module¶
Module to auto create scripts from source
- class retriever.lib.create_scripts.RasterPk(**kwargs)¶
Bases:
TabularPk
Raster package class
- create_raster_resources(file_path)¶
Get resource information from raster file
- get_resources(file_path, driver_name=None, skip_lines=None, encoding=None)¶
Get raster resources
- get_source(file_path, driver=None)¶
Read raster data source
- multi_formats = ['hdf']¶
- pk_formats = ['gif', 'img', 'bil', 'jpg', 'tif', 'tiff', 'hdf', 'l1b', '.gif', '.img', '.bil', '.jpg', '.tif', '.tiff', '.hdf', '.l1b']¶
- set_global(src_ds)¶
Set raster specific properties
- class retriever.lib.create_scripts.TabularPk(name='fill', title='fill', description='fill', citation='fill', licenses=[], keywords=[], archived='fill or remove this field if not archived', homepage='fill', version='1.0.0', resources=[], retriever='True', retriever_minimum_version='2.1.0', **kwargs)¶
Bases:
object
Main Tabular data package
- create_tabular_resources(file, skip_lines, encoding)¶
Create resources for tabular scripts
- get_resources(file_path, driver_name=None, skip_lines=None, encoding='utf-8')¶
Get resource values from tabular data source
- class retriever.lib.create_scripts.VectorPk(**kwargs)¶
Bases:
TabularPk
Vector package class
- create_vector_resources(path, driver_name)¶
Create vector data resources
- get_resources(file_path, driver_name=None, skip_lines=None, encoding=None)¶
Get resource values from tabular data source
- get_source(source, driver_name=None)¶
Open a data source
- pk_formats = ['.shp', 'shp']¶
- set_globals(da_layer)¶
Set vector values
- retriever.lib.create_scripts.clean_table_name(table_name)¶
Remove and replace chars . and ‘-’ with ‘_’
- retriever.lib.create_scripts.create_package(path, data_type, file_flag, out_path=None, skip_lines=None, encoding='utf-8')¶
Creates package for a path
path: string path to files to be processed data_type: string data type of the files to be processed file_flag: boolean for whether the files are processed as files or directories out_path: string path to write scripts out to skip_lines: int number of lines to skip as a list encoding: encoding of source
- retriever.lib.create_scripts.create_raster_datapackage(pk_type, path, file_flag, out_path)¶
Creates raster package for a path
- retriever.lib.create_scripts.create_script_dict(pk_type, path, file, skip_lines, encoding)¶
Create a script dict or skips file if resources cannot be made
- retriever.lib.create_scripts.create_tabular_datapackage(pk_type, path, file_flag, out_path, skip_lines, encoding)¶
Creates tabular package for a path
- retriever.lib.create_scripts.create_vector_datapackage(pk_type, path, file_flag, out_path)¶
Creates vector package for a path
- retriever.lib.create_scripts.get_directory(path)¶
Returns absolute directory path for a path.
- retriever.lib.create_scripts.process_dirs(pk_type, sub_dirs_path, out_path, skip_lines, encoding)¶
Creates a script for each directory.
- retriever.lib.create_scripts.process_singles(pk_type, single_files_path, out_path, skip_lines, encoding)¶
Creates a script for each file
If the filepath is a file, creates a single script for that file. If the filepath is a directory, creates a single script for each file in the directory.
- retriever.lib.create_scripts.process_source(pk_type, path, file_flag, out_path, skip_lines=None, encoding='utf-8')¶
Process source file or source directory
- retriever.lib.create_scripts.write_out_scripts(script_dict, path, out_path)¶
Writes scripts out to a given path
retriever.lib.datapackage module¶
- retriever.lib.datapackage.clean_input(prompt='', split_char='', ignore_empty=False, dtype=None)¶
Clean the user-input from the CLI before adding it.
- retriever.lib.datapackage.is_empty(val)¶
Check if a variable is an empty string or an empty list.
retriever.lib.datasets module¶
- retriever.lib.datasets.dataset_licenses()¶
Return set with all available licenses.
- retriever.lib.datasets.dataset_names()¶
Return list of all available dataset names.
- retriever.lib.datasets.dataset_verbose_list(script_names: list)¶
Returns the verbose list of the specified dataset(s)
- retriever.lib.datasets.datasets(keywords=None, licenses=None)¶
Search all datasets by keywords and licenses.
- retriever.lib.datasets.license(dataset)¶
Get the license for a dataset.
retriever.lib.defaults module¶
retriever.lib.download module¶
- retriever.lib.download.download(dataset, path='./', quiet=False, sub_dir='', debug=False, use_cache=True)¶
Download scripts for retriever.
retriever.lib.dummy module¶
Dummy connection classes for connectionless engine instances
This module contains dummy classes required for non-db based children of the Engine class.
- class retriever.lib.dummy.DummyConnection¶
Bases:
object
Dummy connection class
- close()¶
Dummy close connection
- commit()¶
Dummy commit
- cursor()¶
Dummy cursor function
- rollback()¶
Dummy rollback
- class retriever.lib.dummy.DummyCursor¶
Bases:
DummyConnection
Dummy connection cursor
retriever.lib.engine module¶
- class retriever.lib.engine.Engine¶
Bases:
object
A generic database system. Specific database platforms will inherit from this class.
- add_to_table(data_source)¶
Adds data to a table from one or more lines specified in engine.table.source.
- auto_create_table(table, url=None, filename=None, pk=None, make=True)¶
Create table automatically by analyzing a data source and predicting column names, data types, delimiter, etc.
- auto_get_datatypes(pk, source, columns)¶
Determine data types for each column.
For string columns adds an additional 100 characters to the maximum observed value to provide extra space for cases where special characters are counted differently by different engines.
- auto_get_delimiter(header)¶
Determine the delimiter.
Find out which of a set of common delimiters occurs most in the header line and use this as the delimiter.
- check_bulk_insert()¶
Check if a bulk insert could be performed on the data
- connect(force_reconnect=False)¶
Create a connection.
- property connection¶
Create a connection.
- convert_data_type(datatype)¶
Convert Retriever generic data types to database platform specific data types.
- create_db()¶
Create a new database based on settings supplied in Database object engine.db.
- create_db_statement()¶
Return SQL statement to create a database.
- create_raw_data_dir(path=None)¶
Check to see if the archive directory exists and creates it if necessary.
- create_table()¶
Create new database table based on settings supplied in Table object engine.table.
- create_table_statement()¶
Return SQL statement to create a table.
- property cursor¶
Get db cursor.
- data_path = None¶
- database_name(name=None)¶
Return name of the database.
- datatypes = []¶
- db = None¶
- debug = False¶
- disconnect()¶
Disconnect a connection.
- disconnect_files()¶
Files systems should override this method.
Enables commit per file object.
- download_file(url, filename)¶
Download file to the raw data directory.
- download_files_from_archive(url, file_names=None, archive_type='zip', keep_in_dir=False, archive_name=None)¶
Download files from an archive into the raw data directory.
- download_from_kaggle(data_source, dataset_name, archive_dir, archive_full_path)¶
Download files from Kaggle into the raw data directory
- download_from_socrata(url, path, progbar)¶
Download files from Socrata to the raw data directory
- download_response(url, path, progbar)¶
Returns True|None according to the download GET response
- drop_statement(object_type, object_name)¶
Return drop table or database SQL statement.
- encoding = None¶
- excel_to_csv(src_path, path_to_csv, excel_info=None, encoding='utf-8')¶
Convert excel files to csv files.
- execute(statement, commit=True)¶
Execute given statement.
- executemany(statement, values, commit=True)¶
Execute given statement with multiple values.
- extract_fixed_width(line)¶
Split line based on the fixed width, returns list of the values.
- extract_gz(archive_path, archivedir_write_path, file_name=None, open_archive_file=None, archive=None)¶
Extract gz files.
Extracts a given file name or all the files in the gz.
- extract_tar(archive_path, archivedir_write_path, archive_type, file_name=None)¶
Extract tar or tar.gz files.
Extracts a given file name or the file in the tar or tar.gz. # gzip archives can only contain a single file
- extract_zip(archive_path, archivedir_write_path, file_name=None)¶
Extract zip files.
Extracts a given file name or the entire files in the archive.
- fetch_tables(dataset, table_names)¶
This can be overridden to return the tables of sqlite db as pandas data frame. Return False by default.
- final_cleanup()¶
Close the database connection.
- find_file(filename)¶
Check for an existing datafile.
- format_data_dir()¶
Return correctly formatted raw data directory location.
- format_filename(filename)¶
Return full path of a file in the archive directory.
- format_insert_value(value, datatype)¶
Format a value for an insert statement based on data type.
Different data types need to be formated differently to be properly stored in database management systems. The correct formats are obtained by:
Removing extra enclosing quotes
Harmonizing null indicators
Cleaning up badly formatted integers
Obtaining consistent float representations of decimals
- get_connection()¶
This method should be overridden by specific implementations of Engine.
- get_ct_data(lines)¶
Create cross tab data.
- get_ct_line_length(lines)¶
Returns the number of real lines for cross-tab data
- get_cursor()¶
Get db cursor.
- get_input()¶
Manually get user input for connection information when script is run from terminal.
- insert_data_from_archive(url, filenames)¶
Insert data from files located in an online archive. This function extracts the file, inserts the data, and deletes the file if raw data archiving is not set.
- insert_data_from_file(filename)¶
The default function to insert data from a file. This function simply inserts the data row by row. Database platforms with support for inserting bulk data from files can override this function.
- insert_data_from_url(url)¶
Insert data from a web resource, such as a text file.
- insert_raster(path=None, srid=None)¶
Base function for installing raster data from path
- insert_statement(values)¶
Return SQL statement to insert a set of values.
- insert_vector(path=None, srid=None)¶
Base function for installing vector data from path
- instructions = 'Enter your database connection information:'¶
- load_data(filename)¶
Generator returning lists of values from lines in a data file.
1. Works on both delimited (csv module) and fixed width data (extract_fixed_width) 2. Identifies the delimiter if not known 3. Removes extra line ending
- name = ''¶
- pkformat = '%s PRIMARY KEY %s '¶
- placeholder = None¶
- process_geojson2csv(src_path, path_to_csv, encoding='utf-8')¶
- process_hdf52csv(src_path, path_to_csv, data_name, data_type, encoding='utf-8')¶
- process_json2csv(src_path, path_to_csv, headers, encoding='utf-8')¶
- process_sqlite2csv(src_path, path_to_csv, table_name=None, encoding='utf-8')¶
Process sqlite database to csv files.
- process_xml2csv(src_path, path_to_csv, header_values=None, empty_rows=1, encoding='utf-8')¶
- register_tables()¶
Register table names of scripts
- required_opts = []¶
- script = None¶
- script_table_registry = {}¶
- set_engine_encoding()¶
Set up the encoding to be used.
- set_table_delimiter(file_path)¶
Get the delimiter from the data file and set it.
- spatial_support = False¶
- supported_raster(path, ext=None)¶
“Spatial data is not currently supported for this database type or file format. PostgreSQL is currently the only supported output for spatial data.
- table = None¶
- table_name(name=None, dbname=None)¶
Return full table name.
- to_csv(sort=True, path=None, select_columns=None, select_table=None)¶
Create a CSV file from the a data store.
sort flag to create a sorted file, path to write the flag else write to the PWD, select_columns flag is used by large files to select columns data and has SELECT LIMIT 3.
- use_cache = True¶
- warning(warning)¶
Create a warning message using the current script and table.
- warnings = []¶
- write_fileobject(archivedir_write_path, file_name, file_obj=None, archive=None, open_object=False)¶
Write a file object from a archive object to a given path
open_object flag helps up with zip files, open the zip and the file
- retriever.lib.engine.file_exists(path)¶
Return true if a file exists and its size is greater than 0.
- retriever.lib.engine.filename_from_url(url)¶
Extract and returns the filename from the url.
- retriever.lib.engine.gen_from_source(source)¶
Return generator from a source tuple.
Source tuples are of the form (callable, args) where callable(star args) returns either a generator or another source tuple. This allows indefinite regeneration of data sources.
- retriever.lib.engine.reporthook(tqdm_inst, filename=None)¶
tqdm wrapper to generate progress bar for urlretriever
- retriever.lib.engine.set_csv_field_size()¶
Set the CSV size limit based on the available resources
- retriever.lib.engine.skip_rows(rows, source)¶
Skip over the header lines by reading them before processing.
retriever.lib.engine_tools module¶
Data Retriever Tools
This module contains miscellaneous classes and functions used in Retriever scripts.
- retriever.lib.engine_tools.create_file(data, output='output_file')¶
Write lines to file from a list.
- retriever.lib.engine_tools.create_home_dir()¶
Create Directory for retriever.
- retriever.lib.engine_tools.file_2list(input_file)¶
Read in a csv file and return lines a list.
- retriever.lib.engine_tools.geojson2csv(input_file, output_file, encoding)¶
Convert Geojson file to csv.
Function is used for testing only.
- retriever.lib.engine_tools.getmd5(data, data_type='lines', encoding='utf-8')¶
Get MD5 of a data source.
- retriever.lib.engine_tools.hdf2csv(file, output_file, data_name, data_type, encoding='utf-8')¶
- retriever.lib.engine_tools.json2csv(input_file, output_file=None, header_values=None, encoding='utf-8', row_key=None)¶
Convert Json file to CSV.
- retriever.lib.engine_tools.reset_retriever(scope='all', ask_permission=True)¶
Remove stored information on scripts and data.
- retriever.lib.engine_tools.set_proxy()¶
Check for proxies and makes them available to urllib.
- retriever.lib.engine_tools.sort_csv(filename, encoding='utf-8')¶
Sort CSV rows minus the header and return the file.
Function is used for only testing and can handle the file of the size.
- retriever.lib.engine_tools.sort_file(file_path, encoding='utf-8')¶
Sort file by line and return the file.
Function is used for only testing and can handle the file of the size.
- retriever.lib.engine_tools.sqlite2csv(input_file, output_file, table_name=None, encoding='utf-8')¶
Convert sqlite database file to CSV.
- retriever.lib.engine_tools.walker(raw_data, row_key=None, header_values=None, rows=[], normalize=False)¶
Extract rows of data from json datasets
- retriever.lib.engine_tools.xml2csv(input_file, output_file, header_values=None, empty_rows=1, encoding='utf-8')¶
Convert xml to csv.
- retriever.lib.engine_tools.xml2csv_test(input_file, outputfile=None, header_values=None, row_tag='row')¶
Convert xml to csv.
Function is used for only testing and can handle the file of the size.
- retriever.lib.engine_tools.xml2dict(data, node, level)¶
Convert xml to dict type.
retriever.lib.excel module¶
Data Retriever Excel Functions
This module contains optional functions for importing data from Excel.
retriever.lib.fetch module¶
- retriever.lib.fetch.fetch(dataset, file='sqlite.db', table_name='{db}_{table}', data_dir='.')¶
Import a dataset into pandas data frames
retriever.lib.get_opts module¶
retriever.lib.install module¶
- retriever.lib.install.install_csv(dataset, table_name='{db}_{table}.csv', data_dir='.', debug=False, use_cache=True, force=False, hash_value=None)¶
Install datasets into csv.
- retriever.lib.install.install_hdf5(dataset, file='hdf5.h5', table_name='{db}_{table}', data_dir='.', debug=False, use_cache=True, hash_value=None)¶
Install datasets into hdf5.
- retriever.lib.install.install_json(dataset, table_name='{db}_{table}.json', data_dir='.', debug=False, use_cache=True, pretty=False, force=False, hash_value=None)¶
Install datasets into json.
- retriever.lib.install.install_msaccess(dataset, file='access.mdb', table_name='[{db} {table}]', data_dir='.', debug=False, use_cache=True, force=False, hash_value=None)¶
Install datasets into msaccess.
- retriever.lib.install.install_mysql(dataset, user='root', password='', host='localhost', port=3306, database_name='{db}', table_name='{db}.{table}', debug=False, use_cache=True, force=False, hash_value=None)¶
Install datasets into mysql.
- retriever.lib.install.install_postgres(dataset, user='postgres', password='', host='localhost', port=5432, database='postgres', database_name='{db}', table_name='{db}.{table}', bbox=[], debug=False, use_cache=True, force=False, hash_value=None)¶
Install datasets into postgres.
- retriever.lib.install.install_sqlite(dataset, file='sqlite.db', table_name='{db}_{table}', data_dir='.', debug=False, use_cache=True, force=False, hash_value=None)¶
Install datasets into sqlite.
- retriever.lib.install.install_xml(dataset, table_name='{db}_{table}.xml', data_dir='.', debug=False, use_cache=True, force=False, hash_value=None)¶
Install datasets into xml.
retriever.lib.load_json module¶
- retriever.lib.load_json.read_json(json_file)¶
Read Json dataset package files
Load each json and get the appropriate encoding for the dataset Reload the json using the encoding to ensure correct character sets
retriever.lib.models module¶
Data Retriever Data Model
This module contains basic class definitions for the Retriever platform.
retriever.lib.provenance module¶
- retriever.lib.provenance.commit(dataset, commit_message='', path=None, quiet=False)¶
Commit dataset to a zipped file.
- retriever.lib.provenance.commit_info_for_commit(dataset, commit_message, encoding='utf-8')¶
Generate info for a particular commit.
- retriever.lib.provenance.commit_info_for_installation(metadata_info)¶
Returns a dictionary with commit info and changes in old and current environment
- retriever.lib.provenance.commit_log(dataset)¶
Shows logs for a committed dataset which is in provenance directory
- retriever.lib.provenance.commit_writer(dataset, commit_message, path, quiet)¶
Creates the committed zipped file
- retriever.lib.provenance.install_committed(path_to_archive, engine, force=False, quiet=False)¶
Installs the committed dataset
- retriever.lib.provenance.installation_details(metadata_info, quiet)¶
Outputs details of the commit for eg. commit message, time, changes in environment
- retriever.lib.provenance.package_details()¶
Returns a dictionary with details of installed packages in the current environment
retriever.lib.provenance_tools module¶
- retriever.lib.provenance_tools.get_metadata(path_to_archive)¶
Returns a dictionary after reading metadata.json file of a committed dataset
- retriever.lib.provenance_tools.get_script_provenance(path_to_archive)¶
Reads script from archive.
retriever.lib.rdatasets module¶
- retriever.lib.rdatasets.create_rdataset(engine, package, dataset_name, script_path=None)¶
Download files for RDatasets to the raw data directory
- retriever.lib.rdatasets.display_all_rdataset_names(package_name=None)¶
displays the list of rdataset names present in the package(s) provided
- retriever.lib.rdatasets.get_rdataset_names()¶
returns a list of all the available RDataset names present
- retriever.lib.rdatasets.update_rdataset_catalog(test=False)¶
Updates the datasets_url.json from the github repo
- retriever.lib.rdatasets.update_rdataset_contents(data_obj, package, dataset_name, json_file)¶
Update the contents of json script
- retriever.lib.rdatasets.update_rdataset_script(data_obj, dataset_name, package, script_path)¶
Renames and updates the RDataset script
retriever.lib.repository module¶
Checks the repository for updates.
- retriever.lib.repository.check_for_updates(repo='https://raw.githubusercontent.com/weecology/retriever-recipes/main/')¶
Check for updates to datasets.
This updates the HOME_DIR scripts directory with the latest script versions
retriever.lib.scripts module¶
- retriever.lib.scripts.SCRIPT_LIST()¶
Return Loaded scripts.
Ensure that only one instance of SCRIPTS is created.
- class retriever.lib.scripts.StoredScripts¶
Bases:
object
Stored scripts class
- get_scripts()¶
Return shared scripts
- set_scripts(script_list)¶
Set shared scripts
- retriever.lib.scripts.check_retriever_minimum_version(module)¶
Return true if a script’s version number is greater than the retriever’s version.
- retriever.lib.scripts.get_data_upstream(search_url)¶
Basic method for getting upstream data
- retriever.lib.scripts.get_dataset_names_upstream(keywords=None, licenses=None, repo='https://raw.githubusercontent.com/weecology/retriever-recipes/main/')¶
Search all datasets upstream by keywords and licenses. If the keywords or licenses argument is passed, Github’s search API is used for looking in the repositories. Else, the version.txt file is read and the script names are then returned.
- retriever.lib.scripts.get_retriever_citation()¶
- retriever.lib.scripts.get_retriever_script_versions()¶
Return the versions of the present local scripts
- retriever.lib.scripts.get_script(dataset)¶
Return the script for a named dataset.
- retriever.lib.scripts.get_script_citation(dataset=None)¶
Get the citation list for a script
- retriever.lib.scripts.get_script_upstream(dataset, repo='https://raw.githubusercontent.com/weecology/retriever-recipes/main/')¶
Return the upstream script for a named dataset.
- retriever.lib.scripts.get_script_version_upstream(dataset, repo='https://raw.githubusercontent.com/weecology/retriever-recipes/main/')¶
Return the upstream script version for a named dataset.
- retriever.lib.scripts.name_matches(scripts, arg)¶
Check for a match of the script in available scripts
if all, return the entire script list if the exact script is available, return that script if no exact script name detected, match the argument with keywords title and name of all scripts and return the closest matches
- retriever.lib.scripts.open_csvw(csv_file)¶
Open a csv writer forcing the use of Linux line endings on Windows.
Also sets dialect to ‘excel’ and escape characters to ‘'
- retriever.lib.scripts.open_fr(file_name, encoding='utf-8', encode=True)¶
Open file for reading respecting Python version and OS differences.
Sets newline to Linux line endings on Windows and Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes
- retriever.lib.scripts.open_fw(file_name, encoding='utf-8', encode=True)¶
Open file for writing respecting Python version and OS differences.
Sets newline to Linux line endings on Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes
- retriever.lib.scripts.read_json_version(json_file)¶
Read the version of a script from a JSON file
- retriever.lib.scripts.read_py_version(script_name, search_path)¶
Read the version of a script from a python file
- retriever.lib.scripts.reload_scripts()¶
Load scripts from scripts directory and return list of modules.
- retriever.lib.scripts.to_str(object, object_encoding=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, object_decoder='utf-8')¶
Convert to str
retriever.lib.socrata module¶
- retriever.lib.socrata.create_socrata_dataset(engine, name, resource, script_path=None)¶
Downloads raw data and creates a script for the socrata dataset
- retriever.lib.socrata.find_socrata_dataset_by_id(dataset_id)¶
Returns metadata for the following dataset id
- retriever.lib.socrata.socrata_autocomplete_search(dataset)¶
Returns the list of dataset names after autocompletion
- retriever.lib.socrata.socrata_dataset_info(dataset_name)¶
Returns the dataset information of the dataset name provided
- retriever.lib.socrata.update_socrata_contents(json_file, script_name, url, resource)¶
Update the contents of the json script
- retriever.lib.socrata.update_socrata_script(script_name, filename, url, resource, script_path)¶
Renames the script name and the contents of the script
- retriever.lib.socrata.url_response(url, params)¶
Returns the GET response for the given url and params
retriever.lib.table module¶
- class retriever.lib.table.Dataset(name=None, url=None)¶
Bases:
object
Dataset generic properties
- class retriever.lib.table.RasterDataset(name=None, url=None, dataset_type='RasterDataset', **kwargs)¶
Bases:
Dataset
Raster table implementation
- class retriever.lib.table.TabularDataset(name=None, url=None, pk=True, contains_pk=False, delimiter=None, header_rows=1, column_names_row=1, fixed_width=False, cleanup=<retriever.lib.cleanup.Cleanup object>, record_id=0, columns=[], replace_columns=[], missingValues=None, cleaned_columns=False, number_of_records=None, **kwargs)¶
Bases:
Dataset
Tabular database table.
- add_dialect()¶
Initialize dialect table properties.
These include a table’s null or missing values, the delimiter, the function to perform on missing values and any values in the dialect’s dict.
- add_schema()¶
Add a schema to the table object.
Define the data type for the columns in the table.
- auto_get_columns(header)¶
Get column names from the header row.
Identifies the column names from the header row. Replaces database keywords with alternatives. Replaces special characters and spaces.
- clean_column_name(column_name)¶
Clean column names using the expected sql guidelines remove leading whitespaces, replace sql key words, etc.
- combine_on_delimiter(line_as_list)¶
Combine a list of values into a line of csv data.
- get_column_datatypes()¶
Get set of column names for insert statements.
- get_insert_columns(join=True, create=False)¶
Get column names for insert statements.
create should be set to True if the returned values are going to be used for creating a new table. It includes the pk_auto column if present. This column is not included by default because it is not used when generating insert statements for database management systems.
- values_from_line(line)¶
Return expected row values
Includes dynamically generated field values like auto pk
retriever.lib.templates module¶
Datasets are defined as scripts and have unique properties. The Module defines generic dataset properties and models the functions available for inheritance by the scripts or datasets.
- class retriever.lib.templates.BasicTextTemplate(**kwargs)¶
Bases:
Script
Defines the pre processing required for scripts.
Scripts that need pre processing should use the download function from this class. Scripts that require extra tune up, should override this class.
- download(engine=None, debug=False)¶
Defines the download processes for scripts that utilize the default pre processing steps provided by the retriever.
- process_archived_data(table_obj, url)¶
Pre-process archived files.
Archive info is specified for a single resource or entire data package. Extract the files from the archived source based on the specifications. Either extract a single file or all files. If the archived data is excel, use the xls_sheets to obtain the files to be extracted.
- process_spatial_insert(table_obj)¶
Process spatial data for insertion
- process_tables(table_obj, url)¶
Obtain the clean file and create a table
if xls_sheets, convert excel to csv Create the table from the file
- process_tabular_insert(table_obj, url)¶
Process tabular data for insertion
- class retriever.lib.templates.HtmlTableTemplate(title='', description='', name='', urls={}, tables={}, ref='', public=True, addendum=None, citation='Not currently available', licenses=[{'name': None}], retriever_minimum_version='', version='', encoding='utf-8', message='', **kwargs)¶
Bases:
Script
Script template for parsing data in HTML tables.
- class retriever.lib.templates.Script(title='', description='', name='', urls={}, tables={}, ref='', public=True, addendum=None, citation='Not currently available', licenses=[{'name': None}], retriever_minimum_version='', version='', encoding='utf-8', message='', **kwargs)¶
Bases:
object
This class defines the properties of a generic dataset.
Each Dataset inherits attributes from this class to define it’s Unique functionality.
- checkengine(engine=None)¶
Returns the required engine instance
- download(engine=None, debug=False)¶
Generic function to prepare for installation or download.
- matches_terms(terms)¶
Check if the terms matches a script metadata info
- reference_url()¶
Get a reference url as the parent url from data url
retriever.lib.tools module¶
- retriever.lib.tools.excel_csv(src_path, path_to_csv, excel_info=None, encoding='utf-8')¶
Convert an excel sheet to csv
Read src_path excel file and write the excel sheet to path_to_csv excel_info contains the index of the sheet and the excel file name
- retriever.lib.tools.open_csvw(csv_file)¶
Open a csv writer forcing the use of Linux line endings on Windows.
Also sets dialect to ‘excel’ and escape characters to ‘'
- retriever.lib.tools.open_fr(file_name, encoding='utf-8', encode=True)¶
Open file for reading respecting Python version and OS differences.
Sets newline to Linux line endings on Windows and Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes
- retriever.lib.tools.open_fw(file_name, encoding='utf-8', encode=True)¶
Open file for writing respecting Python version and OS differences.
Sets newline to Linux line endings on Python 3 When encode=False does not set encoding on nix and Python 3 to keep as bytes
- retriever.lib.tools.to_str(object, object_encoding=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, object_decoder='utf-8')¶
Convert encoded values to string
- retriever.lib.tools.walk_relative_path(dir_name)¶
Return relative paths of files in the directory
retriever.lib.warning module¶
- class retriever.lib.warning.Warning(location, warning)¶
Bases:
object
Custom warning class