Function Reference¶
- sqlite_manager.create_sqlite_db(df: DataFrame, schema_file: str | PathLike, db_file: str | PathLike, table_name: str, log_dir: str | PathLike = None) None¶
Create or update an SQLite database using a schema file and load data into a specified table from a pandas DataFrame.
- Parameters:
df (pd.DataFrame) – The data to be loaded into the database.
schema_file (Union[str, os.PathLike]) – Path to the SQL file containing the schema definition.
db_file (Union[str, os.PathLike]) – Path to the SQLite database file to be created or updated.
table_name (str) – Name of the table to insert the DataFrame into.
log_dir (Union[str, os.PathLike], optional) – Directory where the log file is written. Default is the current working directory.
- Raises:
FileNotFoundError – If the schema file does not exist.
ValueError – If the specified table name is not found in the schema or the schema does not match the DataFrame structure.
Examples
import pandas as pd from sqlite_manager import create_sqlite_db # Define the DataFrame to insert data = { "InvoiceNo": ["A001", "A002", "A003"], "StockCode": ["P001", "P002", "P003"], "Description": ["Product 1", "Product 2", "Product 3"], "Quantity": [10, 5, 20], "InvoiceDate": ["2023-01-01", "2023-01-02", "2023-01-03"], "UnitPrice": [12.5, 8.0, 15.0], "CustomerID": ["C001", "C002", "C003"], "Country": ["USA", "UK", "Germany"] } df = pd.DataFrame(data) # Schema file (SQL file defining the database schema) schema_file = "schema.sql" # Contents of schema.sql: # CREATE TABLE IF NOT EXISTS OnlineRetail ( # InvoiceNo TEXT NOT NULL, # StockCode TEXT NOT NULL, # Description TEXT, # Quantity INTEGER NOT NULL, # InvoiceDate TEXT NOT NULL, # UnitPrice REAL NOT NULL, # CustomerID TEXT, # Country TEXT # ); # SQLite database file to create or update db_file = "data/online_retail.db" # Create or update the database and insert data into the table create_sqlite_db( df=df, schema_file=schema_file, db_file=db_file, table_name="OnlineRetail", log_dir="." # Optional )
- sqlite_manager.run_sql_queries(db_file: str | PathLike, query_dir: str | PathLike = None, output_dir: str | PathLike = None, rerun_all: bool = False, rerun_queries: List[str] = None, log_dir: str | PathLike = None) None¶
Execute all SQL queries in a directory (including subdirectories) on a SQLite database. Output results as CSV files mirroring the SQL directory structure.
- Parameters:
db_file (str or Path) – The path to the SQLite database file.
query_dir (str or Path, optional) – Path to the directory containing SQL query files. Default is os.path.abspath(“./sql_queries”)
output_dir (str or Path, optional) – Path to the output directory for CSV files. Default is os.path.abspath(“./query_results”)
rerun_all (bool, optional) – Whether to rerun queries whose output already exists (default is False).
rerun_queries (list of str, optional) – List of specific query filenames to rerun, regardless of existing output.
log_dir (Union[str, os.PathLike], optional) – Directory where the log file is written. Default is the current working directory.
- Return type:
None
Examples
from sqlite_manager import run_sql_queries query_dir = "sql_queries" # Directory containing SQL files db_file = "data/online_retail.db" # SQLite database file output_dir = "output" # Directory to store query results # Run all queries, skipping those with existing outputs run_sql_queries ( query_dir, db_file, output_dir, log_dir="." ) # Rerun all queries regardless of existing outputs run_sql_queries ( query_dir, db_file, output_dir, rerun_all=True ) # Rerun specific queries run_sql_queries ( query_dir, db_file, output_dir, rerun_queries=["query1.sql", "query2.sql"] ) # # Input directory # sql_queries/ # ├── task_1/ # │ ├── query1.sql # │ ├── query2.sql # ├── task_2/ # │ ├── query3.sql # │ └── query4.sql # # # Output Directory (Query Results): # output/ # ├── task_1/ # │ ├── query1.csv # │ ├── query2.csv # ├── task_2/ # │ ├── query3.csv # │ └── query4.csv
- sqlite_manager.run_plot_functions(query_results_dir: str | Path, plot_functions_dir: str | Path, output_dir: str | Path, log_dir: str | Path = None, rerun_all: bool = False, rerun_functions: list[str] | None = None) None¶
Run plotting functions based on query results and instructions.
- Parameters:
query_results_dir (str or Path) – The path to the directory containing .csv results from SQL queries.
plot_functions_dir (str or Path) – The path to the directory containing Python files with plotting functions.
output_dir (str or Path) – The path to the directory where the plots should be saved.
log_dir (str or Path) – The directory where the log file should be saved. A timestamped log file will be created automatically. Default is the current working dir.
rerun_all (bool, optional) – Force rerun of all plotting functions, by default False.
rerun_functions (list of str, optional) – List of specific plotting functions to rerun, by default None.
Examples
Basic usage:
Assuming you have a directory structure like this:
project/ ├── query_results/ │ └── data.csv ├── plot_functions/ │ └── plot_example.py └── output/ └── log.txt
You can call the function as follows:
from pathlib import Path from sqlite_manager.run_plot_functions import run_plot_functions run_plot_functions( query_results_dir=Path("project/query_results"), plot_functions_dir=Path("project/plot_functions"), output_dir=Path("project/output"), log_file=Path("project/log.txt"), rerun_all=True )
This will: - Load all .csv files from the query_results/ directory. - Execute all plotting functions in plot_functions/. - Save plots to the output/ directory. - Log messages to log.txt.