Basic usage — ConnectorX (original) (raw)
Contents
Basic usage#
ConnectorX enables you to run the SQL query, load data from databases into a Pandas Dataframe in the fastest and most memory efficient way.
API#
connectorx.read_sql(conn: Union[str, Dict[str, str]], query: Union[List[str], str], *, return_type: str = "pandas", protocol: str = "binary", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)
Parameters#
conn: Union[str, Dict[str, str]]
: Connection string URI for querying single database or dict of database names (key) and connection string URIs (value) for querying multiple databases.- Please check out here for connection string examples of each database
query: Union[str, List[str]]
: SQL query or list of partitioned SQL queries for fetching data.return_type: str = "pandas"
: The return type of this function. It can bearrow
,pandas
,modin
,dask
orpolars
.protocol: str = "binary"
: The protocol used to fetch data from source, default isbinary
. Check out here to see more details.partition_on: Optional[str]
: The column to partition the result.partition_range: Optional[Tuple[int, int]]
: The value range of the partition column.partition_num: Optional[int]
: The number of partitions to generate.index_col: Optional[str]
: The index column to set for the result dataframe. Only applicable whenreturn_type
ispandas
,modin
ordask
.pre_execution_query: Optional[Union[str, List[str]]]
: SQL query or list of SQL queries executed before main query. Can be used to set runtime configurations using SET statements. Only applicable for source “Postgres” and “MySQL”
Examples#
- Read a DataFrame from a SQL using a single thread
import connectorx as cx
postgres_url = "postgresql://username:password@server:port/database"
query = "SELECT * FROM lineitem"
cx.read_sql(postgres_url, query) - Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (
partition_range
will be automatically queried if not given)
import connectorx as cx
postgres_url = "postgresql://username:password@server:port/database"
query = "SELECT * FROM lineitem"
cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=10) - Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)
import connectorx as cx
postgres_url = "postgresql://username:password@server:port/database"
queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"]
cx.read_sql(postgres_url, queries) - Read a DataFrame parallelly using 4 threads from a more complex query
import connectorx as cx
postgres_url = "postgresql://username:password@server:port/database"
query = f"""
SELECT l_orderkey,
SUM(l_extendedprice * ( 1 - l_discount )) AS revenue,
o_orderdate,
o_shippriority
FROM customer,
orders,
lineitem
WHERE c_mktsegment = 'BUILDING'
AND c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate < DATE '1995-03-15'
AND l_shipdate > DATE '1995-03-15'
GROUP BY l_orderkey,
o_orderdate,
o_shippriority
"""
cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4)
- Read a DataFrame from a SQL joined from multiple databases (experimental, only support PostgreSQL for now)
import connectorx as cx
db1 = "postgresql://username1:password1@server1:port1/database1"
db2 = "postgresql://username2:password2@server2:port2/database2"
query = "SELECT * FROM db1.nation n, db2.region r where n.n_regionkey = r.r_regionkey"
cx.read_sql({"db1": db1, "db2": db2}, query)