Backend agnostic arrays – Ibis (original) (raw)
Introduction
This is a redux of a previous post showing Ibis’s portability in action.
Ibis is portable across complex operations and backends of very different scales and deployment models!
Results differ slightly between BigQuery and DuckDB
The datasets used in each backend are slightly different.
I opted to avoid ETL for the BigQuery backend by reusing the Google-provided IMDB dataset.
The tradeoff is the slight discrepancy in results.
Basics
We’ll start with from ibis.interactive import *
for maximum convenience.
from ibis.interactive import *
1
from ibis.interactive import *
imports Ibis APIs into the global namespace and enables interactive mode.
Connect to your database
- DuckDB
- BigQuery
ddb = ibis.connect("duckdb://")
ddb.create_table(
"name_basics", ex.imdb_name_basics.fetch(backend=ddb).rename("snake_case")
)
ddb.create_table(
"title_basics", ex.imdb_title_basics.fetch(backend=ddb).rename("snake_case")
)
1
Create a table called name_basics
in our DuckDB database using ibis.examples
data
2
Create a table called title_basics
in our DuckDB database using ibis.examples
data
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ tconst ┃ title_type ┃ primary_title ┃ original_title ┃ is_adult ┃ start_year ┃ end_year ┃ runtime_minutes ┃ genres ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ string │ string │ int64 │ int64 │ string │ int64 │ string │ ├───────────┼────────────┼─────────────────────────────────────────────┼─────────────────────────────────────────────┼──────────┼────────────┼──────────┼─────────────────┼──────────────────────────┤ │ tt0000001 │ short │ Carmencita │ Carmencita │ 0 │ 1894 │ NULL │ 1 │ Documentary,Short │ │ tt0000002 │ short │ Le clown et ses chiens │ Le clown et ses chiens │ 0 │ 1892 │ NULL │ 5 │ Animation,Short │ │ tt0000003 │ short │ Pauvre Pierrot │ Pauvre Pierrot │ 0 │ 1892 │ NULL │ 4 │ Animation,Comedy,Romance │ │ tt0000004 │ short │ Un bon bock │ Un bon bock │ 0 │ 1892 │ NULL │ 12 │ Animation,Short │ │ tt0000005 │ short │ Blacksmith Scene │ Blacksmith Scene │ 0 │ 1893 │ NULL │ 1 │ Comedy,Short │ │ tt0000006 │ short │ Chinese Opium Den │ Chinese Opium Den │ 0 │ 1894 │ NULL │ 1 │ Short │ │ tt0000007 │ short │ Corbett and Courtney Before the Kinetograph │ Corbett and Courtney Before the Kinetograph │ 0 │ 1894 │ NULL │ 1 │ Short,Sport │ │ tt0000008 │ short │ Edison Kinetoscopic Record of a Sneeze │ Edison Kinetoscopic Record of a Sneeze │ 0 │ 1894 │ NULL │ 1 │ Documentary,Short │ │ tt0000009 │ movie │ Miss Jerry │ Miss Jerry │ 0 │ 1894 │ NULL │ 45 │ Romance │ │ tt0000010 │ short │ Leaving the Factory │ La sortie de l'usine Lumière à Lyon │ 0 │ 1895 │ NULL │ 1 │ Documentary,Short │ │ … │ … │ … │ … │ … │ … │ … │ … │ … │ └───────────┴────────────┴─────────────────────────────────────────────┴─────────────────────────────────────────────┴──────────┴────────────┴──────────┴─────────────────┴──────────────────────────┘
bq = ibis.connect("bigquery://ibis-gbq")
bq.set_database("bigquery-public-data.imdb")
1
Google provides a public BigQuery dataset for IMDB data.
Let’s pull out the name_basics
table, which contains names and metadata about people listed on IMDB. We’ll call this ents
(short for entities
), and remove some columns we won’t need:
- DuckDB
- BigQuery
ddb_ents = ddb.tables.name_basics.drop("birth_year", "death_year")
ddb_ents.order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ string │ string │ ├───────────┼─────────────────┼─────────────────────────────────────┼─────────────────────────────────────────┤ │ nm0000001 │ Fred Astaire │ soundtrack,actor,miscellaneous │ tt0053137,tt0072308,tt0045537,tt0050419 │ │ nm0000002 │ Lauren Bacall │ actress,soundtrack │ tt0037382,tt0117057,tt0075213,tt0038355 │ │ nm0000003 │ Brigitte Bardot │ actress,soundtrack,music_department │ tt0057345,tt0054452,tt0049189,tt0056404 │ │ nm0000004 │ John Belushi │ actor,soundtrack,writer │ tt0072562,tt0078723,tt0077975,tt0080455 │ │ nm0000005 │ Ingmar Bergman │ writer,director,actor │ tt0083922,tt0069467,tt0050976,tt0050986 │ │ nm0000006 │ Ingrid Bergman │ actress,soundtrack,producer │ tt0038109,tt0036855,tt0034583,tt0038787 │ │ nm0000007 │ Humphrey Bogart │ actor,soundtrack,producer │ tt0037382,tt0034583,tt0042593,tt0043265 │ │ nm0000008 │ Marlon Brando │ actor,soundtrack,director │ tt0068646,tt0070849,tt0078788,tt0047296 │ │ nm0000009 │ Richard Burton │ actor,soundtrack,producer │ tt0057877,tt0059749,tt0061184,tt0087803 │ │ nm0000010 │ James Cagney │ actor,soundtrack,director │ tt0042041,tt0035575,tt0029870,tt0031867 │ │ … │ … │ … │ … │ └───────────┴─────────────────┴─────────────────────────────────────┴─────────────────────────────────────────┘
bq_ents = bq.tables.name_basics.drop("birth_year", "death_year")
bq_ents.order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ string │ string │ ├───────────┼─────────────────┼─────────────────────────────────────┼─────────────────────────────────────────┤ │ nm0000001 │ Fred Astaire │ soundtrack,actor,miscellaneous │ tt0072308,tt0053137,tt0031983,tt0050419 │ │ nm0000002 │ Lauren Bacall │ actress,soundtrack │ tt0038355,tt0075213,tt0117057,tt0037382 │ │ nm0000003 │ Brigitte Bardot │ actress,soundtrack,music_department │ tt0049189,tt0054452,tt0056404,tt0057345 │ │ nm0000004 │ John Belushi │ actor,soundtrack,writer │ tt0072562,tt0078723,tt0080455,tt0077975 │ │ nm0000005 │ Ingmar Bergman │ writer,director,actor │ tt0050976,tt0083922,tt0069467,tt0050986 │ │ nm0000006 │ Ingrid Bergman │ actress,soundtrack,producer │ tt0034583,tt0038787,tt0038109,tt0036855 │ │ nm0000007 │ Humphrey Bogart │ actor,soundtrack,producer │ tt0037382,tt0043265,tt0034583,tt0042593 │ │ nm0000008 │ Marlon Brando │ actor,soundtrack,director │ tt0068646,tt0070849,tt0047296,tt0078788 │ │ nm0000009 │ Richard Burton │ actor,soundtrack,producer │ tt0061184,tt0087803,tt0057877,tt0059749 │ │ nm0000010 │ James Cagney │ actor,soundtrack,director │ tt0042041,tt0029870,tt0031867,tt0035575 │ │ … │ … │ … │ … │ └───────────┴─────────────────┴─────────────────────────────────────┴─────────────────────────────────────────┘
Splitting strings into arrays
We can see that known_for_titles
looks sort of like an array, so let’s call the split method on that column and replace the existing column:
- DuckDB
- BigQuery
ddb_ents = ddb_ents.mutate(known_for_titles=_.known_for_titles.split(","))
ddb_ents.order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ string │ array │ ├───────────┼─────────────────┼─────────────────────────────────────┼────────────────────────────────────┤ │ nm0000001 │ Fred Astaire │ soundtrack,actor,miscellaneous │ ['tt0053137', 'tt0072308', ... +2] │ │ nm0000002 │ Lauren Bacall │ actress,soundtrack │ ['tt0037382', 'tt0117057', ... +2] │ │ nm0000003 │ Brigitte Bardot │ actress,soundtrack,music_department │ ['tt0057345', 'tt0054452', ... +2] │ │ nm0000004 │ John Belushi │ actor,soundtrack,writer │ ['tt0072562', 'tt0078723', ... +2] │ │ nm0000005 │ Ingmar Bergman │ writer,director,actor │ ['tt0083922', 'tt0069467', ... +2] │ │ nm0000006 │ Ingrid Bergman │ actress,soundtrack,producer │ ['tt0038109', 'tt0036855', ... +2] │ │ nm0000007 │ Humphrey Bogart │ actor,soundtrack,producer │ ['tt0037382', 'tt0034583', ... +2] │ │ nm0000008 │ Marlon Brando │ actor,soundtrack,director │ ['tt0068646', 'tt0070849', ... +2] │ │ nm0000009 │ Richard Burton │ actor,soundtrack,producer │ ['tt0057877', 'tt0059749', ... +2] │ │ nm0000010 │ James Cagney │ actor,soundtrack,director │ ['tt0042041', 'tt0035575', ... +2] │ │ … │ … │ … │ … │ └───────────┴─────────────────┴─────────────────────────────────────┴────────────────────────────────────┘
bq_ents = bq_ents.mutate(known_for_titles=_.known_for_titles.split(","))
bq_ents.order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ string │ array │ ├───────────┼─────────────────┼─────────────────────────────────────┼────────────────────────────────────┤ │ nm0000001 │ Fred Astaire │ soundtrack,actor,miscellaneous │ ['tt0072308', 'tt0053137', ... +2] │ │ nm0000002 │ Lauren Bacall │ actress,soundtrack │ ['tt0038355', 'tt0075213', ... +2] │ │ nm0000003 │ Brigitte Bardot │ actress,soundtrack,music_department │ ['tt0049189', 'tt0054452', ... +2] │ │ nm0000004 │ John Belushi │ actor,soundtrack,writer │ ['tt0072562', 'tt0078723', ... +2] │ │ nm0000005 │ Ingmar Bergman │ writer,director,actor │ ['tt0050976', 'tt0083922', ... +2] │ │ nm0000006 │ Ingrid Bergman │ actress,soundtrack,producer │ ['tt0034583', 'tt0038787', ... +2] │ │ nm0000007 │ Humphrey Bogart │ actor,soundtrack,producer │ ['tt0037382', 'tt0043265', ... +2] │ │ nm0000008 │ Marlon Brando │ actor,soundtrack,director │ ['tt0068646', 'tt0070849', ... +2] │ │ nm0000009 │ Richard Burton │ actor,soundtrack,producer │ ['tt0061184', 'tt0087803', ... +2] │ │ nm0000010 │ James Cagney │ actor,soundtrack,director │ ['tt0042041', 'tt0029870', ... +2] │ │ … │ … │ … │ … │ └───────────┴─────────────────┴─────────────────────────────────────┴────────────────────────────────────┘
Similarly for primary_profession
, since people involved in show business often have more than one responsibility on a project:
- DuckDB
- BigQuery
ddb_ents = ddb_ents.mutate(primary_profession=_.primary_profession.split(","))
bq_ents = bq_ents.mutate(primary_profession=_.primary_profession.split(","))
Array length
Let’s see how many titles each entity is known for, and then show the five people with the largest number of titles they’re known for.
This is computed using the length API on array expressions:
- DuckDB
- BigQuery
(
ddb_ents.select("primary_name", num_titles=_.known_for_titles.length())
.order_by(_.num_titles.desc())
.limit(5)
)
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ primary_name ┃ num_titles ┃ ┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ string │ int64 │ ├──────────────────┼────────────┤ │ Alex Koenigsmark │ 5 │ │ Carrie Schnelker │ 5 │ │ Henry Townsend │ 5 │ │ Sally Sun │ 5 │ │ Matthew Kavuma │ 5 │ └──────────────────┴────────────┘
(
bq_ents.select("primary_name", num_titles=_.known_for_titles.length())
.order_by(_.num_titles.desc())
.limit(5)
)
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ primary_name ┃ num_titles ┃ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ string │ int64 │ ├─────────────────────┼────────────┤ │ José Jaime Espinosa │ 6 │ │ Paul Winter │ 6 │ │ Nicolas Bernier │ 6 │ │ Chris Estrada │ 6 │ │ Tsuyotake Matsuda │ 5 │ └─────────────────────┴────────────┘
It seems like the length of the known_for_titles
might be capped at some small number!
Index
We can see the position of "actor"
or "actress"
in primary_profession
s:
- DuckDB
- BigQuery
ddb_ents.primary_profession.index("actor")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ ArrayPosition(primary_profession, 'actor') ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ int64 │ ├────────────────────────────────────────────┤ │ 1 │ │ -1 │ │ -1 │ │ 0 │ │ 2 │ │ -1 │ │ 0 │ │ 0 │ │ 0 │ │ 0 │ │ … │ └────────────────────────────────────────────┘
ddb_ents.primary_profession.index("actress")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ ArrayPosition(primary_profession, 'actress') ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ int64 │ ├──────────────────────────────────────────────┤ │ -1 │ │ 0 │ │ 0 │ │ -1 │ │ -1 │ │ 0 │ │ -1 │ │ -1 │ │ -1 │ │ -1 │ │ … │ └──────────────────────────────────────────────┘
bq_ents.primary_profession.index("actor")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ ArrayPosition(primary_profession, 'actor') ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ int64 │ ├────────────────────────────────────────────┤ │ -1 │ │ -1 │ │ -1 │ │ -1 │ │ -1 │ │ -1 │ │ -1 │ │ -1 │ │ -1 │ │ -1 │ │ … │ └────────────────────────────────────────────┘
bq_ents.primary_profession.index("actress")
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ ArrayPosition(primary_profession, 'actress') ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ int64 │ ├──────────────────────────────────────────────┤ │ -1 │ │ -1 │ │ -1 │ │ -1 │ │ -1 │ │ -1 │ │ -1 │ │ -1 │ │ -1 │ │ -1 │ │ … │ └──────────────────────────────────────────────┘
A return value of -1
indicates that "actor"
is not present in the value.
Let’s look for entities that are not primarily actors.
We can do this using the index method by checking whether the positions of the strings "actor"
or "actress"
are both greater than 0:
- DuckDB
- BigQuery
actor_index = ddb_ents.primary_profession.index("actor")
actress_index = ddb_ents.primary_profession.index("actress")
ddb_not_primarily_acting = (actor_index > 0) & (actress_index > 0)
ddb_not_primarily_acting.mean()
actor_index = bq_ents.primary_profession.index("actor")
actress_index = bq_ents.primary_profession.index("actress")
bq_not_primarily_acting = (actor_index > 0) & (actress_index > 0)
bq_not_primarily_acting.mean()
Who are they?
- DuckDB
- BigQuery
ddb_ents[ddb_not_primarily_acting].order_by("nconst")
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ string │ string │ array │ array │ └────────┴──────────────┴────────────────────┴──────────────────┘
bq_ents[bq_not_primarily_acting].order_by("nconst")
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ string │ string │ array │ array │ └────────┴──────────────┴────────────────────┴──────────────────┘
It’s not 100% clear whether the order of elements in primary_profession
matters here.
Containment
We can get people who are listed as actors or actresses using contains
:
- DuckDB
- BigQuery
ddb_non_actors = bq_ents[
~_.primary_profession.contains("actor") & ~_.primary_profession.contains("actress")
]
ddb_non_actors.order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ array │ array │ ├───────────┼──────────────────┼────────────────────────────────────────────┼────────────────────────────────────┤ │ nm0000016 │ Georges Delerue │ ['composer', 'soundtrack', ... +1] │ ['tt8847712', 'tt0091763', ... +2] │ │ nm0000025 │ Jerry Goldsmith │ ['music_department', 'soundtrack', ... +1] │ ['tt0077269', 'tt0117731', ... +2] │ │ nm0000033 │ Alfred Hitchcock │ ['director', 'producer', ... +1] │ ['tt0054215', 'tt0052357', ... +2] │ │ nm0000035 │ James Horner │ ['music_department', 'soundtrack', ... +1] │ ['tt0177971', 'tt0120338', ... +2] │ │ nm0000040 │ Stanley Kubrick │ ['director', 'writer', ... +1] │ ['tt0120663', 'tt0066921', ... +2] │ │ nm0000041 │ Akira Kurosawa │ ['writer', 'director', ... +1] │ ['tt0080979', 'tt0089881', ... +2] │ │ nm0000049 │ Henry Mancini │ ['music_department', 'soundtrack', ... +1] │ ['tt0383216', 'tt0054698', ... +2] │ │ nm0000055 │ Alfred Newman │ ['music_department', 'composer', ... +1] │ ['tt0049408', 'tt0434409', ... +2] │ │ nm0000065 │ Nino Rota │ ['composer', 'soundtrack', ... +1] │ ['tt0071562', 'tt0056801', ... +2] │ │ nm0000067 │ Miklós Rózsa │ ['music_department', 'composer', ... +1] │ ['tt0052618', 'tt0038109', ... +2] │ │ … │ … │ … │ … │ └───────────┴──────────────────┴────────────────────────────────────────────┴────────────────────────────────────┘
bq_non_actors = bq_ents[
~_.primary_profession.contains("actor") & ~_.primary_profession.contains("actress")
]
bq_non_actors.order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ array │ array │ ├───────────┼──────────────────┼────────────────────────────────────────────┼────────────────────────────────────┤ │ nm0000016 │ Georges Delerue │ ['composer', 'soundtrack', ... +1] │ ['tt8847712', 'tt0091763', ... +2] │ │ nm0000025 │ Jerry Goldsmith │ ['music_department', 'soundtrack', ... +1] │ ['tt0077269', 'tt0117731', ... +2] │ │ nm0000033 │ Alfred Hitchcock │ ['director', 'producer', ... +1] │ ['tt0054215', 'tt0052357', ... +2] │ │ nm0000035 │ James Horner │ ['music_department', 'soundtrack', ... +1] │ ['tt0177971', 'tt0120338', ... +2] │ │ nm0000040 │ Stanley Kubrick │ ['director', 'writer', ... +1] │ ['tt0120663', 'tt0066921', ... +2] │ │ nm0000041 │ Akira Kurosawa │ ['writer', 'director', ... +1] │ ['tt0080979', 'tt0089881', ... +2] │ │ nm0000049 │ Henry Mancini │ ['music_department', 'soundtrack', ... +1] │ ['tt0383216', 'tt0054698', ... +2] │ │ nm0000055 │ Alfred Newman │ ['music_department', 'composer', ... +1] │ ['tt0049408', 'tt0434409', ... +2] │ │ nm0000065 │ Nino Rota │ ['composer', 'soundtrack', ... +1] │ ['tt0071562', 'tt0056801', ... +2] │ │ nm0000067 │ Miklós Rózsa │ ['music_department', 'composer', ... +1] │ ['tt0052618', 'tt0038109', ... +2] │ │ … │ … │ … │ … │ └───────────┴──────────────────┴────────────────────────────────────────────┴────────────────────────────────────┘
Element removal
We can remove elements from arrays too.
remove() does not mutate the underlying data
Let’s see who only has “actor” in the list of their primary professions:
- DuckDB
- BigQuery
ddb_ents.filter(
[
_.primary_profession.length() > 0,
_.primary_profession.remove("actor").length() == 0,
_.primary_profession.remove("actress").length() == 0,
]
).order_by("nconst")
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ string │ string │ array │ array │ └────────┴──────────────┴────────────────────┴──────────────────┘
bq_ents.filter(
[
_.primary_profession.length() > 0,
_.primary_profession.remove("actor").length() == 0,
_.primary_profession.remove("actress").length() == 0,
]
).order_by("nconst")
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩ │ string │ string │ array │ array │ └────────┴──────────────┴────────────────────┴──────────────────┘
Slicing with square-bracket syntax
Let’s remove everyone’s first profession from the list, but only if they have more than one profession listed:
- DuckDB
- BigQuery
ddb_ents[_.primary_profession.length() > 1].mutate(
primary_profession=_.primary_profession[1:],
).order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ array │ array │ ├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤ │ nm0000001 │ Fred Astaire │ ['actor', 'miscellaneous'] │ ['tt0053137', 'tt0072308', ... +2] │ │ nm0000002 │ Lauren Bacall │ ['soundtrack'] │ ['tt0037382', 'tt0117057', ... +2] │ │ nm0000003 │ Brigitte Bardot │ ['soundtrack', 'music_department'] │ ['tt0057345', 'tt0054452', ... +2] │ │ nm0000004 │ John Belushi │ ['soundtrack', 'writer'] │ ['tt0072562', 'tt0078723', ... +2] │ │ nm0000005 │ Ingmar Bergman │ ['director', 'actor'] │ ['tt0083922', 'tt0069467', ... +2] │ │ nm0000006 │ Ingrid Bergman │ ['soundtrack', 'producer'] │ ['tt0038109', 'tt0036855', ... +2] │ │ nm0000007 │ Humphrey Bogart │ ['soundtrack', 'producer'] │ ['tt0037382', 'tt0034583', ... +2] │ │ nm0000008 │ Marlon Brando │ ['soundtrack', 'director'] │ ['tt0068646', 'tt0070849', ... +2] │ │ nm0000009 │ Richard Burton │ ['soundtrack', 'producer'] │ ['tt0057877', 'tt0059749', ... +2] │ │ nm0000010 │ James Cagney │ ['soundtrack', 'director'] │ ['tt0042041', 'tt0035575', ... +2] │ │ … │ … │ … │ … │ └───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘
bq_ents[_.primary_profession.length() > 1].mutate(
primary_profession=_.primary_profession[1:],
).order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ array │ array │ ├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤ │ nm0000001 │ Fred Astaire │ ['actor', 'miscellaneous'] │ ['tt0072308', 'tt0053137', ... +2] │ │ nm0000002 │ Lauren Bacall │ ['soundtrack'] │ ['tt0038355', 'tt0075213', ... +2] │ │ nm0000003 │ Brigitte Bardot │ ['soundtrack', 'music_department'] │ ['tt0049189', 'tt0054452', ... +2] │ │ nm0000004 │ John Belushi │ ['soundtrack', 'writer'] │ ['tt0072562', 'tt0078723', ... +2] │ │ nm0000005 │ Ingmar Bergman │ ['director', 'actor'] │ ['tt0050976', 'tt0083922', ... +2] │ │ nm0000006 │ Ingrid Bergman │ ['soundtrack', 'producer'] │ ['tt0034583', 'tt0038787', ... +2] │ │ nm0000007 │ Humphrey Bogart │ ['soundtrack', 'producer'] │ ['tt0037382', 'tt0043265', ... +2] │ │ nm0000008 │ Marlon Brando │ ['soundtrack', 'director'] │ ['tt0068646', 'tt0070849', ... +2] │ │ nm0000009 │ Richard Burton │ ['soundtrack', 'producer'] │ ['tt0061184', 'tt0087803', ... +2] │ │ nm0000010 │ James Cagney │ ['soundtrack', 'director'] │ ['tt0042041', 'tt0029870', ... +2] │ │ … │ … │ … │ … │ └───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘
Set operations and sorting
Treating arrays as sets is possible with the union and intersect APIs.
Let’s take a look at intersect
.
Intersection
Let’s see if we can use array intersection to figure which actors share known-for titles and sort the result:
- DuckDB
- BigQuery
left = ddb_ents.filter(_.known_for_titles.length() > 0).limit(10_000)
right = left.view()
shared_titles = (
left
.join(right, left.nconst != right.nconst)
.select(
s.startswith("known_for_titles"),
left_name="primary_name",
right_name="primary_name_right",
)
.filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)
.group_by(name="left_name")
.agg(together_with=_.right_name.collect())
.mutate(together_with=_.together_with.unique().sort())
)
shared_titles
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ name ┃ together_with ┃ ┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ array │ ├──────────────────────┼───────────────────────────────────────────────┤ │ Ava Gardner │ ['Ernest Gold', 'Fred Astaire'] │ │ Cyd Charisse │ ['Fred Astaire'] │ │ John Landis │ ['Dan Aykroyd', 'Dick Ziker', ... +14] │ │ Michael Curtiz │ ['Alan Hale', 'Ann Blyth', ... +19] │ │ Francis Ford Coppola │ ['Abe Vigoda', 'Al Pacino', ... +19] │ │ Bernardo Bertolucci │ ['Armand Abplanalp', 'James Acheson', ... +3] │ │ Karl Malden │ ['Abraxas Aaran', 'Alex North', ... +14] │ │ Richard Conte │ ['Abe Vigoda', 'Al Pacino', ... +9] │ │ George Orwell │ ['John Hurt', 'Richard Burton'] │ │ Joseph L. Mankiewicz │ ['Alfred Newman', 'Anne Baxter', ... +13] │ │ … │ … │ └──────────────────────┴───────────────────────────────────────────────┘
left = bq_ents.filter(_.known_for_titles.length() > 0).limit(10_000)
right = left.view()
shared_titles = (
left
.join(right, left.nconst != right.nconst)
.select(
s.startswith("known_for_titles"),
left_name="primary_name",
right_name="primary_name_right",
)
.filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)
.group_by(name="left_name")
.agg(together_with=_.right_name.collect())
.mutate(together_with=_.together_with.unique().sort())
)
shared_titles
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ name ┃ together_with ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ array │ ├─────────────────────────┼─────────────────────────────────────────────────────┤ │ Pavel Vrba │ ['F.C. Lokomotiv Moscow', 'Jeffrey Bruma', ... +4] │ │ Greg Carrolan │ ['Al Cambronne', 'Alana Tornello', ... +20] │ │ Aleksander Parzychowski │ ['Adam Korszun', 'Grzegorz Wawrzenczyk', ... +5] │ │ James Walt │ ['Anton Testino', 'Ben Walanka', ... +10] │ │ Ellen Dallaglio │ ['Antonia Giovanazzi', 'Fra McCann', ... +10] │ │ Catarina Martins │ ['Miguel Oliveira', 'Ricardo Gordon', ... +1] │ │ Stanislav Sesták │ ['Martin Glenn', 'Miso Brecko', ... +6] │ │ Allison Cabot │ ['Brenda Beard', 'Brian Fenmore', ... +16] │ │ Vasilis Bouzianas │ ['Aggelos Kasolas', 'Christos Patriarheas', ... +3] │ │ Marie Muldoon │ ['Alan Oxley', 'Andrew Raeber', ... +39] │ │ … │ … │ └─────────────────────────┴─────────────────────────────────────────────────────┘
Advanced operations
Flatten arrays into rows
Thanks to the tireless efforts of the folks working on sqlglot, as of version 7.0.0 Ibis supports unnest for BigQuery!
You can use it standalone on a column expression:
- DuckDB
- BigQuery
ddb_ents.primary_profession.unnest()
┏━━━━━━━━━━━━━━━━━━━━┓ ┃ primary_profession ┃ ┡━━━━━━━━━━━━━━━━━━━━┩ │ string │ ├────────────────────┤ │ soundtrack │ │ actor │ │ miscellaneous │ │ actress │ │ soundtrack │ │ actress │ │ soundtrack │ │ music_department │ │ actor │ │ soundtrack │ │ … │ └────────────────────┘
bq_ents.primary_profession.unnest()
┏━━━━━━━━━━━━━━━━━━━━┓ ┃ primary_profession ┃ ┡━━━━━━━━━━━━━━━━━━━━┩ │ string │ ├────────────────────┤ │ actor │ │ actor │ │ actor │ │ actor │ │ actor │ │ actor │ │ actor │ │ actor │ │ actor │ │ actor │ │ … │ └────────────────────┘
You can also use it in select
/mutate
calls to expand the table accordingly:
- DuckDB
- BigQuery
ddb_ents.mutate(primary_profession=_.primary_profession.unnest()).order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ string │ array │ ├───────────┼─────────────────┼────────────────────┼────────────────────────────────────┤ │ nm0000001 │ Fred Astaire │ soundtrack │ ['tt0053137', 'tt0072308', ... +2] │ │ nm0000001 │ Fred Astaire │ miscellaneous │ ['tt0053137', 'tt0072308', ... +2] │ │ nm0000001 │ Fred Astaire │ actor │ ['tt0053137', 'tt0072308', ... +2] │ │ nm0000002 │ Lauren Bacall │ soundtrack │ ['tt0037382', 'tt0117057', ... +2] │ │ nm0000002 │ Lauren Bacall │ actress │ ['tt0037382', 'tt0117057', ... +2] │ │ nm0000003 │ Brigitte Bardot │ music_department │ ['tt0057345', 'tt0054452', ... +2] │ │ nm0000003 │ Brigitte Bardot │ soundtrack │ ['tt0057345', 'tt0054452', ... +2] │ │ nm0000003 │ Brigitte Bardot │ actress │ ['tt0057345', 'tt0054452', ... +2] │ │ nm0000004 │ John Belushi │ soundtrack │ ['tt0072562', 'tt0078723', ... +2] │ │ nm0000004 │ John Belushi │ writer │ ['tt0072562', 'tt0078723', ... +2] │ │ … │ … │ … │ … │ └───────────┴─────────────────┴────────────────────┴────────────────────────────────────┘
bq_ents.mutate(primary_profession=_.primary_profession.unnest()).order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ string │ array │ ├───────────┼─────────────────┼────────────────────┼────────────────────────────────────┤ │ nm0000001 │ Fred Astaire │ miscellaneous │ ['tt0072308', 'tt0053137', ... +2] │ │ nm0000001 │ Fred Astaire │ actor │ ['tt0072308', 'tt0053137', ... +2] │ │ nm0000001 │ Fred Astaire │ soundtrack │ ['tt0072308', 'tt0053137', ... +2] │ │ nm0000002 │ Lauren Bacall │ actress │ ['tt0038355', 'tt0075213', ... +2] │ │ nm0000002 │ Lauren Bacall │ soundtrack │ ['tt0038355', 'tt0075213', ... +2] │ │ nm0000003 │ Brigitte Bardot │ music_department │ ['tt0049189', 'tt0054452', ... +2] │ │ nm0000003 │ Brigitte Bardot │ soundtrack │ ['tt0049189', 'tt0054452', ... +2] │ │ nm0000003 │ Brigitte Bardot │ actress │ ['tt0049189', 'tt0054452', ... +2] │ │ nm0000004 │ John Belushi │ soundtrack │ ['tt0072562', 'tt0078723', ... +2] │ │ nm0000004 │ John Belushi │ actor │ ['tt0072562', 'tt0078723', ... +2] │ │ … │ … │ … │ … │ └───────────┴─────────────────┴────────────────────┴────────────────────────────────────┘
Unnesting can be useful when joining nested data.
Here we use unnest to find people known for any of the godfather movies:
- DuckDB
- BigQuery
basics = ddb.tables.title_basics.filter(
[
_.title_type == "movie",
_.original_title.lower().startswith("the godfather"),
_.genres.lower().contains("crime"),
]
)
ddb_known_for_the_godfather = (
ddb_ents.mutate(tconst=_.known_for_titles.unnest())
.join(basics, "tconst")
.select("primary_title", "primary_name")
.distinct()
.order_by(["primary_title", "primary_name"])
)
ddb_known_for_the_godfather
1
Filter the title_basics
data set to only the Godfather movies
2
Unnest the known_for_titles
array column
3
Join with basics
to get movie titles
4
Ensure that each entity is only listed once and sort the results
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ ┃ primary_title ┃ primary_name ┃ ┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ ├───────────────┼─────────────────────┤ │ The Godfather │ A. Emmett Adams │ │ The Godfather │ Abe Vigoda │ │ The Godfather │ Al Lettieri │ │ The Godfather │ Al Martino │ │ The Godfather │ Al Pacino │ │ The Godfather │ Albert S. Ruddy │ │ The Godfather │ Alex Rocco │ │ The Godfather │ Andrea Eastman │ │ The Godfather │ Angelo Infanti │ │ The Godfather │ Anna Hill Johnstone │ │ … │ … │ └───────────────┴─────────────────────┘
basics = bq.tables.title_basics.filter(
[
_.title_type == "movie",
_.original_title.lower().startswith("the godfather"),
_.genres.lower().contains("crime"),
]
)
bq_known_for_the_godfather = (
bq_ents.mutate(tconst=_.known_for_titles.unnest())
.join(basics, "tconst")
.select("primary_title", "primary_name")
.distinct()
.order_by(["primary_title", "primary_name"])
)
bq_known_for_the_godfather
1
Filter the title_basics
data set to only the Godfather movies
2
Unnest the known_for_titles
array column
3
Join with basics
to get movie titles
4
Ensure that each entity is only listed once and sort the results
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ ┃ primary_title ┃ primary_name ┃ ┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ ├───────────────┼─────────────────────┤ │ The Godfather │ A. Emmett Adams │ │ The Godfather │ Abe Vigoda │ │ The Godfather │ Al Lettieri │ │ The Godfather │ Al Martino │ │ The Godfather │ Al Pacino │ │ The Godfather │ Albert S. Ruddy │ │ The Godfather │ Alex Rocco │ │ The Godfather │ Andrea Eastman │ │ The Godfather │ Angelo Infanti │ │ The Godfather │ Anna Hill Johnstone │ │ … │ … │ └───────────────┴─────────────────────┘
Let’s summarize by showing how many people are known for each Godfather movie:
- DuckDB
- BigQuery
ddb_known_for_the_godfather.primary_title.value_counts()
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ ┃ primary_title ┃ primary_title_count ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ string │ int64 │ ├────────────────────────┼─────────────────────┤ │ The Godfather Part II │ 117 │ │ The Godfather │ 93 │ │ The Godfather Part III │ 196 │ └────────────────────────┴─────────────────────┘
bq_known_for_the_godfather.primary_title.value_counts()
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ ┃ primary_title ┃ primary_title_count ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ string │ int64 │ ├────────────────────────┼─────────────────────┤ │ The Godfather Part II │ 114 │ │ The Godfather Part III │ 202 │ │ The Godfather │ 97 │ └────────────────────────┴─────────────────────┘
Filtering array elements
Filtering array elements can be done with the filter method, which applies a predicate to each array element and returns an array of elements for which the predicate returns True
.
This method is similar to Python’s filter function.
Let’s show all people who are neither editors nor actors:
- DuckDB
- BigQuery
ddb_ents.mutate(
primary_profession=_.primary_profession.filter(
lambda pp: ~pp.isin(("actor", "actress", "editor"))
)
).filter(_.primary_profession.length() > 0).order_by("nconst")
1
This filter
call is applied to each array element
2
This filter
call is applied to the table
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ array │ array │ ├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤ │ nm0000001 │ Fred Astaire │ ['soundtrack', 'miscellaneous'] │ ['tt0053137', 'tt0072308', ... +2] │ │ nm0000002 │ Lauren Bacall │ ['soundtrack'] │ ['tt0037382', 'tt0117057', ... +2] │ │ nm0000003 │ Brigitte Bardot │ ['soundtrack', 'music_department'] │ ['tt0057345', 'tt0054452', ... +2] │ │ nm0000004 │ John Belushi │ ['soundtrack', 'writer'] │ ['tt0072562', 'tt0078723', ... +2] │ │ nm0000005 │ Ingmar Bergman │ ['writer', 'director'] │ ['tt0083922', 'tt0069467', ... +2] │ │ nm0000006 │ Ingrid Bergman │ ['soundtrack', 'producer'] │ ['tt0038109', 'tt0036855', ... +2] │ │ nm0000007 │ Humphrey Bogart │ ['soundtrack', 'producer'] │ ['tt0037382', 'tt0034583', ... +2] │ │ nm0000008 │ Marlon Brando │ ['soundtrack', 'director'] │ ['tt0068646', 'tt0070849', ... +2] │ │ nm0000009 │ Richard Burton │ ['soundtrack', 'producer'] │ ['tt0057877', 'tt0059749', ... +2] │ │ nm0000010 │ James Cagney │ ['soundtrack', 'director'] │ ['tt0042041', 'tt0035575', ... +2] │ │ … │ … │ … │ … │ └───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘
bq_ents.mutate(
primary_profession=_.primary_profession.filter(
lambda pp: ~pp.isin(("actor", "actress", "editor"))
)
).filter(_.primary_profession.length() > 0).order_by("nconst")
1
This filter
call is applied to each array element
2
This filter
call is applied to the table
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ array │ array │ ├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤ │ nm0000001 │ Fred Astaire │ ['soundtrack', 'miscellaneous'] │ ['tt0072308', 'tt0053137', ... +2] │ │ nm0000002 │ Lauren Bacall │ ['soundtrack'] │ ['tt0038355', 'tt0075213', ... +2] │ │ nm0000003 │ Brigitte Bardot │ ['soundtrack', 'music_department'] │ ['tt0049189', 'tt0054452', ... +2] │ │ nm0000004 │ John Belushi │ ['soundtrack', 'writer'] │ ['tt0072562', 'tt0078723', ... +2] │ │ nm0000005 │ Ingmar Bergman │ ['writer', 'director'] │ ['tt0050976', 'tt0083922', ... +2] │ │ nm0000006 │ Ingrid Bergman │ ['soundtrack', 'producer'] │ ['tt0034583', 'tt0038787', ... +2] │ │ nm0000007 │ Humphrey Bogart │ ['soundtrack', 'producer'] │ ['tt0037382', 'tt0043265', ... +2] │ │ nm0000008 │ Marlon Brando │ ['soundtrack', 'director'] │ ['tt0068646', 'tt0070849', ... +2] │ │ nm0000009 │ Richard Burton │ ['soundtrack', 'producer'] │ ['tt0061184', 'tt0087803', ... +2] │ │ nm0000010 │ James Cagney │ ['soundtrack', 'director'] │ ['tt0042041', 'tt0029870', ... +2] │ │ … │ … │ … │ … │ └───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘
Applying a function to array elements
You can apply a function to run an ibis expression on each element of an array using the map method.
Let’s normalize the case of primary_profession to upper case:
- DuckDB
- BigQuery
ddb_ents.mutate(
primary_profession=_.primary_profession.map(lambda pp: pp.upper())
).filter(_.primary_profession.length() > 0).order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ array │ array │ ├───────────┼─────────────────┼───────────────────────────────────┼────────────────────────────────────┤ │ nm0000001 │ Fred Astaire │ ['SOUNDTRACK', 'ACTOR', ... +1] │ ['tt0053137', 'tt0072308', ... +2] │ │ nm0000002 │ Lauren Bacall │ ['ACTRESS', 'SOUNDTRACK'] │ ['tt0037382', 'tt0117057', ... +2] │ │ nm0000003 │ Brigitte Bardot │ ['ACTRESS', 'SOUNDTRACK', ... +1] │ ['tt0057345', 'tt0054452', ... +2] │ │ nm0000004 │ John Belushi │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0072562', 'tt0078723', ... +2] │ │ nm0000005 │ Ingmar Bergman │ ['WRITER', 'DIRECTOR', ... +1] │ ['tt0083922', 'tt0069467', ... +2] │ │ nm0000006 │ Ingrid Bergman │ ['ACTRESS', 'SOUNDTRACK', ... +1] │ ['tt0038109', 'tt0036855', ... +2] │ │ nm0000007 │ Humphrey Bogart │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0037382', 'tt0034583', ... +2] │ │ nm0000008 │ Marlon Brando │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0068646', 'tt0070849', ... +2] │ │ nm0000009 │ Richard Burton │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0057877', 'tt0059749', ... +2] │ │ nm0000010 │ James Cagney │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0042041', 'tt0035575', ... +2] │ │ … │ … │ … │ … │ └───────────┴─────────────────┴───────────────────────────────────┴────────────────────────────────────┘
bq_ents.mutate(
primary_profession=_.primary_profession.map(lambda pp: pp.upper())
).filter(_.primary_profession.length() > 0).order_by("nconst")
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ string │ string │ array │ array │ ├───────────┼─────────────────┼───────────────────────────────────┼────────────────────────────────────┤ │ nm0000001 │ Fred Astaire │ ['SOUNDTRACK', 'ACTOR', ... +1] │ ['tt0072308', 'tt0053137', ... +2] │ │ nm0000002 │ Lauren Bacall │ ['ACTRESS', 'SOUNDTRACK'] │ ['tt0038355', 'tt0075213', ... +2] │ │ nm0000003 │ Brigitte Bardot │ ['ACTRESS', 'SOUNDTRACK', ... +1] │ ['tt0049189', 'tt0054452', ... +2] │ │ nm0000004 │ John Belushi │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0072562', 'tt0078723', ... +2] │ │ nm0000005 │ Ingmar Bergman │ ['WRITER', 'DIRECTOR', ... +1] │ ['tt0050976', 'tt0083922', ... +2] │ │ nm0000006 │ Ingrid Bergman │ ['ACTRESS', 'SOUNDTRACK', ... +1] │ ['tt0034583', 'tt0038787', ... +2] │ │ nm0000007 │ Humphrey Bogart │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0037382', 'tt0043265', ... +2] │ │ nm0000008 │ Marlon Brando │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0068646', 'tt0070849', ... +2] │ │ nm0000009 │ Richard Burton │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0061184', 'tt0087803', ... +2] │ │ nm0000010 │ James Cagney │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0042041', 'tt0029870', ... +2] │ │ … │ … │ … │ … │ └───────────┴─────────────────┴───────────────────────────────────┴────────────────────────────────────┘
Conclusion
Ibis has a sizable collection of array APIs that work with many different backends and as of version 7.0.0, Ibis supports a much larger set of those APIs for BigQuery!
Check out the API documentation for the full set of available methods.
Try it out, and let us know what you think.
Back to top