GaussDB-AISQL: a composable cloud-native SQL system with AI capabilities (original) (raw)

References

  1. Marrandino, Alessandro. Machine Learning with BigQuery ML: Create, execute, and improve machine learning models in BigQuery using standard SQL queries. Packt Publishing Ltd, 2021.
    Google Scholar
  2. Amazon Web Services, Inc. Amazon redshift machine learning. See docs.aws.amazoncom/redshift/latest/dg/machine_learning website, 2024
    Google Scholar
  3. Park K, Saur K, Banda D, Sen R, Interlandi M, Karanasos K. End-to-end optimization of machine learning prediction queries. In: Proceedings of 2022 International Conference on Management of Data, SIGMOD’ 22. 2022, 587–601
    Chapter Google Scholar
  4. MindsDB. MindsDB. See mariadbcom/about-us/partners/mindsdb/ website, 2024
    Google Scholar
  5. Huang B, Babu S, Yang J. Cumulon: optimizing statistical data analysis in the cloud. In: Proceedings of 2013 ACM SIGMOD International Conference on Management of Data. 2013, 1–12
    MATH Google Scholar
  6. Cohen J, Dolan B, Dunlap M, Hellerstein J M, Welton C. MAD skills: new analysis practices for big data. Proceedings of the VLDB Endowment, 2009, 2(2): 1481–1492
    Article Google Scholar
  7. Lin Q, Wu S, Zhao J, Dai J, Li F, Chen G. A comparative study of in-database inference approaches. In: Proceedings of the 38th IEEE International Conference on Data Engineering (ICDE). 2022, 1794–1807
    MATH Google Scholar
  8. Wang Y, Yang Y, Zhu W, Wu Y, Yan X, Liu Y, Wang Y, Xie L, Gao Z, Zhu W, Chen X, Yan W, Tang M, Tang Y. SQLFLow: a bridge between SQL and machine learning. 2020, arXiv preprint arXiv: 2001.06846
    MATH Google Scholar
  9. Oracle Corporation. Oracle machine learning. See Docs.oracle.com/en/database/oracle/machine-learning/ website, 2024
    Google Scholar
  10. Wang D, Andres J, Weisz J D, Oduor E, Dugan C. AutoDS: towards human-centered automation of data science. In: Proceedings of 2021 CHI Conference on Human Factors in Computing Systems. 2021, 79
    MATH Google Scholar
  11. Jordan M I, Mitchell T M. Machine learning: trends, perspectives, and prospects. Science, 2015, 349(6245): 255–260
    Article MathSciNet MATH Google Scholar
  12. Paganelli M, Sottovia P, Park K, Interlandi M, Guerra F. Pushing ML predictions into DBMSs. IEEE Transactions on Knowledge and Data Engineering, 2023, 35(10): 10295–10308
    Article Google Scholar
  13. Substrait. See Github.com/substrait-io website, 2024
  14. Group T D M. The predictive model markup language. See dmg.org/pmml/pmml-v4-4-1.html website, 2024
    Google Scholar
  15. ONNX. See Onnx.ai/ website, 2024
  16. Chai C, Wang J, Tang N, Yuan Y, Liu J, Deng Y, Wang G. Efficient coreset selection with cluster-based methods. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023, 167–178
    Chapter MATH Google Scholar
  17. Kumar A, Naughton J, Patel J M. Learning generalized linear models over normalized data. In: Proceedings of 2015 ACM SIGMOD International Conference on Management of Data. 2015, 1969–1984
    Chapter MATH Google Scholar
  18. Kaggle. The state of data science. See www.kaggle.com/kaggle-survey-2020 website, 2020
    Google Scholar
  19. Psallidas F, Zhu Y, Karlas B, Interlandi M, Floratou A, Karanasos K, Wu W, Zhang C, Krishnan S, Curino C, Weimer M. Data science through the looking glass and what we found there. 2019, arXiv preprint arXiv: 1912.09536
    Google Scholar
  20. Grinsztajn L, Oyallon E, Varoquaux G. Why do tree-based models still outperform deep learning on typical tabular data? In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 37
    MATH Google Scholar
  21. The Apache Software Foundation. Apache arrow. See Arrow.apache website, 2016
    Google Scholar
  22. ClickHouse. ClickHouse. See githubcom/ClickHouse/ClickHouse website, 2024
    Google Scholar
  23. Apache Druid. Apache® druid. See druidapache.org/ website, 2024
    Google Scholar
  24. MySQL. See www.mysql.com/ website, 2024
  25. Depoutovitch A, Chen C, Chen J, Larson P, Lin S, Ng J, Cui W, Liu Q, Huang W, Xiao Y, He Y. Taurus database: how to be fast, available, and frugal in the cloud. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 1463–1478
    Chapter Google Scholar
  26. Ma Y, Xie S, Zhong H, Lee L, Lv K. HiEngine: how to architect a cloud-native memory-optimized database engine. In: Proceedings of 2022 International Conference on Management of Data. 2022, 2177–2190
    Chapter MATH Google Scholar
  27. Shen J, Zuo P, Luo X, Su Y, Gu J, Feng H, Zhou Y, Lyu M R. Ditto: an elastic and adaptive memory-disaggregated caching system. In: Proceedings of the 29th Symposium on Operating Systems Principles. 2023, 675–691
    Chapter MATH Google Scholar
  28. Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. GPT-4 technical report. 2023, arXiv preprint arXiv: 2303.08774
    Google Scholar
  29. Ren X, Zhou P, Meng X, Huang X, Wang Y, Wang W, Li P, Zhang X, Podolskiy A, Arshinov G, Bout A, Piontkovskaya I, Wei J, Jiang X, Su T, Liu Q, Yao J. PanGu-Σ: Towards trillion parameter language model with sparse heterogeneous computing. 2023, arXiv preprint arXiv: 2303.10845
    Google Scholar
  30. Rojas J S. IP network traffic flows labeled with 75 apps. See Kaggle.com/datasets/jsrojas/ip-network-traffic-flows-labeled-with-87-apps website, 2018
    MATH Google Scholar
  31. Kohavi R. Census income-UCI Machine Learning Repository. See Archive.ics.uci.edu/dataset/20/census+income website, 1996
    Google Scholar
  32. Bifet A, Ikonomovska E. The airlines dataset. See www.openml.org/d/1169 website, 2009
    MATH Google Scholar
  33. Tromp J. Connect-4- UCI Machine Learning Repository. See Archive.ics.uci.edu/dataset/26/connect+4 website, 1995
    MATH Google Scholar
  34. Moro S, Rita P, Cortez P. Bank marketing- UCI Machine Learning Repository. See Archive.ics.uci.edu/dataset/222/bank+marketing website, 2012
    Google Scholar
  35. Raabe M. The black Friday dataset. See www.openml.org website, 2019
    MATH Google Scholar
  36. Mueller A. The diamonds dataset. See www.openml.org/data/download/21792853/dataset website, 2019
    MATH Google Scholar
  37. Taxi N Y C. New York city taxi tip prediction. See www.openml.org/d/44065 website, 2016
    Google Scholar
  38. Group Mercedes Benz. Mercedes-Benz greener manufacturing. See Github.com/MezbanS/Mercedes-Benz-Greener-Manufacturing website, 2017
    Google Scholar
  39. Khamis M A, Ngo H Q, Nguyen X, Olteanu D, Schleich M. Learning models over relational data using sparse tensors and functional dependencies. ACM Transactions on Database Systems, 2020, 45(2): 7
    Article MathSciNet MATH Google Scholar
  40. Kadra A, Lindauer M, Hutter F, Grabocka J. Well-tuned simple nets excel on tabular datasets. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 1832
    MATH Google Scholar
  41. Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O. LoRAS: an oversampling approach for imbalanced datasets. Machine Learning, 2021, 110(2): 279–301
    Article MathSciNet MATH Google Scholar
  42. Kotelnikov A, Baranchuk D, Rubachev I, Babenko A. TabDDPM: modelling tabular data with diffusion models. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 725
    Google Scholar
  43. Feurer M, Klein A, Eggensperger K, Springenberg J T, Blum M, Hutter F. Efficient and robust automated machine learning. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 2755–2763
    Google Scholar
  44. Yakovlev A, Moghadam H F, Moharrer A, Cai J, Chavoshi N, Varadarajan V, Agrawal S R, Idicula S, Karnagel T, Jinturkar S, Agarwal N. Oracle AutoML: a fast and predictive AutoML pipeline. Proceedings of the VLDB Endowment, 2020, 13(12): 3166–3180
    Article Google Scholar
  45. Li Y, Shen Y, Zhang W, Zhang C, Cui B. VolcanoML: speeding up end-to-end AutoML via scalable search space decomposition. The VLDB Journal, 2023, 32(2): 389–413
    Article MATH Google Scholar
  46. H2O.ai. Scalable AutoML in H2O-3 open source. See H2o.ai/platform/h2o-automl/ website, 2023
    Google Scholar
  47. Patki N, Wedge R, Veeramachaneni K. The synthetic data vault. In: Proceedings of 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). 2016, 399–410
    Google Scholar
  48. Pedreira P, Erling O, Karanasos K, Schneider S, McKinney W, Valluri S R, Zait M, Nadeau J. The composable data management system manifesto. Proceedings of the VLDB Endowment, 2023, 16(10): 2679–2685
    Article Google Scholar
  49. Wilhite D. GoogleSQL: A SQL language as a component. In: Proceedings of the 1st International Workshop on Composable Data Management Systems. 2022
    MATH Google Scholar
  50. Chattopadhyay B, Pedreira P, Agarwal S, Sun Y, Vakharia S, Li P, Liu W, Narayanan S. Shared foundations: modernizing Meta’s data lakehouse. In: Proceedings of the 13th Conference on Innovative Data Systems Research. 2023
    Google Scholar
  51. Begoli E, Camacho-Rodríguez J, Hyde J, Mior M J, Lemire D. Apache calcite: a foundational framework for optimized query processing over heterogeneous data sources. In: Proceedings of 2018 International Conference on Management of Data. 2018, 221–230
    Chapter Google Scholar
  52. Soliman M A, Antova L, Raghavan V, El-Helw A, Gu Z, Shen E, Caragea G C, Garcia-Alvarado C, Rahman F, Petropoulos M, Waas F, Narayanan S, Krikellas K, Baldwin R. Orca: a modular query optimizer architecture for big data. In: Proceedings of 2014 ACM SIGMOD International Conference on Management of Data. 2014, 337–348
    Chapter Google Scholar
  53. Pedreira P, Erling O, Basmanova M, Wilfong K, Sakka L, Pai K, He W, Chattopadhyay B. Velox: Meta’s unified execution engine. Proceedings of the VLDB Endowment, 2022, 15(12): 3372–3384
    Article Google Scholar
  54. Microsoft. Microsoft SQL server machine learning services. website, 2024
    Google Scholar
  55. Karanasos K, Interlandi M, Psallidas F, Sen R, Park K, Popivanov I, Xin D, Nakandal S, Krishnan S, Weimer M, Yu Y, Ramakrishnan R, Curino C. Extending relational query processing with ML inference. In: Proceedings of the 10th Conference on Innovative Data Systems Research (CIDR 2020). 2020
    Google Scholar
  56. Corporation I. IBM db2 machine learning. website, 2024
    Google Scholar
  57. Li F. Modernization of databases in the cloud era: building databases that run like Legos. Proceedings of the VLDB Endowment, 2023, 16(12): 4140–4151
    Article MATH Google Scholar
  58. AP. SAP HANA predictive analysis library (PAL). See Help.sap.com website, 2024
    Google Scholar
  59. Hellerstein J M, Ré C, Schoppmann F, Wang D Z, Fratkin E, Gorajek A, Ng K S, Welton C, Feng X, Li K, Kumar A. The MADlib analytics library: or MAD skills, the SQL. Proceedings of the VLDB Endowment, 2012, 5(12): 1700–1711
    Article Google Scholar
  60. Del Buono F, Paganelli M, Sottovia P, Interlandi M, Guerra F. Transforming ML predictive pipelines into SQL with MASQ. In: Proceedings of 2021 International Conference on Management of Data. 2021, 2696–2700
    Chapter MATH Google Scholar
  61. Schule M, Lang H, Springer M, Kemper A, Neumann T, Gunnemann S. In-database machine learning with SQL on GPUs. In: Proceedings of the 33rd International Conference on Scientific and Statistical Database Management, SSDBM’ 21. 2021, 25–36
    Google Scholar
  62. Olteanu D. The relational data Borg is learning. Proceedings of the VLDB Endowment, 2020, 13(12): 3502–3515
    Article MATH Google Scholar
  63. Gandhi A, Asada Y, Fu V, Gemawat A, Zhang L, Sen R, Curino C, Camacho-Rodríguez J, Interlandi M. The tensor data platform: towards an AI-centric database system. In: Proceedings of the 13th Conference on Innovative Data Systems Research. 2023
    Google Scholar
  64. Ghorbani M, Shaikhha A. Demonstration of OpenDBML, a framework for democratizing in-database machine learning. Proceedings of the VLDB Endowment, 2023, 16(12): 3970–3973
    Article MATH Google Scholar
  65. Miao H, Li A, Davis L S, Deshpande A. Towards unified data and lifecycle management for deep learning. In: Proceedings of the IEEE 33rd International Conference on Data Engineering (ICDE). 2017, 571–582
    Google Scholar
  66. Wang X, Dong X L, Meliou A. Data x-ray: a diagnostic tool for data errors. In: Proceedings of 2015 ACM SIGMOD International Conference on Management of Data. 2015, 1231–1245
    Chapter MATH Google Scholar
  67. Vartak M, da Trindade J M F, Madden S, Zaharia M. MISTIQUE: a system to store and query model intermediates for model diagnosis. In: Proceedings of 2018 International Conference on Management of Data. 2018, 1285–1300
    Chapter Google Scholar

Download references