Add documentation section on Scaling · Issue #28315 · pandas-dev/pandas (original) (raw)
Navigation Menu
- GitHub Copilot Write better code with AI
- GitHub Models New Manage and compare prompts
- GitHub Advanced Security Find and fix vulnerabilities
- Actions Automate any workflow
- Codespaces Instant dev environments
- Issues Plan and track work
- Code Review Manage code changes
- Discussions Collaborate outside of code
- Code Search Find more, search less
- Explore
- Pricing
Provide feedback
Saved searches
Use saved searches to filter your results more quickly
Appearance settings
Description
From the user survey, the most critical feature request was "Improve scaling to larger datasets".
While we continue to do work within pandas to improve scaling (fewer copies, native string dtype, etc.), we can document a few strategies that are available that may help with scaling.
- Using efficient dtypes / ensure you don't have object dtypes. Possibly use Categorical for strings, if they have low cardinality. Possibly use lower-precision dtypes.
- Avoid unnecessary work. When loading data, use
columns=
(csv or parquet) to select only the columns you need. Probably some other examples. - Use out of core methods, like
pd.read_csv(..., chunksize=)
, to avoid having to rewrite. - Use other libraries. I would of course recommend Dask. But I'm not opposed to a section highlighting Vaex, and possibly Spark (though installing it in our doc environment may be difficult).
Do people have thoughts on this? Any objections to highlighting outside projects like Dask?
Are there other strategies we should mention?