GitHub - junyuan-chen/ReadStatTables.jl: Read and write Stata, SAS and SPSS data files with Julia tables (original) (raw)
ReadStatTables.jl
Read and write Stata, SAS and SPSS data files with Julia tables
ReadStatTables.jlis a Julia package for reading and writing Stata, SAS and SPSS data files withTables.jl-compatible tables. It utilizes the ReadStat C library developed by Evan Millerfor parsing and writing the data files. The same C library is also the backend of popular packages in other languages such aspyreadstat for Python and haven for R. As the Julia counterpart for similar purposes, ReadStatTables.jl leverages the state-of-the-art Julia ecosystem for usability and performance. Its read performance, especially when taking advantage of multiple threads, surpasses all related packages by a sizable margin based on the benchmark resultshere:
Features
ReadStatTables.jl provides the following features in addition to wrapping the C interface of ReadStat:
- Fast multi-threaded data collection from ReadStat parsers to a Tables.jl-compatible
ReadStatTable
- Interface of file-level and variable-level metadata compatible with DataAPI.jl
- Integration of value labels into data columns via a custom array type
LabeledArray
- Translation of date and time values into Julia time types
Date
andDateTime
- Write support for Tables.jl-compatible tables (experimental)
Supported File Formats
ReadStatTables.jl recognizes data files with the following file extensions at this moment:
- Stata:
.dta
- SAS:
.sas7bdat
and.xpt
- SPSS:
.sav
and.por
Installation
ReadStatTables.jl can be installed with the Julia package managerPkg. From the Julia REPL, type ]
to enter the Pkg REPL and run:
Quick Start
To read a data file located at data/sample.dta
:
julia> using ReadStatTables
julia> tb = readstat("data/sample.dta") 5×7 ReadStatTable: Row │ mychar mynum mydate dtime mylabl myord mytime │ String3 Float64 Date? DateTime? Labeled{Int8} Labeled{Int8?} DateTime ─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ a 1.1 2018-05-06 2018-05-06T10:10:10 Male low 1960-01-01T10:10:10 2 │ b 1.2 1880-05-06 1880-05-06T10:10:10 Female medium 1960-01-01T23:10:10 3 │ c -1000.3 1960-01-01 1960-01-01T00:00:00 Male high 1960-01-01T00:00:00 4 │ d -1.4 1583-01-01 1583-01-01T00:00:00 Female low 1960-01-01T16:10:10 5 │ e 1000.3 missing missing Male missing 2000-01-01T00:00:00
To access a column from the above table:
julia> tb.myord 5-element LabeledVector{Union{Missing, Int8}, Vector{Union{Missing, Int8}}, Union{Char, Int32}}: 1 => low 2 => medium 3 => high 1 => low missing => missing
Notice that for data variables with value labels, both the original values and the value labels are preserved. For variables representing date/time, the translation to Julia Date
/DateTime
is lazy. One can access the underlying numerical values as follows:
julia> tb.mydate.data 5-element SentinelArrays.SentinelVector{Float64, Float64, Missing, Vector{Float64}}: 21310.0 -29093.0 0.0 -137696.0 missing
File-level and variable-level metadata can be retrieved and modified via methods compatible with DataAPI.jl:
julia> metadata(tb) ReadStatMeta: row count => 5 var count => 7 modified time => 2021-04-23T04:36:00 file format version => 118 file label => A test file file extension => .dta
julia> colmetadata(tb, :mylabl) ReadStatColMeta: label => labeled format => %16.0f type => READSTAT_TYPE_INT8 value label => mylabl storage width => 1 display width => 16 measure => READSTAT_MEASURE_UNKNOWN alignment => READSTAT_ALIGNMENT_RIGHT
For more details, please see the documentation.