GitHub - junyuan-chen/ReadStatTables.jl: Read and write Stata, SAS and SPSS data files with Julia tables (original) (raw)

ReadStatTables.jl

Read and write Stata, SAS and SPSS data files with Julia tables

CI-stable codecov PkgEval docs-stable docs-dev

ReadStatTables.jlis a Julia package for reading and writing Stata, SAS and SPSS data files withTables.jl-compatible tables. It utilizes the ReadStat C library developed by Evan Millerfor parsing and writing the data files. The same C library is also the backend of popular packages in other languages such aspyreadstat for Python and haven for R. As the Julia counterpart for similar purposes, ReadStatTables.jl leverages the state-of-the-art Julia ecosystem for usability and performance. Its read performance, especially when taking advantage of multiple threads, surpasses all related packages by a sizable margin based on the benchmark resultshere:

Features

ReadStatTables.jl provides the following features in addition to wrapping the C interface of ReadStat:

Supported File Formats

ReadStatTables.jl recognizes data files with the following file extensions at this moment:

Installation

ReadStatTables.jl can be installed with the Julia package managerPkg. From the Julia REPL, type ] to enter the Pkg REPL and run:

Quick Start

To read a data file located at data/sample.dta:

julia> using ReadStatTables

julia> tb = readstat("data/sample.dta") 5×7 ReadStatTable: Row │ mychar mynum mydate dtime mylabl myord mytime │ String3 Float64 Date? DateTime? Labeled{Int8} Labeled{Int8?} DateTime ─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────── 1 │ a 1.1 2018-05-06 2018-05-06T10:10:10 Male low 1960-01-01T10:10:10 2 │ b 1.2 1880-05-06 1880-05-06T10:10:10 Female medium 1960-01-01T23:10:10 3 │ c -1000.3 1960-01-01 1960-01-01T00:00:00 Male high 1960-01-01T00:00:00 4 │ d -1.4 1583-01-01 1583-01-01T00:00:00 Female low 1960-01-01T16:10:10 5 │ e 1000.3 missing missing Male missing 2000-01-01T00:00:00

To access a column from the above table:

julia> tb.myord 5-element LabeledVector{Union{Missing, Int8}, Vector{Union{Missing, Int8}}, Union{Char, Int32}}: 1 => low 2 => medium 3 => high 1 => low missing => missing

Notice that for data variables with value labels, both the original values and the value labels are preserved. For variables representing date/time, the translation to Julia Date/DateTime is lazy. One can access the underlying numerical values as follows:

julia> tb.mydate.data 5-element SentinelArrays.SentinelVector{Float64, Float64, Missing, Vector{Float64}}: 21310.0 -29093.0 0.0 -137696.0 missing

File-level and variable-level metadata can be retrieved and modified via methods compatible with DataAPI.jl:

julia> metadata(tb) ReadStatMeta: row count => 5 var count => 7 modified time => 2021-04-23T04:36:00 file format version => 118 file label => A test file file extension => .dta

julia> colmetadata(tb, :mylabl) ReadStatColMeta: label => labeled format => %16.0f type => READSTAT_TYPE_INT8 value label => mylabl storage width => 1 display width => 16 measure => READSTAT_MEASURE_UNKNOWN alignment => READSTAT_ALIGNMENT_RIGHT

For more details, please see the documentation.