Austin R. Benson datasets (original) (raw)

Data!

This is a collection of datasets from my research projects. I strive to make the data used in my research easily accessible. If you encounter problems, please email me at arb@cs.cornell.edu.

Temporal higher-order networks (hypergraphs)
Hypergraphs with labeled nodes

Each of these datasets is a hypergraph where the nodes are labeled into discrete classes. These can be used for community detection or node prediction experiments. We used them in the following papers:

Dataset pages:

US county networks for node regression

These are networks of US counties, where edges come from physical adjacency or Facebook connectedness. The nodes are accompanies by various covariates, such as demographic features, climate measurements, and election statistics, depending on the dataset. We used these for transductive node regression experiments. Some of the datasets have demographic features and election statistics from both 2012 and 2016, and we used these for inductive learning experiments. The data was used in the following papers:

Dataset pages:

Hypergraphs with categorical edge labels

Each of these datasets is a hypergraph or just a graph where the edges have a discrete label (a categorical label). These datasets were collected for and analyzed in the following papers:

Dataset pages:

Largeish weighted graphs

Each of these datasets is an undirected weighted graph of nontrivial size. These datasets were used in the following paper:

Dataset pages:

Graphs and hypergraphs with core-fringe structure

Each of these datasets is a graph or hypergraph, where the nodes are labeled as "core" or "fringe" according to the data collection process. Specifically, all of the graphs measured communication involving a set of nodes, and this set of nodes serves as the core. This induces what we call "core-fringe" structure in the network. We studied how well one can recover the core-labeled nodes from the network structure. In this setup, the core nodes form a "planted vertex cover" in the graph case and a "planted hitting set" in the hypergraph case. We studied these datasets in the following papers:

Dataset pages:

Stack exchange co-tagging networks

These are weighted networks for 168 co-tagging networks on Stack Exchange communities, where the weight of edge (i, j) is the number of questions that were annotated with both tags i and j. The data was analyzed in the following paper:

Dataset page:

Temporal networks

These are temporal networks where (i, j, t) signifies a directed edge from i to j at time t. The networks were used in the following paper:

Dataset pages:

Spatial networks
Sequences of Sets
Discrete subset choices

These datasets are from people making choices from a discrete set of alternatives. In datasets with "universal choice sets," the set of alternatives is the same for every choice that is made. In datasets with "variable choice sets," the set of alternatives changes with each subset selection. These datasets were used in the following paper:

Universal choice dataset pages:

Variable choice dataset pages:

Genius.com data

This is a curated dataset of users, songs, and lyrical annotation on the web site Genius.com. The dataset was used in the following paper:

Dataset page:

Manhattan taxi cab trajectories

This dataset contains 1,000 sequences of neighborhoods of Manhattan visited by taxi cabs over a one year period. The dataset was used in the following paper:

Dataset page:

Flow cytometry

This flow cytometry dataset represents abundances of fluorescent molecules labeling antibodies that bind to specific targets on the surface of blood cells. The dataset was used in the following paper:

Dataset page: