Network Construction and Analysis Project (original) (raw)


Network Construction and Analysis Project
Networked Life
Spring 2005
Prof. Michael Kearns


PROJECT DESCRIPTION

In the early portion of Networked Life,one of your main tasks will be to construct a network based on a source of real-world data of your own choosing, to measure and analyze various properties of your network, and to compare these properties with those arising in other networks and models we will examine in the course. The hope is that this will be a creative and fun project that will help illustrate, ground and possibly challenge some of the abstract concepts about networks we will be studying.

The project will be broken over time into a number of manageable assignments. More details will be provided below as they develop.


DESCRIPTION AND TIMELINE OF ASSIGNMENTS

Task 1: Identification of data source and definition of network. Due as a hardcopy write-up at the start of class, Thursday, Jan 27.

In this first assignment, you should identify a specific source of real-world data, the precise definition of the network (vertices and edges) you plan to extract from this data, and the methodology by which you will extract it.

We will be generous with the term "real-world", which could include data from the domains of biology, sociology, economics and finance, technology, etc. However, it must be a well-defined, objective data source gathered by a third party. An example of an entirely acceptable data source is the recently released corpus of emails exchanged by Enron executives, where it would be natural to examine the network of whom exchanged email with whom. An example of an unacceptable data source and network would be "I wrote down a list of all my friends and then connected any pair of them that I thought shared a lot of common interests". This example is too subjective and the data is not gathered by a third party.

To be sure there is some minimal level of complexity to your network, we require that the number of vertices in the network be at least 25, and the total number of edges in the network to be at least 25. However, considerably more ambitious networks are encouraged.

By the "methodology" by which you will extract your network, we mean how you plan to go from the raw data source and your defined network to an acutal representation of your network in our simple format (see below, but essentially nothing more than a list of all the vertices in your network, followed by a list of all those pairs of vertices that are connected by an edge). Examples would be the careful application of your definition to the data "by hand", or via the use of Excel macros, or via the creation of a Perl script that computes the network from the raw data, and so on.

For Task 1, you should submit a brief write-up detailing the information described above for your network. If your data source is online, please provide the URLs for the source; feel free to include a small portion of the raw data in your write-up if it would be helpful to do so. Be sure to be as precise as possible in all aspects of your write-up, from network definition to methodology. As an informal test, your write-up should be sufficiently precise that a third party could independently create the same network you will from your description.

As a rough guideline, I would expect most write-ups to be between 1 and 4 pages long.

Tips on finding and downloading data sources:Here are a couple of simple tips. First of all, to find a data source that is closest to your idea, it might be helpful to append the words "database" or "statistics" to your basic Google query --- i.e. search for "movie credits database" as opposed to just "movie credits".

For those of you considering automating or partially automating the compilation of your data source or network extraction, a very handy utility is the program known as wget,which takes any URL and downloads the html found at that URL into a local file, thus letting you download many web pages without tediously having to surf them manually. The wget program is installed on Eniac and most of the other servers around, and there are also Windows versions available on the web. On Eniac, just typing "wget" at the command line will tell you how to get more info on usage.

Task 2: Extraction of your network from your data source. Due by electronic mail to Elliot and Prof Kearns (fengyi@seas.upenn.edu, mkearns@cis.upenn.edu) by midnight on Tuesday, March 1, in the exact form specified below. The subject line of your email MUST be "Task 2".

Once we have approved your choices in Task 1, you will then be asked to go ahead and use the precise definition of vertices and edges you gave and extract the network from the data. Note that while Task 1 was entirely conceptual, here's where you will actually need to manipulate your data source. Depending on how large and complex the network you defined in Task 1 is, this manipulation may require, or be greatly aided by, the use of various software tools like Excel. However, the lower bound on network size of 25 vertices and 25 edges was chosen to allow the possibility of doing all steps of this project "by hand" if you so prefer.

You will be asked to submit your network as a file in a specific format that is extremely straightforward --- essentially simply listing the names or IDs of your vertices, followed by a list of pairs of vertices that are connected by an edge.

More precisely, in the email you send by March 1 to receive credit for the assignment, the body of the email should be plain text (no attachments), and should have the following (sample) form:

graph G {
Alice;
Bob;
Chuck;
Dora;
Alice -- Bob;
Alice -- Dora;
Bob -- Chuck;
Bob -- Dora;
}

In this example, we first "declare" there to be four vertices that will have the names or labels "Alice", "Bob", "Chuck" and "Dora"; and then we declare the four edges listed above. Of course, the names or labels can be anything you like, so we suggest you make them meaningful in the context of your network. But you should probably avoid strange characters, underscores, etc. The syntax above is important --- i.e. you need semicolons after each vertex and edge declaration, you need to use "--", not "-", etc.

We will then provide you with web access to a program that will generate a nice visualization of your network from this formatted file, and we will also create and examine a gallery of all the networks for the class.

Task 3: FINAL REPORTS: Quantitative and Qualitative Analysis. Exact due date to be determined, but sometime towards the end of finals week.

The culmination of the project will be a written report analyzing your network from the perspective of the themes of the entire class. I anticipate that the typical length of these reports, including all figures, diagrams, references, etc. will be between 10 and 15 pages. You should think of the overarching goal of the report to be twofold: (a) Demonstrating your mastery of the technical and conceptual themes of the class, and (b) Demonstrating your ability to apply these themes to a specific real-world network.

You should view the inputs to your report as including at least the following items:


EVALUATION CRITERIA
Your efforts on the project will be evaluated according to a few criteria: * Creativity.How novel and imaginative your data source and network are will be considered. Try to both have fun and examine a network whose structural properties might actually show something interesting about some aspect of the world. It might be helpful to define and examine a network from an area you have a personal interest in. * Scale and Complexity.While it will be possible to succeed on the project examining a small but interesting network, we will also appreciate those who tackle the extraction and analysis of large-scale networks. * Depth of Analysis.Your grasp of the main principles of networks discussed in class and your ability to apply them in a compelling way to your network (including if your network seems to violate some of the principles) will be considered.