Network Construction and Analysis Project (original) (raw)
Network Construction and Analysis Project
Networked Life
Spring 2005
Prof. Michael Kearns
PROJECT DESCRIPTION
In the early portion of Networked Life,one of your main tasks will be to construct a network based on a source of real-world data of your own choosing, to measure and analyze various properties of your network, and to compare these properties with those arising in other networks and models we will examine in the course. The hope is that this will be a creative and fun project that will help illustrate, ground and possibly challenge some of the abstract concepts about networks we will be studying.
The project will be broken over time into a number of manageable assignments. More details will be provided below as they develop.
DESCRIPTION AND TIMELINE OF ASSIGNMENTS
Task 1: Identification of data source and definition of network. Due as a hardcopy write-up at the start of class, Thursday, Jan 27.
In this first assignment, you should identify a specific source of real-world data, the precise definition of the network (vertices and edges) you plan to extract from this data, and the methodology by which you will extract it.
We will be generous with the term "real-world", which could include data from the domains of biology, sociology, economics and finance, technology, etc. However, it must be a well-defined, objective data source gathered by a third party. An example of an entirely acceptable data source is the recently released corpus of emails exchanged by Enron executives, where it would be natural to examine the network of whom exchanged email with whom. An example of an unacceptable data source and network would be "I wrote down a list of all my friends and then connected any pair of them that I thought shared a lot of common interests". This example is too subjective and the data is not gathered by a third party.
To be sure there is some minimal level of complexity to your network, we require that the number of vertices in the network be at least 25, and the total number of edges in the network to be at least 25. However, considerably more ambitious networks are encouraged.
By the "methodology" by which you will extract your network, we mean how you plan to go from the raw data source and your defined network to an acutal representation of your network in our simple format (see below, but essentially nothing more than a list of all the vertices in your network, followed by a list of all those pairs of vertices that are connected by an edge). Examples would be the careful application of your definition to the data "by hand", or via the use of Excel macros, or via the creation of a Perl script that computes the network from the raw data, and so on.
For Task 1, you should submit a brief write-up detailing the information described above for your network. If your data source is online, please provide the URLs for the source; feel free to include a small portion of the raw data in your write-up if it would be helpful to do so. Be sure to be as precise as possible in all aspects of your write-up, from network definition to methodology. As an informal test, your write-up should be sufficiently precise that a third party could independently create the same network you will from your description.
As a rough guideline, I would expect most write-ups to be between 1 and 4 pages long.
Tips on finding and downloading data sources:Here are a couple of simple tips. First of all, to find a data source that is closest to your idea, it might be helpful to append the words "database" or "statistics" to your basic Google query --- i.e. search for "movie credits database" as opposed to just "movie credits".
For those of you considering automating or partially automating the compilation of your data source or network extraction, a very handy utility is the program known as wget,which takes any URL and downloads the html found at that URL into a local file, thus letting you download many web pages without tediously having to surf them manually. The wget program is installed on Eniac and most of the other servers around, and there are also Windows versions available on the web. On Eniac, just typing "wget" at the command line will tell you how to get more info on usage.
Task 2: Extraction of your network from your data source. Due by electronic mail to Elliot and Prof Kearns (fengyi@seas.upenn.edu, mkearns@cis.upenn.edu) by midnight on Tuesday, March 1, in the exact form specified below. The subject line of your email MUST be "Task 2".
Once we have approved your choices in Task 1, you will then be asked to go ahead and use the precise definition of vertices and edges you gave and extract the network from the data. Note that while Task 1 was entirely conceptual, here's where you will actually need to manipulate your data source. Depending on how large and complex the network you defined in Task 1 is, this manipulation may require, or be greatly aided by, the use of various software tools like Excel. However, the lower bound on network size of 25 vertices and 25 edges was chosen to allow the possibility of doing all steps of this project "by hand" if you so prefer.
You will be asked to submit your network as a file in a specific format that is extremely straightforward --- essentially simply listing the names or IDs of your vertices, followed by a list of pairs of vertices that are connected by an edge.
More precisely, in the email you send by March 1 to receive credit for the assignment, the body of the email should be plain text (no attachments), and should have the following (sample) form:
graph G {
Alice;
Bob;
Chuck;
Dora;
Alice -- Bob;
Alice -- Dora;
Bob -- Chuck;
Bob -- Dora;
}
In this example, we first "declare" there to be four vertices that will have the names or labels "Alice", "Bob", "Chuck" and "Dora"; and then we declare the four edges listed above. Of course, the names or labels can be anything you like, so we suggest you make them meaningful in the context of your network. But you should probably avoid strange characters, underscores, etc. The syntax above is important --- i.e. you need semicolons after each vertex and edge declaration, you need to use "--", not "-", etc.
We will then provide you with web access to a program that will generate a nice visualization of your network from this formatted file, and we will also create and examine a gallery of all the networks for the class.
Task 3: FINAL REPORTS: Quantitative and Qualitative Analysis. Exact due date to be determined, but sometime towards the end of finals week.
The culmination of the project will be a written report analyzing your network from the perspective of the themes of the entire class. I anticipate that the typical length of these reports, including all figures, diagrams, references, etc. will be between 10 and 15 pages. You should think of the overarching goal of the report to be twofold: (a) Demonstrating your mastery of the technical and conceptual themes of the class, and (b) Demonstrating your ability to apply these themes to a specific real-world network.
You should view the inputs to your report as including at least the following items:
- The original data used to form your network.
- Visualizations of your network or fragments of it. If you would like different visualizations than those we have already generated for you, you should communicate with the TAs and they will help you.
- Quantitative and numerical analyses performed on your network.
- Your knowledge of the domain from which the network comes.
- Any articles or other materials you feel are relevant.
While I am flexible about the specific format of the reports, there will be some required elements, and a number of themes and topics that you should be sure to touch on. - Your report should begin with a precise recounting of the exact definition of your network (edges and vertices), as well as of your methodology in constructing the network, including all data sources. Again, the goal here is to be precise enough that a third party could replicate your efforts.
- Your report should analyze various structural properites of your network, including (but not restricted to) things like its diameter, clustering coefficient(s), degree distribution, and so on. You should consider doing computations (either by hand or computer) of these quantities, or at least make an effort to estimate or bound them. You are free to approach the TAs for help here as well, but please be aware that they are busy and will help you on their own schedules.
- You should further discuss these structural properties in light of what you know about what actually happens on your network, or what aspect of the real world it represents. For instance, if there are vertices with extraordinarily high degree, you should discuss why it makes sense that they have high degree (if in fact it does make sense), how they might have come to have this property, and so on. Ditto for the other structural properties you discuss.
- You should discuss plausible generative models for your network, and contrast them with those we examined in class. You should discuss which of the generative models we examined seem like better or worse explanations for how your network formed.
- You should discuss how structural properties of your network either do or do not influence what actually happens on your network.
- You should make liberal use of network visualizations, graphs, plots, charts, tables, etc. Anything that helps make your points more precise and understandable is encouraged.
- Other guidance will be added to this list as it occurs to me.