UploadWizard/Software design - MediaWiki (original) (raw)

Some documentation for developers and reviewers interested in how UploadWizard works.

Incomplete Frontend docs

Incomplete Location spec

PHP (server side files)

New

includes/upload/UploadStash.php

includes/specials/SpecialUploadStash.php

extensions/UploadWizard/ApiQueryStashImageInfo.php

extensions/UploadWizard/SpecialUploadWizard.php

extensions/UploadWizard/UploadWizard.alias.php

extensions/UploadWizard/UploadWizard.i18n.php

extensions/UploadWizard/UploadWizardMessages.php (should be obsoleted by ResourceLoader in the near future)

extensions/UploadWizard/UploadWizardPage.php (should be obsoleted by ResourceLoader in the near future)

Modified

includes/upload/UploadBase.php

includes/api/ApiUpload.php

includes/filerepo/File.php

Changed config

includes/AutoLoader.php

includes/SpecialPage.php

languages/messages/MessagesEn.php

JavaScript

To be documented

UploadWizard is:

To achieve this, we've changed a lot about how uploads are accomplished.

The standard Mediawiki way

[edit]

Typical media upload timeline diagram for a standard MediaWiki install

This is the how media uploads have worked for a long time with MediaWiki -- very simply.

The file is uploaded with an HTML form, along with wikitext for the File: page that will surround the image.

Each wiki page could be very different; there's little standard formatting.

However, we still use the base operation here -- to upload a media file with accompanying wikitext.

Customized upload timeline diagram for Wikimedia Commons

This is how Wikimedia Commons works in late 2010.

Nothing fundamental has changed here -- they are still uploading a media file with some associated wikitext. But it's being done just a little differently. There is more bureaucracy up front to try to categorize various media types. (At left we see only one example of many.) The user fills out a form, and some JavaScript on the page creates equivalent wikitext, and sends that with the media file to the server.

There is much more preamble, as they feel they need to warn uploaders about Commons' licensing and interface requirements in very scary text.

The form page is very complicated, and has more structure and required fields, but ultimately it's just creating wikitext.

While an improvement over the previous version, the usability is now very poor.

The page spends half its time warning you about bad things that can happen.

The UploadWizard way

[edit]

UploadWizard's multiple-file upload timeline diagram

UploadWizard at heart uses the same system -- associate a media file with wikitext. But it adds two new layers to the entire interaction.

The most obvious change is that we are shepherding multiple files through this process, at the same time.

On the client side, in the user's browser, we now have a "wizard" style interface flow. Information that is related is gathered at the same time, and then the user proceeds to the next step. For example, there's exactly one screen about licensing, and for the most part everything is handled there.

On the server side, we have a new way of storing data and media files that stops just short of publishing them to the Wiki.

This is important for us mostly due to a quirk of how web browsers have traditionally worked. Web browsers cannot analyze the files they are uploading or provide any information about them, not even a thumbnail -- they need help. The Firefogg extension is one kind of help, but the one that works with all browsers is to upload the file to the server and then ask it for what it can determine about these files. So UploadWizard first uploads the files to the server, and then it gets:

So the user can complete filling out all this information in relative peace, focusing on one thing at a time, not worrying if they've accidentally released an unlicensed file into the public sphere.

And then when they're ready, they can publish it to the wiki.

To make the above design for UploadWizard work, we needed to store files with the following constraints:

The new UploadStash module and Special:UploadStash page answers the need for such a file area within MediaWiki.

This is not a radically new concept, as we've been using temporary stashes for uploads. If an uploaded file is found to have some problem which the user could fix before it was committed to the database (typically, a naming conflict) the file would be placed into the repository's "temp" area, its location saved in the user's "session", and a session "key" returned to the user so s/he can refer to this stashed file later.

However, aside from storing the file with a fixed set of metadata, there wasn't anything else one could do with the file.

How the file storage system(s) used in UploadWizard interact

We add a few new features with UploadStash:

The implementation is straightforward; a key-value PHP object is serialized into the stash.

But other keys can be used.

By our current design, the temp area does not have to be web-accessible. Furthermore, even if it were, MediaWiki (and Wikimedia projects especially) have zero security for media files. So, to keep from inadvertently "publishing" files, we simply create special URLs under Special:UploadStash that, when invoked, look up an "session key" in the user's current session and read the file directly back to the user with appropriate HTTP headers. In other words, for this limited purpose, PHP takes on the role that Apache normally does in serving media files. See below for security and other implications. Incidentally, this is not the first time either that we've done this with MediaWiki; Tim Starling's WebStore module uses a similar strategy, although there the reason isn't security.

This is to get thumbnails. This uses the standard facilities for transforming files in MediaWiki. Sound files and other non-visual media should be assigned icons of the appropriate size. These icons and other files will be stored in the temporary area. Since they are stored under their content hash, identical icons are only stored once. These thumbnails are then "stashed" themselves and thus become accessible in the way noted above.

A new module, ApiQueryStashImageInfo, a subclass of ApiQueryImageInfo, is being added.

All of the above has been carefully designed to be 100% compatible with the previous methods of stashing files (in fact, from a data perspective, identical).

(Update: in mid-2011, we transitioned from storing the UploadStash data in $_SESSION, to be stored in the database.)

Security and other implications

[edit]

Since UploadStash allows one to read temp files off the MediaWiki server in a new way, it has to be checked very carefully that it does not open any new security holes. Here is what is in place:

Even so, it is conceivable that if were ever used for "upload by URL" the user could turn MediaWiki into a sort of silent, private, slow, inefficient web proxy.

There is an opportunity for a denial of service attack, by uploading files and requesting transformations ad infinitum.

Opportunities for rationalizing other parts of the codebase

[edit]

Incidentally, this "stashing" functionality has existed for a while in our base class UploadBase, but extremely similar code is also to be found in the extension FirefoggChunkedUpload, as well as other extensions in various states of upkeep (SpecialUploadMogile, MultiUpload, SemanticForms, and SocialProfile....) UploadStash aims to encompass all the use cases noted above and in most cases should be a drop-in replacement. It also should make other forms of asynchronous uploading (such as Upload By URL) simpler to manage.