Analysis of Web page image tag distribution characteristics (original) (raw)

Modelling the characteristics of Web page outlinks

Scientometrics, 2004

Using data sampled from top-level Web pages across five high-level domains and from sample pages within individual websites, the authors investigate the frequency distribution of outlinks in Web pages. The observed distributions were fitted to different theoretical distributions to determine the best-fitting model for representing outlink frequency across Web pages. Theoretical models tested include the modified power law (MPL), Mandelbrot (MDB), generalized Waring (GW), generalized inverse Gaussian-Poisson (GIGP), and generalized negative binomial (GNB) distributions. The GIGP and GNB provided good fits for data sets for top-level pages across the high level domains tested, with the GIGP performing slightly better. The lumpiness and bimodal nature of two of the observed outlink distributions from Web pages within a given website resulted in poor fits of the theoretical models. The GIGP was able to provide better fits to these data sets after the top components were truncated. The ability to effectively model Web page attributes, such as the distribution of the number of outlinks per page, paves the way for simulation models of Web page structural content, and makes it possible to estimate the number of outlinks that may be encountered within Web pages of a specific domain or within individual websites.

Analysis of Usage Patterns in Large Multimedia Websites

Advanced Information and Knowledge Processing, 2010

User behavior in a website is a critical indicator of the web site's usability and success. Therefore an understanding of usage patterns is essential to website design optimization. In this context, large multimedia websites pose a significant challenge for comprehension of the complex and diverse user behaviors they sustain. This is due to the complexity of analyzing and understanding user-data interactions in media-rich contexts. In this chapter we present a novel multi-perspective approach for usability analysis of large media rich websites. Our research combines multimedia web content analysis with elements of web-log analysis and visualization/visual mining of web usage metadata. Multimedia content analysis allows direct estimation of the information-cues presented to a user by the web content. Analysis of web logs and usage-metadata, such as location, type, and frequency of interactions provides a complimentary perspective on the site's usage. The entire set of information is leveraged through powerful visualization and interactive querying techniques to provide analysis of usage patterns, measure of design quality, as well as the ability to rapidly identify problems in the web-site design. Experiments on media rich sites including the SkyServer-a large multimedia web-based astronomy information repository demonstrate the efficacy and promise of the proposed approach.

Statistical analysis of Web documents: a proposal and a case study

12th International Workshop on Database and Expert Systems Applications, 2000

The qua& metrics so far adopted for web document analysis suffer from a serious limitation: they take into account single documents, disregarding the specijic context the web pages belong to. As a formal tool suitable to overcome such a limitation, we introduce new metrics which take as input sets of web pages and return statistical distributions about the number of paragraphs of text, the area covered by the images and the number of (internaUexterna1) hyperlinks. The strategy for the practical evaluation of the qualizjl of the organization of a generic set of web pages requires the comparison of their statistical distributions against reference distributions computed by applying the nietrics to a "selected" set of web documents. The paper reports about an experiment where the general strategy is instantiated to the specijic domain of courseware; in numbers: seven thousands pages make up the reference set and two courseware, totalling about 250 pages, make up the actual case study. The experiment showed that our measures correspond to the kind of quality we might expect.

A poisson model for user accesses to web pages

2003

Predicting the next request of a user as she visits Web pages has gained importance as Web-based activity increases. There are a number of different approaches to prediction. This paper concentrates on the discovery and modelling of the user's aggregate interest in a session. This approach relies on the premise that the visiting time of a page is an indicator of the user's interest in that page. Even the same person may have different desires at different times.

Quantitative analysis of user-generated content on the Web

Proceedings of webevolve2008: web …, 2008

User-generated content (UGC) is becoming the most popular and valuable information available on the WWW. However, little serious research has been conducted to measure the properties of its production process. This paper presents an in-depth quantitative analysis of 9 popular websites that are based on different UGC types. The Information Production Process is used as a framework for the analysis. The findings provide for first time strong scientific evidence for previously anecdotic knowledge: UGC production follows "long-tail" distributions and it is marked with a strong "participation inequality". Also, the analysis arrived to unexpected findings: not all the UGC types follow the inverse power-law distribution, and large content collections could be dominated by the presence of ultraproductive users. The analysis results also have implications for the administration of UGC-based websites.

The portrait of a common HTML web page

Proceedings of the 2006 ACM symposium on Document engineering - DocEng '06, 2006

Web pages are not purely text, nor are they solely HTML. This paper surveys HTML web pages; not only on textual content, but with an emphasis on higher order visual features and supplementary technology. Using a crawler with an in-house developed rendering engine, data on a pseudo-random sample of web pages is collected. First, several basic attributes are collected to verify the collection process and confirm certain assumptions on web page text. Next, we take a look at the distribution of different types of page content (text, images, plug-in objects, and forms) in terms of rendered visual area. Those different types of content are broken down into a detailed view of the ways in which the content is used. This includes a look at the prevalence and usage of scripts and styles. We conclude that more complex page elements play a significant and underestimated role in the visually attractive, media rich, and highly interactive web pages that are currently being added to the World Wide Web.

Statistical Models for Web Pages Usability

The usability of an interface is a fundamental issue to elucidate. Many researchers argued that many usability results and recommendations lack empirical and experimental data. In this research, the usability of the web pages is evaluated using several carefully selected statistical models. Universities web pages are chosen as subjects for this work for ease of comparison and ease of collecting data. Series of experiments have been conducted to investigate into the usability and design of the universities web pages. Prototype web pages have been developed according to the structured methodologies of web pages design and usability. Universities web pages were evaluated together with the prototype web pages using a questionnaire which was designed according to the Human Computer Interactions (HCI) heuristics. Nine (users) respondents' variables and 14 web pages variables (items) were studied. Stringent statistical analysis was adopted to extract the required information form the data acquired, and augmented interpretation of the statistical results was followed. The results showed that the analysis of variance (ANOVA) procedure showed there were significant differences among the universities web pages regarding most of the 23 items studied. Duncan Multiple Range Test (DMRT) showed that the prototype usability performed significantly better regarding most of the items. The correlation analysis showed significant positive and negative correlations between many items. The regression analysis revealed that the most significant factors (items) that contributed to the best model of the universities web pages design and usability were: multimedia in the web pages, the web pages icons (alone) organisation and design, and graphics attractiveness. The results showed some of the limitations of some heuristics used in conventional interface systems design and proposed some additional heuristics in web pages design and usability.

Understanding the everyday use of images on the web

Proceedings of the 6th Nordic Conference on Human-Computer Interaction Extending Boundaries - NordiCHI '10, 2010

This paper presents a qualitative study of domestic Webbased image use, and specifically asks why users access images online. This work is not limited to image search per se, but instead aims to understand holistically the circumstances in which images are accessed through Webbased tools. As such, we move beyond the existing information seeking literature, and instead provide contextual examples of image use as well as an analysis of both how and why images are used. The paper concludes with design recommendations that take into account this wider range of activities.

Heavy-Tailed Distributions, Generalized Source Coding and Optimal Web Layout Design

2000

The design of robust and reliable networks and network services has become an increasingly challenging task in today's Internet world. To achieve this goal, understanding the characteristics of Internet tra c plays a more and more critical role. Empirical studies of measured tra c traces have led to the wide recognition of self-similarity in network tra c. Moreover, a direct link has been established between the self-similar nature of measured aggregate network tra c and the underlying heavy-tailed distributions of the Web tra c at the source level.