Miguel Costa | Universidade de Lisboa (original) (raw)
Papers by Miguel Costa
The Past Web
Web archives are a historically valuable source of information. In some respects, web archives ar... more Web archives are a historically valuable source of information. In some respects, web archives are the only record of the evolution of human society in the last two decades. They preserve a mix of personal and collective memories, the importance of which tends to grow as they age. However, the value of web archives depends on their users being able to search and access the information they require in efficient and effective ways. Without the possibility of exploring and exploiting the archived contents, web archives are useless. Web archive access functionalities range from basic browsing to advanced search and analytical services, accessed through user-friendly interfaces. Full-text and URL search have become the predominant and preferred forms of information discovery in web archives, fulfilling user needs and supporting search APIs that feed complex applications. Both full-text and URL search are based on the technology developed for modern web search engines, since the Web is the main resource targeted by both systems. However, while web search engines enable searching over the most recent web snapshot, web archives enable searching over multiple snapshots from the past. This means that web archives have to deal with a temporal dimension that is the cause of new challenges and opportunities, discussed throughout this chapter.
Web archiving has been gaining interest and recognized importance for modern societies around the... more Web archiving has been gaining interest and recognized importance for modern societies around the world. However, for web archivists it is frequently difficult to demonstrate this fact, for instance, to funders. This study provides an updated and global overview of web archiving. The obtained results showed that the number of web archiving initiatives significantly grew after 2003 and they are concentrated on developed countries. We statistically analyzed metrics, such as, the volume of archived data, archive file formats or number of people engaged. Web archives all together must process more data than any web search engine. Considering the complexity and large amounts of data involved in web archiving, the results showed that the assigned resources are scarce. A Wikipedia page was created to complement the presented work and be collaboratively kept up-to-date by the community.
ISPRS International Journal of Geo-Information
Datasets collecting demographic and socio-economic statistics are widely available. Still, the da... more Datasets collecting demographic and socio-economic statistics are widely available. Still, the data are often only released for highly aggregated geospatial areas, which can mask important local hotspots. When conducting spatial analysis, one often needs to disaggregate the source data, transforming the statistics reported for a set of source zones into values for a set of target zones, with a different geometry and a higher spatial resolution. This article reports on a novel dasymetric disaggregation method that uses encoder–decoder convolutional neural networks, similar to those adopted in image segmentation tasks, to combine different types of ancillary data. Model training constitutes a particular challenge. This is due to the fact that disaggregation tasks are ill-posed and do not entail the direct use of supervision signals in the form of training instances mapping low-resolution to high-resolution counts. We propose to address this problem through self-training. Our method it...
Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval - SIGIR '14, 2014
Web archives already hold together more than 534 billion files and this number continues to grow ... more Web archives already hold together more than 534 billion files and this number continues to grow as new initiatives arise. Searching on all versions of these files acquired throughout time is challenging, since users expect as fast and precise answers from web archives as the ones provided by current web search engines. This work studies, for the first time, how to improve the search effectiveness of web archives, including the creation of novel temporal features that exploit the correlation found between web document persistence and relevance. The persistence was analyzed over 14 years of web snapshots. Additionally, we propose a temporal-dependent ranking framework that exploits the variance of web characteristics over time influencing ranking models. Based on the assumption that closer periods are more likely to hold similar web characteristics, our framework learns multiple models simultaneously, each tuned for a specific period. Experimental results show significant improvements over the search effectiveness of single-models that learn from all data independently of its time. Thus, our approach represents an important step forward on the state-of-the-art IR technology usually employed in web archives.
8th International Web …, 2008
This paper introduces the Portuguese Web Archive initiative, pre-senting its main objectives and ... more This paper introduces the Portuguese Web Archive initiative, pre-senting its main objectives and work in progress. Term search over web archives collections is a desirable feature that raises new chal-lenges. It is discussed how the terms index size could be reduced ...
Multilingual Information Access for Text, Speech and Images, 2005
This paper describes the participation of the XLDB Group in the CLEF monolingual ad hoc task for ... more This paper describes the participation of the XLDB Group in the CLEF monolingual ad hoc task for Portuguese. We present tumba!, a Portuguese search engine and describe its architecture and the underlying assumptions. We discuss the way we used tumba! in CLEF, providing details on our runs and our experiments with ranking algorithms.
The Past Web
Web archives are a historically valuable source of information. In some respects, web archives ar... more Web archives are a historically valuable source of information. In some respects, web archives are the only record of the evolution of human society in the last two decades. They preserve a mix of personal and collective memories, the importance of which tends to grow as they age. However, the value of web archives depends on their users being able to search and access the information they require in efficient and effective ways. Without the possibility of exploring and exploiting the archived contents, web archives are useless. Web archive access functionalities range from basic browsing to advanced search and analytical services, accessed through user-friendly interfaces. Full-text and URL search have become the predominant and preferred forms of information discovery in web archives, fulfilling user needs and supporting search APIs that feed complex applications. Both full-text and URL search are based on the technology developed for modern web search engines, since the Web is the main resource targeted by both systems. However, while web search engines enable searching over the most recent web snapshot, web archives enable searching over multiple snapshots from the past. This means that web archives have to deal with a temporal dimension that is the cause of new challenges and opportunities, discussed throughout this chapter.
Web archiving has been gaining interest and recognized importance for modern societies around the... more Web archiving has been gaining interest and recognized importance for modern societies around the world. However, for web archivists it is frequently difficult to demonstrate this fact, for instance, to funders. This study provides an updated and global overview of web archiving. The obtained results showed that the number of web archiving initiatives significantly grew after 2003 and they are concentrated on developed countries. We statistically analyzed metrics, such as, the volume of archived data, archive file formats or number of people engaged. Web archives all together must process more data than any web search engine. Considering the complexity and large amounts of data involved in web archiving, the results showed that the assigned resources are scarce. A Wikipedia page was created to complement the presented work and be collaboratively kept up-to-date by the community.
ISPRS International Journal of Geo-Information
Datasets collecting demographic and socio-economic statistics are widely available. Still, the da... more Datasets collecting demographic and socio-economic statistics are widely available. Still, the data are often only released for highly aggregated geospatial areas, which can mask important local hotspots. When conducting spatial analysis, one often needs to disaggregate the source data, transforming the statistics reported for a set of source zones into values for a set of target zones, with a different geometry and a higher spatial resolution. This article reports on a novel dasymetric disaggregation method that uses encoder–decoder convolutional neural networks, similar to those adopted in image segmentation tasks, to combine different types of ancillary data. Model training constitutes a particular challenge. This is due to the fact that disaggregation tasks are ill-posed and do not entail the direct use of supervision signals in the form of training instances mapping low-resolution to high-resolution counts. We propose to address this problem through self-training. Our method it...
Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval - SIGIR '14, 2014
Web archives already hold together more than 534 billion files and this number continues to grow ... more Web archives already hold together more than 534 billion files and this number continues to grow as new initiatives arise. Searching on all versions of these files acquired throughout time is challenging, since users expect as fast and precise answers from web archives as the ones provided by current web search engines. This work studies, for the first time, how to improve the search effectiveness of web archives, including the creation of novel temporal features that exploit the correlation found between web document persistence and relevance. The persistence was analyzed over 14 years of web snapshots. Additionally, we propose a temporal-dependent ranking framework that exploits the variance of web characteristics over time influencing ranking models. Based on the assumption that closer periods are more likely to hold similar web characteristics, our framework learns multiple models simultaneously, each tuned for a specific period. Experimental results show significant improvements over the search effectiveness of single-models that learn from all data independently of its time. Thus, our approach represents an important step forward on the state-of-the-art IR technology usually employed in web archives.
8th International Web …, 2008
This paper introduces the Portuguese Web Archive initiative, pre-senting its main objectives and ... more This paper introduces the Portuguese Web Archive initiative, pre-senting its main objectives and work in progress. Term search over web archives collections is a desirable feature that raises new chal-lenges. It is discussed how the terms index size could be reduced ...
Multilingual Information Access for Text, Speech and Images, 2005
This paper describes the participation of the XLDB Group in the CLEF monolingual ad hoc task for ... more This paper describes the participation of the XLDB Group in the CLEF monolingual ad hoc task for Portuguese. We present tumba!, a Portuguese search engine and describe its architecture and the underlying assumptions. We discuss the way we used tumba! in CLEF, providing details on our runs and our experiments with ranking algorithms.