Francisco Pereira | Massachusetts Institute of Technology (MIT) (original) (raw)
Papers by Francisco Pereira
Public transport smartcard data can be used to detect large crowds. By comparing smartcard data w... more Public transport smartcard data can be used to detect large crowds. By comparing smartcard data with statistics on habitual behavior (e.g. average by time of day), one can specifically identify non-habitual crowds, which are often problematic for the transport system. While habitual overcrowding (e.g. during peak hour) is well understood by traffic managers and travelers, non-habitual overcrowding hotspots can be very disruptive given that they are generally unexpected. By quickly understanding and reacting to cases of overcrowding, transport managers can mitigate transport system disruptions.
We propose a probabilistic data analysis model that breaks each non-habitual overcrowding hotspot into a set of explanatory components. Potential explanatory components are retrieved from social networks and special events websites and then processed through text-analysis techniques. We then use the probabilistic model to estimate each component’s specific share of total overcrowding counts.
We first validate with synthetic data and then test our model with real data from Singapore’s public transport system (EZLink), focused on 3 case study areas. We demonstrate that it is able to generate explanations that are intuitively plausible and consistent both locally (correlation coefficient, CC, from 85\% to 99\% for the 3 areas) and globally (CC from 41.2\% to 83.9\%).
This model is directly applicable to domains that are sensitive to crowd formation due to large social events (e.g. communications, water, energy, waste).
Over the last few years, much online volunteered geographic information (VGI) has emerged and has... more Over the last few years, much online volunteered geographic information (VGI) has emerged and has been increasingly analyzed to understand places and cities, as well as human mobility and activity. However, there are concerns about the quality and usability of such VGI. In this study, we demonstrate a complete process that comprises the collection, unification, classification and validation of a type of VGI—online point-of-interest (POI) data—and develop methods to utilize such POI data to estimate disag- gregated land use (i.e., employment size by category) at a very high spatial resolution (census block level) using part of the Boston metropolitan area as an example. With recent advances in activity-based land use, transportation, and environment (LUTE) models, such disaggregated land use data become important to allow LUTE models to analyze and simulate a person’s choices of work location and activity destina- tions and to understand policy impacts on future cities. These data can also be used as alternatives to explore economic activities at the local level, especially as government-published census-based disaggre- gated employment data have become less available in the recent decade. Our new approach provides opportunities for cities to estimate land use at high resolution with low cost by utilizing VGI while ensur- ing its quality with a certain accuracy threshold. The automatic classification of POI can also be utilized for other types of analyses on cities.
Transportation systems are inherently uncertain due to the occurrence of random disruptions; mean... more Transportation systems are inherently uncertain due to the occurrence of random disruptions; meanwhile, real-time traveler information offers the potential to help travelers make better route choices under such disruptions. This paper presents the first revealed preference (RP) study of routing policy choice where travelers opt for routing policies instead of fixed paths. A routing policy is defined as a decision rule applied at each link that maps possible realized traffic conditions to decisions on the link to take next. It represents a traveler’s ability to look ahead in order to incorporate real-time information not yet available at the time of decision. An efficient algorithm to find the optimal routing policy (ORP) in large-scale networks is presented, as the algorithm is a building block of any routing policy choice set generation method. Two case studies are conducted in Stockholm, Sweden and Singapore, respectively. Data for the underlying stochastic time-dependent network are generated from taxi Global Positioning System (GPS) traces through the methods of map-matching and non-parametric link travel time estimation. The routing policy choice sets are then generated by link elimination and simulation, in which the ORP algorithm is repetitively executed. The generated choice sets are first evaluated based on whether or not they include the observed GPS traces on a specific day, which is defined as coverage. They are then evaluated on the basis of adaptiveness, defined as the capability of a routing policy to be realized as different paths over different days. It is shown that using a combination of link elimination and simulation methods yield satisfactory coverage. The comparison to a path choice set benchmark suggests that a routing policy choice set could potentially provide better coverage and capture the adaptive nature of route choice. The routing policy choice set generation enables the development of a discrete choice model of routing policy choice, which will be explored in the second stage of the study.
ICML 2014, Jun 22, 2014
Learning from multiple annotators took a valuable step towards modeling data that does not fit th... more Learning from multiple annotators took a valuable step towards modeling data that does not fit the usual single annotator setting, since multiple annotators sometimes offer varying degrees of expertise. When disagreements occur, the establishment of the correct label through trivial solutions such as majority voting may not be adequate, since without considering heterogeneity in the annotators, we risk generating a flawed model. In this paper, we generalize GP classification in order to account for multiple annotators with different levels expertise. By explicitly handling uncertainty, Gaussian processes (GPs) provide a natural framework for building proper multiple-annotator models. We empirically show that our model significantly outperforms other commonly used approaches, such as majority voting, without a significant increase in the computational cost of approximate Bayesian inference. Furthermore, an active learning methodology is proposed, which is able to reduce annotation cost even further.
IEEE Transactions of Intelligent Transportation Systems, Apr 1, 2014
Abstract—This paper presents a methodology for estimating the upper and lower bounds of a real-ti... more Abstract—This paper presents a methodology for estimating the upper and lower bounds of a real-time traffic prediction system, i.e. its prediction interval (PI). Without a very complex implementation work, our model is able to complement any pre- existing prediction system with extra uncertainty information such as the 5% and 95% quantiles. We treat the traffic prediction system as a black box that provides a feed of predictions. Having this feed together with observed values, we then train conditional quantile regression methods that estimate upper and lower quantiles of the error.
The goal of conditional quantile regression is to determine a function, dτ (x), that returns the specific quantile τ of a target variable d, given an input vector x. Following Koenker [1], we implement two functional forms of dτ (x): locally weighted linear, which relies on value on the neighborhood of x; and splines, a piecewise defined smooth polynomial function.
We demonstrate this methodology with three different traffic prediction models applied to two freeway data-sets from Irvine, CA, and Tel Aviv in Israel. We contrast the results with a traditional confidence intervals approach that assumes that error is normally distributed with constant (homoscedastic) variance. We apply several evaluation measures based on earlier literature and also contribute two new measures that focus on relative interval length and balance between accuracy and interval length. For the available dataset, we verified that conditional quantile regression outperforms the homoscedastic baseline in the vast majority of the indicators.
Transport Research Part C
Due to the heterogeneous case-by-case nature of traffic incidents, plenty of relevant information... more Due to the heterogeneous case-by-case nature of traffic incidents, plenty of relevant information is recorded in free flow text fields instead of constrained value fields. As a result, such text components enclose considerable richness that is invaluable for incident analysis, modeling and prediction. However, the difficulty to formally interpret such data has led to minimal consideration in previous work.
In this paper, we focus on the task of incident duration prediction, more specifically on predicting clearance time, the period between incident reporting and road clearance. An accurate prediction will help traffic operators implement appropriate mitigation measures and better inform drivers about expected road blockage time.
The key contribution is the introduction of topic modeling, a text analysis technique, as a tool for extracting information from incident reports in real time. We analyze a dataset of 2 years of accident cases and develop a machine learning based duration prediction model that integrates textual with non-textual fea- tures. To demonstrate the value of the approach, we compare predictions with and without text analysis using several different prediction models. Models using textual features consistently outperform the others in nearly all circumstances, presenting errors up to 28% lower than models without such information.
Pervasive Computing, Jan 1, 2010
This paper deals with the analysis of crowd mobility during special events. We analyze nearly 1 m... more This paper deals with the analysis of crowd mobility during special events. We analyze nearly 1 million cell-phone traces and associate their destinations with social events. We show that the origins of people attending an event are strongly correlated to the type of event, with implications in city management, since the knowledge of additive flows can be a critical information on which to take decisions about events management and congestion mitigation.
From Social Butterfly to …, Jan 1, 2011
The midsized and large cities of the twenty-first century lead a "double life," because they exis... more The midsized and large cities of the twenty-first century lead a "double life," because they exist in both in the physical and the digital worlds. Although these worlds do not physically share the same spatial or temporal dimensions, the anonymous citizen constantly projects the physical world onto the digital world. Websites such as Flickr, Twitter, Facebook, and Wikipedia are repositories of what citizens sense in the city and include reports or announcements of events and descriptions of space.
The Future Mobility Survey (FMS) is a smartphone-based prompted-recall travel survey that aims to... more The Future Mobility Survey (FMS) is a smartphone-based prompted-recall travel survey that aims to support data collection initiatives for transport modelling purposes. This paper details the considerations that have gone into its development, including the smartphone apps for iPhone and Android platforms, the online activity diary and user interface, and the background intelligence for processing collected data into activity locations and travel traces. We discuss the various trade-offs regarding user comprehension, resource use, and participant burden, including findings from usability tests and a pilot study. We find that close attention should be paid to the simplicity of the user interaction, determinations of activity locations (such as the positive/false negative trade-off in their automatic classification), and the clarity of interactions in the activity diary. The FMS system design and implementation provides pragmatic, useful insights into the development of similar platforms and approaches for travel/activity surveys.
European Transport Research Review, Jan 1, 2009
The task of map-matching consists of finding a correspondence between a geographical point or seq... more The task of map-matching consists of finding a correspondence between a geographical point or sequence of points (e.g. obtained from GPS) and a given map. Due to many reasons, namely the noisy input data and incomplete or inaccurate maps, such a task is not trivial and can affect the validity of applications that depend on it. This includes any Transport Research projects that rely on post-hoc analysis of traces (e.g. via Floating Car Data). In this article, we describe an off-line map-matching algorithm that allows us to handle incomplete map databases. We test and compare this with other approaches and ultimately provide guidelines for use within other applications. This project is provided as open source.
International Journal of …, Jan 1, 2009
... more human-readable information including geographic, de-mographic, environmental, historical... more ... more human-readable information including geographic, de-mographic, environmental, historical and, perhaps ... the area of Ubiquitous Computing and relates deeply with the connection humans have ... We present here a PhD project named KUSCO (Knowledge discover-ing by ...
Advances in Artificial Intelligence– …, Jan 1, 2008
Abstract. In this paper, we present an approach to a challenge well known from the area of Ubiqui... more Abstract. In this paper, we present an approach to a challenge well known from the area of Ubiquitous Computing: extracting meaning out of geo-referenced information. The importance of this semantics of place problem is proportional to the number of available services and data that are ...
Ambient Intelligence, Jan 1, 2011
This paper is about automatic tagging of urban areas considering its constituent Points of Intere... more This paper is about automatic tagging of urban areas considering its constituent Points of Interest. First, our approach geographically clusters places that offer similar services in the same generic category (eg Food & Dining; Entertainment & Arts) in order to identify ...
… of the 3rd International Workshop on …, Jan 1, 2010
During the last few years, the amount of online descriptive information about places and their dy... more During the last few years, the amount of online descriptive information about places and their dynamics has reached reasonable dimension for many cities in the world. Such enriched information can now support semantic analysis of space, particularly in which respects to what exists there and what happens there.
International Journal of …, Jan 1, 2009
... more human-readable information including geographic, de-mographic, environmental, historical... more ... more human-readable information including geographic, de-mographic, environmental, historical and, perhaps ... the area of Ubiquitous Computing and relates deeply with the connection humans have ... We present here a PhD project named KUSCO (Knowledge discover-ing by ...
Public transport smartcard data can be used to detect large crowds. By comparing smartcard data w... more Public transport smartcard data can be used to detect large crowds. By comparing smartcard data with statistics on habitual behavior (e.g. average by time of day), one can specifically identify non-habitual crowds, which are often problematic for the transport system. While habitual overcrowding (e.g. during peak hour) is well understood by traffic managers and travelers, non-habitual overcrowding hotspots can be very disruptive given that they are generally unexpected. By quickly understanding and reacting to cases of overcrowding, transport managers can mitigate transport system disruptions.
We propose a probabilistic data analysis model that breaks each non-habitual overcrowding hotspot into a set of explanatory components. Potential explanatory components are retrieved from social networks and special events websites and then processed through text-analysis techniques. We then use the probabilistic model to estimate each component’s specific share of total overcrowding counts.
We first validate with synthetic data and then test our model with real data from Singapore’s public transport system (EZLink), focused on 3 case study areas. We demonstrate that it is able to generate explanations that are intuitively plausible and consistent both locally (correlation coefficient, CC, from 85\% to 99\% for the 3 areas) and globally (CC from 41.2\% to 83.9\%).
This model is directly applicable to domains that are sensitive to crowd formation due to large social events (e.g. communications, water, energy, waste).
Over the last few years, much online volunteered geographic information (VGI) has emerged and has... more Over the last few years, much online volunteered geographic information (VGI) has emerged and has been increasingly analyzed to understand places and cities, as well as human mobility and activity. However, there are concerns about the quality and usability of such VGI. In this study, we demonstrate a complete process that comprises the collection, unification, classification and validation of a type of VGI—online point-of-interest (POI) data—and develop methods to utilize such POI data to estimate disag- gregated land use (i.e., employment size by category) at a very high spatial resolution (census block level) using part of the Boston metropolitan area as an example. With recent advances in activity-based land use, transportation, and environment (LUTE) models, such disaggregated land use data become important to allow LUTE models to analyze and simulate a person’s choices of work location and activity destina- tions and to understand policy impacts on future cities. These data can also be used as alternatives to explore economic activities at the local level, especially as government-published census-based disaggre- gated employment data have become less available in the recent decade. Our new approach provides opportunities for cities to estimate land use at high resolution with low cost by utilizing VGI while ensur- ing its quality with a certain accuracy threshold. The automatic classification of POI can also be utilized for other types of analyses on cities.
Transportation systems are inherently uncertain due to the occurrence of random disruptions; mean... more Transportation systems are inherently uncertain due to the occurrence of random disruptions; meanwhile, real-time traveler information offers the potential to help travelers make better route choices under such disruptions. This paper presents the first revealed preference (RP) study of routing policy choice where travelers opt for routing policies instead of fixed paths. A routing policy is defined as a decision rule applied at each link that maps possible realized traffic conditions to decisions on the link to take next. It represents a traveler’s ability to look ahead in order to incorporate real-time information not yet available at the time of decision. An efficient algorithm to find the optimal routing policy (ORP) in large-scale networks is presented, as the algorithm is a building block of any routing policy choice set generation method. Two case studies are conducted in Stockholm, Sweden and Singapore, respectively. Data for the underlying stochastic time-dependent network are generated from taxi Global Positioning System (GPS) traces through the methods of map-matching and non-parametric link travel time estimation. The routing policy choice sets are then generated by link elimination and simulation, in which the ORP algorithm is repetitively executed. The generated choice sets are first evaluated based on whether or not they include the observed GPS traces on a specific day, which is defined as coverage. They are then evaluated on the basis of adaptiveness, defined as the capability of a routing policy to be realized as different paths over different days. It is shown that using a combination of link elimination and simulation methods yield satisfactory coverage. The comparison to a path choice set benchmark suggests that a routing policy choice set could potentially provide better coverage and capture the adaptive nature of route choice. The routing policy choice set generation enables the development of a discrete choice model of routing policy choice, which will be explored in the second stage of the study.
ICML 2014, Jun 22, 2014
Learning from multiple annotators took a valuable step towards modeling data that does not fit th... more Learning from multiple annotators took a valuable step towards modeling data that does not fit the usual single annotator setting, since multiple annotators sometimes offer varying degrees of expertise. When disagreements occur, the establishment of the correct label through trivial solutions such as majority voting may not be adequate, since without considering heterogeneity in the annotators, we risk generating a flawed model. In this paper, we generalize GP classification in order to account for multiple annotators with different levels expertise. By explicitly handling uncertainty, Gaussian processes (GPs) provide a natural framework for building proper multiple-annotator models. We empirically show that our model significantly outperforms other commonly used approaches, such as majority voting, without a significant increase in the computational cost of approximate Bayesian inference. Furthermore, an active learning methodology is proposed, which is able to reduce annotation cost even further.
IEEE Transactions of Intelligent Transportation Systems, Apr 1, 2014
Abstract—This paper presents a methodology for estimating the upper and lower bounds of a real-ti... more Abstract—This paper presents a methodology for estimating the upper and lower bounds of a real-time traffic prediction system, i.e. its prediction interval (PI). Without a very complex implementation work, our model is able to complement any pre- existing prediction system with extra uncertainty information such as the 5% and 95% quantiles. We treat the traffic prediction system as a black box that provides a feed of predictions. Having this feed together with observed values, we then train conditional quantile regression methods that estimate upper and lower quantiles of the error.
The goal of conditional quantile regression is to determine a function, dτ (x), that returns the specific quantile τ of a target variable d, given an input vector x. Following Koenker [1], we implement two functional forms of dτ (x): locally weighted linear, which relies on value on the neighborhood of x; and splines, a piecewise defined smooth polynomial function.
We demonstrate this methodology with three different traffic prediction models applied to two freeway data-sets from Irvine, CA, and Tel Aviv in Israel. We contrast the results with a traditional confidence intervals approach that assumes that error is normally distributed with constant (homoscedastic) variance. We apply several evaluation measures based on earlier literature and also contribute two new measures that focus on relative interval length and balance between accuracy and interval length. For the available dataset, we verified that conditional quantile regression outperforms the homoscedastic baseline in the vast majority of the indicators.
Transport Research Part C
Due to the heterogeneous case-by-case nature of traffic incidents, plenty of relevant information... more Due to the heterogeneous case-by-case nature of traffic incidents, plenty of relevant information is recorded in free flow text fields instead of constrained value fields. As a result, such text components enclose considerable richness that is invaluable for incident analysis, modeling and prediction. However, the difficulty to formally interpret such data has led to minimal consideration in previous work.
In this paper, we focus on the task of incident duration prediction, more specifically on predicting clearance time, the period between incident reporting and road clearance. An accurate prediction will help traffic operators implement appropriate mitigation measures and better inform drivers about expected road blockage time.
The key contribution is the introduction of topic modeling, a text analysis technique, as a tool for extracting information from incident reports in real time. We analyze a dataset of 2 years of accident cases and develop a machine learning based duration prediction model that integrates textual with non-textual fea- tures. To demonstrate the value of the approach, we compare predictions with and without text analysis using several different prediction models. Models using textual features consistently outperform the others in nearly all circumstances, presenting errors up to 28% lower than models without such information.
Pervasive Computing, Jan 1, 2010
This paper deals with the analysis of crowd mobility during special events. We analyze nearly 1 m... more This paper deals with the analysis of crowd mobility during special events. We analyze nearly 1 million cell-phone traces and associate their destinations with social events. We show that the origins of people attending an event are strongly correlated to the type of event, with implications in city management, since the knowledge of additive flows can be a critical information on which to take decisions about events management and congestion mitigation.
From Social Butterfly to …, Jan 1, 2011
The midsized and large cities of the twenty-first century lead a "double life," because they exis... more The midsized and large cities of the twenty-first century lead a "double life," because they exist in both in the physical and the digital worlds. Although these worlds do not physically share the same spatial or temporal dimensions, the anonymous citizen constantly projects the physical world onto the digital world. Websites such as Flickr, Twitter, Facebook, and Wikipedia are repositories of what citizens sense in the city and include reports or announcements of events and descriptions of space.
The Future Mobility Survey (FMS) is a smartphone-based prompted-recall travel survey that aims to... more The Future Mobility Survey (FMS) is a smartphone-based prompted-recall travel survey that aims to support data collection initiatives for transport modelling purposes. This paper details the considerations that have gone into its development, including the smartphone apps for iPhone and Android platforms, the online activity diary and user interface, and the background intelligence for processing collected data into activity locations and travel traces. We discuss the various trade-offs regarding user comprehension, resource use, and participant burden, including findings from usability tests and a pilot study. We find that close attention should be paid to the simplicity of the user interaction, determinations of activity locations (such as the positive/false negative trade-off in their automatic classification), and the clarity of interactions in the activity diary. The FMS system design and implementation provides pragmatic, useful insights into the development of similar platforms and approaches for travel/activity surveys.
European Transport Research Review, Jan 1, 2009
The task of map-matching consists of finding a correspondence between a geographical point or seq... more The task of map-matching consists of finding a correspondence between a geographical point or sequence of points (e.g. obtained from GPS) and a given map. Due to many reasons, namely the noisy input data and incomplete or inaccurate maps, such a task is not trivial and can affect the validity of applications that depend on it. This includes any Transport Research projects that rely on post-hoc analysis of traces (e.g. via Floating Car Data). In this article, we describe an off-line map-matching algorithm that allows us to handle incomplete map databases. We test and compare this with other approaches and ultimately provide guidelines for use within other applications. This project is provided as open source.
International Journal of …, Jan 1, 2009
... more human-readable information including geographic, de-mographic, environmental, historical... more ... more human-readable information including geographic, de-mographic, environmental, historical and, perhaps ... the area of Ubiquitous Computing and relates deeply with the connection humans have ... We present here a PhD project named KUSCO (Knowledge discover-ing by ...
Advances in Artificial Intelligence– …, Jan 1, 2008
Abstract. In this paper, we present an approach to a challenge well known from the area of Ubiqui... more Abstract. In this paper, we present an approach to a challenge well known from the area of Ubiquitous Computing: extracting meaning out of geo-referenced information. The importance of this semantics of place problem is proportional to the number of available services and data that are ...
Ambient Intelligence, Jan 1, 2011
This paper is about automatic tagging of urban areas considering its constituent Points of Intere... more This paper is about automatic tagging of urban areas considering its constituent Points of Interest. First, our approach geographically clusters places that offer similar services in the same generic category (eg Food & Dining; Entertainment & Arts) in order to identify ...
… of the 3rd International Workshop on …, Jan 1, 2010
During the last few years, the amount of online descriptive information about places and their dy... more During the last few years, the amount of online descriptive information about places and their dynamics has reached reasonable dimension for many cities in the world. Such enriched information can now support semantic analysis of space, particularly in which respects to what exists there and what happens there.
International Journal of …, Jan 1, 2009
... more human-readable information including geographic, de-mographic, environmental, historical... more ... more human-readable information including geographic, de-mographic, environmental, historical and, perhaps ... the area of Ubiquitous Computing and relates deeply with the connection humans have ... We present here a PhD project named KUSCO (Knowledge discover-ing by ...