Nilani Algiriyage | Massey University (original) (raw)

Papers by Nilani Algiriyage

Research paper thumbnail of Prediction of type 2 diabetes risk factor using machine learning in Sri Lanka

Diabetes mellitus is in third place in the index of 20 major diseases affecting deaths in Sri Lan... more Diabetes mellitus is in third place in the index of 20 major diseases affecting deaths in Sri Lanka. Diagnosis of diabetes is a key and insipid task. A successful, easy and correct method has not been identified to identify the diabetes mellitus in the early stage. Currently, the Diabetes detection is done using blood tests, such as Glycated hemoglobin (A1C) test, Random blood sugar test, Fasting Plasma Glucose test, Oral Glucose Tolerance Test, and Blood Sugar Series. People who do not have a special condition are generally unwilling to go for a blood test, which is a process that costs them time and money. Diabetes mellitus cannot be fully cured, but if identified in prediabetes, it is possible to prevent prediabetes from developing into type II by the actions such as eating healthy foods, losing weight, being physically active. As there are no regular medical checkups to diagnose pre-diabetes among the general public, identification of pre-diabetes is problematic in Sri Lanka. Ma...

Research paper thumbnail of DEES: a real-time system for event extraction from disaster-related web text

Social Network Analysis and Mining, Dec 11, 2022

Research paper thumbnail of Detecting access patterns through analysis of web logs

With the evolution of the Internet and continuous growth of the global information infrastructure... more With the evolution of the Internet and continuous growth of the global information infrastructure, the amount of data collected online from transactions and events has been drastically increased. Web server access log files collect substantial data about web visitor access patterns. Data mining techniques can be applied on such data (which is known as Web Mining) to reveal lot of useful information about navigational patterns. In this research we analyze the patterns of web crawlers and human visitors through web server access log files. The objectives of this research are to detect web crawlers, identify suspicious crawlers, detect Googlebot impersonation and profile human visitors. During human visitor profiling we group similar web visitors into clusters based on their browsing patterns and profile them. We show that web crawlers can be identified and successfully classified using heuristics. We evaluated our proposed methodology using seven test crawler scenarios. We found that approximately 53.25% of web crawler sessions were from â ˘ AIJknownâ˘A ˙I crawlers and 34.16% exhibit suspicious behavior. We present an effective methodology to detect fake Googlebot crawlers by analyzing web access logs. We propose using Markov chain models to learn profiles of real and fake Googlebots based on their patterns of web resource access sequences. We have calculated log-odds ratios for a given set of crawler sessions and our results show that the higher the log-odds score, the higher the probability that a given sequence comes from the real Googlebot. Experimental results show, at a threshold log-odds score we can distinguish the real Googlebot from the fake. For the purpose of human visitor profiling, an improved similarity measure is proposed and it is used as the distance measure in an agglomerative hierarchical clustering for a data set from an e-commerce web site. To generate profiles, frequent item set mining is applied over the clusters. Our results show that proper visitor clustering can be achieved with the improved similarity measure

Research paper thumbnail of A simulation approach for reduced outpatient waiting time

2014 14th International Conference on Advances in ICT for Emerging Regions (ICTer), 2014

Extended waiting time for treatment in National hospitals is very common in Sri Lanka. This situa... more Extended waiting time for treatment in National hospitals is very common in Sri Lanka. This situation has created several problems to patients, doctors and even to other health workers. The quality of service leaves a lot to be desired and is costly to the economy. This study analyses different queues which create bottlenecks in the Out Patient Department at national eye hospital in Sri Lanka and critically evaluate several appointment scheduling rules with the help of a simulation model to come up with a solution which minimises the total patient waiting time. Our results shows that total patient waiting time can be reduced more than 60% using proper appointment scheduling system with process improvement.

Research paper thumbnail of Distinguishing Real Web Crawlers from Fakes: Googlebot Example

2018 Moratuwa Engineering Research Conference (MERCon), 2018

Web crawlers are programs or automated scripts that scan web pages methodically to create indexes... more Web crawlers are programs or automated scripts that scan web pages methodically to create indexes. Search engines such as Google, Bing use crawlers in order to provide web surfers with relevant information. Today there are also many crawlers that impersonate well-known web crawlers. For example, it has been observed that Google’s Googlebot crawler is impersonated to a high degree. This raises ethical and security concerns as they can potentially be used for malicious purposes. In this paper, we present an effective methodology to detect fake Googlebot crawlers by analyzing web access logs. We propose using Markov chain models to learn profiles of real and fake Googlebots based on their patterns of web resource access sequences. We have calculated log-odds ratios for a given set of crawler sessions and our results show that the higher the log-odds score, the higher the probability that a given sequence comes from the real Googlebot. Experimental results show, at a threshold log-odds score we can distinguish the real Googlebot from the fake.

Research paper thumbnail of Real-time disaster event extraction from unstructured text sources

We present a system for real-time event extraction to support emergency response. Automatically e... more We present a system for real-time event extraction to support emergency response. Automatically extracting events from the unstructured text can address the challenge of information scarcity currently faced by emergency responders. The task is to identify the main event from online text sources such as online news and tweets by answering 5W1H questions (who did, what, when, where, why and how).

Research paper thumbnail of Offline analysis of web logs to identify offensive web crawlers

With the continuous growth and rapid advancement of web based services, the traffic generated by ... more With the continuous growth and rapid advancement of web based services, the traffic generated by web servers have drastically increased. Analyzing such data, which is normally known as click stream data, could reveal a lot of information about the web visitors. These data are often stored in web server “access log files” and in other related resources. Web clients can be broadly categorized into two groups: web crawlers and human visitors. During recent past, the traffic generated by web crawlers has drastically increased. Web crawlers are programs or automated scripts that scan web pages methodically to create indexes. They traverse the hyperlink structure of the worldwide web to locate and retrieve information. Web crawler programs are alternatively known as web robots, spiders, bots and scrapers.

Research paper thumbnail of Traffic Flow Estimation Based on Deep Learning for Emergency Traffic Management using CCTV Images

Traffic flow estimation is the first step in the management of road traffic infrastructure and is... more Traffic flow estimation is the first step in the management of road traffic infrastructure and is essential for the successful deployment of intelligent transportation systems. Closed-circuit television (CCTV) systems are now popular and are mounted in many public places to support real-time surveillance. The data generated by CCTV cameras can be used as the foundation for accurate traffic flow estimation. The lightning talk is based on research carried out seeking to answer the questions; 1) What object detection algorithm is best suited to the CCTV image data set for vehicle detection? 2) Can traffic flow be estimated by counting the number of vehicles in CCTV images using an object detection algorithm?<br>We collect real-time CCTV imagery from traffic cameras through the New Zealand Transport Agency's (NZTA) traffic cameras Application Programming Interface (API). In the first experiment, we compare the performance and accuracy of faster R-CNN, Mask R-CNN and YOLOv3 alg...

Research paper thumbnail of Multi-source Multimodal Data and Deep Learning for Disaster Response: A Systematic Review

SN Computer Science

Mechanisms for sharing information in a disaster situation have drastically changed due to new te... more Mechanisms for sharing information in a disaster situation have drastically changed due to new technological innovations throughout the world. The use of social media applications and collaborative technologies for information sharing have become increasingly popular. With these advancements, the amount of data collected increases daily in different modalities, such as text, audio, video, and images. However, to date, practical Disaster Response (DR) activities are mostly depended on textual information, such as situation reports and email content, and the benefit of other media is often not realised. Deep Learning (DL) algorithms have recently demonstrated promising results in extracting knowledge from multiple modalities of data, but the use of DL approaches for DR tasks has thus far mostly been pursued in an academic context. This paper conducts a systematic review of 83 articles to identify the successes, current and future challenges, and opportunities in using DL for DR tasks. Our analysis is centred around the components of learning, a set of aspects that govern the application of Machine learning (ML) for a given problem domain. A flowchart and guidance for future research are developed as an outcome of the analysis to ensure the benefits of DL for DR activities are utilized.

Research paper thumbnail of Identifying Research Gap and Opportunities in the use of Multimodal Deep Learning for Emergency Management

Research paper thumbnail of Towards Real-time Traffic Flow Estimation using YOLO and SORT from Surveillance Video Footage

ISCRAM 2021 Conference Proceedings – 18th International Conference on Information Systems for Crisis Response and Management, 2021

Traffic emergencies and resulting delays cause a significant impact on the economy and society. T... more Traffic emergencies and resulting delays cause a significant impact on the economy and society. Traffic flow estimation is one of the early steps in urban planning and managing traffic infrastructure. Traditionally, traffic flow rates were commonly measured using underground inductive loops, pneumatic road tubes, and temporary manual counts. However, these approaches can not be used in large areas due to high costs, road surface degradation and implementation difficulties. Recent advancement of computer vision techniques in combination with freely available closed-circuit television (CCTV) datasets has provided opportunities for vehicle detection and classification. This study addresses the problem of estimating traffic flow using low-quality video data from a surveillance camera. Therefore, we have trained the novel YOLOv4 algorithm for five object classes (car, truck, van, bike, and bus). Also, we introduce an algorithm to count the vehicles using the SORT tracker based on movement direction such as "northbound" and "southbound" to obtain the traffic flow rates. The experimental results, for a CCTV footage in Christchurch, New Zealand shows the effectiveness of the proposed approach. In future research, we expect to train on large and more diverse datasets that cover various weather and lighting conditions.

Research paper thumbnail of Identifying Disaster-related Tweets: A Large-Scale Detection Model Comparison

ISCRAM 2021 Conference Proceedings – 18th International Conference on Information Systems for Crisis Response and Management, 2021

Social media applications such as Twitter and Facebook are fast becoming a key instrument in gain... more Social media applications such as Twitter and Facebook are fast becoming a key instrument in gaining situational awareness (understanding the bigger picture of the situation) during disasters. This has provided multiple opportunities to gather relevant information in a timely manner to improve disaster response. In recent years, identifying crisis-related social media posts is analysed as an automatic task using machine learning (ML) or deep learning (DL) techniques. However, such supervised learning algorithms require labelled training data in the early hours of a crisis. Recently, multiple manually labelled disaster-related open-source twitter datasets have been released. In this work, we collected 192, 948 tweets by combining a number of such datasets, preprocessed, filtered and duplicate removed, which resulted in 117, 954 tweets. Then we evaluated the performance of multiple ML and DL algorithms in classifying disaster-related tweets in three settings, namely "in-disaster", "out-disaster" and "cross-disaster". Our results show that the Bidirectional LSTM model with Word2Vec embeddings performs well for the tweet classification task in all three settings. We also make available the preprocessing steps and trained weights for future research.

Research paper thumbnail of Web user profiling using hierarchical clustering with improved similarity measure

2015 Moratuwa Engineering Research Conference (MERCon), 2015

—Web user profiling targets grouping users in to clusters with similar interests. Web sites are a... more —Web user profiling targets grouping users in to clusters with similar interests. Web sites are attracted by many visitors and gaining insight to the patterns of access leaves lot of information. Web server access log files record every single request processed by web site visitors. Applying web usage mining techniques allow to identify interesting patterns. In this paper we have improved the similarity measure proposed by Velásquez et al. [1] and used it as the distance measure in an agglomerative hierarchical clustering for a data set from an online banking web site. To generate profiles, frequent item set mining is applied over the clusters. Our results show that proper visitor clustering can be achieved with the improved similarity measure.

Research paper thumbnail of Identification and characterization of crawlers through analysis of web logs

2013 IEEE 8th International Conference on Industrial and Information Systems, 2013

Research paper thumbnail of Traffic Flow Estimation based on Deep Learning for Emergency Traffic Management using CCTV Images

Emergency Traffic Management (ETM) is one of the main problems in smart urban cities. This paper ... more Emergency Traffic Management (ETM) is one of the main problems in smart urban cities. This paper focuses on selecting an appropriate object detection model for identifying and counting vehicles from closed-circuit television (CCTV) images and then estimating traffic flow as the first step in a broader project. Therefore, a case is selected at one of the busiest roads in Christchurch, New Zealand. Two experiments were conducted in this research; 1) to evaluate the accuracy and speed of three famous object detection models namely faster R-CNN, mask R-CNN and YOLOv3 for the data set, 2) to estimate the traffic flow by counting the number of vehicles in each of the four classes such as car, bus, truck and motorcycle. A simple Region of Interest (ROI) heuristic algorithm is used to classify vehicle movement direction such as "left-lane" and "right-lane". This paper presents the early results and discusses the next steps.

Research paper thumbnail of Traffic Flow Estimation based on Deep Learning for Emergency Traffic Management using CCTV Images

ISCRAM, 2020

Emergency Traffic Management (ETM) is one of the main problems in smart urban cities. This paper ... more Emergency Traffic Management (ETM) is one of the main problems in smart urban cities. This paper focuses on selecting an appropriate object detection model for identifying and counting vehicles from closed-circuit television (CCTV) images and then estimating traffic flow as the first step in a broader project. Therefore, a case is selected at one of the busiest roads in Christchurch, New Zealand. Two experiments were conducted in this research; 1) to evaluate the accuracy and speed of three famous object detection models namely faster R-CNN, mask R-CNN and YOLOv3 for the data set, 2) to estimate the traffic flow by counting the number of vehicles in each of the four classes such as car, bus, truck and motorcycle. A simple Region of Interest (ROI) heuristic algorithm is used to classify vehicle movement direction such as "left-lane" and "right-lane". This paper presents the early results and discusses the next steps.

Research paper thumbnail of Distinguishing Real Web Crawlers from Fakes: Googlebot Example

—Web crawlers are programs or automated scripts that scan web pages methodically to create indexe... more —Web crawlers are programs or automated scripts that scan web pages methodically to create indexes. Search engines such as Google, Bing use crawlers in order to provide web surfers with relevant information. Today there are also many crawlers that impersonate well-known web crawlers. For example, it has been observed that Google's Googlebot crawler is impersonated to a high degree. This raises ethical and security concerns as they can potentially be used for malicious purposes. In this paper, we present an effective methodology to detect fake Googlebot crawlers by analyzing web access logs. We propose using Markov chain models to learn profiles of real and fake Googlebots based on their patterns of web resource access sequences. We have calculated log-odds ratios for a given set of crawler sessions and our results show that the higher the log-odds score, the higher the probability that a given sequence comes from the real Googlebot. Experimental results show, at a threshold log-odds score we can distinguish the real Googlebot from the fake.

Research paper thumbnail of A Simulation for Reduced Outpatient Waiting Time

— Extended waiting time for treatment in National hospitals is very common in S ri Lanka. This si... more — Extended waiting time for treatment in National hospitals is very common in S ri Lanka. This situation has created several problems to patients, doctors and even to other health workers. The quality of service leaves a lot to be desired and is costly to the economy. This study analyses different queues which create bottlenecks in the Out Patient Department at national eye hospital in S ri Lanka and critically evaluate several appointment scheduling rules with the help of a simulation model to come up with a solution which minimises the total patient waiting time. Our results shows that total patient waiting time can be reduced more than 60% using proper appointment scheduling system with process improvement.

Research paper thumbnail of Web User Profiling using Hierarchical Clustering with Improved Similarity Measure

—Web user profiling targets grouping users in to clusters with similar interests. Web sites are a... more —Web user profiling targets grouping users in to clusters with similar interests. Web sites are attracted by many visitors and gaining insight to the patterns of access leaves lot of information. Web server access log files record every single request processed by web site visitors. Applying web usage mining techniques allow to identify interesting patterns. In this paper we have improved the similarity measure proposed by Velásquez et al. [1] and used it as the distance measure in an agglomerative hierarchical clustering for a data set from an online banking web site. To generate profiles, frequent item set mining is applied over the clusters. Our results show that proper visitor clustering can be achieved with the improved similarity measure.

Research paper thumbnail of Identification and Characterization of Crawlers Through Analysis of Web Logs

Research paper thumbnail of Prediction of type 2 diabetes risk factor using machine learning in Sri Lanka

Diabetes mellitus is in third place in the index of 20 major diseases affecting deaths in Sri Lan... more Diabetes mellitus is in third place in the index of 20 major diseases affecting deaths in Sri Lanka. Diagnosis of diabetes is a key and insipid task. A successful, easy and correct method has not been identified to identify the diabetes mellitus in the early stage. Currently, the Diabetes detection is done using blood tests, such as Glycated hemoglobin (A1C) test, Random blood sugar test, Fasting Plasma Glucose test, Oral Glucose Tolerance Test, and Blood Sugar Series. People who do not have a special condition are generally unwilling to go for a blood test, which is a process that costs them time and money. Diabetes mellitus cannot be fully cured, but if identified in prediabetes, it is possible to prevent prediabetes from developing into type II by the actions such as eating healthy foods, losing weight, being physically active. As there are no regular medical checkups to diagnose pre-diabetes among the general public, identification of pre-diabetes is problematic in Sri Lanka. Ma...

Research paper thumbnail of DEES: a real-time system for event extraction from disaster-related web text

Social Network Analysis and Mining, Dec 11, 2022

Research paper thumbnail of Detecting access patterns through analysis of web logs

With the evolution of the Internet and continuous growth of the global information infrastructure... more With the evolution of the Internet and continuous growth of the global information infrastructure, the amount of data collected online from transactions and events has been drastically increased. Web server access log files collect substantial data about web visitor access patterns. Data mining techniques can be applied on such data (which is known as Web Mining) to reveal lot of useful information about navigational patterns. In this research we analyze the patterns of web crawlers and human visitors through web server access log files. The objectives of this research are to detect web crawlers, identify suspicious crawlers, detect Googlebot impersonation and profile human visitors. During human visitor profiling we group similar web visitors into clusters based on their browsing patterns and profile them. We show that web crawlers can be identified and successfully classified using heuristics. We evaluated our proposed methodology using seven test crawler scenarios. We found that approximately 53.25% of web crawler sessions were from â ˘ AIJknownâ˘A ˙I crawlers and 34.16% exhibit suspicious behavior. We present an effective methodology to detect fake Googlebot crawlers by analyzing web access logs. We propose using Markov chain models to learn profiles of real and fake Googlebots based on their patterns of web resource access sequences. We have calculated log-odds ratios for a given set of crawler sessions and our results show that the higher the log-odds score, the higher the probability that a given sequence comes from the real Googlebot. Experimental results show, at a threshold log-odds score we can distinguish the real Googlebot from the fake. For the purpose of human visitor profiling, an improved similarity measure is proposed and it is used as the distance measure in an agglomerative hierarchical clustering for a data set from an e-commerce web site. To generate profiles, frequent item set mining is applied over the clusters. Our results show that proper visitor clustering can be achieved with the improved similarity measure

Research paper thumbnail of A simulation approach for reduced outpatient waiting time

2014 14th International Conference on Advances in ICT for Emerging Regions (ICTer), 2014

Extended waiting time for treatment in National hospitals is very common in Sri Lanka. This situa... more Extended waiting time for treatment in National hospitals is very common in Sri Lanka. This situation has created several problems to patients, doctors and even to other health workers. The quality of service leaves a lot to be desired and is costly to the economy. This study analyses different queues which create bottlenecks in the Out Patient Department at national eye hospital in Sri Lanka and critically evaluate several appointment scheduling rules with the help of a simulation model to come up with a solution which minimises the total patient waiting time. Our results shows that total patient waiting time can be reduced more than 60% using proper appointment scheduling system with process improvement.

Research paper thumbnail of Distinguishing Real Web Crawlers from Fakes: Googlebot Example

2018 Moratuwa Engineering Research Conference (MERCon), 2018

Web crawlers are programs or automated scripts that scan web pages methodically to create indexes... more Web crawlers are programs or automated scripts that scan web pages methodically to create indexes. Search engines such as Google, Bing use crawlers in order to provide web surfers with relevant information. Today there are also many crawlers that impersonate well-known web crawlers. For example, it has been observed that Google’s Googlebot crawler is impersonated to a high degree. This raises ethical and security concerns as they can potentially be used for malicious purposes. In this paper, we present an effective methodology to detect fake Googlebot crawlers by analyzing web access logs. We propose using Markov chain models to learn profiles of real and fake Googlebots based on their patterns of web resource access sequences. We have calculated log-odds ratios for a given set of crawler sessions and our results show that the higher the log-odds score, the higher the probability that a given sequence comes from the real Googlebot. Experimental results show, at a threshold log-odds score we can distinguish the real Googlebot from the fake.

Research paper thumbnail of Real-time disaster event extraction from unstructured text sources

We present a system for real-time event extraction to support emergency response. Automatically e... more We present a system for real-time event extraction to support emergency response. Automatically extracting events from the unstructured text can address the challenge of information scarcity currently faced by emergency responders. The task is to identify the main event from online text sources such as online news and tweets by answering 5W1H questions (who did, what, when, where, why and how).

Research paper thumbnail of Offline analysis of web logs to identify offensive web crawlers

With the continuous growth and rapid advancement of web based services, the traffic generated by ... more With the continuous growth and rapid advancement of web based services, the traffic generated by web servers have drastically increased. Analyzing such data, which is normally known as click stream data, could reveal a lot of information about the web visitors. These data are often stored in web server “access log files” and in other related resources. Web clients can be broadly categorized into two groups: web crawlers and human visitors. During recent past, the traffic generated by web crawlers has drastically increased. Web crawlers are programs or automated scripts that scan web pages methodically to create indexes. They traverse the hyperlink structure of the worldwide web to locate and retrieve information. Web crawler programs are alternatively known as web robots, spiders, bots and scrapers.

Research paper thumbnail of Traffic Flow Estimation Based on Deep Learning for Emergency Traffic Management using CCTV Images

Traffic flow estimation is the first step in the management of road traffic infrastructure and is... more Traffic flow estimation is the first step in the management of road traffic infrastructure and is essential for the successful deployment of intelligent transportation systems. Closed-circuit television (CCTV) systems are now popular and are mounted in many public places to support real-time surveillance. The data generated by CCTV cameras can be used as the foundation for accurate traffic flow estimation. The lightning talk is based on research carried out seeking to answer the questions; 1) What object detection algorithm is best suited to the CCTV image data set for vehicle detection? 2) Can traffic flow be estimated by counting the number of vehicles in CCTV images using an object detection algorithm?<br>We collect real-time CCTV imagery from traffic cameras through the New Zealand Transport Agency's (NZTA) traffic cameras Application Programming Interface (API). In the first experiment, we compare the performance and accuracy of faster R-CNN, Mask R-CNN and YOLOv3 alg...

Research paper thumbnail of Multi-source Multimodal Data and Deep Learning for Disaster Response: A Systematic Review

SN Computer Science

Mechanisms for sharing information in a disaster situation have drastically changed due to new te... more Mechanisms for sharing information in a disaster situation have drastically changed due to new technological innovations throughout the world. The use of social media applications and collaborative technologies for information sharing have become increasingly popular. With these advancements, the amount of data collected increases daily in different modalities, such as text, audio, video, and images. However, to date, practical Disaster Response (DR) activities are mostly depended on textual information, such as situation reports and email content, and the benefit of other media is often not realised. Deep Learning (DL) algorithms have recently demonstrated promising results in extracting knowledge from multiple modalities of data, but the use of DL approaches for DR tasks has thus far mostly been pursued in an academic context. This paper conducts a systematic review of 83 articles to identify the successes, current and future challenges, and opportunities in using DL for DR tasks. Our analysis is centred around the components of learning, a set of aspects that govern the application of Machine learning (ML) for a given problem domain. A flowchart and guidance for future research are developed as an outcome of the analysis to ensure the benefits of DL for DR activities are utilized.

Research paper thumbnail of Identifying Research Gap and Opportunities in the use of Multimodal Deep Learning for Emergency Management

Research paper thumbnail of Towards Real-time Traffic Flow Estimation using YOLO and SORT from Surveillance Video Footage

ISCRAM 2021 Conference Proceedings – 18th International Conference on Information Systems for Crisis Response and Management, 2021

Traffic emergencies and resulting delays cause a significant impact on the economy and society. T... more Traffic emergencies and resulting delays cause a significant impact on the economy and society. Traffic flow estimation is one of the early steps in urban planning and managing traffic infrastructure. Traditionally, traffic flow rates were commonly measured using underground inductive loops, pneumatic road tubes, and temporary manual counts. However, these approaches can not be used in large areas due to high costs, road surface degradation and implementation difficulties. Recent advancement of computer vision techniques in combination with freely available closed-circuit television (CCTV) datasets has provided opportunities for vehicle detection and classification. This study addresses the problem of estimating traffic flow using low-quality video data from a surveillance camera. Therefore, we have trained the novel YOLOv4 algorithm for five object classes (car, truck, van, bike, and bus). Also, we introduce an algorithm to count the vehicles using the SORT tracker based on movement direction such as "northbound" and "southbound" to obtain the traffic flow rates. The experimental results, for a CCTV footage in Christchurch, New Zealand shows the effectiveness of the proposed approach. In future research, we expect to train on large and more diverse datasets that cover various weather and lighting conditions.

Research paper thumbnail of Identifying Disaster-related Tweets: A Large-Scale Detection Model Comparison

ISCRAM 2021 Conference Proceedings – 18th International Conference on Information Systems for Crisis Response and Management, 2021

Social media applications such as Twitter and Facebook are fast becoming a key instrument in gain... more Social media applications such as Twitter and Facebook are fast becoming a key instrument in gaining situational awareness (understanding the bigger picture of the situation) during disasters. This has provided multiple opportunities to gather relevant information in a timely manner to improve disaster response. In recent years, identifying crisis-related social media posts is analysed as an automatic task using machine learning (ML) or deep learning (DL) techniques. However, such supervised learning algorithms require labelled training data in the early hours of a crisis. Recently, multiple manually labelled disaster-related open-source twitter datasets have been released. In this work, we collected 192, 948 tweets by combining a number of such datasets, preprocessed, filtered and duplicate removed, which resulted in 117, 954 tweets. Then we evaluated the performance of multiple ML and DL algorithms in classifying disaster-related tweets in three settings, namely "in-disaster", "out-disaster" and "cross-disaster". Our results show that the Bidirectional LSTM model with Word2Vec embeddings performs well for the tweet classification task in all three settings. We also make available the preprocessing steps and trained weights for future research.

Research paper thumbnail of Web user profiling using hierarchical clustering with improved similarity measure

2015 Moratuwa Engineering Research Conference (MERCon), 2015

—Web user profiling targets grouping users in to clusters with similar interests. Web sites are a... more —Web user profiling targets grouping users in to clusters with similar interests. Web sites are attracted by many visitors and gaining insight to the patterns of access leaves lot of information. Web server access log files record every single request processed by web site visitors. Applying web usage mining techniques allow to identify interesting patterns. In this paper we have improved the similarity measure proposed by Velásquez et al. [1] and used it as the distance measure in an agglomerative hierarchical clustering for a data set from an online banking web site. To generate profiles, frequent item set mining is applied over the clusters. Our results show that proper visitor clustering can be achieved with the improved similarity measure.

Research paper thumbnail of Identification and characterization of crawlers through analysis of web logs

2013 IEEE 8th International Conference on Industrial and Information Systems, 2013

Research paper thumbnail of Traffic Flow Estimation based on Deep Learning for Emergency Traffic Management using CCTV Images

Emergency Traffic Management (ETM) is one of the main problems in smart urban cities. This paper ... more Emergency Traffic Management (ETM) is one of the main problems in smart urban cities. This paper focuses on selecting an appropriate object detection model for identifying and counting vehicles from closed-circuit television (CCTV) images and then estimating traffic flow as the first step in a broader project. Therefore, a case is selected at one of the busiest roads in Christchurch, New Zealand. Two experiments were conducted in this research; 1) to evaluate the accuracy and speed of three famous object detection models namely faster R-CNN, mask R-CNN and YOLOv3 for the data set, 2) to estimate the traffic flow by counting the number of vehicles in each of the four classes such as car, bus, truck and motorcycle. A simple Region of Interest (ROI) heuristic algorithm is used to classify vehicle movement direction such as "left-lane" and "right-lane". This paper presents the early results and discusses the next steps.

Research paper thumbnail of Traffic Flow Estimation based on Deep Learning for Emergency Traffic Management using CCTV Images

ISCRAM, 2020

Emergency Traffic Management (ETM) is one of the main problems in smart urban cities. This paper ... more Emergency Traffic Management (ETM) is one of the main problems in smart urban cities. This paper focuses on selecting an appropriate object detection model for identifying and counting vehicles from closed-circuit television (CCTV) images and then estimating traffic flow as the first step in a broader project. Therefore, a case is selected at one of the busiest roads in Christchurch, New Zealand. Two experiments were conducted in this research; 1) to evaluate the accuracy and speed of three famous object detection models namely faster R-CNN, mask R-CNN and YOLOv3 for the data set, 2) to estimate the traffic flow by counting the number of vehicles in each of the four classes such as car, bus, truck and motorcycle. A simple Region of Interest (ROI) heuristic algorithm is used to classify vehicle movement direction such as "left-lane" and "right-lane". This paper presents the early results and discusses the next steps.

Research paper thumbnail of Distinguishing Real Web Crawlers from Fakes: Googlebot Example

—Web crawlers are programs or automated scripts that scan web pages methodically to create indexe... more —Web crawlers are programs or automated scripts that scan web pages methodically to create indexes. Search engines such as Google, Bing use crawlers in order to provide web surfers with relevant information. Today there are also many crawlers that impersonate well-known web crawlers. For example, it has been observed that Google's Googlebot crawler is impersonated to a high degree. This raises ethical and security concerns as they can potentially be used for malicious purposes. In this paper, we present an effective methodology to detect fake Googlebot crawlers by analyzing web access logs. We propose using Markov chain models to learn profiles of real and fake Googlebots based on their patterns of web resource access sequences. We have calculated log-odds ratios for a given set of crawler sessions and our results show that the higher the log-odds score, the higher the probability that a given sequence comes from the real Googlebot. Experimental results show, at a threshold log-odds score we can distinguish the real Googlebot from the fake.

Research paper thumbnail of A Simulation for Reduced Outpatient Waiting Time

— Extended waiting time for treatment in National hospitals is very common in S ri Lanka. This si... more — Extended waiting time for treatment in National hospitals is very common in S ri Lanka. This situation has created several problems to patients, doctors and even to other health workers. The quality of service leaves a lot to be desired and is costly to the economy. This study analyses different queues which create bottlenecks in the Out Patient Department at national eye hospital in S ri Lanka and critically evaluate several appointment scheduling rules with the help of a simulation model to come up with a solution which minimises the total patient waiting time. Our results shows that total patient waiting time can be reduced more than 60% using proper appointment scheduling system with process improvement.

Research paper thumbnail of Web User Profiling using Hierarchical Clustering with Improved Similarity Measure

—Web user profiling targets grouping users in to clusters with similar interests. Web sites are a... more —Web user profiling targets grouping users in to clusters with similar interests. Web sites are attracted by many visitors and gaining insight to the patterns of access leaves lot of information. Web server access log files record every single request processed by web site visitors. Applying web usage mining techniques allow to identify interesting patterns. In this paper we have improved the similarity measure proposed by Velásquez et al. [1] and used it as the distance measure in an agglomerative hierarchical clustering for a data set from an online banking web site. To generate profiles, frequent item set mining is applied over the clusters. Our results show that proper visitor clustering can be achieved with the improved similarity measure.

Research paper thumbnail of Identification and Characterization of Crawlers Through Analysis of Web Logs