Carlo Aliprandi - Academia.edu (original) (raw)
Papers by Carlo Aliprandi
Carlo Aliprandi, Federico Neri – SYNTHEMA Andrea Marchetti , Francesco Ronzano, Maurizio Tesconi ... more Carlo Aliprandi, Federico Neri – SYNTHEMA Andrea Marchetti , Francesco Ronzano, Maurizio Tesconi – CNR IIT Claudia Soria, Monica Monachini – CNR ILC Piek Vossen – VUA/IRION Wauter Bosma – VUA Eneko Agirre, Xabier Artola, Arantza Diaz de Ilarraza, German Rigau, Aitor Soroa - EHU ... Knowledge Yielding Ontologies for Transition-based Organization ... Knowledge Yielding Ontologies for Transition-based Organization ... Prof. Dr. Piek TJM Vossen VU University Amsterdam Tel. + 31 (0) 20 5986466 Fax. + 31 (0) 20 5986500 Email: p.vossen@let.vu.nl
Data Mining IX, 2008
This paper describes a content enabling system that provides deep semantic search and information... more This paper describes a content enabling system that provides deep semantic search and information access to large quantities of distributed multimedia data for both experts and the general public. It provides a language independent search and dynamic classification features for a broad range of data collected from several sources in a number of culturally diverse languages. This system is part of the Online Police Station, launched by the Italian Minister of the Interior in 2006. The Online Police Station uses a virtual reality interface to provide general information and online assistance. Citizens can download forms, make complaints, receive advice and/or report events of an illegal nature. Police specialists can monitor criminal trends to ensure that responses are appropriately focused, and that scarce resources are more effectively employed against criminality. Online Police Station was voted as the Most inspiring good practice for creative solutions to common challenges, during the last European eGovernment Awards 2007.
Multimedia Tools and Applications, 2015
The subtitling demand of multimedia content has grown quickly over the last years, especially aft... more The subtitling demand of multimedia content has grown quickly over the last years, especially after the adoption of the new European audiovisual legislation, which forces to make multimedia content...
This paper describes the data collection, annotation and sharing activities carried out within th... more This paper describes the data collection, annotation and sharing activities carried out within the FP7 EU-funded SAVAS project. The project aims to collect, share and reuse audiovisual language resources from broadcasters and subtitling companies to develop large vocabulary continuous speech recognisers in specific domains and new languages, with the purpose of solving the automated subtitling needs of the media industry.
International Broadcasting Convention (IBC) 2014 Conference, 2014
ABSTRACT The demand for Access Services has quickly grown over the years, mainly due to National ... more ABSTRACT The demand for Access Services has quickly grown over the years, mainly due to National and International laws. This trend is expected to consolidate for subtitling in particular, as almost every broadcaster is nowadays working with digital content: large amounts of existing assets are going to be digitized in the near future. In terms of accessibility, digitalization is a very challenging task that can be turned into a profitable process if addressed with adequate technology. In this paper we will focus on an emerging technique: Assisted Subtitling. Assisted Subtitling consists in the application of Automatic Speech Recognition (ASR) to generate transcripts of programs and to use the transcripts as the basis for subtitles. This paper will report on recent advances in ASR, presenting SAVAS, a novel Speaker Independent ASR technology specifically designed for Live Subtitling. We will describe the technology and, evaluating its performances, we will present the promising results we have so far achieved.
2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), 2014
Organised crime uses information technology systems to communicate, work or expand its influence.... more Organised crime uses information technology systems to communicate, work or expand its influence. The EU FP7 Security Research Project CAPER (Collaborative information, Acquisition, Processing, Exploitation and Reporting for the prevention of organised crime), created in cooperation with European Law Enforcement Agencies (LEAs), aims to build a common collaborative and information sharing platform for the detection and prevention of organised crime, which exploits Open Source Intelligence (OSINT). LEAs are becoming more inclined to using OSINT tools, and particularly tools able to manage Online Social Networks (OSNs) data. This paper presents the CAPER Facebook crawling and analysis subsystem. Heuristic algorithms have been implemented in order to extract specific properties of Facebook's social graph, in particular user interactions. To support analysis tasks specifically, extensive effort has been spent on the analysis of textual user generated content and on the recognition of named-entities, in particular person names, locations and organisations. Relationships between users and entities mentioned in posts and in related comments are created and merged into the users networks extracted from the social graph. All entity relationships are finally visualised in user-friendly network graphs.
HCI International 2014 - Posters’ Extended Abstracts, 2014
In this paper we present a web application that exploits OpeNER Cloud Services. Ent-it-UP monitor... more In this paper we present a web application that exploits OpeNER Cloud Services. Ent-it-UP monitors Social Media and traditional Mass Media contents, performing multilingual Named Entity Recognition and Sentiment Analysis. Since consumers tend to trust the opinion of other consumers, reviews and ratings on the internet are increasingly important. Given the huge amount of data flowing in the web, it has become necessary to adopt an automatic data analysis strategy, in order to understand what people think about a certain product, brand or topic. The goal of Ent-it-Up is to carry out statistics about retrieved entities and display results in a communicative, intuitive and user friendly interface. In this way the final user can easily have a hint about people opinions without wasting too much time in analyzing the huge amount of User-Generated Content.
Communications in Computer and Information Science, 2014
Law Enforcement Agencies (LEAs) are increasingly more reliant on information and communication te... more Law Enforcement Agencies (LEAs) are increasingly more reliant on information and communication technologies and affected by a society shaped by the Internet and Social Media. The richness and quantity of information available from open sources, if properly gathered and processed, can provide valuable intelligence and help drawing inference from existing closed source intelligence. This paper presents CAPER, a state-of-the-art platform for the prevention of organised crime, created in cooperation with European LEAs. CAPER supports information sharing and multi-modal analysis of open and closed information sources, mainly based on Natural Language Processing (NLP) and Visual Analytics (VA) technologies.
At present, the availability of high quality annotated corpora is fundamental to carry out or to ... more At present, the availability of high quality annotated corpora is fundamental to carry out or to evaluate several Natural Language Processing and Text Mining tasks. To create consistently annotated corpora, direct human intervention represents a key factor: teams of manual taggers, usually composed by linguistically skilled people, are needed to refine existing annotations or to add new ones. As a consequence, manual corpora annotation is an expensive and a highly demanding task in term of involved resources.
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020, 2020
In this paper we describe the systems we used to participate in the task TAG-it of EVALITA 2020. ... more In this paper we describe the systems we used to participate in the task TAG-it of EVALITA 2020. The first system we developed uses linear Support Vector Machine as learning algorithm. The other two systems are based on the pretrained Italian Language Model UmBERTo: one of them has been developed following the Multi-Task Learning approach, while the other following the Single-Task Learning approach. These systems have been evaluated on TAG-it official test sets and ranked first in all the TAG-it subtasks, demonstrating the validity of the approaches we followed.
Procedia Manufacturing, 2021
In recent years there has been a growing interest in Circular Economy (CE), which promises to red... more In recent years there has been a growing interest in Circular Economy (CE), which promises to reduce waste and improve sustainability. The promise of CE is to change the conventional "take-make-dispose" that causes massive waste flows based on the integration of demanufacturing and remanufacturing processes within value chains. This integration requires breaking the "silos" of the circular chain to establish new collaborative and sustainable value networks. The paper introduces a novel digital platform for the CE, which is currently under development in the H2020 DigiPrime project. The platform is destined to facilitate seamless and trusted information exchange across circular actors, while offering a range of value-added services that enable manufacturers, remanufactures, recyclers and other actors to gain insights in the status of recycling and waste management processes. The latter facilitates the implementation of zero waste processes, along with the assessment of the performance of the circular chain. The paper introduces the architecture of the digital platform, along with its data modelling, exchange and data traceability mechanisms. It also presents a CE use case used to validate the platform.
The subtitling demand has grown quickly over the years. The path of manual subtitling is no longe... more The subtitling demand has grown quickly over the years. The path of manual subtitling is no longer feasible, due to increased costs and reduced production times. Assisted Subtitling is an emerging technique, consisting in the application of Automatic Speech Recognition (ASR) to automatically generate program transcripts. This paper will report on recent advances in ASR, presenting SAVAS, a novel Speaker Independent ASR technology specifically designed for Live Subtitling. We will describe the technology, presenting its features and detailing language and domain-specific tunings that we have carried out. We will also introduce the S.Scribe!, S.Live! and S.Respeak! systems, which are based on SAVAS. S.Scribe! is a batch Speaker Independent Transcription system for subtitling. S.Live! is a first-of-a-kind Speaker Independent Transcription System, with real-time performances for online subtitling. S.Respeak! is a collaborative Respeaking System, for live and batch production of multilin...
2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2012
ABSTRACT The Web is a huge virtual space where to express and share individual opinions, influenc... more ABSTRACT The Web is a huge virtual space where to express and share individual opinions, influencing any aspect of life, with implications for marketing and communication alike. Social Media are influencing consumers' preferences by shaping their attitudes and behaviors. Monitoring the Social Media activities is a good way to measure customers' loyalty, keeping a track on their sentiment towards brands or products. Social Media are the next logical marketing arena. Currently, Facebook dominates the digital marketing space, followed closely by Twitter. This paper describes a Sentiment Analysis study performed on over than 1000 Facebook posts about newscasts, comparing the sentiment for Rai -the Italian public broadcasting service -towards the emerging and more dynamic private company La7. This study maps study results with observations made by the Osservatorio di Pavia, which is an Italian institute of research specialized in media analysis at theoretical and empirical level, engaged in the analysis of political communication in the mass media. This study takes also in account the data provided by Auditel regarding newscast audience, correlating the analysis of Social Media, of Facebook in particular, with measurable data, available to public domain.
Communications in Computer and Information Science, 2011
We introduce the CAPER project (Collaborative information, Acquisition, Processing, Exploitation ... more We introduce the CAPER project (Collaborative information, Acquisition, Processing, Exploitation and Reporting), partially funded by the European Commission. The goal of CAPER is to create a common platform for the prevention of organized crime through sharing, exploitation and linking of Open and Closed information Sources. CAPER will support collaborative multilingual analysis of unstructured and audiovisual contents, based on Text Mining
We present FastType, a word prediction system for the Ital-ian inflected language, and its user-c... more We present FastType, a word prediction system for the Ital-ian inflected language, and its user-centric interface. Fast-Type has greatly evolved from its original features. We have added new linguistic resources, implemented more ef-ficient prediction algorithms and made a brand-new user interface. Thanks to the prediction engine upgrades, like the generation of word and Part-of-Speech n-gram collec-tions, and to the introduction of a linear combination al-gorithm, performances are greatly improved. Keystroke Saving reached 48% and is now comparable to the one achieved with state-of-the-art methods for non-inflected languages. DonKey, the new human-computer interface, allows the user to benefit from automatic word completion in any application. FastType is primarily designed for users with special needs and to reduce misspellings for users with linguistic difficulties.
We present FastType, an innovative system for word and letter prediction for an inflected languag... more We present FastType, an innovative system for word and letter prediction for an inflected language, namely the Italian language. The system is based on combined statistical and lexical methods and it uses robust language resources. Word prediction is particularly useful to minimise keystrokes for users with special needs, and to reduce misspellings for users having limited Italian proficiency. Word prediction can be effectively used in language learning, by suggesting correct and well-formed words to non-native users. This is significant, and particularly difficult to cope with, for inflected languages such as Italian, where the correct word form depends on the context. After describing the system, we evaluate its performances and, besides the high Keystrokes Saving, we show that FastType outclasses typical word prediction limitations getting outstanding results even over a very large dictionary of words.
ABSTRACT Much information of potential relevance to police investigations of organised crime is a... more ABSTRACT Much information of potential relevance to police investigations of organised crime is available in public sources without being recognised and used. Barriers to the simple and efficient exploitation of this information include that not everything is easily searchable, and may be written in a language other than that of the investigator. To help overcome these problems, the CAPER project aims to create an integrated platform for acquisition, processing, and analysis of information in multiple languages, and also link this to legacy police IT systems. Full Natural Language Processing pipelines for multiple languages and media are used to map persons and organisations to actions and events, and Multilingual lexicons and gazetteers allow cross-lingual search in the indexed data. Domain-specific lexicons contain words and slang expressions with special senses in the context of organised crime. The system supports multilingual analysis of unstructured and audiovisual contents, based on text mining for fourteen languages, and uses language-neutral interfaces, so that addition of further languages will not require any modification of existing components.
Lecture Notes in Computer Science, 2011
Many attempts have been made to extract structured data from Web resources, exposing them as RDF ... more Many attempts have been made to extract structured data from Web resources, exposing them as RDF triples and interlinking them with other RDF datasets: in this way it is possible to create clouds of highly integrated Semantic Web data collections. In this paper we describe an approach to enhance the extraction of semantic contents from unstructured textual documents, in particular considering Wikipedia articles and focusing on event mining. Starting from the deep parsing of a set of English Wikipedia articles, we produce a semantic annotation compliant with the Knowledge Annotation Format (KAF). We extract events from the KAF semantic annotation and then we structure each event as a set of RDF triples linked to both DBpedia and WordNet. We point out examples of automatically mined events, providing some general evaluation of how our approach may discover new events and link them to existing contents.
2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), 2014
Law Enforcement Agencies (LEAs) are increasingly more reliant on information and communication te... more Law Enforcement Agencies (LEAs) are increasingly more reliant on information and communication technologies and affected by a society shaped by the Internet. The richness and quantity of information available from open sources, if properly gathered and processed, can provide valuable intelligence and help in drawing inferences from existing closed source intelligence. Today the intelligence cycle is characterized by manual collection and integration of data. Named Entity Recognition (NER) plays a fundamental role in Open Source Intelligence (OSINT) solutions when fighting crime. This paper describes the implementation of a NER-based focused web crawler under the EU FP7 Security Research Project CAPER (Collaborative information, Acquisition, Processing, Exploitation and Reporting for the prevention of organized crime). The crawler allows 1. to look for documents starting from a URL until a parametric depth of levels - also specifying a keyword that has to be contained in the page and in the related links - and 2. to look for a parametric number of documents starting from a keyword (entrusting the keyword search to one of the principal search engines, thus behaving as a meta-search engine). In addition, the crawler is able to retrieve only those documents that contain the information semantically relevant to the query (in other words: the required keyword with the required sense). This is achieved through the use of NER technologies. In this paper we present the CAPER NER-based Semantic Crawler, which has been proven to be a suitable tool for focused crawling, allowing LEAs to drastically reduce data collection and integration efforts.
Carlo Aliprandi, Federico Neri – SYNTHEMA Andrea Marchetti , Francesco Ronzano, Maurizio Tesconi ... more Carlo Aliprandi, Federico Neri – SYNTHEMA Andrea Marchetti , Francesco Ronzano, Maurizio Tesconi – CNR IIT Claudia Soria, Monica Monachini – CNR ILC Piek Vossen – VUA/IRION Wauter Bosma – VUA Eneko Agirre, Xabier Artola, Arantza Diaz de Ilarraza, German Rigau, Aitor Soroa - EHU ... Knowledge Yielding Ontologies for Transition-based Organization ... Knowledge Yielding Ontologies for Transition-based Organization ... Prof. Dr. Piek TJM Vossen VU University Amsterdam Tel. + 31 (0) 20 5986466 Fax. + 31 (0) 20 5986500 Email: p.vossen@let.vu.nl
Data Mining IX, 2008
This paper describes a content enabling system that provides deep semantic search and information... more This paper describes a content enabling system that provides deep semantic search and information access to large quantities of distributed multimedia data for both experts and the general public. It provides a language independent search and dynamic classification features for a broad range of data collected from several sources in a number of culturally diverse languages. This system is part of the Online Police Station, launched by the Italian Minister of the Interior in 2006. The Online Police Station uses a virtual reality interface to provide general information and online assistance. Citizens can download forms, make complaints, receive advice and/or report events of an illegal nature. Police specialists can monitor criminal trends to ensure that responses are appropriately focused, and that scarce resources are more effectively employed against criminality. Online Police Station was voted as the Most inspiring good practice for creative solutions to common challenges, during the last European eGovernment Awards 2007.
Multimedia Tools and Applications, 2015
The subtitling demand of multimedia content has grown quickly over the last years, especially aft... more The subtitling demand of multimedia content has grown quickly over the last years, especially after the adoption of the new European audiovisual legislation, which forces to make multimedia content...
This paper describes the data collection, annotation and sharing activities carried out within th... more This paper describes the data collection, annotation and sharing activities carried out within the FP7 EU-funded SAVAS project. The project aims to collect, share and reuse audiovisual language resources from broadcasters and subtitling companies to develop large vocabulary continuous speech recognisers in specific domains and new languages, with the purpose of solving the automated subtitling needs of the media industry.
International Broadcasting Convention (IBC) 2014 Conference, 2014
ABSTRACT The demand for Access Services has quickly grown over the years, mainly due to National ... more ABSTRACT The demand for Access Services has quickly grown over the years, mainly due to National and International laws. This trend is expected to consolidate for subtitling in particular, as almost every broadcaster is nowadays working with digital content: large amounts of existing assets are going to be digitized in the near future. In terms of accessibility, digitalization is a very challenging task that can be turned into a profitable process if addressed with adequate technology. In this paper we will focus on an emerging technique: Assisted Subtitling. Assisted Subtitling consists in the application of Automatic Speech Recognition (ASR) to generate transcripts of programs and to use the transcripts as the basis for subtitles. This paper will report on recent advances in ASR, presenting SAVAS, a novel Speaker Independent ASR technology specifically designed for Live Subtitling. We will describe the technology and, evaluating its performances, we will present the promising results we have so far achieved.
2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), 2014
Organised crime uses information technology systems to communicate, work or expand its influence.... more Organised crime uses information technology systems to communicate, work or expand its influence. The EU FP7 Security Research Project CAPER (Collaborative information, Acquisition, Processing, Exploitation and Reporting for the prevention of organised crime), created in cooperation with European Law Enforcement Agencies (LEAs), aims to build a common collaborative and information sharing platform for the detection and prevention of organised crime, which exploits Open Source Intelligence (OSINT). LEAs are becoming more inclined to using OSINT tools, and particularly tools able to manage Online Social Networks (OSNs) data. This paper presents the CAPER Facebook crawling and analysis subsystem. Heuristic algorithms have been implemented in order to extract specific properties of Facebook's social graph, in particular user interactions. To support analysis tasks specifically, extensive effort has been spent on the analysis of textual user generated content and on the recognition of named-entities, in particular person names, locations and organisations. Relationships between users and entities mentioned in posts and in related comments are created and merged into the users networks extracted from the social graph. All entity relationships are finally visualised in user-friendly network graphs.
HCI International 2014 - Posters’ Extended Abstracts, 2014
In this paper we present a web application that exploits OpeNER Cloud Services. Ent-it-UP monitor... more In this paper we present a web application that exploits OpeNER Cloud Services. Ent-it-UP monitors Social Media and traditional Mass Media contents, performing multilingual Named Entity Recognition and Sentiment Analysis. Since consumers tend to trust the opinion of other consumers, reviews and ratings on the internet are increasingly important. Given the huge amount of data flowing in the web, it has become necessary to adopt an automatic data analysis strategy, in order to understand what people think about a certain product, brand or topic. The goal of Ent-it-Up is to carry out statistics about retrieved entities and display results in a communicative, intuitive and user friendly interface. In this way the final user can easily have a hint about people opinions without wasting too much time in analyzing the huge amount of User-Generated Content.
Communications in Computer and Information Science, 2014
Law Enforcement Agencies (LEAs) are increasingly more reliant on information and communication te... more Law Enforcement Agencies (LEAs) are increasingly more reliant on information and communication technologies and affected by a society shaped by the Internet and Social Media. The richness and quantity of information available from open sources, if properly gathered and processed, can provide valuable intelligence and help drawing inference from existing closed source intelligence. This paper presents CAPER, a state-of-the-art platform for the prevention of organised crime, created in cooperation with European LEAs. CAPER supports information sharing and multi-modal analysis of open and closed information sources, mainly based on Natural Language Processing (NLP) and Visual Analytics (VA) technologies.
At present, the availability of high quality annotated corpora is fundamental to carry out or to ... more At present, the availability of high quality annotated corpora is fundamental to carry out or to evaluate several Natural Language Processing and Text Mining tasks. To create consistently annotated corpora, direct human intervention represents a key factor: teams of manual taggers, usually composed by linguistically skilled people, are needed to refine existing annotations or to add new ones. As a consequence, manual corpora annotation is an expensive and a highly demanding task in term of involved resources.
EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020, 2020
In this paper we describe the systems we used to participate in the task TAG-it of EVALITA 2020. ... more In this paper we describe the systems we used to participate in the task TAG-it of EVALITA 2020. The first system we developed uses linear Support Vector Machine as learning algorithm. The other two systems are based on the pretrained Italian Language Model UmBERTo: one of them has been developed following the Multi-Task Learning approach, while the other following the Single-Task Learning approach. These systems have been evaluated on TAG-it official test sets and ranked first in all the TAG-it subtasks, demonstrating the validity of the approaches we followed.
Procedia Manufacturing, 2021
In recent years there has been a growing interest in Circular Economy (CE), which promises to red... more In recent years there has been a growing interest in Circular Economy (CE), which promises to reduce waste and improve sustainability. The promise of CE is to change the conventional "take-make-dispose" that causes massive waste flows based on the integration of demanufacturing and remanufacturing processes within value chains. This integration requires breaking the "silos" of the circular chain to establish new collaborative and sustainable value networks. The paper introduces a novel digital platform for the CE, which is currently under development in the H2020 DigiPrime project. The platform is destined to facilitate seamless and trusted information exchange across circular actors, while offering a range of value-added services that enable manufacturers, remanufactures, recyclers and other actors to gain insights in the status of recycling and waste management processes. The latter facilitates the implementation of zero waste processes, along with the assessment of the performance of the circular chain. The paper introduces the architecture of the digital platform, along with its data modelling, exchange and data traceability mechanisms. It also presents a CE use case used to validate the platform.
The subtitling demand has grown quickly over the years. The path of manual subtitling is no longe... more The subtitling demand has grown quickly over the years. The path of manual subtitling is no longer feasible, due to increased costs and reduced production times. Assisted Subtitling is an emerging technique, consisting in the application of Automatic Speech Recognition (ASR) to automatically generate program transcripts. This paper will report on recent advances in ASR, presenting SAVAS, a novel Speaker Independent ASR technology specifically designed for Live Subtitling. We will describe the technology, presenting its features and detailing language and domain-specific tunings that we have carried out. We will also introduce the S.Scribe!, S.Live! and S.Respeak! systems, which are based on SAVAS. S.Scribe! is a batch Speaker Independent Transcription system for subtitling. S.Live! is a first-of-a-kind Speaker Independent Transcription System, with real-time performances for online subtitling. S.Respeak! is a collaborative Respeaking System, for live and batch production of multilin...
2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2012
ABSTRACT The Web is a huge virtual space where to express and share individual opinions, influenc... more ABSTRACT The Web is a huge virtual space where to express and share individual opinions, influencing any aspect of life, with implications for marketing and communication alike. Social Media are influencing consumers' preferences by shaping their attitudes and behaviors. Monitoring the Social Media activities is a good way to measure customers' loyalty, keeping a track on their sentiment towards brands or products. Social Media are the next logical marketing arena. Currently, Facebook dominates the digital marketing space, followed closely by Twitter. This paper describes a Sentiment Analysis study performed on over than 1000 Facebook posts about newscasts, comparing the sentiment for Rai -the Italian public broadcasting service -towards the emerging and more dynamic private company La7. This study maps study results with observations made by the Osservatorio di Pavia, which is an Italian institute of research specialized in media analysis at theoretical and empirical level, engaged in the analysis of political communication in the mass media. This study takes also in account the data provided by Auditel regarding newscast audience, correlating the analysis of Social Media, of Facebook in particular, with measurable data, available to public domain.
Communications in Computer and Information Science, 2011
We introduce the CAPER project (Collaborative information, Acquisition, Processing, Exploitation ... more We introduce the CAPER project (Collaborative information, Acquisition, Processing, Exploitation and Reporting), partially funded by the European Commission. The goal of CAPER is to create a common platform for the prevention of organized crime through sharing, exploitation and linking of Open and Closed information Sources. CAPER will support collaborative multilingual analysis of unstructured and audiovisual contents, based on Text Mining
We present FastType, a word prediction system for the Ital-ian inflected language, and its user-c... more We present FastType, a word prediction system for the Ital-ian inflected language, and its user-centric interface. Fast-Type has greatly evolved from its original features. We have added new linguistic resources, implemented more ef-ficient prediction algorithms and made a brand-new user interface. Thanks to the prediction engine upgrades, like the generation of word and Part-of-Speech n-gram collec-tions, and to the introduction of a linear combination al-gorithm, performances are greatly improved. Keystroke Saving reached 48% and is now comparable to the one achieved with state-of-the-art methods for non-inflected languages. DonKey, the new human-computer interface, allows the user to benefit from automatic word completion in any application. FastType is primarily designed for users with special needs and to reduce misspellings for users with linguistic difficulties.
We present FastType, an innovative system for word and letter prediction for an inflected languag... more We present FastType, an innovative system for word and letter prediction for an inflected language, namely the Italian language. The system is based on combined statistical and lexical methods and it uses robust language resources. Word prediction is particularly useful to minimise keystrokes for users with special needs, and to reduce misspellings for users having limited Italian proficiency. Word prediction can be effectively used in language learning, by suggesting correct and well-formed words to non-native users. This is significant, and particularly difficult to cope with, for inflected languages such as Italian, where the correct word form depends on the context. After describing the system, we evaluate its performances and, besides the high Keystrokes Saving, we show that FastType outclasses typical word prediction limitations getting outstanding results even over a very large dictionary of words.
ABSTRACT Much information of potential relevance to police investigations of organised crime is a... more ABSTRACT Much information of potential relevance to police investigations of organised crime is available in public sources without being recognised and used. Barriers to the simple and efficient exploitation of this information include that not everything is easily searchable, and may be written in a language other than that of the investigator. To help overcome these problems, the CAPER project aims to create an integrated platform for acquisition, processing, and analysis of information in multiple languages, and also link this to legacy police IT systems. Full Natural Language Processing pipelines for multiple languages and media are used to map persons and organisations to actions and events, and Multilingual lexicons and gazetteers allow cross-lingual search in the indexed data. Domain-specific lexicons contain words and slang expressions with special senses in the context of organised crime. The system supports multilingual analysis of unstructured and audiovisual contents, based on text mining for fourteen languages, and uses language-neutral interfaces, so that addition of further languages will not require any modification of existing components.
Lecture Notes in Computer Science, 2011
Many attempts have been made to extract structured data from Web resources, exposing them as RDF ... more Many attempts have been made to extract structured data from Web resources, exposing them as RDF triples and interlinking them with other RDF datasets: in this way it is possible to create clouds of highly integrated Semantic Web data collections. In this paper we describe an approach to enhance the extraction of semantic contents from unstructured textual documents, in particular considering Wikipedia articles and focusing on event mining. Starting from the deep parsing of a set of English Wikipedia articles, we produce a semantic annotation compliant with the Knowledge Annotation Format (KAF). We extract events from the KAF semantic annotation and then we structure each event as a set of RDF triples linked to both DBpedia and WordNet. We point out examples of automatically mined events, providing some general evaluation of how our approach may discover new events and link them to existing contents.
2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014), 2014
Law Enforcement Agencies (LEAs) are increasingly more reliant on information and communication te... more Law Enforcement Agencies (LEAs) are increasingly more reliant on information and communication technologies and affected by a society shaped by the Internet. The richness and quantity of information available from open sources, if properly gathered and processed, can provide valuable intelligence and help in drawing inferences from existing closed source intelligence. Today the intelligence cycle is characterized by manual collection and integration of data. Named Entity Recognition (NER) plays a fundamental role in Open Source Intelligence (OSINT) solutions when fighting crime. This paper describes the implementation of a NER-based focused web crawler under the EU FP7 Security Research Project CAPER (Collaborative information, Acquisition, Processing, Exploitation and Reporting for the prevention of organized crime). The crawler allows 1. to look for documents starting from a URL until a parametric depth of levels - also specifying a keyword that has to be contained in the page and in the related links - and 2. to look for a parametric number of documents starting from a keyword (entrusting the keyword search to one of the principal search engines, thus behaving as a meta-search engine). In addition, the crawler is able to retrieve only those documents that contain the information semantically relevant to the query (in other words: the required keyword with the required sense). This is achieved through the use of NER technologies. In this paper we present the CAPER NER-based Semantic Crawler, which has been proven to be a suitable tool for focused crawling, allowing LEAs to drastically reduce data collection and integration efforts.