Albert Weichselbraun | University of Applied Sciences Chur (original) (raw)

Papers by Albert Weichselbraun

Research paper thumbnail of Slides: Context Aware Sentiment Detection

Research paper thumbnail of CrudeBERT: Applying Economic Theory Towards Fine-Tuning Transformer-Based Sentiment Analysis Models to the Crude Oil Market

Research paper thumbnail of B 10 Ontologien und Linked Open Data

De Gruyter eBooks, Nov 21, 2022

Research paper thumbnail of Improving Company Valuations with Automated Knowledge Discovery, Extraction and Fusion

arXiv (Cornell University), Oct 19, 2020

Zusammenfassung: Unternehmensbewertungen in der Biotech-Branche, Pharmazie und Medizintechnik ste... more Zusammenfassung: Unternehmensbewertungen in der Biotech-Branche, Pharmazie und Medizintechnik stellen eine anspruchsvolle Aufgabe dar, insbesondere bei Berücksichtigung der einzigartigen Risiken, denen Biotech-Startups beim Eintritt in neue Märkte ausgesetzt sind. Unternehmen, die auf globale Bewertungsdienstleistungen spezialisiert sind, kombinieren daher Bewertungsmodelle und Erfahrungen aus der Vergangenheit mit heterogenen Metriken und Indikatoren, die Einblicke in die Leistung eines Unternehmens geben. Dieser Beitrag veranschaulicht, wie automatisierte Wissensidentifikation,-extraktion und-integration genutzt werden können, um (i) zusätzliche Indikatoren zu ermitteln, die Einblicke in den Erfolg eines Unternehmens in der Produktentwicklung geben und um (ii) arbeitsintensive Datensammelprozesse zur Unternehmensbewertung zu unterstützen.

Research paper thumbnail of Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web

Journal of open source software, Oct 16, 2021

Inscriptis provides a library, command line client and Web service for converting HTML to plain t... more Inscriptis provides a library, command line client and Web service for converting HTML to plain text. Its development has been triggered by the need to obtain accurate text representations for knowledge extraction tasks that preserve the spatial alignment of text without drawing upon heavyweight, browser-based solutions such as Selenium (Huggins et al., 2021). In contrast to existing software packages such as HTML2text (Swartz, 2021), jusText (Belica, 2021) and Lynx (Dickey, 2021), Inscriptis 1. provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements. Inscriptis excels in terms of conversion quality, since it correctly converts complex HTML constructs such as nested tables and also interprets a subset of HTML (e.g., align, valign) and CSS (e.g., display, white-space, margin-top, vertical-align, etc.) attributes that determine the text alignment. 2. supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document. These unique features ensure that downstream knowledge extraction components can operate on accurate text representations, and may even use information on the semantics and structure of the original HTML document, if annotation support has been enabled. Statement of need Research in a growing number of scientific disciplines relies upon Web content. Li et al. (2014), for instance, studied the impact of company-specific News coverage on stock prices, in medicine and pharmacovigilance social media listening plays an important role in gathering insights into patient needs and the monitoring of adverse drug effects (Convertino et al., 2018), and communication sciences analyze media coverage to obtain information on the perception and framing of issues as well as on the rise and fall of topics within News and social media (Scharl et al., 2017; Weichselbraun et al., 2021). Computer science focuses on analyzing content by applying knowledge extraction techniques such as entity recognition (Fu et al., 2021) to automatically identify entities (e.g., persons, organizations, locations, products, etc.) within text documents, entity linking (Ding et al., 2021) to link these entities to knowledge bases such as Wikidata and DBPedia, and sentiment Weichselbraun, A., (2021). Inscriptis-A Python-based HTML to text conversion library optimized for knowledge extraction from the Web.

Research paper thumbnail of Slot Filling for Extracting Reskilling and Upskilling Options from the Web

Lecture Notes in Computer Science, 2022

Disturbances in the job market such as advances in science and technology, crisis and increased c... more Disturbances in the job market such as advances in science and technology, crisis and increased competition have triggered a surge in reskilling and upskilling programs. Information on suitable continuing education options is distributed across many sites, rendering the search, comparison and selection of useful programs a cumbersome task. This paper, therefore, introduces a knowledge extraction system that integrates reskilling and upskilling options into a single knowledge graph. The system collects educational programs from 488 different providers and uses context extraction for identifying and contextualizing relevant content. Afterwards, entity recognition and entity linking methods draw upon a domain ontology to locate relevant entities such as skills, occupations and topics. Finally, slot filling integrates entities based on their context into the corresponding slots of the continuous education knowledge graph. We also introduce a German gold standard that comprises 169 documents and over 3800 annotations for benchmarking the necessary content extraction, entity linking, entity recognition and slot filling tasks, and provide an overview of the system's performance.

Research paper thumbnail of Framing Named Entity Linking Error Types

Named Entity Linking (NEL) and relation extraction forms the backbone of Knowledge Base Populatio... more Named Entity Linking (NEL) and relation extraction forms the backbone of Knowledge Base Population tasks. The recent rise of large open source Knowledge Bases and the continuous focus on improving NEL performance has led to the creation of automated benchmark solutions during the last decade. The benchmarking of NEL systems offers a valuable approach to understand a NEL system's performance quantitatively. However, an in-depth qualitative analysis that helps improving NEL methods by identifying error causes usually requires a more thorough error analysis. This paper proposes a taxonomy to frame common errors and applies this taxonomy in a survey study to assess the performance of four well-known Named Entity Linking systems on three recent gold standards.

Research paper thumbnail of Optimierung von Unternehmensbewertungen durch automatisierte Wissensidentifikation, -extraktion und -integration

Information - Wissenschaft & Praxis, 2020

Zusammenfassung Unternehmensbewertungen in der Biotech-Branche, Pharmazie und Medizintechnik stel... more Zusammenfassung Unternehmensbewertungen in der Biotech-Branche, Pharmazie und Medizintechnik stellen eine anspruchsvolle Aufgabe dar, insbesondere bei Berücksichtigung der einzigartigen Risiken, denen Biotech-Startups beim Eintritt in neue Märkte ausgesetzt sind. Unternehmen, die auf globale Bewertungsdienstleistungen spezialisiert sind, kombinieren daher Bewertungsmodelle und Erfahrungen aus der Vergangenheit mit heterogenen Metriken und Indikatoren, die Einblicke in die Leistung eines Unternehmens geben. Dieser Beitrag veranschaulicht, wie automatisierte Wissensidentifikation, -extraktion und -integration genutzt werden können, um (i) zusätzliche Indikatoren zu ermitteln, die Einblicke in den Erfolg eines Unternehmens in der Produktentwicklung geben und um (ii) arbeitsintensive Datensammelprozesse zur Unternehmensbewertung zu unterstützen.

Research paper thumbnail of Semantic Systems and Visual Tools to Support Environmental Communication

IEEE Systems Journal, 2017

Given the intense attention that environmental topics such as climate change attract in news and ... more Given the intense attention that environmental topics such as climate change attract in news and social media coverage, scientists and communication professionals want to know how different stakeholders perceive observable threats and policy options, how specific media channels react to new insights, and how journalists present scientific knowledge to the public. This paper investigates the potential of semantic technologies to address these questions. After summarizing methods to extract and disambiguate context information, we present visualization techniques to explore the lexical, geospatial, and relational context of topics and entities referenced in these repositories. The examples stem from the Media Watch on Climate Change, the Climate Resilience Toolkit and the NOAA Media Watch-three applications that aggregate environmental resources from a wide range of online sources. These systems not only show the value of providing comprehensive information to the public, but also have helped to develop a novel communication success metric that goes beyond bipolar assessments of sentiment.

Research paper thumbnail of Integrating Economic Theory, Domain Knowledge, and Social Knowledge into Hybrid Sentiment Models for Predicting Crude Oil Markets

Cognitive Computation

For several decades, sentiment analysis has been considered a key indicator for assessing market ... more For several decades, sentiment analysis has been considered a key indicator for assessing market mood and predicting future price changes. Accurately predicting commodity markets requires an understanding of fundamental market dynamics such as the interplay between supply and demand, which are not considered in standard affective models. This paper introduces two domain-specific affective models, CrudeBERT and CrudeBERT+, that adapt sentiment analysis to the crude oil market by incorporating economic theory with common knowledge of the mentioned entities and social knowledge extracted from Google Trends. To evaluate the predictive capabilities of these models, comprehensive experiments were conducted using dynamic time warping to identify the model that best approximates WTI crude oil futures price movements. The evaluation included news headlines and crude oil prices between January 2012 and April 2021. The results show that CrudeBERT+ outperformed RavenPack, BERT, FinBERT, and ear...

Research paper thumbnail of Building Knowledge Graphs and Recommender Systems for Suggesting Reskilling and Upskilling Options from the Web

Information

As advances in science and technology, crisis, and increased competition impact labor markets, re... more As advances in science and technology, crisis, and increased competition impact labor markets, reskilling and upskilling programs emerged to mitigate their effects. Since information on continuing education is highly distributed across websites, choosing career paths and suitable upskilling options is currently considered a challenging and cumbersome task. This article, therefore, introduces a method for building a comprehensive knowledge graph from the education providers’ Web pages. We collect educational programs from 488 providers and leverage entity recognition and entity linking methods in conjunction with contextualization to extract knowledge on entities such as prerequisites, skills, learning objectives, and course content. Slot filling then integrates these entities into an extensive knowledge graph that contains close to 74,000 nodes and over 734,000 edges. A recommender system leverages the created graph, and background knowledge on occupations to provide a career path a...

Research paper thumbnail of StoryLens

Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics, 2018

The news media landscape tends to focus on long-running narratives. Correctly processing new info... more The news media landscape tends to focus on long-running narratives. Correctly processing new information, therefore, requires considering multiple lenses when analyzing media content. Traditionally it would have been considered sufficient to extract the topics or entities contained in a text in order to classify it, but today it is important to also look at more sophisticated annotations related to fine-grained geolocation, events, stories and the relations between them. In order to leverage such lenses we propose a new corpus that offers a diverse set of annotations over texts collected from multiple media sources. We also showcase the framework used for creating the corpus, as well as how the information from the various lenses can be used in order to support different use cases in the EU project InVID for verifying the veracity of online video.

Research paper thumbnail of Medienkritik-Forschung mit der Suchmaschine WebLyzard - Strukturelle und inhaltliche Ergebnisse der Deutschschweiz 2014‐2018

Research paper thumbnail of Science in the Swiss Public. The State of Science Communication and Public Engagement with Science in Switzerland

Science communication and public engagement with science have repeatedly been called for in recen... more Science communication and public engagement with science have repeatedly been called for in recent years, particularly during the COVID-19 pandemic. Therefore, die Swiss Academies of the Arts and Sciences have set up an expert group to assess the state of science communication in Switzerland, and to provide recommendations for how to improve it. The expert group report is based on a comprehensive review of the available interdisciplinary scholarship analyzing science communication and public engagement with science in Switzerland. Selectively, it also incorporates original data, international findings, and secondary analyses where little or no published scholarly work was available. The report covers a wide range of facets of science communication and public engagement in Switzerland, from public attitudes towards science over individuals and organizations engaging in science communication and engagement formats to news and social media representations of science. On this basis, it ...

Research paper thumbnail of Classifying News Media Coverage for Corruption Risks Management with Deep Learning and Web Intelligence

Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics, 2020

A substantial number of international corporations have been affected by corruption. The research... more A substantial number of international corporations have been affected by corruption. The research presented in this paper introduces the Integrity Risks Monitor, an analytics dashboard that applies Web Intelligence and Deep Learning to english and germanspeaking documents for the task of (i) tracking and visualizing past corruption management gaps and their respective impacts, (ii) understanding present and past integrity issues, (iii) supporting companies in analyzing news media for identifying and mitigating integrity risks. Afterwards, we discuss the design, implementation, training and evaluation of classification components capable of identifying English documents covering the integrity topic of corruption. Domain experts created a gold standard dataset compiled from Anglo-American media coverage on corruption cases that has been used for training and evaluating the classifier. The experiments performed to evaluate the classifiers draw upon popular algorithms used for text classification such as Naïve Bayes, Support Vector Machines (SVM) and Deep Learning architectures (LSTM, BiLSTM, CNN) that draw upon different word embeddings and document representations. They also demonstrate that although classical machine learning approaches such as Naïve Bayes struggle with the diversity of the media coverage on corruption, state-of-the art Deep Learning models perform sufficiently well in the project's context. CCS CONCEPTS • Information systems → Data analytics; • Computing methodologies → Neural networks; • Applied computing → Economics; Annotation.

Research paper thumbnail of Mitigating linked data quality issues in knowledge-intense information extraction methods

Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics, 2017

Advances in research areas such as named entity linking and sentiment analysis have triggered the... more Advances in research areas such as named entity linking and sentiment analysis have triggered the emergence of knowledge-intensive information extraction methods that combine classical information extraction with background knowledge from the Web. Despite data quality concerns, linked data sources such as DBpedia, GeoNames and Wikidata which encode facts in a standardized structured format are particularly attractive for such applications. This paper addresses the problem of data quality by introducing a framework that elaborates on linked data quality issues relevant to di erent stages of the background knowledge acquisition process, their impact on information extraction performance and applicable mitigation strategies. Applying this framework to named entity linking and data enrichment demonstrates the potential of the introduced mitigation strategies to lessen the impact of di erent kinds of data quality problems. An industrial use case that aims at the automatic generation of image metadata from image descriptions illustrates the successful deployment of knowledge-intensive information extraction in real-world applications and constraints introduced by data quality concerns.

Research paper thumbnail of Improving Named Entity Linking Corpora Quality

Proceedings - Natural Language Processing in a Deep Learning World, 2019

Research paper thumbnail of A Regional News Corpora for Contextualized Entity Discovery and Linking

This paper presents a German corpus for Named Entity Linking (NEL) and Knowledge Base Population ... more This paper presents a German corpus for Named Entity Linking (NEL) and Knowledge Base Population (KBP) tasks. We describe the annotation guideline, the annotation process, NIL clustering techniques and conversion to popular NEL formats such as NIF and TAC that have been used to construct this corpus based on news transcripts from the German regional broadcaster RBB (Rundfunk Berlin Brandenburg). Since creating such language resources requires significant effort, the paper also discusses how to derive additional evaluation resources for tasks like named entity contextualization or ontology enrichment by exploiting the links between named entities from the annotated corpus. The paper concludes with an evaluation that shows how several well-known NEL tools perform on the corpus, a discussion of the evaluation results, and with suggestions on how to keep evaluation corpora and datasets up to date.

Research paper thumbnail of Extracting and Grounding Sentiment Lexicons

Research paper thumbnail of Name Variants for Improving Entity Discovery and Linking

Identifying all names that refer to a particular set of named entities is a challenging task, as ... more Identifying all names that refer to a particular set of named entities is a challenging task, as quite often we need to consider many features that include a lot of variation like abbreviations, aliases, hypocorism, multilingualism or partial matches. Each entity type can also have specific rules for name variances: people names can include titles, country and branch names are sometimes removed from organization names, while locations are often plagued by the issue of nested entities. The lack of a clear strategy for collecting, processing and computing name variants significantly lowers the recall of tasks such as Named Entity Linking and Knowledge Base Population since name variances are frequently used in all kind of textual content. This paper proposes several strategies to address these issues. Recall can be improved by combining knowledge repositories and by computing additional variances based on algorithmic approaches. Heuristics and machine learning methods then analyze the...

Research paper thumbnail of Slides: Context Aware Sentiment Detection

Research paper thumbnail of CrudeBERT: Applying Economic Theory Towards Fine-Tuning Transformer-Based Sentiment Analysis Models to the Crude Oil Market

Research paper thumbnail of B 10 Ontologien und Linked Open Data

De Gruyter eBooks, Nov 21, 2022

Research paper thumbnail of Improving Company Valuations with Automated Knowledge Discovery, Extraction and Fusion

arXiv (Cornell University), Oct 19, 2020

Zusammenfassung: Unternehmensbewertungen in der Biotech-Branche, Pharmazie und Medizintechnik ste... more Zusammenfassung: Unternehmensbewertungen in der Biotech-Branche, Pharmazie und Medizintechnik stellen eine anspruchsvolle Aufgabe dar, insbesondere bei Berücksichtigung der einzigartigen Risiken, denen Biotech-Startups beim Eintritt in neue Märkte ausgesetzt sind. Unternehmen, die auf globale Bewertungsdienstleistungen spezialisiert sind, kombinieren daher Bewertungsmodelle und Erfahrungen aus der Vergangenheit mit heterogenen Metriken und Indikatoren, die Einblicke in die Leistung eines Unternehmens geben. Dieser Beitrag veranschaulicht, wie automatisierte Wissensidentifikation,-extraktion und-integration genutzt werden können, um (i) zusätzliche Indikatoren zu ermitteln, die Einblicke in den Erfolg eines Unternehmens in der Produktentwicklung geben und um (ii) arbeitsintensive Datensammelprozesse zur Unternehmensbewertung zu unterstützen.

Research paper thumbnail of Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web

Journal of open source software, Oct 16, 2021

Inscriptis provides a library, command line client and Web service for converting HTML to plain t... more Inscriptis provides a library, command line client and Web service for converting HTML to plain text. Its development has been triggered by the need to obtain accurate text representations for knowledge extraction tasks that preserve the spatial alignment of text without drawing upon heavyweight, browser-based solutions such as Selenium (Huggins et al., 2021). In contrast to existing software packages such as HTML2text (Swartz, 2021), jusText (Belica, 2021) and Lynx (Dickey, 2021), Inscriptis 1. provides a layout-aware conversion of HTML that more closely resembles the rendering obtained from standard Web browsers and, therefore, better preserves the spatial arrangement of text elements. Inscriptis excels in terms of conversion quality, since it correctly converts complex HTML constructs such as nested tables and also interprets a subset of HTML (e.g., align, valign) and CSS (e.g., display, white-space, margin-top, vertical-align, etc.) attributes that determine the text alignment. 2. supports annotation rules, i.e., user-provided mappings that allow for annotating the extracted text based on structural and semantic information encoded in HTML tags and attributes used for controlling structure and layout in the original HTML document. These unique features ensure that downstream knowledge extraction components can operate on accurate text representations, and may even use information on the semantics and structure of the original HTML document, if annotation support has been enabled. Statement of need Research in a growing number of scientific disciplines relies upon Web content. Li et al. (2014), for instance, studied the impact of company-specific News coverage on stock prices, in medicine and pharmacovigilance social media listening plays an important role in gathering insights into patient needs and the monitoring of adverse drug effects (Convertino et al., 2018), and communication sciences analyze media coverage to obtain information on the perception and framing of issues as well as on the rise and fall of topics within News and social media (Scharl et al., 2017; Weichselbraun et al., 2021). Computer science focuses on analyzing content by applying knowledge extraction techniques such as entity recognition (Fu et al., 2021) to automatically identify entities (e.g., persons, organizations, locations, products, etc.) within text documents, entity linking (Ding et al., 2021) to link these entities to knowledge bases such as Wikidata and DBPedia, and sentiment Weichselbraun, A., (2021). Inscriptis-A Python-based HTML to text conversion library optimized for knowledge extraction from the Web.

Research paper thumbnail of Slot Filling for Extracting Reskilling and Upskilling Options from the Web

Lecture Notes in Computer Science, 2022

Disturbances in the job market such as advances in science and technology, crisis and increased c... more Disturbances in the job market such as advances in science and technology, crisis and increased competition have triggered a surge in reskilling and upskilling programs. Information on suitable continuing education options is distributed across many sites, rendering the search, comparison and selection of useful programs a cumbersome task. This paper, therefore, introduces a knowledge extraction system that integrates reskilling and upskilling options into a single knowledge graph. The system collects educational programs from 488 different providers and uses context extraction for identifying and contextualizing relevant content. Afterwards, entity recognition and entity linking methods draw upon a domain ontology to locate relevant entities such as skills, occupations and topics. Finally, slot filling integrates entities based on their context into the corresponding slots of the continuous education knowledge graph. We also introduce a German gold standard that comprises 169 documents and over 3800 annotations for benchmarking the necessary content extraction, entity linking, entity recognition and slot filling tasks, and provide an overview of the system's performance.

Research paper thumbnail of Framing Named Entity Linking Error Types

Named Entity Linking (NEL) and relation extraction forms the backbone of Knowledge Base Populatio... more Named Entity Linking (NEL) and relation extraction forms the backbone of Knowledge Base Population tasks. The recent rise of large open source Knowledge Bases and the continuous focus on improving NEL performance has led to the creation of automated benchmark solutions during the last decade. The benchmarking of NEL systems offers a valuable approach to understand a NEL system's performance quantitatively. However, an in-depth qualitative analysis that helps improving NEL methods by identifying error causes usually requires a more thorough error analysis. This paper proposes a taxonomy to frame common errors and applies this taxonomy in a survey study to assess the performance of four well-known Named Entity Linking systems on three recent gold standards.

Research paper thumbnail of Optimierung von Unternehmensbewertungen durch automatisierte Wissensidentifikation, -extraktion und -integration

Information - Wissenschaft & Praxis, 2020

Zusammenfassung Unternehmensbewertungen in der Biotech-Branche, Pharmazie und Medizintechnik stel... more Zusammenfassung Unternehmensbewertungen in der Biotech-Branche, Pharmazie und Medizintechnik stellen eine anspruchsvolle Aufgabe dar, insbesondere bei Berücksichtigung der einzigartigen Risiken, denen Biotech-Startups beim Eintritt in neue Märkte ausgesetzt sind. Unternehmen, die auf globale Bewertungsdienstleistungen spezialisiert sind, kombinieren daher Bewertungsmodelle und Erfahrungen aus der Vergangenheit mit heterogenen Metriken und Indikatoren, die Einblicke in die Leistung eines Unternehmens geben. Dieser Beitrag veranschaulicht, wie automatisierte Wissensidentifikation, -extraktion und -integration genutzt werden können, um (i) zusätzliche Indikatoren zu ermitteln, die Einblicke in den Erfolg eines Unternehmens in der Produktentwicklung geben und um (ii) arbeitsintensive Datensammelprozesse zur Unternehmensbewertung zu unterstützen.

Research paper thumbnail of Semantic Systems and Visual Tools to Support Environmental Communication

IEEE Systems Journal, 2017

Given the intense attention that environmental topics such as climate change attract in news and ... more Given the intense attention that environmental topics such as climate change attract in news and social media coverage, scientists and communication professionals want to know how different stakeholders perceive observable threats and policy options, how specific media channels react to new insights, and how journalists present scientific knowledge to the public. This paper investigates the potential of semantic technologies to address these questions. After summarizing methods to extract and disambiguate context information, we present visualization techniques to explore the lexical, geospatial, and relational context of topics and entities referenced in these repositories. The examples stem from the Media Watch on Climate Change, the Climate Resilience Toolkit and the NOAA Media Watch-three applications that aggregate environmental resources from a wide range of online sources. These systems not only show the value of providing comprehensive information to the public, but also have helped to develop a novel communication success metric that goes beyond bipolar assessments of sentiment.

Research paper thumbnail of Integrating Economic Theory, Domain Knowledge, and Social Knowledge into Hybrid Sentiment Models for Predicting Crude Oil Markets

Cognitive Computation

For several decades, sentiment analysis has been considered a key indicator for assessing market ... more For several decades, sentiment analysis has been considered a key indicator for assessing market mood and predicting future price changes. Accurately predicting commodity markets requires an understanding of fundamental market dynamics such as the interplay between supply and demand, which are not considered in standard affective models. This paper introduces two domain-specific affective models, CrudeBERT and CrudeBERT+, that adapt sentiment analysis to the crude oil market by incorporating economic theory with common knowledge of the mentioned entities and social knowledge extracted from Google Trends. To evaluate the predictive capabilities of these models, comprehensive experiments were conducted using dynamic time warping to identify the model that best approximates WTI crude oil futures price movements. The evaluation included news headlines and crude oil prices between January 2012 and April 2021. The results show that CrudeBERT+ outperformed RavenPack, BERT, FinBERT, and ear...

Research paper thumbnail of Building Knowledge Graphs and Recommender Systems for Suggesting Reskilling and Upskilling Options from the Web

Information

As advances in science and technology, crisis, and increased competition impact labor markets, re... more As advances in science and technology, crisis, and increased competition impact labor markets, reskilling and upskilling programs emerged to mitigate their effects. Since information on continuing education is highly distributed across websites, choosing career paths and suitable upskilling options is currently considered a challenging and cumbersome task. This article, therefore, introduces a method for building a comprehensive knowledge graph from the education providers’ Web pages. We collect educational programs from 488 providers and leverage entity recognition and entity linking methods in conjunction with contextualization to extract knowledge on entities such as prerequisites, skills, learning objectives, and course content. Slot filling then integrates these entities into an extensive knowledge graph that contains close to 74,000 nodes and over 734,000 edges. A recommender system leverages the created graph, and background knowledge on occupations to provide a career path a...

Research paper thumbnail of StoryLens

Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics, 2018

The news media landscape tends to focus on long-running narratives. Correctly processing new info... more The news media landscape tends to focus on long-running narratives. Correctly processing new information, therefore, requires considering multiple lenses when analyzing media content. Traditionally it would have been considered sufficient to extract the topics or entities contained in a text in order to classify it, but today it is important to also look at more sophisticated annotations related to fine-grained geolocation, events, stories and the relations between them. In order to leverage such lenses we propose a new corpus that offers a diverse set of annotations over texts collected from multiple media sources. We also showcase the framework used for creating the corpus, as well as how the information from the various lenses can be used in order to support different use cases in the EU project InVID for verifying the veracity of online video.

Research paper thumbnail of Medienkritik-Forschung mit der Suchmaschine WebLyzard - Strukturelle und inhaltliche Ergebnisse der Deutschschweiz 2014‐2018

Research paper thumbnail of Science in the Swiss Public. The State of Science Communication and Public Engagement with Science in Switzerland

Science communication and public engagement with science have repeatedly been called for in recen... more Science communication and public engagement with science have repeatedly been called for in recent years, particularly during the COVID-19 pandemic. Therefore, die Swiss Academies of the Arts and Sciences have set up an expert group to assess the state of science communication in Switzerland, and to provide recommendations for how to improve it. The expert group report is based on a comprehensive review of the available interdisciplinary scholarship analyzing science communication and public engagement with science in Switzerland. Selectively, it also incorporates original data, international findings, and secondary analyses where little or no published scholarly work was available. The report covers a wide range of facets of science communication and public engagement in Switzerland, from public attitudes towards science over individuals and organizations engaging in science communication and engagement formats to news and social media representations of science. On this basis, it ...

Research paper thumbnail of Classifying News Media Coverage for Corruption Risks Management with Deep Learning and Web Intelligence

Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics, 2020

A substantial number of international corporations have been affected by corruption. The research... more A substantial number of international corporations have been affected by corruption. The research presented in this paper introduces the Integrity Risks Monitor, an analytics dashboard that applies Web Intelligence and Deep Learning to english and germanspeaking documents for the task of (i) tracking and visualizing past corruption management gaps and their respective impacts, (ii) understanding present and past integrity issues, (iii) supporting companies in analyzing news media for identifying and mitigating integrity risks. Afterwards, we discuss the design, implementation, training and evaluation of classification components capable of identifying English documents covering the integrity topic of corruption. Domain experts created a gold standard dataset compiled from Anglo-American media coverage on corruption cases that has been used for training and evaluating the classifier. The experiments performed to evaluate the classifiers draw upon popular algorithms used for text classification such as Naïve Bayes, Support Vector Machines (SVM) and Deep Learning architectures (LSTM, BiLSTM, CNN) that draw upon different word embeddings and document representations. They also demonstrate that although classical machine learning approaches such as Naïve Bayes struggle with the diversity of the media coverage on corruption, state-of-the art Deep Learning models perform sufficiently well in the project's context. CCS CONCEPTS • Information systems → Data analytics; • Computing methodologies → Neural networks; • Applied computing → Economics; Annotation.

Research paper thumbnail of Mitigating linked data quality issues in knowledge-intense information extraction methods

Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics, 2017

Advances in research areas such as named entity linking and sentiment analysis have triggered the... more Advances in research areas such as named entity linking and sentiment analysis have triggered the emergence of knowledge-intensive information extraction methods that combine classical information extraction with background knowledge from the Web. Despite data quality concerns, linked data sources such as DBpedia, GeoNames and Wikidata which encode facts in a standardized structured format are particularly attractive for such applications. This paper addresses the problem of data quality by introducing a framework that elaborates on linked data quality issues relevant to di erent stages of the background knowledge acquisition process, their impact on information extraction performance and applicable mitigation strategies. Applying this framework to named entity linking and data enrichment demonstrates the potential of the introduced mitigation strategies to lessen the impact of di erent kinds of data quality problems. An industrial use case that aims at the automatic generation of image metadata from image descriptions illustrates the successful deployment of knowledge-intensive information extraction in real-world applications and constraints introduced by data quality concerns.

Research paper thumbnail of Improving Named Entity Linking Corpora Quality

Proceedings - Natural Language Processing in a Deep Learning World, 2019

Research paper thumbnail of A Regional News Corpora for Contextualized Entity Discovery and Linking

This paper presents a German corpus for Named Entity Linking (NEL) and Knowledge Base Population ... more This paper presents a German corpus for Named Entity Linking (NEL) and Knowledge Base Population (KBP) tasks. We describe the annotation guideline, the annotation process, NIL clustering techniques and conversion to popular NEL formats such as NIF and TAC that have been used to construct this corpus based on news transcripts from the German regional broadcaster RBB (Rundfunk Berlin Brandenburg). Since creating such language resources requires significant effort, the paper also discusses how to derive additional evaluation resources for tasks like named entity contextualization or ontology enrichment by exploiting the links between named entities from the annotated corpus. The paper concludes with an evaluation that shows how several well-known NEL tools perform on the corpus, a discussion of the evaluation results, and with suggestions on how to keep evaluation corpora and datasets up to date.

Research paper thumbnail of Extracting and Grounding Sentiment Lexicons

Research paper thumbnail of Name Variants for Improving Entity Discovery and Linking

Identifying all names that refer to a particular set of named entities is a challenging task, as ... more Identifying all names that refer to a particular set of named entities is a challenging task, as quite often we need to consider many features that include a lot of variation like abbreviations, aliases, hypocorism, multilingualism or partial matches. Each entity type can also have specific rules for name variances: people names can include titles, country and branch names are sometimes removed from organization names, while locations are often plagued by the issue of nested entities. The lack of a clear strategy for collecting, processing and computing name variants significantly lowers the recall of tasks such as Named Entity Linking and Knowledge Base Population since name variances are frequently used in all kind of textual content. This paper proposes several strategies to address these issues. Recall can be improved by combining knowledge repositories and by computing additional variances based on algorithmic approaches. Heuristics and machine learning methods then analyze the...