Bounce rate prediction for clicked ads in sponsored search advertising is crucial for improving the quality of ads shown to the user. Bounce rate represents the proportion of landing pages for clicked ads on which users spend less than a specified time signifying that the user did not find a possible match of their query intent with the landing page content. In the pay-per-click revenue model for search engines, higher bounce rates mean advertisers get charged without meaningful user engagement, which impacts user and advertiser retention in long term. In real-time search engine settings complex ML models are prohibitive due to stringent latency requirements. Also historical logs are ineffective for rare queries (tail) where the data is sparse, as well as for matching user intent to adcopy when the query and bidded keywords don’t exactly overlap (smart match). In this paper, we propose a real-time bounce rate prediction system that leverages lightweight features like modified tf, positional and proximity features computed from ad landing pages and improves prediction for rare queries. The model preserves privacy and uses no user based feature. The entire ensemble is trained on millions of examples from the offline user log of the Bing commercial search engine and improves the ranking metrics for tail queries and smart match by more than 2x compared to a model that only uses ad-copy-advertiser features.
Data in the energy domain grows at unprecedented rates and is usually generated by heterogeneous energy systems. Despite the great potential that big data-driven technologies can bring to the energy sector, general adoption is still lagging. Several challenges related to controlled data exchange and data integration are still not wholly achieved. As a result, fragmented applications are developed against energy data silos, and data exchange is limited to few applications. In this paper, we analyze the challenges and requirements related to energy-related data applications. We also evaluate the use of Energy Data Ecosystems (EDEs) as data-driven infrastructures to overcome the current limitations of fragmented energy applications. EDEs are inspired by the International Data Space (IDS) initiative launched in Germany at the end of 2014 with an overall objective to take both the development and use of the IDS reference architecture model to a European/global level. The reference architecture model consists of four architectures related to business, security, data and service, and software aspects. This paper illustrates the applicability of EDEs and IDS reference architecture in real-world scenarios from the energy sector. The analyzed scenario is positioned in the context of the EU-funded H2020 project PLATOON.
Realizing smart factories according to the Industry 4.0 vision requires intelligent human-to-machine and machine-to-machine communication. To achieve this goal, components such as actuators, sensors, and cyber-physical systems along with their data, need to be described; moreover, interoperability conflicts arisen from various semantic representations of these components demand also solutions. To empowering communication in smart factories, a variety of standards and standardization frameworks have been proposed. These standards enable the description of the main properties of components, systems, and processes, as well as interactions between them. Standardization frameworks classify, align, and integrate industrial standards according to their purposes and features. Various standardization frameworks have been proposed all over the world by industrial communities, e.g., RAMI4.0 or IICF. While being expressive to categorize existing standards, standardization frameworks may present divergent classifications of the same standard. Mismatches between standard classifications generate semantic interoperability conflicts that negatively impact the effectiveness of communication in smart factories. In this article, we tackle the problem of standard interoperability across different standardization frameworks, and devise a knowledge-driven approach that allows for the description of standards and standardization frameworks into an Industry 4.0 knowledge graph (I40KG). The STO ontology represents properties of standards and standardization frameworks, as well as relationships among them. The I40KG integrates more than 200 standards and four standardization frameworks. To populate the I40KG, the landscape of standards has been analyzed from a semantic perspective and the resulting I40KG represents knowledge expressed in more than 200 industrial related documents including technical reports, research articles, and white papers. Additionally, the I40KG has been linked to existing knowledge graphs and an automated reasoning has been implemented to reveal implicit relations between standards as well as mappings across standardization frameworks. We analyze both the number of discovered relations between standards and the accuracy of these relations. Observed results indicate that both reasoning and linking processes enable for increasing the connectivity in the knowledge graph by up to 80%, whilst up to 96% of the relations can be validated. These outcomes suggest that integrating standards and standardization frameworks into the I40KG enables the resolution of semantic interoperability conflicts, empowering the communication in smart factories.
Nowadays, the volume and variety of generated data, how to process it and accordingly create value through scalable analytics are main challenges to industries and real-world practices such as talent analytics. For instance, large enterprises and job centres have to progress data intensive matching of job seekers to various job positions at the same time. In other words, it should result in the large scale assignment of best-fit (right) talents (Person) with right expertise (Profession) to the right job (Position) at the right time (Period). We call this definition as a 4P rule in this paper. All enterprises should consider 4P rule in their daily recruitment processes towards efficient workforce development strategies. Such consideration demands integrating large volumes of disparate data from various sources and strongly needs the use of scalable algorithms and analytics. The diversity of the data in human resource management requires speeding up analytical processes. The main challenge here is not only how and where to store the data, but also the analysing it towards creating value (knowledge discovery). In this paper, we propose a generic Career Knowledge Representation (CKR) model in order to be able to model most competences that exist in a wide variety of careers. A regenerated job qualification data of 15 million employees with 84 dimensions (competences) from real HRM data has been used in test and evaluation of proposed Evolutionary MapReduce K-Means method in this research. This proposed EMR method shows faster and more accurate experimental results in comparison to similar approaches and has been tested with real large scale datasets and achieved results are already discussed.
It is quite important how to correlate and find out relationships between how people grow up and succeed in their research field compared to their field’s grows. In many cases, people refer only to indices such as citation count, h-index, i10-index and compare scientists from different fields in a similar situation with the same variables. It is not a fair comparison since fields are different in development and being cited. In this paper, we used the acceleration concept from physics and propose a new method with new metrics to efficiently and fairly evaluate scientists according to the real-time analysis of their recent status compared to their field’s grows. This considers various inputs such as whether a person is beginner scientists or professional and applies all such key inputs in the evaluation. The evaluation is also over time. The results showed better evaluation compared to state-of-the-art metrics.
Web of Things (WoT) is capable of promoting the knowledge discovery and address interoperability problems of various Internet of Things (IoT) applications. To integrate the semantic information on WoT, Semantic Sensor Networks (SSN) based knowledge engineering is utilized for the uniform representation of identical knowledge by using the sensor ontology. However, it is arduous to link numerous heterogeneous sensor entities, and the Sensor Ontology Matching (SOM) is a newly emerging technique for solving the ontology heterogeneity problem, which aims at finding the semantically identical sensor entities in two ontologies. In this paper, the problem of SOM is deemed as regression problem that to integrate a number of Entity Similarity Measure (EMS) to estimate the real similarity score between two sensor entities. To address it, we propose a Artificial Neural Network (ANN)-based sensor ontology matching technique (ANN-OM), which employs the representative entities for enhancing the quality of alignment and the matching efficiency. The experimental results illustrate that ANN-OM is capable of determining superior alignment which is better than the state-of-the-art ontology matchers.
Moving vehicles generate a large amount of sensor data every second. To ensure automatic driving in a complex driving environment, it needs to fulfill a large amount of data transmission, storage, and processing in a short time. Real-time perception of traffic, target characteristics, and traffic density are important to achieve safe driving and a stable driving experience. However, it is very difficult to adjust the pricing strategy according to the actual demand of the network. In order to analyze the interaction between task vehicle and service vehicle, the Stackelberg game model is introduced. Considering the communication model, calculation model, optimization objectives, and delay constraints, this paper constructs the utility function of service vehicle and task vehicle based on the Stackelberg game model. Based on the utility function, we can obtain the optimal price strategy of service vehicles and the optimal purchase strategy of task vehicles.
With the rapid development of the Web of Things, there have been a lot of sensors deployed. Advanced knowledge can be achieved by deep learning method and easier integration with open Web standards. A large number of the data generated by sensors required extra processing resources due to the limited resources of the sensors. Due to the limitation of bandwidth or requirement of low latency, it is impossible to transfer such large amounts of data to cloud servers for processing. Thus, the concept of distributed fog computing has been proposed to process such big data into knowledge in real-time. Large scale fog computing system is built using cheap devices, denotes as fog nodes. Therefore, the resiliency to fog node failures should be considered in design of distributed fog computing. LT codes (LTC) have important applications in the design of modern distributed computing, which can reduce the latency of the computing tasks, such as matrix multiplication in deep learning methods. In this paper, we consider that fog nodes may be failure, and an improved LT codes are applied to matrix multiplication of distributed fog computing process to reduce latency. Numerical results show that the improved LTC based scheme can reduce average overhead and degree simultaneously, which reduce the latency and computation complexity of distributed fog computing.
In this paper, a novel end-to-end hand detection method YOLObile-KCF on mobile device based on Web of Things (WoT) is presented, which can also be applied in practice. While hand detection has been become a hot topic in recent years, little attention has been paid to the practical use of hand detection on mobile device. It is demonstrated that our hand detection system can effectively detect and track hand with high accuracy and fast speed that enables us not only to communicate with each other on mobile devices, but also can assist and guide the people on the other side on the mobile device in real-time. The method used in our study is known as object detection, which is a working theory based on deep learning. And lightweight neural network suitable for mobile device which can has few parameters and easily deployed is adopted in our model. What's more, KCF algorithms is added in our model. And several experiments were carried out to test the validity of hand detection system. From the experiment, it came to realize that the YOLObile-KCF hand detection system based on WoT is considerable, which is more efficient and convenient in smart life. Our work involving studies of hand detection for smart life proves to be encouraging.
Image inpainting aims to reconstruct the missing or unknown region for a given image. As one of the most important topics from image processing, this task has attracted increasing research interest over the past few decades. Learning-based methods have been employed to solve this task, and achieved superior performance. Nevertheless, existing methods often produce artificial traces, due to the lack of constraints on image characterization under different semantics. To accommodate this issue, we propose a novel artistic Progressive Semantic Reasoning (PSR) network in this paper, which is composed of three shared parameters from the generation network superposition. More precisely, the proposed PSR algorithm follows a typical end-to-end training procedure, that learns low-level semantic features and further transfers them to a high-level semantic network for inpainting purposes. Furthermore, a simple but effective Cross Feature Reconstruction (CFR) strategy is proposed to tradeoff semantic information from different levels. Empirically, the proposed approach is evaluated via intensive experiments using a variety of real-world datasets. The results confirm the effectiveness of our algorithm compared with other state-of-the-art methods. The source code can be found from https://github.com/sfwyly/PSR-Net.
The construction of timing-driven Steiner minimum tree is a critical issue in VLSI routing design. Meanwhile, since the interconnection model of X-architecture can make full use of routing resources compared to the traditional Manhattan architecture, constructing a Timing-Driven X-architecture Steiner Minimum Tree (TDXSMT) is of great significance to improving routing performance. In this paper, an efficient algorithm based on Social Learning Multi-Objective Particle Swarm Optimization (SLMOPSO) is proposed to construct a TDXSMT with minimizing the maximum source-to-sink pathlength. An X-architecture Prim-Dijkstra model is presented to construct an initial Steiner tree which can optimize both the wirelength and the maximum source-to-sink pathlength. In order to find a better solution, an SLMOPSO method based on the nearest and best select strategy is presented to improve the global exploration capability of the algorithm. Besides, the mutation and crossover operators are utilized to achieve the discrete particle update process, thereby better solving the discrete TDXSMT problem. The experimental results indicate that the proposed algorithm has an excellent trade-off between the wirelength and maximum source-to-sink pathlength of the routing tree and can greatly optimize the timing delay.
Web of things (WoT) are inclined to suffer from internal attacks, which are from compromised nodes. Due to resource-constraint of WoT, the traditional security methods cannot be deployed. One of the most appropriate protection mechanisms to resist internal attacks is the trust management system. For the sake of evaluate the performance of WoT reasonably and appropriately, we improve the Beta-based reputation system, and propose a Poisson Distribution-based trust model (PDTM) in this paper. In view of evaluating a sensor node (or terminal) behaviors, its reputation and trust are represented by the Poisson distribution. PDTM is used to look for reliable nodes to transmit data and weaken malicious attacks within WoTs. The simulation results indicate that the PDTM can resist internal attack effectively, in order to strengthen the network security.
Rockburst disasters seriously threaten the safety and production progress of construction workers. In order to improve the accuracy of rockburst tendency prediction and ensure the rationality of index weights and classification and identification, a SOM clustering-combined weighting VIKOR model is proposed to predict rockburst. Based on the comprehensive analysis of the conditions of rockburst, the samples are classified from three indicators: rock brittleness index, tangential stress index and elastic strain energy index. This method accurately classifies samples through a self-organizing feature mapping network, and calculates the weights of different indicators through a combination weighting method, and finally sorts the rockburst grades through a multi-criteria compromise solution sorting method. This method makes the multi-information fusion of rockburst prediction more objective and operability. Comparing the engineering examples, it is found that the simulation calculation of the VIKOR rockburst prediction method of SOM neural network clustering and combination weighting is basically consistent with the engineering examples.
This paper motivates to solve the multiple mapping of Received Signal Strength Indications (RSSIs) and location estimating problem in mobile positioning. A mobile positioning method based on Time-distributed Auto Encoder and Gated Recurrent Unit (TAE-GRU) is proposed to realize the mobile positioning. To distinguish the identical RSSI of different temporal steps, this paper develops a reconstructed model based on Time-distributed Auto Encoder (TAE), which is conducive for further learning of the estimated model. Among them, time-distributed technology is utilized to translate the data of each temporal step separately accommodating the temporal characteristics of RSSI data. Besides, an estimated model based on Gated Recurrent Unit (GRU) is developed to learn the temporal relationship of RSSI data to estimate the locations of mobile devices. Combining the TAE model and GRU model, the proposed model is provided with the capability of solving multiple mapping and mobile positioning dilemma. Massive experimental results demonstrated that the proposed method provides superior performance than comparative methods when solving multiple mapping and positioning problems.
Fraud behavior poses a severe threat to e-commerce platforms and anti-fraud systems have become indispensable infrastructure of these platforms. Recently, there have been a large number of fraud detection models proposed to monitor online purchasing transactions and extract hidden fraud patterns. Thanks to these fraud detection models, we have observed a significant reduction of committed frauds in the last several years. However, there have been an increasing number of malicious sellers on e-commerce platforms, according to our recent statistics, who purposely circumvent these online fraud detection systems by transferring their fake purchasing behaviors from online to offline. This way, the effectiveness of our existing fraud detection system built based upon online transactions is compromised. To solve this problem, we study in this paper a new problem, called offline fraud community detection, which can greatly strengthen our existing fraud detection systems. We propose a new FRaud COmmunity Detection from Online to Offline (FRODO) framework which combines the strength of both online and offline data views, especially the offline spatial-temporal data, for fraud community discovery. Moreover, a new Multi-view Heterogeneous Graph Neural Network model is proposed within our new FRODO framework which can find anomalous graph patterns such as biclique communities through only a small number of black seeds, i.e., a small number of labeled fraud users. The seeds are processed by a streamlined pipeline of three components comprised of label propagation for a high coverage, multi-view heterogeneous graph neural networks for high-risky fraud user recognition, and spatial-temporal network reconstruction and mining for offline fraud community detection. The extensive experimental results on a large real-life Taobao network, with 20 millions of users, 5 millions of product items and 30 millions of transactions, demonstrate the good effectiveness of the proposed methods.
Batch RL is concerned about learning a decision policy from a given dataset without interacting with the environment. Although research is actively conducted on learning-related issues (e.g., convergence speed, stability, and safety), empirical challenges before learning are largely ignored. Many RL practitioners face the challenge of determining whether a designed Markov Decision Process (MDP) is valid and meaningful. This study proposes a model-based method to check whether an MDP designed for a given dataset is well formulated through a heuristic-based feature analysis. We tested our method in constructed as well as more realistic environments. Our results show that our approach can identify potential problems of data. As far as we know, performing validity analysis on batch RL data is a novel direction, and we envision that our tool serves as a motivational example to help practitioners apply RL more easily.
Industrial sponsored search system (SSS) can be logically divided into three modules: keywords matching, ad retrieving, and ranking. The ad candidates grow exponentially during ad retrieving. Due to limited latency and computing resources, the candidates have to be pruned earlier. Suppose we set a pruning line to cut SSS into two parts: upstream and downstream. The problem we are going to address is: how to pick out the best K items from N candidates provided by the upstream to maximize the total system’s revenue. Since the industrial downstream is very complicated and updated quickly, a crucial restriction in this problem is that the selection scheme should get adapted to the downstream. In this paper, we propose a novel model-free reinforcement learning approach to fixing this problem. Our approach considers the downstream as a black-box environment, and the agent sequentially selects items and finally feeds into the downstream, where revenue would be estimated and used as a reward to improve the selection policy. The idea has been successfully realized in Baidu’s sponsored search system, and online long time A/B test shows remarkable improvements on revenue.
Homelessness service provision, a task of great societal relevance, requires solutions to several urgent problems facing our humanity. Data science, that has recently emerged as a potential catalyst in addressing long standing problems related to human services, offers immense potential. However, homelessness service provision presents unignorable challenges (e.g., assessment methods and data bias) that are are seldom found in other domains, requiring cross-discipline collaborations and cross-pollination of ideas. This work summarizes the challenges offered by homelessness service provision tasks, as well as the problems and the opportunities that exist for advancing both data science and human services. We begin by highlighting typical goals of homelessness service provision, and subsequently describe homelessness service data along with their properties, that make it challenging to use traditional data science methods. Along the way, we discuss some of the existing efforts and promising directions for data science, and conclude by discussing the importance of a deep collaboration between data science and domain experts for synergistic advancements in both disciplines.
This paper describes in details the design and development of a novel annotation framework and of annotated resources for Internal Displacement, as the outcome of a collaboration with the Internal Displacement Monitoring Centre, aimed at improving the accuracy of their monitoring platform IDETECT. The schema includes multi-faceted description of the events, including cause, quantity of people displaced, location and date. Higher-order facets aimed at improving the information extraction, such as document relevance and type, are proposed. We also report a case study of machine learning application to the document classification tasks. Finally, we discuss the importance of standardized schema in dataset benchmark development and its impact on the development of reliable disaster monitoring infrastructure.
Workforce diversification is essential to increase productivity in any world economy. In the context of the Fourth Industrial Revolution, that need is even more urgent since technological sectors are men-dominated. Despite the significant progress made towards gender inequality in the last decades, we are far from the ideal scenario. Changes towards equality are too slow and uneven across different world regions. Monitoring gender parity is essential to understand priorities and specificities in each world region. However, it is challenging because of the scarcity and the cost to obtain data, especially in less developed countries. In this paper we study how the Facebook Advertising Platform (Facebook Ads) can be used to assess gender imbalance in education, focusing on STEM (Science, Technology, Engineering, and Mathematics) areas, which are the main focus of the Fourth Revolution. As a case study, we apply our methodology to characterize Brazil in terms of gender balance in STEM as well as to correlate the results using Facebook Ads data with official Brazilian government numbers. Our results suggest that even considering a biased population where the majority is female, the proportion of men interested in some majors is higher than the proportion of women. Within STEM areas, we can identify two different patterns. Life Science and Math/Physical Sciences have female dominance, Environmental Science, Technology, and Engineering majors are still concentrated towards men. We also assess the impact of educational level and age on the interest in majors. The gender gap in STEM increases with the women’s educational level and age, as confirmed by official data in Brazil.
The world-wide refugee problem has a long history, but continues to this day, and will unfortunately continue into the foreseeable future. Efforts to anticipate, mitigate and prepare for refugee counts, however, are still lacking. There are many potential causes, but the published research has primarily focused on identifying ways to integrate already existing refugees into the various communities wherein they ultimately reside, rather than on preventive measures. The work proposed herein uses a set of features that can be divided into three basic categories: 1) sociocultural, 2) socioeconomic, and 3) economic, which refer to the nature of each proposed predictive feature. For example, corruption perception is a sociocultural feature, access to healthcare is a socioeconomic feature, and inflation is an economic feature. Forty-five predictive features were collected for various years and countries of interest. As may seem intuitive, the features that fell under the category of "economic" produced the highest predictive value from the regression technique employed. However, additional potential predictive features that have not been previously addressed stood out in our experiments. These include: the global peace index (gpi), freedom of expression (fe), internet users (iu), access to healthcare (hc), cost of living index (coli), local purchasing power index (lppi), homicide rate (hr), access to justice (aj), and women's property rights (wpr). Many of these features are nascent in terms of both their development and collection, as well as the fact that some of these features are not yet collected at a universal level, meaning that the data is missing for some countries and years. Ongoing work regarding these datasets for predicting refugee counts is also discussed in this work.
It has been observed in several works that the ranking of candidates based on their score can be biased for candidates belonging to the minority community. In recent works, the fairness-aware representative ranking was proposed for computing fairness-aware re-ranking of results. The proposed algorithm achieves the desired distribution of top-ranked results with respect to one or more protected attributes. In this work, we highlight the bias in fairness-aware representative ranking for an individual and for a group if the group is sub-active on the platform. We define individual unfairness and group unfairness from two different perspectives. We further propose methods to generate ideal individual and group fair representative ranking if the universal representation ratio is known. The paper is concluded with open challenges and further directions.
Machine-driven topic identification of online contents is a prevalent task in the natural language processing (NLP) domain. Social media deliberation reflects society's opinion, and a structured analysis of these contents allows us to decipher the same. We employ an NLP-based approach for investigating migration-related Twitter discussions. Besides traditional deep learning-based models, we have also considered pre-trained transformer-based models for analyzing our corpus. We have successfully classified multiple strands of public opinion related to European migrants. Finally, we use 'BertViz' to visually explore the interpretability of better performing transformer-based models.
In recent years, Deep Neural Networks (DNNs) have emerged as a widely adopted approach in many application domains. Training DNN models is also becoming a significant fraction of the datacenter workload. Recent evidence has demonstrated that modern DNNs are becoming more complex and the size of DNN parameters (i.e., weights) is also increasing. In addition, a large amount of input data is required to train the DNN models to reach target accuracy. As a result, the training performance becomes one of the major challenges that limit DNN adoption in real-world applications. Recent works have explored different parallelism strategies (i.e., data parallelism and model parallelism) and used multi-GPUs in datacenters to accelerate the training process. However, naively adopting data parallelism and model parallelism across multiple GPUs can lead to sub-optimal executions. The major reasons are i) the large amount of data movement that prevents the system from feeding the GPUs with the required data in a timely manner (for data parallelism); and ii) low GPU utilization caused by data dependency between layers that placed on different devices (for model parallelism).
In this paper, we identify the main challenges in adopting data parallelism and model parallelism on multi-GPU platforms. Then, we conduct a survey including recent research works targeting these challenges. We also provide an overview of our work-in-progress project on optimizing DNN training on GPUs. Our results demonstrate that simple-yet-effective system optimizations can further improve the training scalability compared to prior works.
Recent advances on deep learning technologies have made GPU clusters popular as training platforms. In this paper, we study reliability issues while focusing on training job failures from analyzing logs collected from running deep learning workloads on a large-scale GPU cluster in production. These failures are largely grouped into two categories, infrastructure and user, based on their sources, and reveal diverse reasons causing the failures. With insights obtained from the failure analysis, we suggest several different ways to improve the stability of shared GPU clusters designed for DL training and optimize user experience by reducing failure occurrences.
Maintaining efficient utilization of allocated compute resources and controlling their capital and operating expenditure is important for running a hyperscale datacenter infrastructure. Power is one of the most constrained and difficult to manage resources in datacenters. Accurate accounting of power usage across clients of multi-tenant web services can improve budgeting, planning and provisioning of compute resources.
In this work, we propose a queuing theory based transitive power modeling framework that estimates the total power cost of a client request across the stack of shared services running in Facebook datacenters. By capturing the non-linearity of power vs load relation, our model is able to estimate marginal change in power consumption of a system upon serving a request with a mean error of less than 4% when applied on production services. In view of the fact that datacenter capacity is planned for peak demand, we test this model at peak load to report up to 2x improvement in accuracy compared to a mathematical model. We further leverage this framework along with a distributed tracing system to estimate power demand shift for serving particular product features within fraction of a percentage and guide the decision to shift their computation at off-peak time.
The application of deep learning models presents significant improvement to many services and products in Microsoft. However, it is challenging to provide efficient computation and memory capabilities for both DNN workload inference and training given that the model size and complexities keep increasing. From the serving aspect, many DL models suffer from long inference latency and high cost, preventing their deployment in production. On the training side, large-scale model training often requires complex refactoring of models and access to prohibitively expensive GPU clusters, which are not always accessible to many practitioners. We want to deliver solid solutions and systems while exploring the cutting-edge techniques to address these issues. In this talk, I will introduce our experience and lessons from designing and implementing optimizations for both DNN serving and training at large scale with remarkable compute and memory efficiency improvement and infrastructure cost reduction.
Recent years have witnessed substantial efforts devoted to ensuring algorithmic fairness for machine learning (ML), spanning from formalizing fairness metrics to designing fairness-enhancing methods. These efforts lead to numerous possible choices in terms of fairness definitions and fairness-enhancing algorithms. However, finding the best fairness configuration (including both fairness definition and fairness-enhancing algorithms) for a specific ML task is extremely challenging in practice. The large design space of fairness configurations combined with the tremendous cost required for fairness deployment poses a major obstacle to this endeavor. This raises an important issue: can we enable automated fairness configurations for a new ML task on a potentially unseen dataset?
To this point, we design Auto-Fair, a system that provides recommendations of fairness configurations by ranking all fairness configuration candidates based on their evaluations on prior ML tasks. At the core of Auto-Fair lies a meta-learning model that ranks all fairness configuration candidates by utilizing: (1) a set of meta-features that are derived from both datasets and fairness configurations that were used in prior evaluations; and (2) the knowledge accumulated from previous evaluations of fairness configurations on related ML tasks and datasets. The experimental results on 350 different fairness configurations and 1,500 data samples demonstrate the effectiveness of Auto-Fair.
“Fairness” is a multi-faceted concept that is contested within and across disciplines. In machine learning, it usually denotes some form of equality of measurable outcomes of algorithmic decision making. In this paper, we start from a viewpoint of sociology and media studies, which highlights that to even claim fair treatment, individuals and groups first have to be visible. We draw on a notion and a quantitative measure of diversity that expresses this wider requirement. We used the measure to design and build the Diversity Searcher, a Web-based tool to detect and enhance the representation of socio-political actors in news media. We show how the tool's combination of natural language processing and a rich user interface can help news producers and consumers detect and understand diversity-relevant aspects of representation, which can ultimately contribute to enhancing diversity and fairness in media. We comment on our observation that, through interactions with target users during the construction of the tool, NLP results and interface questions became increasingly important, such that the formal measure of diversity has become a catalyst for functionality, but in itself less important.
Misinformation/disinformation about COVID-19 has been rampant on social media around the world. In this study, we investigate COVID-19 misinformation/ disinformation on social media in multiple languages/countries: Chinese (Mandarin)/China, English/USA, and Farsi (Persian)/Iran; and on multiple platforms such as Twitter, Facebook, Instagram, WhatsApp, Weibo, WeChat and TikTok. Misinformation, especially about a global pandemic, is a global problem yet it is common for studies of COVID-19 misinformation on social media to focus on a single language, like English, a single country, like the USA, or a single platform, like Twitter. We utilized opportunistic sampling to compile 200 specific items of viral and yet debunked misinformation across these languages, countries and platforms emerged between January 1 and August 31. We then categorized this collection based both on the topics of the misinformation and the underlying roots of that misinformation. Our multi-cultural and multi-linguistic team observed that the nature of COVID-19 misinformation on social media varied in substantial ways across different languages/countries depending on the cultures, beliefs/religions, popularity of social media, types of platforms, freedom of speech and the power of people versus governments. We observe that politics is at the root of most of the collected misinformation across all three languages in this dataset. We further observe the different impact of government restrictions on platforms and platform restrictions on content in China, Iran, and the USA and their impact on a key question of our age: how do we control misinformation without silencing the voices we need to hold governments accountable?
Growing dissatisfaction with platform governance decisions at major social media platforms like Twitter, Facebook, and Instagram has led to a number of substantial efforts, originating both on the political right and the political left, to shift to new platforms. In this paper, we examine one of the most impactful of these platform migration efforts, a recent effort primarily on the political right to shift from Twitter to Parler in response to Twitter's increased efforts to flag misinformation in the lead up to the 2020 election in the US. As a case study, we analyze the usage of Parler by all members of the United States Congress and compare that to their usage of Twitter. Even though usage of Parler, even at its peak, was only a small percentage of Twitter usage, Parler usage has been impactful. Specifically, it was linked to the planning of the January 6, 2021 attack on the United States Capitol building. Going forward, Parler itself may not have a large and lasting impact, but it offers important lessons about the relationship between political polarization, platform migration, and the real-world political impacts of platform governance decisions and the splintering of our media landscape.
We audit the presence of domain-level source diversity bias in video search results. Using a virtual agent-based approach, we compare outputs of four Western and one non-Western search engines for English and Russian queries. Our findings highlight that source diversity varies substantially depending on the language with English queries returning more diverse outputs. We also find disproportionately high presence of a single platform, YouTube, in top search outputs for all Western search engines except Google. At the same time, we observe that Youtube’s major competitors such as Vimeo or Dailymotion do not appear in the sampled Google’s video search results. This finding suggests that Google might be downgrading the results from the main competitors of Google-owned Youtube and highlights the necessity for further studies focusing on the presence of own-content bias in Google’s search results.
Jigsaw’s Perspective API aims to protect voices in online conversation by developing and serving machine learning models that identify toxicity text. This talk will share how the team behind Perspective thinks about the issues of Fairness, Accountability, Transparency, Ethics and Society through the lens of Google’s AI Principles. For the Perspective team, building technology that is fair and ethical is a continuous, ongoing effort. The talk will cover concrete strategies the Perspective team has already used to mitigate bias in ML models as well as new strategies currently being explored. Finally, with examples of how Perspective is being used in the real world, the talk will show how machine learning, combined with thoughtful human moderation and participation, can help improve online conversations.
Micro-credit loan serves as an indispensable supplementary loan for people lacking verifiable credit records or unqualified to get conventional bank loans. And financial fraud detection is one of the mainstream methods to control loan risk. Its goal is to utilize a set of corresponding features (e.g., customer behaviors) to predict whether a customer will fail to make required payments in the future. To the best of our knowledge, few works pay attention to permanent residential locations of customers. However, the real data study shows that customer location information potentially provides additional power in financial fraud detection. Three challenges place barriers to make full use of location information: (1) Data sparsity makes financial fraud detection model hard to learn the relationship between location information and fraud behaviors; (2) Financial fraud detection model considering location information alone without resident personality, might weaken fraud distinguish power of location information; (3) The representation of location information should be effective and easy to apply, for being used in various applications. In this paper, we propose Fuller Location Information Embedding (FLIE) network. FLIE ideally handles above challenges, which performance verified by experiments on the tasks of fraudulent customer prediction and customer segmentation.
Every year publicly listed companies file financial reports to give insights about their activities. These reports are meant for shareholders or general public to evaluate the company’s health and decide whether to buy or sell stakes in the company. However, these annual financial reports tend to be long, and it is time-consuming to go through the reports for each company. We propose a Goal Guided Summarization technique through which the summary is extracted. The goal, in our case, is the decision to buy or sell company’s shares. We use hierarchical neural models for achieving this goal while extracting summaries. By the means of intrinsic and extrinsic evaluation we observe that the summaries extracted by our approach can model the decision of buying and selling shares better compared to summaries extracted by other summarization techniques as well as the complete document itself. We also observe that the summary extractor model can be used to construct stock portfolios which give better returns compared to major stock index.
The existing datasets are mostly composed of official documents, statements, news articles, and so forth. So far, only a little attention has been paid to the numerals in financial social comments. Therefore, this paper presents CFinNumAttr, a financial numeral attribute dataset in Chinese via annotating the stock reviews and comments collected from social networking platform. We also conduct several experiments on the CFinNumAttr dataset with state-of-the-art methods to discover the importance of the financial numeral attributes. The experimental results on the CFinNumAttr dataset show that the numeral attributes in social reviews or comments contain rich semantic information, and the numeral clue extraction and attribute classification tasks can make a great improvement in financial text understanding.
Neural networks for language modeling have been proven effective on several sub-tasks of natural language processing. Training deep language models, however, is time-consuming and computationally intensive. Pre-trained language models such as BERT are thus appealing since (1) they yielded state-of-the-art performance, and (2) they offload practitioners from the burden of preparing the adequate resources (time, hardware, and data) to train models. Nevertheless, because pre-trained models are generic, they may underperform on specific domains. In this study, we investigate the case of multi-class text classification, a task that is relatively less studied in the literature evaluating pre-trained language models. Our work is further placed under the industrial settings of the financial domain. We thus leverage generic benchmark datasets from the literature and two proprietary datasets from our partners in the financial technological industry. After highlighting a challenge for generic pre-trained models (BERT, DistilBERT, RoBERTa, XLNet, XLM) to classify a portion of the financial document dataset, we investigate the intuition that a specialized pre-trained model for financial documents, such as FinBERT, should be leveraged. Nevertheless, our experiments show that the FinBERT model, even with an adapted vocabulary, does not lead to improvements compared to the generic BERT models.
Motivated by recent applications of sequential decision making in matching markets, in this paper we attempt at formulating and abstracting market designs for P2P lending. We describe a paradigm to set the stage for how peer to peer investments can be conceived from a matching market perspective, especially when both borrower and lender preferences are respected. We model these specialized markets as an optimization problem and consider different utilities for agents on both sides of the market while also understanding the impact of equitable allocations to borrowers. We devise a technique based on sequential decision making that allow the lenders to adjust their choices based on the dynamics of uncertainty from competition over time and that also impacts the rewards in return for their investments. Using simulated experiments we show the dynamics of the regret based on the optimal borrower-lender matching and find that the lender regret depends on the initial preferences set by the lenders which could affect their learning over decision making steps.
Intent detection plays an important role in customer service dialog systems for providing high-quality service in the financial industry. The lack of publicly available datasets and high annotation cost are two challenging issues in this research direction. To overcome these challenges, we propose a social media enhanced self-training approach for intent detection by using label names only. The experimental results show the effectiveness of the proposed method.
Document processing is a foundational pre-processing task in natural language application applied in the financial domain. In this paper, we present the result of FinSBD-3, the 3rd shared task on Structure Boundary Detection in unstructured text in the financial domain. The shared task is organized as part of the 1st Workshop on Financial Technology on the Web. Participants were asked to create system detecting the boundaries of elements in unstructured text extracted from financial PDF. This edition extends the previous shared tasks by adding boundaries of visual elements such as tables, figures, page headers and page footers; on top of sentences, lists and list items which were already present in previous edition of the shared tasks.
FinSBD-3 is a shared task organized in the context of the 1st workshop on Financial Technology on the Web. The task focuses on extracting the entire structure of noisy PDF financial documents that include 1) sentences, lists, items, and organization of lists and items; 2) figures and tables; 3) headers and footers. This paper describes the approach that allows us to extract the figures and tables using their visual cues. We applied the object segmentation techniques in image processing to detect the location of figures and tables in the PDF files. A post-processing method is then executed in order to find exact content. The result shows the potential of this approach.
Both authors contributed equally to this research. This paper presents the method that we tackled the FinSBD-3 shared task (structure boundary detection) to extract the boundaries of sentences, lists, and items, including structure elements like footer, header, tables from noisy unstructured English and French financial texts. The deep attention model based on word embedding using data augmentation and BERT model named as hybrid deep learning model to detect the sentence, list-item, footer, header, tables boundaries in noisy English and French texts and classify the list-item sentences into list & different item types using deep attention model. The experiment is shown that the proposed method could be an effective solution to deal with the FinSBD-3 shared task. The submitted result ranks first based on the task metrics in the final leader board.
The FinSim-2 is a second edition of FinSim Shared Task on Learning Semantic Similarities for the Financial Domain, colocated with the FinWeb workshop. FinSim-2 proposed the challenge to automatically learn effective and precise semantic models for the financial domain. The second edition of the FinSim offered an enriched dataset in terms of volume and quality, and interested in systems which make creative use of relevant resources such as ontologies and lexica, as well as systems which make use of contextual word embeddings such as BERT. Going beyond the mere representation of words is a key step to industrial applications that make use of Natural Language Processing (NLP). This is typically addressed using either unsupervised corpus-derived representations like word embeddings, which are typically opaque to human understanding but very useful in NLP applications or manually created resources such as taxonomies and ontologies, which typically have low coverage and contain inconsistencies, but provide a deeper understanding of the target domain. Finsim is inspired from previous endeavours in the Semeval community, which organized several competitions on semantic/lexical relation extraction between concepts/words. This year, 18 system runs were submitted by 7 teams and systems were ranked according to 2 metrics, Accuracy and Mean rank. All the systems beat our baseline 1 model by over 15 points and the best systems beat the baseline 2 by over 1 ∼ 3 points in accuracy.
This paper presents the FinMatcher system and its results for the FinSim 2021 shared task which is co-located with the Workshop on Financial Technology on the Web (FinWeb) in conjunction with The Web Conference. The FinSim-2 shared task consists of a set of concept labels from the financial services domain. The goal is to find the most relevant top-level concept from a given set of concepts. The FinMatcher system exploits three publicly available knowledge graphs, namely WordNet, Wikidata, and WebIsALOD. The graphs are used to generate explicit features as well as latent features which are fed into a neural classifier to predict the closest hypernym.
Ontologies are increasingly used for machine reasoning over the last few years. They can provide explanations of concepts or be used for concept classification if there exists a mapping from the desired labels to the relevant ontology. Another advantage of using ontologies is that they do not need a learning process, meaning that we do not need the train data or time before using them. This paper presents a practical use of an ontology for a classification problem from the financial domain. It first transforms a given ontology to a graph and proceeds with generalization with the aim to find common semantic descriptions of the input sets of financial concepts.
We present a solution to the shared task on Learning Semantic Similarities for the Financial Domain (FinSim-2 task). The task is to design a system that can automatically classify concepts from the Financial domain into the most relevant hypernym concept in an external ontology - the Financial Industry Business Ontology. We propose a method that maps given concepts to the mentioned ontology and performs a graph search for the most relevant hypernyms. We also employ a word vectorization method and a machine learning classifier to supplement the method with a ranked list of labels for each concept.
In this paper, we present the different methods proposed for the FinSIM-2 Shared Task 2021 on Learning Semantic Similarities for the Financial domain. The main focus of this task is to evaluate the classification of financial terms into corresponding top-level concepts (also known as hypernyms) that were extracted from an external ontology. We approached the task as a semantic textual similarity problem. By relying on a siamese network with pre-trained language model encoders, we derived semantically meaningful term embeddings and computed similarity scores between them in a ranked manner. Additionally, we exhibit the results of different baselines in which the task is tackled as a multi-class classification problem. The proposed methods outperformed our baselines and proved the robustness of the models based on textual similarity siamese network.
In this paper, we present our approaches for the FinSim 2021 Shared Task on Learning Semantic Similarities for the Financial Domain. The aim of the FinSim shared task is to automatically classify a given list of terms from the financial domain into the most relevant hypernym (or top-level) concept in an external ontology. Two different word representations have been compared in our study, i.e., customized word2vec provided by the shared task and FinBERT. We first create a customized corpus from the given prospectuses and relevant articles from Investopedia. Then we train the domain-specific word2vec embeddings using the customized data with customized word2vec and FinBERT as the initialized embeddings respectively. Our experimental results demonstrate that these customized word embeddings can effectively improve the classification performance and achieve better results than the direct utilization of the provided word embeddings. The class imbalance issue of the given data is also explored. We empirically study the classification performance by employing several different strategies for imbalanced classification problems. Our system ranks 2nd on both Average Accuracy and Mean Rank metrics.
Recent advancement in neural network architectures has provided several opportunities to develop systems to automatically extract and represent information from domain specific unstructured text sources. The Finsim-2021 shared task, collocated with the FinNLP workshop, offered the challenge to automatically learn effective and precise semantic models of financial domain concepts. Building such semantic representations of domain concepts requires knowledge about the specific domain. Such a thorough knowledge can be obtained through the contextual information available in raw text documents on those domains. In this paper, we proposed a transformer-based BERT architecture that captures such contextual information from a set of domain specific raw documents and then perform a classification task to segregate domain terms into fixed number of class labels. The proposed model not only considers the contextual BERT embeddings but also incorporates a TF-IDF vectorizer that gives a word level importance to the model. The performance of the model has been evaluated against several baseline architectures.
In this contribution, we describe the systems presented by the PolyU CBS Team at the second Shared Task on Learning Semantic Similarities for the Financial Domain (FinSim-2), where participating teams had to identify the right hypernyms for a list of target terms from the financial domain.
For this task, we ran our classification experiments with several distributional, string-based, and Transformer features. Our results show that a simple logistic regression classifier, when trained on a combination of word embeddings, semantic and string similarity metrics and BERT-derived probabilities, achieves a strong performance (above 90%) in financial hypernymy detection.
This paper describes the method that we submitted to the FinSim-2 task on learning similarities for the financial domain. This task aims to automatically classify the Financial domain terms into the most relevant hypernym (or top-level) concept in an external ontology. This paper shows the result of experiments using the Catboost, Attention-LSTM, BERT, RoBERTa to develop an automatic finance domain classifier via word ontology and embedding. The experiment result demonstrates that each model could be an effective method to tackle the FinSim-2 task, respectively.
LocWeb2021 (Eleventh International Workshop on Location and the Web) is a workshop at The Web Conference 2021, with evolving topics around location-aware information access, Web architecture, spatial social computing, and social good. It is designed as a meeting place for researchers around the location topic at The Web Conference.
In this work, we predict the user lifetime within the anonymous and location-based social network Jodel in the Kingdom of Saudi Arabia. Jodel’s location-based nature yields to the establishment of disjoint communities country-wide and enables for the first time the study of user lifetime in the case of a large set of disjoint communities. A user’s lifetime is an important measurement for evaluating and steering customer bases as it can be leveraged to predict churn and possibly apply suitable methods to circumvent potential user losses. We train and test off the shelf machine learning techniques with 5-fold crossvalidation to predict user lifetime as a regression and classification problem; identifying the Random Forest to provide very strong results. Discussing model complexity and quality trade-offs, we also dive deep into a time-dependent feature subset analysis, which does not work very well; Easing up the classification problem into a binary decision (lifetime longer than timespan x) enables a practical lifetime predictor with very good performance. We identify implicit similarities across community models according to strong correlations in feature importance. A single countrywide model generalizes the problem and works equally well for any tested community; the overall model internally works similar to others also indicated by its feature importances.
Many online services, including search engines, content delivery networks, ad networks, and fraud detection utilize IP geolocation databases to map IP addresses to their physical locations. However, IP geolocation databases are often inaccurate. We present a novel IP geolocation technique based on combining propagating IP location information through traceroutes with IP interpolation. Using a large ground truth set, we show that physical locations of IP addresses can be propagated along traceroute paths. We also experiment with and expand upon the concept of IP range location interpolation, where we use the location of individual addresses in an IP range to assign a location to the entire range. The results show that our approach significantly outperforms commercial geolocation by up to 31 percentage points. We open source several components to aid in reproducing our results.
Geolocated user-generated content is a promising source of data reflecting how citizens live and feel. Information extracted from this source is being increasingly used for urban planning and policy evaluation purposes. While a lot of existing research focuses on the relationship between locations and sentiment in social media postings, we aim to uncover relations between location and sentiment that are consistent over cities around the world. In this paper, we therefore analyze the relationship between multiple categories of points of interest (POIs) in the OpenStreetMap dataset and the sentiment of English microblogging messages sent nearby using a three-stage processing pipeline: (1) extract sentiment scores from geolocated microblogs posted on Twitter, (2) spatial aggregation of sentiment in cities and POIs, (3) analyze relationships in aggregated sentiment. We identify differences in Twitter users’ sentiments within cities based on POIs, and we investigate the temporal dynamics of these sentiments and compare our findings between major cities in multiple countries.
Understanding human activity patterns in cities enables a more efficient and sustainable energy, transport, and resource planning. In this invited talk, after laying out the background on spatio-temporal representation, I will present our unsupervised approaches to handle large-scale mutivariate sensor data from heterogeneous sources, prior to modelling them further with the rich contextual signals obtained from the environment. I will also present several spatio-temporal prediction and recommendation problems, leveraging graph-based enrichment and embedding techniques, with applications in continuous trajectory prediction, visitor intent profiling, and urban flow forecasting.
Through recent advancements in speech technologies and introduction of smart assistants, such as Amazon Alexa, Apple Siri and Google Home, increasing number of users are interacting with various applications through voice commands. E-commerce companies typically display short product titles on their webpages, either human-curated or algorithmically generated, when brevity is required. However, these titles are dissimilar from natural spoken language. For example, ”Lucky Charms Gluten Free Break-fast Cereal, 20.5 oz a box Lucky Charms Gluten Free” is acceptable to display on a webpage, while a similar title cannot be used in a voice based text-to-speech application. In such conversational systems, an easy to comprehend sentence, such as ”a 20.5 ounce box of lucky charms gluten free cereal” is preferred. Compared to display devices, where images and detailed product information can be presented to users, short titles for products which convey the most important information, are necessary when interfacing with voice assistants. We propose eBERT, a sequence-to-sequence approach by further pre-training the BERT embeddings on an e-commerce product description corpus, and then fine-tuning the resulting model to generate short, natural, spoken language titles from input web titles. Our extensive experiments on a real-world industry dataset, as well as human evaluation of model output, demonstrate that eBERT summarization outperforms comparable baseline models. Owing to the efficacy of the model, a version of this model has been deployed in real-world setting.
The rise of deep learning methods has transformed the research area of natural language processing beyond recognition. New benchmark performances are reported on a daily basis ranging from machine translation to question-answering. Yet, some of the unsolved practical research questions are not in the spotlight and this includes, for example, issues arising at the interface between spoken and written language processing.
We identify sentence boundary detection and speaker change detection applied to automatically transcribed texts as two NLP problems that have not yet received much attention but are nevertheless of practical relevance. We frame both problems as binary tagging tasks that can be addressed by fine-tuning a transformer model and we report promising results.
Much of what we do today is centered around humans — whether it is creating the next generation smartphones, understanding interactions with social media platforms, or developing new mobility strategies. A better understanding of people can not only answer fundamental questions about “us” as humans, but can also facilitate the development of enhanced, personalized technologies. In this talk, I will overview the main challenges (and opportunities) faced by research on multimodal sensing of human behavior, and illustrate these challenges with projects conducted in the Language and Information Technologies lab at Michigan.
A talk with two parts covering three modalities. In the first part, I will talk about NLP Beyond Text, where we integrate visual context into a speech recognition model and find that the recovery of different types of masked speech inputs is improved by fine-grained visual grounding against detected objects . In the second part, I will come Back Again, and talk about the benefits of textual supervision in cross-modal speech–vision retrieval models .
Human knowledge and use of language is inextricably connected to perception, action and the organization of the brain, yet natural language processing is still dominated by text! More research involving language-including speech-in the context of other modalities and environments is needed, and there has never been a better time to do it. Without ever invoking the worn-out, overblown phrase ”how babies learn” in the talk, I’ll cover three of my team’s efforts involving language, vision and action. First: our work on speech-image representation learning and retrieval, where we demonstrate settings in which directly encoding speech outperforms the hard-to-beat strategy of using automatic speech recognition and strong text encoders. Second: two models for text-to-image generation: a multi-stage model which exploits user-guidance in the form of mouse traces and a single-stage one which uses cross-modal contrastive losses. Third: Room-across-Room, a multilingual dataset for vision-and-language navigation, for which we collected spoken navigation instructions, high-quality text transcriptions, and fine-grained alignments between words and pixels in high-definition 360-degree panoramas. I’ll wrap up with some thoughts on how work on computational language grounding more broadly presents new opportunities to enhance and advance our scientific understanding of language and its fundamental role in human intelligence.
Most language use is driven by specific communicative goals in interactive setups, where often visual perception goes hand in hand with language processing. I will discuss some recent projects by my research group related to modelling language generation in socially and visually grounded contexts, arguing that such models can help us to better understand the cognitive processes underpinning these abilities in humans and contribute to more human-like conversational agents.
In this paper, we conducted a tweet sentiment analysis of the 2020 U.S. Presidential Election between Donald Trump and Joe Biden. Specially, we identified the Multi-Layer Perceptron classifier as the methodology with the best performance on the Sanders Twitter benchmark dataset. We collected a sample of over 260,000 tweets related to the 2020 U.S. Presidential Election from the Twitter website via Twitter API, processed feature extraction, and applied Multi-Layer Perceptron to classify these tweets with a positive or negative sentiment. From the results, we concluded that (1) contrary to popular poll results, the candidates had a very close negative to positive sentiment ratio, (2) negative sentiment is more common and prominent than positive sentiment within the social media domain, (3) some key events can be detected by the trends of sentiment on social media, and (4) sentiment analysis can be used as a low-cost and easy alternative to gather political opinion.
Multi-task learning (MTL) aims to make full use of the knowledge contained in multi-task supervision signals to improve the overall performance. How to make the knowledge of multiple tasks shared appropriately is an open problem for MTL. Most existing deep MTL models are based on parameter sharing. However, suitable sharing mechanism is hard to design as the relationship among tasks is complicated. In this paper, we propose a general framework called Multi-Task Neural Architecture Search (MTNAS) to efficiently find a suitable sharing route for a given MTL problem. MTNAS modularizes the sharing part into multiple layers of sub-networks. It allows sparse connection among these sub-networks and soft sharing based on gating is enabled for a certain route. Benefiting from such setting, each candidate architecture in our search space defines a dynamic sparse sharing route which is more flexible compared with full-sharing in previous approaches. We show that existing typical sharing approaches are sub-graphs in our search space. Extensive experiments on three real-world recommendation datasets demonstrate MTANS achieves consistent improvement compared with single-task models and typical multi-task methods while maintaining high computation efficiency. Furthermore, in-depth experiments demonstrates that MTNAS can learn suitable sparse route to mitigate negative transfer.
News media reflects the present state of a country or region to its audiences. Media outlets of a region post different kinds of news for their local and global audiences. In this paper, we focus on Europe (precisely EU) and propose a method to identify news that has an impact on Europe from any aspect such as financial, business, crime, politics, etc. Predicting the location of the news is itself a challenging task. Most of the approaches restrict themselves towards named entities or handcrafted features. In this paper, we try to overcome that limitation i.e., instead of focusing only on the named entities (Europe location, politicians etc.) and some hand-crafted rules, we also explore the context of news articles with the help of pre-trained language model BERT. The auto-regressive language model based European news detector shows about 9-19% improvement in terms of F-score over baseline models. Interestingly, we observe that such models automatically capture named entities, their origin, etc; hence, no separate information is required. We also evaluate the role of such entities in the prediction and explore the tokens that BERT really looks at for deciding the news category. Entities such as person, location, organization turn out to be good rationale tokens for the prediction.
To attract unsuspecting readers, news article headlines and abstracts are often written with speculative sentences or clauses. Male dominance in the news is very evident, whereas females are seen as “eye candy” or “inferior”, and are underrepresented and under-examined within the same news categories as their male counterparts. In this paper, we present an initial study on gender bias in news abstracts in two large English news datasets used for news recommendation and news classification. We perform three large-scale, yet effective text-analysis fairness measurements on 296,965 news abstracts. In particular, to our knowledge we construct two of the largest benchmark datasets of possessive (gender-specific and gender-neutral) nouns and attribute (career-related and family-related) words datasets1 which we will release to foster both bias and fairness research aid in developing fair NLP models to eliminate the paradox of gender bias. Our studies demonstrate that females are immensely marginalized and suffer from socially-constructed biases in the news. This paper individually devises a methodology whereby news content can be analyzed on a large scale utilizing natural language processing (NLP) techniques from machine learning (ML) to discover both implicit and explicit gender biases.
In today’s news deluge, it can often be overwhelming to understand the significance of a news article or verify the facts within. One approach to address this challenge is to identify relevant data so that crucial statistics or facts can be highlighted for the user to easily digest, and thus improve the user’s comprehension of the news story in a larger context. In this paper, we look toward structured tables on the Web, especially the high quality data tables from Wikipedia, to assist in news understanding. Specifically, we aim to automatically find tables related to a news article. For that, we leverage the content and entities extracted from news articles and their matching tables to fine-tune a Bidirectional Transformers (BERT) model. The resulting model is, therefore, an encoder tailored for article-to-table match. To find the matching tables for a given news article, the fine-tuned BERT model encodes each table in the corpus and the news article into their respective embedding vectors. The tables with the highest cosine similarities to the news article in this new representation space are considered the possible matches. Comprehensive experimental analyses show that the new approach significantly outperforms the baselines over a large, weakly-labeled, dataset obtained from Web click logs as well as a small, crowdsourced, evaluation set. Specifically, our approach achieves near 90% accuracy@5 as opposed to baselines varying between 30% and 64%.
This paper proposes a vision and research agenda for the next generation of news recommender systems (RS), called the table d’hôte approach. A table d’hôte (translates as host’s table) meal is a sequence of courses that create a balanced and enjoyable dining experience for a guest. Likewise, we believe news RS should strive to create a similar experience for the users by satisfying the news-diet needs of a user. While extant news RS considers criteria such as diversity and serendipity, and RS bundles have been studied for other contexts such as tourism, table d’hôte goes further by ensuring the recommended articles satisfy a diverse set of user needs in the right proportions and in a specific order. In table d’hôte, available articles need to be stratified based on the different ways that news can create value for the reader, building from theories and empirical research in journalism and user engagement. Using theories and empirical research from communication on the uses and gratifications (U&G) consumers derive from media, we define two main strata in a table d’hôte news RS, each with its own substrata: 1) surveillance, which consists of information the user needs to know, and 2) serendipity, which are the articles offering unexpected surprises. The diversity of the articles according to the defined strata and the order of the articles within the list of recommendations are also two important aspects of the table d’hôte in order to give the users the most effective reading experience. We propose our vision, link it to the existing concepts in the RS literature, and identify challenges for future research.
Automatically identifying fake news from the Internet is a challenging problem in deception detection tasks. Online news is modified constantly during its propagation, e.g., malicious users distort the original truth and make up fake news. However, the continuous evolution process would generate unprecedented fake news and cheat the original model. We present the Fake News Evolution (FNE) dataset: a new dataset tracking the fake news evolution process. Our dataset is composed of 950 paired data, each of which consists of articles representing the three significant phases of the evolution process, which are the truth, the fake news, and the evolved fake news. We observe the features during the evolution and they are the disinformation techniques, text similarity, top 10 keywords, classification accuracy, parts of speech, and sentiment properties.
News recommendation is very crucial for online news services to improve user experience and alleviate information overload. Precisely learning representations of news and users is the core problem in news recommendation. Existing models usually focus on implicit text information to learn corresponding representations, which may be insufficient for modeling user interests. Even if entity information is considered from external knowledge, it may still not be used explicitly and effectively for user modeling. In this paper, we propose a novel news recommendation approach, which combine explicit entity graph with implicit text information. The entity graph consists of two types of nodes and three kinds of edges, which represent chronological order, related and affiliation relationship. Then graph neural network is utilized for reasoning on these nodes. Extensive experiments on a real-world dataset, Microsoft News Dataset (MIND), validate the effectiveness of our proposed approach.
The amount of scientific literature continuously grows, which poses an increasing challenge for researchers to manage, find and explore research results. Therefore, the classification of scientific work is widely applied to enable the retrieval, support the search of suitable reviewers during the reviewing process, and in general to organize the existing literature according to a given schema. The automation of this classification process not only simplifies the submission process for authors, but also ensures the coherent assignment of classes. However, especially fine-grained classes and new research fields do not provide sufficient training data to automatize the process. Additionally, given the large number of not mutual exclusive classes, it is often difficult and computationally expensive to train models able to deal with multi-class multi-label settings. To overcome these issues, this work presents a preliminary Deep Learning framework as a solution for multi-label text classification for scholarly papers about Computer Science. The proposed model addresses the issue of insufficient data by utilizing the semantics of classes, which is explicitly provided by latent representations of class labels. This study uses Knowledge Graphs as a source of these required external class definitions by identifying corresponding entities in DBpedia to improve the overall classification.
In the era of misinformation and information inflation, the credibility assessment of the produced news is of the essence. However, fact-checking can be challenging considering the limited references presented in the news. This challenge can be transcended by utilizing the knowledge graph that is related to the news articles. In this work, we present a methodology for creating scientific news article representations by modeling the directed graph between the scientific news articles and the cited scientific publications. The network used for the experiments is comprised of the scientific news articles, their topic, the cited research literature, and their corresponding authors. We implement and present three different approaches: 1) a baseline Relational Graph Convolutional Network (R-GCN), 2) a Heterogeneous Graph Neural Network (HetGNN) and 3) a Heterogeneous Graph Transformer (HGT). We test these models in the downstream task of link prediction on the: a) news article - paper links and b) news article - article topic links. The results show promising applications of graph neural network approaches in the domains of knowledge tracing and scientific news credibility assessment.
With substantial and continuing increases in the number of published papers across the scientific literature, development of reliable approaches for automated discovery and assessment of published findings is increasingly urgent. Tools which can extract critical information from scientific papers and metadata can support representation and reasoning over existing findings, and offer insights into replicability, robustness and generalizability of specific claims. In this work, we present a pipeline for the extraction of statistical information (p-values, sample size, number of hypotheses tested) from full-text scientific documents. We validate our approach on 300 papers selected from the social and behavioral science literatures, and suggest directions for next steps.
Although many FAIR principles could be fulfilled by 5-star Linked Open Data, the successful realization of FAIR poses a multitude of challenges. FAIR publishing and retrieval of Linked Data is still rather a FAIRytale than reality, for users and machines. In this paper, we give an overview on four major approaches that tackle individual challenges of FAIR data and present our vision of a FAIR Linked Data backbone. We propose 1) DBpedia Databus - a flexible, heavily automatable dataset management and publishing platform based on DataID metadata; that is extended by 2) the novel Databus Mods architecture which allows for flexible, unified, community-specific metadata extensions and (search/annotation) overlay systems; 3) DBpedia Archivo an archiving solution for unified handling and improvement of FAIRness for ontologies on publisher and consumer side; as well as 4) the DBpedia Global ID management and lookup services to cluster and discover equivalent entities and properties
A huge number of scholarly articles published every day in different domains makes it hard for the experts to organize and stay updated with the new research in a particular domain. This study gives an overview of a new approach, HierClasSArt, for knowledge aware hierarchical classification of the scholarly articles for mathematics into a predefined taxonomy. The method uses combination of neural networks and Knowledge Graphs for better document representation along with the meta-data information. This position paper further discusses the open problems about incorporation of new articles and evolving hierarchies in the pipeline. Mathematics domain has been used as a use-case.
Finding suitable citations for scientific publications can be challenging and time-consuming. To this end, context-aware citation recommendation approaches that recommend publications as candidates for in-text citations have been developed. In this paper, we present C-Rex, a web-based demonstration system available at http://c-rex.org for context-aware citation recommendation based on the Neural Citation Network  and millions of publications from the Microsoft Academic Graph. Our system is one of the first online context-aware citation recommendation systems and the first to incorporate not only a deep learning recommendation approach, but also explanation components to help users better understand why papers were recommended. In our offline evaluation, our model performs similarly to the one presented in the original paper and can serve as a basic framework for further implementations. In our online evaluation, we found that the explanations of recommendations increased users’ satisfaction.
In the context of open science, good research data management (RDM), including data sharing and data reuse, has become a major goal of research policy. However, studies and monitors reveal that open science practices are not yet widely mainstream. Rewards and incentives have been suggested as a solution, to facilitate and accelerate the development of open and transparent RDM. Based on relevant literature, our paper provides a critical analysis of three main issues: what should be rewarded and incentivized, who should be rewarded, and what kind of rewards and incentives should be used? Concluding the analysis, we ask if it is really necessary and appropriate to consider RDM as an individual (behavioral) issue, as the main challenges are elsewhere, not personal, but technological, institutional and financial.
New discoveries in science are often built upon previous knowledge. Ideally, such dependency information should be made explicit in a scientific knowledge graph. The Keystone Framework was proposed for tracking the validity dependency among papers. A keystone citation indicates that the validity of a given paper depends on a previously published paper it cites. In this paper, we propose and evaluate a strategy that repurposes rhetorical category classifiers for the novel application of extracting keystone citations that relate to research methods. Five binary rhetorical category classifiers were constructed to identify Background, Objective, Methods, Results, and Conclusions sentences in biomedical papers. The resulting classifiers were used to test the strategy against two datasets. The initial strategy assumed that only citations contained in Methods sentences were methods keystone citations, but our analysis revealed that citations contained in sentences classified as either Methods or Results had a high likelihood to be methods keystone citations. Future work will focus on fine tuning the rhetorical category classifiers, experimenting with multiclass classifiers, evaluating the revised strategy with more data, and constructing a larger gold standard citation context sentence dataset for model training.
The growth rate of the number of scientific publications is constantly increasing, creating important challenges in the identification of valuable research and in various scholarly data management applications, in general. In this context, measures which can effectively quantify the scientific impact could be invaluable. In this work, we present BIP! DB, an open dataset that contains a variety of impact measures calculated for a large collection of more than 100 million scientific publications from various disciplines.
Measuring the quality of research work is an essential component of the scientific process. With the ever-growing rates of articles being submitted to top-tier conferences, and the potential consistency and bias issues in the peer review process identified by scientific community, it is thus of great necessary and challenge to automatically evaluate submissions. Existing works mainly focus on exploring relevant factors and applying machine learning models to simply be accurate at predicting the acceptance of a given academic paper, while ignoring the interpretability power which is required by a wide range of applications. In this paper, we propose a framework to construct decision sets that consist of unordered if-then rules for predicting paper acceptance. We formalize decision set learning problem via a joint objective function that simultaneously optimize accuracy and interpretability of the rules, rather than organizing them in a hierarchy. We evaluate the effectiveness of the proposed framework by applying it on a public scientific peer reviews dataset. Experimental results demonstrate that the learned interpretable decision sets by our framework performs on par with state-of-the-art classification algorithms which optimize exclusively for predictive accuracy and much more interpretable than rule-based methods.
The automatic extraction of topics is a standard technique for summarizing text corpora from various domains (e.g., news articles, transport or logistic reports, scientific publications) that has several applications. Since, in many cases, topics are subject to continuous change there is the need to monitor the evolution of a set of topics of interest, as the corresponding corpora are updated. The evolution of scientific topics, in particular, is of great interest for researchers, policy makers, fund managers, and other professionals/engineers in the research and academic community. In this work, we demonstrate a prototype that provides intuitive visualisations for the evolution of scientific topics providing insights about topic transformation, merging, and splitting during the recent years. Although the prototype works on top of a scientific text corpus, its implementation is generic and can be easily applied on texts from other domains, as well.
Neural models have been applied to many text summarization tasks recently. In general, a large number of high quality reference summaries are required to train well-performing neural models. The reference summaries, i.e., ground truth, are usually written by human, and are costly to obtain. Thus, in this paper, we focus on unsupervised summarization problem by exploring news and readers’ comments in linking tweets, i.e., tweets with URLs linking to the news. Our data analysis shows that the linking tweets, collectively highlight important information in news but may not fully cover all content in news. This inspires us to propose the dual-attention based model, named DAS, to address the observed issues above. The dual-attention mechanism extracts both important information highlighted by linking tweets and the salient content in news. Specifically, it consists of two similar structures of Transformer with multi-head attention. We propose position-dependent word salience, which reflects the effect of local context. The word salience is computed from dual-attention mechanism. Sentence salience is then estimated from the word salience. Experimental results on a benchmark dataset show that DAS outperforms state-of-the-art unsupervised models and achieves comparable results with state-of-the-art supervised models.
Hate speech on social media platforms has become a severe issue in recent years. To cope with it, researchers have developed machine learning-based classification models. Due to the complexity of the problem, the models are far from perfect. A promising approach to improve them is to integrate social network data as additional features in the classification. Unfortunately, there is a lack of datasets containing text and social network data to investigate this phenomenon. Therefore, we develop an approach to identify and collect hater networks on Twitter that uses a pre-trained classification model to focus on hateful content. The contributions of this article are (1) an approach to identify hater networks and (2) an anonymized German offensive language dataset that comprises social network data. The dataset consists of 4,647,200 labeled tweets and a social graph with 49,353 users and 122,053 edges.
Microblogs have become the preferred means of communication for people to share information and feelings, especially for fast evolving events. Understanding the emotional reactions of people allows decision makers to formulate policies that are likely to be more well-received by the public and hence better accepted especially during policy implementation. However, uncovering the topics and emotions related to an event over time is a challenge due to the short and noisy nature of microblogs. This work proposes a weakly supervised learning approach to learn coherent topics and the corresponding emotional reactions as an event unfolds. We summarize the event by giving the representative microblogs and the emotion distributions associated with the topics over time. Experiments on multiple real-world event datasets demonstrate the effectiveness of the proposed approach over existing solutions.
Chatbots are increasingly used for delivering mental health assistance. As part of our effort to develop a chatbot on academic and social issues for Cantonese-speaking students, we have constructed a dataset of 1,028 post-reply pairs on test anxiety and loneliness. The posts, harvested from Cantonese social media, are manually classified to a symptom category drawn from counselling literature; the replies are human-crafted, offering brief advice for each post. For response selection, the chatbot predicts the quality of a candidate post-reply pair with a regression model. During training, the symptom categories were used as proxies of reply relevance. In experiments, this approach improved response selection accuracy over a binary classification model and a weakly supervised regression model. This result suggests that manual annotation of symptom category can help boost the performance of a counsellor chatbot.
As user-generated contents thrive, so does the spread of toxic comment. Therefore, detecting toxic comment becomes an active research area, and it is often handled as a text classification task. As recent popular methods for text classification tasks, pre-trained language model-based methods are at the forefront of natural language processing, achieving state-of-the-art performance on various NLP tasks. However, there is a paucity in studies using such methods on toxic comment classification. In this work, we study how to best make use of pre-trained language model-based methods for toxic comment classification and the performances of different pre-trained language models on these tasks. Our results show that, Out of the three most popular language models, i.e. BERT, RoBERTa, and XLM, BERT and RoBERTa generally outperform XLM on toxic comment classification. We also prove that using a basic linear downstream structure outperforms complex ones such as CNN and BiLSTM. What is more, we find that further fine-tuning a pre-trained language model with light hyper-parameter settings brings improvements to the downstream toxic comment classification task, especially when the task has a relatively small dataset.
Social media has become an essential part of the daily routines of children and adolescents. Moreover, enormous efforts have been made to ensure the psychological and emotional well-being of young users as well as their safety when interacting with various social media platforms. In this paper, we investigate the exposure of those users to inappropriate comments posted on YouTube videos targeting this demographic. We collected a large-scale dataset of approximately four million records and studied the presence of five age-inappropriate categories and the amount of exposure to each category. Using natural language processing and machine learning techniques, we constructed ensemble classifiers that achieved high accuracy in detecting inappropriate comments. Our results show a large percentage of worrisome comments with inappropriate content: we found 11% of the comments on children’s videos to be toxic, highlighting the importance of monitoring comments, particularly on children’s platforms.
We propose a multitask deep neural network for detecting affect-retweet pairs for Twitter tweets. Each task given to our network jointly learns a given affect, e.g. hate, sarcasm etc., along with learning retweeting behaviour as an auxiliary task, from a given tweet corpus. On test data, this model allows us to predict retweet behaviour in the absence of any further meta-data, along with identifying affect. This allows us also to predict whether a tweet with affect would go viral or not. Our model delivers F1-scores of 0.93 and 0.91 for hate and sarcasm detection respectively, and predicts retweets with the accuracy of 71% and 60% respectively, delivering state-of-the-art performance on benchmark data.
Graph or network representations are an important foundation for data mining and machine learning tasks in relational data. Many tools of network analysis, like centrality measures, information ranking, or cluster detection rest on the assumption that links capture direct influence, and that paths represent possible indirect influence. This assumption is invalidated in time series data capturing, e.g., time-stamped social interactions, time-resolved co-occurrences or other types of relational time series. In such data, for two time-stamped links (A,B) and (B,C) the chronological ordering and timing determines whether a causal path from node A via B to C exists. A number of works has shown that for this reason network analysis cannot be directly applied to time-stamped data. Existing methods to address this issue require statistics on causal paths, which is computationally challenging for big time series data.
Addressing this problem, we develop an efficient algorithm to count causal paths in time-stamped network data. Applying it to empirical data, we show that our method is more efficient than a baseline method implemented in an OpenSource data analytics package. Our method works efficiently for different values of the maximum time difference between consecutive links of a causal path and supports streaming scenarios. With it, we are closing a gap that hinders an efficient analysis of large temporal networks.
Machine learning models and recommender systems play a crucial role in web applications, providing personalized experiences to each customer. Recurring visits of the same customer raise a nontrivial question about the persistence of the experience. Given a changing user context, alongside online algorithms that update over time, the optimal treatment might differ from past model decisions. However, changing customer experience may create inconsistency and harm customer satisfaction and business process completion.
This paper discusses the tradeoff between providing the user with a consistent experience and suggesting an up-to-date optimal treatment. We offer preliminary approaches to tackle the persistence problem and explore the tradeoffs in a simulated study.
The Open Source software package pathpy, available at https://www.pathpy.net, implements statistical techniques to learn optimal graphical models for the causal topology generated by paths in time-series data. Operationalizing Occam’s razor, these models balance model complexity with explanatory power for empirically observed paths in relational time series. Standard network analysis is justified if the inferred optimal model is a first-order network model. Optimal models with orders larger than one indicate higher-order dependencies and can be used to improve the analysis of dynamical processes, node centralities and clusters.
The visual analysis of temporal network data is often hindered by the cognitively demanding nature of dynamic graphic visualizations. Addressing this issue, the graph visualization tool HOTVis generates time-aware static network visualizations that highlight the causal topology of temporal networks, i.e. which nodes can directly and indirectly influence each other, and are thus considerably easier to interpret than state-of-the-art dynamic graph visualizations.
References are an essential part of Wikipedia. Each statement in Wikipedia should be referenced. In this paper, we explore the creation and collection of references for new Wikipedia articles from an editor’s perspective. We map out the workflow of editors when creating a new article, emphasising on how they select references.
Online peer production communities such as Wikipedia typically rely on a distinct class of users, called administrators, to enforce cooperation when good faith collaboration fails. Assessing one’s intentions is a complex task, however, especially when operating under time-pressure with a limited number of (costly to collect) cues. In such situations, individuals typically rely on simplifying heuristics to make decisions, at the cost of precision. In this paper, we hypothesize that administrators’ community governance policy might be influenced by general trust attitudes acquired mostly out of the Wikipedia context. We use a decontextualized online experiment to elicit levels of trust in strangers in a sample of 58 English Wikipedia administrators. We show that low-trusting admins exercise their policing rights significantly more (e.g., block about 81% more users than high trusting types on average). We conclude that efficiency gains might be reaped from the further development of tools aimed at inferring users’ intentions from digital trace data.
The Wikidata knowledge base (KB) is one of the most popular structured data repositories on the web, containing more than 1 billion statements for over 90 million entities. Like most major KBs, it is nonetheless incomplete and therefore operates under the open-world assumption (OWA) - statements not contained in Wikidata should be assumed to have an unknown truth. The OWA ignores however, that a significant part of interesting knowledge is negative, which cannot be readily expressed in this data model.
In this paper, we review the challenges arising from the OWA, as well as some specific attempts Wikidata has made to overcome them. We review a statistical inference method for negative statements, called peer-based inference, and present Wikinegata, a platform that implements this inference over Wikidata. We discuss lessons learned from the development of this platform, as well as how the platform can be used both for learning about interesting negations, as well as about modelling challenges inside Wikidata. Wikinegata is available at https://d5demos.mpi-inf.mpg.de/negation.
The Bengali Wikipedia has recently crossed the milestone of 100,000 articles after a journey of almost 17 years in December 2020. In this journey, the Bengali language edition of the world’s largest encyclopedia has experienced multiple changes with a promising increase in the overall performance considering the growth of community members and content. This paper analyzes the various associating factors throughout this journey including the number of active editors, number of content pages, pageview, etc., along with the connection to outreach activities with these parameters. The gender gap has been a worldwide problem and is quite prevalent in Bengali Wikipedia as well, which seems to be unchanged over the years and consequentially, leaving a conspicuous disparity in the movement. The paper inspects the present scenario of Bengali Wikipedia through quantitative factors with a relative comparison with other regional languages.
Wikipedia is a major source of information utilized by internet users around the globe for fact-checking and access to general, encyclopedic information. For researchers, it offers an unprecedented opportunity to measure how societies respond to events and how our collective perception of the world evolves over time and in response to events. Wikipedia use and the reading patterns of its users reflect our collective interests and the way they are expressed in our search for information – whether as part of fleeting, zeitgeist-fed trends or long-term – on most every topic, from personal to business, through political, health-related, academic and scientific. In a very real sense, events are defined by how we interpret them and how they affect our perception of the context in which they occurred, rendering Wikipedia invaluable for understanding events and their context. This paper introduces WikiShark (www.wikishark.com) – an online tool that allows researchers to analyze Wikipedia traffic and trends quickly and effectively, by (1) instantly querying pageview traffic data; (2) comparing traffic across articles; (3) surfacing and analyzing trending topics; and (4) easily leveraging findings for use in their own research.
Wikipedia articles are known for their exhaustive knowledge and extensive collaboration. Users perform various tasks that include editing in terms of adding new facts or rectifying some mistakes, looking up new topics, or simply browsing. In this paper, we investigate the impact of gradual edits on the re-positioning and organization of the factual information in Wikipedia articles. Literature shows that in a collaborative system, a set of contributors are responsible for seeking, perceiving, and organizing the information. However, very little is known about the evolution of information organization on Wikipedia articles. Based on our analysis, we show that in a Wikipedia article, the crowd is capable of placing the factual information to its correct position, eventually reducing the knowledge gaps. We also show that the majority of information re-arrangement occurs in the initial stages of the article development and gradually decreases in the later stages.
Our findings advance our understanding of the fundamentals of information organization on Wikipedia articles and can have implications for developers aiming to improve the content quality and completeness of Wikipedia articles.
The advent of Wikidata represented a breakthrough as a collaborative and constantly advancing knowledgebase. As it was originally envisioned, it simplified the linkage and data reuse among different Wikimedia projects. Catalan Wikipedia is one example project where Wikidata has been heavily adopted by its community base: that is the case of integration with article infoboxes or in automatically generated lists. In the following article we highlight the possibilities of taking advantage of structured data from Wikidata for evaluating new biographical articles, so facilitating users to get engaged into diversity challenges or track potential vandalism and errors.
The quality of Wikipedia articles is manually evaluated which is time inefficient as well as susceptible to human bias. An automated assessment of these articles may help in minimizing the overall time and manual errors. In this paper, we present a novel approach based on the structural analysis of Wikigraph to automate the estimation of the quality of Wikipedia articles. We examine the network built using the complete set of English Wikipedia articles and identify the variation of network signatures of the articles with respect to their quality. Our study shows that these signatures are useful for estimating the quality grades of un-assessed articles with an accuracy surpassing the existing approaches in this direction. The results of the study may help in reducing the need for human involvement for quality assessment tasks.
Current state-of-the-art task-agnostic visio-linguistic approaches, such as ViLBERT , are limited to domains in which texts have a visual materialization (e.g. a person running). This work describes a project towards achieving the next generation of models, that can deal with open-domain media, and learn visio-linguistic representations that reflect data’s context, by jointly reasoning over media, a domain knowledge-graph and temporal context. This ambition will be leveraged by a Wikimedia data framework, comprised by comprehensive and high-quality data, covering a wide range of social, cultural, political and other type of events. Towards this goal, we propose a research setup comprised by an open-domain data framework and a set of novel independent research tasks.
A major challenge for many analyses of Wikipedia dynamics—e.g., imbalances in content quality, geographic differences in what content is popular, what types of articles attract more editor discussion—is grouping the very diverse range of Wikipedia articles into coherent, consistent topics. This problem has been addressed using various approaches based on Wikipedia’s category network, WikiProjects, and external taxonomies. However, these approaches have always been limited in their coverage: typically, only a small subset of articles can be classified, or the method cannot be applied across (the more than 300) languages on Wikipedia. In this paper, we propose a language-agnostic approach based on the links in an article for classifying articles into a taxonomy of topics that can be easily applied to (almost) any language and article on Wikipedia. We show that it matches the performance of a language-dependent approach while being simpler and having much greater coverage.
Mathematical information retrieval (MathIR) applications such as semantic formula search and question answering systems rely on knowledge-bases that link mathematical expressions to their natural language names. For database population, mathematical formulae need to be annotated and linked to semantic concepts, which is very time-consuming. In this paper, we present our approach to structure and speed up this process by using an application-driven strategy and AI-aided system. We evaluate the quality and time-savings of AI-generated formula and identifier annotation recommendations on a test selection of Wikipedia articles from the physics domain. Moreover, we evaluate the community acceptance of Wikipedia formula entity links and Wikidata item creation and population to ground the formula semantics. Our evaluation shows that the AI guidance was able to significantly speed up the annotation process by a factor of 1.4 for formulae and 2.4 for identifiers. Our contributions were accepted in 88% of the edited Wikipedia articles and 67% of the Wikidata items. The >>AnnoMathTeX<< annotation recommender system is hosted by Wikimedia at annomathtex.wmflabs.org. In the future, our data refinement pipeline will be integrated seamlessly into the Wikimedia user interfaces.
Wikidata recently supported entity schemas based on shape expressions (ShEx). They play an important role in the validation of items belonging to a multitude of domains on Wikidata. However, the number of entity schemas created by the contributors is relatively low compared to the number of WikiProjects. The past couple of years have seen attempts at simplifying the shape expressions and building tools for creating them. In this article, ShExStatements is presented with the goal of simplifying writing the shape expressions for Wikidata.
In this paper, we investigate the state-of-the-art of machine learning models to infer sociodemographic attributes of Wikipedia editors based on their public profile pages and corresponding implications for editor privacy. To build models for inferring sociodemographic attributes, ground truth labels are obtained via different strategies, using publicly disclosed information from editor profile pages. Different embedding techniques are used to derive features from editors’ profile texts. In comparative evaluations of different machine learning models, we show that the highest prediction accuracy can be obtained for the attribute gender, with precision values of 82% to 91% for women and men respectively, as well as an averaged F1-score of 0.78. For other attributes like age group, education, and religion, the utilized classifiers exhibit F1-scores in the range of 0.32 to 0.74, depending on the model class. By merely using publicly disclosed information of Wikipedia editors, we highlight issues surrounding editor privacy on Wikipedia and discuss ways to mitigate this problem. We believe our work can help start a conversation about carefully weighing the potential benefits and harms that come with the existence of information-rich, pre-labeled profile pages of Wikipedia editors.
Quantifying the moral narratives expressed in the user-generated text, news, or public discourses is fundamental for understanding individuals’ concerns and viewpoints and preventing violent protests and social polarisation. The Moral Foundation Theory (MFT) was developed precisely to operationalise morality in a five-dimensional scale system. Recent developments of the theory urged for the introduction of a new foundation, liberty. Being only recently added to the theory, there are no available linguistic resources to assess liberty from text corpora. Given its importance to current social issues such as the vaccination debate, we propose a data-driven approach to derive a liberty lexicon based on aligned documents from online encyclopedias with different worldviews. Despite the preliminary nature of our study, we show proof of the concept that large encyclopedia corpora can point out differences in the way people with contrasting viewpoints express themselves. Such differences can be used to derive a novel lexicon, identifying linguistic markers of the liberty foundation.
Wikipedia has been a critical information source during the COVID-19 pandemic. Analyzing how information is created, edited, and viewed on this platform can help gain new insights for risk communication strategies for the next pandemic. Here, we study the content editor and viewer patterns on the COVID-19 related documents on Wikipedia using a near-complete dataset gathered of 11 languages over 238 days in 2020. Based on the analysis of the daily access and edit logs on the identified Wikipedia pages, we discuss how the regional and cultural closeness factors affect information demand and supply.
Wikipedia is a critical platform for organizing and disseminating knowledge. One of the key principles of Wikipedia is neutral point of view (NPOV), so that bias is not injected into objective treatment of subject matter. As part of our research vision to develop resilient bias detection models that can self-adapt over time, we present in this paper our initial investigation of the potential of a cross-domain transfer learning approach to improve Wikipedia bias detection. The ultimate goal is to future-proof Wikipedia in the face of dynamic, evolving kinds of linguistic bias and adversarial manipulations intended to evade NPOV issues. We highlight the impact of incorporating evidence of bias from other subjectivity rich domains into further pre-training a BERT-based model, resulting in strong performance in comparison with traditional methods.
Wikipedia, the online encyclopedia, is a trusted source of knowledge for millions of individuals worldwide. As everyone can start a new article, it is often necessary to decide whether certain entries meet the standards for inclusion set forth by the community. These decisions (which are known as “Article for Deletion”, or AfD) are taken by groups of editors in a deliberative fashion, and are known for displaying a number of common biases associated to group decision making. Here, we present an analysis of 1,967,768 AfD discussions between 2005 and 2018. We perform a signed network analysis to capture the dynamics of agreement and disagreement among editors. We measure the preference of each editor for voting toward either inclusion or deletion. We further describe the evolution of individual editors and their voting preferences over time, finding four major opinion groups. Finally, we develop a predictive model of discussion outcomes based on latent factors. Our results shed light on an important, yet overlooked, aspect of curation dynamics in peer production communities, and could inform the design of improved processes of collective deliberation on the web.
Wikipedia is an online, free, multi-language, and collaborative encyclopedia, currently one of the most significant information sources on the web. The open nature of Wikipedia contributions raises concerns about the quality of its information. Previous studies have addressed this issue using manual evaluations and proposing generic measures for quality assessment. In this work, we focus on the quality of health-related content. For this purpose, we use general and health-specific features from Wikipedia articles to propose health-specific metrics. We evaluate these metrics using a set of Wikipedia articles previously assessed by WikiProject Medicine. We conclude that it is possible to combine generic and specific metrics to determine health-related content’s information quality. These metrics are computed automatically and can be used by curators to identify quality issues. Along with the explored features, these metrics can also be used in approaches that automatically classify the quality of Wikipedia health-related articles.
This paper aims to give a systemic vision about the data-driven mobile applications in urban data management processes, which is essential to ensure a sustainable smart city ecosystem for what is needed to ensure diversification between stakeholders and data sources. The realization of sustainable data-driven smart solutions based on an urban data platform that will enable citizen wellbeing in the smart city is needed to develop data-driven applications. In this paper, we present five case study mobile applications developed using AppSheet and Google Apps Script technologies to prevent the spread of COVID-19 and provide support to (potentially) infected citizens. Several aspects relevant to coronavirus pandemic are considered: quick COVID-19 patient assessment based on user-provided symptoms integrated with contact tracing; volunteer help during quarantine; UAV-based COVID-19 outdoor safety surveillance; test scheduling and AR-based pharmacy shop assistant.
Planning and establishing digital transformation (DT) is a complex process for all the organizations. City's DT is another challenging and complex process, which demands both the leading and dedicated role of the local government, and the engagement and commitment of the local stakeholders on a commonly agreed vision and plan. European Commission launched its Digital (DCC) and Intelligent Cities Challenge (ICC) initiatives to provide cities with guidance and support to design and implement corresponding digital transformation strategies. Shaping this strategy became hard during the ICC due to the Covid-19 pandemic, which changed all the local priorities and affected the initial city planning. The aim of this work-in-progress paper is to present the strategic planning process for city's digital transformation that was followed by the municipality of Trikala in Greece, which regardless is a famous smart city it had to join the DCC and ICC initiatives in order to methodologically perform it. Useful evidence are depicted with regard to the different stakeholders’ perspectives and priorities within the city's digital transformation and especially whether and how the COVID-19 outbreak re-arranged or re-shaped them.
Formulation of carpooling schemes for mutual cost benefits between the driver and the passengers has a long history. However, the convenience of driving alone, especially under the current COVID-19 pandemic, the increase of car ownership and the difficulties in finding travelers with matching schedule and route keeps car occupancy low. The technology is a key enabler of online platforms which facilitate the ride matching process and lead the increase of carpooling services. The aim of this work-in-progress article is to clarify the value proposition of carpooling platforms in smart cities, especially under conditions like the pandemic. Thus, an extensive bibliometric analysis of three separate specialized literature collections using the bibliometrix R-Tool combined with a systematic literature review of selected papers is performed. It is identified that smart carpooling platforms could generate additional value for participants and smart cities with real-time ride matching, interconnection with public transportation and other city services, secure transactions, reputation-based services and closed organization carpooling schemes. To deliver this value to a smart city, a multi-sided platform business model is proposed, suitable for a carpooling service provider with multiple customer segments and partners.
In the insurance industry, the assessor’s role is essential and requires significant efforts conversing with the claimant. This is a highly professional process that involves many complex operations to make a final insurance report. In order to save the cost, the previous offline insurance assessment procedure is gradually moved online. However, for the junior assessor often lacking in practical experience, it is not easy to quickly handle such an online procedure, yet this is important as the insurance company decides how much compensation the claimant should receive based on the assessor’s feedback. In this paper, we present an insurance assistant that applies NLP technologies to help junior insurance assessors do their job better. The insurance assistant recommends appropriate inquiring policies and auto-completes the case report during the insurance assessment procedure. Here, we demonstrate the system via a short video 1.
We present ClaimLinker, a Web service and API that links arbitrary text to a knowledge graph of fact-checked claims, offering a novel kind of semantic annotation of unstructured content. Given a text, ClaimLinker matches parts of it to fact-checked claims mined from popular fact-checking sites and integrated into a rich knowledge graph, thus allowing the further exploration of the linked claims and their associations. The application is based on a scalable, fully unsupervised and modular approach that does not require training or tuning and which can serve high quality results at real time (outperforming existing unsupervised methods). This allows its easy deployment for different contexts and application scenarios.
Despite rapid developments in the field of machine learning research, collecting high quality labels for supervised learning remains a bottleneck for many applications. This difficulty is exacerbated by the fact that state-of-the art models for NLP tasks are becoming deeper and more complex, often increasing the amount of training data required even for fine-tuning. Weak supervision methods, including data programming, address this problem and reduce the cost of label collection by using noisy label sources for supervision. However until recently, data programming was only accessible to users who knew how to program. In order to bridge this gap, the Data Programming by Demonstration framework was proposed to facilitate the automatic creation of labeling functions based on a few examples labeled by a domain expert. This framework has proven successful for generating high accuracy labeling models for document classification. In this work, we extend the DPBD framework to span-level annotation tasks, arguably one of the most time consuming NLP labeling tasks. We built a novel tool, TagRuler, that makes it easy for annotators to build span-level labeling functions without programming and encourages them to explore trade-offs between different labeling models and active learning strategies. We empirically demonstrated that an annotator could achieve a higher F1 score using the proposed tool compared to manual labeling for different span-level annotation tasks.
Information Extraction (IE) tasks are commonly studied topics in various domains of research. Hence, the community continuously produces multiple techniques, solutions, and tools to perform such tasks. However, running those tools and integrating them within existing infrastructure requires time, expertise, and resources. One pertinent task here is triples extraction and linking, where structured triples are extracted from a text and aligned to an existing Knowledge Graph (KG). In this paper, we present Plumber , the first framework that allows users to manually and automatically create suitable IE pipelines from a community-created pool of tools to perform triple extraction and alignment on unstructured text. Our approach provides an interactive medium to alter the pipelines and perform IE tasks. A short video to show the working of the framework for different use-cases is available online1
Large-scale social networks have become more and more popular with the rapid progress of social media. A number of social network analysis tasks have been developed to conduct on the real large-scale networks. However, the prohibitive cost of achieving the underlying large network, including time cost and data privacy, makes it hard to evaluate the performance of analysis algorithms on real-world social networks. In this paper, we present a tool called FastSNG, which generates heterogeneous social network datasets according to the user-defined configuration depicting the rich characteristics of the expected social network, such as community structures, attributes, and node degree distributions. Moreover, the generation algorithm of FastSNG adopts a degree distribution generation (D2G) model which is efficient to generate web-scale social network datasets. Finally, the tool provides user-friendly and succinct user interfaces for the interaction with general users.
Scientists always look for the most accurate and relevant answer to their queries on the scholarly literature. Traditional scholarly search systems list documents instead of providing direct answers to the search queries. As data in knowledge graphs are not acquainted semantically, they are not machine-readable. Therefore, a search on scholarly knowledge graphs ends up in a full-text search, not a search in the content of scholarly literature. In this demo, we present a faceted search system that retrieves data from a scholarly knowledge graph, which can be compared and filtered to better satisfy user information needs. Our practice’s novelty is that we use dynamic facets, which means facets are not fixed and will change according to the content of a comparison.
Anonymity offers strong guarantees for people to use technology without fear of mass surveillance, identity theft, and data misuse. To anonymize and share data widely, de-identification is the main tool used in academia and industry. Yet mounting evidence suggest that de-identification may not protect people’s privacy in practice. We present The Observatory of Anonymity, an interactive website demonstrating how few pieces of personal data can easily re-identify us. Taking advantage of modern web technologies, it allows participants to explore their correctness score—the likelihood to be correctly and uniquely identified from their demographics only. Trained on census data from 89 countries, it demonstrates the effectiveness of re-identification attacks on deemed-anonymous data. The website further allows analysts to upload their own data samples to train our machine learning models in real time. The Observatory provides a unique tool for individuals, researchers, and practitioners to assess whether current de-identification practices satisfy the anonymization standards of modern data protection laws such as GDPR and CCPA.
Capturing and exploiting a content’s semantic is a key success factor for Web search. To this end, it is crucial to - ideally automatically - extract the core semantics of the data being processed and link this information with some formal representation, such as an ontology. By intertwining both, search becomes semantic by simultaneously allowing end-users a structured access to the data via the underlying ontology. Connecting both, we introduce the SEMANNOREX framework in order to provide semantically enriched access to a news corpus from Websites and Wikinews.
Data is scattered across service providers, heterogeneously structured in various formats. By lack of interoperability, data portability is hindered, and thus user control is inhibited. An interoperable data portability solution for transferring personal data is needed. We demo PROV4ITDaTa: a Web application, that allows users to transfer personal data into an interoperable format to their personal data store. PROV4ITDaTa leverages the open-source solutions RML.io, Comunica, and Solid: (i) the RML.io toolset to describe how to access data from service providers and generate interoperable datasets; (ii) Comunica to query these and more flexibly generate enriched datasets; and (iii) Solid Pods to store the generated data as Linked Data in personal data stores. As opposed to other (hard-coded) solutions, PROV4ITDaTa is fully transparent, where each component of the pipeline is fully configurable and automatically generates detailed provenance trails. Furthermore, transforming the personal data into RDF allows for an interopable solution. By maximizing the use of open-source tools and open standards, PROV4ITDaTa facilitates the shift towards a data ecosystem wherein users have control of their data, and providers can focus on their service instead of trying to adhere to interoperability requirements.
In this demonstration we present the Access Risk Knowledge (ARK) Platform - a socio-technical risk governance system. Through the ARK Virus Project, the ARK Platform has been extended for risk management of personal protective equipment (PPE) in healthcare settings during the COVID-19 pandemic. ARK demonstrates the benefits of a Semantic Web approach for supporting both the integration and classification of qualitative and quantitative PPE risk data, across multiple healthcare organisations, in order to generate a unique unified evidence base of risk. This evidence base could be used to inform decision making processes regarding PPE use.
In this demonstration, we put ourselves in the place of a website manager who seeks to use browser fingerprinting for web authentication. The first step is to choose the attributes to implement among the hundreds that are available. To do so, we developed BrFAST, an attribute selection platform that includes FPSelect, an algorithm that rigorously selects the attributes according to a trade-off between security and usability. BrFAST is configured with a set of parameters for which we provide values for BrFAST to be usable as is. We notably include the resources to use two publicly available browser fingerprint datasets. BrFAST can be extended to use other parameters: other attribute selection methods, other measures of security and usability, or other fingerprint datasets. BrFAST helps visualize the exploration of the possibilities during the search of the best attribute set to use, evaluate the properties of attribute sets, and compare several attribute selection methods. During the demonstration, we compare the attribute sets selected by FPSelect with those selected by the usual methods according to the properties of the resulting browser fingerprints (e.g., their usability, their unicity).
Web tables contain a large amount of useful knowledge. Takco is a new large-scale platform designed for extracting facts from tables that can be added to Knowledge Graphs (KGs) like Wikidata. Focusing on achieving high precision, current techniques are biased towards extracting redundant facts, i.e., facts already in the KG. Takco aims to find more novel facts, still at high precision. Our demonstration has two goals. The first one is to illustrate the main features of Takco’s novel interpretation algorithm. The second goal is to show to what extent other state-of-the-art systems are biased towards the extraction of redundant facts using our platform, thus raising awareness on this important problem.
App reviews deliver user opinions and emerging issues (e.g., new bugs) about the app releases. Due to the dynamic nature of app reviews, topics and sentiment of the reviews would change along with app release versions. Although several studies have focused on summarizing user opinions by analyzing user sentiment towards app features, no practical tool is released. The large quantity of reviews and noise words also necessitates an automated tool for monitoring user reviews. In this paper, we introduce TOUR for dynamic TOpic and sentiment analysis of User Reviews. TOUR is able to (i) detect and summarize emerging app issues over app versions, (ii) identify user sentiment towards app features, and (iii) prioritize important user reviews for facilitating developers’ examination. The core techniques of TOUR include the online topic modeling approach and sentiment prediction strategy. TOUR provides entries for developers to customize the hyper-parameters and the results are presented in an interactive way. We evaluate TOUR by conducting a developer survey that involves 15 developers, and all of them confirm the practical usefulness of the recommended feature changes by TOUR.
An echo chamber effect refers to the phenomena that online users revealed selective exposure and ideological segregation on political issues. Prior studies indicate the connection between the spread of misinformation and online echo chambers. In this paper, to help users escape from an echo chamber, we propose a novel news-analysis platform that provides a panoramic view of stances towards a particular event from different news media sources. Moreover, to help users better recognize the stances of news sources which published these news articles, we adopt a news stance classification model to categorize their stances into “agree”, “disagree”, “discuss”, or “unrelated” to a relevant claim for specified events with political stances. Finally, we proposed two ways showing the echo chamber effects: 1) visualizing the event and the associated pieces of news; and 2) visualizing the stance distribution of news from news sources of different political ideology. By making the echo chamber effect explicit, we expect online users will become exposed to more diverse perspectives toward a specific event. The demo video of our platform is available on youtube1.
This research explores the content of marketing communications on the Web in their dialogic and data aspects. Bringing together theories from the fields of digital marketing communications and the Semantic Web, the work investigates how web marketing content is used for building semantic relationships between data nodes (by using schema.org) and across semiotic interpretative routes (by adhering to dialogic principles for communication).
The rise of the counterfactual concept promoted the study of reasoning, and we applied it to Knowledge Base Question Answering (KBQA) multi-hop reasoning as a way of data augmentation for the first time. Intuitively, we propose a model-agnostic Counterfactual Samples Synthesizing(CSS) training scheme. The CSS uses two augmentation methods Q-CSS and T-CSS to augment the training set. That is, for each training instance, we create two augmented instances, one per augmentation method. Furthermore, perform the Dynamic Answer Equipment(DAE) algorithm to dynamically assign ground-truth answers for the expanded question, constructing counterfactual examples. After training with the supplemented examples, the KBQA model can focus on all key entities and words, which significantly improved model’s sensitivity. Experimental verified the effectiveness of CSS and achieved consistent improvements across settings with different extents of KB incompleteness.
Modelling multilingual text data over time is a challenging task. This PhD is focused on semantic representation of domain specific short to mid length time stamped textual data. The proposed method is evaluated on the example of job postings, where we are modeling demand on IT jobs. More specifically, we addresses the following three problems: unifying the representation of multilingual text data; clustering similar textual data; using the proposed semantic representation to model and predict future demand of jobs. This works starts with a problem statement, followed by a description of the proposed approach and methodology and is concluded with an overview of the first results and summary of the ongoing research.
Demand forecasting is a crucial component of demand management. Value is provided to the organization through accurate forecasts and insights into the reasons driving the forecasts to increase confidence and assist decision-making. In this Ph.D., we aim to develop state-of-the-art demand forecasting models for irregular demand, develop explainability mechanisms to avoid exposing models fine-grained information regarding the model features, create a recommender system to assist users on decision-making and develop mechanisms to enrich knowledge graphs with feedback provided by the users through artificial intelligence-powered feedback modules. We have already developed models for accurate forecasts regarding steady and irregular demand and architecture to provide forecast explanations that preserve sensitive information regarding model features. These explanations highlighting real-world events that provide insights on the general context captured through the dataset features while highlighting actionable items and suggesting datasets for future data enrichment.