LIVIVO - The Search Portal for Life Sciences

zur deutschen Oberfläche wechseln
Advanced search

Search results

Result 1 - 9 of total 9

Search options

  1. Book ; Online: Fairness and Bias in Truth Discovery Algorithms

    Lazier, Simone / Thirumuruganathan, Saravanan / Anahideh, Hadis

    An Experimental Analysis

    2023  

    Abstract: Machine learning (ML) based approaches are increasingly being used in a number of applications with societal impact. Training ML models often require vast amounts of labeled data, and crowdsourcing is a dominant paradigm for obtaining labels from ... ...

    Abstract Machine learning (ML) based approaches are increasingly being used in a number of applications with societal impact. Training ML models often require vast amounts of labeled data, and crowdsourcing is a dominant paradigm for obtaining labels from multiple workers. Crowd workers may sometimes provide unreliable labels, and to address this, truth discovery (TD) algorithms such as majority voting are applied to determine the consensus labels from conflicting worker responses. However, it is important to note that these consensus labels may still be biased based on sensitive attributes such as gender, race, or political affiliation. Even when sensitive attributes are not involved, the labels can be biased due to different perspectives of subjective aspects such as toxicity. In this paper, we conduct a systematic study of the bias and fairness of TD algorithms. Our findings using two existing crowd-labeled datasets, reveal that a non-trivial proportion of workers provide biased results, and using simple approaches for TD is sub-optimal. Our study also demonstrates that popular TD algorithms are not a panacea. Additionally, we quantify the impact of these unfair workers on downstream ML tasks and show that conventional methods for achieving fairness and correcting label biases are ineffective in this setting. We end the paper with a plea for the design of novel bias-aware truth discovery algorithms that can ameliorate these issues.

    Comment: Accepted in Algorithmic Fairness in Artificial intelligence, Machine learning and Decision Making workshop at SDM 2023
    Keywords Computer Science - Machine Learning ; Computer Science - Computers and Society ; Computer Science - Databases
    Subject code 006
    Publishing date 2023-04-25
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  2. Article ; Online: Big Data, Small Personas: How Algorithms Shape the Demographic Representation of Data-Driven User Segments.

    Salminen, Joni / Chhirang, Kamal / Jung, Soon-Gyo / Thirumuruganathan, Saravanan / Guan, Kathleen W / Jansen, Bernard J

    Big data

    2022  Volume 10, Issue 4, Page(s) 313–336

    Abstract: Derived from the notion of algorithmic bias, it is possible that creating user segments such as personas from data results in over- or under-representing certain segments (FAIRNESS), does not properly represent the diversity of the user populations ( ... ...

    Abstract Derived from the notion of algorithmic bias, it is possible that creating user segments such as personas from data results in over- or under-representing certain segments (FAIRNESS), does not properly represent the diversity of the user populations (DIVERSITY), or produces inconsistent results when hyperparameters are changed (CONSISTENCY). Collecting user data on 363M video views from a global news and media organization, we compare personas created from this data using different algorithms. Results indicate that the algorithms fall into two groups: those that generate personas with
    MeSH term(s) Algorithms ; Big Data ; Cultural Diversity ; Demography/statistics & numerical data
    Language English
    Publishing date 2022-08-15
    Publishing country United States
    Document type Journal Article
    ISSN 2167-647X
    ISSN (online) 2167-647X
    DOI 10.1089/big.2021.0177
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  3. Book ; Online: Fair Active Learning

    Anahideh, Hadis / Asudeh, Abolfazl / Thirumuruganathan, Saravanan

    2020  

    Abstract: Machine learning (ML) is increasingly being used in high-stakes applications impacting society. Therefore, it is of critical importance that ML models do not propagate discrimination. Collecting accurate labeled data in societal applications is ... ...

    Abstract Machine learning (ML) is increasingly being used in high-stakes applications impacting society. Therefore, it is of critical importance that ML models do not propagate discrimination. Collecting accurate labeled data in societal applications is challenging and costly. Active learning is a promising approach to build an accurate classifier by interactively querying an oracle within a labeling budget. We design algorithms for fair active learning that carefully selects data points to be labeled so as to balance model accuracy and fairness. We demonstrate the effectiveness and efficiency of our proposed algorithms over widely used benchmark datasets using demographic parity and equalized odds notions of fairness.
    Keywords Computer Science - Machine Learning ; Statistics - Machine Learning
    Publishing date 2020-01-06
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  4. Book ; Online: Fair Active Learning

    Anahideh, Hadis / Asudeh, Abolfazl / Thirumuruganathan, Saravanan

    2020  

    Abstract: Machine learning (ML) is increasingly being used in high-stakes applications impacting society. Therefore, it is of critical importance that ML models do not propagate discrimination. Collecting accurate labeled data in societal applications is ... ...

    Abstract Machine learning (ML) is increasingly being used in high-stakes applications impacting society. Therefore, it is of critical importance that ML models do not propagate discrimination. Collecting accurate labeled data in societal applications is challenging and costly. Active learning is a promising approach to build an accurate classifier by interactively querying an oracle within a labeling budget. We design algorithms for fair active learning that carefully selects data points to be labeled so as to balance model accuracy and fairness. Specifically, we focus on demographic parity - a widely used measure of fairness. Extensive experiments over benchmark datasets demonstrate the effectiveness of our proposed approach.

    Comment: This was intended as a replacement of arXiv:2001.01796 please see the updated version there
    Keywords Computer Science - Machine Learning ; Statistics - Machine Learning
    Publishing date 2020-06-20
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  5. Book ; Online: Local Embeddings for Relational Data Integration

    Cappuzzo, Riccardo / Papotti, Paolo / Thirumuruganathan, Saravanan

    2019  

    Abstract: Deep learning based techniques have been recently used with promising results for data integration problems. Some methods directly use pre-trained embeddings that were trained on a large corpus such as Wikipedia. However, they may not always be an ... ...

    Abstract Deep learning based techniques have been recently used with promising results for data integration problems. Some methods directly use pre-trained embeddings that were trained on a large corpus such as Wikipedia. However, they may not always be an appropriate choice for enterprise datasets with custom vocabulary. Other methods adapt techniques from natural language processing to obtain embeddings for the enterprise's relational data. However, this approach blindly treats a tuple as a sentence, thus losing a large amount of contextual information present in the tuple. We propose algorithms for obtaining local embeddings that are effective for data integration tasks on relational databases. We make four major contributions. First, we describe a compact graph-based representation that allows the specification of a rich set of relationships inherent in the relational world. Second, we propose how to derive sentences from such a graph that effectively "describe" the similarity across elements (tokens, attributes, rows) in the two datasets. The embeddings are learned based on such sentences. Third, we propose effective optimization to improve the quality of the learned embeddings and the performance of integration tasks. Finally, we propose a diverse collection of criteria to evaluate relational embeddings and perform an extensive set of experiments validating them against multiple baseline methods. Our experiments show that our framework, EmbDI, produces meaningful results for data integration tasks such as schema matching and entity resolution both in supervised and unsupervised settings.

    Comment: Accepted to SIGMOD 2020 as Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. Code can be found at https://gitlab.eurecom.fr/cappuzzo/embdi
    Keywords Computer Science - Databases ; Computer Science - Computation and Language ; Computer Science - Machine Learning
    Subject code 004
    Publishing date 2019-09-03
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  6. Book ; Online: ZeroER

    Wu, Renzhi / Chaba, Sanya / Sawlani, Saurabh / Chu, Xu / Thirumuruganathan, Saravanan

    Entity Resolution using Zero Labeled Examples

    2019  

    Abstract: Entity resolution (ER) refers to the problem of matching records in one or more relations that refer to the same real-world entity. While supervised machine learning (ML) approaches achieve the state-of-the-art results, they require a large amount of ... ...

    Abstract Entity resolution (ER) refers to the problem of matching records in one or more relations that refer to the same real-world entity. While supervised machine learning (ML) approaches achieve the state-of-the-art results, they require a large amount of labeled examples that are expensive to obtain and often times infeasible. We investigate an important problem that vexes practitioners: is it possible to design an effective algorithm for ER that requires Zero labeled examples, yet can achieve performance comparable to supervised approaches? In this paper, we answer in the affirmative through our proposed approach dubbed ZeroER. Our approach is based on a simple observation -- the similarity vectors for matches should look different from that of unmatches. Operationalizing this insight requires a number of technical innovations. First, we propose a simple yet powerful generative model based on Gaussian Mixture Models for learning the match and unmatch distributions. Second, we propose an adaptive regularization technique customized for ER that ameliorates the issue of feature overfitting. Finally, we incorporate the transitivity property into the generative model in a novel way resulting in improved accuracy. On five benchmark ER datasets, we show that ZeroER greatly outperforms existing unsupervised approaches and achieves comparable performance to supervised approaches.

    Comment: Published at 2020 ACM SIGMOD International Conference on Management of Data
    Keywords Computer Science - Databases ; Computer Science - Machine Learning
    Subject code 006
    Publishing date 2019-08-16
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  7. Article ; Online: An Empirical Study of Questionnaires for the Diagnosis of Pediatric Obstructive Sleep Apnea.

    Ahmed, Sadia / Hasani, Sona / Koone, Mary / Thirumuruganathan, Saravanan / Diaz-Abad, Montserrat / Mitchell, Ron / Isaiah, Amal / Das, Gautam

    Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference

    2018  Volume 2018, Page(s) 4097–4100

    Abstract: Pediatric Obstructive Sleep Apnea (OSA) is a chronic disorder characterized by the disruption in sleep due to involuntary and temporary cessation of breathing. Definitive diagnosis of OSA requires an intrusive and expensive approach based on ... ...

    Abstract Pediatric Obstructive Sleep Apnea (OSA) is a chronic disorder characterized by the disruption in sleep due to involuntary and temporary cessation of breathing. Definitive diagnosis of OSA requires an intrusive and expensive approach based on polysomnography where the children spend a night in the hospital under the supervision of a sleep technician. The prevalence of OSA is increasing, making the traditional diagnostic approach prohibitively expensive. There has been increasing interest in designing inexpensive approaches to screen children such as the use of questionnaires. In this paper, we study the efficacy of five widely used and representative questionnaires on their ability to diagnose and stratify OSA. Our experiments show that the diagnostic ability of each of these questionnaires is insufficient for widespread clinical use. Using techniques from data mining, we identify the most informative questions and propose a new questionnaire. We show that machine learning models trained based on the answers to our questionnaire can stratify OSA with higher accuracy.
    MeSH term(s) Humans ; Machine Learning ; Polysomnography ; Prevalence ; Sleep Apnea, Obstructive ; Surveys and Questionnaires
    Language English
    Publishing date 2018-09-17
    Publishing country United States
    Document type Journal Article ; Research Support, Non-U.S. Gov't ; Research Support, U.S. Gov't, Non-P.H.S.
    ISSN 2694-0604
    ISSN (online) 2694-0604
    DOI 10.1109/EMBC.2018.8513389
    Database MEDical Literature Analysis and Retrieval System OnLINE

    More links

    Kategorien

  8. Book ; Online: DeepER -- Deep Entity Resolution

    Ebraheem, Muhammad / Thirumuruganathan, Saravanan / Joty, Shafiq / Ouzzani, Mourad / Tang, Nan

    2017  

    Abstract: Entity resolution (ER) is a key data integration problem. Despite the efforts in 70+ years in all aspects of ER, there is still a high demand for democratizing ER - humans are heavily involved in labeling data, performing feature engineering, tuning ... ...

    Abstract Entity resolution (ER) is a key data integration problem. Despite the efforts in 70+ years in all aspects of ER, there is still a high demand for democratizing ER - humans are heavily involved in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representation of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). For accuracy, we use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations. For efficiency, we propose a locality sensitive hashing (LSH) based blocking approach that uses distributed representations of tuples; it takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. For ease-of-use, DeepER requires much less human labeled data and does not need feature engineering, compared with traditional machine learning based approaches which require handcrafted features, and similarity functions along with their associated thresholds. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.

    Comment: Accepted to PVLDB 2018 as "Distributed Representations of Tuples for Entity Resolution"
    Keywords Computer Science - Databases
    Subject code 006
    Publishing date 2017-10-02
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

  9. Book ; Online: Malware in the Future? Forecasting of Analyst Detection of Cyber Events

    Bakdash, Jonathan Z. / Hutchinson, Steve / Zaroukian, Erin G. / Marusich, Laura R. / Thirumuruganathan, Saravanan / Sample, Charmaine / Hoffman, Blaine / Das, Gautam

    2017  

    Abstract: There have been extensive efforts in government, academia, and industry to anticipate, forecast, and mitigate cyber attacks. A common approach is time-series forecasting of cyber attacks based on data from network telescopes, honeypots, and automated ... ...

    Abstract There have been extensive efforts in government, academia, and industry to anticipate, forecast, and mitigate cyber attacks. A common approach is time-series forecasting of cyber attacks based on data from network telescopes, honeypots, and automated intrusion detection/prevention systems. This research has uncovered key insights such as systematicity in cyber attacks. Here, we propose an alternate perspective of this problem by performing forecasting of attacks that are analyst-detected and -verified occurrences of malware. We call these instances of malware cyber event data. Specifically, our dataset was analyst-detected incidents from a large operational Computer Security Service Provider (CSSP) for the U.S. Department of Defense, which rarely relies only on automated systems. Our data set consists of weekly counts of cyber events over approximately seven years. Since all cyber events were validated by analysts, our dataset is unlikely to have false positives which are often endemic in other sources of data. Further, the higher-quality data could be used for a number for resource allocation, estimation of security resources, and the development of effective risk-management strategies. We used a Bayesian State Space Model for forecasting and found that events one week ahead could be predicted. To quantify bursts, we used a Markov model. Our findings of systematicity in analyst-detected cyber attacks are consistent with previous work using other sources. The advanced information provided by a forecast may help with threat awareness by providing a probable value and range for future cyber events one week ahead. Other potential applications for cyber event forecasting include proactive allocation of resources and capabilities for cyber defense (e.g., analyst staffing and sensor configuration) in CSSPs. Enhanced threat awareness may improve cybersecurity.

    Comment: Revised version resubmitted to journal
    Keywords Computer Science - Cryptography and Security
    Subject code 006
    Publishing date 2017-07-11
    Publishing country us
    Document type Book ; Online
    Database BASE - Bielefeld Academic Search Engine (life sciences selection)

    More links

    Kategorien

To top