LIVIVO - Search results -

Search results

Result 1 - 6 of total 6

Search options

Book ; Online: LT@Helsinki at SemEval-2020 Task 12

Pàmies, Marc / Öhman, Emily / Kajava, Kaisla / Tiedemann, Jörg

Multilingual or language-specific BERT?

2020

Abstract: This paper presents the different models submitted by the LT@Helsinki team for the SemEval 2020 Shared Task 12. Our team participated in sub-tasks A and C; titled offensive language identification and offense target identification, respectively. In both ... ...

Abstract	This paper presents the different models submitted by the LT@Helsinki team for the SemEval 2020 Shared Task 12. Our team participated in sub-tasks A and C; titled offensive language identification and offense target identification, respectively. In both cases we used the so-called Bidirectional Encoder Representation from Transformer (BERT), a model pre-trained by Google and fine-tuned by us on the OLID and SOLID datasets. The results show that offensive tweet classification is one of several language-based tasks where BERT can achieve state-of-the-art results. Comment: Accepted at SemEval-2020 Task 12. Identical to camera-ready version except where adjustments to fit arXiv requirements were necessary
Keywords	Computer Science - Computation and Language
Publishing date	2020-08-03
Publishing country	us
Document type	Book ; Online
Database	BASE - Bielefeld Academic Search Engine (life sciences selection)

Full text online

Full text

Inter-library loan at ZB MED

Your chosen title can be delivered directly to ZB MED Cologne location if you are registered as a user at ZB MED Cologne.

Book ; Online: XED

Öhman, Emily / Pàmies, Marc / Kajava, Kaisla / Tiedemann, Jörg

A Multilingual Dataset for Sentiment Analysis and Emotion Detection

2020

Abstract: We introduce XED, a multilingual fine-grained emotion dataset. The dataset consists of human-annotated Finnish (25k) and English sentences (30k), as well as projected annotations for 30 additional languages, providing new resources for many low-resource ... ...

Abstract	We introduce XED, a multilingual fine-grained emotion dataset. The dataset consists of human-annotated Finnish (25k) and English sentences (30k), as well as projected annotations for 30 additional languages, providing new resources for many low-resource languages. We use Plutchik's core emotions to annotate the dataset with the addition of neutral to create a multilabel multiclass dataset. The dataset is carefully evaluated using language-specific BERT models and SVMs to show that XED performs on par with other similar datasets and is therefore a useful tool for sentiment analysis and emotion detection. Comment: Accepted at COLING 2020
Keywords	Computer Science - Computation and Language
Publishing date	2020-11-03
Publishing country	us
Document type	Book ; Online
Database	BASE - Bielefeld Academic Search Engine (life sciences selection)

Full text online

Full text

Inter-library loan at ZB MED

Your chosen title can be delivered directly to ZB MED Cologne location if you are registered as a user at ZB MED Cologne.

Book ; Online: Biomedical and Clinical Language Models for Spanish

Carrino, Casimiro Pio / Armengol-Estapé, Jordi / Gutiérrez-Fandiño, Asier / Llop-Palao, Joan / Pàmies, Marc / Gonzalez-Agirre, Aitor / Villegas, Marta

On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario

2021

Abstract: This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices, such as masking at word and subword level, varying the vocabulary size and testing with domain data, looking for better language ... ...

Abstract	This work presents biomedical and clinical language models for Spanish by experimenting with different pretraining choices, such as masking at word and subword level, varying the vocabulary size and testing with domain data, looking for better language representations. Interestingly, in the absence of enough clinical data to train a model from scratch, we applied mixed-domain pretraining and cross-domain transfer approaches to generate a performant bio-clinical model suitable for real-world clinical data. We evaluated our models on Named Entity Recognition (NER) tasks for biomedical documents and challenging hospital discharge reports. When compared against the competitive mBERT and BETO models, we outperform them in all NER tasks by a significant margin. Finally, we studied the impact of the model's vocabulary on the NER performances by offering an interesting vocabulary-centric analysis. The results confirm that domain-specific pretraining is fundamental to achieving higher performances in downstream NER tasks, even within a mid-resource scenario. To the best of our knowledge, we provide the first biomedical and clinical transformer-based pretrained language models for Spanish, intending to boost native Spanish NLP applications in biomedicine. Our best models are freely available in the HuggingFace hub: https://huggingface.co/BSC-TeMU. Comment: 9 pages
Keywords	Computer Science - Computation and Language
Subject code	410
Publishing date	2021-09-08
Publishing country	us
Document type	Book ; Online
Database	BASE - Bielefeld Academic Search Engine (life sciences selection)

Full text online

Full text

Inter-library loan at ZB MED

Your chosen title can be delivered directly to ZB MED Cologne location if you are registered as a user at ZB MED Cologne.

Book ; Online: MarIA

Gutiérrez-Fandiño, Asier / Armengol-Estapé, Jordi / Pàmies, Marc / Llop-Palao, Joan / Silveira-Ocampo, Joaquín / Carrino, Casimiro Pio / Gonzalez-Agirre, Aitor / Armentano-Oller, Carme / Rodriguez-Penagos, Carlos / Villegas, Marta

Spanish Language Models

2021

Abstract: This work presents MarIA, a family of Spanish language models and associated resources made available to the industry and the research community. Currently, MarIA includes RoBERTa-base, RoBERTa-large, GPT2 and GPT2-large Spanish language models, which ... ...

Abstract	This work presents MarIA, a family of Spanish language models and associated resources made available to the industry and the research community. Currently, MarIA includes RoBERTa-base, RoBERTa-large, GPT2 and GPT2-large Spanish language models, which can arguably be presented as the largest and most proficient language models in Spanish. The models were pretrained using a massive corpus of 570GB of clean and deduplicated texts with 135 billion words extracted from the Spanish Web Archive crawled by the National Library of Spain between 2009 and 2019. We assessed the performance of the models with nine existing evaluation datasets and with a novel extractive Question Answering dataset created ex novo. Overall, MarIA models outperform the existing Spanish models across a variety of NLU tasks and training settings.
Keywords	Computer Science - Computation and Language ; Computer Science - Artificial Intelligence
Publishing date	2021-07-15
Publishing country	us
Document type	Book ; Online
Database	BASE - Bielefeld Academic Search Engine (life sciences selection)

Full text online

Full text

Inter-library loan at ZB MED

Your chosen title can be delivered directly to ZB MED Cologne location if you are registered as a user at ZB MED Cologne.

Book ; Online: BigBIO

Fries, Jason Alan / Weber, Leon / Seelam, Natasha / Altay, Gabriel / Datta, Debajyoti / Garda, Samuele / Kang, Myungsun / Su, Ruisi / Kusa, Wojciech / Cahyawijaya, Samuel / Barth, Fabio / Ott, Simon / Samwald, Matthias / Bach, Stephen / Biderman, Stella / Sänger, Mario / Wang, Bo / Callahan, Alison / Periñán, Daniel León /

Gigant, Théo / Haller, Patrick / Chim, Jenny / Posada, Jose David / Giorgi, John Michael / Sivaraman, Karthik Rangasai / Pàmies, Marc / Nezhurina, Marianna / Martin, Robert / Cullan, Michael / Freidank, Moritz / Dahlberg, Nathan / Mishra, Shubhanshu / Bose, Shamik / Broad, Nicholas Michio / Labrak, Yanis / Deshmukh, Shlok S / Kiblawi, Sid / Singh, Ayush / Vu, Minh Chien / Neeraj, Trishala / Golde, Jonas / del Moral, Albert Villanova / Beilharz, Benjamin

A Framework for Data-Centric Biomedical Natural Language Processing

2022

Abstract: Training and evaluating language models increasingly requires the construction of meta-datasets --diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming ...

Abstract	Training and evaluating language models increasingly requires the construction of meta-datasets --diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a diversity of novel pretraining tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BigBIO a community library of 126+ biomedical NLP datasets, currently covering 12 task categories and 10+ languages. BigBIO facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BigBIO is an ongoing community effort and is available at https://github.com/bigscience-workshop/biomedical Comment: Submitted to NeurIPS 2022 Datasets and Benchmarks Track
Keywords	Computer Science - Computation and Language
Subject code	006
Publishing date	2022-06-30
Publishing country	us
Document type	Book ; Online
Database	BASE - Bielefeld Academic Search Engine (life sciences selection)

Full text online

Full text

Inter-library loan at ZB MED

Your chosen title can be delivered directly to ZB MED Cologne location if you are registered as a user at ZB MED Cologne.

Book ; Online: BLOOM

Workshop, BigScience / : / Scao, Teven Le / Fan, Angela / Akiki, Christopher / Pavlick, Ellie / Ilić, Suzana / Hesslow, Daniel / Castagné, Roman / Luccioni, Alexandra Sasha / Yvon, François / Gallé, Matthias / Tow, Jonathan / Rush, Alexander M. / Biderman, Stella / Webson, Albert / Ammanamanchi, Pawan Sasanka / Wang, Thomas / Sagot, Benoît /

Muennighoff, Niklas / del Moral, Albert Villanova / Ruwase, Olatunji / Bawden, Rachel / Bekman, Stas / McMillan-Major, Angelina / Beltagy, Iz / Nguyen, Huu / Saulnier, Lucile / Tan, Samson / Suarez, Pedro Ortiz / Sanh, Victor / Laurençon, Hugo / Jernite, Yacine / Launay, Julien / Mitchell, Margaret / Raffel, Colin / Gokaslan, Aaron / Simhi, Adi / Soroa, Aitor / Aji, Alham Fikri / Alfassy, Amit / Rogers, Anna / Nitzav, Ariel Kreisberg / Xu, Canwen / Mou, Chenghao / Emezue, Chris / Klamm, Christopher / Leong, Colin / van Strien, Daniel / Adelani, David Ifeoluwa / Radev, Dragomir / Ponferrada, Eduardo González / Levkovizh, Efrat / Kim, Ethan / Natan, Eyal Bar / De Toni, Francesco / Dupont, Gérard / Kruszewski, Germán / Pistilli, Giada / Elsahar, Hady / Benyamina, Hamza / Tran, Hieu / Yu, Ian / Abdulmumin, Idris / Johnson, Isaac / Gonzalez-Dios, Itziar / de la Rosa, Javier / Chim, Jenny / Dodge, Jesse / Zhu, Jian / Chang, Jonathan / Frohberg, Jörg / Tobing, Joseph / Bhattacharjee, Joydeep / Almubarak, Khalid / Chen, Kimbo / Lo, Kyle / Von Werra, Leandro / Weber, Leon / Phan, Long / allal, Loubna Ben / Tanguy, Ludovic / Dey, Manan / Muñoz, Manuel Romero / Masoud, Maraim / Grandury, María / Šaško, Mario / Huang, Max / Coavoux, Maximin / Singh, Mayank / Jiang, Mike Tian-Jian / Vu, Minh Chien / Jauhar, Mohammad A. / Ghaleb, Mustafa / Subramani, Nishant / Kassner, Nora / Khamis, Nurulaqilla / Nguyen, Olivier / Espejel, Omar / de Gibert, Ona / Villegas, Paulo / Henderson, Peter / Colombo, Pierre / Amuok, Priscilla / Lhoest, Quentin / Harliman, Rheza / Bommasani, Rishi / López, Roberto Luis / Ribeiro, Rui / Osei, Salomey / Pyysalo, Sampo / Nagel, Sebastian / Bose, Shamik / Muhammad, Shamsuddeen Hassan / Sharma, Shanya / Longpre, Shayne / Nikpoor, Somaieh / Silberberg, Stanislav / Pai, Suhas / Zink, Sydney / Torrent, Tiago Timponi / Schick, Timo / Thrush, Tristan / Danchev, Valentin / Nikoulina, Vassilina / Laippala, Veronika / Lepercq, Violette / Prabhu, Vrinda / Alyafeai, Zaid / Talat, Zeerak / Raja, Arun / Heinzerling, Benjamin / Si, Chenglei / Taşar, Davut Emre / Salesky, Elizabeth / Mielke, Sabrina J. / Lee, Wilson Y. / Sharma, Abheesht / Santilli, Andrea / Chaffin, Antoine / Stiegler, Arnaud / Datta, Debajyoti / Szczechla, Eliza / Chhablani, Gunjan / Wang, Han / Pandey, Harshit / Strobelt, Hendrik / Fries, Jason Alan / Rozen, Jos / Gao, Leo / Sutawika, Lintang / Bari, M Saiful / Al-shaibani, Maged S. / Manica, Matteo / Nayak, Nihal / Teehan, Ryan / Albanie, Samuel / Shen, Sheng / Ben-David, Srulik / Bach, Stephen H. / Kim, Taewoon / Bers, Tali / Fevry, Thibault / Neeraj, Trishala / Thakker, Urmish / Raunak, Vikas / Tang, Xiangru / Yong, Zheng-Xin / Sun, Zhiqing / Brody, Shaked / Uri, Yallow / Tojarieh, Hadar / Roberts, Adam / Chung, Hyung Won / Tae, Jaesung / Phang, Jason / Press, Ofir / Li, Conglong / Narayanan, Deepak / Bourfoune, Hatim / Casper, Jared / Rasley, Jeff / Ryabinin, Max / Mishra, Mayank / Zhang, Minjia / Shoeybi, Mohammad / Peyrounette, Myriam / Patry, Nicolas / Tazi, Nouamane / Sanseviero, Omar / von Platen, Patrick / Cornette, Pierre / Lavallée, Pierre François / Lacroix, Rémi / Rajbhandari, Samyam / Gandhi, Sanchit / Smith, Shaden / Requena, Stéphane / Patil, Suraj / Dettmers, Tim / Baruwa, Ahmed / Singh, Amanpreet / Cheveleva, Anastasia / Ligozat, Anne-Laure / Subramonian, Arjun / Névéol, Aurélie / Lovering, Charles / Garrette, Dan / Tunuguntla, Deepak / Reiter, Ehud / Taktasheva, Ekaterina / Voloshina, Ekaterina / Bogdanov, Eli / Winata, Genta Indra / Schoelkopf, Hailey / Kalo, Jan-Christoph / Novikova, Jekaterina / Forde, Jessica Zosa / Clive, Jordan / Kasai, Jungo / Kawamura, Ken / Hazan, Liam / Carpuat, Marine / Clinciu, Miruna / Kim, Najoung / Cheng, Newton / Serikov, Oleg / Antverg, Omer / van der Wal, Oskar / Zhang, Rui / Zhang, Ruochen / Gehrmann, Sebastian / Mirkin, Shachar / Pais, Shani / Shavrina, Tatiana / Scialom, Thomas / Yun, Tian / Limisiewicz, Tomasz / Rieser, Verena / Protasov, Vitaly / Mikhailov, Vladislav / Pruksachatkun, Yada / Belinkov, Yonatan / Bamberger, Zachary / Kasner, Zdeněk / Rueda, Alice / Pestana, Amanda / Feizpour, Amir / Khan, Ammar / Faranak, Amy / Santos, Ana / Hevia, Anthony / Unldreaj, Antigona / Aghagol, Arash / Abdollahi, Arezoo / Tammour, Aycha / HajiHosseini, Azadeh / Behroozi, Bahareh / Ajibade, Benjamin / Saxena, Bharat / Ferrandis, Carlos Muñoz / McDuff, Daniel / Contractor, Danish / Lansky, David / David, Davis / Kiela, Douwe / Nguyen, Duong A. / Tan, Edward / Baylor, Emi / Ozoani, Ezinwanne / Mirza, Fatima / Ononiwu, Frankline / Rezanejad, Habib / Jones, Hessie / Bhattacharya, Indrani / Solaiman, Irene / Sedenko, Irina / Nejadgholi, Isar / Passmore, Jesse / Seltzer, Josh / Sanz, Julio Bonis / Dutra, Livia / Samagaio, Mairon / Elbadri, Maraim / Mieskes, Margot / Gerchick, Marissa / Akinlolu, Martha / McKenna, Michael / Qiu, Mike / Ghauri, Muhammed / Burynok, Mykola / Abrar, Nafis / Rajani, Nazneen / Elkott, Nour / Fahmy, Nour / Samuel, Olanrewaju / An, Ran / Kromann, Rasmus / Hao, Ryan / Alizadeh, Samira / Shubber, Sarmad / Wang, Silas / Roy, Sourav / Viguier, Sylvain / Le, Thanh / Oyebade, Tobi / Le, Trieu / Yang, Yoyo / Nguyen, Zach / Kashyap, Abhinav Ramesh / Palasciano, Alfredo / Callahan, Alison / Shukla, Anima / Miranda-Escalada, Antonio / Singh, Ayush / Beilharz, Benjamin / Wang, Bo / Brito, Caio / Zhou, Chenxi / Jain, Chirag / Xu, Chuxin / Fourrier, Clémentine / Periñán, Daniel León / Molano, Daniel / Yu, Dian / Manjavacas, Enrique / Barth, Fabio / Fuhrimann, Florian / Altay, Gabriel / Bayrak, Giyaseddin / Burns, Gully / Vrabec, Helena U. / Bello, Imane / Dash, Ishani / Kang, Jihyun / Giorgi, John / Golde, Jonas / Posada, Jose David / Sivaraman, Karthik Rangasai / Bulchandani, Lokesh / Liu, Lu / Shinzato, Luisa / de Bykhovetz, Madeleine Hahn / Takeuchi, Maiko / Pàmies, Marc / Castillo, Maria A / Nezhurina, Marianna / Sänger, Mario / Samwald, Matthias / Cullan, Michael / Weinberg, Michael / De Wolf, Michiel / Mihaljcic, Mina / Liu, Minna / Freidank, Moritz / Kang, Myungsun / Seelam, Natasha / Dahlberg, Nathan / Broad, Nicholas Michio / Muellner, Nikolaus / Fung, Pascale / Haller, Patrick / Chandrasekhar, Ramya / Eisenberg, Renata / Martin, Robert / Canalli, Rodrigo / Su, Rosaline / Su, Ruisi / Cahyawijaya, Samuel / Garda, Samuele / Deshmukh, Shlok S / Mishra, Shubhanshu / Kiblawi, Sid / Ott, Simon / Sang-aroonsiri, Sinee / Kumar, Srishti / Schweter, Stefan / Bharati, Sushil / Laud, Tanmay / Gigant, Théo / Kainuma, Tomoya / Kusa, Wojciech / Labrak, Yanis / Bajaj, Yash Shailesh / Venkatraman, Yash / Xu, Yifan / Xu, Yingxin / Xu, Yu / Tan, Zhe / Xie, Zhongli / Ye, Zifan / Bras, Mathilde / Belkada, Younes / Wolf, Thomas

A 176B-Parameter Open-Access Multilingual Language Model

2022

Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations ... ...

Abstract	Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
Keywords	Computer Science - Computation and Language
Subject code	410
Publishing date	2022-11-09
Publishing country	us
Document type	Book ; Online
Database	BASE - Bielefeld Academic Search Engine (life sciences selection)

Full text online

Full text

Inter-library loan at ZB MED

Your chosen title can be delivered directly to ZB MED Cologne location if you are registered as a user at ZB MED Cologne.

To top

Search results

Search options

Book ; Online: LT@Helsinki at SemEval-2020 Task 12

Full text online

More links

Kategorien

Inter-library loan at ZB MED

Book ; Online: XED

Full text online

More links

Kategorien

Inter-library loan at ZB MED

Book ; Online: Biomedical and Clinical Language Models for Spanish

Full text online

More links

Kategorien

Inter-library loan at ZB MED

Book ; Online: MarIA

Full text online

More links

Kategorien

Inter-library loan at ZB MED

Book ; Online: BigBIO

Full text online

More links

Kategorien

Inter-library loan at ZB MED

Book ; Online: BLOOM

Full text online

More links

Kategorien

Inter-library loan at ZB MED