To Örebro University

oru.seÖrebro universitets publikasjoner
Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Can We Quantify Domainhood?: Exploring Measures to Assess Domain-Specificity in Web Corpora
RISE SICS, Linköping, Sweden.
RISE SICS, Linköping, Sweden; Linköping University, Linköping, Sweden.
RISE SICS, Linköping, Sweden; Linköping University, Linköping, Sweden.
Örebro universitet, Institutionen för naturvetenskap och teknik.ORCID-id: 0000-0002-4001-2087
Vise andre og tillknytning
2018 (engelsk)Inngår i: Database and Expert Systems Applications: EXA 2018 International Workshops / [ed] Elloumi, M.; Granitzer, M.; Hameurlain, A.; Seifert, C.; Stein, B.; Tjoa, AM.; Wagner, R., Springer Berlin/Heidelberg, 2018, s. 207-217Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domainspecificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback-Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.

sted, utgiver, år, opplag, sider
Springer Berlin/Heidelberg, 2018. s. 207-217
Serie
Communications in Computer and Information Science, ISSN 1865-0929, E-ISSN 1865-0937 ; 903
HSV kategori
Identifikatorer
URN: urn:nbn:se:oru:diva-73233DOI: 10.1007/978-3-319-99133-7_17ISI: 000460552400017Scopus ID: 2-s2.0-85052001976ISBN: 978-3-319-99133-7 (digital)ISBN: 978-3-319-99132-0 (tryckt)OAI: oai:DiVA.org:oru-73233DiVA, id: diva2:1297164
Konferanse
29th International Conference on Database and Expert Systems Applications (DEXA), Regensburg, Germany, September 3-6, 2018
Tilgjengelig fra: 2019-03-19 Laget: 2019-03-19 Sist oppdatert: 2019-03-19bibliografisk kontrollert

Open Access i DiVA

Fulltekst mangler i DiVA

Andre lenker

Forlagets fulltekstScopus

Person

Alirezaie, Marjan

Søk i DiVA

Av forfatter/redaktør
Alirezaie, Marjan
Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric

doi
isbn
urn-nbn
Totalt: 336 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf