Search Mailing List Archives
[multilingual-dh] open global multilingual television and online news annotation datasets
kalev.leetaru5 at gmail.com
Wed Sep 28 05:53:39 PDT 2022
I thought many of you might find these global open data multilingual news
annotation datasets of interest.
In collaboration with the Internet Archive's Television News Archive, we
have made selections available of their entire global holdings, spanning 98
channels over 50 countries and territories in 35 languages and dialects
over 20 years:
Each news broadcast is sampled into a grid of images, one every 4 seconds.
The sampled frames are downloadable as a ZIP file to enable at-scale
non-consumptive computational research, enabling things like scanning for
how media from one country is repurposed by the media in other countries:
And 300+ language video OCR, for those interested in the use of onscreen
text across the world:
We also have posted experiments on how state of the art commercial ASR
transcription performs on television news from around the world, including
unsolved challenges like code switching:
For those interested in still imagery, we've analyzed around half a billion
global news images from around the world, including 300+ language OCR and
EXIF metadata extraction:
And for those interested in textual multilingualism, we also have a
150-language realtime ngram dataset:
For those interested in low-resource languages and language detection, we
also are rolling out across our datasets soon our first 400-language
I thought a lot of these resources might be of interest! Email me if you
have any questions!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the multilingual-dh