Frequency Lists

Frequency lists from news articles

Why

Many language students prefer to use frequency lists as a way to expand their vocabulary. Often frequency lists do not count using lexemes, causing many words to be counted multiple times. Additionally, a given frequency list may have been collected from outdated source material and may not accurately reflect how frequently certain words are used in the real world.

Langliter processes thousands of news articles which enables much of its advanced functionality. As a result, creating a lemma based frequency list from this data is not all that difficult. In the app, this data is used to calculate a modified version of the Dale–Chall readability formula to provide users an indication of the difficulty of a given article as shown here:

Readability Score

Lists

I will try to update these lists quarterly as the data set grows. Signup at the bottom to get notified when lists are updated and added. Lists come with some basic PoS information, so if you just want to study the top N number of verbs, it should only require a simple spreadsheet filter.

The three columns are Lemma, PoS, Count.

All lists are licensed under the Apache License, Version 2.0. license.txt.

Spanish

Collected from a corpus of ~30k news articles from the past two years from a variety of newspapers.

es_freq.xlsx

Here is a Google Sheets version with some english translations. Google Sheets Version

Tag Translation:

A = Adjective
C = Conjunction
D = Determiner
N = Noun
P = Pronoun
R = Adverb
S = Adposition
V = Verb

Published by in general using 305 words.

Download Now