english – Techdirt (original) (raw)

Stories filed under: "english"

Research Suggests A Large Proportion Of Web Material In Languages Other Than English Is Machine Translations Of Poor Quality Texts

from the curse-of-recursion dept

The latest generative AI tools are certainly impressive, but they bring with them a wide range of complex problems, as numerous posts on Techdirt attest. A new academic paper, published on arXiv, raises more of them, but from a new angle. Entitled “A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism”, it studies the impact of today’s low-cost AI translation tools on the online world:

We explore the effects that the long-term availability of low cost Machine Translation (MT) has had on the web. We show that content on the web is often translated into many languages, and the quality of these multi-way translations indicates they were primarily created using MT.

“Multi-way” in this context means that two or more sentences can be found translated in several different languages. According to the researchers, of the 6.38 billion sentences studied, 2.19 billion are found in multi-way translations. In particular, languages that appear less frequently online had more multi-way sentences, with disproportionately more found among the rarest languages. Another key feature observed is that highly multi-way parallel translations are “significantly worse” than two-way translations. Moreover, the multi-way data consisted of shorter, more predictable sentences compared to two-way translations. Inspecting a random sample of 100 highly multi-way parallel sentences, the researchers found:

the vast majority came from articles that we characterized as low quality, requiring little or no expertise or advance effort to create, on topics like being taken more seriously at work, being careful about your choices, six tips for new boat owners, deciding to be happy, etc. Furthermore, we were unable to find any translationese or other errors that would suggest the articles were being translated into English (either by human translators or MT), suggesting it is instead being generated in English and translated to other languages.

Taking these observations together, the paper suggests that highly multi-way sentences are generated using AI, specifically machine translations of low-quality English-language originals. Further analysis showed that in the languages found less commonly online, most translations are multi-way parallel, which means that AI content dominates translated material in those languages. In addition:

a large fraction of the total sentences in lower resource languages have at least one translation implying that a large fraction of the total web in those languages is MT generated

In other words, however bad the problems are that AI is creating for English-language material, they are probably worse in languages found less commonly online, since a major proportion of the Web in those languages is generated by machines, not humans.

If this conclusion holds true beyond the dataset studied by the researchers, there is another interesting issue. Generative AI depends on large training sets, which often come from the Web. For languages other than English, the new paper suggests that much of the training material will be translations by AI of low-quality, possibly AI-generated texts. This issue of generative AI feeding on itself has been studied in earlier research. One group summarized their results on “The Curse of Recursion” as follows:

We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs [Large Language Models]. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web.

The new research suggests this is likely to be a more serious problem when building generative AI systems in languages for which there is less material online that can be used for training. The good news is the fact that the presence of multi-way sentences in languages other than English is a strong indication that they have been produced by AI, which offers a means to spot them and filter them out. The bad news is that if this technique is applied to improve the quality of training materials and avoid “model collapse”, the already energy-hungry process of training generative AI systems will be even more damaging for the planet.

Follow me @glynmoody on Mastodon and on Bluesky.

Filed Under: ai, arxiv, climate crisis, energy, english, generative ai, llms, machine translation, training

Italian Legislators Seek To Secure Their Existence And Future For Italian Children By Outlawing English Use By Citizens

from the che-cazzo? dept

The past is a foreign country. They do things differently there. – L.P. Hartley, The Go-Betweens

Accurate. And sometimes a foreign country wants to be the past. Somewhere between a faded photo of Benito Mussolini’s inverted corpse and an ill-received performance by the Stormtroopers of Death at the Teatro alla Scala lies this inexplicable decision to return Italy to its nationalistic roots.

Italians who use English and other foreign words in official communications could face fines of up to €100,000 ($108,705) under new legislation introduced by Prime Minister Giorgia Meloni’s Brothers of Italy party.

Fabio Rampelli, a member of the lower chamber of deputies, introduced the legislation, which is supported by the prime minister.

While the legislation encompasses all foreign languages, it is particularly geared at “Anglomania” or use of English words, which the draft states “demeans and mortifies” the Italian language, adding that it is even worse because the UK is no longer part of the EU.

“Demeans.” “Mortifies.” Holy hell. This sounds like those border dwellers who stubbornly insist everyone should speak American despite being able to track their roots back to non-English speaking countries like Norway, Germany, and probably Lichtenstein.

It’s apparently demeaning for Italian citizens (and especially their elected reps) to casually drop E-bombs during conversation. But of course that makes sense to these legislators, who have formed a far right party far more concerned with excising anything not strictly Italian than solving actual problems, all while pretending to be the representative voice of millions of apparently disenfranchised Italians. Basically, it’s the Republican party but with better fish options during fundraising banquets.

The “Brothers of Italy” have received plenty of labels from political opponents, onlookers, journalists, and outside observers. They’re the ones you expect: “nationalist,” “neo-fascist,” “anti-immigrant,” “nativist.” And now they’re making laws. Extremely stupid laws.

The first article of the legislation guarantees that even in offices that deal with non Italian-speaking foreigners, Italian must be the primary language used.

Article 2 would make Italian “mandatory for the promotion and use of public goods and services in the national territory.” Not doing so could garner fines between €5,000 ($5,435) and €100,000 ($108,705).

The law would turn over language enforcement to the Culture Ministry, which would be able to levy fines to anyone using another language or, as CNN points out, incorrectly pronouncing Italian words. Anyone running for office would have to prove their Italian language bona fides [is Latin still ok?] before being allowed to participate in this ongoing backslide into national socialism.

Meanwhile, the same ruling party has introduced another bill targeting synthetically produced food under the theory that moving towards more sustainable food supplies presents a threat to the “nation’s heritage.”

I assume some readers will see this and find nothing wrong with it. Why not preserve native languages and cultures? Even if there’s a downside (WWII), there’s also an upside, as is stated in this classic Italian adage:

Sicuramente questo farà funzionare i treni in orario.

Maybe?

I would like to say this legislation is going nowhere but I also made it clear in the years 1974-2016 that there was no way Donald Trump could win a presidential election. Nothing is beyond the realm of imagination at this point. A bunch of dipshits playing to an even dipshittier base are in power. All bets are off. Stupid people will be harming innocent people just to score internet points with 8chan regulars. That’s the way the world operates now. Enjoy your illegal English while you can, Italians.

Filed Under: anglomania, english, fines, foreign languages, free speech, giorgia meloni, italian, italy

China Forbids The Use Of English Words In Mobile Games

from the not-as-crazy-as-it-looks dept

Techdirt has run many articles about China‘s direct assault on Internet freedom. Indeed, its attempts to muzzle online dissent are so all-encompassing you might think it has run out of things to censor. But you’d be wrong: China is now reining in games for mobile phones, as a post on Tech in Asia explains:

> A little over a month ago, Chinese censorship bureau SAPPRFT announced new rules that require every mobile game launched in China to be pre-approved by SAPPRFT (already-launched games will have to get retroactive approval before the grace period ends in October). Before the rules had even gone into effect, developers and analysts alike were predicting things could be bad, and that the rules might dismantle China?s indie mobile gaming scene entirely.

Making sure games aren’t seditious in any way might be expected, but there’s a rather weird twist to this latest move:

> One developer’s rant has gone viral in the Chinese web after their game was supposedly rejected by SAPPRFT for containing English words. Not offensive English words, mind you, but completely innocuous ones like “mission start” and “warning.” “I’m really fucking surprised,” wrote the developer of the rejection. > > Another developer confirmed that their game had been rejected for the same reason: including English words like “go” and “lucky.” SAPPRFT’s rules also forbid the use of traditional Chinese characters.

The use of English here is hardly subversive. The words in question form part of a global gaming language that has little to do with either the US or the UK. The ban on traditional Chinese characters, as opposed to the simplified ones that are generally used in China, is more understandable: Taiwan still uses the traditional form, so their inclusion might be seen as some kind of subliminal political statement.

The consequence is likely to be fewer games from smaller Chinese software companies, who are less able to meet the stringent new demands. As the Tech in Asia post rightly points out:

> We could be facing a future where China’s entire mobile game catalogue consists only of the games produced by powerful corporations like Tencent and Netease, with no room for startups and indies.

And that is probably the real reason for this latest move: big companies tend to be far more willing to toe the government line than smaller independents, since they have far more to lose. So, as with other apparently arbitrary moves, the latest unexpected clampdown by the Chinese government looks to be yet another example of its shrewd and subtle control of the online world.

Follow me @glynmoody on Twitter or identi.ca, and +glynmoody on Google+

Filed Under: censorship, china, english, mobile, mobile games, words

Obama Administration Learns: If You Redefine Every Word In The Dictionary, You Can Get Away With Just About Anything

from the words-mean-something dept

We’ve written before about how the NSA uses its own definitions of some fairly basic English words, in order to pretend to have the authority to do things it probably… doesn’t really have authority to do. It’s become clear that this powergrab-by-redefinition is not unique to the NSA when it comes to the executive branch of the government. Earlier this year, we also wrote about the stunning steady redefinition of words within the infamous “Authorization to Use Military Force” (AUMF) that was passed by Congress immediately after September 11, 2001. It officially let the President use “all necessary and appropriate force” against those who “planned, authorized, committed or aided the terrorist attacks that occurred on September 11, 2001.” But, over time, the AUMF was being used to justify efforts against folks who had nothing to do with September 11th, leading to this neat sleight of hand in which the military started pretending that the AUMF also applied to “associated forces.” That phrase appears nowhere in the AUMF, but it’s a phrase that is regularly repeated and claimed by the administration and the military.

But, it goes beyond that. As Trevor Timm highlights over at The Guardian, pretty much the entire drone bombing (drones, by the way, are also apparently “authorized” by the AUMF) of Syria involves the administration conveniently redefining basic English to suit its purposes. Let’s start with the authorization for the bombing itself:

For instance, in his Tuesday statement that US airstrikes that have expanded into Syria, Obama studiously avoided any discussion about his domestic legal authority to conduct these strikes. That dirty work was apparently left up to anonymous White House officials, who told the New York Times?s Charlie Savage that both the Authorization of Use of Military Force (AUMF) from 2001 (meant for al-Qaida) and the 2002 war resolution (meant for Saddam Hussein?s Iraq) gave the government the authority to strike Isis in Syria.

In other words: the legal authority provided to the White House to strike al-Qaida and invade Iraq more than a dozen years ago now means that the US can wage war against a terrorist organization that?s decidedly not al-Qaida, in a country that is definitely not Iraq.

It’s amazing what you can accomplish when you pretend words mean something entirely different than they do. Hell, if you can just make words mean whatever the hell you want them to mean, there’s no such thing as a limitation on what you can do. It’s all fair game. Who needs laws when the law is basically a mad libs for you to fill in what you want?

Moving on. The definitional jujitsu covers the people who were killed by the bombing as well. Civilians? What civilians?

Buzzfeed?s Evan McMorris-Santoro reported that the Pentagon is ?confident? that no civilians were killed in any of the initial airstrikes in Syria, despite a credible report to the contrary. But we have no idea what that actually means either. The White House previously embraced a re-definition of ?civilian? so it could easily deny its drone strikes were killing anyone than ?militants? in Yemen, Pakistan, and elsewhere, according to a New York Times report in 2012:

> It in effect counts all military-age males in a strike zone as combatants, according to several administration officials, unless there is explicit intelligence posthumously proving them innocent.

So any casualties, if they?re men, might well be tallied as ?militants? even if the actual dead people were not.

Kill anyone you want, just as long as they’re men of a certain age. Thank you Pentagon dictionary. You just wiped out civilian deaths.

But why stop there? How about “imminent threats”? Because that sounds pretty scary, right? It sure is — especially when it can mean whatever the hell the administration wants:

In addition to conducting airstrikes against Isis is Syria on Tuesday, the Obama administration also announced it had also targeted the ?Khorasan Group?, a separate al-Qaida-linked terrorist organization. They justified it by claiming that the group was plotting an ?imminent? attack on the US. Before last week, hardly anyone had heard of the Khorasan Group (in fact, even their name was classified), so it?s difficult to judge from public information just how threatening their alleged plot really was. But when you add in the administration?s definition of ?imminent,? it becomes impossible.

Take, for example, this definition from a Justice Department white paper, which was leaked last year, intended to justify the killing of Americans overseas:

> [A]n ?imminent? threat of violent attack against the United States does not require the United States to have clear evidence that a specific attack on U.S. persons will take place in the immediate future.

To translate: ?imminent? can mean a lot of things ? including ?not imminent?.

This is pretty neat. Anything else you’ve got for us? How about “combat” or “ground troops”? They’re not what you think they are either, because a malleable language can do anything:

As the New York Times?s Mark Landler detailed over the weekend, White House has ?an extremely narrow definition of combat ? a definition rejected by virtually every military expert.? According to the Obama administration, the 1600 ?military advisers? that have steadily been flowing in Iraq fall outside this definition, despite the fact that ?military advisers? can be: embedded with Iraqi troops; carry weapons; fire their weapons if fired upon; and call in airstrikes. In the bizarro dictionary of war employed by this White House, none of that qualifies as ?combat?.

Yes, the English language changes over time and that’s generally a good thing. But we’re not talking about the way the word “decimate” once meant to lop off 10% and now means “destroy everything.” This is a deliberate misrepresentation of things.

Hell, this seems to go further than Orwell even imagined with his authoritarian use of language and rewriting of history. In this case, rather than just saying “we were always at war with Eurasia,” he could have just changed the definition of “we,” “were,” “always,” “at,” “war,” “with,” and “Eurasia,” and it would have been that much more powerful.

Filed Under: aumf, authorization to use military force, civilians, combat, definitions, english, fud, imminent, language, obama administration, terror, war

Language School's Blogger Fired For Writing A Post On Homophones; Director Fears Association With 'Gay Sex'

from the not-a-hoax dept

Let’s say you’re a company and you hired a social media expert to run your social media and blogging tasks. Now let’s say you want to fire that person. You need a decent reason, right? Maybe your company is just going through layoffs and that job happens to fall under the ax (though, make sure you get control of your Twitter account before dropping the blade). Or maybe your “expert” ran a tweet/response campaign that backfired as badly as it possibly could. Those are good reasons to fire your social media and blogging guy.

What’s a poor reason for firing that person? How about: Well, we thought the person’s post about homophones for our language school’s blog might make people think we’re all gay and whatnot? Yeah, that pretty much covers it.

But when the social-media specialist for a private Provo-based English language learning center wrote a blog explaining homophones, he was let go for creating the perception that the school promoted a gay agenda. Tim Torkildson says after he wrote the blog on the website of his employer, Nomen Global Language Center, his boss and Nomen owner Clarke Woodger, called him into his office and told him he was fired.

Now, I know what you’re thinking: that didn’t f#&$ing happen. Well, au contraire, bonjour, it sure as hell did happen. A school entirely built to teach the English language to non-English speaking immigrants in Utah fired a guy for blogging about homophones. And, just so we’re clear, homophones are not telephones run by the homosexual-ati as a hotline designed to disrupt the traditional family values of ‘Merica. No, homophones are words that sound alike but have different definitions, like “I” and “eye.”

Torkildson’s account includes some eyebrow-raising quotes of Woodger claiming not to know what homophones were, claiming that they don’t teach that kind of “advanced” language study to their English language students, and worrying that the post would associate the school with homosexuality for reasons uknown to this writer. Woodger’s account is different, but vaguely not so different.

Woodger says his reaction to Torkildson’s blog has nothing to do with homosexuality but that Torkildson had caused him concern because he would “go off on tangents” in his blogs that would be confusing and sometimes could be considered offensive…Woodger says his school has taught 6,500 students from 58 countries during the past 15 years. Most of them, he says, are at basic levels of English and are not ready for the more complicated concepts such as homophones.

Er, so yeah. It had nothing to do with homosexuality, except it has something to do with tangents and being offensive, and they don’t teach the concept of homophones to English students because it’s so advanced. I’d ask you to hazard a guess what the tangents and “offensive” stuff were in these damned language posts, but you’ve already probably guessed correctly.

Regardless, if homophones associate a language school with homosexuality, then I guess all of us Homo sapiens are at least a little gay. Right?

Filed Under: clarke woodger, english, homophobia, homophones, language, tim torkildson
Companies: nomen global language center

DailyDirt: You Say Tomato, I Say Tomahto

from the urls-we-dig-up dept

Dead languages don’t change and evolve. It’s the languages that people speak the most that develop new words and new dialects. In the past, it’s been difficult to track the evolution of language, but with more and more ~~wiretapped phonecalls~~ digital voice recordings available for analysis, linguists are in a better position to study how languages are changing. Here are just a few interesting links on language dialects.

Joshua Katz used linguistic data from Bert Vaux’s dialect survey to generate interactive maps of how people speak across the continental US. What is your generic term for a sweetened carbonated beverage? [url]
Phonemica is a project to record the thousands of different Chinese dialects in order to preserve the richness of the language for future generations. It’s run by volunteers who want to collect spoken stories, and it was started with an Indiegogo fundraising campaign. [url]
There are several barriers that prevent various English dialects from becoming their own languages. Modern literacy and the increasing global mobility of people make it harder and harder for new languages to split off and develop. [url]

If you’d like to read more awesome and interesting stuff, check out this unrelated (but not entirely random!) Techdirt post via StumbleUpon.

Filed Under: chinese, dialects, english, language, linguists, literacy, mandarin, phonemica, speaking, voices
Companies: indiegogo

DailyDirt: English Curiosities

from the urls-we-dig-up dept

The English language is one of the hardest languages to learn. There are countless irregularities and significant differences between written and spoken English grammar that can trip up almost anyone. Here are just a few linguistic analyses of slightly older versions of English .

A method of diagramming English sentences was invented 166 years ago as a way to teach English grammar in a simpler way. Imagine if this was invented today…. [url]
Linguistic anthropology looks at how language has changed and influenced social life. Some words have changed their meanings more rapidly than others, and here’s a chart showing some of the words that have stayed the most consistent with time. [url]
The most significant change to the English language is… the progressive passive. People used to say sentences like: “The house was building” instead of “The house was being built”. (And long ago, it was “The house is a-building” or “The house is on building”.) [url]

If you’d like to read more awesome and interesting stuff, check out this unrelated (but not entirely random!) Techdirt post via StumbleUpon.

Filed Under: diagramming, english, grammar, language, linguistic anthropology, progressive passive, sentences

DailyDirt: Learning A Foreign Language

from the urls-we-dig-up dept

Apparently, Japanese is the most difficult foreign language for native English speakers to learn. Not only does it have different written and spoken codes, it also has three different writing systems. Furthermore, Japanese syntax is left branching, which is the complete opposite of English syntax, which is right branching. Learning a foreign language is never easy (although some people seem to have an easier time than others), but it’s not impossible with enough time and effort put into it. Here are a few more links about learning foreign languages.

A father spoke to his son in only Klingon for the first three years of his life. He was apparently interested in whether his kid, who was just going through his first language acquisition process, would pick up Klingon just like any human language. And, yes, the kid did start to learn it. [url]
What it takes to learn Chinese, or any other foreign language, is simply lots of hard work. You don’t have to be talented. Just follow the “10,000 Hour Rule,” and practice, practice, practice. [url]
Scientists in China think they’ve figured out a better way to teach Chinese. Using network theory, they developed a learning strategy that exploits the structural relationships between Chinese characters, which are actually composed of a fairly limited number of sub-characters. [url]

If you’d like to read more awesome and interesting stuff, check out this unrelated (but not entirely random!) Techdirt post via StumbleUpon.

Filed Under: education, english, klingon, language, learning, mandarin, syntax

DailyDirt: Talking Funny

from the urls-we-dig-up dept

Some people don’t think they have an accent when they talk, but even Mid-westerners have a few discernible pronunciations that can give them away. But beyond accents, there are also speaking quirks like slang and syntax and other weird sounds that people use when they communicate. Here are just a few examples.

The Voices of California project is trying to document California’s English accents. The researchers are gathering examples of different slang, pronunciation and syntax that Californians use — such as “hella,” “pin/pen,” and the “positive anymore” phenomenon. [url]
Is Obama addicted to saying “is is” when he talks? Thankfully, he doesn’t use extreme isisism, like the “triple is” in a sentence like: “What the American people’s understanding is, is, is that…” [url]
The vocal fry is a low creak in someone’s voice while they talk — that mostly occurs at the end of sentences. It was thought to be more common among young women, but newer studies suggest otherwise. [url]

If you’d like to read more awesome and interesting stuff, check out this unrelated (but not entirely random!) Techdirt post.

Filed Under: accents, english, isisism, language, slang, spoken, vocal fry

DailyDirt: Making Up Words

from the urls-we-dig-up dept

The English language creates new words all the time and steals words from other languages to bulk up its vocabulary. Maybe it’s not fair to other languages, but then the consequences are that English grammar is highly irregular and correct spellings sometimes require knowledge of the word origins. Here are just a few interesting tidbits on creating new words.

The usage of “OMG” apparently dates back as far as 1917 — when Lord John Fisher used it in a letter to Winston Churchill. However, the Oxford English Dictionary only added OMG to its lexicon in 2011. [url]
How many words exist in the English language? Unabridged dictionaries have hundreds of thousands of entries, but scientific estimates put it closer to a million. A 2011 Culturonomics paper suggests the English language is growing at a rate of about 8,500 new words per year, but that rate is actually slowing down. [url]
Lingodroids are creating new words that humans might be able to use. Perhaps fittingly, these bots are generating a whole lot of new 4-letter words. [url]

If you’d like to read more awesome and interesting stuff, check out this unrelated (but not entirely random!) Techdirt post.

Filed Under: culturonomics, english, language, lingodroids, omg, words