data mining – Techdirt (original) (raw)

Stories filed under: "data mining"

from the that's-not-good dept

The weird and persistently silly copyright reform process in the EU Parliament continues to get more and more bizarre and stupid. Last month, we told you about the first committee vote, which we feared would be terrible, but turned out to be only marginally stupid, as the worst parts of the proposal were rejected. Now, two more committees — the Culture and Education (CULT) and Industry, Research and Energy (ITRE) Committees — have voted on their own reform proposals and the results are really, really bad if you support things like culture, education, research and the public. And, yes, I get the irony of the fact that the Culture and Education Committee in the EU just declared a giant “fuck you” to culture and education with its vote.

Among the many problematic aspects approved by these committees is a filter requirement that would block users from uploading legally obtained media into the cloud. This makes no sense — especially given that the EU already has additional “you must be a pirate” taxes on situations where individuals are making copies of their legally acquired works.

And then there’s the whole “snippet tax” which legacy newspapers are demanding because they’ve failed to adapt to the digital age, and they want Google News to send them money for daring to send them traffic without monetary compensation. The whole concept is backwards… and here, it’s been expanded. As Copybuzz explains:

* The press publishers? right went from applying to ?digital? uses of press to all uses, including print. Aside from the fact that this seems a violation of Article 10(1) of the Bern Convention (which establishes a mandatory exception for ?press summaries?), the impact of such a massive extension is unfathomable. * The definition of press publications has become so broad that infringements to article 11 are impossible to predict and hence prevent. The ?exceptions? to the applications of this new right just add to the potential legal uncertainty, as the CULT text states ?The [publisher] rights granted under this Directive should be without prejudice to the authors? rights and should not apply to the legitimate uses of press publications by individual users acting in a private and non-commercial capacity. The protection granted to press publications under this Directive should apply to content automatically generated by an act of hyperlinking related to a press publication without prejudice to the legitimate use of quotations.? This paragraph alone opens such a Pandora Box of unanswered questions, such as: * What is a legitimate use of press publications? And who?s the judge of the legitimacy? * When are you acting in your private and non-commercial capacity? * Content automatically generated by an act of hyperlinking related to a press publication: so that mean that when you share a link on social media and that triggers automatically the appearance of a snippet, you are now officially in trouble? * When are you ?legitimately? quoting? Is that a new criteria imposed on top of the only mandatory exception globally? And if so, who judges if you comply?

None of that sounds good or well thought out. It sounds like the kind of thing that someone not very knowledgeable about the subject would put together after just hearing one side from a bunch of whining newspaper execs.

And then there’s this nonsense, as summarized by Parliament Member Julia Reda:

Incredibly, the ITRE committee ? responsible for research and usually a staunch defender of open access ? even voted to extend the extra copyright to academic publications, which would make open access publishing virtually impossible. It would stop people from linking to academic content, despite the content itself being free. This would apply to both online publications and print journals. The chilling effects on the spread of academic works and information would be substantial.

Yes, linking to academic content will now require payment — even if it’s open access. That’s… nuts.

And, finally, on the “text and data mining” issue — which is one of the key points that the EU has been fighting over with this new copyright reform effort, ITRE again severely limited who can do data mining to tiny startups. Again from Copybuzz:

The ITRE Committee for example has in its extreme generosity decided to leave the benefit of the Text and Data Mining exception limited to research organisations and ?start-up companies?, defined as ?any company with fewer than 10 employees and an annual turnover or balance sheet below €2 million and which was established not earlier than three years before benefiting from the exception?. The message for European start-ups is clear: don?t dare scale up your first three years of business if you want to mine content and if you do, move away from the EU (and move anyway after 3 years)! Never mind jobs and growth, the EU mantra we keep on hearing. Oh, and please do not be innovative any longer once you are an established player: we would not want our economy to be competitive on the international scene.

This is really a killer for innovation. There’s a massive industry now being built up around machine learning and AI and autonomous machines — and an awful lot of it actually relies on the ability to do text and data mining on the internet. With this proposal, the (of all things) “Industry & Research” committee is basically saying there shall be no such industry or research in Europe. It’s pushing one of the most promising up and coming industries out of the EU entirely. Incredible.

It’s almost stunning how bad these decisions were. But, of course, some of the legacy copyright industry folks decided to celebrate, claiming that the votes showed that the EU Parliament “would not tolerate free-riding platforms.” That’s complete nonsense and an insult. Again: things like news aggregators and search engines have been enormously helpful in creating new markets and expanding attention and traffic to sites. If anything, legacy content producers have been “free riding” on those platforms.

Hopefully saner heads will prevail as this process moves forward, but the EU seems to be going down a dark and dangerous road on copyright policy.

Filed Under: copyright, cult, data mining, education, eu, eu parliament, filters, itre, open access, research, snippet tax

from the yes,-but...-how-about-doing-it-right? dept

We recently warned that there were efforts underway to make the EU’s copyright reform proposal even more draconian and ridiculous. Thankfully, the “compromise,” which wasn’t a compromise at all and would have made things much worse, was rejected by the Internal Market and Consumer Protection (IMCO) committee, but there was still plenty of bad stuff to be concerned about.

The mandatory filtering (i.e., mandatory censorship) regime for internet platforms was rejected. That’s a good thing. But, on the flip side, the so-called “link tax” requiring payments from those who link to and aggregate news to news publishers has moved forward. Two other small bits of good news were also included: the “freedom of panorama” allowing people to photograph buildings and sculptures without violating someone’s copyright and also a “remix right” that will protect the public from doing basic remixing of copyright-covered works. There are still concerns about the “text and data mining” rules which limit what content can be acquired.

So, basically, it’s a mixed bag. Some, of course, will argue that any “compromise” will involve some good and some bad, but that assumes that we need a compromise here. Why not aim for creating a policy that’s actually better overall, rather than a “compromise” solution? Europe has the chance to lead the way, but appears to have little interest in doing so. Either way, there’s still more to go in this process, and other committees to approve things, so the policy still has a long way to go. Hopefully, by the end it pushes more and more to being true copyright reform, rather than just “propping up old industries” reform.

Filed Under: copyright, data mining, eu, eu parliament, filters, google tax, link tax, remix right

Russia Provides Glimpse Of A Future Where Powerful Facial Recognition Technology Has Abolished Public Anonymity

from the are-we-really-ready-for-that? dept

As hardware and software advance, so facial recognition becomes more accurate and more attractive as a potential solution to various problems. Techdirt first wrote about this area back in 2012, when Facebook had just started experimenting with facial recognition (now we’re at the inevitable lawsuit stage). Since then, we’ve reported on an increasing number of organizations exploring the use of facial recognition, including the FBI, the NSA, Boston police and even the church. But all of those pale in comparison to what is happening in Russia, reported here by the Guardian:

> FindFace, launched two months ago and currently taking Russia by storm, allows users to photograph people in a crowd and work out their identities, with 70% reliability. > > It works by comparing photographs to profile pictures on Vkontakte, a social network popular in Russia and the former Soviet Union, with more than 200 million accounts. In future, the designers imagine a world where people walking past you on the street could find your social network profile by sneaking a photograph of you, and shops, advertisers and the police could pick your face out of crowds and track you down via social networks.

One of FindFace’s founders, Alexander Kabakov, points out the service could have a big impact on dating:

> “If you see someone you like, you can photograph them, find their identity, and then send them a friend request.” The interaction doesn’t always have to involve the rather creepy opening gambit of clandestine street photography, he added: “It also looks for similar people. So you could just upload a photo of a movie star you like, or your ex, and then find 10 girls who look similar to her and send them messages.”

Definitely not creepy at all.

Of course, a 70% hit rate isn’t that good: perhaps FindFace isn’t really such a threat to public anonymity. The trouble is, the Guardian article reports that the company has performed three million searches on its database of around a billion photographs using just four common-or-garden servers. It’s easy to imagine what might be achieved with some serious hardware upgrades, along with tweaks to the software, or with access to even bigger, more complete databases. For example government ones: according to the Guardian, FindFace’s founders think the big money will come from selling their system to “law enforcement and retail.” Although they’ve not yet been contacted by Russia’s FSB security agency, they say they’d be happy to listen to offers from them. Perhaps comforted by the thought of all that future business coming his way, Kabakov is philosophical about the social implications of his company’s technology:

> “In today?s world we are surrounded by gadgets. Our phones, televisions, fridges, everything around us is sending real-time information about us. Already we have full data on people’s movements, their interests and so on. A person should understand that in the modern world he is under the spotlight of technology. You just have to live with that.”

That may well be true. But the question is, are we ready to do so?

Follow me @glynmoody on Twitter or identi.ca, and +glynmoody on Google+

Filed Under: data mining, facial recognition, privacy, russia
Companies: findface

Is It Really That Big A Deal That Twitter Blocked US Intelligence Agencies From Mining Public Tweets?

from the it's-public-info dept

Over the weekend, some news broke about how Twitter was blocking Dataminr, a (you guessed it) social media data mining firm, from providing its analytics of real-time tweets to US intelligence agencies. Dataminr — which, everyone makes clear to state, has investments from both Twitter and the CIA’s venture arm, In-Q-Tel — has access to Twitter’s famed “firehose” API of basically every public tweet. The company already has relationships with financial firms, big companies and other parts of the US government, including the Department of Homeland Security, which has been known to snoop around on Twitter for quite some time.

Apparently, the details suggest, some (unnamed) intelligence agencies within the US government had signed up for a free pilot program, and it was as this program was ending that Twitter reminded Dataminr that part of the terms of their agreement in providing access to the firehose was that it not then be used for government surveillance. Twitter insists that this isn’t a change, it’s just it enforcing existing policies.

Many folks are cheering Twitter on in this move, and given the company’s past actions, the stance is perhaps not that surprising. The company was one of the very first to challenge government attempts to get access to Twitter account info (well before the whole Snowden stuff happened). Also, some of the Snowden documents revealed that Twitter was alone among internet companies in refusing to sign up for the NSA’s PRISM program, which made it easier for internet firms to supply the NSA with info in response to FISA Court orders. And, while most other big internet firms “settled” with the government over revealing government requests for information, Twitter has continued to fight on, pushing for the right to be much more specific about how often the government asks for what kinds of information. In other words, Twitter has a long and proud history of standing up to attempts to use its platform for surveillance purposes — and it deserves kudos for its principled stance on these issues.

That said… I’m not really sure that blocking this particular usage really makes any sense. This is public information, rather than private information. And, yes, not everyone has access to “the firehose,” so Twitter can put whatever restrictions it wants on usage of that firehose, but seeing as it’s public information, it’s likely that there are workarounds that others have (though, perhaps not quite as timely). But separately, reviewing public information actually doesn’t seem like a bad idea for the intelligence community. Yes, we can all agree (and we’ve been among the most vocal in arguing this) that the intelligence agencies have a long and horrifying history of questionable datamining of other databases that they should not have access to. But publicly posted tweet information seems like a weird thing for anyone to be concerned about. There’s no reasonable expectation of privacy in that information, and not because of some dumb “third party doctrine” concept, but because the individuals who tweet do, in fact, make a proactive decision to post that information publicly.

So, perhaps I’m missing something here (and I expect that some of you will explain what I’m missing in the comments), but I don’t see why it’s such a problem for intelligence agencies to do datamining on public tweets. We can argue that the intelligence community has abused its datamining capabilities in the past, and that’s true, but that’s generally over private info where the concern is raised. I’m not sure that it’s helpful to argue that the intelligence community shouldn’t even be allowed to scan publicly available information as well. It feels like it’s just “anti-intelligence” rather than “anti-abusive intelligence.”

Filed Under: data mining, intelligence, intelligence community, public info, surveillance, tweets
Companies: dataminr, twitter

DailyDirt: Recipes Analyzed By Algorithms

from the urls-we-dig-up dept

Algorithms are data mining every aspect of our lives and the world around us — to pull out interesting bits of information that we should act on. Companies like Google and Facebook come up with algorithms to figure out when to put ads in front of our eyes and how to display pertinent information (sometimes at the same time). Other algorithms are apparently watching what we eat, and trying to highlight what makes food taste good for us or how to formulate the “perfect Pepsis” or find unexpected recipes or flavor combinations. Here are just a few examples of software-based culinary art.

If you’d like to read more awesome and interesting stuff, check out this unrelated (but not entirely random!) Techdirt post via StumbleUpon.

Filed Under: ai, algorithms, artificial intelligence, cognitive computing, cognitive cooking, cuisine, data mining, flavors, food, recipes, watson
Companies: ibm

DailyDirt: Data Is Everywhere, Let's Use It

from the urls-we-dig-up dept

If you’ve been reading Techdirt for a while, you probably know that we’re not big fans of this myth: “If you’re not paying for the product, you are the product.” Regardless of whether or not you pay for something, some companies will still treat their customers horribly. Likewise, there are also some corporations that try to treat customers (or users) with respect without expectation of payment for the favor. That said, it’s easy to make mistakes that get mis-interpreted when it comes to analyzing consumer behavior. An unintentional email message to a targeted (or even un-targeted) group of customers can enrage a whole community. Consumer data is available to a lot of companies, but it might be wise for these companies to tread lightly with their data scientists. Here are just a few cases that data miners might want to check out.

If you’d like to read more awesome and interesting stuff, check out this unrelated (but not entirely random!) Techdirt post via StumbleUpon.

Filed Under: advertising, consumer behavior, data, data mining, emotional contagion, marketing, psychology, reputation, social experiments
Companies: facebook, okcupid, shutterfly, target

DailyDirt: It's What You Say AND How You Say It

from the urls-we-dig-up dept

Studying how language can predict behavior is a fascinating field. As communications are increasingly digital, everyone’s messages are more easily data mined for all sorts of analysis (ahem, and not all of it is done by the CIA). Marketing folks are looking at how catchy phrases might increase sales — which may be why you’re seeing more headlines like “8 simple ways to …” and “one simple trick that …” in ads. Here are just a few other linguistic studies for you to peruse. Also, happy belated National Grammar Day!

If you’d like to read more awesome and interesting stuff, check out this unrelated (but not entirely random!) Techdirt post via StumbleUpon.

Filed Under: analysis, behavior, big data, communication, data mining, dating, grammar, language, linguistics, predictions, whom

DailyDirt: Making Money The Old-Fashioned Way… By Algorithms

from the urls-we-dig-up dept

People are changing the way they make decisions now that technology can help them crunch more numbers than ever before. Instead of just going with a gut instinct, decisions can be based on all kinds of random data analysis (for better or worse). Big data is a popular trend, and more and more successful examples of data mining for profit seem to get publicized every day. But are we only looking at the winning combinations and ignoring the losers? Here are just a few examples of algorithms that might be making some money.

If you’d like to read more awesome and interesting stuff, check out this unrelated (but not entirely random!) Techdirt post via StumbleUpon.

Filed Under: algorithm, artificial intelligence, big data, data mining, gigo, poker, simulation, venture capital

Former NSA Boss: We Don't Data Mine Our Giant Data Collection, We Just Ask It Questions

from the um,-that's-the-same-thing dept

General Michael Hayden, the former head of both the NSA and the CIA, has already been out making silly statements about how the real “harm” in the latest leaks is it shows that the US “can’t keep a secret.” However, he’s now given an even more ridiculous interview trying to defend both the mass dragnet collection of all phone records and the PRISM collection of internet data. In both cases, some of his claims are quite incredible. Let’s start with this whopper, in which he claims that they don’t do any data mining on the mass dragnet data they collect, they just “ask it questions.”

HAYDEN: It is a successor to the activities we began after 9/11 on President Bush’s authority, later became known as the Terrorist Surveillance Program.

So, NSA gets these record and puts them away, puts them in files. They are not touched. So, fears or accusations that the NSA then data mines or trolls through these records, they’re just simply not true.

MARTIN: Why would you be collecting this information if you didn’t want to use it?

HAYDEN: Well, that’s – no, we’re going to use it. But we’re not going to use it in the way that some people fear. You put these records, you store them, you have them. It’s kind of like, I’ve got the haystack now. And now let’s try to find the needle. And you find the needle by asking that data a question. I’m sorry to put it that way, but that’s fundamentally what happens. All right. You don’t troll through the data looking for patterns or anything like that. The data is set aside. And now I go into that data with a question that – a question that is based on articulable(ph), arguable, predicate to a terrorist nexus. Sorry, long sentence.

I’m not sure if Hayden is just playing dumb or what, but asking it questions is data mining. What he describes as asking it questions is exactly what people are afraid of. It’s exactly the kind of data mining that people worry about. On top of that, just the fact that he flat out admits that they’re putting together the haystack to “try to find the needle” is exactly the kind of issue that people are so concerned about. The whole point of the 4th Amendment is that you’re not allowed to collect the haystack. You’re only supposed to be able to, on narrow circumstances, go looking for the needle with proper oversight. Yet, here, he admits that there’s no such oversight once they have that haystack:

MARTIN: May I back up? Do you have to have approval…

HAYDEN: No.

MARTIN: …from the FISA court…

HAYDEN: No.

MARTIN: …which is the intelligence surveillance court established in order to go in and ask that question.

You have had a generalized approval, and so you’ve got to justify the overall approach to the judge. But you do not have to go to the judge, saying, hey, I got this number now. I’ll go ahead and get a FISA request written up for you. No, you don’t have to do that.

That should be a “wow” moment right there, because it also appears to contradict President Obama’s claim that “if anybody in government wanted to go further than just that top-line data … they’d have to go back to a federal judge and — and — and indicate why, in fact, they were doing further — further probing.” Furthermore, he’s basically admitting that they basically give the FISA Court some vague reason why they need every possible record on phone calls, and then there’s no oversight by the court on how those are used, other than vague promises from the NSA that they’re not being abused for data mining — but just for “asking questions,” which is data mining.

Moving on to PRISM. Hayden’s responses are equally astounding. He’s asked about the fact that the NSA has admitted that they try to make a determination of if the person is foreign and have a system to determine if they’re 51% sure that a person is foreign in deciding whether or not to keep their data. As the interviewer notes, 51% “seems mushy.” Hayden’s response is ridiculous:

Yeah, well, actually, in some ways, you know, that’s actually the literal definition of probable, in probable cause.

Um, whether or not that’s the standard for probable cause is meaningless. Probable cause is the standard used to determine if someone can be arrested (or to have a search done). It is not the standard for determining if the person is foreign or not, subjecting them to mass surveillance by the NSA. The 4th Amendment requires probable cause for a search, but not probable cause in foreignness, rather probable cause in criminal activity. Is Hayden honestly suggesting that being foreign is probable cause of criminality? Because that’s insane.

Filed Under: data mining, michael hayden, nsa, nsa surveillance, probable cause, surveillance

Oh, And One More Thing: NSA Directly Accessing Information From Google, Facebook, Skype, Apple And More

from the not-a-good-week-for-the-nsa dept

Obviously, the Verizon/NSA situation was merely a small view into just how much spying the NSA is doing on everyone. And it seems to be spurring further leaks and disclosures. The latest, from the Washington Post, is that the NSA has direct data mining capabilities into the data held by nine of the biggest internet/tech companies:

The technology companies, which participate knowingly in PRISM operations, include most of the dominant global players of Silicon Valley. They are listed on a roster that bears their logos in order of entry into the program: “Microsoft, Yahoo, Google, Facebook, PalTalk, AOL, Skype, YouTube, Apple.” PalTalk, although much smaller, has hosted significant traffic during the Arab Spring and in the ongoing Syrian civil war.

Dropbox , the cloud storage and synchronization service, is described as “coming soon.”

This program, like the constant surveillance of phone records, began in 2007, though other programs predated it. They claim that they’re not collecting all data, but it’s not clear that makes a real difference:

The PRISM program is not a dragnet, exactly. From inside a company’s data stream the NSA is capable of pulling out anything it likes, but under current rules the agency does not try to collect it all.

Analysts who use the system from a Web portal at Fort Meade key in “selectors,” or search terms, that are designed to produce at least 51 percent confidence in a target’s “foreignness.” That is not a very stringent test. Training materials obtained by the Post instruct new analysts to submit accidentally collected U.S. content for a quarterly report, “but it’s nothing to worry about.”

Even when the system works just as advertised, with no American singled out for targeting, the NSA routinely collects a great deal of American content.

I expect we’ll be seeing more such revelations before long.

Filed Under: data, data mining, nsa, privacy, spying
Companies: aol, apple, dropbox, facebook, google, microsoft, paltalk, skype, yahoo