data mining – Techdirt (original) (raw)
Stories filed under: "data mining"
Latest EU Parliament Votes On Copyright: Fuck The Public, Give Big Corporations More Copyright
from the that's-not-good dept
The weird and persistently silly copyright reform process in the EU Parliament continues to get more and more bizarre and stupid. Last month, we told you about the first committee vote, which we feared would be terrible, but turned out to be only marginally stupid, as the worst parts of the proposal were rejected. Now, two more committees — the Culture and Education (CULT) and Industry, Research and Energy (ITRE) Committees — have voted on their own reform proposals and the results are really, really bad if you support things like culture, education, research and the public. And, yes, I get the irony of the fact that the Culture and Education Committee in the EU just declared a giant “fuck you” to culture and education with its vote.
Among the many problematic aspects approved by these committees is a filter requirement that would block users from uploading legally obtained media into the cloud. This makes no sense — especially given that the EU already has additional “you must be a pirate” taxes on situations where individuals are making copies of their legally acquired works.
And then there’s the whole “snippet tax” which legacy newspapers are demanding because they’ve failed to adapt to the digital age, and they want Google News to send them money for daring to send them traffic without monetary compensation. The whole concept is backwards… and here, it’s been expanded. As Copybuzz explains:
* The press publishers? right went from applying to ?digital? uses of press to all uses, including print. Aside from the fact that this seems a violation of Article 10(1) of the Bern Convention (which establishes a mandatory exception for ?press summaries?), the impact of such a massive extension is unfathomable. * The definition of press publications has become so broad that infringements to article 11 are impossible to predict and hence prevent. The ?exceptions? to the applications of this new right just add to the potential legal uncertainty, as the CULT text states ?The [publisher] rights granted under this Directive should be without prejudice to the authors? rights and should not apply to the legitimate uses of press publications by individual users acting in a private and non-commercial capacity. The protection granted to press publications under this Directive should apply to content automatically generated by an act of hyperlinking related to a press publication without prejudice to the legitimate use of quotations.? This paragraph alone opens such a Pandora Box of unanswered questions, such as: * What is a legitimate use of press publications? And who?s the judge of the legitimacy? * When are you acting in your private and non-commercial capacity? * Content automatically generated by an act of hyperlinking related to a press publication: so that mean that when you share a link on social media and that triggers automatically the appearance of a snippet, you are now officially in trouble? * When are you ?legitimately? quoting? Is that a new criteria imposed on top of the only mandatory exception globally? And if so, who judges if you comply?
None of that sounds good or well thought out. It sounds like the kind of thing that someone not very knowledgeable about the subject would put together after just hearing one side from a bunch of whining newspaper execs.
And then there’s this nonsense, as summarized by Parliament Member Julia Reda:
Incredibly, the ITRE committee ? responsible for research and usually a staunch defender of open access ? even voted to extend the extra copyright to academic publications, which would make open access publishing virtually impossible. It would stop people from linking to academic content, despite the content itself being free. This would apply to both online publications and print journals. The chilling effects on the spread of academic works and information would be substantial.
Yes, linking to academic content will now require payment — even if it’s open access. That’s… nuts.
And, finally, on the “text and data mining” issue — which is one of the key points that the EU has been fighting over with this new copyright reform effort, ITRE again severely limited who can do data mining to tiny startups. Again from Copybuzz:
The ITRE Committee for example has in its extreme generosity decided to leave the benefit of the Text and Data Mining exception limited to research organisations and ?start-up companies?, defined as ?any company with fewer than 10 employees and an annual turnover or balance sheet below €2 million and which was established not earlier than three years before benefiting from the exception?. The message for European start-ups is clear: don?t dare scale up your first three years of business if you want to mine content and if you do, move away from the EU (and move anyway after 3 years)! Never mind jobs and growth, the EU mantra we keep on hearing. Oh, and please do not be innovative any longer once you are an established player: we would not want our economy to be competitive on the international scene.
This is really a killer for innovation. There’s a massive industry now being built up around machine learning and AI and autonomous machines — and an awful lot of it actually relies on the ability to do text and data mining on the internet. With this proposal, the (of all things) “Industry & Research” committee is basically saying there shall be no such industry or research in Europe. It’s pushing one of the most promising up and coming industries out of the EU entirely. Incredible.
It’s almost stunning how bad these decisions were. But, of course, some of the legacy copyright industry folks decided to celebrate, claiming that the votes showed that the EU Parliament “would not tolerate free-riding platforms.” That’s complete nonsense and an insult. Again: things like news aggregators and search engines have been enormously helpful in creating new markets and expanding attention and traffic to sites. If anything, legacy content producers have been “free riding” on those platforms.
Hopefully saner heads will prevail as this process moves forward, but the EU seems to be going down a dark and dangerous road on copyright policy.
Filed Under: copyright, cult, data mining, education, eu, eu parliament, filters, itre, open access, research, snippet tax
EU Copyright Proposal: Not Good, But Not As Blatantly Terrible As It Could Have Been
from the yes,-but...-how-about-doing-it-right? dept
We recently warned that there were efforts underway to make the EU’s copyright reform proposal even more draconian and ridiculous. Thankfully, the “compromise,” which wasn’t a compromise at all and would have made things much worse, was rejected by the Internal Market and Consumer Protection (IMCO) committee, but there was still plenty of bad stuff to be concerned about.
The mandatory filtering (i.e., mandatory censorship) regime for internet platforms was rejected. That’s a good thing. But, on the flip side, the so-called “link tax” requiring payments from those who link to and aggregate news to news publishers has moved forward. Two other small bits of good news were also included: the “freedom of panorama” allowing people to photograph buildings and sculptures without violating someone’s copyright and also a “remix right” that will protect the public from doing basic remixing of copyright-covered works. There are still concerns about the “text and data mining” rules which limit what content can be acquired.
So, basically, it’s a mixed bag. Some, of course, will argue that any “compromise” will involve some good and some bad, but that assumes that we need a compromise here. Why not aim for creating a policy that’s actually better overall, rather than a “compromise” solution? Europe has the chance to lead the way, but appears to have little interest in doing so. Either way, there’s still more to go in this process, and other committees to approve things, so the policy still has a long way to go. Hopefully, by the end it pushes more and more to being true copyright reform, rather than just “propping up old industries” reform.
Filed Under: copyright, data mining, eu, eu parliament, filters, google tax, link tax, remix right
Russia Provides Glimpse Of A Future Where Powerful Facial Recognition Technology Has Abolished Public Anonymity
from the are-we-really-ready-for-that? dept
As hardware and software advance, so facial recognition becomes more accurate and more attractive as a potential solution to various problems. Techdirt first wrote about this area back in 2012, when Facebook had just started experimenting with facial recognition (now we’re at the inevitable lawsuit stage). Since then, we’ve reported on an increasing number of organizations exploring the use of facial recognition, including the FBI, the NSA, Boston police and even the church. But all of those pale in comparison to what is happening in Russia, reported here by the Guardian:
> FindFace, launched two months ago and currently taking Russia by storm, allows users to photograph people in a crowd and work out their identities, with 70% reliability. > > It works by comparing photographs to profile pictures on Vkontakte, a social network popular in Russia and the former Soviet Union, with more than 200 million accounts. In future, the designers imagine a world where people walking past you on the street could find your social network profile by sneaking a photograph of you, and shops, advertisers and the police could pick your face out of crowds and track you down via social networks.
One of FindFace’s founders, Alexander Kabakov, points out the service could have a big impact on dating:
> “If you see someone you like, you can photograph them, find their identity, and then send them a friend request.” The interaction doesn’t always have to involve the rather creepy opening gambit of clandestine street photography, he added: “It also looks for similar people. So you could just upload a photo of a movie star you like, or your ex, and then find 10 girls who look similar to her and send them messages.”
Definitely not creepy at all.
Of course, a 70% hit rate isn’t that good: perhaps FindFace isn’t really such a threat to public anonymity. The trouble is, the Guardian article reports that the company has performed three million searches on its database of around a billion photographs using just four common-or-garden servers. It’s easy to imagine what might be achieved with some serious hardware upgrades, along with tweaks to the software, or with access to even bigger, more complete databases. For example government ones: according to the Guardian, FindFace’s founders think the big money will come from selling their system to “law enforcement and retail.” Although they’ve not yet been contacted by Russia’s FSB security agency, they say they’d be happy to listen to offers from them. Perhaps comforted by the thought of all that future business coming his way, Kabakov is philosophical about the social implications of his company’s technology:
> “In today?s world we are surrounded by gadgets. Our phones, televisions, fridges, everything around us is sending real-time information about us. Already we have full data on people’s movements, their interests and so on. A person should understand that in the modern world he is under the spotlight of technology. You just have to live with that.”
That may well be true. But the question is, are we ready to do so?
Follow me @glynmoody on Twitter or identi.ca, and +glynmoody on Google+
Filed Under: data mining, facial recognition, privacy, russia
Companies: findface
Is It Really That Big A Deal That Twitter Blocked US Intelligence Agencies From Mining Public Tweets?
from the it's-public-info dept
Over the weekend, some news broke about how Twitter was blocking Dataminr, a (you guessed it) social media data mining firm, from providing its analytics of real-time tweets to US intelligence agencies. Dataminr — which, everyone makes clear to state, has investments from both Twitter and the CIA’s venture arm, In-Q-Tel — has access to Twitter’s famed “firehose” API of basically every public tweet. The company already has relationships with financial firms, big companies and other parts of the US government, including the Department of Homeland Security, which has been known to snoop around on Twitter for quite some time.
Apparently, the details suggest, some (unnamed) intelligence agencies within the US government had signed up for a free pilot program, and it was as this program was ending that Twitter reminded Dataminr that part of the terms of their agreement in providing access to the firehose was that it not then be used for government surveillance. Twitter insists that this isn’t a change, it’s just it enforcing existing policies.
Many folks are cheering Twitter on in this move, and given the company’s past actions, the stance is perhaps not that surprising. The company was one of the very first to challenge government attempts to get access to Twitter account info (well before the whole Snowden stuff happened). Also, some of the Snowden documents revealed that Twitter was alone among internet companies in refusing to sign up for the NSA’s PRISM program, which made it easier for internet firms to supply the NSA with info in response to FISA Court orders. And, while most other big internet firms “settled” with the government over revealing government requests for information, Twitter has continued to fight on, pushing for the right to be much more specific about how often the government asks for what kinds of information. In other words, Twitter has a long and proud history of standing up to attempts to use its platform for surveillance purposes — and it deserves kudos for its principled stance on these issues.
That said… I’m not really sure that blocking this particular usage really makes any sense. This is public information, rather than private information. And, yes, not everyone has access to “the firehose,” so Twitter can put whatever restrictions it wants on usage of that firehose, but seeing as it’s public information, it’s likely that there are workarounds that others have (though, perhaps not quite as timely). But separately, reviewing public information actually doesn’t seem like a bad idea for the intelligence community. Yes, we can all agree (and we’ve been among the most vocal in arguing this) that the intelligence agencies have a long and horrifying history of questionable datamining of other databases that they should not have access to. But publicly posted tweet information seems like a weird thing for anyone to be concerned about. There’s no reasonable expectation of privacy in that information, and not because of some dumb “third party doctrine” concept, but because the individuals who tweet do, in fact, make a proactive decision to post that information publicly.
So, perhaps I’m missing something here (and I expect that some of you will explain what I’m missing in the comments), but I don’t see why it’s such a problem for intelligence agencies to do datamining on public tweets. We can argue that the intelligence community has abused its datamining capabilities in the past, and that’s true, but that’s generally over private info where the concern is raised. I’m not sure that it’s helpful to argue that the intelligence community shouldn’t even be allowed to scan publicly available information as well. It feels like it’s just “anti-intelligence” rather than “anti-abusive intelligence.”
Filed Under: data mining, intelligence, intelligence community, public info, surveillance, tweets
Companies: dataminr, twitter
DailyDirt: Recipes Analyzed By Algorithms
from the urls-we-dig-up dept
Algorithms are data mining every aspect of our lives and the world around us — to pull out interesting bits of information that we should act on. Companies like Google and Facebook come up with algorithms to figure out when to put ads in front of our eyes and how to display pertinent information (sometimes at the same time). Other algorithms are apparently watching what we eat, and trying to highlight what makes food taste good for us or how to formulate the “perfect Pepsis” or find unexpected recipes or flavor combinations. Here are just a few examples of software-based culinary art.
- Data mining thousands of Indian food recipes reveals that chefs of this cuisine pair flavors in a way that western cuisine generally does not. Western food combines ingredients that have overlapping flavors, but Indian cuisine pairs ingredients that seem to minimize common flavors. [url]
- IBM’s Chef Watson (based on its Jeopardy-winning algorithms) has created some “cognitive cooking” by analyzing thousands of recipes to create new dishes of its own. When will a robot competitor appear in an Iron Chef episode? [url]
- Some foodie snobs are worried that artificial intelligence in the kitchen will lead to a destruction of cuisine as an art and part of culture. Other folks, though, are more optimistic that kitchen AI will free humans from the drudgery of cooking and open up a new world of culinary art. (Still, others are concerned that software cannot be inventors under US intellectual property laws, and that novel recipes or inventions created by AI won’t be protected by patents or copyright.) [url]
If you’d like to read more awesome and interesting stuff, check out this unrelated (but not entirely random!) Techdirt post via StumbleUpon.
Filed Under: ai, algorithms, artificial intelligence, cognitive computing, cognitive cooking, cuisine, data mining, flavors, food, recipes, watson
Companies: ibm
DailyDirt: Data Is Everywhere, Let's Use It
from the urls-we-dig-up dept
If you’ve been reading Techdirt for a while, you probably know that we’re not big fans of this myth: “If you’re not paying for the product, you are the product.” Regardless of whether or not you pay for something, some companies will still treat their customers horribly. Likewise, there are also some corporations that try to treat customers (or users) with respect without expectation of payment for the favor. That said, it’s easy to make mistakes that get mis-interpreted when it comes to analyzing consumer behavior. An unintentional email message to a targeted (or even un-targeted) group of customers can enrage a whole community. Consumer data is available to a lot of companies, but it might be wise for these companies to tread lightly with their data scientists. Here are just a few cases that data miners might want to check out.
- Facebook participated in some social experiments, but creating an “emotional contagion” resulted in some unwanted public attention. The actual ability for a social network to measure or effect various emotions is far from proven, but the potential to cause widespread distress through a social network is probably something users should be concerned about. [url]
- Shutterfly made a seemingly small mistake in mass-emailing a bunch of its customers a congratulatory message about an upcoming newborn. The photo printing service wasn’t even using data mining techniques (eg Target) to try to figure out who might be pregnant, but in this data-driven world, folks are trained to expect that companies may be trying to pry into their personal lives. [url]
- Social psychology has had some problems with scientific fraud, and thankfully, there are some investigators who are developing methods to find fake or massaged data. It’s hard enough to actually design psych experiments that have conclusive results, but sometimes the data can’t lie. [url]
- OKCupid admits to experimenting on its users, too. The difference with dating sites is that the people using them seem to be tacitly agreeing to be experimented upon. [url]
If you’d like to read more awesome and interesting stuff, check out this unrelated (but not entirely random!) Techdirt post via StumbleUpon.
Filed Under: advertising, consumer behavior, data, data mining, emotional contagion, marketing, psychology, reputation, social experiments
Companies: facebook, okcupid, shutterfly, target
DailyDirt: It's What You Say AND How You Say It
from the urls-we-dig-up dept
Studying how language can predict behavior is a fascinating field. As communications are increasingly digital, everyone’s messages are more easily data mined for all sorts of analysis (ahem, and not all of it is done by the CIA). Marketing folks are looking at how catchy phrases might increase sales — which may be why you’re seeing more headlines like “8 simple ways to …” and “one simple trick that …” in ads. Here are just a few other linguistic studies for you to peruse. Also, happy belated National Grammar Day!
- Men who use the pronoun “whom” in an online dating profile receive 31% more responses from women. And you probably don’t even need to use it correctly… well, unless another study concludes that women are significantly better at grammar usage than men. [url]
- Can the language you speak influence your behavior? Speakers of languages with weak future tense grammar (eg. German, Finnish and Estonian) seem to correlate with more future-oriented behaviors such as an increased rate of financial saving, lower rates of smoking and higher rates of exercise, and higher condom usage — compared to speakers of languages with stronger future tense grammar like English and French. [url]
- Four minutes of conversation is about all it takes for a speed dating participant to figure out if there’s any real chemistry between a potential couple. Protip: language analysis suggests you might want to sound sympathetic and not ask too many questions. [url]
If you’d like to read more awesome and interesting stuff, check out this unrelated (but not entirely random!) Techdirt post via StumbleUpon.
Filed Under: analysis, behavior, big data, communication, data mining, dating, grammar, language, linguistics, predictions, whom
DailyDirt: Making Money The Old-Fashioned Way… By Algorithms
from the urls-we-dig-up dept
People are changing the way they make decisions now that technology can help them crunch more numbers than ever before. Instead of just going with a gut instinct, decisions can be based on all kinds of random data analysis (for better or worse). Big data is a popular trend, and more and more successful examples of data mining for profit seem to get publicized every day. But are we only looking at the winning combinations and ignoring the losers? Here are just a few examples of algorithms that might be making some money.
- If you don’t think your cellphone metadata matters to anyone, venture capitalists might not want to fund your new venture. Apparently, some VCs fund entrepreneurs based partially on an unconventional algorithm that includes things such as the age of the founder’s cellphone number and the average time of his/her first call in the morning. [url]
- Can an algorithm pick stocks better than human financial analysts? Sure, but a monkey throwing darts can, too, sometimes. The wisdom of a crowd of analysts might not be a bad algorithm to use, but it still relies on a crowd of humans. [url]
- Poker players can bluff to win, but now that more players are practicing against algorithms and using simulations — it might be harder for those bluffs to work. It’s not so reliable to try to guess when a player is bluffing, but a simulation of thousands of poker hands can give you some statistical confidence…. [url]
If you’d like to read more awesome and interesting stuff, check out this unrelated (but not entirely random!) Techdirt post via StumbleUpon.
Filed Under: algorithm, artificial intelligence, big data, data mining, gigo, poker, simulation, venture capital
Former NSA Boss: We Don't Data Mine Our Giant Data Collection, We Just Ask It Questions
from the um,-that's-the-same-thing dept
General Michael Hayden, the former head of both the NSA and the CIA, has already been out making silly statements about how the real “harm” in the latest leaks is it shows that the US “can’t keep a secret.” However, he’s now given an even more ridiculous interview trying to defend both the mass dragnet collection of all phone records and the PRISM collection of internet data. In both cases, some of his claims are quite incredible. Let’s start with this whopper, in which he claims that they don’t do any data mining on the mass dragnet data they collect, they just “ask it questions.”
HAYDEN: It is a successor to the activities we began after 9/11 on President Bush’s authority, later became known as the Terrorist Surveillance Program.
So, NSA gets these record and puts them away, puts them in files. They are not touched. So, fears or accusations that the NSA then data mines or trolls through these records, they’re just simply not true.
MARTIN: Why would you be collecting this information if you didn’t want to use it?
HAYDEN: Well, that’s – no, we’re going to use it. But we’re not going to use it in the way that some people fear. You put these records, you store them, you have them. It’s kind of like, I’ve got the haystack now. And now let’s try to find the needle. And you find the needle by asking that data a question. I’m sorry to put it that way, but that’s fundamentally what happens. All right. You don’t troll through the data looking for patterns or anything like that. The data is set aside. And now I go into that data with a question that – a question that is based on articulable(ph), arguable, predicate to a terrorist nexus. Sorry, long sentence.
I’m not sure if Hayden is just playing dumb or what, but asking it questions is data mining. What he describes as asking it questions is exactly what people are afraid of. It’s exactly the kind of data mining that people worry about. On top of that, just the fact that he flat out admits that they’re putting together the haystack to “try to find the needle” is exactly the kind of issue that people are so concerned about. The whole point of the 4th Amendment is that you’re not allowed to collect the haystack. You’re only supposed to be able to, on narrow circumstances, go looking for the needle with proper oversight. Yet, here, he admits that there’s no such oversight once they have that haystack:
MARTIN: May I back up? Do you have to have approval…
HAYDEN: No.
MARTIN: …from the FISA court…
HAYDEN: No.
MARTIN: …which is the intelligence surveillance court established in order to go in and ask that question.
You have had a generalized approval, and so you’ve got to justify the overall approach to the judge. But you do not have to go to the judge, saying, hey, I got this number now. I’ll go ahead and get a FISA request written up for you. No, you don’t have to do that.
That should be a “wow” moment right there, because it also appears to contradict President Obama’s claim that “if anybody in government wanted to go further than just that top-line data … they’d have to go back to a federal judge and — and — and indicate why, in fact, they were doing further — further probing.” Furthermore, he’s basically admitting that they basically give the FISA Court some vague reason why they need every possible record on phone calls, and then there’s no oversight by the court on how those are used, other than vague promises from the NSA that they’re not being abused for data mining — but just for “asking questions,” which is data mining.
Moving on to PRISM. Hayden’s responses are equally astounding. He’s asked about the fact that the NSA has admitted that they try to make a determination of if the person is foreign and have a system to determine if they’re 51% sure that a person is foreign in deciding whether or not to keep their data. As the interviewer notes, 51% “seems mushy.” Hayden’s response is ridiculous:
Yeah, well, actually, in some ways, you know, that’s actually the literal definition of probable, in probable cause.
Um, whether or not that’s the standard for probable cause is meaningless. Probable cause is the standard used to determine if someone can be arrested (or to have a search done). It is not the standard for determining if the person is foreign or not, subjecting them to mass surveillance by the NSA. The 4th Amendment requires probable cause for a search, but not probable cause in foreignness, rather probable cause in criminal activity. Is Hayden honestly suggesting that being foreign is probable cause of criminality? Because that’s insane.
Filed Under: data mining, michael hayden, nsa, nsa surveillance, probable cause, surveillance
Oh, And One More Thing: NSA Directly Accessing Information From Google, Facebook, Skype, Apple And More
from the not-a-good-week-for-the-nsa dept
Obviously, the Verizon/NSA situation was merely a small view into just how much spying the NSA is doing on everyone. And it seems to be spurring further leaks and disclosures. The latest, from the Washington Post, is that the NSA has direct data mining capabilities into the data held by nine of the biggest internet/tech companies:
The technology companies, which participate knowingly in PRISM operations, include most of the dominant global players of Silicon Valley. They are listed on a roster that bears their logos in order of entry into the program: “Microsoft, Yahoo, Google, Facebook, PalTalk, AOL, Skype, YouTube, Apple.” PalTalk, although much smaller, has hosted significant traffic during the Arab Spring and in the ongoing Syrian civil war.
Dropbox , the cloud storage and synchronization service, is described as “coming soon.”
This program, like the constant surveillance of phone records, began in 2007, though other programs predated it. They claim that they’re not collecting all data, but it’s not clear that makes a real difference:
The PRISM program is not a dragnet, exactly. From inside a company’s data stream the NSA is capable of pulling out anything it likes, but under current rules the agency does not try to collect it all.
Analysts who use the system from a Web portal at Fort Meade key in “selectors,” or search terms, that are designed to produce at least 51 percent confidence in a target’s “foreignness.” That is not a very stringent test. Training materials obtained by the Post instruct new analysts to submit accidentally collected U.S. content for a quarterly report, “but it’s nothing to worry about.”
Even when the system works just as advertised, with no American singled out for targeting, the NSA routinely collects a great deal of American content.
I expect we’ll be seeing more such revelations before long.
Filed Under: data, data mining, nsa, privacy, spying
Companies: aol, apple, dropbox, facebook, google, microsoft, paltalk, skype, yahoo