Stats and Bayesian econometrics resources
The Foundations of Statistics: A Simulation-based Approach. Nice intro to R too.
A course in Bayesian Econometrics, by Gary Koop.
Gelman galore
Uh.. Um..
I [Mark Liberman] took a quick look at demographic variation in the frequency of the filled pauses conventionally written as “uh” and “um”.
Marginal vs marginal
To me, the most interesting bit of terminological confusion is that the word “marginal” has opposite meanings in statistics and economics. In statistics, the margin (as in “marginal distribution”) is the average or, in mathematical terms, the integral. In economics, the margin (as in “marginal cost”) is the change or, in mathematical terms, the derivative.
If you are jumpy, you are less likely to be liberal
In a group of 46 adult participants with strong political beliefs, individuals with measurably lower physical sensitivities to sudden noises and threatening visual images were more likely to support foreign aid, liberal immigration policies, pacifism, and gun control, whereas individuals displaying measurably higher physiological reactions to those same stimuli were more likely to favor defense spending, capital punishment, patriotism, and the Iraq War.
Friday the 13th, unlucky?
A study published on Thursday by the Dutch Centre for Insurance Statistics (CVS) showed that fewer accidents and reports of fire and theft occur when the 13th of the month falls on a Friday than on other Fridays. . . . In the last two years, Dutch insurers received reports of an average 7,800 traffic accidents each Friday, the CVS study said. But the average figure when the 13th fell on a Friday was just 7,500.
Datacharmer recently made a good comment on this:
Apart from avoiding risky behaviour on Friday the 13th because it is deemed unlucky (which might well be happening), you should also consider that Friday the 13th – unlike other Fridays – CAN’T be Christmas or New Year’s (where people get drunk and drive), and it will also be associated with a lower (or higher) probability of falling before a bank holiday weekend (or I guess in the States Independence day, etc).
I guess all I’m saying that it could well be other factors driving this result other than a change in people’s behaviour because Friday the 13th is ‘unlucky’.
How about accidents on Friday the 12th of Friday the 14th? The article only compares Friday the 13th with an average Friday – in fact, it doesn’t even reveal whether the 13th is least accident prone Friday in the book…
Stata is blogging
Stata now has an official blog, Not Elsewhere Classified. And here's a list of a few unofficial ones.
Dodging the Vietnam draft
What percentage of draft eligible men did not, or would not, join the US military despite being drafted? How many men's enlistment in the military ultimately depended on the outcome of the lottery? I will let you ponder these questions for a moment, and put the answer under the fold...
Are double-blind trials underestimating drug effectiveness?
In a double-blind trial, patients exhibit the placebo (and nocebo) effects because of the expectation they might be on a real drug.
If expectations are so important, could it be that patients being administered the real drug don't react to it fully due to the expectation they might be on the placebo?
Addendum: Steve Waldman (of Interfluidity and Naked Capitalism) posts in the comments:
@llimllib on twitter - Bill Mill - posted a cite (in response to a tweet on this post) that seems like a nice confirmation of your conjecture. Almost perfect.Wow.
Google's broken hiring process
From Google's director of research:
One of the interesting things we've found, when trying to predict how well somebody we've hired is going to perform when we evaluate them a year or two later, is one of the best indicators of success within the company was getting the worst possible score on one of your interviews. We rank people from one to four, and if you got a one on one of your interviews, that was a really good indicator of success.
Ryan Tate uses this as evidence that Google's interview process is broken.
Craig Newmark sets the record straight.
On related news, I was surprised by suggestions that Larry Page still reviews CVs personally, and half-surprised to find out that lots of Google employees struggle with bureaucracy within the company.
Funny graphs
Dave sent those via email - I've seen some of them before, but they are still hilarious (note: there's more graphs under the fold).
Safer data mining
Machine learning department members from Carnegie Mellon protest at the recent G20. Hat tip (and more pics) Social Science Statistics blog.
Probability brain teaser of the day
Over at Andrew Gelman's, a reader emails:
I read the abstract for your paper What is the probability your vote will make a difference? [...] I'd note that the abstract prima facie contains an error. Your sentence in the abstract, "On average, a voter in America had a 1 in 60 million chance of being decisive in the presidential election." can not be correct. If we assume that this sentence is correct that means that given the actual turnout of 132,618,580 people the sum total probability of voters being decisive is larger than one. This of course [sic] is impossible. The total amount of decisiveness must be at most one.
What is going on here? Can the probabilities sum to more than one? Take a minute to think it through, and then read the rest of the entry to find out the answer...
The normal distribution
This beautiful image comes courtesy of W. J. Youden.
I've been reading Edward Tufte's superb The Visual Display of Quantitative Information - perhaps the most perfect book I've ever come across. Almost every page is a revelation, so expect me to be posting more of the wonderful graphs Tufte has collected in the future.
The importance of being clear
A formatting fubar involving an Excel spreadsheet has left Barclays Capital with contracts involving collapsed investment bank Lehman Brothers than it never meant to acquire.
Working to a tight deadline, a junior law associate at Cleary Gottlieb Steen & Hamilton LLP converted an Excel file into a PDF format document. The doc was to be posted on a bankruptcy court's website before a midnight purchase offer deadline on 18 September, just four hours after Barclays sent the spreadsheet to the lawyers. The Excel file contained 1,000 rows of data and 24,000 cells.
Some of these details on various trading contracts were marked as hidden because they were not intended to form part of Barclays' proposed deal. However, this "hidden" distinction was ignored during the reformatting process so that Barclays ended up offering to take on an additional 179 contracts as part of its bankruptcy buyout deal, Finextra reports.
The Register has the full story. As Merv has always warned, 'horrible things happen when you hide cells in excel'.
I see this as a manifestation of a wider lack of education on the importance of communicating information efficiently. The Spartans, Tufte, Strunk and White, the Economist, Picasso and numerous econometricians have done a lot to improve things, but management-speak, TV advertising and other such phenomena show we still have a long way to go.
Stata lessons & other resources
A friend asked for a quick list, so here goes:
UCLA's excellent resources to help you learn and use Stata
Another great collection of Stata Resources by Park Hun Myoung, as well as a stata command cheat-card
London School of Economics Stata Resources
Syracuse University's Stata tutorial
Program in Statistics and Methodology by the pol sci's at Ohio State University
Duke Stata tutorials
A lesson in statistics
[...] when he talks with people about statistical procedures, engineers focus on the algorithm being applied to the data, whereas statisticians are always thinking about the psychology of the person doing the analysis.
This is a big topic for another day - the day I start posting on my main area of expertise, econometrics - but these are words worth pondering.
Fun with statistics, "which Palin is the mother?" edition
Well, Sarah, I'm calling you a liar. And not even a good one. Trig Paxson Van Palin is not your son. He is your grandson.
The Daily Kos has an 'interesting' story claiming that Sarah Palin's youngest son, who has been diagnosed with Down's syndrome, is really her daughter's son. There is much evidence presented, mainly consisting of pictures where certain bellies appear to be too large while others too small. But statistics deliver the killer argument:
The final point of interest is that Trig Palin has been diagnosed with Down's syndrome (aka trisomy 21). This is an interesting point, as chances of having offspring with Down's Syndrome increases from under 1% to 3% after a mother reaches the age of 40. However, 80% of the cases of Down's Syndrome are in mother's (sic) under the age of 35, through sheer quantities of births in this age group.
And of course, 99% of deaths are not due to suicide, so killing yourself is safe!
Thanks to Scatterplot for the pointer.
I hate seasonally adjusted series
To all data providers, wherever they may be:
Please, please stop seasonally adjusting the series you give me. I can run a regression with seasonal dummies myself, thank you very much. So can everyone else. The difficult thing is to get from the seasonally adjusted series to the original one, and I don't see why you should deprive me of the privilege.
Yes, I know I shouldn't want to in most cases, but let me be the judge of that, OK? What's your problem anyway, what is it to you? Please, just do me this favour.
What are the odds?
You are on holiday in some strange land, and you bump into Sue, an old friend. What are the odds?
Well, the probability is 1. 100%. It's certain. It bloody happened.
OK, you say, fair point - but that's not what you meant. What you meant is what is the probability you would bump into Sue, assuming you hadn't just bumped into him. I reply that that's a silly assumption to make as you just did bump into him, but you insist.
Well then, the probability is whatever you want it to be - pick a number, and I'll explain to you why it's plausible. You look perplexed and ask what I mean.
I explain: first of all, you need to specify the probability of what you are interested in.
-Do you want to know the probability you'd bump into Sue at the time and place you did
-Do you want the probability you would bump into an old friend while on holiday in general
-Or do you want the probability that something 'remarkable' enough would happen to you at some point in your life that would make you start asking silly questions about 'what is the probability of that happening'?
You say the first, obviously, and that I should stop being clever. I wasn't done, I say, and proceed to ask what is the information set I should base my probability estimate at - quantum mechanics aside, randomness is in the eye of the beholder after all, and if I was all-knowing God the probability of whatever it is that happened would be 1 even before it happened.
You throw your pina colada on my head and vow never to speak to me again.
A few days later, you are kind of missing me but don't feel like talking to me yet, so you visit bluematter. as a first step in rebuilding the relationship. And the first thing you see is this delightful little story, via Andrew Gelman:
In the city of Syracuse, the strangest thing happened in Tuesday's Democratic presidential primary.
Sen. Hillary Clinton and Sen. Barack Obama received the exact same number of votes, according to unofficial Board of Election results.
Clinton: 6,001.
Obama: 6,001.
The odds of Clinton and Obama tying were less than one in 1 million, said Syracuse University mathematics Professor Hyune-Ju Kim.
Elaborating on Thursday, she [Professor Hyune-Ju Kim] noted: "The "almost impossible" odd is obtained when we assume the Syracuse voter distribution follows the New York state distribution. Since it is almost impossible to observe what we have observed, statistically we can conclude that Syracuse voter distribution is significantly different from the New York state distribution."
There would be less than one in 1 million chance of a tie occurring between Clinton and Obama in voting by a randomly selected group of 12,346 New York Democratic voters, she said.
To which Andrew replies:
Not to pick on some harried mathematics professor who'd probably rather be out proving theorems, but . . . of course Syracuse voters are not a randomly selected group of New Yorkers. You don't need a statistical test to see that. Regarding the probability of an exact tie: I don't think that's so low: a quick calculation might say that either Clinton or Obama could've received between, say, 5000 and 7000 votes, giving something like a 1/2000 chance of an exact tie. That's gotta be the right order of magnitude.
If there was one thing you were ever certain about, it is that you don't want to read what I have to say on this. A baseball bat happens to lie next to you (what are the odds!?). You grab it with both your shaky hands and smash the computer monitor to pieces.
Roses are red, violets are blue, and correlation is not causation
Merv emails me this article:
Football clubs with red team strips are more successful than those with other colours, according to a study released Wednesday.
The fact that English clubs Manchester United, Liverpool and Arsenal regularly top league tables is not a coincidence, say the experts from Durham University and the University of Plymouth.
Red shirts give the team an advantage due to deep-rooted biological response to the colour. "In nature, red is often associated with male aggression and display," they said, giving the example of the red-breasted robin.
"It is a testosterone-driven signal of male quality, and its striking effect has even been harnessed by soldiers in the past," added the researchers, after analyzing data on English football league results since World War II.
The red-breasted robin? That's the most fearsome red beast they can think of?
I can't access the paper, but just reading the abstract is enough to convince me it's bonkers. The authors also have a 2005 paper - in Nature no less - entitled 'Red enhances human performance in contests'. If any reader has a access to Nature, would you be kind enough to email me the article so I can -ahem- review it?
Planning an informed jump off a bridge
Zubin Jelveh has the inside scoop on where to go to avoid the crowds:
Econometric causality
James Heckman has an excellent paper on the subject (free access).
Stop abusing statistical significance
I just made my first edit on Wikipedia, on the article on 'statistical power'. Here's the old text, with the deleted parts in bold:
There are times when the recommendations of power analysis regarding sample size will be inadequate. Power analysis is appropriate when the concern is with the correct acceptance or rejection of a null hypothesis. In many contexts, the issue is less about determining if there is or is not a difference but rather with getting a more refined estimate of the population effect size. For example, if we were expecting a population correlation between intelligence and job performance of around .50, a sample size of 20 will give us approximately 80% power (alpha = .05, two-tail). However, in doing this study we are probably more interested in knowing whether the correlation is .30 or .60 or .50. In this context we would need a much larger sample size in order to reduce the confidence interval of our estimate to a range that is acceptable for our purposes. These and other considerations often result in the true but somewhat simplistic recommendation that when it comes to sample size, "More is better!"
However, huge sample sizes can lead to statistical tests becoming so powerful that the null hypothesis is always rejected for real data. This is a problem in studies of differential item functioning.
Leaving the cost of collecting data aside, larger (appropriately collected) samples are ALWAYS BETTER. At the end of the day, if your sample is *too* large (for example if your statistical software restricts the amount of information you can load on it and you don't need the extra information anyways) you can always obtain a smaller random sample from your larger random sample. So, the 'more is better' recommendation is simple, but not simplistic.
The last paragraph reveals a fundamental misconception about statistical significance that refuses to go away. If the effect of an independent variable on the dependent variable is zero, using a very large sample will result to an estimated effect that is 0 to many decimal places; as the sample size increases further, the effect will approach *exactly* zero even more. NEVER USE STATISTICAL SIGNIFICANCE AS A PROXY FOR PRACTICAL SIGNIFICANCE. I have no clue whether large sample sizes have been seen as a problem in the past in studies of differential item functioning, but if that is the case then the researchers are idiots.
Here is another post on problematic applications of statistical significance.