Lies, Damned Lies, and Statistics (45): Anonymity in Surveys Changes Survey Results

Whether interviewees are given anonymity or not makes a big difference in survey results:

When people are assured of anonymity, it turns out, a lot more of them will acknowledge that they have had same-sex experiences and that they don’t entirely identify as heterosexual. But it also turns out that when people are assured of anonymity, they will show significantly higher rates of anti-gay sentiment. These results suggest that recent surveys have been understating, at least to some degree, two different things: the current level of same-sex activity and the current level of opposition to gay rights. (source)

Anonymity can result in data that are closer to the truth, so it’s tempting to require it, especially in the case of surveys that may suffer from social desirability bias (surveys asking about opinions that people are reluctant to divulge because these opinions are socially unacceptable – the Bradley effect is one example). However, anonymity can also create problems. For example, it may make it difficult to avoid questioning the same people more than once.

Go here for other posts in this series.

Measuring Human Rights (24): Measuring Racism, Ctd.

Measuring racism is a problem, as I’ve argued before. Asking people if they’re racist won’t work because they don’t answer this question correctly, and understandably so. This is due to the social desirability bias. Surveys may minimize this bias if they approach the subject indirectly. For example, rather than simply asking people if they are racist or if they believe blacks are inferior, surveys could ask some of the following questions:

  • Do you believe God has created the races separately?
  • What do you believe are the reasons for higher incarceration rates/lower IQ scores/… among blacks?
  • Etc.

Still, no guarantee that bias won’t falsify the results. Maybe it’s better to dump the survey method altogether and go for something even more indirect. For example, you can measure

  • racism in employment decisions, such as numbers of callbacks received by applicants with black sounding names
  • racism in criminal justice, for example the degree to which black federal lower-court judges are overturned more often than cases authored by similar white judges, or differences in crime rates by race of the perpetrator, or jury behavior
  • racial profiling
  • residential racial segregation
  • racist consumer behavior, e.g. reluctance to buy something from a black seller
  • the numbers of interracial marriages
  • the numbers and membership of hate groups
  • the number of hate crimes
  • etc.

A disadvantage of many of these indirect measurements is that they don’t necessarily reflect the beliefs of the whole population. You can’t just extrapolate the rates you find in these measurements. It’s not because some judges and police officers are racist that the same rate of the total population is racist. Not all people who live in predominantly white neighborhoods do so because they don’t want to live in mixed neighborhoods. Different crime rates by race can be an indicator of racist law enforcement, but can also hide other causes, such as different poverty rates by race (which can themselves be indicators of racism). Higher numbers of hate crimes or hate groups may represent a radicalization of an increasingly small minority. And so on.

Another alternative measurement system is the Implicit Association Test. This is a psychological test that measures implicit attitudes and beliefs that people are either unwilling or unable to report.

Because the IAT requires that users make a series of rapid judgments, researchers believe that IAT scores may reflect attitudes which people are unwilling to reveal publicly. (source)

Participants in an IAT are asked to rapidly decide which words are associated. For example, is “female” or “male” associated with “family” and “career” respectively? This way, you can measure the strength of association between mental constructs such as “female” or “male” on the one hand and attributes such as “family” or “career” on the other. And this allows you to detect prejudice. The same is true for racism. You can read here or here how an IAT is usually performed.

Yet another measurement system uses evidence from Google search data, such as in this example. The advantage of this system is that it avoids the social desirability bias, since Google searches are done alone and online and without prior knowledge of the fact that the search results will be used to measure racism. Hence, people searching on Google are more likely to express social taboos. In this respect, the measurement system is similar to the IAT. Another advantage of the Google method, compared to traditional surveys, is that the Google sample is very large and more or less evenly distributed across all areas of a country. This allows for some fine grained geographical breakdown of racial animus.

More specifically, the purpose of the Google method is to analyze trends in searches that include words like “nigger” or “niggers” (not “nigga” because that’s slang in some Black communities, and not necessarily a disparaging term). In order to avoid searches for the term “nigger” by people who may not be racially motivated – such as researchers (Google can’t tell the difference) – you could refine the method and analyze only searches for phrases like “why are niggers lazy”, “Obama+nigger”, “niggers/blacks+apes” etc. If you find that those searches are more common in some locations than others, or that they become more common in some locations, then you can try to correlate those findings with other, existing indicators of racism such as those cited above, or with historic indicators such as prevalence of slavery or lynchings.

More posts in this series are here.

Lies, Damned Lies, and Statistics (37): When Surveyed, People Express Opinions They Don’t Hold

It’s been a while since the last post in this series, so here’s a recap of its purpose. This blog promotes the quantitative approach to human rights: we need to complement the traditional approaches – anecdotal, journalistic, legal, judicial etc. – with one that focuses on data, country rankings, international comparisons, catastrophe measurement, indexes etc.

Because this statistical approach is important, it’s also important to engage with measurement problems, and there are quite a few in the case of human rights. After all, you can’t measure respect for human rights like you can measure the weight or size of an object. There are huge obstacles to overcome in human rights measurement. On top of the measurement difficulties that are specific to the area of human rights, this area suffers from some of the general problems in statistics. Hence, there’s a blog series here about problems and abuse in statistics in general.

Take for example polling or surveying. A lot, but not all, information on human rights violations comes from surveys and opinion polls, and it’s therefore of the utmost importance to describe what can go wrong when designing, implementing and using surveys and polls. (Previous posts about problems in polling and surveying are here, here, here, here and here).

One interesting problem is the following:

Simply because the surveyor is asking the question, respondents believe that they should have an opinion about it. For example, researchers have shown that large minorities would respond to questions about obscure or even fictitious issues, such as providing opinions on countries that don’t exist. (source, source)

Of course, when people express opinions they don’t have, we risk drawing the wrong conclusions from surveys. We also risk that a future survey asking the same questions comes up with totally different results. Confusion guaranteed. After all, if we make up our opinions when someone asks us, and those aren’t really our opinions but rather unreflected reactions we give because of a sense of obligation, it’s unlikely that we will express the same opinion in the future.

Another reason for this effect is probably our reluctance to come across as ignorant: rather than selecting the “I don’t know/no opinion” answer, we just pick one of the other possible answers. Again a cause of distortions.

Measuring Human Rights (12): Measuring Public Opinion on Torture

Measuring the number and gravity of cases of actual torture is extremely difficult, for apparent reasons. It takes place in secret, and the people subjected to torture are often in prison long afterwards, or don’t survive it. Either way, they can’t tell us.

That’s why people try to find other ways to measure torture. Asking the public when and under which circumstances they think torture is acceptable may give an approximation of the likelihood of torture, at least as long as we assume that in democratic countries governments will only engage in torture if there’s some level of public support for it. This approach won’t work in dictatorships, obviously, since public opinion in a dictatorship is often completely irrelevant.

However, measuring public opinion on torture has proven to be very difficult and misleading:

Many journalists and politicians believe that during the Bush administration, a majority of Americans supported torture if they were assured that it would prevent a terrorist attack. … But this view was a misperception … we show here that a majority of Americans were opposed to torture throughout the Bush presidency…even when respondents were asked about an imminent terrorist attack, even when enhanced interrogation techniques were not called torture, and even when Americans were assured that torture would work to get crucial information. Opposition to torture remained stable and consistent during the entire Bush presidency.

Gronke et al. attribute confusion of beliefs [among many journalists] to the so-called false consensus effect studied by cognitive psychologists, in which people tend to assume that others agree with them. For example: The 30% who say that torture can “sometimes” be justified believe that 62% of Americans do as well. (source)

Lies, Damned Lies, and Statistics (32): The Questioner Matters

I’ve discussed the role of framing before: the way in which you ask questions in surveys influences the answers you get and therefore modifies the survey results. (See here and here for instance). It happens quite often that polling organizations or media inadvertently or even deliberately frame questions in a way that will seduce people to answer the question in a particular fashion. In fact you can almost frame questions in such a way that you get any answer you want.

However, the questioner may matter just as much as the question.

Consider this fascinating new study, based on surveys in Morocco, which found that the gender of the interviewer and how that interviewer was dressed had a big impact on how respondents answered questions about their views on social policy. …

[T]his paper asks whether and how two observable interviewer characteristics, gender and gendered religious dress (hijab), affect survey responses to gender and non-gender-related questions. [T]he study finds strong evidence of interviewer response effects for both gender-related items, as well as those related to support for democracy and personal religiosity … Interviewer gender and dress affected responses to survey questions pertaining to gender, including support for women in politics and the role of Shari’a in family law, and the effects sometimes depended on the gender of the respondent. For support for gender equality in the public sphere, both male and female respondents reported less progressive attitudes to female interviewers wearing hijab than to other interviewer groups. For support for international standards of gender equality in family law, male respondents reported more liberal views to female interviewers who do not wear hijab, while female respondents reported more liberal views to female respondents, irrespective of dress. (source, source)

Other data indicate that the effect occurs in the U.S. as well. This is potentially a bigger problem than the framing effect since questions are usually public and can be verified by users of the survey results, whereas the nature of the questioner is not known to the users.

There’s an overview of some other effects here. More on the headscarf is here. More posts in this series are here.

Lies, Damned Lies, and Statistics (31): Common Problems in Opinion Polls

Opinion polls or surveys are very useful tools in human rights measurement. We can use them to measure public opinion on certain human rights violations, such as torture or gender discrimination. High levels of public approval of such rights violations may make them more common and more difficult to stop. And surveys can measure what governments don’t want to measure. Since we can’t trust oppressive governments to give accurate data on their own human rights record, surveys may fill in the blanks. Although even that won’t work if the government is so utterly totalitarian that it doesn’t allow private or international polling of its citizens, or if it has scared its citizens to such an extent that they won’t participate honestly in anonymous surveys.

But apart from physical access and respondent honesty in the most dictatorial regimes, polling in general is vulnerable to mistakes and fraud (fraud being a conscious mistake). Here’s an overview of the issues that can mess up public opinion surveys, inadvertently or not.

Wording effect

There’s the well-known problem of question wording, which I’ve discussed in detail before. Pollsters should avoid leading questions, questions that are put in such a way that they pressure people to give a certain answer, questions that are confusing or easily misinterpreted, wordy questions, questions using jargon, abbreviations or difficult terms, double or triple questions etc. Also quite common are “silly questions”, questions that don’t have meaningful or clear answers: for example “is the catholic church a force for good in the world?” What on earth can you answer to that? Depends on what elements of the church you’re talking about, what circumstances, country or even historical period you’re asking about. The answer is most likely “yes and no”, and hence useless.

The importance of wording is illustrated by the often substantial effects of small modifications in survey questions. Even the replacement of a single word by another, related word, can radically change survey results.

Of course, one often claims that biased poll questions corrupt the average survey responses, but that the overall results of the survey can still be used to learn about time trends and difference between groups. As long as you make a mistake consistently, you may still find something useful. That’s true, but no reason not to take care of wording. The same trends and differences can be seen in survey results that have been produced with correctly worded questions.

Order effect or contamination effect

Answers to questions depend on the order they’re asked in, and especially on the questions that preceded. Here’s an example:

Fox News yesterday came out with a poll that suggested that just 33 percent of registered voters favor the Democrats’ health care reform package, versus 55 percent opposed. … The Fox News numbers on health care, however, have consistently been worse for Democrats than those shown by other pollsters. (source)

The problem is not the framing of the question. This was the question: “Based on what you know about the health care reform legislation being considered right now, do you favor or oppose the plan?” Nothing wrong with that.

So how can Fox News ask a seemingly unbiased question of a seemingly unbiased sample and come up with what seems to be a biased result? The answer may have to do with the questions Fox asks before the question on health care. … the health care questions weren’t asked separately. Instead, they were questions #27-35 of their larger, national poll. … And what were some of those questions? Here are a few: … Do you think President Obama apologizes too much to the rest of the world for past U.S. policies? Do you think the Obama administration is proposing more government spending than American taxpayers can afford, or not? Do you think the size of the national debt is so large it is hurting the future of the country? … These questions run the gamut slightly leading to full-frontal Republican talking points. … A respondent who hears these questions, particularly the series of questions on the national debt, is going to be primed to react somewhat unfavorably to the mention of another big Democratic spending program like health care. And evidently, an unusually high number of them do. … when you ask biased questions first, they are infectious, potentially poisoning everything that comes below. (source)

If you want to avoid this mistake – if we can call it that (since in this case it’s quite likely to have been a “conscious mistake” aka fraud) – randomizing the question order for each respondent might help.

Similar to the order effect is the effect created by follow-up questions. It’s well-known that follow-up questions of the type “but what if…” or “would you change your mind if …” change the answers to the initial questions.

Bradley effect

The Bradley effect is a theory proposed to explain observed discrepancies between voter opinion polls and election outcomes in some U.S. government elections where a white candidate and a non-white candidate run against each other.

Contrary to the wording and order effects, this isn’t an effect created – intentionally or not – by the pollster, but by the respondents. The theory proposes that some voters tend to tell pollsters that they are undecided or likely to vote for a black candidate, and yet, on election day, vote for the white opponent. It was named after Los Angeles Mayor Tom Bradley, an African-American who lost the 1982 California governor’s race despite being ahead in voter polls going into the elections.

The probable cause of this effect is the phenomenon of social desirability bias. Some white respondents may give a certain answer for fear that, by stating their true preference, they will open themselves to criticism of racial motivation. They may feel under pressure to provide a politically correct answer. The existence of the effect is, however, disputed. (Some say the election of Obama disproves the effect, thereby making another statistical mistake).

Fatigue effect

Another effect created by the respondents rather than the pollsters is the fatigue effect. As respondents grow increasingly tired over the course of long interviews, the accuracy of their responses could decrease. They may be able to find shortcuts to shorten the interview; they may figure out a pattern (for example that only positive or only negative answers trigger follow-up questions). Or they may just give up halfway, causing incompletion bias.

However, this effect isn’t entirely due to respondents. Survey design can be at fault as well: there may be repetitive questioning (sometimes deliberately for control purposes), the survey may be too long or longer than initially promised, or the pollster may want to make his life easier and group different polls into one (which is what seems to have happened in the Fox poll mentioned above, creating an order effect – but that’s the charitable view of course). Fatigue effect may also be caused by a pollster interviewing people who don’t care much about the topic.

Sampling effect

Ideally, the sample of people who are to be interviewed for a survey should represent a fully random subset of the entire population. That means that every person in the population should have an equal chance of being included in the sample. That means that there shouldn’t be self-selection (a typical flaw in many if not all internet surveys of the “Polldaddy” variety) or self-deselection. That reduces the randomness of the sample, which can be seen from the fact that self-selection leads to polarized results. The size of the sample is also important. Samples that are too small typically produce biased results.

Even the determination of the total population from which the sample is taken, can lead to biased results. And yes, that has to be determined… For example, do we include inmates, illegal immigrants etc. in the population? See here for some examples of the consequences of such choices.

House effect

A house effect occurs when there are systematic differences in the way that a particular pollster’s surveys tend to lean toward one or the other party’s candidates; Rasmussen is known for that.

I probably forgot an effect or two. Fill in the blanks if you care. Go here for other posts in this series.

Lies, Damned Lies, and Statistics (29): How (Not) to Frame Survey Questions, Ctd.

Here’s a nice example of the way in which small modifications in survey questions can radically change survey results:

Our survey asked the following familiar question concerning the “right to die”: “When a person has a disease that cannot be cured and is living in severe pain, do you think doctors should or should not be allowed by law to assist the patient to commit suicide if the patient requests it?”

57 percent said “doctors should be allowed,” and 42 percent said “doctors should not be allowed.” As Joshua Green and Matthew Jarvis explore in their chapter in our book, the response patterns to euthanasia questions will often differ based on framing. Framing that refers to “severe pain” and “physicians” will often lead to higher support for ending the patient’s life, while including the word “suicide” will dramatically lower support. (source)

Similarly, seniors are willing to pay considerably more for “medications” than for “drugs” or “medicine” (source). Yet another example involves the use of “Wall Street”: there’s greater public support for banking reform when the issue is more specifically framed as regulating “Wall Street banks”.

What’s the cause of this sensitivity? Difficult to tell. Cognitive bias probably has some effect, and the psychology of associations (“suicide” brings up images of blood and pain, whereas “physicians” brings up images of control; similarly “homosexual” evokes sleazy bars, “gay” evokes art and design types). Maybe the willingness not to offend the person asking the question. Anyway, the conclusion is that pollsters should be very careful when framing questions. One tactic could be to use as many different words and synonyms as possible in order to avoid a bias created by one particular word.

Lies, Damned Lies, and Statistics (28): Push Polls

Push polls are used in election campaigns, not to gather information about public opinion, but to modify public opinion in favor of a certain candidate, or – more commonly – against a certain candidate. They are called “push” polls because they intend to “push” the people polled towards a certain point of view.

Push polls are not cases of “lying with statistics” as we usually understand them, but it’s appropriate to talk about them since they are very similar to a “lying technique” that I discussed many times, namely leading questions (see here for example). The difference here is that leading questions aren’t used to manipulate poll results, but to manipulate people.

The push poll isn’t really a poll at all, since the purpose isn’t information gathering. Which is why many people don’t like the term and label it oxymoronic. A better term indeed would be advocacy telephone campaigns. A push poll  is more like a gossip campaign, a propaganda effort or telemarketing. They’re very similar to political attack ads, in the sense that they intend to smear candidates, often with little basis in facts. Compared to political ads, push polls have the “advantage” that they don’t seem to emanate from the campaign offices of one of the candidates. (Push polls are typically conducted by bogus polling agencies). Hence it’s more difficult for the recipients of the push poll to classify the “information” contained in the push poll as political propaganda. He or she is therefore more likely to believe the information. Which is of course the reason push polls are used. Also, the fact that they are presented as “polls” rather than campaign messages, makes it more likely that people listen, and as they listen more, they internalize the messages better than in the case of outright campaigning (which they often dismiss as propaganda).

Push polls usually, but not necessarily, contain lies or false rumors. They may also be limited to misleading or leading questions. For example, a push poll may ask people: “Do you think that the widespread and persistent rumors about Obama’s Muslim faith, based on his own statements, connections and acquaintances, are true?”. Some push polls may even contain some true but unpleasant facts about a candidate, and then hammer on these facts in order to change the opinions of the people being “polled”.

One infamous example of a push poll was the poll used by Bush against McCain in the Republican primaries of 2000 (insinuating that McCain had an illegitimate black child), or the poll used by McCain (fast learner!) against Obama in 2008 (alleging that Obama had ties with the PLO).

One way to distinguish legitimate polls from push polls is the sample size. The former are usually content with relatively small sample sizes (but not too small), whereas the latter typically want to “reach” as many people as possible. Push polls won’t include demographic questions about the people being polled (gender, age, etc.) since there is no intention to aggregate results, let alone aggregate by type of respondent. Another way to identify push polls is the selection of the target population: normal polls try to reach a random subset of the population; push polls are often targeted at certain types of voters, namely those likely to be swayed by negative campaigning about a certain candidate. Push polls also tend to be quite short compared to regular polls, since the purpose is to reach a maximum number of people.

Measuring Poverty (2): Some Problems With Poverty Measurement

The struggle against poverty is a worthy social goal, and the absence of poverty is a human right. But poverty is also an obstacle to other social goals, particularly the full realization of other human rights. A necessary instrument in poverty reduction is data: how many people suffer from poverty? Without an answer to that question it’s very difficult to assess the success of poverty reduction policies (such as development aid).

And that’s were the problems start. There’s some uncertainty in the data. The data may not reflect accurately the real number of people living in poverty. There are definition issues – what is poverty? – that may reduce the accuracy of the data or the comparability between different measurements of poverty (or between different measurements over time), and there are issues related to the measurements themselves. I’ll focus on the latter for the moment.

Poverty is often measured by way of surveys. These surveys, however, can be biased because of

  1. sample errors: underreporting of the very rich and the very poor (more on sample errors here), and
  2. reporting errors: failure of the very rich and the very poor to report accurately.

The rich are less likely than middle-income people to respond to surveys because they are less accessible (their houses for instance are less accessible). In addition, when they respond, they may tend to underreport a larger fraction of their wealth as they have more incentives to hide (for tax reasons for example).

The very poor may also be inaccessible, but for other reasons. They may be hard to interview when they don’t have a fixed address or an official identification. In poor countries, they may be hard to find because they live in remote areas with inadequate transportation access. And again, when they report, it may be difficult to estimate their “wealth” because their assets are often in kind rather than in currency.

Because we can have underreporting of the two extremes on the wealth distribution, we believe that income distribution is more egalitarian than it really is. Hence we underestimate income inequality and relative poverty.

But apart from relative poverty we also underestimate absolute poverty since we’re often unable to include the very poor in the reporting for the reasons given above. By “cutting off” the people at the poor end of the distribution, it seems like most people are middle class and society largely egalitarian.

However, absolute poverty can also be overestimated: if the poor respond, we may fail to accurately assess their “wealth” given that much of it is in kind. And it’s unlikely that these two errors – underestimation and overestimation – cancel each other out.

These and other problems of poverty measurement make it difficult to claim that we “know” more or less precisely how many poor people there are, but if we make the same errors consistently we may be able to guess, not the levels of poverty, but at least the trends: is poverty going up or down?

Lies, Damned Lies, and Statistics (16): Measuring Public Opinion in Dictatorships

Measuring human rights requires a certain level of respect for human rights (freedom to travel, freedom to speak, to interview etc.). Trying to measure human rights in situations characterized by the absence of freedom is quite difficult, and can even lead to unexpected results: the absence of (access to) good data may give the impression that things aren’t as bad as they really are. Conversely, when a measurement shows a deteriorating situation, the cause of this may simply be better access to better data. And this better access to better data may be the result of more openness in society. Deteriorating measurements may therefore signal an actual improvement. I gave an example of this dynamic here (it’s an example of statistics on violence against women).

Measuring public opinion in authoritarian countries is always difficult, but if you ask the public if they love or hate their government, it’s likely that you’ll have higher rates of “love” in the more authoritarian countries. After all, in those countries it can be pretty dangerous to tell someone in the street that you hate your government. They choose to lie and say that they approve. That’s the safest answer but probably in many cases not the real one. I don’t believe for a second that the percentage of people approving of their government is 19 times higher in Azerbaijan than in Ukraine, when Ukraine is in fact much more liberal than Azerbaijan.

In the words of Robert Coalson:

The Gallup chart is actually an index of fear. What it reflects is not so much attitudes toward the government as a willingness to openly express one’s attitudes toward the government. As one member of RFE/RL’s Azerbaijan Service told me, “If someone walked up to me in Baku and asked me what I thought about the government, I’d say it was great too”.

Lies, Damned Lies, and Statistics (10): How (Not) to Frame Survey Questions

I’ve mentioned before that information on human rights depends heavily on opinion surveys. Unfortunately, surveys can be wrong and misleading for so many different reasons that we have to be very careful when designing surveys and when using and interpreting survey data. One reason I haven’t mentioned before is the framing of the questions.

Even very small differences in framing can produce widely divergent answers. And there is a wide variety of problems linked to the framing of questions:

  • Questions can be leading questions, questions that suggests the answer. For example: “It’s wrong to discriminate against people of another race, isn’t it?” Or: “Don’t you agree that discrimination is wrong?”
  • Questions can be put in such a way that they put pressure on people to give a certain answer. For example: “Most reasonable people think racism is wrong. Are you one of them?” This is also a leading question of course, but it’s more than simply “leading”.
  • Questions can be confusing or easily misinterpreted. Such questions often include a negative, or, worse, a double negative. For example: “Do you agree that it isn’t wrong to discriminate under no circumstances?” Needless to say that your survey results will be infected by answers that are the opposite of what they should have been.
  • Questions can be wordy. For example: “What do you think about discrimination (a term that refers to treatment taken toward or against a person of a certain group that is based on class or category rather than individual merit) as a type of behavior that promotes a certain group at the expense of another?” This is obviously a subtype of the confusing-variety.
  • Questions can also be confusing because they use jargon, abbreviations or difficult terms. For example: “Do you believe that UNESCO and ECOSOC should administer peer-to-peer expertise regarding discrimination in an ad hoc or a systemic way?”
  • Questions can in fact be double or even triple questions, but there is only one answer required and allowed. Hence people who may have opposing answers to the two or three sub-questions will find it difficult to provide a clear answer. For example: “Do you agree that racism is a problem and that the government should do something about it?”
  • Open questions should be avoided in a survey. For example: “What do you think about discrimination?” Such questions do not yield answers that can be quantified and aggregated.
  • You also shouldn’t ask questions that exclude some possible answers, and neither should you provide a multiple-choice set of answers that doesn’t include some possible answers. For example: “How much did the government improve its anti-discrimination efforts relative to last year? Somewhat? Average? A lot?” Notice that such a framing of the question doesn’t allow people to respond that the effort had not improved or had worsened. Another example: failure to include “don’t know” as a possible answer.

Here’s a real-life example:

In one of the most infamous examples of flawed polling, a 1992 poll conducted by the Roper organization for the American Jewish Committee found that 1 in 5 Americans doubted that the Holocaust occurred. How could 22 percent of Americans report being Holocaust deniers? The answer became clear when the original question was re-examined: “Does it seem possible or does it seem impossible to you that the Nazi extermination of the Jews never happened?” This awkwardly-phrased question contains a confusing double-negative which led many to report the opposite of what they believed. Embarrassed Roper officials apologized, and later polls, asking clear, unambiguous questions, found that only about 2 percent of Americans doubt the Holocaust. (source)

Lies, Damned Lies, and Statistics (9): Too Small Sample Sizes in Surveys

So many things can go wrong in the design and execution of opinion surveys. And opinion surveys are a common tool in data gathering in the field of human rights.

As it’s often impossible (and undesirable) to question a whole population, statisticians usually select a sample from the population and ask their questions only to the people in this sample. They assume that the answers given by the people in the sample are representative of the opinions of the entire population. But that’s only the case if the sample is a fully random subset of the population – that means that every person in the population should have an equal chance of being chosen – and if the sample hasn’t been distorted by other factors such as self-selection by respondents (a common thing in internet polls) or personal bias by the statistician who selects the sample.

A sample that is too small is also not representative for the entire population. For example, if we ask 100 people if they approve or disapprove of discrimination of homosexuals, and 55 of them say they approve, we might assume that about 55% of the entire population approves. Now it could possible be that only 45% of the total population approve, but that we just happened, by chance, to interview an unusually large percentage of people who approve. For example, this may have happened because, by chance and without being aware of it, we selected the people in our sample in such a way that there are more religious conservatives in our sample than there are in society, relatively speaking.

This is the problem of sample size: the smaller the sample, the greater the influence of luck on the results we get. Asking the opinion of 100 people, and taking this as representative of millions of citizens, is like throwing a coin 10 times and assuming – after having 3 heads and 7 tails – that the probability of throwing heads is 30%. We all know that it’s not 30 but 50%. And we know this because we know that when we increase the “sample size” – i.e. when we throw more than 10 times, say a thousand times – we will have heads and tails approximately half of the time. Likewise, if we take our example of the survey on homosexuality: increasing the sample size reduces the chance that religious conservatives (or other groups) are disproportionately represented in the sample.

When analyzing survey results, the first thing to look at is the sample size, as well as the level of confidence (usually 95%) that the results are within a certain margin of error (usually + or – 5%). High levels of confidence that the results are correct within a small margin of error indicate that the sample was sufficiently large and random.

Lies, Damned Lies, and Statistics (6): Statistical Bias in the Design and Execution of Surveys

Statisticians can – wittingly or unwittingly – introduce bias in their work. Take the case of surveys for instance. Two important steps in the design of a survey are the definition of the population and the selection of the sample. As it’s often impossible (and undesirable) to question a whole population, statisticians usually select a sample from the population and ask their questions only to the people in this sample. They assume that the answers given by the people in the sample are representative of the opinions of the entire population.

Bias can be introduced

  • at the moment of the definition of the population
  • at the moment of the selection of the sample
  • at the moment of the execution of the survey (as well as at other moments of the statistician’s work, which I won’t mention here).

Population

Let’s take a fictional example of a survey. Suppose statisticians want to measure public opinion regarding the level of respect for human rights in the country called Dystopia.

First, they set about defining their “population”, i.e. the group of people whose “public opinion” they want to measure. “That’s easy”, you think. So do they, unfortunately. It’s the people living in this country, of course, or is it?

Not quite. Suppose the level of rights protection in Dystopia is very low, as you might expect. That means that probably many people have fled the country. Including in the survey population only the residents of the country will then overestimate the level of rights protection. And there is another point: dead people can’t talk. We can assume that many victims of rights violations are dead because of them. Not including these dead people in the survey will also artificially push up the level of rights protection. (I’ll mention in a moment how it is at all possible to include dead people in a survey; bear with me).

Hence, doing a survey and then assuming that the people who answered the survey are representative for the whole population, means discarding the opinions of refugees and dead people. If those opinions were included the results would be different and more correct. Of course, in the case of dead people it’s obviously impossible to include their opinions, but perhaps it would be advisable to make a statistical correction for it. After all, we know their answers: people who died because of rights violations in their country presumably wouldn’t have a good opinion of their political regime.

Sample

And then there are the problem linked to the definition of the sample. An unbiased sample should represent a fully random subset of the entire and correctly defined population (needless to say that if the population is defined incorrectly, as in the example above, then the sample is by definition also biased even if no sampling mistakes have been made). That means that every person in the population should have an equal chance of being chosen. That means that there shouldn’t be self-selection (a typical flaw in many if not all internet surveys of the “Polldaddy” variety) or self-deselection. The latter is very likely in my Dystopia example. People who are too afraid to talk won’t talk. The harsher the rights violations, the more people who will fail to cooperate. So you have a perverse effect that very cruel regimes may score better on human rights surveys that modestly cruel regimes. The latter are cruel, but not cruel enough to scare the hell out of people.

The classic sampling error is from a poll on the 1948 Presidential election in the U.S.

On Election night, the Chicago Tribune printed the headline DEWEY DEFEATS TRUMAN, which turned out to be mistaken. In the morning the grinning President-Elect, Harry S. Truman, was photographed holding a newspaper bearing this headline. The reason the Tribune was mistaken is that their editor trusted the results of a phone survey. Survey research was then in its infancy, and few academics realized that a sample of telephone users was not representative of the general population. Telephones were not yet widespread, and those who had them tended to be prosperous and have stable addresses. (source)

Execution

Another reason why bias in the sampling may occur is the way in which the surveys are executed. If the government of Dystopia allows statisticians to operate on its territory, it will probably not allow them to operate freely, or circumstances may not permit them to operate freely. So the people doing the interviews are not allowed to, or don’t dare to, travel around the country. Hence they themselves deselect entire groups from the survey, distorting the randomness of the sample. Again, the more repressive the regime, the more this happens. With possible adverse effects. The people who can be interviewed are perhaps only those living in urban areas, close to the residence of the statisticians. And those living there may have a relatively large stake in the government, which makes them paint a rosy image of the regime.