Archive for the ‘Data’ Category

Obesity – Maps

Monday, May 21st, 2012

Many of us first got into the field of the built environment and health because of the a series of maps that the CDC put together using data from the BRFSS.  Dr. Richard Jackson used these maps as he went around the country to incrdase awareness of the issue.

The maps can be viewed here:

CDC obesity maps

Census geography – a primer

Saturday, May 19th, 2012

Sooner or later, every urban planner or public health practitioner finds themselves needing to work with census data.  Many people are uncertain as to how the census characterizes various facets of geography.  There are also complaints that the census defined geography doesn’t fit an individual’s idea of what it should be in a particular area.  Part of that comes from the need to have national standards; it is perhaps too much to expect that a national standard would exactly apply to every individual area.  Perhaps that is a topic for a future blog post.  Today, I am going to go through the various levels of census geography, starting from the most local.  I am not going to cover every level reported, there are far too many different types of geography made available.  The ones covered here represent those most often used in health and planning.

Blocks.  Blocks represent the smallest area that the census makes data available for.   In urban areas, these often correspond to actual city blocks, but in less dense areas, they may not.  However, the census tries to have them bounded by actual physical features such as roads and railroad tracks.  As valuable as the data might be on this very local level, most census data is suppressed at the block level because of confidentiality issues.

Block groups.  As the name suggests, these are amalgamations of blocks (all higher level census geographies are  collections of smaller areas).  The block group aims to have about 1500 people, but this may vary.  Note that all geographies can be of any size; in Alaska, a block group may cover thousands of square miles.  They are sized for population, not land area.  In general, most data are reported for block groups.  They represent a finer grain of analysis than census tracts, but rarely do block groups correspond to anything that residents may find identifiable.  In that sense they don’t represent neighborhoods, they are mostly analytical constructs.

Census tracts.  These are the basic reporting units of the census.  Again, they are sized for population, not land area.  In my experience, the smallest are about 100 acres, the largest was over 10,000 square miles.  The census aims for a population of about 4500 persons (or about 3 block groups).  Tracts never cross county borders, their numbering system reflects this restriction.  Many researchers talk about census tracts as if they were neighborhoods., but be very clear:  they are not!  They are drawn to try to correspond to some sort of on the ground reality, but the need to divide the country into tracts overrides any ability to reflect true neighborhood boundaries.  In some communities, the tracts are much bigger than what locals consider to be a neighborhood, in others, much smaller.  Though the “tracting” of the US began a hundred years ago (in part to assist health departments to understand local population numbers”, the tracking of the US did not become widespread until 1970 and was not fully implemented until 1990.  Tract boundaries have also changed over time.  In areas with growing populations, they have been split (I know of no circumstances where they have been combined – but that doesn’t mean that has not happened).  Also, over time, the census has moved to rationalize tracts: make them more compact and eliminate incongruities.  A tract with the same number in 2010 may not totally correspond to the same numbered tract in 1970.

Zip Code Tabulation Areas (ZCTA).  Given the popularity of the US Postal Service’s zip code, the census has moved to define ZCTAs.  Note that these are not exactly the same as zip codes.  Zip codes are defined based on the needs of delivering mail, the ZCTA might be considered to be “rounded” zip codes adapted to make them fit the constraints of census geography.  Many zip codes are for post offices or single office buildings. In general, these do not have a corresponding ZCTA.  So don’t be annoyed when you can’t find data on every zip code in your community.  Another issue is that ZCTAs (nor zip codes) can be relied upon to correspond to any way a neighborhood might be thought of.  The reality is that there is no census definition of neighborhood or any census geography level that corresponds to what we might consider a neighborhood to be.

Counties.  The US contains roughly 3400 counties and county equivalents.  Even a geography that might seem as straightforward as the county contains a few quirks.  Counties are called parishes in Louisiana and Boroughs in Alaska (I think).  Some states, particularly Maryland and Virginia, have independent cities, not considered part of any county.  Examples of these include Manassas Park in Virginia and Baltimore City in Maryland. These independent cities are considered and treated as if they were counties by the census.  The District of Columbia, treated as if it were a state by the census, does not have any counties.  Texas has the most. Note that even though counties in some states, such as Massachusetts have ceased to have any legal powers, they continue to be used by the census.

Metropolitan statistical areas.  These are made up of one or more counties with a principle urban area (think center city) of at least 50,000.  For the most part, but not always, these are what most people think of as metropolitan areas.  They are defined based on commuting patterns and in consultation with state and local government.  Sometimes, these definitions have been fairly static over time.  In other metropolitan areas, they have expanded as the metro areas themselves have grown. In addition, the federal Office of Management and Budget (the official agency that defines MSAs – it’s not the Census Bureau that does it, sometimes defines new ones).   Along with micropolitan areas, MSAs make up what are known as Core Based Statistical Areas. The names of these MSAs were changed prior to the 2010 census. In the past, they tended to be named after the single largest city in the MSA. Only a few MSAs, such as San Francisco – Oakland, had multiple cities in their names.  Now, a large number of MSAs have multiple names.  Most researchers and residents ignore the smaller city names and tend to refer to the MSA by its largest city.

Micropolitan statistical areas.  These are similar to MSAs, but they are smaller in population, the center city has between 10,000 and 50,000 people.

New England City and Town Areas (NECTA).  As noted above, counties have not real meaning in some parts of New England.  Thus defining metropolitan areas based on county might not produce totally meaningful data.  Thus in the six New England states, there are metropolitan (and micropolitan) areas based on amalgamations of cities and towns.

Metropolitan divisions.  For some of the larger metropolitan areas, the census has defined subareas, called metropolitan divisions (in New England, there are also equivalent NECTA divisions).  These consists of one or more counties.  For example, The San Francisco MSA has a San Francisco metropolitan division and an Oakland metropolitan division (actually, again a metropolitan division can have multiple cities included in its name).

Combined Statistical Areas.  Some metropolitan areas seem to be clustered together and highly economically integrated with their neighbors, yet each metropolitan area is still independent enough to make a complete consolidation imprecise.  The way this has been accommodated is by the Combined Statistical Area.  These can contain both metro- and micro-politan areas.  Again, in there are equivalents in New England for the city and town based areas.  An example is the San Jose – San Francisco – Oakland CSA.  This CSA is a cluster of six MSAs.

The National Health and Nutritional Examination Survey (NHANES)

Monday, April 30th, 2012

The CDC has a number of great national surveys. Last week I talked about the BRFSS, a large scale annual survey.  But perhaps the most widely known survey is known as NHANES.  This survey is conducted in waves over several years. It consists of two teams that travel across the country, settling in one place for about six weeks, and extensively testing carefully selected survey respondents. Again, the survey methodology is very critical because it informs us of how to use the survey and what are its limitations.

Survey respondents are selected in advance of the arrival of the survey team. They are sent letters and invited to participate. Those who agree to participate arrive at the survey site where over a two day process they are extensively surveyed and give it a number of lab tests including the drawing of blood and the taking of urine samples. Respondents are not randomly selected. The CDC uses a cluster sampling methodology and oversampled blacks and Hispanics. The survey includes both children and adults.

Analyzing the data requires special software that has the appropriate survey commands that can accommodate the cluster sampling design. Standard statistical packages such as SAS and STATA have these commands. Simple analytical packages may not. A major limitation of NHANES is that it is not geographically representative of the United States. The sample selected to be demographically representative, but because the two teams could only visit a total of 16 places a year, is impossible to achieve a good geographic spread. This means if you want to know differences between parts of the country it just doesn’t work. Therefore, the survey is best used for national data or maybe for very large states or groups of states. It’s also not very good for looking at changes over time because one doesn’t know if the changes because of the geographic irregularities of the survey. About a year ago, it was heralded the news that the US obesity rate had appeared to have leveled off. This was based on NHANES data and so those of us in the know were rather skeptical whether the leveling off represented a real change or was just a function of a different geography participating in a newer wave of subjects; we just didn’t know.

Again however, these limitations are what they are. NHANES has provided vital information on a number of health issues including, for example, the buildup of fire retardant chemicals in adults and children. It was this survey that helped identify the problem with self-report of height and weight. The NHANES researchers asked subjects to report their height and weight but also had trained professionals measure these things. Thus it was found that people lie! Or at least misrepresent their height and weight.  NHANES has been used as the source for thousands of studies and is an invaluable part of the arsenal of what people use to identify environmental and health problems in the United States.

The Behavioral Risk Factor Surveillance System (BRFSS)

Monday, April 23rd, 2012

One of the great things that the CDC provides researchers and planners is access to detailed data.  They fund a number of national surveys that have an incredible power to help increase our understanding of critical health issues.  Unfortunately, using these surveys is not always easy.  But they are not complicated.  One must understand the survey methodology, the special ways they must be analyzed and the limitations of each survey.

The largest of these health surveys is the Behavioral Risk Factor Surveillance System (BRFSS).  It began in the early 1980s in just a few state but now is in every state, the District of Columbia, and even in a number of overseas possessions of the United States. In 2010, the last year that data has been released, the survey includes over 450,000 respondents. It may well be the single largest health survey in the entire world.

It’s important to consider how the survey is undertaken because that helps explains special ways that the data from the survey must be analyzed. The survey is funded by the CDC, but is conducted by each state (or equivalent) health department under very strict guidelines. Most important, the BRFSS does not use a random survey design. Instead, it uses a cluster sampling strategy and over samples certain groups such as blacks and Hispanics so that the final sample includes large enough numbers of these groups to provide some statistical power. This means that standard ways of data analysis are inappropriate for the sample and that the cluster strategy and weighting methodology must be taken into account when analyzing data for the survey. Fortunately, most statistical packages have what are known as survey commands that enable this analysis to be done fairly easily. However, non-statistical data packages such as Microsoft Excel, probably do not have these special commands and therefore cannot be used to analyze the data.

Note that the survey is given over the phone and thus may undersample people who do not have telephones or do not answer the telephones. The survey is generally only offered in English and Spanish, so other linguistic minorities may not be in the sample. Finally the sample only includes adults. You must be 18 years old to participate in the survey so it does not include children or adolescents.

There are two groups of questions in the survey: core questions and optional survey modules. The core questions are asked of every single qualifying respondent in the survey (men are not asked about mammography, for example). States can also use optional survey modules or add additional questions of their own if they want to study another topic. For a given year, the data set is quite large but it can be easily downloaded from the CDC website. As noted below, however, zip code level identifiers are not available from the CDC but must be accessed directly from the states. The issues and topics covered by the survey are very broad. These include health status, drinking and smoking habits, height and weight, knowledge about strokes, asthma, physical activity, calculated variables such as obesity and diet measures, and a large range of basic demographic information.

The standard large-scale data set also includes an identifier for a subject’s county of residence.  This is the standard FIPS identifier used by the census, so it’s very easy to attach census data, on the county level, to information on individuals. This allows for some multilevel modeling. I used the County identifier to assign metropolitan area to each respondent and then use dthe information in multilevel analysis to explore the effects of income equality, racial residential segregation, and urban sprawl on health. This represents a very powerful tool for exploring the effects of environment on individuals.

Since about 1998, Massachusetts has been asking subjects their zip code. And in the last five years this is become standard across the country. In theory, it should be able to analyze outcomes by zip code. However, there are a number of methodological issues. The one thing, the CDC does not include these identifiers in its national data set. If you want to know the zip codes by respondent you must go to the health department of each state and go through their cumbersome application process. This might be worthwhile if you’re only doing one stage and you have a good relationship with your state health department. But it pretty much precludes any possibility of doing a national analysis on a zip code level. The other problem is that even though the survey is immense, the number of people in any given zip code in any given year is fairly small. This means there aren’t enough subjects in a given zip code to provide any kind of statistical reliability for analysis in that given area. Even combining data from multiple survey years does not provide enough subjects in one zip code. The data were very useful for study neighborhood effects by zip code on obesity in Massachusetts that was published a number of years ago. However I could not calculate obesity rates for individuals zip codes because there simply weren’t enough subjects at the local level even though the data set included six years of surveys.

Another important limitation of the survey is that it’s based upon self-report. The problem is that some people either don’t know the right answers answers or they make things up. Comparing self-reported height and weight to independently measure height and weight, for example, suggests that men tend to overreport their height and women tend underreport their weight. This is a problem with all self-report surveys whether or not there is response bias.

Despite all these limitations, the BRFSS is an extremely powerful source of information. Many of us first became alarmed about obesity in the United States when we saw a presentation by Richard Jackson that included a series of maps that showed the increased obesity by state over what was then a 15 year. The growth obesity was alarming. The survey provided information that was used to first connect urban sprawl to obesity and has been used in thousands of studies that have been useful at least down to the county and metropolitan area level.