## Original link:http://tecdat.cn/?p=22805

### Why do I need dummy variables?

Most data can be measured by numbers, such as height and weight. However, variables such as gender, season and location cannot be measured numerically. Instead, we use dummy variables to measure them.

## Example: Gender

Let’s assume that the effect of X on y is different between men and women.

For men, y = 10 + 5x + ey = 10 + 5x + e

For women, y = 5 + X + ey = 5 + X + E.

Where e is the random effect and the average value is zero. Therefore, in the real relationship between Y and X, gender affects both intercept and slope.

First, let’s generate the data we need.

```
#True slope, male = 5, female = 1
Ifelse (d $gender = = 1, 10+5*d$x+e,5+d$x+e)
```

First, we can look at the relationship between X and Y and color the data by gender.

`plot(data=d)`

Obviously, the relationship between Y and X should not be described by a single line. We need two: one for men and one for women.

If we only return y to X and gender, the result is

The estimated coefficient of X is incorrect.

The correct setting should be such that gender can affect both intercept and slope.

Or use the following method to add a dummy variable.

The model shows that for women (gender = 0), the estimated model is y = 5.20 + 0.99x; For men (gender = 1), the estimated relationship is y = 5.20 + 0.99x + 4.5 + 4.02x, that is, y = 9.7 + 5.01x, which is quite close to the real relationship.

Next, let’s try two dummy variables: gender and location

## Dummy variables for gender and location

### Gender is not important, but location is important

Let’s get some data, in which gender is not important, but location will be important.

Draw to see the relationship between X and y, color the data by gender, and separate by location.

`plot(d,grid~location)`

The effect of gender on y seems to be significant. But when you compare the Chicago data with the Toronto data, the intercept is different and the slope is different.

If we ignore the impact of gender and location, the model will be

R-squared is quite low.

We know that gender is not important, but we still add it to see if it will be different.

As expected, the impact of gender is not significant.

Now let’s look at the impact of location

The impact of location is great. But our model setup basically means that the position will only change the intercept.

What if the position changes the intercept and slope at the same time?

You can also try this.

Gender is not important, and location changes intercept and slope.

### Gender is not important, and location changes intercept and slope

Now let’s get some data that are important for gender and location. Let’s start at two places.

```
Ifelse (d $gender = = "0" & D $location = = "Toronto", 1+1*d$x+e,
+ Ifelse (d $gender = = "1" & D $location = = "Chicago", 20+2*d$x+e,
+ Ifelse (d $gender = = "0" & D $location = = "Chicago", 2+2*d$x+e,NA))))
```

`Plot (D, x, y, color = gender ~ location)`

### Gender and location are important, five locations

Finally, let’s try a model with five locations.

```
+ Ifelse (d $gender = = "1" & D $location = = "Chicago", 2+10*d$x+e,
+ Ifelse (d $gender = = "0" & D $location = = "Chicago", 2+2*d$x+e,
+ Ifelse (d $gender = = "1" & D $location = = "New York", 3 + 15 * D $X + e,
+ Ifelse (d $gender = = "0" & D $location = = "New York", 3 + 5 * D $X + e,
+ Ifelse (d $gender = = "1" & D $location = = "Beijing", 8 + 30 * D $X + e,
+ Ifelse (d $gender = = "0" & D $location = = "Beijing", 8 + 2 * D $X + e,
+ Ifelse (d $gender = = "1" & D $location = = "Shanghai",
```

`plot( x. Y, color = gender ~ (location)`

Therefore, if you think that some factors (gender, location, season, etc.) may affect your explanatory variables, set them as dummy variables.

Most popular insights

1.Application case of multiple logistic regression in R language

2.Implementation of panel smooth transfer regression (PSTR) analysis case

3.Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB

4.Case study of Poisson Poisson regression model in R language

5.Hosmer lemeshow goodness of fit test in R language regression

6.Implementation of lasso regression, ridge ridge regression and elastic net model in R language

7.Implementation of logistic regression in R language

8.Python uses linear regression to predict stock price

9.How does R language calculate IDI and NRI in survival analysis and Cox regression