My experiment is the following:

- Temperature 26C vs. Temperature 30C
- Treatment Saline vs. Treatment BMC
- Timepoints 0, 2, and 3.

My intention is to create GLM model to look at differential abundance between groups:

- 26C vs. 30C for Saline treatment
- 26C vs. 30C for BMC treatment
- Saline vs. BMC at 26C
- Saline vs. BMC at 30C

In doing so, I would like to include all the samples possible when calculating the dispersion which is why I'm using a GLM instead of a Fisher's Exact Test for subsets of samples. I would also like to incorporate the ordered time information.

I'm trying to understand why my GLM model is failing. I tried using all of my data including the baseline T0 which was an issue with my design matrix because it was not of full rank. I understand why the baseline was an issue as it didn't add any information for any of the other categories. However, I'm not sure why this occurs for `Temperature_30C`

.

Here's my data, code, and experiment design. I'm using Python as my main environment with an R package called `edgeR`

in the backend using an interface called `rpy2`

if this helps having this info.

```
from rpy2 import robjects as ro
from rpy2 import rinterface as ri
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
R = ro.r
NULL = ri.NULL
random_state=0
ro.r('set.seed')(random_state)
# Package
edgeR = importr("edgeR")
limma = importr("limma")
def pandas_to_rpy2(df, rpy2_version_major=None):
if rpy2_version_major is None:
from rpy2 import __version__ as rpy2_version
rpy2_version_major = int(rpy2_version.split(".")[0])
if rpy2_version_major == 2: # v2.x
return pandas2ri.py2ri(df)
if rpy2_version_major == 3: # v3.x
from rpy2.robjects.conversion import localconverter
with localconverter(ro.default_converter + pandas2ri.converter):
return ro.conversion.py2rpy(df)
def build_glm_model(X, design_matrix, random_state=0):
components = X.columns
# Run edgeR
r_X = pandas_to_rpy2(X.T)
r_components = ro.vectors.StrVector(components)
r_design = pandas_to_rpy2(design_matrix)
# Give DGEList the gene expression table with groups
model = edgeR.DGEList(
counts= r_X,
genes = r_components,
# group = r_y,
)
# Normalize the expression values
# if kws["calculate_normalization_factors"]:
model = edgeR.calcNormFactors(model)
# Calculate dispersion
model = edgeR.estimateGLMCommonDisp(model, r_design)
# if kws["estimate_glm_trended_dispersion"]:
model = edgeR.estimateGLMTrendedDisp(model, r_design)
# Estimate tagwise dispersion
# if kws["estimate_glm_tagwise_dispersion"]:
model = edgeR.estimateGLMTagwiseDisp(model, r_design)
# Fit
fit = edgeR.glmFit(model, r_design)
return (model, fit)
# Counts
X = pd.read_csv("https://pastebin.com/raw/J7kmL8Ly", sep="\t", index_col=0)
# Metadata
df_metadata = pd.read_csv("https://pastebin.com/raw/PANaC3r5", sep="\t", index_col=0)
design_matrix = pd.concat([
pd.get_dummies(df_metadata["treatment"].fillna("Baseline"), prefix="Treatment"),
pd.get_dummies(df_metadata["temperature"], prefix="Temperature"),
df_metadata["collection_time_numeric"].to_frame("t"),
], axis=1).astype(int)
# Remove the baseline samples
design_matrix = design_matrix.drop("Treatment_Baseline", axis=1).query("t > 0")
# Create the GLM model
model, fit = build_glm_model(X.loc[design_matrix.index], design_matrix)
# RRuntimeError: Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset, :
# Design matrix not of full rank. The following coefficients not estimable:
# Temperature_30C
```

Also posted on Cross Validated https://stats.stackexchange.com/questions/521917/

Answered by Gordon here: How to structure my edgeR model to incorporate a temporal variable and 2 categorical variables?