I am trying to create a figure which shows the relationship between viral copies and genome coverage (GCC). This is what my data looks like:

At first, I just plotted a linear regression but my supervisors told me that was incorrect, and to try a sigmoidal curve. So I did this using geom_smooth:

`library(scales) ggplot(scatter_plot_new, aes(x = Copies_per_uL, y = Genome_cov, colour = Virus)) + geom_point() + scale_x_continuous(trans = log10_trans(), breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) + geom_smooth(method = "gam", formula = y ~ s(x), se = FALSE, size = 1) + theme_bw() + theme(legend.position = 'top', legend.text = element_text(size = 10), legend.title = element_text(size = 12), axis.text = element_text(size = 10), axis.title = element_text(size=12), axis.title.y = element_text(margin = margin (r = 10)), axis.title.x = element_text(margin = margin(t = 10))) + labs(x = "Virus copies/µL", y = "GCC (%)") + scale_y_continuous(breaks=c(25,50,75,100))`

However, my supervisors say this is incorrect too because the curves make it look like GCC can go over 100%, which it can’t.

My question is: what is the best way to show the relationship between virus copies and GCC? I want to make it clear that A) low virus copies = low GCC, and that B) after a certain amount of virus copies the GCC plateaus.

I’ve researched a lot of different methods – GAM, LOESS, logistic, piecewise – but I don’t know how to tell what is the best method for my data.

EDIT: this is the data:

`>print(scatter_plot_new) Subsample Virus Genome_cov Copies_per_uL 1 S1.1_RRAV RRAV 100 92500 2 S1.2_RRAV RRAV 100 95900 3 S1.3_RRAV RRAV 100 92900 4 S2.1_RRAV RRAV 100 4049.54 5 S2.2_RRAV RRAV 96.9935 3809 6 S2.3_RRAV RRAV 94.5054 3695.06 7 S3.1_RRAV RRAV 3.7235 86.37 8 S3.2_RRAV RRAV 11.8186 84.2 9 S3.3_RRAV RRAV 11.0929 95.2 10 S4.1_RRAV RRAV 0 2.12 11 S4.2_RRAV RRAV 5.0799 2.71 12 S4.3_RRAV RRAV 0 2.39 13 S5.1_RRAV RRAV 4.9503 0.16 14 S5.2_RRAV RRAV 0 0.08 15 S5.3_RRAV RRAV 4.4147 0.08 16 S1.1_UMAV UMAV 5.7666 1.38 17 S1.2_UMAV UMAV 26.0379 1.72 18 S1.3_UMAV UMAV 7.4128 2.52 19 S2.1_UMAV UMAV 21.172 31.06 20 S2.2_UMAV UMAV 16.1663 29.87 21 S2.3_UMAV UMAV 9.121 32.82 22 S3.1_UMAV UMAV 92.903 627.24 23 S3.2_UMAV UMAV 83.0314 615.36 24 S3.3_UMAV UMAV 90.3458 632.67 25 S4.1_UMAV UMAV 98.6696 11180 26 S4.2_UMAV UMAV 98.8405 12720 27 S4.3_UMAV UMAV 98.7939 8680 28 S5.1_UMAV UMAV 98.6489 318200 29 S5.2_UMAV UMAV 99.1303 346100 30 S5.3_UMAV UMAV 98.8767 345100`

**Answer**

another way to go about this would be to use a Bayesian formulation, it can be a bit heavy going to start with but it tends to make it much easier to express specifics of your problem as well as getting better ideas of where the “uncertainty” is

Stan is a Monte Carlo sampler with a relatively easy to use programmatic interface, libraries are available for R and others but I’m using Python here

we use a sigmoid like everybody else: it has biochemical motivations as well as being mathematically very convenient to work with. a nice parameterization for this task is:

```
import numpy as np
def sigfn(x, alpha, beta):
return 1 / (1 + np.exp(-(x - alpha) * beta))
```

where `alpha`

defines the midpoint of the sigmoid curve (i.e. where it crosses 50%) and `beta`

defines the slope, values nearer zero are flatter

to show what this looks like, we can pull in your data and plot it with:

```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_table('raw_data.txt', delim_whitespace=True)
df.columns = ['subsample', 'virus', 'coverage', 'copies']
df.coverage /= 100
x = np.logspace(-1, 6, 201)
plt.semilogx(x, sigfn(np.log(x), 5.5, 3), label='sigfn', color='C2')
sns.scatterplot(df.copies, df.coverage, hue=df.virus, edgecolor='none')
```

where `raw_data.txt`

contains the data you gave and I transformed the coverage to something more useful. the coefficients 5.5 and 3 look nice and give a plot very much like the other answers:

to “fit” this function using Stan we need to define our model using its own language that’s a mix between R and C++. a simple model would be something like:

```
data {
int<lower=1> N; // number of rows
vector[N] log_copies;
vector<lower=0,upper=1>[N] coverage;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
vector[N] mu;
mu = 1 ./ (1 + exp(-(log_copies - alpha) * beta));
sigma ~ cauchy(0, 0.1);
alpha ~ normal(0, 5);
beta ~ normal(0, 5);
coverage ~ normal(mu, sigma);
}
```

which hopefully reads OK. we have a `data`

block that defines the data we expect when we sample the model, `parameters`

define the things that are sampled, and `model`

defines the likelihood function. You tell Stan to “compile” the model, which takes a while, and then you can sample from it with some data. for example:

```
import pystan
model = pystan.StanModel(model_code=code)
model.sampling(data=dict(
N=len(df),
log_copies=np.log(df.copies),
coverage=df.coverage,
), iter=10000, chains=4, thin=10)
import arviz
arviz.plot_trace(fit)
```

`arviz`

makes nice diagnostic plots easy, while printing the fit gives you a nice R-style parameter summary:

```
4 chains, each with iter=10000; warmup=5000; thin=10;
post-warmup draws per chain=500, total post-warmup draws=2000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
alpha 5.51 6.0e-3 0.26 4.96 5.36 5.49 5.64 6.12 1849 1.0
beta 2.89 0.04 1.71 1.55 1.98 2.32 2.95 8.08 1698 1.0
sigma 0.08 2.7e-4 0.01 0.06 0.07 0.08 0.09 0.1 1790 1.0
lp__ 57.12 0.04 1.76 52.9 56.1 57.58 58.51 59.19 1647 1.0
```

the large standard deviation on `beta`

says that the data really doesn’t provide much information about this parameter. also some of the answers giving 10+ significant digits in their model fits are overstating things somewhat

because some answers noted that each virus might need its own parameters I extended the model to allow `alpha`

and `beta`

to vary by “Virus”. it all gets a bit fiddly, but the two viruses almost certainly have different `alpha`

values (i.e. you need more copies/μL of RRAV for the same coverage) and a plot showing this is:

the data is the same as before, but I’ve drawn a curve for 40 samples of the posterior. `UMAV`

seems relatively well determined, while `RRAV`

could follow the same slope and need a higher copy count, or have a steeper slope and a similar copy count. most of the posterior mass is on needing a higher copy count, but this uncertainty might explain some of the differences in other answers finding different things

I mostly used answering this as an exercise to improve my knowledge of Stan, and I’ve put a Jupyter notebook of this here in case anyone is interested/wants to replicate this.

**Attribution***Source : Link , Question Author : teaelleceecee , Answer Author : Sam Mason*