# What kind of curve (or model) should I fit to my percentage data?

I am trying to create a figure which shows the relationship between viral copies and genome coverage (GCC). This is what my data looks like:

At first, I just plotted a linear regression but my supervisors told me that was incorrect, and to try a sigmoidal curve. So I did this using geom_smooth:

``````library(scales)
ggplot(scatter_plot_new, aes(x = Copies_per_uL, y = Genome_cov, colour = Virus)) +
geom_point() +
scale_x_continuous(trans = log10_trans(), breaks = trans_breaks("log10", function(x) 10^x), labels = trans_format("log10", math_format(10^.x))) +
geom_smooth(method = "gam", formula = y ~ s(x), se = FALSE, size = 1) +
theme_bw() +
theme(legend.position = 'top', legend.text = element_text(size = 10), legend.title = element_text(size = 12), axis.text = element_text(size = 10), axis.title = element_text(size=12), axis.title.y = element_text(margin = margin (r = 10)), axis.title.x = element_text(margin = margin(t = 10))) +
labs(x = "Virus copies/µL", y = "GCC (%)") +
scale_y_continuous(breaks=c(25,50,75,100))
``````

However, my supervisors say this is incorrect too because the curves make it look like GCC can go over 100%, which it can’t.

My question is: what is the best way to show the relationship between virus copies and GCC? I want to make it clear that A) low virus copies = low GCC, and that B) after a certain amount of virus copies the GCC plateaus.

I’ve researched a lot of different methods – GAM, LOESS, logistic, piecewise – but I don’t know how to tell what is the best method for my data.

EDIT: this is the data:

``````>print(scatter_plot_new)
Subsample   Virus   Genome_cov  Copies_per_uL
1   S1.1_RRAV   RRAV    100 92500
2   S1.2_RRAV   RRAV    100 95900
3   S1.3_RRAV   RRAV    100 92900
4   S2.1_RRAV   RRAV    100 4049.54
5   S2.2_RRAV   RRAV    96.9935 3809
6   S2.3_RRAV   RRAV    94.5054 3695.06
7   S3.1_RRAV   RRAV    3.7235  86.37
8   S3.2_RRAV   RRAV    11.8186 84.2
9   S3.3_RRAV   RRAV    11.0929 95.2
10  S4.1_RRAV   RRAV    0   2.12
11  S4.2_RRAV   RRAV    5.0799  2.71
12  S4.3_RRAV   RRAV    0   2.39
13  S5.1_RRAV   RRAV    4.9503  0.16
14  S5.2_RRAV   RRAV    0   0.08
15  S5.3_RRAV   RRAV    4.4147  0.08
16  S1.1_UMAV   UMAV    5.7666  1.38
17  S1.2_UMAV   UMAV    26.0379 1.72
18  S1.3_UMAV   UMAV    7.4128  2.52
19  S2.1_UMAV   UMAV    21.172  31.06
20  S2.2_UMAV   UMAV    16.1663 29.87
21  S2.3_UMAV   UMAV    9.121   32.82
22  S3.1_UMAV   UMAV    92.903  627.24
23  S3.2_UMAV   UMAV    83.0314 615.36
24  S3.3_UMAV   UMAV    90.3458 632.67
25  S4.1_UMAV   UMAV    98.6696 11180
26  S4.2_UMAV   UMAV    98.8405 12720
27  S4.3_UMAV   UMAV    98.7939 8680
28  S5.1_UMAV   UMAV    98.6489 318200
29  S5.2_UMAV   UMAV    99.1303 346100
30  S5.3_UMAV   UMAV    98.8767 345100
``````

another way to go about this would be to use a Bayesian formulation, it can be a bit heavy going to start with but it tends to make it much easier to express specifics of your problem as well as getting better ideas of where the “uncertainty” is

Stan is a Monte Carlo sampler with a relatively easy to use programmatic interface, libraries are available for R and others but I’m using Python here

we use a sigmoid like everybody else: it has biochemical motivations as well as being mathematically very convenient to work with. a nice parameterization for this task is:

``````import numpy as np

def sigfn(x, alpha, beta):
return 1 / (1 + np.exp(-(x - alpha) * beta))
``````

where `alpha` defines the midpoint of the sigmoid curve (i.e. where it crosses 50%) and `beta` defines the slope, values nearer zero are flatter

to show what this looks like, we can pull in your data and plot it with:

``````import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df.columns = ['subsample', 'virus', 'coverage', 'copies']
df.coverage /= 100

x = np.logspace(-1, 6, 201)
plt.semilogx(x, sigfn(np.log(x), 5.5, 3), label='sigfn', color='C2')

sns.scatterplot(df.copies, df.coverage, hue=df.virus, edgecolor='none')
``````

where `raw_data.txt` contains the data you gave and I transformed the coverage to something more useful. the coefficients 5.5 and 3 look nice and give a plot very much like the other answers:

to “fit” this function using Stan we need to define our model using its own language that’s a mix between R and C++. a simple model would be something like:

``````data {
int<lower=1> N;  // number of rows
vector[N] log_copies;
vector<lower=0,upper=1>[N] coverage;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
vector[N] mu;
mu = 1 ./ (1 + exp(-(log_copies - alpha) * beta));

sigma ~ cauchy(0, 0.1);
alpha ~ normal(0, 5);
beta ~ normal(0, 5);

coverage ~ normal(mu, sigma);
}
``````

which hopefully reads OK. we have a `data` block that defines the data we expect when we sample the model, `parameters` define the things that are sampled, and `model` defines the likelihood function. You tell Stan to “compile” the model, which takes a while, and then you can sample from it with some data. for example:

``````import pystan

model = pystan.StanModel(model_code=code)
model.sampling(data=dict(
N=len(df),
log_copies=np.log(df.copies),
coverage=df.coverage,
), iter=10000, chains=4, thin=10)

import arviz
arviz.plot_trace(fit)
``````

`arviz` makes nice diagnostic plots easy, while printing the fit gives you a nice R-style parameter summary:

``````4 chains, each with iter=10000; warmup=5000; thin=10;
post-warmup draws per chain=500, total post-warmup draws=2000.

mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
alpha   5.51  6.0e-3   0.26   4.96   5.36   5.49   5.64   6.12   1849    1.0
beta    2.89    0.04   1.71   1.55   1.98   2.32   2.95   8.08   1698    1.0
sigma   0.08  2.7e-4   0.01   0.06   0.07   0.08   0.09    0.1   1790    1.0
lp__   57.12    0.04   1.76   52.9   56.1  57.58  58.51  59.19   1647    1.0
``````

the large standard deviation on `beta` says that the data really doesn’t provide much information about this parameter. also some of the answers giving 10+ significant digits in their model fits are overstating things somewhat

because some answers noted that each virus might need its own parameters I extended the model to allow `alpha` and `beta` to vary by “Virus”. it all gets a bit fiddly, but the two viruses almost certainly have different `alpha` values (i.e. you need more copies/μL of RRAV for the same coverage) and a plot showing this is:

the data is the same as before, but I’ve drawn a curve for 40 samples of the posterior. `UMAV` seems relatively well determined, while `RRAV` could follow the same slope and need a higher copy count, or have a steeper slope and a similar copy count. most of the posterior mass is on needing a higher copy count, but this uncertainty might explain some of the differences in other answers finding different things

I mostly used answering this as an exercise to improve my knowledge of Stan, and I’ve put a Jupyter notebook of this here in case anyone is interested/wants to replicate this.