7
$\begingroup$

I don’t know much Probability Theory beyond the undergraduate level.

I was trying to model a simple scenario with my family. What is the probability I will develop type 1 diabetes in the following years?

I did some research on the internet, and it seems that the onset age for those with type 1 diabetes looks like this. enter image description here

It is much more likely for the disease to develop during your infancy and adolescence, it peaks during your teens and then it drops and flattens out during adulthood. Maybe this graph is a little imprecise, I need to read more papers, but that is beside the point here.

I figure that this is the graph of the PDF whose associated random variable returns the age at which you will develop the disease, assuming you will develop the disease in the first place; besides, doesn’t this remind us of survival analysis?

However, not everyone develops the disease. The general incidence is 3%, but, in my case, since I already have a diabetic sibling, the (conditional) probability I will develop the disease is 7%.

I would like to have a curve that when integrated over an interval tells me the probability I will develop the disease. In my mind, I should simply take the previous graph, whose integral over the full domain was 1 (it is a PDF, after all) and rescale it, so that its integral is 0.07 in my case, and 0.03 for a general patient.

Is my line of reasoning correct? Does this thing have a name? It looks like a density function with all of the properties of an honest PDF, except that $\int _{-\infty} ^\infty f(x) dx \neq 1$.

$\endgroup$
4
  • 2
    $\begingroup$ Well, this looks like a conditional probability. That is: Given that a patient does contract the disease, what is the probability that it happens in year $i$. You could divide by the total integral to make it into a standard probability. $\endgroup$ Commented Jun 11 at 11:06
  • 3
    $\begingroup$ Alternatively, you could add an atom representing the case where the patient never contracts the disease. Thus, making your distribution into a mixed continuous/discrete sort of thing. But I think that viewing it as a conditional probability better captures what you are after. $\endgroup$ Commented Jun 11 at 11:12
  • 3
    $\begingroup$ A curve does not have to be a PDF to convey useful information. In this case, I guess that the graph could be more useful it it'd show how much is likely that a random person develops diabetes at age $i$ given that they are alive at that age. If that's the case, it doesn't even have to add to the general incidence, since not every $i$ age is populated by the same number of people. Your interpretation, that the graph is the conditional probability given that you got sooner or later the desease, it's less useful IMO, since it's mixing also the likelihood of being alive. $\endgroup$ Commented Jun 11 at 13:52
  • 1
    $\begingroup$ Consider also that rescaling to $7\%$ to account family history might not be correct. Maybe also the shape of the function (whatever the interpretation of it is correct) changes: it could be that you are more likely (or less likely) to develop earlier/later the diabetes than someone with no history. $\endgroup$ Commented Jun 11 at 13:57

4 Answers 4

8
$\begingroup$

The answers above cover everything except your question at the end: "Does this thing have a name?". Yes, it is called a defective or improper distribution. For example, William Feller in his Introduction to Probability Theory (Volume 1, Chapter XIII on "Recurrent Events. Renewal Theory") discusses a situation very similar to yours where $f_n$ is the probability that $T=n$ and the $f_n$ sum to $f<1$ and states "Then $T$ is an improper or defective random variable, which with probability $1-f$ does not assume a numerical value". In my experience, I have seen "defective" used when the probabilities sum to a value less than unity, and "improper" when the probabilities sum or integrate to infinity. In Bayesian analysis, an improper or defective distribution can often be used as a prior distribution (see Wikipedia)

$\endgroup$
6
$\begingroup$

The answer is no. A probability density function must integrate to $1$. Since by definition of probability density function

$$\mathbb P(X\in A)=\int_Ap(x)\, dx$$

for every nice $A$. What nice means is s but more technical and since you don't have a strong background I'll just link the Wikipedia page, all you need to know, is that the real numbers are a nice set, so you must have $$1=\mathbb P(X\in \mathbb R) =\int_{-\infty}^\infty p(x)\,dx$$ Now, what does this imply for the real world application as in your case. For once, it could be as the comments suggested that this be the probability you contract diabetes at age $t$ given that you are one of the people that contracts diabetes. I don't think that's it, as it doesn't seem as a very useful graph to the general public, a much more reasonable interpretation in my opinion, is that even if you live $100$ years, you could still develop diabetes out of nowhere (understandably, with a very low probability), so you can't just cut off the probability at some point, as you would have to chose the point at which the probability of living equals $0$, and it's not clear how you would say something like this. What researchers do instead for the model, is assume that if you were to live forever, then you would definitely contract diabetes (the actual term would be almost surely).

So if you want to calculate probability of contracting diabetes, assuming you're not immortal, you could let $X$ be the age at which you contract diabetes (given by the above graph) and $Y$ be the age you die. Then, the fact that some people don't contract diabetes ever is given by

$$\mathbb P(X<Y)<1$$

i.e. the probability that you contract diabetes before you die is less than one.

Now, there's the whole old discussion of what probability means, and whether there's reasonable models for real life and stuff like that. Like what does it mean that at $100000000$ years old I can still contract diabetes if I will surely not live that long? Models will almost always have some dissonance with real life, we just have to find out which dissonance we find tolerable and which we don't

$\endgroup$
2
$\begingroup$

The "risk" or "hazard" function is at each point in time a pdf-sort-of-like thing for the probability of an event but conditional upon the event not yet having happened. Think about your example. The data you looked up is likely the percent of people t years old who develop the disease within the year. The have not yet had the disease. That's why it’s called onset. It’s the same way for mortality. If you look up a life table it will give you the proportion of people who were t years old that died within the year. They can only die once so probability is conditioned on time of death greater than t.

$$ h(t) = \lim_{r->0} r^{-1} P( t< T < t + r \; | \; T>t ) $$

If we multiply a hazard function by its survival function, $S(t)=P(T>t)$, (or complementary CDF as its also called) we do get a density $$ f(t) = h(t)S(t) = \lim_{r->0} r^{-1} P( t< T < t + r ) $$

Notice that $f(t) = -\frac{d}{dt} S(t)$,

thus $$ \frac{d}{dt} S(t) = -h(t) S(t) $$

so $$ S(t) = \exp( -int_0^t h(u) du ) $$

$\endgroup$
0
$\begingroup$

The graph you've sketched looks a little like this graph of incidence rates* :enter image description here

Assuming that this is the graph you're thinking of, it shows the number of people per year per 100,000 persons diagnosed with diabetes. (In this case for the period 2001-2015, and pooled into age categories 5 years wide.)

I figure that this is the graph of the PDF whose associated random variable returns the age at which you will develop the disease, assuming you will develop the disease in the first place;

Not quite, though it is closely related. It is the graph showing the proportion of people of any given age and gender who will be first diagnosed with diabetes in the next year. For example, approximately 25 in every 100,000 fifty-five-year-old men will be diagnosed with diabetes.

So assuming that diagnostic criteria, treatment options, and population dynamics don't change much in the coming decades (a very dubious assumption, but I can't predict the future so I have to work with past data), then: Conditional on you reaching the age of 55 and not already being diagnosed, your odds of being diagnosed with diabetes before the age of 56 are 25/100000.

This is a little higher than the probability that will be diagnosed for the first time at age 55. You could die before reaching 55. You could be diagnosed at an earlier age, and so be excluded from that category. To convert this chart into the PDF you want, you would need to account for both of these effects.


* (Rogers, Mary & Kim, Catherine & Banerjee, Tanima & Lee, Joyce. (2017). Fluctuations in the incidence of type 1 diabetes in the United States from 2001 to 2015: A longitudinal study. BMC Medicine. 15. 10.1186/s12916-017-0958-6.) (Licensed under CC-BY. My thanks to the authors for releasing it under open access.)

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.