Caleb Stanford

Big Data Health Monitoring Should Be Mainstream Already

big-data health

The healthcare system in the US is often rightfully criticized for the cost to patients: insurance costs, “out-of-network” costs, and so on. But in my own interactions with healthcare over the last year or two, I’ve been equally frustrated by something else: medical data is not being sufficiently gathered and utilized. In the age of big data, medicine seems to be many years behind:

  1. A large amount of medical data is generated in current healthcare routine, but outside of scientific studies it is only utilized on a per-patient basis.

  2. An even larger amount of medical data could be easily gathered, through patient-recorded logs and wearable monitoring devices, but is not being gathered or utilized.

This is sad to me, because if we’ve learned anything in the last 10 years, it’s that data is extremely valuable when used in aggregate, at maximum scale, to update models and make predictions. In particular, we are probably missing out on more sophisticated, more accurate, and more proactive health monitoring.

I’m not suggesting that computers should replace humans as doctors; at least, not anytime soon. Although algorithms can analyze data faster and at a larger scale than humans, they probably can’t replace doctor expertise in the short term, especially with interactive tasks (like deciding what questions to ask in response to patient symptoms, or determining what data would be most valuable to collect).

What I am suggesting is that we: (1) gather more data, (2) make inferences at a larger scale from a larger amount of data, that can be applied in a more individualized fashion. It was encouraging to read that at least one researcher is taking this approach.

Can you be more specific about “medical data”?

Medical data points are taken constantly, even under the current system. If you go in for a doctor visit, several data points are gathered: the problem or symptoms you have (or lack thereof), as well as basic vital signs (pulse, blood pressure) and sometimes blood tests, urine tests, etc. Additionally, many people take data measurements at home, e.g. fitness trackers, sleep trackers, and blood pressure measurements; and people often notice symptoms and record them (e.g. stomach ache today, chest pain or back pain, feeling lousy in general, etc.). Many people would also be happy to record and track more data. I personally keep a log of minor health problems — just in case some pattern emerges, I suppose.

How is the way we currently use medical data insufficient?

It seems to me that current medical data is not used in aggregate to update medical models and improve future predictions. In fact, my doctors have not looked at my individual data points at all. Instead, you go in for a sick visit and they listen briefly to your array of symptoms, or you go in to get help with a specific thing like sleep, or you go in for a yearly checkup and they review any problems or questions you have. Then, the doctor (a human) makes a prediction or diagnosis about a specific issue, but does not compare with your past data in aggregate and has often forgotten about other minor or major problems you may have experienced. Additionally, many minor issues are ignored if they cannot be easily diagnosed. Finally, doctors do not apply the information about your case to future cases; thus, revision to standard practices only happens over a longer period of time through medical research and controlled studies.

The result of this limited use of health data is that while doctors are good at diagnosing specific sicknesses and medical problems, they are not good at overall monitoring “healthy” individuals to make sure they continue to stay healthy, without risk for future problems. Such monitoring would require a much more subtle understanding of our medical data; an understanding that is evidently not possible using the current approach.

But is there any guarantee that more aggressive data mining would be successful? Would models and predictions actually improve?

No strict guarantee, no. But from my perspective there is good reason to believe the answer to the second question is “yes”; all the recent breakthroughs in big data and machine learning have shown that with enough data and computational resources, it’s usually possible to outperform humans on concrete tasks (e.g. predicting future data from past data). These methods are only continuing to improve and be applied to various domains, not just computer science problems. So I am confident that machine learning algorithms will get better, and there is no reason that that will not extend to medical diagnosis and monitoring. However, there is one important precondition, and that is that machine learning algorithms require a lot of data to perform well.

Right now, no one has access to that kind of data. But the data exists and much of it is already collected, or else easy to collect. Perhaps in the near future, we can utilize it?

ProvableP vs ProvableNP

math logic theory-of-computation

Introduction and Background

The versus question asks: is it possible to determine whether a Boolean formula is satisfiable, in time polynomial in the length of the formula? Determining whether a Boolean formulas is satisfiable is called the SAT problem, and programs which solve the problem are called SAT solvers. SAT solvers are used in many modern programming tools (usually, for automatically solving some system of constraints), so the SAT problem is of great practical importance today. And it is often suggested that a solution to “ versus ” would resolve whether there exists any efficient algorithm at all for SAT.

There are a number of reasons why this suggestion could be wrong, or at least, a number of ways to argue against it. But here’s a particular thought: what if (that is, SAT can be solved in polynomial time), but the polynomial-time algorithm is beyond our ability to find – and in particular, beyond our ability to prove correct? What if there exists some magic algorithm that happens to work, but there doesn’t exist any proof that it works? Or perhaps there doesn’t exist any proof that it runs in polynomial time? Such a scenario would be an unsatisfying resolution to versus , in a way: because practically speaking, we generally only trust algorithms that we can prove correct.

The question I’m trying to motivate is: does ? That is, what happens if we restrict our attention to algorithms which can be proven correct? Is this question answerable, or is it just as hard as versus ?


It turns out that if and only if . I’ll define what I mean by and . This post is adapted from a question and answer I made on Computer Science StackExchange.

The ProvableP versus ProvableNP question

Fix a logic that is strong enough to encode statements about Turing machines. By this I mean the same requirements as Godel’s second incompleteness theorem; but for the purposes of this question, just assume the following:

  • is consistent (it doesn’t prove False); and

  • implies the Peano arithmetic axioms of (PA).

Then define

There’s an issue with this definition: how do we encode languages (of which there are uncountably many) as finite objects in order for the logic to have something to say about them? In particular, how do we encode formulas for and ? Let us assume that languages are subsets of the natural numbers , and that any in or must be given by an arbitrary formula of the symbols of , i.e. it is any definable subset of .

From the definition we have that . But does ? How does this relate to the original vs question?

The Answer

Obligatory warning: to my knowledge, this proof has not been rigorously checked by someone else :) But we now argue that

For the forward direction, assume , and consider any language in . Note that is accepted by a nondeterministic Turing machine , such that there exists a polynomial which bounds the longest execution branch of on input a string of length . Then let be a nondeterministic Turing machine with the following description: first, count the input length ; second, calculate ; and finally, runs for at most steps nondeterministically. If doesn’t halt (this never actually occurs, but doesn’t know that for sure), rejects.

Now observe that (more specifically, PA) proves that runs in time . Moreover, can describe “the language of strings accepted by ” by a formula (roughly, “there exists an sequence of configurations such that the sequence is a run of and ends in an accept state”). Then we have that proves . Therefore, . But we know (even if doesn’t) that by construction. So . But , so .

The backward direction is a similar trick. Assume and that . Then , so . From here, we know there is some Turing machine and polynomial such that runs in time and . Let be a deterministic Turing machine which, on input of length , first calculates , and then runs for at most steps. If doesn’t halt, returns a default value, say .

Then similarly to before, proves that halts in time and therefore that . it follows that , but we know (even if doesn’t) that . Thus, .

Coq Logo Vector Image

research miscellaneous

If you’re familiar with the Coq proof assistant, you may recognize their logo (a brown rooster shape). However, it’s difficult to find a good (high-resolution) image of their logo online — the image file on their website is a sad 66x100 pixels. Probably not good enough for use in a presentation or poster.

I made a vector version of the image (actually in September 2017) which I’m posting online now. This was made in Inkscape. The lines may differ very slightly from the original logo. I hope someone finds this useful!

Vector image (SVG):

Coq Logo SVG

500x804 image (PNG):

Coq Logo 500x804

1000x1607 image (PNG):

Coq Logo 1000x1607

Older Posts »