This is a short post but I felt like it needed to be done. There is a lot of documentation both on the marathon resources page: https://mesosphere.com/blog/2016/02/17/marathon-production-ready-containers/ as well as all over blogs and of course stackoverflow. However, this particular issue I'm about to discuss is not very well documented and the only resources available are github issues and comments posted by users like me who ran into this while deploying a new cluster.
Whats the problem?
I'm not going to discuss what Apache Mesos and Mesosphere Marathon are because those two are insanely huge topics. Maybe I'll write something about my experience building out a cluster from the ground up at some point later. What I do want to discuss is the issue with lost marathon tasks. From time to time, running tasks, which have been scheduled by marathon in mesos, get into a bad state. Using the mesos jargon these tasks are "Lost" in other words these tasks failed for any number of reasons. The tasks appear under the list of Tasks when calling the /v2/apps/<APP>/tasks with the status LOST. At first glance this may appear innocent enough, however, with only basic marathon configuration, the lost tasks prevent marathon deployments from completing correctly. As a result scaling the application, restarting and other actions are not possible.
Take step back
I'm not an expert when it comes to marathon configuration, and I followed the official mesos documentation to configure a simple cluster in GCE (Google Compute Engine). I configured 3 mesos masters and set the quorum value to 2 which is the minimum configuration for a highly available cluster. On each mesos master I configured a marathon scheduler and set up 5 mesos slaves to run tasks. I went with the simplest option and installed both zookeeper, mesos and marathon using the APT software packages which automatically install mesos and marathon as services in init.d.
A quick look at the marathon init script, the start command simply runs the marathon binary: