Dealing with Mesosphere Marathon Lost Tasks
This is a short post but I felt like it needed to be done. There is a lot of documentation both on the marathon resources page: https://mesosphere.com/blog/2016/02/17/marathon-production-ready-containers/ as well as all over blogs and of course stackoverflow. However, this particular issue I'm about to discuss is not very well documented and the only resources available are github issues and comments posted by users like me who ran into this while deploying a new cluster.
Whats the problem?
I'm not going to discuss what Apache Mesos and Mesosphere Marathon are because those two are insanely huge topics. Maybe I'll write something about my experience building out a cluster from the ground up at some point later. What I do want to discuss is the issue with lost marathon tasks. From time to time, running tasks, which have been scheduled by marathon in mesos, get into a bad state. Using the mesos jargon these tasks are "Lost" in other words these tasks failed for any number of reasons. The tasks appear under the list of Tasks when calling the /v2/apps/<APP>/tasks with the status LOST. At first glance this may appear innocent enough, however, with only basic marathon configuration, the lost tasks prevent marathon deployments from completing correctly. As a result scaling the application, restarting and other actions are not possible.
Take step back
I'm not an expert when it comes to marathon configuration, and I followed the official mesos documentation to configure a simple cluster in GCE (Google Compute Engine). I configured 3 mesos masters and set the quorum value to 2 which is the minimum configuration for a highly available cluster. On each mesos master I configured a marathon scheduler and set up 5 mesos slaves to run tasks. I went with the simplest option and installed both zookeeper, mesos and marathon using the APT software packages which automatically install mesos and marathon as services in init.d.
A quick look at the marathon init script, the start command simply runs the marathon binary:
start() { start-stop-daemon --start --background --quiet \ --pidfile "$PID" --make-pidfile \ --exec /usr/bin/marathon }
On the surface everything looks correct. The marathon UI runs, applications can spin up with the right number of tasks and the REST API which is the backend for the marathon UI works as expected. I can start and stop the service with the standard service command.
sudo service marathon start/stop
The problem started to manifest itself when the host VMs of the mesos slaves were restarted unexpectedly. The mesos platform should be able to handle this scenario gracefully. The tasks which are no longer running, are created on another active mesos slave. This did work as expected, mesos attempted to restart the killed task, however, in rare cases the task which was killed would reappear under the applications tasks as a lost task. Seems innocent enough at first, however, an interesting side effect manifested itself in these instances. Marathon deployments of the affected application would get stuck. This included everything from attempting to scale the application, restart application, destroy application. Furthermore, other applications' deployments were getting stuck as well, due to the fact that they were waiting for the affected application deployment to complete and thus the entire cluster was in a weird, essentially unusable, state.
The solution
After some extensive research through marathon documentation, online forums, github issues, stackoverflow, I did stumble upon the solution and it was unimpressively simple. All I needed to do was add a couple of important arguments to the marathon binary to configure it propertly to deal with the lost tasks.
/usr/bin/marathon --task_lost_expunge_gc 180000 --task_lost_expunge_initial_delay 120000 --task_lost_expunge_interval 300000
--task_lost_expunge_gc tells marathon to get rid of lost tasks after some period of time.
--task_lost_expunge_initial_delay configures the initial delay of the marathon garbage collector.
--task_lost_expunge_interval configures marathon to periodically scan the tasks and removes the lost ones.
To put this all together, I simply added the above arguments to the init.d script like so:
start() { start-stop-daemon --start --background --quiet \ --pidfile "$PID" --make-pidfile \ --exec "/usr/bin/marathon --task_lost_expunge_gc 180000 --task_lost_expunge_initial_delay 120000 --task_lost_expunge_interval 300000" }
After restarting all the marathon masters with the above configuration, all the lost tasks immediately disappeared from the affected applications, and all the deployments completed as expected. I have not seen this issue pop up since.
This may or may not be the correct solution or the best solution, but it is a solution and given my inexperience with the platform, I'm running with it until I find something more appropriate.