Only hardcore, only the development of your own metrics, classification of test data into groups. Expert evaluation of data people compare result with expert opinion, comparison with the previous metrics and the best result for each dataset and class. The framework should be able to count all the metrics for the previous versions to the new algorithm, because the quality evaluation system will be regularly finished. An attempt to bring the evaluation to the binary mind, will give "the average temperature in the hospital". I'm pretty sure that you will not find ready-made solutions, too atypical task for the mass of the solution. At least I have not found.
Find more questions by tags Continuous integrationTesting software