Clood CBR: towards microservices oriented case-based reasoning.

,


Introduction
Several case-based reasoning (CBR) development frameworks and toolkits have been introduced to the CBR community [12][13][14]. These have been extended for recommender systems [7] and textual CBR [11] and more recently for selfmanagement systems [1]. However many of these CBR systems are mostly implemented with monolithic architectures such as desktop standalone applications, with heavy demands due to siloed in-memory batch processing. This is not compatible with recent software development trends, which are increasingly using REST APIs for communication with cloud computing platforms.
Cloud computing is a term used to describe the use of remote hardware and software to deliver on-demand computing services through a network (usually the Internet). In the past, applications or programs were run from software downloaded on to a physical computer or server. In contrast cloud computing lets users access these applications through the internet. Implementing software applications in the cloud offer several benefits which include efficient/cost reduction, scalability, mobility, and disaster recovery. Distribution of CBR applications and cases enables, MapReduce type algorithms to exploit the parallelism opportunity that is to be had with pair-wise similarity computations [18]. Interestingly, CBR has also been applied to support cloud provisioning, whereby similar Amazon Web Services (AWS) 3 configurations are recommended given a characterisation of a user's compute task [8]. This helps the user to make decisions about the types of cloud services for the given task. But having to monitor resource utilisation and change service requirements accordingly is a challenge which in turn has paved the way for microservice based architectures.
A CBR framework using a microservice based architecture provides (amongst other things) flexibility in both the technology being used (e.g. programming language) as well as dynamic scalability that can adapt to user application demands (e.g. spikes in casebase querying, seasonal effects). This is because, individual microservices are independently scaled and developed such that the overall system architecture is a scalable distributed application [5]. Importantly, the computation of services are stateless since they are automatically provisioned only when needed and then stopped when no longer required. This is particularly advantageous to CBR in situations where there is in-memory demand due to its inherent nature of being a lazy learner.
In this paper we discuss how the CBR cyle can be organised into multiple microservices and how service discovery is facilitated between these independent components using rest communications. A microservice is considered efficient when the system is loosely coupled and highly cohesive [9]. Identifying which functionalities within the CBR cycle should be decoupled and organising them into microservices is a key design challenge that we address in this paper. We do this by introducing, Clood, a generic open-source CBR cloud-based microservice framework, and make the following key contributions: create a novel design using the microservice paradigm for CBR; introduce, Clood, an extensible open source microservice CBR framework 4 ; evaluate the scalability of the retrieval phase on a recommender task; and identify areas of future development that are essential for the sustainability of Clood CBR.
Rest of the paper is organised as follows; in Section 2 we discuss existing frameworks, jcolibri and myCBR. The design paradigm appears in Section 3 and the Clood implementation is discussed in Section 4. Results from a scalability experiment with half a million cases is presented in Section 5 followed by conclusions and future directions in Section 6.

Related Work in CBR Development Architectures
There are two well-established open-source frameworks for building CBR applications: myCBR and colibri, though they follow different approaches and support different phases of the CBR application development.
myCBR has been a tool for researchers and practitioners over the last ten years [15]. This framework is focused on the developing of a knowledge model for representing cases and computing similarity through the myCBR-workbench tool [2]. This knowledge model can be instantiated through the building blocks and functionality provided by the myCBR-SDK, that is a Java library following a classical monolithic software architecture. However, their authors have recently presented the myCBR Rest API which exposes the functionality of both myCBR-SDK and myCBR-workbench through a RESTful API [1]. Instead of forcing users to integrate their myCBR systems into a Java environment, this novel API enables users to model a CBR system using myCBR's workbench and then deploying the application as a web service. The goal is to make it easier to build, test, compare and deploy CBR applications.
colibri, on the other hand, is focused on the development of a wide range of CBR applications [10]. As a platform, colibri offers a well defined architecture for designing CBR systems, a reference implementation of that architecture: the jcolibri framework [12], and several development tools that aid users in the implementation and sharing of new CBR systems and components. These tools have been integrated in the colibri Studio development environment [13]. Both tools make up the COLIBRI platform following a two layer architecture. jcolibri is the white-box layer of the architecture: a framework for developing CBR applications in Java. This framework represents the bottom layer of the platform. It includes most of the code required to implement a wide collection of CBR systems: Standard, Textual, Knowledge-Intensive, Data-Intensive, Recommender Systems, and Distributed CBR applications. It also includes evaluation, maintenance and casebase visualisation tools. All this functionality has established jcolibri as a reference CBR framework with more than 35K downloads. However, jcolibri still follows the same monolithic Java architecture like myCBR and is not suitable for modern web environments.
The need for both these platforms to evolve into web services architecture is clear. However, there are different approaches to implement this evolution. myCBR proposes wrapping its existing java components as web services. It is a straightforward option but has several drawbacks. Mostly, the wrapping of the existing java components does not allow to take advantage of the capabilities of cloud architectures regarding availability or scalability. The alternative option is to create a cloud-based CBR framework from scratch in order to exploit the features of modern cloud architectures. This is the option adopted by Clood, that can be considered as a re-implementation of the functionalities provided by the jcolibri and myCBR frameworks, but instead of wrapping its existing java components, it redesigns entirely the CBR architecture for the cloud. In this manner, Clood adopts the CBR architecture defined in colibri based on a pre/post-cycle to load/release required resources. Clood also reproduces the case structure representation based on a composite pattern, and the similarity computation through global/local similarity functions that both jcolibri and myCBR implement.
In summary, our goal is to create a cloud architecture that is able to provide the same functionalities using familiar methods currently being used in jcolibri and, thereafter, further integrate existing web services found in myCBR. As we will present in the following section, Clood re-implements jcolibri's methods using modern web services technologies such as Elasticsearch or JSON-based communications that extend the existing capabilities of the framework regarding flexibility and data-intensive processing.

Microservices Design Paradigm for CBR
A microservice is an independent process which can carry out specific tasks in isolation [5]. These should be deployed, tested and scaled independently for a single functional responsibility; such as similarity, ranking, casebase editing, etc. Key to this architecture are the concept of serverless functions also referred to as Function-as-a-Service (FaaS) [3] -logic that is split into small code snippets and executed in a managed compute service. Well known examples include AWS's Lambda 5 and Google's Cloud Functions 6 . Figure 1 shows a high-level overview of the system's design consisting of 3 components: REST API; Serverless functions; and the ElasticSearch (ES) service. The core CBR tasks -retrieve, reuse, revise, retain -are implemented as serverless computing functions. Functions can interact with external interfaces (e.g. dashboard) and internally with other functions through REST APIs. Decomposition of the CBR cycle into smaller functions provides flexibility to introduce similarity functions and deploy them independently. Such functions will also include relevant knowledge container provisions. The post-cycle or maintenance tasks, like forgetting cases or recomputing footprint cases can be confined to the Retain service. Other management services for Clood project management and configurations are also included as microservices accessible through REST. The ES service is used as the casebase which allows the serverless functions to query and retrieve. Data sources and connectors forming the pre-cycle communicate with the Casebase once they are synced with the ES. Data source's can either be external or within the cloud platform which gives flexibility for the community to use existing data sources. An important distinction here with the pre-cycle is that it remains lean (as compared to jcolibri, or myCBR); in that it does not involve loading cases into memory once cases are made persistent.

The Casebase
Popular CBR systems like jcolibri keep the casebase in memory during operation. An in-memory casebase guarantees speed when interacting with the casebase but will incur massive costs to scale up for big data. Also, using the CBR system in a distributed manner can be problematic with in-memory casebase as memory is an expensive resource even on the cloud.
In the serverless architecture, we maintain the casebase in a NoSQL full-text distributed search engine for all types of data. ES and Solr are popular examples of such distributed, scalable open-source search tools for textual, numerical, geospatial, structured, and unstructured data. These tools provide a significant improvement regarding the representation of cases in previous CBR frameworks, because the case structure does not need to be fixed. Therefore, the cases in the casebase can have different attributes, and similarity metrics are applied according to each particular data types.
Moreover, as these search tools are built on Apache Lucene, they are extendable, allowing users to write custom similarity metric scripts against a data index. Accordingly, the type of operations that would normally occur in-memory can be done in the data store index which is usually file-based 7 . While there are several databases with search capability to choose from, we focus on ES because of its popularity and close integration with existing cloud service providers.

Local Similarity
A subset of the serverless functions are used to generate similarity scripts to measure local similarity. These metric functions perform retrieval from the case-base at the attribute level. Each generated similarity function script depends on the data type of the attribute. Supported data types include string, numeric, Boolean, date and object. Similarity metrics to retrieve exact matches are inbuilt in ES or similar services. And custom similarity metrics functions have been implemented to support other local similarity functions that are used for CBR retrieval in the jcolibri and myCBR frameworks.

Global Similarity
The global similarity function which aggregates local similarities determines the order in which cases are retrieved from the casebase and their ranking. Both a weighted and non-weighted form can be used to identify the nearest neighbours and is managed directly by the ES index service. Each local similarity function script is executed in memory, in response to a single query, to obtain the global similarity as a sum. Custom scripts can be created as needed to vary the weights associated with different attributes. These weights can be dynamically modified for each retrieval task or alternatively remain static for all queries. The latter corresponds to learning a attribute weighting scheme that is used unchanged with every casebase query; whilst the former provides the opportunity to change attribute weights to suit the query context. The default global aggregation can be replaced with a custom aggregation script; whilst this does offer greater flexibility it will also incur greater computing memory when working with medium to large casebases since all the cases that are returned by the local functions will be held in memory (as with the monolithic organisation of jcolibri and myCBR).

Implication for CBR Cycle
The major improvement over the architectures used by jcolibri and myCBR is the lack of a two-layer persistence strategy. In previous frameworks there is a need to load cases into memory from a persistence media such as a data-base, text file, etc. However, the use of ES services allows to manage cases directly from its internal index.
Absence of the two-layer persistence strategy, has an immediate impact on the application structure because unlike previously where a precycle step was needed prior to the CBR cycle itself for loading cases into memory, this is no longer required. However Clood maintain the possibility of executing a precycle (or its complementary postcycle) in order to perform additional pre/post-processing of the data, if the CBR system requires it.
Another significant benefit of cloud-based technologies is concurrency, which directly creates the opportunity to execute CBR processes in parallel. This feature is quite limited in current frameworks and is also very relevant in order to parallelise time-consuming algorithms such as kNN or noise removal methods such as BBNR (Blame-based noise reduction), CRR (Conservative Redundancy Removal), RENN (Repeated Edited Nearest Neighbour), RC (Relative Cover), ICF (Iterative Case Filtering), etc.

Clood CBR System
Clood is implemented using python functions following the design paradigm presented in Section 3. These functions run on Amazon Web Services (AWS) Lambda, which is the severless event-driven computing service of AWS. The casebase uses the AWS ES service and the client application is implemented with JavaScript and HTML using the AngularJS framework 8 . Using a test application provided by jcolibri 9 we describe the Clood implementation (see Figure 2) and discuss how CBR functionality is achieved with cloud capabilities.

Casebase using Elasticsearch
ES is an open-source highly distributed and horizontally scalable full-text search engine with various capabilities built on Apache Lucene [6]. ES uses RESTful interfaces to manipulate its schema-free JSON document store and performs searches at very high speeds maintaining an index that is about 20% the size of the indexed documents [17]. Compared to traditional database management systems, the ES "index" is somewhat like the database table as queries are executed against the index. Although it is "schema-free", ES internally generates a schema based on the field (attributes/columns) values of documents to be indexed. Relying on an ES-generated schema can be problematic in some cases. For example, a field for storing alphanumeric values can be designated as numeric by ES if the first documents to be indexed have numeric values only for that field. In order to avoid undesirable field properties, we create an explicit mapping which indicates the data type to be stored for each field in the casebase. The ES index "mapping" is comparable to the database schema as it describes the fields (columns) in the JSON documents along with their data types.
An explicit index mapping supports the specification of how a field's values should be indexed and the local similarity metric to be used for retrieving the values of that field. Where possible, we delay specifying the local similarity function for a field until retrieval time for greater flexibility. This is because the index specification for a field cannot be modified once data is added to the index. With query script similarity functions supplied at retrieval time, the method of retrieval can be varied without having to modify the underlying index mapping. Introducing a new attribute to an existing casebase can be done by extending the index mappings with the new field. The structure of cases that do not have values for newly created fields will remain unchanged. Clood's severless functions interact with ES by HTTP requests and responses using a python Elasticsearch client, elasticsearch-py 10 . The casebase is a separate service which can be hosted anywhere with exposed API end-points further highlighting the distributed nature of Clood. Table 1 shows the local similarity metric functions that are currently implemented on Clood, reproducing some relevant functions available in jcolibri and myCBR. Although several similarity metrics are currently missing in Clood, the goal here is to demonstrate the potential of the framework and to encourage code contributions in the future. Each similarity metric is implemented as a python function which generates and returns a Painless script. Painless is the scripting language that is specifically designed for writing inline and stored scripts on ES. Alternative languages for writing ES scripts are Java, Lucene expressions language 11 , and Mustache template 12 .

Clood Similarity Functions
McSherry, INRECA, Interval and EnumDistance are re-implementations of local similarity metrics found in jcolibri. Figure 3 shows an example of the python code used to dynamically generate scripts for measuring Interval similarity on numeric attributes. Figure 4 is the generated script when the Interval function is called with parameters "queryval" = 10, "interval" = 5, and "weight" = 1. At retrieval, generated scripts for each case attribute, are combined into a single multi-match query script. For textual CBR, we specifically implemented the Semantic local similarity metric (Semantic USE) for text content, using the Universal Sentence Encoder (USE) which embeds texts in a dense vector space of 512 dimensions [4]. This vector representation is generated using a lite version of USE based on the Transformer architecture 13 [16] and is stored as a dense vector field on ES. Textual retrieval follows the same process of generating the vector representation of a query string. Afterwards, the Semantic USE local similarity function measures the cosine similarity between query vectors and documents' vectors to identify the most semantically similar content.

REST API
REST APIs are stateless in that the API server does not remember the state of its clients and every call to an end-point is independent of other calls. REST API uses existing protocols such as HTTP for Web APIs. As a result, client applications do not need additional software to use the service. REST improves portability to different types of platforms since all interactions are completed through universally understood interfaces. With Clood, each REST API endpoint is a serverless function. The replication of an end-point and the resources allocated to it vary to meet changing demands without affecting the other endpoints. Table 2 summarises the major REST API end-points of Clood.
Clood is able to concurrently manage multiple CBR projects (use-cases) referred to as "project" in Table 2. The system's capabilities can be easily extended by introducing new serverless functions (e.g. similarity functions, reuse functions, revise functions). Functions that will become part of the REST API are specified in a YAML file along with their access protocols.

Clood CBR Dashboard
Client applications can perform CBR operations through the RESTful API endpoints of Clood. The Clood CBR client application is a light-weight HTML and JavaScript implementation that is able to manage multiple CBR projects through API calls. Figure 5 shows the interface for specifying the attributes of a project's casebase. Clood system's configuration provides guidance on allowed operations when specifying attributes. For example, it indicates that the Interval local similarity metric only applies to numeric attributes. Once the attribute specifications are completed, Clood generates an index mapping for the case representation on ES. Logstash is an open-source data processing pipeline from the ES stack for ingesting data into ES 14 . Using Logstash, cases can be added to a Clood's casebase from multipel data sources including files (e.g. CSV file), databases with JDBC interfaces (e.g. MySQL, PostgreSQL, Oracle), and NoSQL databases (e.g. MongoDB, CouchDB). However, we also include a file upload utility for adding cases from CSV files through a RESTful end-point and which should be sufficient for file sizes that will not overwhelm the Web browser.
The retrieve operation begins with specifying some attribute values along with weights for aggregating the local similarity measures. Attributes with known values become part of the problem space while attributes with unknown values form the solution space. Furthermore, a retrieve strategy can be specified per attribute as shown on the user interface in Figure 6. For example, the Best match can be retrieved for one attribute while the Mean of the k best matches retrieved for another attribute. The k nearest neighbours to retrieve and the global similarity method can also be specified at the retrieve phase.
The reuse interface in Figure 7 displays the retrieval results for reuse. The recommended case (candidate solution) mixes the user-supplied attribute values with the retrieved values for unknown attribute values. The k most similar cases to the query case are also presented for possible reuse. The reuse button against a retrieved case is used to make it the recommended case. The recommended case can be revised by adjusting it as required. Afterwards, the case can be retained by adding to the casebase.

Evaluation
A scalability test is conducted to evaluate Clood based CBR application, to examine how resource demands both on the casebase and the serverless CBR functions are met. We expect a fairly consistent compute performance for different CBR tasks across different project sizes (compared to a jcolibriapplication). We focus on case retrieval for evaluation since it is the most commonly performed and time-consuming stage of the CBR cycle.

Experimental Setup and Dataset
Six CBR projects of increasing casebase sizes (10, 10 2 , 10 3 , 10 4 , 10 5 , and 540, 394) were created from a used cars dataset 15 (1.35GB CSV file), and case retrieval efficiency compared with Clood and jcolibri . A case has 25 attributes describing the physical features of a car (e.g. manufacturer, model, colour), car location (e.g. region, state, coordinates), and the listing price. In the comparative study, 10 nearest neighbours (NN) are retrieved using the following query. Clood, consists of: the time spent to dynamically generate a query using the appropriate similarity functions for the query case, retrieve the 10 NN of the query case from the casebase, generate a recommended case for reuse using specified reuse strategy, and generate a response through the API. We do not include the time lapse between the client application and the API endpoints as that is very dependent on the network connection speed and client's platform resources. For jcolibri, Retrieve time is measured in the cycle phase consisting of: the time spent to retrieve the similarity configuration, perform NN scoring over the cases (in-memory), and select the 10 best cases. jcolibri was run on a Windows 10 PC having 6th generation Intel core i7 processor and 16GB RAM with 2GiB Java heap size. Clood uses AWS Lambda functions for its operations while the casebase was hosted on a single cluster of the AWS ES Service with 2GiB and 1 vCPU (t2.small.elasticsearch instance). Figure 8 shows the average case retrieve times for Clood and jcolibri on log scales with standard deviations as error bars. jcolibri was marginally faster on the smallest casebase (10) but the superior performance of Clood is apparent with increasing casebase sizes. Similar case retrieval times were obtained at at about 100 cases; however at casebase size of 1,000, Clood was 5.5 times faster than jcolibri and at casebase size of 540,394, Clood was 3,737 times faster than jcolibri. Close examination of Clood's Retrieve time spent on the ES casebase when measured separately (Query time) shows to have increased due to time spent querying the casebase (see Figure 9).

Conclusion
We introduced Clood CBR, a novel microservices-oriented CBR framework which leverages the serverless architecture for CBR operations and a distributed data storage service (Elasticsearch) for CBR knowledge persistence. Implementation of the extensible Clood CBR framework is an ongoing opensource project. We demonstrated the robustness of Clood on a CBR project of half a million cases and showed how Clood is more scalable than existing systems like jcolibri. On-going work is extending support for additional similarity and data types (e.g. myCBR's table similarity and user-defined similarity functions); and include functions for reuse and revise, casebase maintenance and visualisation. In future we aim to make Clood a Python library to reuse the Clood Elasticsearch similarity functions for the community and to include seamless support for multiple cloud providers.