Saturday, July 24Digital Marketing Journals

A Comprehensive Survey of Existing Chatbot Architectures and Techniques(Part-1) | by Tejwinder Singh


In this series of blogs, I will try to investigate various types of dialogue systems or more commonly knows as chatbots that exist and what are some of the design techniques and algorithms used to develop these systems. Since their inception in the 1960s dialogue systems have gained increasing attention due to their ability to streamline conversations between humans and machines. User experience and involvement have become an important factor for the growth of businesses across the globe and dialogue systems are a perfect way to engage a user to enhance their overall experience. Taking this into account this blog will cover the various chatbot architectures ranging from rule-based to generation-based. I will further investigate the social responsiveness of these dialogue systems and how we can achieve the architecture for the implementation of these chatbots in the next blog.

Computer programs that receive natural language text or human voice as an input and can generate the best humanly responses to enable an intelligent and natural conversation between a human and a machine are known as conversational agents. These dialogue systems have wide applications ranging from e-commerce, hospitality, healthcare, automotive, question answering, and many more. Existing dialogue systems can be classified either as Task-Oriented dialogue systems, that are designed for a specific task and tends to have short conversations with the user to gain information and complete the task or as Chatbots which focuses on human engagement and social conversations [1]. A dialogue System architecture can be classified as either rule-based systems or corpus-based systems. Corpus-based systems can further be classified as information retrieval systems or generation based systems. In the sections to follow, I would provide details of these architectures.

This system is based on the simple method of decomposing the input text based on certain criteria and look for the presence of some keywords. These keywords when found are then transformed and reassembled according to the rules associated with them and certain assembly specifications. ELIZA was the first rule-based dialogue system developed to imitate Rogerian psychologist[2], which was followed by PARRY, a rule-based chatbot for studying schizophrenia[3]. More recent rule-based chatbots include DBpedia Chatbot designed to enhance user interaction with the DBpedia platform[4].

DBpedia Discussion and Developers mailing list was used as a data source, which was then cleaned and vectorized and converted into a matrix array which contains frequencies of each term in every message. These vectors were clustered into topics and threads using cosine similarity and then in the form of patterns, converted into rules which follow the structure of regular expressions.

These types of chatbots are not efficient in terms of answering questions whose patterns do not match with the rules that were used to train the chatbot. Additionally, it is a very cumbersome task to write various rules possible for different scenarios and that too for every possible scenario which becomes nearly impossible.

1. How Conversational AI can Automate Customer Service

2. Automated vs Live Chats: What will the Future of Customer Service Look Like?

3. Chatbots As Medical Assistants In COVID-19 Pandemic

4. Chatbot Vs. Intelligent Virtual Assistant — What’s the difference & Why Care?

Due to inherent problems of hand-built rules, machine learning solutions are applied on corpus constructed from real-world data and hence the name, corpus-based chatbots. These corpora can be categorized based on the type and aspects of dialogue interactions like written or spoken corpora, human-human(contains entries of interaction between humans) or human-machine corpora(entries consists of interactions between a human and machine), natural or unnatural corpora, etc. [5]. Corpus-based natural language generation is used for spoken dialogue systems due to its inherent advantage of mimicking a human being rather than just acting on some predefined rules or templates [6]. These corpus-based chatbots can be further classified as information retrieval models or generation based models.

These bots are trained on a corpus of dialogue interactions, a set of queries along with their possible responses. From the set of available responses, this system ranks all the matching responses and gives as an output the most relevant response. Query-Response Similarity and Query-Post Similarity measures were introduced for a post-comment pair dataset to measure the similarity between a query and a response where both the query and response are represented as TF-IDF vectors and fluent responses are returned [7].

One major problem with retrieval-based chatbots is the semantic gap that exists between two object descriptions due to linguistically different representations. To engage the user with the chatbot, it is of utmost importance that the responses being generated by the model are more than just answers to a given question. They need to contain some extra information to keep the conversation going and instil an environment of discussion with the user.

Topic-aware convolutional neural tensor network (TACNTN) was introduced to improve the matching algorithm which incorporates topic information by embedding the message, the response, and their related topic information into vector spaces which are processed by the neural tensors for matching[8]. By using this approach the system performed better than other baseline models such as LSTM, CNN, Cosine, etc.

The above discussion is centred on single-turn conversations in retrieval based systems which only takes into account the last input message in the conversation. We now take into account multi-turn conversations wherein responses are generated not only based on the last input message but the overall context of the conversation. The challenge in a multi-turn conversation system is identifying the right information and then using that information to find the responses that match the context while preserving the relationships between different turns.

A Sequential Matching Network was proposed in which the context vector is generated which is then compared with the response vector [9]. For every message-response pair, the model constructs word-word and sequence-sequence similarity matrix using word embeddings and gated recurrent units. In this way, useful information, by comparing the response with the previous turns, can be maintained which can add to the overall performance of the chatbot.

Rather than choosing the most appropriate response for a query from the available corpus, generation-based chatbots tend to generate their answers using an encoder-decoder framework, where the query is encoded in a vector representation and given as an input to the decoder unit to generate the response. We can think of it as an alternate way of generating a response wherein machine translation from a user’s turn to a response takes place

Ritter et al. investigated the use of statistical machine translation for response generation and was the first approach to open-domain linguistic response generation[10].

They identified that message-response pairs are not semantically equivalent as in the case of machine translation of bilingual text and they were far more unaligned. Later on, encoder-decoder based models or Seq2Seq models used recurrent neural networks to read the input message and generate the corresponding response one token at a time [11]. Since one predicted output is fed as an input to predict the next output in the sequence and the entire output is not known before-hand, a less greedy approach is to use beam search wherein multiple outputs are given as an input at the previous step to predict the output of the next step. This simple and general architecture was able to produce simple responses and was able to maintain a basic level of conversation but it failed to capture long-term objective and information associated with the human conversation which is different from predicting the next step from the previous one.

To overcome the inability of this architecture to model longer prior context of the conversation hierarchical recurrent neural networks with bidirectional RNN encoder were used which allowed the model to see prior turns and to introduce additional short-term dependencies for longer messages present in the dataset[12].

A further enhancement in the model was made by including additional non-dialogue datasets that were based on the same topics and types of human conversations. Another problem with simple Seq2Seq architecture was that across multiple turns the response generated by the models tends to be incoherent and thus is only good for generating single responses.

This problem was tackled by introducing a hierarchical latent encoder-decoder model for generating dialogues and maintaining dialogue context[13]. With each dialogue message, a continuous latent variable of high dimensionality is attached based on which first the high-level semantic content of the response is represented followed by a word to word response generation.

This method helped address the problem of high variability among messages and made the conversation more natural. The Reinforcement learning-based dialogue manager was proposed to address the issues of robustness, flexibility, and reproducibility in task-completion dialogue systems and can be further applied to chatbots to allow natural interactions [14].

In this post, I introduced you to the idea of dialogue systems and the various types of architectures around which to model our chatbots. In the coming posts, I would be talking more about the techniques and algorithms that are widely being used to implement this architecture and touch upon the idea of social and emotional awareness in chatbots.

[1] Speech and Language Processing. Daniel Jurafsky & James H. Martin.

[2] Weizenbaum, J. (1966). ELIZA — -a computer program for the study of natural language communication between man and machine. Communications of the ACM, 9(1), 36–45.

[3] Colby, K. M., Hilf, F. D., Weber, S., & Kraemer, H. C. (1972). Turing-like indistinguishability tests for the validation of a computer simulation of paranoid processes. Artificial Intelligence, 3, 199–221.

[4] Athreya, R. G., Ngonga Ngomo, A. C., & Usbeck, R. (2018, April). Enhancing Community Interactions with Data-Driven Chatbots — The DBpedia Chatbot. In Companion Proceedings of the The Web Conference 2018 (pp. 143–146). International World Wide Web Conferences Steering Committee.

[5] Serban, I. V., Lowe, R., Henderson, P., Charlin, L., & Pineau, J. (2015). A survey of available corpora for building data-driven dialogue systems. arXiv preprint arXiv:1512.05742.

[6] Oh, A. H., & Rudnicky, A. I. (2000). Stochastic language generation for spoken dialogue systems. In ANLP-NAACL 2000 Workshop: Conversational Systems.

[7] Ji, Z., Lu, Z., & Li, H. (2014). An information retrieval approach to short text conversation. arXiv preprint arXiv:1408.6988.

[8] Wu, Y., Li, Z., Wu, W., & Zhou, M. (2018). Response selection with topic clues for retrieval-based chatbots. Neurocomputing, 316, 251–261.

[9] Wu, Y., Wu, W., Xing, C., Zhou, M., & Li, Z. (2016). Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots. arXiv preprint arXiv:1612.01627.

[10] Ritter, A., Cherry, C., & Dolan, W. B. (2011, July). Data-driven response generation in social media. In Proceedings of the conference on empirical methods in natural language processing (pp. 583–593). Association for Computational Linguistics.

[11] Vinyals, O., & Le, Q. (2015). A neural conversational model. arXiv preprint arXiv:1506.05869.

[12] Serban, I. V., Sordoni, A., Bengio, Y., Courville, A., & Pineau, J. (2016, March). Building end-to-end dialogue systems using generative hierarchical neural network models. In Thirtieth AAAI Conference on Artificial Intelligence.

[13] Serban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., & Bengio, Y. (2017, February). A hierarchical latent variable encoder-decoder model for generating dialogues. In Thirty-First AAAI Conference on Artificial Intelligence.

[14] Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., & Jurafsky, D. (2016). Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541.

Leave a Reply