Skip to main content
Log in

Using Curiosity for an Even Representation of Tasks in Continual Offline Reinforcement Learning

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

In this work, we investigate the means of using curiosity on replay buffers to improve offline multi-task continual reinforcement learning when tasks, which are defined by the non-stationarity in the environment, are non labeled and not evenly exposed to the learner in time. In particular, we investigate the use of curiosity both as a tool for task boundary detection and as a priority metric when it comes to retaining old transition tuples, which we respectively use to propose two different buffers. Firstly, we propose a Hybrid Reservoir Buffer with Task Separation (HRBTS), where curiosity is used to detect task boundaries that are not known due to the task-agnostic nature of the problem. Secondly, by using curiosity as a priority metric when it comes to retaining old transition tuples, a Hybrid Curious Buffer (HCB) is proposed. We ultimately show that these buffers, in conjunction with regular reinforcement learning algorithms, can be used to alleviate the catastrophic forgetting issue suffered by the state of the art on replay buffers when the agent’s exposure to tasks is not equal along time. We evaluate catastrophic forgetting and the efficiency of our proposed buffers against the latest works such as the Hybrid Reservoir Buffer (HRB) and the Multi-Time Scale Replay Buffer (MTR) in three different continual reinforcement learning settings. These settings are defined based on how many times the agent encounters the same task, how long they last, and how different new tasks are when compared to the old ones (i.e., how large the task drift is). The three settings are namely, 1. prolonged task encounter with substantial task drift, and no task re-visitation, 2. frequent, short-lived task encounter with substantial task drift and task re-visitation, and 3. every timestep task encounter with small task drift and task re-visitation. Experiments were done on classical control tasks and Metaworld environment. Experiments show that our proposed replay buffers display better immunity to catastrophic forgetting compared to existing works in all but the every time step task encounter with small task drift and task re-visitation. In this scenario curiosity will always be higher, thus not being an useful measure in both proposed buffers, making them not universally better than other approaches across all types of CL settings, and thereby opening up an avenue for further research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

Data Availability

All the experiments were exclusively conducted on open source software, namely pytorch [55], roboschool [53], OpenAI gym [52]. The implementation, requirements and configurations to run the experiments have been provided in the following reproducibility checklist.

• Implementation: https://github.com/punk95/Continual-Learning-With-Curiosity

• Hyper Parameter Configuration: Hyper-Parameter values were specified in Appendix B

• Hardware requirements: All the algorithms were run on a 12 core AMD Ryzen™ 9 5900X processor and an Nvidia RTX 3080 GPU.

• Number of Algorithms run: Each of the results were obtained by averaging over 8 experiments with different seeds.

• Seeds: A different randomly generated seed was used for every individual experiment.

• Statistics used for the results: We have used the discounted reward and the composition ratio of the buffers as our statistic to evaluate the algorithms. The results were averaged over 8 runs.

References

  1. Parisi GI, Kemker R, Part JL, Kanan C, Wermter S. Continual lifelong learning with neural networks: a review. CoRR. 2018;abs/1802.07569. arXiv:1802.07569.

  2. Lesort T, Lomonaco V, Stoian A, Maltoni D, Filliat D, Díaz-Rodríguez N.: Continual learning for robotics: definition, framework, learning strategies, opportunities and challenges. arXiv:1907.00182.

  3. Díaz-Rodríguez N, Lomonaco V, Filliat D, Maltoni D.: Don’t forget, there is more than forgetting: new metrics for Continual Learning. arXiv:1810.13166.

  4. Sutton RS, Barto AG. Reinforcement learning: an introduction. MIT press; 2018.

  5. Khetarpal K, Riemer M, Rish I, Precup D.: Towards continual reinforcement learning: a review and perspectives. arXiv:2012.13490.

  6. Zenke F, Poole B, Ganguli S. Continual learning through synaptic intelligence. In: Precup D, Teh YW, editors. Proceedings of the 34th International Conference on Machine Learning. vol. 70 of Proceedings of Machine Learning Research. PMLR; 2017. p. 3987–3995. Available from: http://proceedings.mlr.press/v70/zenke17a.html.

  7. Kirkpatrick J, Pascanu R, Rabinowitz NC, Veness J, Desjardins G, Rusu AA, et al. Overcoming catastrophic forgetting in neural networks. CoRR. 2016;abs/1612.00796. arXiv:1612.00796.

  8. Oudeyer P. Computational theories of curiosity-driven learning. CoRR. 2018;abs/1802.10546. arXiv:1802.10546.

  9. Ten A, Oudeyer PY, Moulin-Frier C.: Curiosity-driven exploration: diversity of mechanisms and functions. https://doi.org/10.31234/osf.io/n2byt.

  10. Baranes A, Oudeyer P. Active learning of inverse models with intrinsically motivated goal exploration in robots. CoRR. 2013;abs/1301.4862. arXiv:1301.4862.

  11. Gottlieb J, Oudeyer PY, Lopes M, Baranes A. Information-seeking, curiosity, and attention: computational and neural mechanisms. Trends Cogn Sci. 2013;17(11):585–93. https://doi.org/10.1016/j.tics.2013.09.001.

    Article  Google Scholar 

  12. French R. Semi-distributed representations and catastrophic forgetting in connectionist networks. Connect Sci. 1992;01(4):365–77. https://doi.org/10.1080/09540099208946624.

    Article  Google Scholar 

  13. French R. Dynamically constraining connectionist networks to produce distributed, orthogonal representations to reduce catastrophic interference. Proceedings of the 16th Annual Cognitive Science Society Conference. 1994 08.

  14. Robins AV. Catastrophic forgetting. Rehearsal and Pseudorehearsal Connect Sci. 1995;7:123–46.

    Article  Google Scholar 

  15. Silver D, Mercer R. The task rehearsal method of life-long learning: overcoming impoverished data. In: Advances in Artificial Intelligence; 2002. p. 90–101. https://doi.org/10.1007/3-540-47922-8_8.

  16. French RM. Pseudo-recurrent connectionist networks: an approach to the ‘sensitivity-stability’ dilemma. Connect Sci. 1997;9:353–80.

    Article  Google Scholar 

  17. Ans B, Rousset S. Avoiding catastrophic forgetting by coupling two reverberating neural networks. Comptes Rendus de l’Académie des Sciences - Series III - Sciences de la Vie. 1997;p. 989–997. https://doi.org/10.1016/S0764-4469(97)82472-9.

  18. Xiong F, Liu Z, Huang K, Yang X, Qiao H. State primitive learning to overcome catastrophic forgetting in robotics. Cogn Comput. 2021;p. 394–402. https://doi.org/10.1007/s12559-020-09784-8.

  19. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(56):1929–58. http://jmlr.org/papers/v15/srivastava14a.html

  20. Goodfellow IJ, Mirza M, Xiao D, Courville A, Bengio Y.: An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv:1312.6211.

  21. Ammar HB, Eaton E, Ruvolo P, Taylor ME. Online multi-task learning for policy gradient methods. In: Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32. ICML’14. JMLR.org; 2014. p. II-1206-II-1214.

  22. Borsa D, Graepel T, Shawe-Taylor J.: Learning shared representations in multi-task reinforcement learning. arXiv:1603.02041.

  23. Rusu AA, Rabinowitz NC, Desjardins G, Soyer H, Kirkpatrick J, Kavukcuoglu K, et al.: Progressive neural networks. arXiv:1606.04671.

  24. Hinton G, Vinyals O, Dean J.: Distilling the knowledge in a neural network. arXiv:1503.02531.

  25. Traoré R, Caselles-Dupré H, Lesort T, Sun T, Cai G, Díaz-Rodríguez N, et al.: DisCoRL: Continual reinforcement learning via policy distillation. arXiv:1907.05855.

  26. Traoré R, Caselles-Dupré H, Lesort T, Sun T, Díaz-Rodríguez N, Filliat D.: Continual reinforcement learning deployed in real-life using policy distillation and sim2real transfer. arXiv:1906.04452.

  27. Rusu AA, Colmenarejo SG, Gulcehre C, Desjardins G, Kirkpatrick J, Pascanu R, et al.: Policy distillation. arXiv:1511.06295.

  28. Kaplanis C, Shanahan M, Clopath C.: Policy consolidation for continual reinforcement learning. arXiv:1902.00255.

  29. Tirumala D, Noh H, Galashov A, Hasenclever L, Ahuja A, Wayne G, et al.: Exploiting hierarchy for learning and transfer in KL-regularized RL; 2020. arXiv:1903.07438.

  30. Rebuffi S, Kolesnikov A, Lampert CH. iCaRL: incremental classifier and representation learning. CoRR. 2016;abs/1611.07725. arXiv:1611.07725.

  31. Rolnick D, Ahuja A, Schwarz J, Lillicrap TP, Wayne G.: Experience replay for continual learning. arXiv:1811.11682.

  32. Isele D, Cosgun A.: Selective experience replay for lifelong learning. arXiv:1802.10269.

  33. Chaudhry A, Rohrbach M, Elhoseiny M, Ajanthan T, Dokania PK, Torr PHS, et al.: On tiny episodic memories in continual learning; 2019. arXiv:1902.10486.

  34. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al.: Playing Atari with deep reinforcement learning; 2013. arXiv:1312.5602.

  35. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, et al.: Continuous control with deep reinforcement learning; 2019. arXiv:1509.02971.

  36. Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR. 2018;abs/1801.01290. arXiv:1801.01290.

  37. Vitter JS. Random sampling with a reservoir. ACM Trans Math Softw. 1985;11(1):37–57.

    Article  MathSciNet  Google Scholar 

  38. Isele D, Cosgun A. Selective experience replay for lifelong learning. CoRR. 2018;abs/1802.10269. arXiv:1802.10269.

  39. Kaplanis C, Clopath C, Shanahan M.: Continual reinforcement learning with multi-timescale replay; 2020. arXiv:2004.07530.

  40. Bellemare MG, Srinivasan S, Ostrovski G, Schaul T, Saxton D, Munos R. Unifying count-based exploration and intrinsic motivation. CoRR. 2016;abs/1606.01868. arXiv:1606.01868.

  41. Lopes M, Lang T, Toussaint M, Oudeyer PY. Exploration in model-based reinforcement learning by empirically estimating learning progress. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1. NIPS’12. Red Hook, NY, USA: Curran Associates Inc.; 2012. p. 206–214.

  42. Houthooft R, Chen X, Duan Y, Schulman J, Turck FD, Abbeel P. Curiosity-driven exploration in deep reinforcement learning via Bayesian neural networks. CoRR. 2016;abs/1605.09674. arXiv:1605.09674.

  43. Schmidhuber J. Formal theory of creativity, fun, and intrinsic motivation(1990–2010). IEEE Trans Auton Ment Dev. 2010;2(3):230–47. https://doi.org/10.1109/TAMD.2010.2056368.

    Article  Google Scholar 

  44. Pathak D, Agrawal P, Efros AA, Darrell T.: Curiosity-driven Exploration by Self-supervised Prediction. arXiv:1705.05363.

  45. Doncieux S, Filliat D, Rodríguez ND, Hospedales TM, Duro RJ, Coninx A, et al. Open-ended learning: a conceptual framework based on representational redescription. Front Neurorobot. 2018;12.

  46. Todorov E, Erez T, Tassa Y. MuJoCo: A physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems; 2012. p. 5026–5033. https://doi.org/10.1109/IROS.2012.6386109.

  47. Tassa Y, Doron Y, Muldal A, Erez T, Li Y, de Las Casas D, et al.: DeepMind control suite; 2018. arXiv:1801.00690.

  48. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al.: Playing Atari with deep reinforcement learning; 2013. arXiv:1312.5602.

  49. Kempka M, Wydmuch M, Runc G, Toczek J, Jaśkowski W.: ViZDoom: a doom-based AI research platform for visual reinforcement learning; 2016. arXiv:1605.02097.

  50. Rusu AA, Flennerhag S, Rao D, Pascanu R, Hadsell R.: Probing transfer in deep reinforcement learning without task engineering; 2022. arXiv:2210.12448.

  51. Bellemare MG, Naddaf Y, Veness J, Bowling M. The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res. 2013;47:253–79. https://doi.org/10.1613/jair.3912.

    Article  Google Scholar 

  52. Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, et al.: OpenAI Gym; 2016. arXiv:arXiv:1606.01540.

  53. Ellenberger B.: PyBullet Gymperium; 2018-2019. https://github.com/benelot/pybullet-gym.

  54. Yu T, Quillen D, He Z, Julian R, Narayan A, Shively H, et al.: Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning; 2021. arXiv:1910.10897.

  55. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al.: PyTorch: an imperative style, high-performance deep learning library 2019. arXiv:1912.01703.

Download references

Acknowledgements

We thank D. H. S. Maithripala and Elisa Massi for valuable feedback on an early version of this article.

Funding

Natalia Díaz-Rodríguez is supported by grant IJC2019-039152-I funded by MCIN/AEI /10.13039/501100011033, by “ESF Investing in your future,” Google Research Scholar Program, and by the European Union through the Marie Curie Postdoctoral Fellowship (Project: 101059332 - RRR-XAI - HORIZON-MSCA-2021-PF-01). J. Del Ser is supported by the Basque Government through the ELKARTEK program and the consolidated research group MATHMODE (IT1456-22).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pankayaraj Pathmanathan.

Ethics declarations

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Conflict of Interest

J. Del Ser is an editorial board member of Cognitive Computation. The authors declare that they have no other conflict of interest regarding this work. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or University of Granada. Neither the European Union nor the granting authority can be held responsible for them.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

A. Implementation

Pytorch [55] based implementation of all the implementations (HCB ("Using Curiosity as a Priority Measure when Retaining Old Samples"), HRBTS ("Using Curiosity for Task Change Detection"), HRB [38], MTR [39] and FIFO) and the experiments used in work can be accessed at https://github.com/punk95/Continual-Learning-With-Curiosity

B. Hyper Parameters of the Curiosity Based Algorithms and Experiments

Continual learning setup itself was emulated in the environment by changing certain parameters in the control task. In both the Walker2D and the Hopper environments power parameter of the learner as defined in [53] was changed, whereas in the Pendulum environment the length of the Pendulum was changed. In the Pendulum’s case length was changed within the range of 1.0 to 1.8 in-environment units. As per Hopper’s case power ranged from 0.75 to 8.75 in-environment units. On the other hand, Walker2D’s power ranged from 1.4 to 13.4 in-environment units. Here the learning agents were not provided with explicit knowledge of the change. Except for the every timestep task encounter with small task drift and task re-visitation setting, tasks were defined by the minimum, maximum, and mean of the range. For example, three tasks in the Pendulum’s case correspond to learning to balance a Pendulum corresponding to the lengths of 1.0, 1.4, and 1.8 in-environment units.

We believe that certain characteristics should be maintained when selecting these parameter ranges (for the prolonged task encounter with substantial task drift, and no task re-visitation and frequent, short-lived task encounter with substantial task drift and task re-visitation settings) in order to create meaningful experiments, which are in line with the scope of this paper. 1. Since neural architectural changes are outside the scope of this paper, the range should be selected such that a single architecture, to a certain extent, can generalize well across these tasks. 2. Tasks selected should be independent of each other (at least to some level). This aspect is essential as one can misinterpret generalization by predominantly training on a single task when that task is dominant in time. 3. The tasks should be immune to few-shot learning when they are trained with a network that was optimized for a previous task. To elaborate further when different tasks are close to each other by nature, if we are to train on subsequent tasks starting with a network that is optimized for the previous task sometimes we may only need to see a few transition tuples of the new task for the current network to start performing well (to start adjusting for the new task). This is indeed possible only by the fact that these tasks by nature share some kind of similarity between each other and we cannot expect this phenomenon to appear in general. If we are to choose two different tasks which are similar/easily adaptable as mentioned before then we may misinterpret the results, which are a consequence of the algorithm predominately training on one task, as the algorithm performing well in generalization when in fact it is an one off scenario and the algorithm may perform worse on other scenarios. Thus it becomes essential to choose tasks such that they are not very similar in nature. To these ends, via trial and error, we selected these aforementioned ranges so that the tasks defined by the minimum, maximum, and mean of these ranges would suffice these characteristics.

B.1. Parameter Tuning

beg (Table 2).

Table 2 Hyper parameter for all the algorithms (HCB ("Using Curiosity as a Priority Measure when Retaining Old Samples"), HRBTS ("Using Curiosity for Task Change Detection"), HRB [38], MTR [39] and FIFO) in Pendulum, Hopper, and Walker2D environments

B.2. Parameters: Every Timestep Task Encounter with Small Task Drift and Task Re-Visitation

Table 3.

Table 3 Hyper parameter for all the algorithms (HCB ("Using Curiosity as a Priority Measure when Retaining Old Samples"), HRBTS ("Using Curiosity for Task Change Detection"), HRB [38], MTR [39] and FIFO) in Pendulum and Hopper environments

B.3. Parameters: Frequent, Short-Lived Task Encounter With Substantial Task Drift, and Task Re-Visitation

Table 4.

Table 4 Hyper parameter for all the algorithms (HCB ("Using Curiosity as a Priority Measure when Retaining Old Samples"), HRBTS ("Using Curiosity for Task Change Detection"), HRB [38], MTR [39] and FIFO) in Pendulum and Hopper environments

C. Additional Results and Analyses

C.1. Prolonged Task Encounter with Substantial Task Drift and No Task Re-Visitation

Tables 5, 6 and 7.

Table 5 Rewards for Pendulum in all phases. Here, Task 1 corresponds to Pendulum’s length of 1.0, Task 2 corresponds to Pendulum’s length of 1.4 and Task 3 corresponds to Pendulum’s length of 1.8. Phase 1 corresponds to t = 0 to t = 30000 where the agent was exposed to Task 1. Phase 2 corresponds to t = 30,000 to t = 130000 where the agent was exposed to Task 2. Phase 3 corresponds to t = 130000 to t = 150000 where the agent was exposed to Task 3. Here, the dominant task corresponds to the task that is the most exposed to the agent (that task at which the agent spent most of it’s time training on) during the training time. Here the bold reward values show the best reward out of all the compared horizontal entires
Table 6 Rewards for Hopper in all phases. Phase 1 corresponds to t = 0 to t = 50000 where the agent was exposed to Task 1. Here, Task 1 corresponds to Hopper’s power of 0.75, Task 2 corresponds to Hopper’s power of 4.75 and Task 3 corresponds to Hopper’s power of 8.75. Phase 2 corresponds to t = 50000 to t = 350000 where the agent was exposed to Task 2. Phase 3 corresponds to t = 350,000 to t = 400000 where the agent was exposed to Task 3. Here, the dominant task corresponds to the task that is the most exposed to the agent (that task at which the agent spent most of it’s time training on) during the training time. Here the bold reward values show the best reward out of all the compared horizontal entires
Table 7 Rewards for Walker 2D in all phases. Phase 1 corresponds to t = 0 to t = 250000 where the agent was exposed to Task 1. Here, Task 1 corresponds to Walker’s power of 1.40, Task 2 corresponds to Walker’s power of 7.40 and Task 3 corresponds to Walker’s power of 13.40. Phase 2 corresponds to t = 250000 to t = 350000 where the agent was exposed to Task 2. Phase 3 corresponds to t = 350000 to t = 400000 where the agent was exposed to Task 3. Here, the dominant task corresponds to the task that is the most exposed to the agent (that task at which the agent spent most of it’s time training on) during the training time. Here the bold reward values show the best reward out of all the compared horizontal entires

C.2. Frequent, Short-Lived Task Encounter With Substantial Task Drift, and Task Re-Visitation

Tables 8 and 9.

Table 8 Rewards for Pendulum in all phases. Here Task 1 corresponds to Pendulum’s length of 1.0, Task 2 corresponds to Pendulum’s length of 1.4 and Task 3 corresponds to Pendulum’s length of 1.8. Phase 1 corresponds to t = 56602 to t = 61602 where the agent was exposed to Task 3. Phase 2 corresponds to t = 105977 to t = 110977 where the agent was exposed to Task 1. Phase 3 corresponds to t = 123155 to t = 141158 where the agent was exposed to Task 2. Here, the dominant task corresponds to the task that is the most exposed to the agent (that task at which the agent spent most of it’s time training on) during the training time. Here the bold reward values show the best reward out of all the compared horizontal entires
Table 9 Rewards for Hopper in all phases. Here, Task 1 corresponds to Hopper’s power of 0.75, Task 2 corresponds to Hopper’s power of 4.75 and Task 3 corresponds to Hopper’s power of 8.75. Phase 1 corresponds to t = 86067 to t = 111067 where the agent was exposed to Task 1. Phase 2 corresponds to t = 111067 to t = 150939 where the agent was exposed to Task 2. Phase 3 corresponds to t = 315080 to t = 327580 where the agent was exposed to Task 3. Here, the dominant task corresponds to the task that is the most exposed to the agent (that task at which the agent spent most of its time training on) during the training time. Here the bold reward values show the best reward out of all the compared horizontal entires

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pathmanathan, P., Díaz-Rodríguez, N. & Del Ser, J. Using Curiosity for an Even Representation of Tasks in Continual Offline Reinforcement Learning. Cogn Comput 16, 425–453 (2024). https://doi.org/10.1007/s12559-023-10213-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-023-10213-9

Keywords

Navigation