Skip to main content
Log in

Streaming data cleaning based on speed change

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Errors are prevalent in data sequences, such as GPS trajectories or sensor readings. Existing methods on cleaning sequential data employ a constraint on value changing speeds and perform constraint-based repairing. While such speed constraints are effective in identifying large spike errors, the small errors that do not deviate much from the truth and indeed satisfy the speed constraints can hardly be identified and repaired. To handle such small errors, in this paper, we propose a cleaning method based on probability of speed change. Rather than declaring a broad constraint of max/min speeds, we model the probability distribution of speed changes. The repairing problem is thus to maximize the probability of the sequence w.r.t. the probability of speed changes. We formalize the probability-based repairing problem and devise algorithms in streaming scenarios. Experiments on real data sets (in various applications) demonstrate the superiority of our proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33
Fig. 34
Fig. 35
Fig. 36

Similar content being viewed by others

Notes

  1. Except QP our methods traverse the repairing space of each data point so that a discretization in advance is needed if the value is continuous.

  2. The log probability is only defined in this proof. In actual repairing, it is constructed from original data.

  3. Probability distribution is constructed before dynamic programming and remains unchanged during the repairing.

  4. http://finance.yahoo.com/q/hp?s=AIP.L+Historical+Prices.

  5. Mixed Gaussian distribution whose density function is \({\mathcal {N}}(-10,1)+{\mathcal {N}}(10,1)\).

  6. To save experiment time, we only use part of the dataset in some experiments.

References

  1. Anderson, C.: The Long Tail. Harper Collins, USA (2008)

    Google Scholar 

  2. ASF: Apache storm (2020). http://storm.apache.org/

  3. Berger, V.W., Zhou, Y.: Kolmogorov–Smirnov test: Overview. Statistics reference online, Wiley statsref (2014)

    Google Scholar 

  4. Bifet, A., Holmes, G., Pfahringer, B., Kranen, P., Kremer, H., Jansen, T., Seidl, T.: MOA: massive online analysis, a framework for stream classification and clustering. In: Proceedings of the First Workshop on Applications of Pattern Analysis, WAPA 2010, Cumberland Lodge, Windsor, UK, Sept 1–3, 2010, JMLR Proceedings, vol. 11, pp. 44–50. JMLR.org (2010)

  5. Blázquez-García, A., Conde, A., Mori, U., Lozano, J.A.: A review on outlier/anomaly detection in time series data. ACM Comput. Surv. 54(3), 56:1-56:33 (2021)

    Google Scholar 

  6. Bohannon, P., Flaster, M., Fan, W., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2005, Baltimore, Maryland, USA, June 14-16, 2005, pp. 143–154. ACM (2005)

  7. Brillinger, D.R.: Time Series: Data Analysis and Theory, vol. 36. Siam (2001)

  8. Candanedo, L.M., Feldheim, V.: Accurate occupancy detection of an office room from light, temperature, humidity and co2 measurements using statistical learning models. Energy Build (2016)

  9. Cheung, Y.W., Lai, K.S.: Lag order and critical values of the augmented dickey-fuller test. J. Bus. Econ. Stat. 13(3), 277–280 (1995)

    Google Scholar 

  10. Chi, Y., Wang, H., Yu, P.S., Muntz, R.R.: Moment: maintaining closed frequent itemsets over a stream sliding window. In: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), 1–4 Nov 2004, Brighton, UK, pp. 59–66. IEEE Computer Society (2004)

  11. Dasu, T., Loh, J.M.: Statistical distortion: consequences of data cleaning. Proc. VLDB Endow. 5(11), 1674–1683 (2012)

    Article  Google Scholar 

  12. Ding, Z., Fei, M.: An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window. IFAC Proc. Vol. 46(20), 12–17 (2013)

    Article  Google Scholar 

  13. Draper, N.R., Smith, H.: Applied Regression Analysis. Wiley Series in Probability and Mathematical Statistics, 2nd edn. Wiley (1981)

    Google Scholar 

  14. Fang, C., Song, S., Mei, Y.: On repairing timestamps for regular interval time series. Proc. VLDB Endow. 15(9), 1848–1860 (2022)

    Article  Google Scholar 

  15. Gama, J., Medas, P., Castillo, G., Rodrigues, P.P.: Learning with drift detection. In: Advances in Artificial Intelligence—SBIA 2004. In: 17th Brazilian Symposium on Artificial Intelligence, São Luis, Maranhão, Brazil, Sept 29–Oct 1, 2004, Proceedings, Lecture Notes in Computer Science, vol. 3171, pp. 286–295. Springer (2004)

  16. Gardner, E.S., Jr.: Exponential smoothing: the state of the art-part ii. Int. J. Forecast. 22(4), 637–666 (2006)

    Article  Google Scholar 

  17. Golab, L., Karloff, H.J., Korn, F., Saha, A., Srivastava, D.: Sequential dependencies. Proc. VLDB Endow. 2(1), 574–585 (2009)

    Article  Google Scholar 

  18. Golab, L., Özsu, M.T.: Processing sliding window multi-joins in continuous queries over data streams. In: Proceedings of 29th International Conference on Very Large Data Bases, VLDB 2003, Berlin, Germany, Sept 9–12, 2003, pp. 500–511. Morgan Kaufmann (2003)

  19. Gu, J., Li, W., Cai, X.: The effect of the forget-remember mechanism on spreading. Eur Phys J B 62(2), 247–255 (2008)

    Article  Google Scholar 

  20. Hyndman, R.J., Athanasopoulos, G.: Forecasting: Principles and Practice. OTexts (2018)

  21. Jeffery, S.R., Garofalakis, M.N., Franklin, M.J.: Adaptive cleaning for RFID data streams. In: Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12–15, 2006, pp. 163–174. ACM (2006)

  22. Karp, R.M.: Reducibility among combinatorial problems. In: Proceedings of a symposium on the Complexity of Computer Computations, held March 20–22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, USA, The IBM Research Symposia Series, pp. 85–103. Plenum Press, New York (1972)

  23. Krishnan, S., Wang, J., Wu, E., Franklin, M.J., Goldberg, K.: Activeclean: Interactive data cleaning for statistical modeling. Proc. VLDB Endow. 9(12), 948–959 (2016)

    Article  Google Scholar 

  24. Li, X., Dong, X.L., Lyons, K., Meng, W., Srivastava, D.: Truth finding on the deep web: Is the problem solved? Proc. VLDB Endow. 6(2), 97–108 (2012)

    Article  Google Scholar 

  25. Liu, M., Li, M., Golovnya, D., Rundensteiner, E.A., Claypool, K.T.: Sequence pattern query processing over out-of-order event streams. In: Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, March 29 2009–April 2 2009, Shanghai, China, pp. 784–795. IEEE Computer Society (2009)

  26. Livshits, E., Kimelfeld, B., Roy, S.: Computing optimal repairs for functional dependencies. ACM Trans. Database Syst. 45(1), 4:1-4:46 (2020)

    Article  MathSciNet  Google Scholar 

  27. Ma, Q., Gu, Y., Lee, W., Yu, G., Liu, H., Wu, X.: REMIAN: real-time and error-tolerant missing value imputation. ACM Trans. Knowl. Discov. Data 14(6), 77:1-77:38 (2020)

    Article  Google Scholar 

  28. Mayfield, C., Neville, J., Prabhakar, S.: ERACER: a database approach for statistical inference and data cleaning. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6–10, 2010, pp. 75–86. ACM (2010)

  29. Neumeyer, L., Robbins, B., Nair, A., Kesari, A.: S4: distributed stream computing platform. In: ICDMW 2010, The 10th IEEE International Conference on Data Mining Workshops, Sydney, Australia, 13 Dec 2010, pp. 170–177. IEEE Computer Society (2010)

  30. Qi, Z., Wang, H., Wang, A.: Impacts of dirty data on classification and clustering models: an experimental evaluation. J. Comput. Sci. Technol. 36(4), 806–821 (2021). https://doi.org/10.1007/s11390-021-1344-6

    Article  Google Scholar 

  31. Song, K.S.: Circuit for generating a scroll window signal in digital image apparatus (1992)

  32. Song, S., Cao, Y., Wang, J.: Cleaning timestamps with temporal constraints. Proc. VLDB Endow. 9(10), 708–719 (2016)

    Article  Google Scholar 

  33. Song, S., Li, C., Zhang, X.: Turn waste into wealth: On simultaneous clustering and cleaning over dirty data. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, Aug 10–13, 2015, pp. 1115–1124. ACM (2015)

  34. Song, S., Zhang, A.: Iot data quality. In: CIKM’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, Oct 19–23, 2020, pp. 3517–3518. ACM (2020)

  35. Song, S., Zhang, A., Wang, J., Yu, P.S.: SCREEN: stream data cleaning under speed constraints. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2015, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp. 827–841. ACM (2015)

  36. Stisen, A., Blunck, H., Bhattacharya, S., Prentow, T.S., Kjærgaard, M.B., Dey, A.K., Sonne, T., Jensen, M.M.: Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition. In: Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, SenSys 2015, Seoul, South Korea, Nov 1–4, 2015, pp. 127–140. ACM (2015)

  37. Ulm, G., Smith, S., Nilsson, A., Gustavsson, E., Jirstrand, M.: OODIDA: on-board/off-board distributed real-time data analytics for connected vehicles. Data Sci. Eng. 6(1), 102–117 (2021)

  38. Vorburger, P., Bernstein, A.: Entropy-based concept shift detection. In: 6th International Conference on Data Mining (ICDM’06), pp. 1113–1118. IEEE (2006)

  39. Wang, H., Chen, S., Gong, W.: Mobility improves accuracy: Precise robot manipulation with COTS RFID systems. In: 19th IEEE International Conference on Pervasive Computing and Communications, PerCom 2021, Kassel, Germany, March 22–26, 2021, pp. 1–10. IEEE (2021)

  40. Wang, J., Song, S., Lin, X., Zhu, X., Pei, J.: Cleaning structured event logs: a graph repair approach. In: 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, South Korea, April 13–17, 2015, pp. 30–41. IEEE Computer Society (2015)

  41. Wang, J., Song, S., Zhu, X., Lin, X.: Efficient recovery of missing events. Proc. VLDB Endow. 6(10), 841–852 (2013)

    Article  Google Scholar 

  42. Wang, J., Wang, J., Guo, Y.: Scroll-window recursive subspace identification methods for closed-loop system based on orthogonal projection. Inf. Control 43(1), 56–62 (2014)

    Google Scholar 

  43. Xhafa, F., Kilic, B., Krause, P.: Evaluation of iot stream processing at edge computing layer for semantic data enrichment. Future Gener. Comput. Syst. 105, 730–736 (2020)

    Article  Google Scholar 

  44. Yakout, M., Berti-Équille, L., Elmagarmid, A.K.: Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22–27, 2013, pp. 553–564. ACM (2013)

  45. Yin, W., Yue, T., Wang, H., Huang, Y., Li, Y.: Time series cleaning under variance constraints. In: Database Systems for Advanced Applications—DASFAA 2018 International Workshops: BDMS, BDQM, GDMA, and SeCoP, Gold Coast, QLD, Australia, May 21–24, 2018, Proceedings, Lecture Notes in Computer Science, vol. 10829, pp. 108–113. Springer (2018)

  46. Yu, Y., Zhu, Y., Li, S., Wan, D.: Time series outlier detection based on sliding window prediction. Math. Probl. Eng. 2014 (2014)

  47. Yuan, H., Li, G.: A survey of traffic prediction: from spatio-temporal data to intelligent transportation. Data Sci. Eng. 6(1), 63–85 (2021). https://doi.org/10.1007/s41019-020-00151-z

    Article  Google Scholar 

  48. Zhang, A., Song, S., Wang, J.: Sequential data cleaning: a statistical approach. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26–July 01, 2016, pp. 909–924. ACM (2016)

  49. Zhang, A., Song, S., Wang, J., Yu, P.S.: Time series data cleaning: from anomaly detection to anomaly repairing. Proc. VLDB Endow. 10(10), 1046–1057 (2017)

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China (62072265, 62102023, 62021002, 62232005), National Key Research and Development Plan (2021YFB3300500, 2019YFB1705301), Beijing National Research Center for Information Science and Technology (BNR2022RC01011), and Alibaba Group through Alibaba Innovative Research (AIR) Program. Shaoxu Song (https://sxsong.github.io/) is the corresponding author.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shaoxu Song.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Zhang, A., Song, S. et al. Streaming data cleaning based on speed change. The VLDB Journal 33, 1–24 (2024). https://doi.org/10.1007/s00778-023-00796-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-023-00796-y

Keywords

Navigation