I have a sequence of characters. The characters are ordered in chronological order. I am looking for an algorithm to group the sequence and remove errors in the data. I am having an hard time to explain in words/maths what the requirement of fixing the sequence is but something like “the outcome is to group as many characters as possible in a chain of constant letters” and minimum sequence length should be a setting. Below I am trying to visualize by some examples what I mean (3 chars minimum):
AAACCAACA => AAAAAAAAA
ACACAAAAA => AAAAAAAAA
AYYYYYYYA => YYYYYYYYY
YYYYAAYYY => YYYYYYYYY
AYAYAYYYY => YYYYYYYYY
And for longer sequences it becomes a little more tricky
AYAYAYAYAAAAAAAAAAYAYYYYYYYYYYYYYY => AAAAAAAAAAAAAAAAAAYYYYYYYYYYYYYYYY
HTHTTHHHTTHHH => TTTTTHHHHHHHH
TTOAOAOOAATTA => OOOOOOOOAAAAA
This is bit tricky because even thought the 3 O’s aren’t directly next to each other, they are still the most probable uninterrupted sequence of characters.
Does anyone know of any algorithm (Machine learning?) or similar (term of this problem, problem name) that can do this type of error-code correction?
Thank you in advance