Find identical strings

Problem Description#

Given an array of n strings, each of length m, and a target string of length m, write a program to obtain the indices of the strings in the array that are identical to the target string.

Example:

String[] toFind = new String[]{"abca","abab","abba","abca"};
String target = "abca";
// Result [0,3]

Solution#

Traversal#

The simplest method is to traverse the string array and compare each string with the target string. If they are equal, add the index to the result, and finally return the result.

List<Integer> result = new ArrayList<>();
for (int i = 0; i < toFind.length; i++) {
    if (target.equals(toFind[i])) {
        result.add(i);
    }
}
return result;

Trie#

Use the array of strings to construct a trie, storing the index of each string at the root node. Use the target string to search in the trie to obtain the result.

Schematic Diagram#

photo_2020-09-18_17-26-52

Code Implementation#

class Main {
    public static void main(String[] args) {
        String[] toFind = new String[] { "abca", "abab", "abba", "abca" };
        String target = "abca";
        TrieTreeNode root = new TrieTreeNode();
        for (int i = 0; i < toFind.length; i++) {
            root.addStringToTree(toFind[i], i);
        }
        System.out.println(root.searchInTree(target));
    }
}

// Simple implementation of a trie node
class TrieTreeNode {
    // Child nodes
    private List<TrieTreeNode> list;
    // Stores indices
    private List<Integer> result;
    // Character at this node
    private char ch;

    // Empty constructor (constructs root node)
    public TrieTreeNode() {
    }

    // Constructor for child nodes
    private TrieTreeNode(char ch) {
        this.ch = ch;
    }

    // Check if the character at this node is ch
    private boolean isChar(char ch) {
        return this.ch == ch;
    }

    // Get the child node with character ch (automatically adds if not present)
    private TrieTreeNode getNext(char ch) {
        if (list == null) {
            list = new ArrayList<>();
        }
        for (var item : list) {
            if (item.isChar(ch)) {
                return item;
            }
        }
        var tmp = new TrieTreeNode(ch);
        list.add(tmp);
        return tmp;
    }

    // Use the above private functions to add a string to the trie
    public void addStringToTree(String str, int index) {
        var tmp = this;
        for (var item : str.toCharArray()) {
            tmp = tmp.getNext(item);
        }
        if (tmp.result == null) {
            tmp.result = new ArrayList<>();
        }
        tmp.result.add(index);
    }

    // Search for the target string in the trie
    public List<Integer> searchInTree(String target) {
        var tmp = this;
        for (var item : target.toCharArray()) {
            if (tmp.list == null || tmp.list.isEmpty()) {
                return new ArrayList<Integer>();
            }
            tmp = tmp.getNext(item);
        }
        return tmp.result;
    }
}

Hash Table#

Use a String->List<Integer> hash table, with the string as the key and the corresponding indices as the value. Directly return the result by looking up the table.

Map<String, List<Integer>> map = new HashMap<>();
for (int i = 0; i < toFind.length; i++) {
    if (map.get(toFind[i]) == null) {
        map.put(toFind[i], new ArrayList<Integer>());
    }
    map.get(toFind[i]).add(i);
}
return map.get(target);

Split the String and Use Hash Table (Homework)#

Explanation#

When the given string is too long or too large in scale, directly applying a hash table can lead to significant space waste. In such cases, we can split the string, using each substring as a key, and a list of pairs (i,j) as the value, where i is the index of the string and j is the position of the substring.

When searching for the target string, we also split the target string into substrings and use each substring as a key to query the list of pairs. If a pair (i,j) in the list has the substring position (j) matching the current substring's position in the target string, it indicates that the substring of toFind[i] matches the target string, and we add i to the index set of that substring. Finally, we take the intersection of the index sets for all fields of the target string to obtain the result.

For each element i in the result, all substrings of toFind[i] correspond to the substrings of the target string. That is, toFind[i] is identical to the target string.

Schematic Diagram#

photo_2020-09-18_18-33-27

Code Implementation (including input and output, replacing String[] with List<String>)#

import java.util.List;
import java.util.ArrayList;
import java.util.Scanner;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Set;
import java.util.Map;

class Main {
    public static void main(String[] args) {
        // Input strings to find
        List<String> list = new ArrayList<>();
        String target;
        try (var scanner = new Scanner(System.in)) {
            System.out.print("Please input target String:");
            // Target string
            target = scanner.nextLine();
            System.out.println("Please input String array to find(end with space):");
            String tmp;
            // Read into list if not empty
            while (!"".equals(tmp = scanner.nextLine())) {
                list.add(tmp);
            }
        }
        // Output search results
        System.out.println(find(target, getMap(list)));

    }

    // String splitting function
    private static List<String> split(String src, int size) {
        List<String> result = new ArrayList<>();
        int begin = 0;
        while (begin < src.length()) {
            // If the start position plus the cut length exceeds the source string length, cut to the end of the string and break
            if (begin + size > src.length()) {
                result.add(src.substring(begin));
                break;
            }
            result.add(src.substring(begin, begin + size));
            begin += size;
        }
        return result;
    }

    private static Map<String, List<Pair>> getMap(List<String> list) {
        Map<String, List<Pair>> map = new HashMap<>();
        for (int i = 0; i < list.size(); i++) {
            String item = list.get(i);
            // Split the string to find into List<String>
            var array = split(item, 10);
            // Traverse all split results
            for (int j = 0; j < array.size(); j++) {
                String key = array.get(j);
                // Use the split result as key to get value from map, if null, add an empty List
                if (map.get(key) == null) {
                    map.put(key, new ArrayList<Pair>());
                }
                // Add the pair <i,j> (where i is the index of the string to find, j is the position in that string)
                map.get(key).add(new Pair(i, j));
            }
        }
        return map;
    }

    private static Set<Integer> find(String target, Map<String, List<Pair>> map) {
        // If target is null, return empty set
        if (target == null) {
            return new HashSet<Integer>();
        }
        List<Set<Integer>> list = new ArrayList<>();
        // Split the target string
        var array = split(target, 10);
        // Traverse the split string array
        for (int i = 0; i < array.size(); i++) {
            var key = array.get(i);
            // Set of row indices that match this segment
            Set<Integer> set = new HashSet<>();
            // If this segment matches, add the row index to the set
            for (var item : map.get(key)) {
                if (item.getLast() == i) {
                    set.add(item.getFirst());
                }
            }
            // Add the set to the list
            list.add(set);
        }
        // If the list is empty, return an empty set
        if (list.isEmpty()) {
            return new HashSet<Integer>();
        }
        // Take the first set from the list
        Set<Integer> result = list.get(0);
        // Take the intersection of this set with all sets
        for (var item : list) {
            result.retainAll(item);
        }
        // Return the result
        return result;
    }
}

// A simple Pair implementation (<Integer,Integer>)
class Pair {
    private final Integer first;
    private final Integer last;

    public Pair(Integer first, Integer last) {
        this.first = first;
        this.last = last;
    }

    public Integer getFirst() {
        return this.first;
    }

    public Integer getLast() {
        return this.last;
    }

    @Override
    public boolean equals(Object compare) {
        if (compare == null) {
            return false;
        }
        Pair tmp = (Pair) compare;
        return this.first.equals(tmp.first) && this.last.equals(tmp.last);
    }
}

Further Thoughts (Homework)#

In the above algorithm, we used fixed lengths to split the search domain toFind and the target target, but using this method may lead to some substrings appearing too frequently. Therefore, we can consider optimizing the string splitting algorithm by shifting substrings that appear too frequently (i.e., concatenating the characters following the substring to the end of it). For example:

// Split the string with a length of 3
"abcedfabcqweabcertabctyu"
// Result can be seen, abc appears too frequently, shift it
"abc|edf|abc|qwe|abc|ert|abc|tyu"
// After shifting, abc is differentiated into abce|abcq|abct, reducing the frequency of this substring
"abce|edf|abcq|qwe|abce|ert|abct|tyu"

Points to Note in the Algorithm#

How to define "too frequent"?

Clearly, we need to set a ratio to indicate the maximum allowed frequency of a substring. When the frequency of a substring exceeds this threshold, we shift that substring, attempting to "differentiate" it to a lower frequency. Repeat this process to achieve the final result where "the frequency of all substrings is below this threshold."
The integrity of the string splitting algorithm.

In this problem, we used the same rules to split each string in the search domain and the target string, comparing corresponding positions, and finally taking the intersection of the row indices that are equal in corresponding positions to obtain the result.

In the case of ordinary fixed-length splitting, we can handle all strings separately. However, in the current algorithm framework, if we handle strings separately, it may lead to inconsistent splitting methods. Therefore, when performing substring frequency statistics, substring shifting, and other operations, we need to treat the set of {search domain, target} as a whole, ensuring that each operation can successfully apply to every string in the search domain and the target string.

My Java Code Implementation (using recursion and higher complexity)#

import java.util.*;

class Main {
    // Global statistics for the length of substrings that need to be shifted
    private static Map<String, Integer> globalCount = new HashMap<>();
    // Statistics for the occurrence count of substrings in each split
    private static Map<String, Integer> count = new HashMap<>();

    public static void main(String[] args) {
        String[] range = new String[] { "abcabcedf", "abeabeefg", "bcdbcabec" };
        String target = "abcabcedf";
        // The splitting method for target and range should be the same, so both need to be processed simultaneously
        List<List<String>> list = cutStrings(range, target, 3, 0.2);
        System.out.println(list);
    }

    // Get the list after splitting, parameters are the search domain, target, original split length, and maximum occurrence frequency of substrings
    private static List<List<String>> cutStrings(String[] range, String target, int width, double pro) {
        // Total number of substrings equals (length of string / split length) rounded up multiplied by the number of strings
        int totalCount = ((int) Math.ceil((double) target.length() / width)) * (range.length + 1);
        // Maximum occurrence frequency less than (1 / total number of substrings) is impossible, return empty 2D array
        // (Although there are limits, if pro is set too low, stack overflow may still occur)
        if (pro < 1.0 / totalCount) {
            return new ArrayList<List<String>>();
        }
        count.clear();
        List<List<String>> result = new ArrayList<>();
        // Traverse the search domain, split the strings by specified length
        for (var item : range) {
            result.add(cutString(item, width));
        }
        // Split the target by specified length
        result.add(cutString(target, width));
        // After splitting all strings, we have the occurrence count of all substrings in count
        // Next, we need to traverse count, marking substrings with occurrence frequency greater than the limit, and shifting them in the next split
        boolean flag = false; // flag to mark if any substring has an occurrence frequency greater than the limit
        for (var item : count.entrySet()) {
            if ((double) item.getValue() / totalCount > pro) {
                addForGlobalCount(item.getKey().substring(0, width));
                flag = true;
            }
        }
        // If all substring occurrence frequencies are below the limit, return directly
        if (!flag) {
            return result;
        }
        // Otherwise, split again
        return cutStrings(range, target, width, pro);
    }

    // Count occurrences of str
    private static void addForCount(String str) {
        count.put(str, count.getOrDefault(str, 0) + 1);
    }

    // Count the length of substrings that need to be shifted (the basic logic is if shifting still results in high frequency, continue shifting)
    private static void addForGlobalCount(String str) {
        globalCount.put(str, globalCount.getOrDefault(str, 0) + 1);
    }

    // Split str and count occurrences of all substrings (the splitting method will be corrected by globalCount in each loop, gradually approaching the correct result)
    private static List<String> cutString(String str, int width) {
        List<String> result = new ArrayList<>();
        int start = 0;
        String sub;
        while (start < str.length()) {
            // If reaching the end, cut to the end
            if (start + width > str.length()) {
                sub = str.substring(start);
                result.add(sub);
                addForCount(sub);
                break;
            }
            Integer extraWidth;
            // Check if shifting is needed
            if ((extraWidth = globalCount.get(str.substring(start, start + width))) != null) {
                // If shifting is needed, check if it reaches the end
                int tmp = start + width + extraWidth > str.length() ? str.length() : start + width + extraWidth;
                sub = str.substring(start, tmp);
            } else {
                sub = str.substring(start, start + width);
            }
            result.add(sub);
            addForCount(sub);
            start += width;
        }
        return result;
    }
}