Problem Description#
Given an array of n strings, each of length m, and a target string of length m, write a program to obtain the indices of the strings in the array that are identical to the target string.
Example:
String[] toFind = new String[]{"abca","abab","abba","abca"};
String target = "abca";
// Result [0,3]
Solution#
Traversal#
The simplest method is to traverse the string array and compare each string with the target string. If they are equal, add the index to the result, and finally return the result.
List<Integer> result = new ArrayList<>();
for (int i = 0; i < toFind.length; i++) {
if (target.equals(toFind[i])) {
result.add(i);
}
}
return result;
Trie#
Use the array of strings to construct a trie, storing the index of each string at the root node. Use the target string to search in the trie to obtain the result.
Schematic Diagram#
Code Implementation#
class Main {
public static void main(String[] args) {
String[] toFind = new String[] { "abca", "abab", "abba", "abca" };
String target = "abca";
TrieTreeNode root = new TrieTreeNode();
for (int i = 0; i < toFind.length; i++) {
root.addStringToTree(toFind[i], i);
}
System.out.println(root.searchInTree(target));
}
}
// Simple implementation of a trie node
class TrieTreeNode {
// Child nodes
private List<TrieTreeNode> list;
// Stores indices
private List<Integer> result;
// Character at this node
private char ch;
// Empty constructor (constructs root node)
public TrieTreeNode() {
}
// Constructor for child nodes
private TrieTreeNode(char ch) {
this.ch = ch;
}
// Check if the character at this node is ch
private boolean isChar(char ch) {
return this.ch == ch;
}
// Get the child node with character ch (automatically adds if not present)
private TrieTreeNode getNext(char ch) {
if (list == null) {
list = new ArrayList<>();
}
for (var item : list) {
if (item.isChar(ch)) {
return item;
}
}
var tmp = new TrieTreeNode(ch);
list.add(tmp);
return tmp;
}
// Use the above private functions to add a string to the trie
public void addStringToTree(String str, int index) {
var tmp = this;
for (var item : str.toCharArray()) {
tmp = tmp.getNext(item);
}
if (tmp.result == null) {
tmp.result = new ArrayList<>();
}
tmp.result.add(index);
}
// Search for the target string in the trie
public List<Integer> searchInTree(String target) {
var tmp = this;
for (var item : target.toCharArray()) {
if (tmp.list == null || tmp.list.isEmpty()) {
return new ArrayList<Integer>();
}
tmp = tmp.getNext(item);
}
return tmp.result;
}
}
Hash Table#
Use a String->List<Integer>
hash table, with the string as the key and the corresponding indices as the value. Directly return the result by looking up the table.
Map<String, List<Integer>> map = new HashMap<>();
for (int i = 0; i < toFind.length; i++) {
if (map.get(toFind[i]) == null) {
map.put(toFind[i], new ArrayList<Integer>());
}
map.get(toFind[i]).add(i);
}
return map.get(target);
Split the String and Use Hash Table (Homework)#
Explanation#
When the given string is too long or too large in scale, directly applying a hash table can lead to significant space waste. In such cases, we can split the string, using each substring as a key, and a list of pairs (i,j) as the value, where i is the index of the string and j is the position of the substring.
When searching for the target string, we also split the target string into substrings and use each substring as a key to query the list of pairs. If a pair (i,j) in the list has the substring position (j) matching the current substring's position in the target string, it indicates that the substring of toFind[i] matches the target string, and we add i to the index set of that substring. Finally, we take the intersection of the index sets for all fields of the target string to obtain the result.
For each element i in the result, all substrings of toFind[i] correspond to the substrings of the target string. That is, toFind[i] is identical to the target string.
Schematic Diagram#
Code Implementation (including input and output, replacing String[] with List<String>)#
import java.util.List;
import java.util.ArrayList;
import java.util.Scanner;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Set;
import java.util.Map;
class Main {
public static void main(String[] args) {
// Input strings to find
List<String> list = new ArrayList<>();
String target;
try (var scanner = new Scanner(System.in)) {
System.out.print("Please input target String:");
// Target string
target = scanner.nextLine();
System.out.println("Please input String array to find(end with space):");
String tmp;
// Read into list if not empty
while (!"".equals(tmp = scanner.nextLine())) {
list.add(tmp);
}
}
// Output search results
System.out.println(find(target, getMap(list)));
}
// String splitting function
private static List<String> split(String src, int size) {
List<String> result = new ArrayList<>();
int begin = 0;
while (begin < src.length()) {
// If the start position plus the cut length exceeds the source string length, cut to the end of the string and break
if (begin + size > src.length()) {
result.add(src.substring(begin));
break;
}
result.add(src.substring(begin, begin + size));
begin += size;
}
return result;
}
private static Map<String, List<Pair>> getMap(List<String> list) {
Map<String, List<Pair>> map = new HashMap<>();
for (int i = 0; i < list.size(); i++) {
String item = list.get(i);
// Split the string to find into List<String>
var array = split(item, 10);
// Traverse all split results
for (int j = 0; j < array.size(); j++) {
String key = array.get(j);
// Use the split result as key to get value from map, if null, add an empty List
if (map.get(key) == null) {
map.put(key, new ArrayList<Pair>());
}
// Add the pair <i,j> (where i is the index of the string to find, j is the position in that string)
map.get(key).add(new Pair(i, j));
}
}
return map;
}
private static Set<Integer> find(String target, Map<String, List<Pair>> map) {
// If target is null, return empty set
if (target == null) {
return new HashSet<Integer>();
}
List<Set<Integer>> list = new ArrayList<>();
// Split the target string
var array = split(target, 10);
// Traverse the split string array
for (int i = 0; i < array.size(); i++) {
var key = array.get(i);
// Set of row indices that match this segment
Set<Integer> set = new HashSet<>();
// If this segment matches, add the row index to the set
for (var item : map.get(key)) {
if (item.getLast() == i) {
set.add(item.getFirst());
}
}
// Add the set to the list
list.add(set);
}
// If the list is empty, return an empty set
if (list.isEmpty()) {
return new HashSet<Integer>();
}
// Take the first set from the list
Set<Integer> result = list.get(0);
// Take the intersection of this set with all sets
for (var item : list) {
result.retainAll(item);
}
// Return the result
return result;
}
}
// A simple Pair implementation (<Integer,Integer>)
class Pair {
private final Integer first;
private final Integer last;
public Pair(Integer first, Integer last) {
this.first = first;
this.last = last;
}
public Integer getFirst() {
return this.first;
}
public Integer getLast() {
return this.last;
}
@Override
public boolean equals(Object compare) {
if (compare == null) {
return false;
}
Pair tmp = (Pair) compare;
return this.first.equals(tmp.first) && this.last.equals(tmp.last);
}
}
Further Thoughts (Homework)#
In the above algorithm, we used fixed lengths to split the search domain toFind
and the target target
, but using this method may lead to some substrings appearing too frequently. Therefore, we can consider optimizing the string splitting algorithm by shifting substrings that appear too frequently (i.e., concatenating the characters following the substring to the end of it). For example:
// Split the string with a length of 3
"abcedfabcqweabcertabctyu"
// Result can be seen, abc appears too frequently, shift it
"abc|edf|abc|qwe|abc|ert|abc|tyu"
// After shifting, abc is differentiated into abce|abcq|abct, reducing the frequency of this substring
"abce|edf|abcq|qwe|abce|ert|abct|tyu"
Points to Note in the Algorithm#
-
How to define "too frequent"?
Clearly, we need to set a ratio to indicate the maximum allowed frequency of a substring. When the frequency of a substring exceeds this threshold, we shift that substring, attempting to "differentiate" it to a lower frequency. Repeat this process to achieve the final result where "the frequency of all substrings is below this threshold."
-
The integrity of the string splitting algorithm.
In this problem, we used the same rules to split each string in the search domain and the target string, comparing corresponding positions, and finally taking the intersection of the row indices that are equal in corresponding positions to obtain the result.
In the case of ordinary fixed-length splitting, we can handle all strings separately. However, in the current algorithm framework, if we handle strings separately, it may lead to inconsistent splitting methods. Therefore, when performing substring frequency statistics, substring shifting, and other operations, we need to treat the set of {search domain, target} as a whole, ensuring that each operation can successfully apply to every string in the search domain and the target string.
My Java Code Implementation (using recursion and higher complexity)#
import java.util.*;
class Main {
// Global statistics for the length of substrings that need to be shifted
private static Map<String, Integer> globalCount = new HashMap<>();
// Statistics for the occurrence count of substrings in each split
private static Map<String, Integer> count = new HashMap<>();
public static void main(String[] args) {
String[] range = new String[] { "abcabcedf", "abeabeefg", "bcdbcabec" };
String target = "abcabcedf";
// The splitting method for target and range should be the same, so both need to be processed simultaneously
List<List<String>> list = cutStrings(range, target, 3, 0.2);
System.out.println(list);
}
// Get the list after splitting, parameters are the search domain, target, original split length, and maximum occurrence frequency of substrings
private static List<List<String>> cutStrings(String[] range, String target, int width, double pro) {
// Total number of substrings equals (length of string / split length) rounded up multiplied by the number of strings
int totalCount = ((int) Math.ceil((double) target.length() / width)) * (range.length + 1);
// Maximum occurrence frequency less than (1 / total number of substrings) is impossible, return empty 2D array
// (Although there are limits, if pro is set too low, stack overflow may still occur)
if (pro < 1.0 / totalCount) {
return new ArrayList<List<String>>();
}
count.clear();
List<List<String>> result = new ArrayList<>();
// Traverse the search domain, split the strings by specified length
for (var item : range) {
result.add(cutString(item, width));
}
// Split the target by specified length
result.add(cutString(target, width));
// After splitting all strings, we have the occurrence count of all substrings in count
// Next, we need to traverse count, marking substrings with occurrence frequency greater than the limit, and shifting them in the next split
boolean flag = false; // flag to mark if any substring has an occurrence frequency greater than the limit
for (var item : count.entrySet()) {
if ((double) item.getValue() / totalCount > pro) {
addForGlobalCount(item.getKey().substring(0, width));
flag = true;
}
}
// If all substring occurrence frequencies are below the limit, return directly
if (!flag) {
return result;
}
// Otherwise, split again
return cutStrings(range, target, width, pro);
}
// Count occurrences of str
private static void addForCount(String str) {
count.put(str, count.getOrDefault(str, 0) + 1);
}
// Count the length of substrings that need to be shifted (the basic logic is if shifting still results in high frequency, continue shifting)
private static void addForGlobalCount(String str) {
globalCount.put(str, globalCount.getOrDefault(str, 0) + 1);
}
// Split str and count occurrences of all substrings (the splitting method will be corrected by globalCount in each loop, gradually approaching the correct result)
private static List<String> cutString(String str, int width) {
List<String> result = new ArrayList<>();
int start = 0;
String sub;
while (start < str.length()) {
// If reaching the end, cut to the end
if (start + width > str.length()) {
sub = str.substring(start);
result.add(sub);
addForCount(sub);
break;
}
Integer extraWidth;
// Check if shifting is needed
if ((extraWidth = globalCount.get(str.substring(start, start + width))) != null) {
// If shifting is needed, check if it reaches the end
int tmp = start + width + extraWidth > str.length() ? str.length() : start + width + extraWidth;
sub = str.substring(start, tmp);
} else {
sub = str.substring(start, start + width);
}
result.add(sub);
addForCount(sub);
start += width;
}
return result;
}
}