LeetCode //C - 609. Find Duplicate File in System

609. Find Duplicate File in System

Given a list paths of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths. You may return the answer in any order.

A group of duplicate files consists of at least two files that have the same content.

A single directory info string in the input list has the following format:

  • “root/d1/d2/…/dm f1.txt(f1_content) f2.txt(f2_content) … fn.txt(fn_content)”

It means there are n files (f1.txt, f2.txt … fn.txt) with content (f1_content, f2_content … fn_content) respectively in the directory “root/d1/d2/…/dm”. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

  • “directory_path/file_name.txt”
     
Example 1:

Input: paths = [“root/a 1.txt(abcd) 2.txt(efgh)”,“root/c 3.txt(abcd)”,“root/c/d 4.txt(efgh)”,“root 4.txt(efgh)”]
Output: [[“root/a/2.txt”,“root/c/d/4.txt”,“root/4.txt”],[“root/a/1.txt”,“root/c/3.txt”]]

Example 2:

Input: paths = [“root/a 1.txt(abcd) 2.txt(efgh)”,“root/c 3.txt(abcd)”,“root/c/d 4.txt(efgh)”]
Output: [[“root/a/2.txt”,“root/c/d/4.txt”],[“root/a/1.txt”,“root/c/3.txt”]]

Constraints:
  • 1 < = p a t h s . l e n g t h < = 2 ∗ 1 0 4 1 <= paths.length <= 2 * 10^4 1<=paths.length<=2104
  • 1 <= paths[i].length <= 3000
  • 1 < = s u m ( p a t h s [ i ] . l e n g t h ) < = 5 ∗ 1 0 5 1 <= sum(paths[i].length) <= 5 * 10^5 1<=sum(paths[i].length)<=5105
  • paths[i] consist of English letters, digits, ‘/’, ‘.’, ‘(’, ‘)’, and ’ '.
  • You may assume no files or directories share the same name in the same directory.
  • You may assume each given directory info represents a unique directory. A single blank space separates the directory path and file info.

From: LeetCode
Link: 609. Find Duplicate File in System


Solution:

Ideas:

1. Parsing Input:
For each directory info string, we split the string by spaces. The first token is the directory. Each subsequent token represents a file with its content enclosed in parentheses. We extract the file name and content and then form the full file path string (i.e. “directory/fileName”).

2. Grouping by Content:
We use a hash table (with chaining) keyed by file content. Each key stores a dynamically growing array of file paths that share that content.

3. Building the Result:
After processing all inputs, we count only groups with more than one file. For each such group, we allocate an array for that group and record its size in returnColumnSizes.

4. Memory Management:
The returned arrays and their sizes are malloced, as required. The allocated memory for nodes and temporary arrays is freed before returning.

Code:
/**
 * Return an array of arrays of size *returnSize.
 * The sizes of the arrays are returned as *returnColumnSizes array.
 * Note: Both returned array and *columnSizes array must be malloced, assume caller calls free().
 */
// A node to hold the file paths for a given file content.
typedef struct Node {
    char *content;         // file content key (dynamically allocated)
    char **paths;          // dynamic array of file paths having this content
    int count, capacity;   // current number and capacity of the array
    struct Node *next;     // pointer for chaining in the hash table bucket
} Node;

#define TABLE_SIZE 10007

// djb2 hash function
unsigned long hashFunc(const char *str) {
    unsigned long hash = 5381;
    int c;
    while ((c = *str++))
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
    return hash;
}

// Insert a file path into the hash table for the given content key.
void insert(Node **hashTable, const char *content, char *filePath) {
    unsigned long hash = hashFunc(content);
    int index = hash % TABLE_SIZE;
    Node *cur = hashTable[index];
    // Search for a node with the same content.
    while(cur) {
        if(strcmp(cur->content, content) == 0) {
            // Found it. Add filePath to the array.
            if(cur->count == cur->capacity) {
                cur->capacity *= 2;
                cur->paths = realloc(cur->paths, cur->capacity * sizeof(char *));
            }
            cur->paths[cur->count++] = filePath;
            return;
        }
        cur = cur->next;
    }
    // Not found: create a new node.
    Node *newNode = (Node*)malloc(sizeof(Node));
    newNode->content = strdup(content);
    newNode->capacity = 2;
    newNode->count = 0;
    newNode->paths = (char**)malloc(newNode->capacity * sizeof(char *));
    newNode->paths[newNode->count++] = filePath;
    newNode->next = hashTable[index];
    hashTable[index] = newNode;
}

/**
 * Return an array of arrays of size *returnSize.
 * The sizes of the arrays are returned as *returnColumnSizes array.
 * Note: Both returned array and *columnSizes array must be malloced, assume caller calls free().
 */
char*** findDuplicate(char** paths, int pathsSize, int* returnSize, int** returnColumnSizes) {
    // Create hash table (array of Node pointers).
    Node **hashTable = (Node**)calloc(TABLE_SIZE, sizeof(Node*));

    // Process each directory info string.
    for (int i = 0; i < pathsSize; i++) {
        char *s = paths[i];
        // Duplicate the string because strtok modifies it.
        char *info = strdup(s);
        char *token = strtok(info, " ");
        if (!token) {
            free(info);
            continue;
        }
        // The first token is the directory path.
        char *dir = token;
        
        // Process the remaining tokens, each representing a file.
        while ((token = strtok(NULL, " ")) != NULL) {
            // token is of the form "filename(content)"
            // Find the '(' character.
            char *p1 = strchr(token, '(');
            if (!p1) continue;
            // p1 points to '('; file name ends at p1.
            int nameLen = p1 - token;
            // Find the closing ')'
            char *p2 = strchr(p1, ')');
            if (!p2) continue;
            int contentLen = p2 - p1 - 1; // excluding '(' and ')'

            // Extract file content.
            char *fileContent = (char*)malloc((contentLen + 1) * sizeof(char));
            strncpy(fileContent, p1 + 1, contentLen);
            fileContent[contentLen] = '\0';

            // Extract file name.
            char *fileName = (char*)malloc((nameLen + 1) * sizeof(char));
            strncpy(fileName, token, nameLen);
            fileName[nameLen] = '\0';

            // Build full file path: "directory/fileName"
            int fullLen = strlen(dir) + 1 + strlen(fileName) + 1;
            char *fullPath = (char*)malloc(fullLen * sizeof(char));
            snprintf(fullPath, fullLen, "%s/%s", dir, fileName);

            // Insert the fullPath into hash table under fileContent.
            insert(hashTable, fileContent, fullPath);

            free(fileContent);
            free(fileName);
        }
        free(info);
    }

    // Now iterate over the hash table to count groups with duplicate files.
    int groupCount = 0;
    for (int i = 0; i < TABLE_SIZE; i++) {
        Node *cur = hashTable[i];
        while(cur) {
            if(cur->count > 1)
                groupCount++;
            cur = cur->next;
        }
    }

    // Allocate result array and the returnColumnSizes array.
    char ***result = (char***)malloc(groupCount * sizeof(char**));
    *returnColumnSizes = (int*)malloc(groupCount * sizeof(int));
    *returnSize = groupCount;

    // Fill in the result with groups.
    int index = 0;
    for (int i = 0; i < TABLE_SIZE; i++) {
        Node *cur = hashTable[i];
        while(cur) {
            if(cur->count > 1) {
                // Allocate an array for this group.
                char **group = (char**)malloc(cur->count * sizeof(char*));
                for (int j = 0; j < cur->count; j++) {
                    group[j] = cur->paths[j];
                }
                result[index] = group;
                (*returnColumnSizes)[index] = cur->count;
                index++;
            }
            cur = cur->next;
        }
    }

    // Free the hash table nodes and their paths array (but not the file path strings,
    // since they are now in our result).
    for (int i = 0; i < TABLE_SIZE; i++) {
        Node *cur = hashTable[i];
        while(cur) {
            Node *temp = cur;
            cur = cur->next;
            free(temp->paths);
            free(temp->content);
            free(temp);
        }
    }
    free(hashTable);
    
    return result;
}

你可能感兴趣的:(LeetCode,leetcode,c语言,算法)