This project implements an external sorting algorithm in Go, capable of handling large datasets that exceed available memory. The program reads a dataset of random strings from a file, sorts them, and writes the sorted output to a new file. It includes a dataset generator that creates random strings for testing purposes.
- External Sorting in Go
- Features
- Project Structure
- Requirements
- Installation
- Build
- Usage
- How It Works
- Sequence Diagram
- Memory Management
- Contributing
- License
- Generates a dataset of random strings.
- Implements external merge sort to handle large datasets.
- Uses a temporary directory for storing intermediate sorted chunks.
- Supports sorting of strings.
go_external_sort/ │ ├── cmd/main.go │ ├── dataset/ │ └── main.go # Entry point of the program to generate dataset │ └── sort │ └── main.go # Entry point of the program to sort dataset ├── sorter/ │ ├── sorter.go # External sorting logic │ └── merge.go # Merging sorted chunks └── utils/ ├── file.go # File handling utilities └── logger.go # Logging utilities - Go 1.22.3 or later
- Clone the repository:
git clone https://github.com/fajarnugraha37/go_external_sort.git cd go_external_sort- Ensure you have Go installed on your machine. You can download it from golang.org.
To build the project for development, you can use the following command:
go build -o ./bin/dataset.exe ./cmd/dataset/main.go go build -o ./bin/sort.exe ./cmd/sort/main.goThis command compiles the code and creates executables in the bin directory.
For a optimized-optimized build, you can use the following command:
go build -o ./bin/dataset.exe -ldflags="-s -w" ./cmd/dataset/main.go go build -o ./bin/sort.exe -ldflags="-s -w" ./cmd/sort/main.go-ldflags="-s -w": This flag reduces the size of the binary by omitting the symbol table and debug
If you are on a different operating system and want to cross-compile for another OS, you can set the GOOS and GOARCH environment variables accordingly.
To generate a dataset of random strings, use the following command:
go run ./cmd/dataset/main.go -output dataset_input.txt -count 10000000 -length 16or
./bin/dataset.exe -output dataset_input.txt -count 10000000 -length 16-output: Path to the output file where the dataset will be saved.-count: Number of random strings to generate (default is 100,000).-length: Length of each random string (default is 10).
To sort the generated dataset, use the following command:
go run ./cmd/sort/main.go -input dataset_input.txt -output dataset_output.txtor
./bin/sort.exe -input dataset_input.txt -output dataset_output.txt-input: Path to the input file containing the dataset to be sorted.-output: Path to the output file where the sorted dataset will be saved.
- Dataset Generation: The
./cmd/dataset/main.gofile generates a specified number of random strings and writes them to a file. - External Sorting: The
./cmd/sort/main.gofile orchestrates the sorting process:- It reads the dataset from the input file.
- It splits the dataset into manageable chunks, sorts each chunk, and writes them to temporary files.
- It merges the sorted chunks into a final sorted output file.
sequenceDiagram participant User participant Generator participant Sorter participant FileSystem as FS participant TempDirectory as TempDir User->>Generator: Run dataset generation command Generator->>FS: Create dataset file FS-->>Generator: Confirm file creation Generator->>FS: Write random strings to dataset file FS-->>Generator: Confirm write completion Generator->>User: Dataset generated successfully User->>Sorter: Run sorting command Sorter->>FS: Open dataset file FS-->>Sorter: Provide dataset file Sorter->>TempDir: Create temp directory (if not exists) Sorter->>FS: Read dataset file FS-->>Sorter: Provide dataset content Sorter->>Sorter: Split dataset into chunks Sorter->>Sorter: Sort each chunk Sorter->>TempDir: Write sorted chunks to temp files TempDir-->>Sorter: Confirm chunk write completion Sorter->>TempDir: Merge sorted chunks Sorter->>FS: Write sorted output file FS-->>Sorter: Confirm output file creation Sorter->>User: Sorting completed successfully Participants: The diagram includes five participants: User, Generator, Sorter, FileSystem, and TempDirectory.Interactions: The arrows represent the interactions between these participants, showing the flow of commands and responses during the dataset generation and sorting processes.
The program is designed to use minimal memory by:
- Limiting the size of each chunk to 10 MB.
- Using buffered I/O for efficient file operations.
- Storing temporary files in a dedicated
tempdirectory within the working directory.
Contributions are welcome! If you have suggestions or improvements, feel free to open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.