Skip to content

This project demonstrates an external sorting algorithm in Go, designed to efficiently handle large data sets that exceed available memory. The program reads a random string data set from a file, sorts it using an external merge sort technique, and writes the sorted output to a new file.

License

Notifications You must be signed in to change notification settings

fajarnugraha37/go_external_sort

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

External Sorting in Go

This project implements an external sorting algorithm in Go, capable of handling large datasets that exceed available memory. The program reads a dataset of random strings from a file, sorts them, and writes the sorted output to a new file. It includes a dataset generator that creates random strings for testing purposes.

Table of Contents

Features

  • Generates a dataset of random strings.
  • Implements external merge sort to handle large datasets.
  • Uses a temporary directory for storing intermediate sorted chunks.
  • Supports sorting of strings.

Project Structure

go_external_sort/ │ ├── cmd/main.go │ ├── dataset/ │ └── main.go # Entry point of the program to generate dataset │ └── sort │ └── main.go # Entry point of the program to sort dataset ├── sorter/ │ ├── sorter.go # External sorting logic │ └── merge.go # Merging sorted chunks └── utils/ ├── file.go # File handling utilities └── logger.go # Logging utilities 

Requirements

  • Go 1.22.3 or later

Installation

  1. Clone the repository:
git clone https://github.com/fajarnugraha37/go_external_sort.git cd go_external_sort
  1. Ensure you have Go installed on your machine. You can download it from golang.org.

Build

Simple Build

To build the project for development, you can use the following command:

go build -o ./bin/dataset.exe ./cmd/dataset/main.go go build -o ./bin/sort.exe ./cmd/sort/main.go

This command compiles the code and creates executables in the bin directory.

Optimized Build

For a optimized-optimized build, you can use the following command:

go build -o ./bin/dataset.exe -ldflags="-s -w" ./cmd/dataset/main.go go build -o ./bin/sort.exe -ldflags="-s -w" ./cmd/sort/main.go
  • -ldflags="-s -w": This flag reduces the size of the binary by omitting the symbol table and debug

Cross-Compilation

If you are on a different operating system and want to cross-compile for another OS, you can set the GOOS and GOARCH environment variables accordingly.

Usage

Generating a Dataset

To generate a dataset of random strings, use the following command:

go run ./cmd/dataset/main.go -output dataset_input.txt -count 10000000 -length 16

or

./bin/dataset.exe -output dataset_input.txt -count 10000000 -length 16
  • -output: Path to the output file where the dataset will be saved.
  • -count: Number of random strings to generate (default is 100,000).
  • -length: Length of each random string (default is 10).

Sorting the Dataset

To sort the generated dataset, use the following command:

go run ./cmd/sort/main.go -input dataset_input.txt -output dataset_output.txt

or

./bin/sort.exe -input dataset_input.txt -output dataset_output.txt
  • -input: Path to the input file containing the dataset to be sorted.
  • -output: Path to the output file where the sorted dataset will be saved.

How It Works

  1. Dataset Generation: The ./cmd/dataset/main.go file generates a specified number of random strings and writes them to a file.
  2. External Sorting: The ./cmd/sort/main.go file orchestrates the sorting process:
    • It reads the dataset from the input file.
    • It splits the dataset into manageable chunks, sorts each chunk, and writes them to temporary files.
    • It merges the sorted chunks into a final sorted output file.

Sequence Diagram

sequenceDiagram participant User participant Generator participant Sorter participant FileSystem as FS participant TempDirectory as TempDir User->>Generator: Run dataset generation command Generator->>FS: Create dataset file FS-->>Generator: Confirm file creation Generator->>FS: Write random strings to dataset file FS-->>Generator: Confirm write completion Generator->>User: Dataset generated successfully User->>Sorter: Run sorting command Sorter->>FS: Open dataset file FS-->>Sorter: Provide dataset file Sorter->>TempDir: Create temp directory (if not exists) Sorter->>FS: Read dataset file FS-->>Sorter: Provide dataset content Sorter->>Sorter: Split dataset into chunks Sorter->>Sorter: Sort each chunk Sorter->>TempDir: Write sorted chunks to temp files TempDir-->>Sorter: Confirm chunk write completion Sorter->>TempDir: Merge sorted chunks Sorter->>FS: Write sorted output file FS-->>Sorter: Confirm output file creation Sorter->>User: Sorting completed successfully 
Loading
  • Participants: The diagram includes five participants: User, Generator, Sorter, FileSystem, and TempDirectory.
  • Interactions: The arrows represent the interactions between these participants, showing the flow of commands and responses during the dataset generation and sorting processes.

Memory Management

The program is designed to use minimal memory by:

  • Limiting the size of each chunk to 10 MB.
  • Using buffered I/O for efficient file operations.
  • Storing temporary files in a dedicated temp directory within the working directory.

Contributing

Contributions are welcome! If you have suggestions or improvements, feel free to open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

This project demonstrates an external sorting algorithm in Go, designed to efficiently handle large data sets that exceed available memory. The program reads a random string data set from a file, sorts it using an external merge sort technique, and writes the sorted output to a new file.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages