HOME PROPOSAL CHECKPOINT FINAL REPORT

Solving the 15-Puzzle with Parallel A*

Anna Gupta (annag), William Xiao (williamx)

View Our Code on Github

Proposal

Title
Solving the 15-Puzzle with Parallel A*

Summary
We will implement a parallel version of the A* algorithm to solve the 15-Puzzle with minimal moves on the Gates or latedays machines.

Background
The 15-Puzzle consists of a 4x4 frame of square tiles (numbered from 1 to 15), with one tile missing. The object of the game is to place the tiles in numerical order by sliding tiles, using the empty space. The puzzle can come in different sizes; in general, a NxN frame will use tiles numbered from 1 to N2 - 1.
The game can be represented as a graph, with nodes being the state of the board and edges representing an action (sliding a particular tile). We would like to implement a graph-searching algorithm to find the shortest sequence of moves to solve the puzzle.
In graph-search algorithms, a heuristic ranks edges from a node, based on an algorithm. This decides which actions are preferable, given the state of the game. The 15-Puzzle is perfect for an algorithm involving heuristics. One possible heuristic for the puzzle uses the taxicab distances for each tile (difference in distance between actual and desired positions). This opens the door for us to use A* (or other heuristic algorithms) to solve the puzzle.

Fifteen Puzzle

The Challenge
Unlike other graph-search algorithms, A* is difficult to implement in parallel due to the heuristic. The sequential version of the algorithm utilizes a priority queue, ranking the nodes by cost or preference. In an iteration, the least-cost node is popped from the priority queue, information is updated, and the neighbors of the node are added to the queue. These neighbors could have a higher preference than nodes previously existing in the queue, making it difficult to utilize multiple threads. It might not be viable to have multiple threads working on different nodes in the queue - since the second-preferred node in the queue may not be the first-preferred node after the iteration.

Resources
We will start writing code from scratch. We will research A* and other heuristic algorithms to determine one that is optimal for solving the 15-Puzzle. We will also determine a benchmark for speed-up that can be acquired by parallelizing A* by reading research papers. We will utilize the Gates or latedays clusters for our project.

Goals & Deliverables

Goals: If time permits:

We will construct a rubric for our work by researching prior attempts to parallelize A* and the speed-ups they received. Ideally, we would like to achieve as close to the best speed-up achieved as possible.


Platform Choice
CUDA will give us access to a large number of threads - which will be useful in solving large instances of the puzzle. In addition, we can utilize thread blocks and shared memory as a form of communication between the threads.

Schedule

Week of: Task
April 17 Determine benchmark for speed-up; Implement a sequential version of A*
April 24 Research and implement a parallel version of A*
May 1 Test and make implementation scalable on different sized grids
May 8 Optimize implementation and finish final report + presentation

Checkpoint

Work Completed
The goal of our project is to implement a parallel version of A* to solve the 15-Puzzle. We've successfully implemented a sequential version of the algorithm. In particular:

Updated Schedule

Time Period: Task: William Task: Anna
April 24-27 Brainstorm different methods of parallelizing A* Research and understand shared priority queues
May 2-5 Implement parallel version of A* Implement parallel version of A*
May 6-9 Test implementation on larger grids + make optimizations as necessary Compare speed-up across threads + determine optimal number of threads
May 10-12 Finish final report + presentation Finish final report + presentation

Updated Goals & Deliverables
So far, we are on schedule, as detailed in our proposal. We have begun researching possible ways to parallelize A* and designing our implementation. In previous attempts of parallel A*, researchers have sacrificed optimality of the solution for speed. Therefore, the parallel versions, although fast, may not return the best solution.
Going forward, one concern we have is whether we will have to sacrifice optimality as well. Initially, we hoped to achieve great speed-up while also returning the optimal solution. If time permits, we will implement multiple versions of parallel A*, varying in optimality of the solution. From this, we can compare the correlation between speed-up and suboptimality.

Updated Goals:

Other Information
For the parallelism competition, we will provide both a demo and graph. The demo will show the steps in order to solve the puzzle optimally. Therefore, the user could follow the printed steps to solve a randomly generated puzzle. We will also present a graph to demonstrate speed-up between the sequential version that we've already implemented and the parallel version that we will implement in the future.

Final Report

Summary
We parallelized the A* algorithm, for use in solving the 15-Puzzle optimally. To achieve parallelism, we utilized pthreads and tested on the GHC machines.

Background
A*:
A* is a heuristic-based pathfinding algorithm. Given a weighted-graph and two nodes, the algorithm finds the shortest path between them. A algorithm uses two functions: we'll name them g(n) and h(n). g(n) is the cost from the start to n, or the sum of the edges connecting start and n. h(n) is a heuristic, a function that estimates the cheapest cost from n to the goal. This function must be admissible, meaning it never overestimates the actual cost. Together, key(n) = g(n) + h(n), which represents the priority of the node. A lower key value means a lower cost is incurred to travel through that node, ensuring a higher priority. The algorithm will explore nodes in order of priority.

15-Puzzle:
This puzzle fits perfectly in the context of A*. Each board arrangement represents a node in the graph, with edges representing possible moves in the game. Each edge will have a weight of 1, so g(n) represents the smallest number of moves to reach n from start. Using A*, we can find the shortest path between the start and goal arrangements.

AStar

Priority Queue:
The order of expansion is maintained by a priority queue. We begin by pushing the start node into the queue. Then, we repeat the following steps:

Sequentially, one node would be popped from the queue at a time. We saw potential parallelism in this process in that many nodes can be popped and processed at the same time.

Hash Table:
To represent a board, we created a Board class. For every board, we wished to create only one Board object. To keep track of the boards created, we kept a hash table, ensuring a unique object for each board in the game.

Previous Ideas
We brainstormed and researched many ideas to parallelize the A* algorithm, all of which had flaws:

Our Final Approach
POSIX Threads and TSPriorityQueue class:
Our final idea involved spawning multiple POSIX threads, each popping and inserting into a coarse-grained priority queue simultaneously. However, due to high contention on the root node and low arithmetic intensity (a thread accesses the priority queue frequently), speed-up would be minimal. Therefore, we split the data structure into multiple priority queues. Our TSPriorityQueue class contained a vector of priority queues, each being coarse-grained. Specifically, the length of the vector is equal to the number of threads in the system. Therefore, each thread accesses from its own priority queue with no contention, increasing parallelism.

TSPriorityQueue

Hash Table and Hash Function:
We created a hash function, mapping a Board object (a board in the game) to a value. This value mod n, n being the number of threads, is its index into the TSPriorityQueue, mapping to a priority queue. Since our hash function is randomized, the neighbors of a node are likely to be hashed into different priority queues. This increases communication between threads and balances load among the priority queues. Along with the vector of priority queues, we also have a vector of hash tables. Each priority queue is paired with a hash table, and a hash table contains the Board objects that have been pushed into its matching priority queue. This avoids creating duplicate Board objects to represent the same board arrangement.

Ensuring Optimality of Solution:
Our parallel version saw speed-up; however, we wished to ensure optimality of the solution outputted. The sequential A* was optimal, because it pushed nodes in monotonically ascending order. Therefore, if we pop the goal state, we are assured that the key value seen is the lowest possible key value. However, this is not guaranteed when using multiple threads. For example, thread 0 could pop off a goal state with key value of 100. At the same time, thread 5 could pop off another node with key value of 50. This node could then lead to the goal state with a lower key value than 100. To solve this issue, we replicated assignment 4 and created a handle-tick function. This function would check if the goal state has been expanded and, if so, compares its key value to the global minimum of TSPriorityQueue. If the goal state's key value is smaller, then we are assured no smaller path to the goal state exists.

Further Optimization
In our parallel implementation, each thread is mapped to a single priority queue. However, we wished to further lower contention and increase flexibility. We made further optimizations by increasing the number of priority queues through an integer "multiplier". Therefore, we would create multiplier number of priority queues for each thread, rather than 1 per.

Results

Runtime

Figure 1: To produce this graph, we ran 3 files, each with a 4x4 board input. We compared our sequential baseline with our parallel version, varying the number of threads used. We ran this code on the GHC machines, which has 8 hyperthreaded cores. The initial increase in execution time (between sequential and 2-threaded versions) could be due to large overhead in spawning threads. As thread count increases, execution time significantly decreases.

Speedup

Figure 2: This displays the same data as Figure 1. On the y-axis, we plot speed-up, rather than execution time. The 16-threaded version sees 2x speed-up compared to the sequential version.

Runtime Optimized

Figure 3: This graph runs the same files as before. All factors are constant, except we utilize 128 priority queues per thread, rather than 1 per. We chose 128 after meticulous testing. Too few priority queues results in contention, whereas too many priority queues is wasteful of memory. Compared to Figure 1, there is a sharper decrease in execution time, with the 16-threaded version taking significantly less time.

Speedup Optimized

Figure 4: This displays the same data as Figure 3, plotting speed-up rather than execution time. Recall that with 1 priority queue per thread, we achieved a 2x speedup. With the increase in priority queues, we achieve a 5-6x speedup. This optimization was very successful.

Challenges
When developing our implementation, we discovered the C++ library does not support an "update" function for priority queues. Therefore, in order to modify the key value of an element in the queue, we must delete the element and insert again with the updated key value. To circumvent this issue, we created our own implementation of a priority queue, supporting the insert operation. We also implemented a thread-safe data structure, consisting of a vector of coarse-grained priority queues.

It was difficult to create an efficient, load-balanced hash function. Initially, our parallel version did not see great speed-up. We discovered this was a result of a poor hash function, resulting in load imbalance and low speed-up.

References
The algorithm for parallelization and ensuring optimality was created by ourselves.

Distribution of Work
Equal work was done by each group member.