Jekyll2020-08-19T15:22:36+02:00https://www.lukaschaefer.com/feed.xmlLukas Schäfer | German CompSci StudentI am a German computer science student passionate about artificial intelligence, esports, music, literature and much more. Here you can find my professional profile and blog where I write about all sorts of random things as well as projects I work(ed) on.Lukas SchaeferASNets Evaluation and Conclusion in Classical Planning2019-06-26T00:00:00+02:002019-06-26T00:00:00+02:00https://www.lukaschaefer.com/ai/2019/06/26/Thesis-4<h4>Domain-Dependent Policy Learning using Neural Networks for Classical Planning (4/4)</h4>
<!--more-->
<hr />
<div style="text-align: left"> <<< <a href="/ai/2019/06/02/Thesis-3.html"> Part 3 </a> </div>
<p>This will be the forth and final post of my
<a href="../../../../../assets/files/bsc_thesis.pdf">undergraduate dissertation</a> series. It will cover the detailed
evaluation of Action Schema Networks conducted for classical automated planning, propose future work that might
deal with identified weaknesses before concluding the project as a whole.</p>
<hr />
<h1 id="evaluation">Evaluation</h1>
<p>As mentioned in the <a href="/ai/2019/05/05/Thesis-2.html">second post of this series</a>, Sam Toyer already
conducted an empirical evaluation of ASNets <a class="citation" href="#toyer:thesis:17">(Toyer, 2017)</a>, but primarily focused on probabilistic planning
tasks. While he also considered classical planning, these experiments were only performed for the Gripper domain
which is solved fairly easily by most considered baseline planners. Therefore, we decided to extensively evaluate
the method considering multiple domains of varying complexity and comparing ASNets performance to successful
satisficing and optimal planners.</p>
<h2 id="evaluation-objectives">Evaluation Objectives</h2>
<p>The goal of this evaluation is to answer the following questions with respect to the suitability of ASNets to solve
classical planning tasks:</p>
<ol>
<li><strong>Are ASNets able to learn effective (maybe even optimal) policies?</strong> We use satisficing and optimal teacher
search configuration during training. We expect that ASNet policies will be constrained in their quality by the
applied teacher search. However, it is essential to observe to what extent the policies reach and potentially even
exceed the teacher’s effectiveness.</li>
<li><strong>Which properties have domains in common, on which ASNets are able to perform well?</strong> Such findings about apparent
limitations and strengths of the method can enable further progress to improve learning techniques for automated
planning applications.</li>
<li><strong>For which period of time need ASNets to be trained until they perform reasonably well?</strong> It is to be expected
that training time significantly depends on the used configuration and planning domain.</li>
</ol>
<h2 id="evaluation-setup">Evaluation Setup</h2>
<h3 id="hardware">Hardware</h3>
<p>The evaluation including execution and training of all baseline planners and ASNet configurations was run on a
x86064 server using a single core of a Intel Xeon E5-2650 v4 CPU clocked at 2.2GHz and 96GiB of RAM.</p>
<h3 id="domains-and-problems">Domains and Problems</h3>
<p>We use eight domains of different complexity and characteristics to reliably evaluate ASNets. These are mostly taken
from previous iterations of the International Planning Competition (IPC) and the
<a href="https://fai.cs.uni-saarland.de/hoffmann/ff-domains.html">FF-Domain collection</a> of Prof. Dr. Hoffmann. All domains
with the number of problem instances used for ASNet training and evaluation as well as the expected difficulty
can be found in the table below.</p>
<center>
<img src="../../../../../assets/img/posts_images/bsc_thesis/asnet_eval_domains.png" alt="ASNet evaluation domains" width="80%" />
</center>
<p>For details on the domains and used problem instances, see
<a href="../../../../../assets/files/bsc_thesis.pdf#page=39">section 7.2.1 of the thesis</a>.</p>
<h3 id="baseline-planners">Baseline Planners</h3>
<p>As baseline planners, three primary heuristic search approaches were used which dominate classical automated planning.
All baselines were implemented in the Fast-Downward planning system <a class="citation" href="#helmert:fd:06">(Helmert, 2006)</a> and their search time for
each problem will be limited at 30 minutes.</p>
<ul>
<li>
<p><strong><script type="math/tex">A^*</script> with <script type="math/tex">h^{LM-cut}</script> or <script type="math/tex">h^{add}</script></strong>: <script type="math/tex">A^*</script> is one of the most popular heuristic search algorithms.
It maintains a prioritised open list of states ordered by their <em>f-value</em> which is the sum of the cost <script type="math/tex">g(s)</script> to
reach the state as well as its heuristic value <script type="math/tex">h(s)</script>. In each iteration, the state with the lowest f-value is expanded
and its children added to the list if it is no goal state already. This process starts at the initial state and is
continued until a goal is reached or the task is found unsolvable when the open list becomes empty. Whenever applied
with admissible heuristics, <script type="math/tex">A^*</script> is guaranteed to find optimal plans that minimise the cost <a class="citation" href="#hart:etal:72">(Hart, Nilsson, & Raphael, 1972)</a>.
Despite being fairly expensive, this is why the algorithm is still widely used in optimal planning.
As heuristic functions, we use the admissible LM-cut heuristic <a class="citation" href="#helmert:domshlak:icaps-09">(Helmert & Domshlak, n.d.)</a> as well as the
inadmissible, additive heuristic <script type="math/tex">h^{add}</script> <a class="citation" href="#bonet:geffner:ai-01">(Bonet & Geffner, 2001)</a>.</p>
</li>
<li>
<p><strong>Greedy best-first search with <script type="math/tex">h^{FF}</script></strong>: Greedy best-first search (GBFS) is the pendant of <script type="math/tex">A^*</script> for satisficing
planning proposed by Russell and Norvig <a class="citation" href="#russel:norvig:95">(Russell & Norvig, 1995)</a>. The algorithms share the same procedure, but GBFS
uses the heuristic values of states alone as the open list priorities instead of f-values. For most domains, GBFS is
considerably faster in finding a plan than <script type="math/tex">A^*</script>, but it provides no optimality guarantees. The search is used with the
relaxed plan heuristic <script type="math/tex">H^{FF}</script> <a class="citation" href="#hoffmann:nebel:jair-01">(Hoffmann & Nebel, 2001)</a> using a dual-queue with preferred operators. This setup
has proven to be fairly effective for application in satisficing planning.</p>
</li>
<li>
<p><strong>LAMA-2011</strong>: The LAMA-2011 planning system <a class="citation" href="#richter:etal:ipc-11">(Richter, Westphal, & Helmert, 2011)</a> was one of the winners at the 2011 IPC and
combines multiple approaches from previous searches. It starts with GBFS runs combining the relaxed plan heuristic with
landmark heuristics <a class="citation" href="#richter:westphal:jair-10">(Richter & Westphal, 2010)</a> to quickly find a plan. Afterwards, it iteratively aims to improve
and find better plans using (weighted) <script type="math/tex">A^*</script> and pruning techniques.</p>
</li>
</ul>
<h3 id="asnet-configurations">ASNet configurations</h3>
<p>All evaluation will be executed using the same ASNet configuration. A hidden representation size <script type="math/tex">d_h = 16</script> and two
layers will be used. In all intermediate nodes, the ELU activation function <a class="citation" href="#clevert:etal:elu:15">(Clevert, Unterthiner, & Hochreiter, 2015)</a> is applied
and to minimise overfitting, we apply the L2-regularisation with <script type="math/tex">\lambda = 0.001</script> as well as dropout
<a class="citation" href="#srivastava:etal:dropout:14">(Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014)</a> with a probability of <script type="math/tex">p = 0.25</script>.</p>
<p>Training is executed for <script type="math/tex">T_{max-epochs} = 10</script> with <script type="math/tex">T_{prob-epochs} = 3</script> each. Training is limited at two hours.
It is worth noting, that training samples for each problem are accumulated in each epoch to maximise sample efficiency.
On each problem, <script type="math/tex">T_{train-epochs} = 100</script> are executed using the Adam optimizer <a class="citation" href="#kingma:ba:adam:14">(Kingma & Ba, 2014)</a> with a
learning rate of <script type="math/tex">\alpha = 0.001</script>.
As the teacher searches during training, we use the optimal <script type="math/tex">A^*</script> with <script type="math/tex">h^{LM-cut}</script>, as well as <script type="math/tex">A^*</script> with <script type="math/tex">h^{add}</script>
and GBFS with <script type="math/tex">h^{FF}</script>.</p>
<h2 id="evaluation-results">Evaluation Results</h2>
<p>A brief overview over the evaluation results clearly indicates, that ASNets were unable to learn effective policies for most
domains and tasks with Tyreworld being the notable exception.</p>
<center>
<img src="../../../../../assets/img/posts_images/bsc_thesis/asnet_eval_results.png" alt="ASNet evaluation results" width="80%" />
</center>
<p>For the <strong>Floortile</strong> and <strong>TurnAndOpen</strong> domain, it turns out that the teacher searches simply take too long. Hence, the sampling
during training did not terminate in time and hardly any training could be completed. It is unsurprising that ASNets were unable
to solve these tasks given these issues. However, it could be observed, that learned plans for the Floortile domain successfully
avoided potential dead-ends. These were previously identified as the major challenge of the domain, such that it was surprising
to see this success, despite brief training.</p>
<p>The domains <strong>Floortile</strong>, <strong>Sokoban</strong> and <strong>TurnAndOpen</strong> all require some form of movement through an environment. Unfortunately,
these movement actions can often be chosen interchangeably due to symmetries in the environment. This led to related actions
having almost equal probabilities to be chosen. This frequently caused ASNets to choose inverting actions to return into a
previously encountered state. Given the simple policy search applied using the ASNet policy, this immediately resulted in a failure
of the planning procedure to avoid circular search.</p>
<p>Training data for the domains <strong>Blocksworld</strong>, <strong>Hanoi</strong> and <strong>ParcPrinter</strong> looked very promising. Training was consistently
terminated even before an hour for Hanoi and Blocksworld and reached stable success rates (solving the problems currently
trained on) of above 70%. However, such performance did hardly generalise beyond the set of problem instances used during training.
For Blocksworld and Hanoi, we assume that problem instances are too diverse requiring very different solutions despite the
common domain. On the ParcPrinter domain however, it could be observed that large parts of the required scheduling was learned
perfectly. Only one of the last few tasks was consistently failed which involves printing the correct images on selective sheets.
We are unsure about why ASNets were capable of learning the previous scheduling tasks flawlessly, but failed to learn this part
of the process.</p>
<p>The only domain on which ASNets generalised and performed well is <strong>Tyreworld</strong>. We previously anticipated that it would be one
of the easiest domains due to its repetitious patterns which can be learned and simply repeated for each tyre. Additionally,
it seems to be essential that each subproblem of replacing a tyre is independent and hence these can be executed in any order.
Therefore, any indecisiveness of ASNets with respect to the order of these actions did not harm the performance on the problems.</p>
<h3 id="scalability-concerns">Scalability Concerns</h3>
<p>However, even on the Tyreworld domain, a considerable weakness of ASNets could be identified. The networks contain one module
for each grounding in every layer which causes the networks to blow up in size fairly quickly. Below, you can see figures
showing the increase in size and creation time for networks for the evaluated planning tasks.</p>
<center>
<img src="../../../../../assets/img/posts_images/bsc_thesis/asnet_network_creation_time.png" alt="ASNet network creation time" width="80%" />
</center>
<p>While most domains are not problematic as they involve at most 1,000 - 2,000 groundings, few domains like TurnAndOpen
or Elevator will involve much more groundings. Creation time and network size linearly increase with the number of
groundings, and it becomes hard to justify training if the generation of such a model takes 30 minutes and more.</p>
<center>
<img src="../../../../../assets/img/posts_images/bsc_thesis/asnet_network_size.png" alt="ASNet network size" width="80%" />
</center>
<h3 id="unstable-training">Unstable Training</h3>
<p>Besides these considerable scalability issues, we also observed that training was highly unstable. It appears that training on
each problem quickly converges to good performance and low loss values. However, any such convergence seems to indicate overfitting
and reverts any previous progress on other problem files. This leads to no consistent improvement during training.</p>
<center>
<img src="../../../../../assets/img/posts_images/bsc_thesis/asnet_loss_hanoi.png" alt="ASNet training loss on Hanoi domain" width="80%" />
</center>
<p>This figure shows such loss development during training on the Hanoi domain. Similar plots could be observed for the majority of
domains.</p>
<h1 id="future-work">Future Work</h1>
<p>This project serves as a starting point to get insight into the potential application of deep learning methods for policy learning
in classical automated planning tasks. There are many logical extensions and related approaches worth exploring for future research
in this field.</p>
<h3 id="additional-input-features">Additional Input Features</h3>
<p>One straight-forward extension, which should be implemented for classical planning, are additional heuristic input features as
they were already proposed and evaluated by Sam Toyer <a class="citation" href="#toyer:etal:17">(Toyer, Trevizan, Thiébaux, & Xie, 2017)</a> <a class="citation" href="#toyer:thesis:17">(Toyer, 2017)</a>. While our work
considered such inputs and already provides the framework for such additions, they were not implemented yet due to time constraints.
Previous work considered binary values representing landmarks computed by the LM-cut heuristic. These features were found to assist
ASNets in overcoming limitations in their receptive field. We propose to consider non-binary values as well. While simplified
inputs can assist learning, neural networks are generally able to learn complex relations and might extract more information
out of non-binary values.</p>
<p>Besides additional heuristic input features, one could also provide action costs in the input. Currently, the network can not
directly reason about action costs which are only indirectly recognized due to the teacher search values in samples. Providing
cation costs as direct input might speed up learning and lead to less dependency with the teacher search for good network policies.</p>
<h3 id="sampling">Sampling</h3>
<p>During our sampling procedure, we collect data from states which are largely extracted from goal trajectories of the applied
teacher search. Such information is useful to learn a reasonable policy, but also contains a strong bias as sampled states are
heavily correlated to each other. This can lead to a bias of ASNet policies to simply imitate the teacher search and limits the
ability to generalise to problems outside of the training set. We can imagine that this was one of the reasons why stable training
was not achieved.</p>
<p>One approach to reduce the redundancy and bias of the sampling set is to reduce the number of sampled runs for goal trajectories of
the teacher search. These trajectories often share large parts of their states and quickly dominate the sampling data. One could
e.g. limit the teacher search sampling to only a single trajectory starting in the initial state. However, it would have to be
analysed whether this significantly smaller sampling set is still sufficient to allow training progress.</p>
<p>Another approach to avoid such a bias entirely would be to use a uniform sampling strategy, i.e. collecting uncorrelated data
randomly. This means, no teacher search to collect connected trajectories could be used due to their dependency. Sampling truly
random states without an underlying bias from a planning problem state space is challenging, but could improve the quality
of the sampling data and therefore the learning considerably.</p>
<h3 id="improved-policy-search">Improved Policy Search</h3>
<p>Another logical extension of our work would be the implementation of sophisticated search engines for policies. The current search
simply follows the most probable action, which accurately represents the policy, but probably limits the effectiveness. Backtracking
could be added to our existing approach using an open list of already encountered states. This would allow the search to continue
whenever duplicate or dead-end states are reached without failing the entire planning procedure. This could already elevate
performance in tasks involving interchangeable paths or dead-ends.</p>
<p>Second, one could aim to combine the ASNet policies with well-established heuristic search. The policy probabilities for actions
could e.g. be used to decide in tiebreaking situations between multiple paths or one could combine the policies action probabilities
and heuristic values to compute a common priority measure.</p>
<p>Furthermore, we identified symmetries in the state space to be a considerable challenge to ASNets. Pruning techniques capable of
identifying and removing such paths could be used to prune these states and assist the network policy’s indecisiveness in these
situations. Such pruning methods are well-researched in the planning community <a class="citation" href="#fox:long:ijcai-99">(Fox & Long, n.d.)</a>
<a class="citation" href="#domshlak:etal:ipc-15">(Domshlak, Katz, & Shleyfman, 2015)</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>The objective of this project was to evaluate the suitability of domain-dependent policy learning using Action Schema Networks for
classical automated planning. We integrated this deep learning architecture in the Fast-Downward planning system. In doing so,
we extended the PDDL translation to compute relations between abstract action schemas, predicates as well as their groundings,
added a policy representation to the framework with a simple search and implemented the neural network architecture using Keras.
The training procedure largely follows previous work of Sam Toyer, but extends the teacher policy to support arbitrary search
configurations implemented in the Fast-Downward system. This leads to large flexibility given the various, already implemented
planning strategies due to the system’s popularity.</p>
<p>Our extensive empirical evaluation aimed to primarily answer whether ASNets are suited to solve classical planning tasks. Although
the network generalised poorly on most domains, significant learning could be found for most tasks. Hence, we would not consider
ASNets strictly unsuitable, but rather found shortcomings in the approach taken primarily regarding its training and the sampling
process. We provide analysis of the training process for each domain to identify encountered problems and based on our findings
propose further research which might alleviate or even solve the identified challenges.</p>
<p>However, the final assessment regarding the suitability of Action Schema Networks for classical planning will depend on the
results of further research building upon our work.</p>
<hr />
<p>For more details on this project, see my <a href="../../../../../assets/files/bsc_thesis.pdf">undergraduate thesis</a> or feel free
to reach out to me (contact information can be found below).</p>Lukas SchäferDomain-Dependent Policy Learning using Neural Networks for Classical Planning (4/4)ASNets for Classical Planning2019-06-02T00:00:00+02:002019-06-02T00:00:00+02:00https://www.lukaschaefer.com/ai/2019/06/02/Thesis-3<h4>Domain-Dependent Policy Learning using Neural Networks for Classical Planning (3/4)</h4>
<!--more-->
<hr />
<div style="text-align: left"> <<< <a href="/ai/2019/05/05/Thesis-2.html"> Part 2 </a> </div>
<p>This third post about my
<a href="../../../../../assets/files/bsc_thesis.pdf">undergraduate dissertation</a> will cover my primary contributions
to translate the architecture of Action Schema Networks, introduced in the previous post, for classical automated
planning in the Fast-Downward framework.</p>
<hr />
<p>The dissertation focuses on the application of ASNets in deterministic, classical planning. For this purpose, the
network architecture was implemented and integrated into the Fast-Downward planning system <a class="citation" href="#helmert:fd:06">(Helmert, 2006)</a>
which is prominently used throughout classical planning research.</p>
<h1 id="network-definition">Network Definition</h1>
<p>Prior to the integration of the network in this system, the training and evaluation, we have to define the
network. As described in the previous post, the architecture of ASNets is inherently dependent on the given
planning task encoded as a PDDL domain and problem specification.</p>
<h2 id="pddl-in-fast-downward">PDDL in Fast-Downward</h2>
<p>Fast-Downward itself already translates such task descriptions into planning task representations with abstracts
and respective groundings. Throughout this process the tasks are also simplified and normalised removing any
quantifiers, propositions which can never be fulfilled or remain constant. These steps are essential for ASNet
construction as they frequently reduce the number of groundings and hence also the number of units in the
proposition and action layers considerably.</p>
<p>Imagine a transport task with two trucks and eight locations connected in a circle, such that each location is
connected to its adjacent locations. The task contains a total of eight streets, which can be driven by each
truck in both directions, so it has overall <script type="math/tex">8 \cdot 2 \cdot 2 = 32</script> drive actions. Naively instantiating all
possible groundings would lead to <script type="math/tex">2 \cdot 8 \cdot 8 = 128</script> such actions (2 trucks, 8 possible origin locations
and 8 potential target locations). Similarly, the number of connected propositions,
indicating whether two locations are connected, would be reduced from <script type="math/tex">8 \cdot 8 = 64</script> to just 16. This has
significant impact on the network size of ASNets and therefore improves their scalability.</p>
<p>Furthermore, these simplified task representations can be used to efficiently compute relations among grounded
actions and propositions used to derive connections among ASNet layers. First, we derive the relations among the
fewer abstract action schemas and predicates. For each grounding, we can then derive the respective initialisation
<script type="math/tex">A</script> used and apply <script type="math/tex">A</script> to each respective related predicate or action schema to obtain the related groundings.
This process is visualised in the table below:</p>
<center>
<img src="../../../../../assets/img/posts_images/bsc_thesis/relations_initialisation.png" alt="Initialisation abstract relations" width="60%" />
</center>
<p>The find all propositions related to grounded action <script type="math/tex">a = drive(truck, L, E)</script>, we can extract the initialisation
applied to obtain <script type="math/tex">a</script> from the abstract action schema <script type="math/tex">drive(?v, ?from, ?to)</script>. This initialisation <script type="math/tex">A</script>
can also be applied to the related predicates of the underlying action schema to find all grounded propositions
related to <script type="math/tex">a</script>.</p>
<h2 id="keras-network-definition">Keras Network Definition</h2>
<p>Following the creation of a planning task representation and the extraction of relations, the networks can be
constructed. We defined the architecture of ASNets using <em>Keras</em> <a class="citation" href="#chollet2015keras">(Chollet & others, 2015)</a> on top of the
<em>Tensorflow</em> <a class="citation" href="#tensorflow2015-whitepaper">(Martı́n Abadi et al., 2015)</a> backend. Keras is a Python library serving as an API to
multiple machine learning libraries, in our case Tensorflow, offering a modular approach with high-level
abstraction. This makes Keras model definitions comparably simple to read and write as well as easily extendible.
During our experiments, we used Keras version 2.1.6 with Tensorﬂow 1.8.0.</p>
<p>Generally, ASNets are structured in action and proposition layers. The respective action and proposition modules
correlate to grounded actions and propositions, but their weights are shared among all modules in the layer
that share the same underlying abstract action schema or predicate. Hence, we distinguished the input extraction
depending on the respective grounding and the shared main module holding the weights and doing the primary
computation. All such structures were implemented as custom Keras layers.
Lastly, the masked softmax function to compute the output policy in the final network layer is implemented as
a Keras layer to output a probability distribution over the set of actions.</p>
<h1 id="training">Training</h1>
<h2 id="training-overview">Training Overview</h2>
<p>To be able to apply and exploit ASNets during planning search to solve problems of a given domain, we have to
acquire knowledge and train the networks first. Such training repeatedly updates the network parameters <script type="math/tex">\theta</script>
including weight matrices and bias vectors with the goal of improving the network policy guidance.
ASNets for a common domain can share their parameters as all problems of the same domain involve the same
underlying action schemas and predicates. Therefore, it is possible to train ASNets on comparably small problem
instances for a domain to obtain such parameters and exploit the obtained policy on arbitrary problem instances.
This concept is essential for ASNets generalisation capability.</p>
<p>Sam Toyer already proposed a supervised training algorithm in his thesis <a class="citation" href="#toyer:thesis:17">(Toyer, 2017)</a>. However, we
made minor modifications for application in deterministic, classical planning rather than probabilistic planning.
Throughout training, multiple epochs are executed in which the network is trained for each given training problem
in a predetermined set of training tasks <script type="math/tex">P_{train}</script>. For each problem, the respective network is constructed
before multiple training iterations, which we refer to as problem epochs, are executed. Each such problem epoch
involves the sampling of states and following update steps over the set of sampled states <script type="math/tex">\mathcal{M}</script> to
optimise the parameters <script type="math/tex">\theta</script>. Pseudocode for the entire training cycle can be seen below:</p>
<center>
<img src="../../../../../assets/img/posts_images/bsc_thesis/asnet_training.png" alt="ASNet training algorithm" width="60%" />
</center>
<p>For some domains, good policies can be learned fairly easy and quickly. In these cases, it is unnecessary to
execute a large number of epochs. Therefore, we potentially stop training early as proposed by Sam Toyer himself.
Training is stopped early whenever a large portion of network searches during sampling successfully reached a goal
and the success rate of the network search has hardly improved for multiple epochs.</p>
<h2 id="loss-function">Loss Function</h2>
<p>During the training steps, the network parameters <script type="math/tex">\theta</script> are updated to ensure that “good” actions are
chosen in the sampled states from <script type="math/tex">\mathcal{M}</script>. This is achieved by optimising the parameters to minimise the
following binary crossentropy loss function proposed by Sam Toyer et al. <a class="citation" href="#toyer:etal:17">(Toyer, Trevizan, Thiébaux, & Xie, 2017)</a></p>
<script type="math/tex; mode=display">\mathcal{L}_\theta(\mathcal{M}) = \sum_{s \in \mathcal{M}} \sum_{a \in \mathcal{A}} -(1 - y_{s,a}) \cdot log(1 - \pi^\theta(a \mid s)) - y_{s,a} \cdot log(\pi^\theta(a \mid s))</script>
<p>where <script type="math/tex">\pi^\theta(a \mid s)</script> represents the probability of the network policy with parameters <script type="math/tex">\theta</script> to
choose action <script type="math/tex">a</script> in state <script type="math/tex">s</script> and <script type="math/tex">y_{s,a}</script> corresponds to binary values. <script type="math/tex">y_{s,a} = 1</script> if action <script type="math/tex">a</script>
starts an optimal plan from state <script type="math/tex">s</script> onwards according to the teacher search <script type="math/tex">S^∗</script>. These values can
be aquired during the sampling process.</p>
<h2 id="sampling">Sampling</h2>
<p>In order to enable training of the networks using the supervised algorithm described above, labelled data is
needed. For ASNets in particular, such data must include state information for the network input as well as all
further information required to evaluate the mentioned loss. Therefore, a sample for state <script type="math/tex">s</script> can be
represented by a tuple <script type="math/tex">(g, t_s, a_s, y_s)</script> of four lists of binary features. The values of <script type="math/tex">g</script> indicate
for each fact whether it is contained in the planning task’s goal. <script type="math/tex">t_s</script> shows which facts are true in <script type="math/tex">s</script>,
<script type="math/tex">a_s</script> indicates which actions are applicable in <script type="math/tex">s</script> (required for the Softmax mask) and <script type="math/tex">y_s</script> includes
the <script type="math/tex">y_{s,a}</script> values described in the loss section for each action. Note, that <script type="math/tex">g</script> remains the same for all
states and hence does not have to be computed for each sample.
The networks will receive <script type="math/tex">g, t_s</script> and <script type="math/tex">a_s</script> as inputs, while the <script type="math/tex">y_{s,a}</script> values are just used to compute
the loss for optimisation.</p>
<h3 id="sampling-search">Sampling Search</h3>
<p>These samples are collected as part of a search applied on each planning problem in <script type="math/tex">P_{train}</script>. During this
search we want to collect samples for states the network policy encounters to improve upon its previous
performance. However, these states do not necessarily provide guidance towards the goal, especially at the
beginning of training. Therefore, we also sample states along trajectories of an applied teacher search <script type="math/tex">S^*</script>
to ensure that the network is trained on objectively “good” states.</p>
<p>In the sampling search, we first explore the state space of the current problem by applying the network search
<script type="math/tex">S^\theta</script> which naively follows the previously constructed ASNet. Starting at the initial state <script type="math/tex">s_0</script>, the
most probably action according to the network policy <script type="math/tex">\pi^\theta</script> is followed in each state until
either a dead-end, a goal or a previously encountered state is reached. The latter is required to avoid the
network policy to repeatedly explore the same states as the network would apply the same actions when reaching
a state for the second time. All states encountered along the explored trajectory are collected in the state
memory <script type="math/tex">\mathcal{M}</script>.</p>
<p>After exploring these states based on the network policy, a predetermined teacher search <script type="math/tex">S^*</script> is started
from all previously collected states. Similarly to the first phase, we explore and collect states alongside
the followed trajectory until a goal or dead-end is reached.</p>
<center>
<img src="../../../../../assets/img/posts_images/bsc_thesis/sampling_search.png" alt="Sampling search algorithm" width="60%" />
</center>
<p>For each state <script type="math/tex">s</script> collected during the sampling, the corresponding tuple <script type="math/tex">(g, t_s, a_s, y_s)</script> has to be
extracted. Identifying the values <script type="math/tex">g, t_s</script> and <script type="math/tex">a_s</script> are straight forward and only require a simple lookup
of fact values in the goal set, current state and checking for applicable actions. However, obtaining the
<script type="math/tex">y_{s,a}</script> values is more complicated. As a reminder, these values indicate whether action <script type="math/tex">a</script> starts an
optimal plan from <script type="math/tex">s</script> according to the teacher search <script type="math/tex">S^*</script>. Hence, for each sampled state we have to
identify which actions start optimal plans with respect to the teacher search. First, we compute a plan from <script type="math/tex">s</script>
with <script type="math/tex">S^*</script> and store its cost <script type="math/tex">c_s</script>. Then we extract the reached state <script type="math/tex">s'</script> from applying <script type="math/tex">a</script> in <script type="math/tex">s</script>
and compute the cost of the plan found for <script type="math/tex">s'</script> using <script type="math/tex">S^*</script>. If <script type="math/tex">c_{s'} + c_a \leq c_s</script> then choosing
<script type="math/tex">a</script> in <script type="math/tex">s</script> appears to be optimal with respect to the teacher search and therefore <script type="math/tex">y_{s,a} = 1</script>,
otherwise <script type="math/tex">y_{s,a} = 0</script>.</p>
<h2 id="policies-in-fast-downward">Policies in Fast-Downward</h2>
<p>In addition to the described sampling search implemented in the Fast-Downward system <a class="citation" href="#helmert:fd:06">(Helmert, 2006)</a>, we
also extended the system with a general framework for policies as an alternative evaluator to the usually
applied heuristic functions encountered throughout automated planning.</p>
<p>Based on this added concept, we specifically implemented a network policy for ASNets which serves as an interface
to deep learning models representing policies. This policy is based on a Fast-Downward representation of such
networks responsible for extracting and feeding the required input data into the network and extracting its
output. Such interaction with the networks was achieved by storing the network models as <em>Protobuf</em> networks.</p>
<h3 id="policy-search">Policy Search</h3>
<p>Lastly, we added a new search engine based on our implementation of policies. It naively follows the most probable
action for each state according to the given policy. While such a search is very simple and probably limits the
achieved performance, it is solely reliant on the policy. Hence, it allows us to purely evaluate the quality of
network policies. For future performance, it will certainly be of interest to apply these network policies in
more sophisticated policy searches like Monte-Carlo Tree Search <a class="citation" href="#coulom:cg-06">(Coulom, 2007)</a>.</p>
<hr />
<p>In the last post, I will summarise the extensive evaluation results and my concluding thoughts on this project.</p>
<div style="text-align: right"><a href="/ai/2019/06/26/Thesis-4.html"> Part 4 </a> >>> </div>Lukas SchäferDomain-Dependent Policy Learning using Neural Networks for Classical Planning (3/4)Action Schema Networks: Planning meets Deep Learning2019-05-05T00:00:00+02:002019-05-05T00:00:00+02:00https://www.lukaschaefer.com/ai/2019/05/05/Thesis-2<h4>Domain-Dependent Policy Learning using Neural Networks for Classical Planning (2/4)</h4>
<!--more-->
<hr />
<div style="text-align: left"> <<< <a href="/ai/2018/12/03/Thesis-1.html"> Part 1 </a> </div>
<p>This post as the second of the series about my
<a href="../../../../../assets/files/bsc_thesis.pdf">undergraduate dissertation</a> will cover the underlying
architecture of Action Schema Networks.</p>
<hr />
<h1 id="action-schema-networks">Action Schema Networks</h1>
<p><em>Action Schema Networks</em>, short <em>ASNets</em>, is a neural networks architecture proposed by Sam Toyer et al.
<a class="citation" href="#toyer:thesis:17">(Toyer, 2017; Toyer, Trevizan, Thiébaux, & Xie, 2017)</a> for application in automated planning. The networks are capable of
learning domain-speciﬁc policies to exploit on arbitrary problems of a given (P)PDDL domain. This post will
cover the general architecture and design of the networks as well as their training and exploitation of learned
knowledge will be explained. Lastly, Sam Toyer’s empirical evaluation of ASNets on planning tasks will be
presented.</p>
<h2 id="architecture">Architecture</h2>
<p>ASNets are composed of alternating action and proposition layers, containing action and proposition modules
for each ground action or proposition respectively, starting and ending with an action layer. Overall the network
computes a policy <script type="math/tex">\pi^\theta</script>, outputting a probability <script type="math/tex">\pi^\theta(a \mid s)</script> to choose action <script type="math/tex">a</script> in a
given state <script type="math/tex">s</script> for every action <script type="math/tex">a \in \mathcal{A}</script>. One naive approach to exploit this policy during
search on planning tasks would be to greedily follow <script type="math/tex">\pi^\theta</script>, i.e. choosing
<script type="math/tex">argmax_{a} \pi^\theta(a \mid s)</script> in state <script type="math/tex">s</script>.</p>
<p>The following figure out of Toyer’s thesis <a class="citation" href="#toyer:thesis:17">(Toyer, 2017)</a> illustrates an ASNet with <script type="math/tex">L</script> layers.</p>
<center>
<img src="../../../../../assets/img/posts_images/bsc_thesis/asnet_illustration.png" alt="ASNet illustration" width="60%" />
</center>
<h3 id="action-modules">Action modules</h3>
<p>Each action module in an action layer <script type="math/tex">l</script> represents a ground action <script type="math/tex">a</script> and computes a hidden representation</p>
<script type="math/tex; mode=display">\phi^l_a = f(W^l_a \cdot u^l_a + b^l_a)</script>
<p>where <script type="math/tex">u^l_a \in \mathbb{R}^{d^l_a}</script> is an input vector, <script type="math/tex">W^l_a \in \mathbb{R}^{d_h \times d^l_a}</script> is a learned
weight matrix and <script type="math/tex">b^l_a \in \mathbb{R}^{d_h}</script> is the corresponding bias. <script type="math/tex">f</script> is a nonlinear function, e.g.
RELU or tanh, and <script type="math/tex">d_h</script> represents a chosen hidden representation size. The input is constructed as follows</p>
<script type="math/tex; mode=display">u^l_a = \begin{bmatrix} \psi^{l-1}_1 \\ \vdots \\ \psi^{l-1}_M \end{bmatrix}</script>
<p>where <script type="math/tex">\psi^{l-1}_i</script> is the hidden representation of the proposition module of the <script type="math/tex">i</script>-th proposition <script type="math/tex">p_i</script>
related to <script type="math/tex">a</script> in the preceding proposition layer. Proposition <script type="math/tex">p \in \mathcal{P}</script> is said to be related to
action <script type="math/tex">a \in \mathcal{A}</script> iff <script type="math/tex">p</script> appears either in the precondition <script type="math/tex">pre_a</script>, add- <script type="math/tex">add_a</script> or delete-list
<script type="math/tex">del_a</script> of action <script type="math/tex">a</script>. This concept of relatedness is essential for the sparse connectivity of ASNets
essential for their weight sharing and efficient learning.</p>
<p>Coming back to the dummy planning task introduced in the <a href="/ai/2018/12/03/Thesis-1.html">first post</a>, where a package has to be delivered from Manchester to Edinburgh, the action module for <script type="math/tex">drive(truck, M, E)</script>
would look like this:</p>
<center>
<img src="../../../../../assets/img/posts_images/bsc_thesis/asnet_action_module.png" alt="ASNet action module illustration" width="60%" />
</center>
<p>Note that the number of related propositions <script type="math/tex">N</script> of two actions <script type="math/tex">a_1</script> and <script type="math/tex">a_2</script>, which are instantiated from
the same action schema of the underlying domain, will always be the same. Hence, each action module of e.g.
<script type="math/tex">drive</script> actions will have the same structure based on the action schema and relatedness. This is used in ASNets
to share weights matrices <script type="math/tex">W^l_a</script> and bias <script type="math/tex">b^l_a</script> among all actions of the same action schema as <script type="math/tex">a</script> in
layer <script type="math/tex">l</script>. Through this approach, ASnets are able to share these parameters among arbitary instantiated problems
of the same domain as these all share the same action schemas.</p>
<p>Action modules of the very first action layer receive specific input vectors <script type="math/tex">u^1_a</script> containing binary features
representing the truth values of related propositions in input state <script type="math/tex">s</script>, values indicating the relevance of
related propositions for the problem’s goal as well as a value showing whether <script type="math/tex">a</script> is applicable in <script type="math/tex">s</script>.
Additionally, Sam Toyer et al. experimented with heuristic features regarding disjunctive action landmarks,
computed for the LM-cut heuristic <a class="citation" href="#helmert:domshlak:icaps-09">(Helmert & Domshlak, n.d.)</a>, as additional input to overcome
limitations in the receptive field of ASNets. Otherwise, ASNets are only able to reason about action chains with
length at most <script type="math/tex">L</script>.</p>
<p>For the output action layer respectively, the network has to output a policy represented by a probability
distribution <script type="math/tex">\pi^\theta(a \mid s)</script>. This is achieved using a masked softmax activation function, where a mask
<script type="math/tex">m</script> is applied to ensure that <script type="math/tex">\pi^\theta(a \mid s) = 0</script> iff <script type="math/tex">pre_a \nsubseteq s</script>, i.e. only applicable
actions receive nonzero probability. The mask represents this as binary features with <script type="math/tex">m_i = 1</script> iff <script type="math/tex">pre_{a_i} \subset s</script> and <script type="math/tex">m_i = 0</script> otherwise. Overall, the activation function computes the probability <script type="math/tex">\pi_i = \pi^\theta(a_i \mid s)</script> as follows for all actions <script type="math/tex">\mathcal{A} = \{a_1, ..., a_N\}</script>:</p>
<script type="math/tex; mode=display">\pi_i = \frac{m_i \cdot exp(\phi^{L + 1}_{a_i})}{\sum_{j=1}^N m_j \cdot exp(\phi^{L + 1}_{a_j})}</script>
<h3 id="proposition-modules">Proposition modules</h3>
<p>Proposition modules are constructed very similarly to action modules but only occur in intermediate layers.
Therefore a hidden representation produced by the module for proposition <script type="math/tex">p \in \mathcal{P}</script> in the <script type="math/tex">l</script>-th
layer looks like the following</p>
<script type="math/tex; mode=display">\psi^l_p = f(W^l_p \cdot v^l_p + b^l_p)</script>
<p>where <script type="math/tex">v_p^l \in \mathbb{R}^{d_p}</script> is an input vector and <script type="math/tex">W^l_p</script>, <script type="math/tex">b^l_p</script> represent the respective
weight matrix and bias vector and <script type="math/tex">f</script> is the same nonlinearity applied in action modules.
The main difference between proposition and action modules is that the number of actions related to one
proposition can vary making the input construction slightly more complicated.</p>
<p>To deal with this variation and be able to share weights among proposition modules as for action modules, the
input feature vector’s dimensionality <script type="math/tex">d_p^l</script> has to be equal for all propositions with the same underlying
predicate. Therefore the action schemas <script type="math/tex">A_1, ..., A_S</script> referencing the predicate of proposition <script type="math/tex">p</script> in their
preconditions, add or delete list are collected. When building the hidden representation of proposition p, all
related grounded actions from the listed action schemas are considered with action module representations of the
same action schema being combined to a single <script type="math/tex">d_h</script>-dimensional vector using a pooling function:</p>
<script type="math/tex; mode=display">v^l_p = \begin{bmatrix} pool(\{\phi^l_a \mid op(a) = A_1 \wedge R(a, p)\}) \\ \vdots \\ pool(\{\phi^l_a \mid op(a) = A_S \wedge R(a, p)\}) \end{bmatrix}</script>
<p><script type="math/tex">op(a)</script> reflects the action schema of grounded action <script type="math/tex">a</script> and <script type="math/tex">R(a, p)</script> denotes if <script type="math/tex">a</script> and <script type="math/tex">p</script> are
related.</p>
<center>
<img src="../../../../../assets/img/posts_images/bsc_thesis/asnet_proposition_module.png" alt="ASNet proposition module illustration" width="60%" />
</center>
<p>The figure illustrates a proposition module for <script type="math/tex">at(truck, L)</script> in the described planning task of the
previous post.</p>
<h2 id="supervised-training">Supervised Training</h2>
<p>During training, the ASNet is executed on small problems from a domain to learn weights, which still lead to an
efficient policy on larger problems of the same domain. The proposed supervised training algorithm, proposed by Sam Toyer et al., relies on a teacher policy.</p>
<p>At the beginning of the training, weights and bias are initialized by employing the Glorot initialisation, or Xavier initialisation <a class="citation" href="#glorot:bengio:10">(Glorot & Bengio, 2010)</a>, using a zero-centred Gaussian distribution.
After initializing the parameters, the network is exploring the state space of each training problem during
multiple epochs. Starting from the problem’s initial state <script type="math/tex">s_0</script>, the exploration follows the network policy
<script type="math/tex">\pi^\theta</script> and stops when a limited amount of states have been traversed, a goal or dead-end state have been
reached. Dead-ends can be detected efficiently using a delete-relaxed heuristic <a class="citation" href="#hoffmann:nebel:jair-01">(Hoffmann & Nebel, 2001)</a>.
Let the set of explored states in epoch <script type="math/tex">e</script> be denoted as <script type="math/tex">S_{exp}^e</script>.</p>
<p>Additionally, for every <script type="math/tex">s \in S_{exp}^e</script> a teacher policy (usually an optimal policy) <script type="math/tex">\pi^∗</script> is used to
extract all states encountered when following the policy from <script type="math/tex">s</script>. All these states are collected in the set
<script type="math/tex">S_{opt}^e</script> to ensure that the network is always trained with ``good” states, while <script type="math/tex">S_{exp}^e</script> is essential
to allow the network to improve upon its performance in already visited states. Afterwards, the set of training
states <script type="math/tex">\mathcal{M}</script> is updated as <script type="math/tex">\mathcal{M} = \mathcal{M} \cup S_{exp}^e \cup S_{opt}^e</script>. After each
exploration phase the ASNet’s weights <script type="math/tex">\theta</script> are updated using the loss function</p>
<script type="math/tex; mode=display">\mathcal{L}_\theta(\mathcal{M}) = \frac{1}{|\mathcal{M}|} \sum_{s \in \mathcal{M}} \sum_{a \in \mathcal{A}} \pi^\theta(a \mid s) \cdot Q^*(s, a)</script>
<p>where <script type="math/tex">Q^*(s,a)</script> denotes the expected cost of reaching a goal from <script type="math/tex">s</script> by following the policy <script type="math/tex">\pi^*</script> after
taking action a. The parameter updates are performed using minibatch <em>stochastic gradient descent</em> to save the
signiﬁcant expense of computing gradients on the entire state collection <script type="math/tex">\mathcal{M}</script> and can generally
converge faster <a class="citation" href="#li:etal:14">(Li, Zhang, Chen, & Smola, 2014)</a>. The Adam optimization algorithm, proposed by Kingma and Ba
<a class="citation" href="#kingma:ba:adam:14">(Kingma & Ba, 2014)</a>, is used to optimise <script type="math/tex">\theta</script> in a direction minimizing <script type="math/tex">L_\theta(\mathcal{M})</script>.</p>
<p>Exploration is stopped early whenever an early stopping condition is fulfilled, where the network policy
<script type="math/tex">\pi^\theta</script> reaches a goal state in at least <script type="math/tex">99.9\%</script> of the states in <script type="math/tex">\mathcal{M}</script> during the last epoch
and the success rate of <script type="math/tex">\pi^\theta</script> did not increase by more than <script type="math/tex">0.01\%</script> over the previous best rate for
at least five epochs.</p>
<h2 id="empirical-evaluation">Empirical Evaluation</h2>
<p>Sam Toyer conducted an experiment comparing ASNets to state-of-the-art probabilistic planners LRTDP
<a class="citation" href="#bonet:geffner:icaps-03">(Bonet & Geffner, n.d.)</a>, ILAO<script type="math/tex">^*</script> <a class="citation" href="#hansen:zilberstein:ilao*:01">(Hansen & Zilberstein, 2001)</a> and SSiPP
<a class="citation" href="#trevizan:veloso:ssipp:14">(Trevizan & Veloso, 2014)</a> to be able to evaluate their performance. All experiments were run with
the admissible LM-cut and inadmissible <script type="math/tex">h^{add}</script> heuristic and were limited to 9000s and 10Gb memory.</p>
<p>The ASNet was trained using two layers, a hidden representation size <script type="math/tex">d_h = 16</script> for each module and the ELU
activation function <a class="citation" href="#clevert:etal:elu:15">(Clevert, Unterthiner, & Hochreiter, 2015)</a> for each domain using small problem instances. The learning rate
during training was 0.0005 and a batch size of 128 was utilized for the Adam optimization. Additionally, <script type="math/tex">L_2</script>
regularization with <script type="math/tex">\lambda = 0.001</script> on the weights and dropout <a class="citation" href="#srivastava:etal:dropout:14">(Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014)</a> with
<script type="math/tex">p = 0.25</script> on the outputs of all intermediate layers was used to prevent overfitting.
Training was limited at two hours. As the teacher policy in the ASNets LRTDP with <script type="math/tex">h^{LM−cut}</script> and <script type="math/tex">h^{add}</script>
was employed.</p>
<p>Three probabilistic planning domains were used in the evaluation: CosaNostra Pizza
<a class="citation" href="#stephenson:snow_crash:92">(Stephenson, 1992)</a>, Probabilistic Blocks World <a class="citation" href="#younes:etal:jair-05">(Younes, Littman, Weissman, & Asmuth, 2005)</a> and
Triangle Tire World <a class="citation" href="#little:thiebaux:icaps-ws-07">(Little & Thiebaux, 2007)</a>.
Besides probabilistic planning, Sam Toyer also briefly evaluated ASNets on the deterministic classical planning
domain Gripper <a class="citation" href="#long:etal:aim-00">(Long et al., 2000)</a> against Greedy best-first search (GBFS), <script type="math/tex">A^*</script> using the
<script type="math/tex">h^{LM-cut}</script> and <script type="math/tex">h^{add}</script> heuristics, LAMA-2011 <a class="citation" href="#richter:etal:ipc-11">(Richter, Westphal, & Helmert, 2011)</a> and LAMA-first, which won the
International Planning Competition (IPC) 2011.</p>
<p>ASNets performed comparably better on large problems where the required training truly pays off. They were able
to significantly outperform all baseline planners on large problem instances of the CosaNostra Pizza and Triangle
Tire World domains learning optimal or near optimal policy for many problems. For the Probabilistic Blocks World
domain, the LM-cut policy was too expensive to compute and the exploration using the <script type="math/tex">h^{add}</script> teacher policy
was insufficient to outperform the baseline planners for complex tasks.</p>
<p>In deterministic planning, ASNets took signiﬁcantly more time to train and evaluate compared to most baseline
planners. The solutions of ASNets using additional heuristic input features were found to be optimal on all
problems, but without such input the networks were unable to solve even problems of medium size. However, the LAMA
planners outperformed ASNets in the considered problems finding optimal solutions significantly faster.
It should be noted that the experiment primarily focused on probabilistic planning and the classical planning part
was merely to show the ability of ASNets to be executed on these tasks. To measure and evaluate the performance of
ASNets in classical deterministic planning, a comprehensive experiment would still be needed.</p>
<hr />
<p>In the third post, I will explain my contributions to extending the capabilities of ASNets in deterministic
automated planning and how these networks were integrated into the Fast-Downward planning system
<a class="citation" href="#helmert:fd:06">(Helmert, 2006)</a>.</p>
<div style="text-align: right"><a href="/ai/2019/06/02/Thesis-3.html"> Part 3 </a> >>> </div>Lukas SchäferDomain-Dependent Policy Learning using Neural Networks for Classical Planning (2/4)Classical Automated Planning and Deep Learning2018-12-03T00:00:00+01:002018-12-03T00:00:00+01:00https://www.lukaschaefer.com/ai/2018/12/03/Thesis-1<h4>Domain-Dependent Policy Learning using Neural Networks for Classical Planning (1/4)</h4>
<!--more-->
<hr />
<p>I have finished my undergraduate Bachelor studies last summer and as a start to this blog I will outline the
work I did for my dissertation titled <a href="../../../../../assets/files/bsc_thesis.pdf">“Domain-Dependent Policy
Learning using Neural Networks in Classical Planning”</a>. I will split this summary over four posts which will
mostly be constructed of paragraphs of my thesis, summaries of such or parts of the
<a href="../../../../../assets/files/kolloquium.pdf">kolloquium presentation</a> I held at the group seminar of
the <a href="https://fai.cs.uni-saarland.de/">Foundations of Artificial Intelligence (FAI) group</a> at Saarland
University.</p>
<p><strong>TL;DR:</strong> I transferred and applied a neural network architecture called Action Schema Networks, designed for
policy learning for probabilistic planning, to deterministic, classical planning and evaluated its performance.</p>
<hr />
<h1 id="introduction">Introduction</h1>
<p>Machine learning (ML) is a subfield of artificial intelligence (AI) which received tremendous media
and research attention over the last years. While these terms are frequently used as if they were synonyms, such
usage is misleading. ML specifically covers branches of AI involving some form of learning, mostly from large sets
of data, while AI is more broad and general. It therefore aims to solve one of the main remaining challenges of
computers, their limitation to obtain and rationally apply knowledge. This is arguably the main reason why humans
are still superior to computers in many tasks despite their gradually increasing computational power.</p>
<p>The success of ML can already be seen in applications just as Alpha Go <a class="citation" href="#silver:etal:nature-16">(Silver et al., 2016)</a> and Alpha
Go Zero <a class="citation" href="#silver:etal:arxiv-17">(Silver et al., 2017)</a>, Google Deepmind’s agents, capable of beating human professionals at the
Chinese board game Go. This achievement was assumed to still be decades from reality due to the game’s computational
complexity constructed by its <script type="math/tex">\approx 10^{170}</script> different board states
<a class="citation" href="#muller:ComputerGo-02">(Müller, 2002; Tromp & Farnebäck, 2006)</a>. For comparisons, it is assumed that the observable universe
has about <script type="math/tex">10^{80}</script> atoms.
In the core of these programs were neural networks, which are often referred to under the title of deep learning,
combined with algorithms from or at least related to automated planning, another field of AI with the big goal of
creating an intelligent agent capable of eﬃciently solving (almost) arbitrary problems. While this sounds like a
vision far in the future, modern planning systems are already able to solve a wide variety of tasks, e.g.
complex scheduling tasks.</p>
<p>However, it might be surprising that planning has seen little interaction with the ﬁeld of machine learning despite
its rise in popularity. Only in recent past, these two ﬁelds were combined with mixed success.
One of these combinational approaches was the work of Toyer et al. from the Australian National University who
recently proposed a new neural network structure designed for application in probabilistic and classical automated
planning, called Action Schema Networks (ASNets) <a class="citation" href="#toyer:thesis:17">(Toyer, 2017; Toyer, Trevizan, Thiébaux, & Xie, 2017)</a>. These are able to learn
domain-speciﬁc knowledge in planning and apply it to unseen problems of the same domain. The promising structure
was primarily introduced and evaluated with respect to probabilistic planning.</p>
<p>Therefore, the goal of my dissertation was to evaluate the possible performance of Action Schema Networks in
classical planning.
The main contribution of the thesis was the implementation of this novel neural network structure in the
Fast-Downward planning system <a class="citation" href="#helmert:fd:06">(Helmert, 2006)</a> for application in deterministic, classical planning with
necessary extensions to the framework and an extensive empirical evaluation was conducted to assess ASNets on
multiple tasks of varying complexity. This evaluation considered diﬀerent conﬁgurations to in the end state
whether Action Schema Networks are a suitable method for classical planning and if so under which conditions
aiming towards the goal of learning complex relations occurring in planning tasks.</p>
<h1 id="background">Background</h1>
<h2 id="classical-automated-planning">Classical Automated Planning</h2>
<p>Classical automated planning focuses on ﬁnite, deterministic, fully-observable problems solved by a single agent.
The predominant formalisation for planning tasks is <em>STRIPS</em> <a class="citation" href="#fikes:nilsson:ai-71">(Fikes & Nilsson, 1971)</a> representing such a task as <script type="math/tex">\Pi = (\mathcal{P}, \mathcal{A}, c, I, G)</script>:</p>
<ul>
<li>
<p><script type="math/tex">\mathcal{P}</script> is a set of <em>propositions</em> (or facts)</p>
</li>
<li>
<p><script type="math/tex">\mathcal{A}</script> is a set of <em>actions</em> where each action <script type="math/tex">a \in \mathcal{A}</script> is a triple
<script type="math/tex">(pre_a , add_a , del_a)</script> with <script type="math/tex">pre_a, add_a , del_a \subseteq \mathcal{P}</script> including a’s preconditions,
add list and delete list with <script type="math/tex">add_a \cap del_a = \emptyset</script></p>
<ul>
<li>
<p>preconditions are facts, which have to be true for <script type="math/tex">a</script> to be applicable</p>
</li>
<li>
<p>add list contains all propositions becoming true after applying <script type="math/tex">a</script></p>
</li>
<li>
<p>delete list contains all propositions becoming false after applying <script type="math/tex">a</script></p>
</li>
</ul>
</li>
<li>
<p><script type="math/tex">c : A \rightarrow \mathbb{R}^+_0</script> is the <em>cost function</em> assigning all actions to their cost</p>
</li>
<li>
<p><script type="math/tex">I \subseteq \mathcal{P}</script> is the <em>initial state</em> containing all propositions, which are true at the start of
the task</p>
</li>
<li>
<p><script type="math/tex">G \subseteq \mathcal{P}</script> is the <em>goal</em> with all facts which have to become true to solve the task</p>
</li>
</ul>
<h3 id="brief-example">Brief Example</h3>
<center>
<img src="../../../../../assets/img/posts_images/bsc_thesis/transport_planning_task.png" alt="Transport planning-task" width="60%" />
</center>
<p>For example, one could describe a transportation task, as illustrated above, in which a truck has to deliver a
package from some location to its destination by driving along streets and load or unload packages. In this concrete
example, the truck has to drive from London (L) to Manchester (M), pick up the package, drive to Edinburgh (E) and
unload the package there. The task would include the propositions <script type="math/tex">\mathcal{P} = \{at(o, x) \mid o \in \{t, p\},</script>
<script type="math/tex">x \in \{L, M, G, E\}\}</script> and actions <script type="math/tex">\mathcal{A} = \{drive(x, y, z) \mid x \in \{truck\},</script>
<script type="math/tex">y, z \in \{L, M, G, E\},</script> <script type="math/tex">y \text{ and } z \text{ are connected}\}</script> <script type="math/tex">\cup</script>
<script type="math/tex">\{load(x, y, z), unload(x, y, z) \mid x \in \{truck\},</script> <script type="math/tex">y \in \{package\}, z \in \{L, M, G, E\}\}</script>.
The goal could be formalised as <script type="math/tex">\{at(package, E)\}</script> and the initial state describes the starting position
of the truck and package as <script type="math/tex">\{at(truck, L), at(package, M)\}</script>.</p>
<p>To solve any task <script type="math/tex">\Pi</script> in automated planning, the planner has to observe the current state and choose actions,
one at a time, in order to reach a goal state <script type="math/tex">s^*</script> with <script type="math/tex">G \subseteq s^∗</script>. The sequence of actions, applied
to get to such a state, is called <em>plan</em> for <script type="math/tex">\Pi</script>. A plan is considered optimal if it has the least cost out of
all plans reaching a goal. E.g., the optimal plan for our transport task would be
<script type="math/tex">\langle drive(truck, L, M), load(truck, package, M), drive(truck, M, E), unload(truck, package, E) \rangle</script>.</p>
<h3 id="modelling---pddl">Modelling - PDDL</h3>
<p>One essential component of planning is to model the task at hand. This process is usually split into two components:
the <em>domain</em> and <em>problem</em>. This separation has its origin in the main modelling language for planning <em>PDDL</em>
(Planning Domain Deﬁnition Language) introduced by McDermott et al. <a class="citation" href="#pddl-handbook">(McDermott & others, 1998)</a>. A domain describes a
whole family of various problems sharing the core idea. It contains predicates deﬁned on abstract objects as well
as action schemas. Problem instances are always assigned a domain which predeﬁnes mentioned elements. In the
problem ﬁle concrete objects are deﬁned instantiating the predicates and action schemas of the domain to
propositions and actions respectively. Furthermore the initial and goal states are speciﬁed.</p>
<p>Most of these planning problems seem conceptually easy for rational-thinking humans, but this impression can be
misleading. In fact, planning is computationally extremely difficult. Merely deciding whether a task is solvable is
already PSPACE-complete <a class="citation" href="#bylander:ai-94">(Bylander, 1994)</a>.</p>
<h2 id="deep-learning">Deep Learning</h2>
<h3 id="foundation">Foundation</h3>
<p>The idea of neural networks (NNs) has a long history reaching back to the 1940s <a class="citation" href="#mcculloch:pitts:86">(McCulloch & Pitts, 1986)</a>
inspired by the human brain whose immensely impressive capabilities are partly due to the dense connectivity of
neurons. With the introduction of the perceptron, which was capable of learning, by F. Rosenblatt in 1958
<a class="citation" href="#rosenblatt:58">(Rosenblatt, 1958)</a> and backpropagation by Rumelhart et al. in 1986 <a class="citation" href="#Rumelhart:1988:LRB:65669.104451">(Rumelhart, Hinton, & Williams, 1988)</a>
the foundation for modern NNs was built.</p>
<p>The simplest, modern NN architecture is the <em>fully-connected feedforward network</em> or <em>multi-layer perceptron</em> (MLP)
as illustrated below. As (almost) all neural network architectures, it consists of various layers containing
nodes or units which are successively connected with each other. One usually speaks of <em>input layer</em> (yellow),
<em>hidden layers</em> (blue) and an <em>output layer</em> (green). Each connection of two nodes is weighted with a parameter.</p>
<center>
<img src="../../../../../assets/img/posts_images/bsc_thesis/MLP.png" alt="Multi-Layer-Perceptron" width="60%" />
</center>
<p>A node is usually computing a fairly simple mathematical operation in which it applies the weights to each input
of the previous layer, adds a bias vector and finally applies a nonlinear function <script type="math/tex">f</script>, often called
<em>activation function</em>. The output of the <script type="math/tex">l</script>-th layer <script type="math/tex">h^l</script> can therefore be computed as follows:</p>
<script type="math/tex; mode=display">h^l = f(W^l \cdot h^{l-1} + b^l)</script>
<p>The weights and bias vectors are collectively stored and form the “learned intelligence” of these networks
gradually optimised with respect to some goal. The objective is usually represented by a <em>loss function</em>
<script type="math/tex">L(\hat{y}, y)</script> depending on the true output <script type="math/tex">y</script> and the prediction <script type="math/tex">\hat{y}</script> computed by the network
for a given input <script type="math/tex">x</script>. This form of training is called <em>supervised learning</em> and depends on labelled training
data including inputs as well as their expected output.
The optimisation is usually achieved by <em>gradient-descent</em> in which all parameters, annotated as <script type="math/tex">\theta</script>,
are updated in the direction of the steepest descent of the loss function <script type="math/tex">L</script> for some step-size <script type="math/tex">\alpha</script>,
called <em>learning rate</em>:</p>
<script type="math/tex; mode=display">\theta = \theta - \alpha \nabla_\theta L(\hat{y}, y)</script>
<h3 id="convolutional-neural-networks">Convolutional Neural Networks</h3>
<p>Over the last few years, many different architectures of such networks evolved with specific applications.
<em>Recurrent neural networks</em> (RNNs) are capable of considering past knowledge and decisions and therefore incorporate
something similar to a memory. This property turned out especially valuable whenever processing language
reaching state-of-the-art in tasks as machine translation <a class="citation" href="#bahdanau2014neural">(Bahdanau, Cho, & Bengio, 2014; Cho et al., 2014)</a> and
speech recognition <a class="citation" href="#graves2013speech">(Graves, Mohamed, & Hinton, 2013)</a>.</p>
<p>Similarly, <em>convolutional neural networks</em> (CNNs) became the standard for many multi-dimensional, mostly visual,
input tasks reaching unseen accuracy in e.g. image classification <a class="citation" href="#krizhevsky:etal:imagenet:12">(Krizhevsky, Sutskever, & Hinton, 2012)</a>.
The main characteristic of CNNs is the application of the mathematical <em>convolution operation</em>. This linear
operation replaces the typical matrix multiplication known from MLPs where every unit in each layer has a
weighted connection to every node in the successive layer. In convolution, smaller weight matrices, called <em>filters</em>
or <em>kernels</em>, are applied by “sliding” the ﬁlters over the units in one layer with each applying its operation to a set of neighboured inputs. This form of processing brings multiple advantages.</p>
<p>Due to the usually smaller size of filters, CNNs have <em>sparse connectivity</em>, only combining neighboured units in
one operation. This makes use of local properties to extract input features like edges in visual domains, which is
especially meaningful in deep CNNs. While shallow layers detect e.g. edges or shapes of an input image, filters
in deeper layers could work upon these features and detect increasingly abstract objects like cars and humans.</p>
<center>
<img src="../../../../../assets/img/posts_images/bsc_thesis/cnn_car.jpg" alt="Convolutional Neural Network Application" width="60%" />
<br />
<br />
<i>Source: <a href="https://www.mathworks.com/discovery/convolutional-neural-network.html">https://www.mathworks.com/discovery/convolutional-neural-network.html</a> (03.12.2018)</i>
</center>
<p>Additionally, filters are reused repeatedly during the “sliding”, applying their operation to (partially)
different input units. This form of <em>weight sharing</em> allows to significantly reduce the amount of parameters needed.
Hence, the memory requirements of the networks are lowered, making them more efficient than fully-connected
NNs because less parameters have to be learned, so needed training time and data can be reduced by the approach.</p>
<hr />
<p>In the next post, I will outline ASNets as a network structure before explaining my adjustments for application
in classical planning as well as the results of the evaluation in the third and forth part.</p>
<div style="text-align: right"> <a href="/ai/2019/05/05/Thesis-2.html"> Part 2 </a> >>> </div>Lukas SchäferDomain-Dependent Policy Learning using Neural Networks for Classical Planning (1/4)