This article details the experimental setup for evaluating RECKONING, a novel bi-level learning algorithm, on three diverse multi-hop logical reasoning datasetsThis article details the experimental setup for evaluating RECKONING, a novel bi-level learning algorithm, on three diverse multi-hop logical reasoning datasets

Evaluating Dynamic Knowledge Encoding: Experimental Setup for Multi-Hop Logical Reasoning

2025/10/24 09:15

Abstract and 1. Introduction

  1. Background

  2. Method

  3. Experiments

    4.1 Multi-hop Reasoning Performance

    4.2 Reasoning with Distractors

    4.3 Generalization to Real-World knowledge

    4.4 Run-time Analysis

    4.5 Memorizing Knowledge

  4. Related Work

  5. Conclusion, Acknowledgements, and References

\ A. Dataset

B. In-context Reasoning with Distractors

C. Implementation Details

D. Adaptive Learning Rate

E. Experiments with Large Language Models

4 Experiments

Setup We conduct our experiments on three datasets focusing on multi-hop logical reasoning over natural language knowledge: ProofWriter [73], which measures the model’s ability to emulate reasoning over facts and rules expressed in natural language; CLUTRR-SG [28], which is generated from the CLUTRR [71] benchmark, a logical reasoning task that involves reasoning over family relationships between entities grounded in first-order logical proofs; and FOLIO [29], a reasoning benchmark with first-order logical reasoning problems written by expert annotators based on real-world knowledge. Each problem in these datasets requires multiple reasoning hops to answer.[1]

\ We compare our method against the following baselines: (1) a fine-tuned model that performs a forward pass on only the question without access to the knowledge (No-Facts), (2) a fine-tuned model that performs a forward pass on only the knowledge without access to the question (No-Question), (3) a model trained using RECKONING with random knowledge that is not relevant to the questions (Random-Facts), and (4) an ICR baseline that concatenates the knowledge K with the question x in a single context and is trained using supervised learning to predict the answer (FT-ICR). Our first three baselines sanity-check whether any surface-level patterns in the questions and facts can be exploited to make accurate predictions. The last baseline compares RECKONING to the conventional way of reasoning with language models. Unless stated otherwise, we use the GPT-2-small [59] model (∼124M parameters) as our initialization and refer by RECKONING to our method trained with the multi-task objective. We compute each score from the average across three different runs. For more details on the implementation, datasets, and examples, see Appendix A and Appendix C.

\

:::info Authors:

(1) Zeming Chen, EPFL (zeming.chen@epfl.ch);

(2) Gail Weiss, EPFL (antoine.bosselut@epfl.ch);

(3) Eric Mitchell, Stanford University (eric.mitchell@cs.stanford.edu)';

(4) Asli Celikyilmaz, Meta AI Research (aslic@meta.com);

(5) Antoine Bosselut, EPFL (antoine.bosselut@epfl.ch).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

[1] In ProofWriter, the number of reasoning hops is called the proof depth. To unify the presentation of the results, we use the term “hop” to describe the number of reasoning steps for both datasets.

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Share Insights

You May Also Like

BTC Leverage Builds Near $120K, Big Test Ahead

BTC Leverage Builds Near $120K, Big Test Ahead

The post BTC Leverage Builds Near $120K, Big Test Ahead appeared on BitcoinEthereumNews.com. Key Insights: Heavy leverage builds at $118K–$120K, turning the zone into Bitcoin’s next critical resistance test. Rejection from point of interest with delta divergences suggests cooling momentum after the recent FOMC-driven spike. Support levels at $114K–$115K may attract buyers if BTC fails to break above $120K. BTC Leverage Builds Near $120K, Big Test Ahead Bitcoin was trading around $117,099, with daily volume close to $59.1 billion. The price has seen a marginal 0.01% gain over the past 24 hours and a 2% rise in the past week. Data shared by Killa points to heavy leverage building between $118,000 and $120,000. Heatmap charts back this up, showing dense liquidity bands in that zone. Such clusters of orders often act as magnets for price action, as markets tend to move where liquidity is stacked. Price Action Around the POI Analysis from JoelXBT highlights how Bitcoin tapped into a key point of interest (POI) during the recent FOMC-driven spike. This move coincided with what was called the “zone of max delta pain”, a level where aggressive volume left imbalances in order flow. Source: JoelXBT /X Following the test of this area, BTC faced rejection and began to pull back. Delta indicators revealed extended divergences, with price rising while buyer strength weakened. That mismatch suggests demand failed to keep up with the pace of the rally, leaving room for short-term cooling. Resistance and Support Levels The $118K–$120K range now stands as a major resistance band. A clean move through $120K could force leveraged shorts to cover, potentially driving further upside. On the downside, smaller liquidity clusters are visible near $114K–$115K. If rejection holds at the top, these levels are likely to act as the first supports where buyers may attempt to step in. Market Outlook Bitcoin’s next decisive move will likely form around the…
Share
BitcoinEthereumNews2025/09/18 16:40