Every SEAS student hits this wall
This was the final project for CSCI 6366, Neural Networks and Deep Learning at GWU's School of Engineering and Applied Science. A team project with my classmate Anurag Dhungana. We had a semester to build something real using what we'd learned, and we wanted to pick a problem we actually felt.
The problem we picked was one every SEAS student has hit: trying to figure out which courses you need before you can take the one you actually want. The official GWU systems are fragmented. The course bulletin is a PDF, the schedule is a separate portal, and prerequisite chains require you to manually trace through multiple pages to understand a dependency you could follow in 10 seconds if it were properly modeled.
We wanted to build something that could answer the question naturally: "What do I need to take before CSCI 6364?"
LLMs hallucinate prerequisites. Confidently.
The obvious first move was fine-tuning a language model on course data. We did that. It worked for simple questions. But it completely broke on anything that required tracing a chain. "Can I take X if I've completed Y and Z?" Language models don't naturally reason about structured relationships. They pattern-match. They hallucinate. They'd give you a confident wrong answer about prerequisites that didn't exist.
The core insight that reframed the whole project: course relationships aren't a text problem, they're a graph problem. Prerequisites form a directed acyclic graph. If you want to answer multi-hop questions correctly, you need something that can traverse that structure, not just retrieve text that sounds relevant.
That's what pushed us toward building a Knowledge Graph.
A knowledge graph that does the reasoning first
We scraped and structured data from two sources: the GWU CSCI and DATS course bulletin (187 courses) and the Spring 2026 course schedule (586 instances). From that, we built a Knowledge Graph using NetworkX where nodes are courses and edges are prerequisite relationships. spaCy handled the entity extraction from the bulletin text to identify course codes, relationships, and topic clusters.
The Knowledge Graph became the backbone for our QA system. At query time, instead of just feeding the question to the model, we'd first traverse the graph to resolve any prerequisite chains or course relationships in the question, then pass that structured context to the language model to generate a natural language answer. The model's job changed from "figure out what the prerequisites are" to "explain the prerequisites I've already retrieved correctly."
For the language model itself, we fine-tuned Llama 3.1 8B using LoRA adapters via Unsloth. We generated 2,828 training Q&A pairs from the course data: a mix of simple factual questions (who teaches this, when does it meet) and complex multi-hop questions (prerequisite chains, co-requisite planning, conditional enrollment). LoRA meant we weren't fine-tuning all 8 billion parameters. We were adapting a small set of low-rank matrices that plugged into the existing model. This made training feasible on academic resources: significantly less compute and time than full fine-tuning would have required.
Four attempts, four lessons
We didn't get to the Knowledge Graph approach immediately. We went through four approaches across the semester.
The first two were baseline experiments: simple fine-tuning on the course data without any graph augmentation. These handled factual lookups reasonably well but failed on anything requiring relational reasoning. Accuracy on multi-hop questions was around 26%.
The third approach introduced the Knowledge Graph but only as a retrieval mechanism. We'd pull relevant graph context and inject it into the prompt, but the model wasn't trained with graph-augmented examples. Better, but inconsistent.
The fourth approach, training the model with KG-augmented examples so it learned to use the graph context rather than ignore it, is where we hit 34-38% accuracy on multi-hop reasoning questions. That gap between 26% and 38% was the whole point of the Knowledge Graph.
Why 38% is a real result
34-38% might sound low. In the context of multi-hop reasoning on domain-specific academic data with a relatively small training set, it's a meaningful result. The questions we were evaluating on weren't "who teaches CSCI 6212." They were things like: "if I've completed CSCI 6212 and CSCI 6221, what are all the graduate-level AI courses I'm now eligible to take, and which of those have the least remaining prerequisites?" That's genuinely hard for any system.
The direction matters as much as the absolute number. Each approach in our progression got better, and the Knowledge Graph augmentation was clearly responsible for the improvement. That's a finding worth documenting.
On HuggingFace, not behind a paywall
Both fine-tuned models are on HuggingFace: the KG-QA system (which requires loading the knowledge graph files alongside the model weights) and the simpler fine-tuned version for straightforward factual queries. We built a full Next.js frontend with an interactive knowledge graph visualization, training metrics, methodology walkthrough, and a chat interface. The chat inference isn't hosted, running a Llama 3.1 8B model requires GPU resources we don't have access to for free, but the models are public and runnable locally.
Designing for a system that's sometimes wrong
Building the frontend raised a design problem that the model work didn't: how do you build an interface for a system that's partially correct, where the stakes of being wrong are real? A student asking about prerequisites might use the answer to decide what to register for. If the system is confidently wrong, that's a problem. If it shows appropriate uncertainty, students can calibrate how much to trust it.
The choice to use a chat interface rather than a search box was deliberate. Search implies you're querying a database of known facts. Chat implies you're talking to something that reasons. That framing is actually more honest for this system: it does reason, just not always correctly. More practically, chat lets students ask in the natural language they already use. "What do I need before 6364?" is a question a student would ask an advisor. Forcing that into a structured query form would add cognitive overhead for no benefit. The conversation format lowers the barrier to asking follow-up questions, which matters when prerequisite chains are multi-step.
The interface distinguishes between two modes of response: simple factual lookups (who teaches this course, when does it meet) and complex graph traversal queries (prerequisite chains, conditional enrollment, multi-hop reasoning). These get different visual treatment in the output card. A factual lookup shows a compact answer. A graph traversal query shows which part of the knowledge graph was traversed and what relationships were resolved before the language model generated its answer. This is communicating model confidence through design rather than burying a probability score in an API response. The student can see whether the system found a clean graph path or whether it was working from partial information.
The knowledge graph visualization is an interactive D3-rendered network where nodes are courses and edges are prerequisites, zoomable and explorable. It's not there purely for aesthetics or to show off the underlying structure. It's a trust-building mechanism. When a student can look at the graph and trace the prerequisite path themselves, they can verify what the system told them. The system's answer becomes checkable, not just a black box response. That changes the relationship between the user and the AI output. Instead of "do I trust this?", the question becomes "does this match what I can see in the graph?" That's a much healthier position for a tool that will sometimes be wrong.
For anything touching graduation requirements or prerequisite chains, the output card includes an active prompt to confirm with an academic advisor. Not a disclaimer tucked at the bottom in small text, but a visible element in the response itself, at the same level as the answer. The design intent was that the verification nudge should be as prominent as the answer, because the cost of acting on a wrong prerequisite chain is high enough that the interface should normalize checking rather than treating it as an afterthought.
Rate limiting and error states were designed to feel honest rather than broken. When the system hits a limit or fails to retrieve graph context, the message says so plainly. Not a generic error. Not a spinner that times out. An explanation of what the system tried and why it couldn't complete it. The goal was for the system's limits to feel like information, not like failure.
The model wasn't the problem. The data structure was.
This project was my first time working with knowledge graphs seriously, and the thing that stuck with me was how much the data representation matters. The language model was the same throughout our four approaches. The training data was similar in size. What changed was how we structured the knowledge, and that alone was the difference between a system that hallucinated prerequisites and one that could trace them correctly.
Good ML isn't just about the model. It's about building the right substrate for the model to reason on.
And that insight extends past the model into the interface. Because the data was structured as a graph, the interface could expose graph-level concepts to users. The interactive knowledge graph visualization, where students can trace prerequisite paths themselves, was only possible because the underlying data was organized that way. If the course relationships had been stored as unstructured text or a flat table, there would have been nothing to visualize. The data architecture determined what was possible in the interface.
This is the connection that I think gets missed when ML projects are treated as purely technical work. Data architecture and UX are not separate decisions made by separate people at separate times. How you structure your data determines what your interface can show. If prerequisites are edges in a graph, you can render a graph and let users navigate it. If they're sentences in a document, you can only return text. The decision made at the data modeling stage either opens or closes the design space for everything that comes after it. On this project, choosing the right data structure wasn't just what made the model work. It was what made the interface trustworthy.




