Dec 202412 min

SEAS Search: Building a Knowledge Graph-Based Course QA System

How we built an intelligent course question-answering system using Knowledge Graphs and LLMs to solve the problem of multi-hop reasoning in academic advising.

Machine LearningKnowledge GraphsLLMRAGNeural NetworksNLP

Standard large language models like Llama 3 or GPT lack specific knowledge of a particular semester's schedule and curriculum structure. They hallucinate fake courses, can't trace prerequisite chains across multiple steps, and have zero knowledge of current semester offerings. For our Neural Networks final project, my teammate Anurag Dhungana and I set out to fix this with SEAS Search—a knowledge graph-based course question-answering system.

The Core Problem: LLMs Are Blind to Reality

When you ask a general-purpose LLM about course prerequisites or scheduling, it will confidently give you wrong answers. It might invent courses like "CSCI 9999" or fail to understand that to take Machine Learning, you first need Linear Algebra, which requires Calculus. These multi-hop reasoning problems are exactly where traditional fine-tuning approaches fail.

Our Data Pipeline

We started by scraping 187 courses from the GWU bulletin and 586 course instances from the Spring schedule, focusing on Computer Science and Data Science departments. We used regex to extract prerequisites and spaCy for topic modeling to get clean, structured data that could be transformed into a knowledge graph.

The Foundation: Llama 3.1 8B

Our core model is Llama 3.1 with 8 billion parameters—a decoder-only Transformer architecture optimized for generative tasks. Using 4-bit quantization and LoRA (Low-Rank Adaptation), we could fine-tune this massive model on a single GPU in under 2 hours while retaining 95% of its reasoning ability.

Our Evolution Through Four Methodologies

Methodology 1: Synthetic Data Generation

Our first attempt used Llama 3.2 3B to auto-generate training examples from raw bulletin text. The idea was elegant: let the small model "write the textbook" for us. The problem? The 3B model was too small for this complex task—it broke JSON formatting constantly and had low diversity, producing only 34 usable pairs instead of thousands. Lesson learned: synthetic data generation needs a much larger model.

Methodology 2: Standard Fine-Tuning

Next, we established a baseline by fine-tuning Llama 3.1 8B on 2,828 Q&A pairs with LoRA and 4-bit quantization. Our final loss of 0.46 looked great—until we realized the model was just memorizing the training data word-for-word. Without a validation split, we had no way to detect this overfitting. When asked novel questions, it hallucinated because it hadn't learned to reason—only to repeat.

Methodology 3: Optimized Fine-Tuning

We introduced rigorous ML engineering practices: an 80/20 train/validation split, early stopping with patience of 3, and a cosine learning rate scheduler. Our final validation loss was 0.75—higher than before, but this represented the true difficulty of the task, not memorization. The model generalized better but still couldn't reason about unseen prerequisites.

Methodology 4: Knowledge Graph-Based QA

Our breakthrough came when we stopped trying to make the model memorize facts and instead taught it to read a map. We built a knowledge graph with 489 nodes and 566 edges—courses, instructors, and topics connected by relationships like "requires," "taught_by," and "covers."

The pipeline works in three steps: First, we extract entities from the user query. Second, we retrieve a subgraph within 2 hops using NetworkX. Third, we format that subgraph as natural language context and feed it to Llama. The key insight: the model doesn't have to guess—it just looks up the answer in the provided context.

Results That Changed Our Thinking

Standard fine-tuning achieved 26% accuracy, while optimized fine-tuning reached 38%. Our KG-based approach hit 34%—but here's the critical point: it was tested on significantly harder multi-hop reasoning tasks. The real win was efficiency: KG-based training took 4.17 minutes versus 108 minutes for optimized fine-tuning—a 25x speedup.

Why so fast? We weren't teaching the model new facts; we were teaching it how to format answers using structured graph context. The high-quality KG context reduced learning complexity dramatically. Our final loss of 0.30 was the lowest of all approaches, suggesting high confidence in generated answers.

Key Learnings

Structure beats scale. 195 graph examples outperformed 2,828 standard examples. The knowledge graph provides explicit relationships that the model can follow, rather than forcing it to infer patterns from text.

Validation splits are critical. Without them, low training loss is meaningless—you're measuring memorization, not learning.

RAG retrieves text, KG-RAG retrieves structure. Structure (edges) is better for reasoning chains, prerequisites, and intersections. The model cannot hallucinate what doesn't exist because its context window only contains real edges from the graph.

Future Directions

Currently limited to 187 CS and Data Science courses, we plan to expand to 500+ courses across MATH, ECE, and STAT. We want to add degree requirements and time conflict detection, build a hybrid model combining fine-tuning with graphs, and ultimately deploy this as a real advising tool.

This approach generalizes to any structured domain—legal, medical, enterprise databases. Anywhere you have entities with relationships, a knowledge graph can provide the grounding that LLMs desperately need.

Try It Yourself

The live project is available at seas-search.vercel.app, and the source code is on GitHub. We built this on an NVIDIA A100 GPU with 80GB VRAM, but the efficiency of our approach means you can experiment with smaller hardware too.