Unpacking RAG's Role in Cybersecurity: A Deep Dive with Brennan Lodge Artwork

Dev.Sec.Lead

This podcast is dedicated to up and coming leaders in the cybersecurity industry. It is meant to help bridge the gap between being a technical leader to a business leader/enabler. We will talk about what it takes to be a leader and other issues that impact our industry today.

All Episodes

Dev.Sec.Lead

Unpacking RAG's Role in Cybersecurity: A Deep Dive with Brennan Lodge

April 15, 2024 • Wilson Bautista Jr. • Season 3 • Episode 3

Welcome to Dev.Sec.Lead, where today we dove deep into the transformative role of Retrieval Augmented Generation, or RAG, in cybersecurity. Joining us was the brilliant Brennan Lodge, who illuminated how this tech can serve as an AI assistant, enhancing accuracy and combating AI hallucinations. Crucially, we touched on RAG's potential to upskill analysts, integrate threat intelligence, and significantly improve decision-making processes by mapping alerts to MITRE ATT&CK techniques. Despite challenges, it promises a more secure future. Stay vigilant, stay informed, and remember, AI is an aid, not a replacement for human ingenuity. Don't miss out—like, subscribe, and bring your friends into the fold of Dev.Sec.Lead for more episodes on cutting-edge cybersecurity!

Support the show

1:19

Okay. I've got my guest here, Brennan Lodge. We met over at the Sunshine Cyber Security Conference here in Tampa, Florida, and he talked about something pretty awesome, and we wanted to bring him on the show today to talk about a little bit regarding retrieval augmented generation and AI and how that impacts cybersecurity. So, hey, Brennan. Thanks for coming on to the show today. Yeah. Pleasure. Thanks, Austin. Yeah. Yeah. So AI. Right? Everybody's talking about AI. Right? And it's like this supposedly secret sauce to a lot of different things. Everybody knows what ChatGPT is. Right? And they're all talking about, okay, this is generative AI, and it's able to give me answers based off of a huge dataset. But what makes retrieval augmented generation AI different than what we see at Chat GPT? Yeah. So the the hype is real. Right? We're in the hype cycle of AI, and you've got the extremes of AI saving the world or AI ending the world. And in between, we've got cybersecurity, more so hopefully on the safe side. Red team, blue team. There you go. But yeah. So RAG, I was in a r and d team at my last bank that I worked at. And, of course, that came to our team, you know, from the top down of how are we gonna use AI? Is it hype? Are there some realistic implementations that we can use it to to help us? And this is, like, the theme that I've been, you know, preaching over and over of, you know, the realistic implications of it, which is AI for us is a guide. And I I don't see it as a full automated guard just yet, given some of the, you know, white papers and stuff that that I've read. So getting closer to to Rag. Right? So I started with what is going wrong with AI? Why is it bad or not accurate or can fail? And the big thing with that was on the hallucinations of bad output. And we're all we've all seen it, right, on ChatChippet or some other generative AI offering. And I started, you know, reading about retrieval augmented generation as a combatant, really, or an assistant, really, to just the raw LLM or generative AI response. And, yeah, it just went down the rabbit hole and got lucky. Right? Rolling the dice on what research or technique or tactic with generative AI should I go down. And it's been really powerful, that being a retrieval augmented generation. I think that one of the things that I see with AI, especially with using chat gpt, and this is where I think RAG is really beneficial, is that with regular chat GPT, generative AI, you don't get to see where the sources are. Yeah. And that's really important because it's okay. Is this garbage, or is this actually fact? And with Rag, we're able to take a data set that's very I guess, you put PDFs in the knowledge base, and then it's able to cite those PDFs. And that's the extent of how I've used it. I'm really interested in how we can look at it from a cyber security standpoint. How does that work? You've got some experience with the SOC, and you got experience. We were we've been talking about GRC stuff. What what are you thinking? How does it how does RAG enhance cybersecurity efforts? Yeah. So maybe we paint the picture, use a analogy I've been using for RAG to set the stage. So with retrieval augmented generation, there's like a trifecta of integration points. Right? So you've got a sentence embedding model, a vector database, and an LLM. And you have a prompt and an answer. Right? Prompt at the beginning, answer at the end. So you have a sentence embedding model, which translates your English or whatever language into machine language or embeddings within a vector database. So you're going to parse out those PDFs in a schema and upsert them into the vector database. So now and vector databases on their own are an awesome tech, but we can talk about them later, of just semantic search. Right? What the embeddings do, given the text that you observed into that vector database is kinda similarities. So the similarities I think of your reading of a a story. I've got a 3 year old daughter, and we talk about kings and queens and prince and princesses. Of course, she's a princess. Right? But how do you make the similarities of she, queen, princess, daughter, son, prince. Right? That's the power of the vector database in the embeddings that translate based off of the similarity searches that you can get. So we've got that built out within the reg, and now we've got an LLM. So the LLM can do that translation of answering that question based off of what the LOM was trained. And with that trifecta, you can now get a cross reference based off of what was in that query or question and point to the document that was parsed and inserted into the vector database. And also, you've got the l m response. So understands that content, that question, and on its own with the power of that cross reference can provide a really powerful answer. So you're you still could get the hallucinations, but you're also going to get a cross reference. So you can get a link to that PDF or a link that within the section or the paragraph of that PDF so that it'll prove, hey, this LLM answer and also the cross reference answer. And if you wanna dig more, hey. Here's the the citation as to where that answer is closely, located in in that document. And I I know that we're gonna talk about PDFs, but, like, from a I guess, from a context of an analyst position and using that in regards to, like let's just talk about something that hits near and dear with a lot of analysts hearts, like log parsing and log understanding logs. How do you see RAG as a a tool to help analysts do their job better? Yeah. So I see it really well on the upskilling. Right? Especially if you're able to integrate threat intelligence. We're all familiar with those PDF reports on Cozy Bear, for example. As a junior analyst, it just sounds like a funny name to everybody else. Right? But what kind of context can we get out of reg, which is given the TTPs of cozy bear that may have been flagged, within the logs or within the alerts. Now we can query and ask, like, alright. Given Cozy Bear and given this alert I've seen, tell me more. Tell me more context to help me, understand why I should care about this investigation. And it can point back to that PDF or most recent report on Cozy Bear. So that's like the the threat intel angle, especially for the analysts. Yeah. I think that would be really interesting to see how, like, you would have the external information as a database, and then you could add more contacts with internal information and be able to relate the 2. Maybe that's something that will come in the future or is probably already happening. I just don't know. Yes. It it is happening. Yeah. So one of the use cases real quick is on MITRE mapping, An open source project that I'm working on with that vector database approach of, say, I have an alert or a log or an event or, right, some threat intel. Hey. Let's map this to a MITRE ATT and CK. Right? Now we all know we can't get those 1 to 1 matches or what we can get one to many we'll rank this match for the MITRE ATT and CK technique. Again, providing that that context, that know how and institutional knowledge that a junior analyst may not have. And it also helps downstream with that context and and mapping for the grand scheme of all of our detections, all of our alerts and Mhmm. What's been going on within our logs. Yeah. It seems if we get it to work really well, that it would help analysts make their decisions faster or be able to provide information faster to the higher ups and their managers to make decisions. But I but I think we're a little bit premature in saying that it's gonna happen right away. I think it's probably gonna happen in the next, I don't know, 5 years, 10 years, possibly. But what are some of the challenges with integrating RAG into existing cybersecurity infrastructures? Yeah. So, you know, it comes down to to cost. Right? Yeah. Always money. Right? Always ROI. Yeah. So let's start with the money. GPUs are expensive. Right? I think later down the road, most tech will get cheaper. But for now, on that trifecta, a lot of it can be open source, which is a beautiful thing of especially on the open source, models in in that community. And but in order to process this, it takes some horsepower. Right? So GPUs are expensive. And I would try and set, like, a benchmark with it. So my benchmark was I needed I wanted to get a response within 10 seconds based off of a query and keep it under $500 a a month. And we all know EC 2 and AWS in the cloud, it's not so transparent with the the cost. So being sure you're able to to manage and and track that accordingly. After some efforts, was able to to get it at that benchmark and set some standards of and, of course, the the limitations given the resource constraints with it. But, yeah, and then the other thing is is privacy. So we don't want our intellectual property out the window, especially if we make requests to some of these API services that are offering these that OpenAI's of the the world. You make a query to OpenAI that's going over the Internet. We've seen some of the code show up or proprietary code and questions and it and answers come back, and that's scary. The alternative and with my offering, I've really pushed through to have this as an on prem solution Mhmm. And bring this into a VPC or your on prem, and you can stand this up and not have to use the open AIs of the world, to make those queries and customize. Right? Further fine tune and train that LLM to understand some of the custom terminology that you may ask of it compared to maybe OpenAI AI or some other LLM out there may not have. So that's another beautiful thing, but also a consideration. So when you say when you're saying LLM, right, and we're just, like, considering OpenAI and you have Grok, and you have all these other tools that are out there. Are you saying that's would it be better for companies that want or organizations or or people in general to utilize a proprietary like, their own LLM to, do this kind of work because you just don't wanna use anything that's out in public. Yeah. So most of these models, if not all, are somewhat black box. Right? Even the data scientists that are building them can't really explain on these neural networks or how all they're reacting or acting or providing such awesome accuracy. Right? There are some open source models that are trying to get there with that transparency. GPT for all is one of my favorites and the one I've been testing and playing around with. And they provide the a really cool data visualization tool to visualize what data was used to train the model. Right? So that can help. And back to your question on where we're going, I think the Google's, Microsoft's, and everybody creating a a model out there is trying to understand the entire world or the entire Internet. But we'll slowly start to see the opposite effect of domain specific models. And that with training on specific types of of data, cybersecurity, for example. If you take a look in the the Hugging Face inventory, we're already getting there. There is a cybersecurity model called Lily that I've tested. It's okay. But going back to that transparency, it doesn't provide the data used to train the model, and that can help with understanding, like, any biases or any particular subject matters, for example, that specialize for that model. So back to the GPT for all, I, you know, searched through a few just like a quick search and index of the data used for the model. I found 1, a good one, called wizard that had some data on phishing. Right? Not phish the the band, but phishing the cybersecurity technique. And it was like, okay. But it's got some cybersecurity domain knowledge inherent in it. And lo and behold, it had insecure code in there that was identified, that was trained against it. And some other subject matters of application security, intrusion detection. So I'm like, okay. Found something good. And yeah. And I've been using that off the shelf within my reg for just playing around with the research. So, yeah, I think that's where the industry is going, and we just need to be more transparent with what data we're using to train our models. No. That's awesome. Thanks. So let's talk about the the implications. I can't speak today. Implications. The implications of RAG and how we can utilize this as a benefit for cybersecurity organizations for our up and coming leaders. Right? So what should cybersecurity leaders know about the scalability of RAG solutions, RAG AI solutions? Yeah. So scalability. Right? It's not just about handling more data or more queries. It's also how well we can integrate it to our existing infrastructures, our existing tools, our platforms, our our workflows. And that's where I think the devil in the details it is and getting the the gold findings out of RAG. And it takes quite a bit, of tuning, right, to to get there. Again, not a guard, but a guide. So I think we'll start seeing, like, reg. It's just like a hybrid type of system of the integration, knowledge, share, quick searching and querying, and maybe that MITRE mapping or even helping with scoring on some of the detections on priorities. And then the cost. Right? So we talked about how well this can scale. We can't make a million requests to the rag and expect it to be well, and, of course, garbage in, garbage out, right, with the the data. If you're putting, you know, unstructured data of of garbage in there, you're going to get bad results. So this is what I've been living in the last 10 or so years of building data science teams within cybersecurity departments of, hey. We want AI or we want ML, Brennan, but the foundation of our data infrastructure is crap. Right? So we need to start from there and building doing data engineering, building out and stabilizing those pipelines, and then adding in some ETL, extract, transform, load so it's clean. And further downstream, you're gonna get better results that are efficient and maybe even real time. But you've gotta deliver on that first in order to get to the fancy AI capabilities. No. That's great. So with the rag tools and their interaction with other AI systems currently used in cybersecurity, how do we use them with, like, machine learning models or neural networks? Yeah. Great question. We're gonna lead in back to that, like, hybrid approach. Mhmm. Line decision making feedback loops is is another important one. Right? So let's jump into the feedback loops, which is fun and a ML tactic within itself. But on the use case there, we get our detections, and those come downstream into our SIEM. How we evaluate those detections are on, like, the data science confusion matrix, and it's not that confusing. Right? We're all familiar with these terms, which is false positives, false negatives, true positives, true negatives. Right? So scaling, your detections first into those categories and understanding, like, how well we're doing. Right? We don't want to overwhelm our analysts. Right? Hence, false positives. Mhmm. And we don't want to miss anything, hence, false negatives. But how do you balance them? Right? Feedback loops. Just a thumbs up, thumbs down can help at the end when an analyst is closing that investigation to say, hey. This was a a true positive, or this was a a false positive. Right? And then tossing that back into the loop of your machine learning and training can then help get better accurate more accurate results further downstream. Right? So that helps with the full ecosystem, if you will, on on ML. So with leaders coming into the scene and leaders that are already in their positions, we all know that artificial intelligence and generative AI already has some issues. Right? We know that there are specific attacks and specific vulnerabilities that are inherent with these AI systems. Is it any different with RAG? No. And it's no different really from the traditional security approach. Right? So with RAG, because it can be like a a chatbot type of application, there are other applications. They're gonna be vulnerable to prompt injection, data privacy leakage, biases, unfairness with the the results. Prompt injection attacks has has warranted a new career for prompt engineers who can come up with the the best questions and answers. Probably really good at trivia. But on the prompt injection attacks, it it's similar to a SQL injection attack, similar like cross site scripting. So taking that same approach on the manipulation of that input and setting up some guardrails on, hey. We're not going to accept this type of of input. Yeah. It becomes labor intensive, but, yeah, it's a similar tactic and techniques that we've had to use to defend ourselves against, you know, the previously mentioned attacks. Right? And then data privacy leakage. Hey. Let's set the guardrails. No PII, no credit card number, Social Security numbers, etcetera. We're we're going to use in our data, and we're not going to allow that to be as a as an answer within our data. You get into some of the more complicated ones on the model robustness and the adversarial attacks, or they're asking questions to try and get to the intellectual property or the model parameters. But similar approach, the prompt engineering to enter those guardrails on avoiding those types of exploits. So when you say guardrails, I know you mentioned this, a bunch of times, but, like, when I'm thinking of guardrails, it's stopping us from falling over the edge. But, like, technically, what is that technically, really? What does what in the world does that mean? Yeah. It's traditional stuff. Right? So a guardrail of ACLs. Right? Access control list. Who can get access to the the model parameters? Who can ask or be prompting? And then just taking logs on what questions are being asked and investigating those logs. Right? Like, typical detection type of, tactics here. The similar guardrails that that we have for our technical infrastructure can then be gleaned with generative AI applications. It sounds like there's another job that's gonna need to be developed of, like, investigate you said investigating who's prompting what. That's gonna be interesting. Right. I try to put that all together. Okay. So as a leader in this field, how how do you stay updated with these rapid technological advances in AI and cybersecurity? Yeah. Yeah. It's white papers. It's conferences. It's talking with folks, listening to your podcast, Ray Wilson. Mhmm. And just seeing what's out there. Right? And being realistic and testing and just trying stuff out, experimenting. Right? Drop some models into your lab and see how far you can take it. Getting up to speed with RAG was quite difficult. A fun summer project for me, testing out some of the different c p or GPUs rather. Arm, for example, just to help you guys out with getting up and and riding. ARM GPUs are really tough to customize and get the drivers working and such. And to why NVIDIA is doing so well, especially with their stock price, is the the ease of use and their flexibility to work with developers and just somewhat work, the box. So CUDA is their development driver package, and you can configure and tweak how many GPUs or or resources on the back end there to toggle with. But it's fun to the tinkering around with it and trying out these implementations and checking out stuff, like, on Reddit or on on Hugging Face, which is turned into the GitHub for machine learning models, is a good place to research and and find out what's going on. So what Reddit sub subreddits do you recommend? There's a machine learning and language technology one. Oh, there's also I think there's one for reg. There's one I was playing around with COBOL CPP. So it's k with for COBOL, And that was a fun application to help con with the configuration in back end. But, yeah, mileage may vary with what type of subreddits they're on. And, yeah, that's why I'll just keep it that keep it there. Yeah. No. I totally get that. You just don't know what you're gonna get. Caution. That's funny. Alright. So when we're talking about AI driven technologies, I know that you had mentioned a few things that you've been working on. But maybe just, like, do you have any kind of list of skills that up and coming folks need to know? Because, technology is changing so fast. What do you what are you seeing? You're in startup world, so you see a lot of you see a lot of stuff that are innovate innovative in general. So what do you what should we be paying attention to now? Yeah. So in order to get up and and running, just start playing around with Python. It's a universal tool for all data scientists. And then most data scientists have been living in Jupyter Notebooks. So JupyterLab is a good way to get up and and running. And most small models and tools and LMs out there and the integration or orchestration through a package called, like, Langchain all works with with Python. Right? So that's the the beautiful thing. The white papers out there, the Reddits of the world, on the enterprise side. Databricks is pretty good. They've got some really good blogs. I use the rag one to get up and and running. Snowflake is pretty good out there. And then, yeah, just starting to think, of use cases. Right? So with Rag, I presented at colonel last week, which was really great conference and technically focused. I'll go mini Defcon. Look at some really good ideas on the red teamers and their pen test reports. Right? Dropping them into to reg. Now securing them down, locking them down, putting some apples on it on who can get access. But at the end of understanding the the vulnerabilities or what the pen tester took in order to exploit a particular system is, like, your library. Right? Your inventory that you can reference back and forth or get some new creative ideas amongst the red team by sharing that knowledge. So that was another cool use case I didn't think of. And then on the GRC side, starting to see and actually build out myself some of the use cases on just data privacy laws, regulatory frameworks out there that are very cumbersome, very long, lengthy, right, PCI DSS, splitting them up, tossing them in the the vector database, and asking questions. Also, tossing in your own policy documents into the vector database and then interfacing with it to compare and contrast, like a gap analysis of Mhmm. Hey. We wanna do business in Iceland. Here's our policies. Here's Iceland's data privacy laws and regulations. What do we need to do? You'll get a a list of of things that, you know, didn't match on that those gaps And, hey, write me a new policy that makes me abide by Iceland's. Right? Now I've got a new policy. It's integrated, customized based off of my own policy. Of course, you gotta go out there and implement it. Right? Can't just write it down, but that's another really good use case that that I've seen. Just gotta make sure that it's not we're not looking at AI as the easy button because when like, I live in GRC world. Oh, yeah. We're just gonna we're just gonna use chatcheappt. We're gonna write a we're gonna write all of our policies for SOC 2, and we're gonna be good. And that's the beautiful thing. I was just humans in the loop. Right? They're gonna need to interpret. They're going to need to be that last bastion of, hey, is this bad? And offering up that insight and logistic reasoning that maybe the AI model doesn't have just yet. So again, a guide and not a particular guard or full stop fully automated system. Human needs to be in the loop continuously. Yep. Completely agree. Brennan, thanks so much for your time today. Really appreciate you coming on and talking to us a little bit and making minds explode like like mine. It's always nice to be, talking to to somebody that understands what's going on, especially with AI. So, thank you again. My pleasure. Thanks all. Good luck out there, everybody. Hope you liked this episode of DevSec Lead. We'll have other ones coming up in the next few weeks. Really appreciate you. And if you could pass along the message and get your friends to subscribe and we'll keep on learning together. Hope you all have a good one and talk to you soon. See you.