AI Loophole #1; Your GitHub README.md

I used to be the Security Team Lead for Web Applications at one of the largest government data centers in the world but now I do mostly “source available” security mainly focusing on BSD. I’m on GitHub but I run a self-hosted Gogs (which gitea came from) git repo at Quadhelion Engineering Dev.

Well, on that server I tried to deny AI with Suricata, robots.txt, “NO AI” Licenses, Human Intelligence (HI) License links in the software, “NO AI” comments in posts everywhere on the Internet where my software was posted. Here is what I found today after having correlated all my logs of git clones or scrapes and traced them all back to IP/Company/Server.

Formerly having been loathe to even give my thinking pattern to a potential enemy I asked Perplexity AI questions specifically about BSD security, a very niche topic. Although there is a huge data pool here in general over many decades, my type of software is pretty unique, is buried as it does not come up on a GitHub search for BSD Security for two pages which is all most users will click, is very recent comparitively to the “dead pool” of old knowledge, and is fairly well recieved, yet not generally popular so GitHub Traffic Analysis is very useful.

The traceback and AI result analysis shows the following:

GitHub cloning vs visitor activity in the Traffic tab DOES NOT MATCH any useful pattern for me the Engineer. Likelyhood of AI training rough estimate of my own repositories: 60% of clones are AI/Automata
GitHub README.md is not licensable material and is a public document able to be trained on no matter what the software license, copyright, statements, or any technical measures used to dissuade/defeat it. a. I’m trying to see if tracking down whether any README.md no matter what the context is trainable; is a solvable engineering project considering my life constraints.
Plagarisation of technical writing: Probable
Theft of programming “snippets” or perhaps “single lines of code” and overall logic design pattern for that solution: Probable
Supremely interesting choice of datasets used vs available, in summary use, but also checking for validation against other software and weighted upon reputation factors with “Coq” like proofing, GitHub “Stars”, Employer History?
Even though I can see my own writing and formatting right out of my README.md the citation was to “Phoronix Forum” but that isn’t true. That’s like saying your post is “Tick Tock” said. I wrote that, a real flesh and blood human being took comparitvely massive amounts of time to do that. My birthname is there in the post 2 times [EDIT: post signature with my name no longer? Name not in “about” either hmm], in the repo, in the comments, all over the Internet.

[EDIT continued] Did it choose the Phoronix vector to that information because it was less attributable? It found my other repos in other ways. My Phoronix handle is the same name as GitHub username, where my handl is my name, easily inferable in any, as well as a biography link with my fullname in the about.[EDIT cont end]

You should test this out for yourself as I’m not going to take days or a week making a great presentation of a technical case. Check your own niche code, a specific code question of application, or make a mock repo with super niche stuff with lots of code in the README.md and then check it against AI every day until you see it.

P.S. I pulled up TabNine and tried to write Ruby so complicated and magically mashed, AI could offer me nothing, just as an AI obsucation/smartness test. You should try something similar to see what results you get.

wizardbeard , 4 days ago

Hey Elias, found some confounding info: looks like Perplexity AI doesn’t respect the methods of blocking scrapers through robots.txt so this might just be an issue with them specifically being assholes.

Couldn’t figure out how to tag you in a comment on the other post, so I’ll edit this comment in a moment with the link.

Link: lemmy.world/post/16716107

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

elias_griffin OP , 7 days ago (edited 7 days ago)

Thanks for all the comments affirming my hard working planned 6 month AI honeypot endeavouring to be a threat to anything that even remotely has the possibility of becoming anti-human. It was in my capability and interest to do, so I did it. This phase may pass and we won’t have to worry, but we aren’t there yet, I believe.

I did some more digging in Perplexity on niche security but this is tangential and speculative un-like my previous evidenced analysis, but I do think I’m on to something and maybe others can help me crack it.

I wrote this nice article www.quadhelion.engineering/…/freebsd-synfin.html about FreeBSD syscontrols tunables, dropping SYN FIN and it’s performance impact on webhosting and security, so I searched for that. There are many conf files out there containing this directive and performance in aggregate but I couldn’t find any specific data on a controlled test of just that tunable, so I tested it months ago.

Searched for it Perplexity:

It gave me a contradictorily worded and badly explained answer with the correct conclusion as from two different people

None of the sources it claimed said anything* about it’s performance trade-off

The answers change daily

One answer one day gave an identical fork of a gist with the authors name in comments in the second line. I went on GitHub and notified the original author. gist.github.com/clemensg/8828061?permalink_commen… Then I went to go back and take a screenshot I would say, maybe 5-10 minutes later and I could not recreate that gist as a source anymore. I figured it would be consistent so I didn’t need to take a screenshot right then!

The forked gist was:gist.github.com/…/ac748b77fa3c001ef3791478815f7b6…

[Contradiction over time] The impact was none, negligible, trivial, improve

[Errors] Corrected after yesterday, and in following with my comments on the web that it actually improves performance as in my months old article

It is not minimal -> trivial, it’s a huge decision that has definite and measurable impact on todays web stacks. This is an obvious duh moment once you realize you are changing the TCP stacks and that is hardly ever negligible, certainly never none.

drop_synfin is mainly mitigating fingerprinting, not DOS/DDoS, that’s a SYN flood it’s meaning, but I also tested this in my article!

Anyone feel like an experiment here in this thread and ask ChatGPT the same question for me/us? https://lemmy.world/pictrs/image/0c9fa84b-eab5-4f9b-9728-032a8d7686fd.pnghttps://lemmy.world/pictrs/image/d27ba955-ad03-4ab5-a88d-3e230ee124f8.png

https://lemmy.world/pictrs/image/b400d0c8-3cda-4566-b200-e907176a4b1c.png

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Sanctus , 8 days ago

I agree with you that they have consumed far more of the internet than they let on. That scrapers are shoving just everything into these regardless of legality or consent. Its messed up. Once more if the world wasn’t just a concrete jungle this could probably be a great ubiquitous tool in a faster and safer manner than it is now.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

elias_griffin OP , 8 days ago

I also just realized why I’m getting heat here, lawsuits.

I just gave legal cause that practice was not properly disclosed by Microsoft, abused by OpenAI, a legal grounds as a README.markdown containg code as being software, not speech, integral to licensed software, which is covered by said license.

If an entity does find out like me your technical writing or code is in AI from a README, they are perhaps liable?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AlexanderESmith , 8 days ago

Eh. This is not a new argument, and not the first evidence of it. I don't think you're gonna be high on their list of retaliation targets, if you register at all (to say nothing of the low-to-middling reach of the fediverse in general).

Hell, just look at photographers/painters v. image generators, or the novel/article/technical authors v. ... practically all LLMs really, or any other of a dozen major stories about "AI" absorbing content and spitting out huge chunks of essentially unmodified code/writing/images.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

elias_griffin OP , 8 days ago

It all started with this today:

Perplexity AI Is Lying about Their User Agent rknight.me/…/perplexity-ai-is-lying-about-its-use…

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Blaster_M , 8 days ago

So… if you don’t want the world to see your work, why are you hosting it publicly?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AlexanderESmith , 8 days ago

"The world seeing [their] work" is not equal to "Some random company selling access to their regurgitated content, used without permission after explicitly attempting to block it".

LLMs and image generators - that weren't trained on content that is wholly owned by the group creating the model - is theft.

Not saying LLMs and image generators are innately thievery. It's like the whole "illegal mp3" argument. mp3s are just files with compressed audio. If they contain copyrighted work, and obtained illegitimately, THEN their thievery. Same with content generators.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

VictoriaAScharleau , 7 days ago

stealing removes something. copying makes more of it. it’s not theft

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AlexanderESmith , 7 days ago

The MPAA and music industry would beg to differ. As would the US courts, as well as any court in a country we share copyright agreements with.

Consider that if a movie uses a scene from another movie without permission, or a music producer uses a melody without permission, or either of them use too much of an existing song without permission, everyone sues everyone else, and they win.

Consider also that if a large corporation uses an individual's content without permission, we have documented cases of the individual suing, and winning (or settling).

Some other facts to consider;

An mp3 file is not inherently illegal. Nor is a torrent file/tracker/download.

If the mp3 file contains audio you don't own the rights to, it is illegal, same for the torrent you used to download/distribute it. In the eyes of the law, it's theft.

A trained LLM or image generation model is not inherently theft, if you only use open-source or licensed/owned content to train it

(at odds in our conversation) What of a model that eas trained with content the trainer didn't own?

In the mp3 example, its largely an individual stealing from a large company. On the Internet, this is frequently cheered as the user "sticking it to the man" (unless, of course, you're an indie creator who can't support yourself because everyone's downloading your content for free). Discussions regarding the morality of this have been had - and will be had - for a long time, but it's legality is a settled matter: It's not legal.

In the case of "AI" models, its large companies stealing from a huge number of individuals who have no support or established recourse.

You're suggesting that it's fine because, essentially, the creators haven't lost anything. This makes it extremely clear to me that you've never attempted to support yourself as a creator (and I suspect you haven't created anything of meaning in the public domain either).

I guess what it comes down to is this; If creators can be stolen from without consequence, what incentive does anyone have to create anything? Are you going to work your 40-60 hours a week, then come home and work another 20-40 hours to create something for no personal benefit other than the act of creation? Truely, some people will. Most wont.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

VictoriaAScharleau , 7 days ago

this doesn’t address what I said at all.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AlexanderESmith , 7 days ago

The first sentence directly addresses your comment "it's not theft" with "the law says it is".

The rest of the post attempts to explain why it is so and some of the moral or ethical discussions surrounding some examples.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

VictoriaAScharleau , 7 days ago

the law does not say it is theft.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Eiim , 7 days ago

Copyright violations ≠ conversion. Those are two completely different sets of laws. If you’re going to argue that legal definitions back you up, at least make sure you know what they are?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

VictoriaAScharleau , 7 days ago

people made art, music, and stories long before copyright

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Hawk , 8 days ago

If I copy McDonald’s site one by one for my own restaurant and just change the name, you can expect to be sued.

And yet, their site is available publicly?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

elias_griffin OP , 8 days ago

The comments so far aren’t real people posting how they really feel. An agenda or automata. Does that tell you I’m over the target or what?

Look my post is doing really well on the cyberescurity exchanges. So to all real developers and program managers out there:

Recommend the removal of any “primary logic” functional code examples out of your README.md, that’s it.

PSA, Here to help, Elias

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

bamboo , 8 days ago

Lmao you got some criticism and now you’re saying everyone else is a bot or has an agenda. I am a software engineer and my organization does not gain any specific benefits for promoting AI in any way. They don’t sell AI products and never will. We do publish open source work however, and per its license anyone is free to use it for any purpose, AI training included. It’s actually great that our work is in training sets, because it means our users can ask tools like ChatGPT questions and it can usually generate accurate code, at least for the simple cases. Saves us time answering those questions ourselves.

I think that the anti-AI hysteria is stupid virtue signaling for luddites. LLMs are here, whether or not they train on your random project isn’t going to affect them in any meaningful way, there are more than enough fully open source works to train on. Better to have your work included so that the LLM can recommend it to people or answer questions about it.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Chronographs , 8 days ago

The way that I see it, LLMs are a powerful tool to quickly and easily generate an output that should then be checked by a human. The problem is that it’s being shoehorned into every product it feasibly can be, often as an unchecked source of truth, by people who don’t understand it and just don’t want to miss out. If at any point you have to simply trust an LLM is “right”, it’s being used wrong.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

bamboo , 8 days ago

Yeah this is super sensible. Out of curiosity, do you have any decent examples bad usage? I think chatbots, GitHub copilot type stuff to be fine. I find the rewording applications to be fine. I haven’t used it but Duolingo has an AI mode now and it is questionable sounding, but maybe it is elementary enough and fine tuned well enough for the content in the supported courses that errors are extremely rare or even detectable.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Chronographs , 7 days ago

I would say chatbots are bad if their job is to provide accurate information, similarly is their use in search engines. Github on the other hand would be an example of a good use, as the code will be checked by whoever is using it. I also like all the image generation/processing uses, assuming that they aren’t taken as a source of truth.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

bamboo , 7 days ago

Chatbots are fine as long as it’s clearly disclosed to the user that anything they generate could be wrong. They’re super useful just as an idea generating machine for example, or even as a starting point for technical questions when you don’t know what the right vocabulary is to describe a problem.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Chronographs , 7 days ago

Yeah I was thinking more along the lines of customer support chatbots

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

bamboo , 7 days ago

Oh yeah those are problematic, but I’m pretty sure a court has ruled in a customer’s favor when the AI fucked up, which is good at least.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AlexanderESmith , 8 days ago

you got some criticism and now you’re saying everyone else is a bot or has an agenda

Please look up ad hominem, and stop doing it. Yes, their responses are a distraction from the topic at hand, but so were the random posts calling OP paranoid. I'd have been on the defensive too.

[Our company] publish[s] open source work ... anyone is free to use it for any purpose, AI training included

Great, I hope this makes the models better. But you made that decision. OP clearly didn't. In fact, they attempted to use several methods to explicitly block it, and the model trainers did it anyway.

I think that the anti-AI hysteria is stupid virtue signaling for luddites

Many loudly outspoken figures against the use of stolen data for the training of generative models work in the tech industry, myself included (I've been in the industry for over two decades). We're far from Luddites.

LLMs are here

I've heard this used as a justification for using them, and reasonable people can discuss the merits of the technology in various contexts. However, this is not a justification for defending the blatant theft of content to train the models.

whether or not they train on your random project isn’t going to affect them in any meaningful way

And yet, they did it while ignoring explicit instructions to the contrary.

there are more than enough fully open source works to train on

I agree, and model trainers should use that content, instead of whatever they happen to grab off every site they happen to scrape.

Better to have your work included so that the LLM can recommend it to people or answer questions about it

I agree if you give permission for model trainers to do so. That's not what happened here.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

bamboo , 8 days ago

Why do you think they need your permission to use information you posted publicly to train their models? Copyright isn’t unlimited, and model training is probably fair use.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AlexanderESmith , 8 days ago

"Your honor, we can use whatever data we want because model training is probably fair use, or whatever".

I don't know what's worse, the fact that you think creators don't have the right to dictate how their works are used, or that you apparently have no idea what fair use is.

This might help; https://copyright.gov/fair-use/

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

VictoriaAScharleau , 7 days ago

authors should have no say in how published works are used.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AlexanderESmith , 7 days ago

I already replied to the essence of this in my reply to your other post about how "illegal downloads aren't theft because its a copy", but I'll mention here that this is even more evidence that you aren't a creator, and I suggest that your opinions on this subject aren't relevant, and you should avoid subjecting other people to them.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

VictoriaAScharleau , 7 days ago

your attacks on my identity don’t undercut my claims at all.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AlexanderESmith , 7 days ago

"evidence suggests that you probably aren't a creator"
"As a result, I suggests that your opinions aren't relevant"

Aside from the fact that these are not character attacks, I encourage you to refute my assumptions. Otherwise, my points will stand on their own.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

VictoriaAScharleau , 7 days ago

on the internet, no one knows you’re a dog. whether I have or not, saying so doesn’t prove it. what I said stands on its own merits and your inability to make an argument without attacking identity speaks to the strength of your argument, your understanding of the subject, and your ability (or willingness) to engage in good faith.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

catloaf , 7 days ago

Authors shouldn’t be paid for their labor?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

VictoriaAScharleau , 7 days ago

I didn’t say that. you’re making a leap of logic

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

catloaf , 7 days ago

Yes, I am. Logically, if an author creates something and cannot control its distribution, it is available to everyone at no cost, therefore the author will never see a dime for their labor.

This discounts the donation model, because in practice, it rarely pays the bills. It also ignores patronage, because I doubt that you want the creation of art to be dependent on the generosity of the rich.

Thus, it makes sense for the author to maintain certain rights over the product of their labor. They provide the work under their terms, e.g. requiring payment for a copy, and that relatively low cost to the average Joe provides the money they need to buy food, pay rent, etc.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

VictoriaAScharleau , 7 days ago

you recognize two well known cases where copyright is not necessary to get paid. I don’t think there is even an argument at this point. have a nice day.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

catloaf , 7 days ago

Yes, and I said they’re not feasible, because they’ve been tried in the past and present and found to not work very well. If you disagree, I’m happy to hear your thoughts.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

VictoriaAScharleau , 7 days ago

you claim they are not feasible, but we know people do get paid through them, so you’re just lying.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

catloaf , 7 days ago

Yes, they do get paid, but not a living wage.

For the donation model, most people doing that work that I’ve talked to have day jobs, and do the other work on the side. There’s a reason the donation platform buttons say things like “buy me a coffee” and not “pay my rent for the month”: it’s because the donations don’t cover rent.

For the patronage model, like I said, I don’t think anyone wants work like this to be controlled by a handful of rich people.

I’m still interested in hearing your thoughts if you have more than “nuh uh” and “you’re lying”.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

VictoriaAScharleau , 7 days ago

only one person would need to be able to live on either model to disprove your claim. since that has definitely happened, you’re definitely lying.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

catloaf , 7 days ago

Just because it works once doesn’t mean it’ll work all the time for everyone.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

VictoriaAScharleau , 7 days ago

any time it has worked proves you are wrong. the top 50 patreons clear over $100k a year

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

catloaf , 7 days ago

Exactly, and since it certainly follows a long tail distribution, the rest of the 250,000 creators on patreon make a tiny fraction of that. For the vast majority of people, it doesn’t provide a primary income.

I’m not sure you want to rely on Patreon in any case, since it also relies on the retention of rights for profit. In your scenario, when they upload to Patreon, anyone involved could tell them to get fucked and pay the author nothing.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

VictoriaAScharleau , 7 days ago

Exactly, and since it certainly follows a long tail distribution, the rest of the 250,000 creators on patreon make a tiny fraction of that. For the vast majority of people, it doesn’t provide a primary income.

this is true for the vast majority of storytellers and artists and musicians through all of history.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

VictoriaAScharleau , 7 days ago

I’m not sure you want to rely on Patreon in any case, since it also relies on the retention of rights for profit. In your scenario, when they upload to Patreon, anyone involved could tell them to get fucked and pay the author nothing.

anyone could do that now. people still get paid.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

bamboo , 7 days ago

I mean, this is how courts work. Someone will sue because a work they hold copyright to was used in a training set without their authorization, the defendant will claim it was fair use, the judge will pick a side. To the best of my knowledge this hasn’t happened just yet, and since I’m not a judge, I use “probably”. Fair use is both vague and broad, and this is important to ensure copyright holders don’t have complete control over their work. It was recognized a long time ago that you can make works that utilize another copyrighted work, but don’t functionally replace the original work, and are therefore fair use. The whole point was to try and foster innovation, not to allow copyright holders to dictate how their works are used, and fair use is an essential part of that.

Training an LLM with a work doesn’t functionally replace that work. If there is a filter that prevents 1:1 reproduction, then it literally cannot. It also provides significant benefit to have these LLMs, they are a unique and valuable work themselves. That’s why it’s fair use.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AlexanderESmith , 7 days ago

Agreed on all points, except my personal interpretation of "fair use" specific to the case of generative models.

You call out "doesn't replace the original work". Is that not how you see an LLM Q/A bot replacing a user going to a git repo for established examples, or a website for an article (generating page views, subscriptions, ad revenue), or similar? Why would anyone go to the source materials if they're getting their answer from the bot?

This is practically the same as when Google started showing articles in AMP, and not bringing people to the original website, is it not?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

bamboo , 7 days ago

How would an LLM answering questions about a git repo be legally different from a person answering those same questions (think stackoverflow)? Specific to this case, US law does not consider “APIs” to be copyrightable (Oracle v Google, Google reimplemented Java using the same APIs but their own implementation code, court ruled that Oracle couldn’t copyright the APIs).

Regarding “replace”, the primary use of the git repo is the code itself, not the Q&A about how to use it. The LLM doesn’t generate code that fully replaces that library or program, or if it does, it is distinct enough to be a different work.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AlexanderESmith , 7 days ago

First, a chat bot is not an API. Second, they were talking about the the formatting and delivery method of the data, not the content.

Regarding the output of the model: Some repos are entirely READMEs by their nature. No code, just documentation and walkthroughs. Notwithstanding that; If I set a flag that's says "don't use my data" and they use it anyway, that's theft, even if it's only one file, even if the file is just a description of the code. That's my work, not yours. You don't get to use it however you want, unless I specifically note that it's public domain (or you use it and follow the license, like attributing me, or linking to the repo, etc).

As to the difference between a bot and a human (re: stack overflow)? The former is a representative of a company (automation or not, whether it's a bot or a page on their corporate site), the latter is a person relating experience and opinion. The legal difference is that one is using the data commercially, and the other is just a person in the world, answering another person's question for no reason other than a desire to be helpful (and if they're decent, attributing the source instead of claiming that they're generating wisdom on their own).

That last parenthetical used to be called plagiarism, by the way.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

elias_griffin OP , 8 days ago

Discussion Primer: From my perspective and potential millions of others, the readme is part of the software, it is delivered with the software whether zip, tar, git. Itself, Markdown is a specifiction and can be consider the document as software.

In fact README is so integral to the software you cannot run the software without it.

Conclusion: I think we all think of readme, especially ones with examples of your code in your readme, as code. I have evidence AI trains on your README even if you tell it specifally not to use readme, block readme, block markdowns, it still goes after it. Kinda scary?

I want everyone else to have the evidence I have, Science.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

catloaf , 8 days ago

I mean this in the best possible way, but have you ever had any mental health evaluations? I’m not sure if they’re still calling it paranoid schizophrenia, but the way you write makes me concerned.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

elias_griffin OP , 8 days ago

I write the smartest in the room, passionate, with wisdom and evidence. The way you defame someone like this makes me definitely sure you are not afraid to defame someone’s character with no evidence of anything but your own stupidity and un-awareness.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

catloaf , 8 days ago

This is out of genuine concern, my dude. Your other comment accusing me of not being a real person is positively alarming.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

elias_griffin OP , 8 days ago

Your rapacious backwards insult of caring is gross and obvious. You called me “my dude” like a teenger whose chill, and calm, and correct, but just …a child and wrong in the end. How old are you child? My Lemmy profile is my name with my Seal naturally born March 4th, 1974 as Elias Christopher Griffin. I’ve done more in my life than most people do in 10. My mental health is top 3% as is my intellect.

You are an un-named rando lemmy account named “catloaf” who averages 16 posts a day for the past 4 months with no original posts of your own because you aren’t original.

I make only original posts. You seem nothing like a real person. Want to tell us who you are? What makes you special, outside of the mandated counseling you recieve or data models you intake?

You know what, no one takes what you say seriously loaf of cat, I certainly didn’t, don’t, and won’t. Here is space for your next hairball

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

subignition , 7 days ago

I take back the benefit of the doubt I gave in my earlier reply. This reply is as unhinged as the Navy SEAL copypasta. You need mental health support.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

DudeDudenson , 7 days ago

This really reads like copy pasta, if someone told me you were an LLM configured to make antiAI people look bad I’d believe them

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

subignition , 7 days ago

I think your problem is here:

You should test this out for yourself as I'm not going to take days or a week making a great presentation of a technical case.

You've written a whole lot to try to be convincing but ultimately stopped short of actually proving what you've alleged. It looks to me you are frustrated that no one is taking you at your word and going down this rabbit hole themselves, when the various reputational elements you're relying on are going to be important only to a minority of users. Burden of proof works how it always has, however.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AlexanderESmith , 8 days ago

It's not paranoia if you have proof that they're stealing your content without permission or compensation.

You come off as an AI bro apologist. What they're doing isn't okay.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

catloaf , 8 days ago

Just because they are out to get you doesn’t mean you’re not paranoid, and vice versa.

I have nothing for or against AI/ML as a tool, my issue with it is when companies scrape huge amounts of data in violation of the author’s rights, as in OP’s example. Although I’m not quite sure why he’s keeping code in the README.md file; usually that’s for basic installation and usage, and full examples are kept in full documentation. That said, I highly doubt README.md files are public domain, so they shouldn’t be automatically used as training materials.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AlexanderESmith , 8 days ago

I'm not quite sure who's argument you're making here. It reads like you agree with OP and I (e.g. "LLMs shouldn't be using other people's content without permission", et al).

But you called OP paranoid... I assumed because you thought OP thought their content was being used without their permission. And it's extremely clear that this is what is happening...

What am I missing?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

catloaf , 7 days ago

You’re not missing anything. Both things can be true.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

wizardbeard , 8 days ago

These concepts are not mutually exclusive. You can be right about AI considerably overstepping boundaries and still be exhibiting classic signs of paranoia issues, which OP is.

Their immediate response to people not reacting to this post and their comments is to immediately jump to the idea that they’re being targeted by their designated enemy. That’s not particularly healthy.

I’m worried that AI is becoming the new gangstalking for tech aligned people predisposed to disprdered thinking.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AlexanderESmith , 8 days ago

I agree that their replies are a little... over the top. That's all kind of a distraction from the main topic though, isn't it? Do we really need to be rendering armchair diagnoses about someone we know very little about?

I mean, if I posted a legitimate concern - with evidence - and I was dog-piled with a bunch of responses that I was a nutter, I'd probably go on the defensive too. Some people don't know how to handle criticism or stressful interactions, it doesn't mean we should necessarily write them (or their verified concerns) off.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

DudeDudenson , 7 days ago

Frankly op replied to his own post multiple times with no prompting whatsoever, just reading through this stuff I’m concerned about him as well. LLM stuff not withstanding and even if he’s right he seems somewhat obsessed with this in an unhealthy way

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

bamboo , 8 days ago

Anything you put publicly on the internet in a well known format is likely to end up in a training set. It hasn’t been decided legally yet, but it’s very likely that training a model will fall under fair use. Commercial solutions go a step further and prevent exact 1:1 reproductions, which would likely settle any ambiguity. You can throw anti-AI licenses on it, but until it’s determined to be a violation of copyright, it is literally meaningless.

Also if you just hope to spam tab with any of the AI code generators and get good results, you’re not. That’s not how those work. Saying something like this just shows the world that you have no idea how to use the tool, not the quality of the tool itself. AI is a useful tool, it’s not a magic bullet.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

elias_griffin OP , 8 days ago (edited 8 days ago)

Sounds like AI or an AI influencer post. The first paragaph is so far off-topic, might as well be talking about sailing. You completely mis-understood what I meant using TabNine. I wrote my own code and obfuscated my own code. Then tried to have AI complete another function using my code.

Nothing you said is relevant is any way, shape, or form.

[EDIT} www.tabnine.com

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

wizardbeard , 8 days ago (edited 8 days ago)

My guy, your posts are particularly hard to follow, and you are very very quick to jump to the conclusion that you’re somehow being targeted and under attack. It’s no surprise that people aren’t responding to what you think is appropriate for them to respond to.

You’ve gone out of your way to provide extra info about irrelevant details: Why does the particular flavor of git you use matter at all to this conversation beyond the fact that you self host, why does it matter that you are on github as well when we are specifically discussing things you believe were sourced from readme.mds you have self hosted?

Meanwhile you don’t give many details or explanation about the core thing you are trying to discuss, seemingly expecting people to be able to just follow your ramblings.

Edit: After having re-read your OP, it’s less messy than I initially thought, but jesus christ man you need to work on arranging your points better. It shouldn’t take reading your main post, a few of your comments, and the main post again to get your point: “AI data scrapers appear to treat readme files as public data regardless of any anti-AI precautions or licensing you’ve tried to apply, and they appear to not only grab from github bit also from self-hosted git repositories.”

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Chronographs , 8 days ago

Seriously. OP might have a legitimate point but they’re making it with the energy of someone trying to convince me that vole people live in the antiposition of the time cube.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AlexanderESmith , 8 days ago

In fairness, a lot of the more exceptional engineers I've worked with couldn't write their way out of a wet paper bag.

On top of that, even great technical writers are often bad at picking - or sticking with - an appropriate target audience.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

catloaf , 8 days ago

I think that training models for fair use purposes, like education, not commercialization, will also fall under fair use. But even so, it’s very difficult to prove that someone has trained their model on your data without a license, so as long as it’s available, I’m sure that it’ll be used.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

AlexanderESmith , 8 days ago

This "fair use" argument is excellent if used specifically in the context of "education, not commercialization". Best one I've seen yet, actually.

The only problem is that perplexity.ai isn't marketing itself as educational, or as a commentary on the work, or as parody. They tout themselves as a search engine. They also have paid "pro" and "enterprise" plans. Do you think they're specifically contextualizing their training data based on which user is asking the question? I absolutely do not.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

the_doktor , 5 days ago

And this is why AI needs to be banned from use. People own the things they post / place them under various licenses, and AI coming along and taking what you did is a blatant violation of copyright, ownership, trust, and is just general theft.

I am absolutely angry with the concept of AI and have campaigned against its use and written at length, many times, to every company that believes it’s allowed to scour the internet for training data for its highly flawed, often incorrect, sometimes dangerous AI garbage. To hell with that and to hell with anyone who supports AI.

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

bamboo , 5 days ago

It hasn’t been decided in court yet, but it’s likely that AI training won’t be a considered copyright violation, especially if there is a measure in place to prevent exact 1:1 reproductions of the training material.

But even then, how is the questionable choices of some LLM trainers reason to ban all AI? There are some models that are trained exclusively on material that is explicitly licensed for this purpose. There’s nothing legally or morally dubious about training an LLM if the training material is all properly licensed, right?

Reply

Report

Activity

Open original URL

Copy original URL

Copy Mbin URL

Loading...

Recommend the removal of any “primary logic” functional code examples out of your README.md, that’s it.

Recommend the removal of any “primary logic” functional code examples out of your `README.md`, that’s it.