![]() One is for anyone who’s built training sets off of LAION‐5B - named for the more than 5 billion image-text pairs it contains - to “delete them or work with intermediaries to clean the material.” Another is to effectively make an older version of Stable Diffusion disappear from all but the darkest corners of the internet. Trying to clean up the data retroactively is difficult, so the Stanford Internet Observatory is calling for more drastic measures. Google built its text-to-image Imagen model based on a LAION dataset but decided against making it public in 2022 after an audit of the database “uncovered a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes.” OpenAI, maker of DALL-E and ChatGPT, said it doesn’t use LAION and has fine-tuned its models to refuse requests for sexual content involving minors. Many text-to-image generators are derived in some way from the LAION database, though it’s not always clear which ones. The Stanford report acknowledged LAION’s developers made some attempts to filter out “underage” explicit content but might have done a better job had they consulted earlier with child safety experts. LAION said this week it developed “rigorous filters” to detect and remove illegal content before releasing its datasets and is still working to improve those filters. Much of LAION’s data comes from another source, Common Crawl, a repository of data constantly trawled from the open internet, but Common Crawl’s executive director, Rich Skrenta, said it was “incumbent on” LAION to scan and filter what it took before making use of it. “It will be much safer and much more fair if we can democratize it so that the whole research community and the whole general public can benefit from it,” he said. LAION was the brainchild of a German researcher and teacher, Christoph Schuhmann, who told the AP earlier this year that part of the reason to make such a huge visual database publicly accessible was to ensure that the future of AI development isn’t controlled by a handful of powerful companies. “By removing that content before it ever reaches the model, we can help to prevent the model from generating unsafe content.” “Those filters remove unsafe content from reaching the models,” the company said in a prepared statement. Stability AI on Wednesday said it only hosts filtered versions of Stable Diffusion and that “since taking over the exclusive development of Stable Diffusion, Stability AI has taken proactive steps to mitigate the risk of misuse.” That model is in the hands of many people on their local machines,” said Lloyd Richardson, director of information technology at the Canadian Centre for Child Protection, which runs Canada’s hotline for reporting online sexual exploitation. ![]() ![]() New versions of Stable Diffusion have made it much harder to create harmful content, but an older version introduced last year - which Stability AI says it didn’t release - is still baked into other applications and tools and remains “the most popular model for generating explicit imagery,” according to the Stanford report. “Taking an entire internet-wide scrape and making that dataset to train models is something that should have been confined to a research operation, if anything, and is not something that should have been open-sourced without a lot more rigorous attention,” Thiel said in an interview.Ī prominent LAION user that helped shape the dataset’s development is London-based startup Stability AI, maker of the Stable Diffusion text-to-image models. It’s not an easy problem to fix, and traces back to many generative AI projects being “effectively rushed to market” and made widely accessible because the field is so competitive, said Stanford Internet Observatory’s chief technologist David Thiel, who authored the report. Best Online Master's in Cybersecurity Degrees.Best Online Master's in Data Science Programs. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |