"[...] you agree not to sell, license, rent, modify, distribute, copy, reproduce, transmit, publicly display, publicly perform, publish, adapt, edit or create derivative works from any materials or content accessible on the Service. Use of the Goodreads Content or materials on the Service for any purpose not expressly permitted by this Agreement is strictly prohibited."
Also did the reviewers give you permission to fed their content into an LLM?
esskay 1 days ago [-]
Fairly meaningless in this day and age. Also IIRC scraping legality depends heavily on jurisdiction. Some places take a more permissive view of accessing publicly available information, even if a site's TOS forbids bots.
In the US there’s a major precedent [0] which held that scraping public-facing pages isn’t a CFAA "unauthorized access" issue. That’s a big part of why we’ve seen entire venture-backed scraping companies pop up - it’s not considered hacking if the data is already public.
> However, after further appeal in another court, hiQ was found to be in breach of LinkedIn's terms, and there was a settlement.
So why would the same not apply here?
hrimfaxi 22 hours ago [-]
They settled out of court, that doesn't mean that they were found to be in breach of the terms.
These were some of the notable elements (worth noting that none mention breaching terms of service):
> Damages: Judgment in the amount of $500,000 is entered against hiQ, with all other monetary relief waived.
> CFAA liability: hiQ stipulates that LinkedIn experienced losses sufficient to, and “may establish liability” under a CFAA civil claim “based on hiQ’s data collection practices and based on hiQ’s direct access to password-protected pages on LinkedIn’s platforms using fake accounts.”
> California “CFAA”: hiQ stipulates that LinkedIn “may establish civil liability” under California’s state-law counterpart to the CFAA based on hiQ’s data collection practices, use of fake accounts and other means to evade detection by LinkedIn, hiQ’s direct access to password-protected pages on LinkedIn’s platforms using fake accounts, and hiQ’s unauthorized commercial use of data.
> Trespass: hiQ stipulates that LinkedIn has established judgment as to liability under California law for the common law torts of trespass to chattels and misappropriation.
> Irreparable harm: hiQ stipulates that LinkedIn has established that it has suffered an irreparable injury and that LinkedIn satisfied the remaining factors and is entitled to a permanent injunction.
A settlement means there was no legal ruling and no precedent set. The entire case is legally moot.
In America, you can simply pay to not lose any lawsuit ever, and thus never have to face legal consequence or changes to the law you don't like.
hrimfaxi 22 hours ago [-]
> CFAA liability: hiQ stipulates that LinkedIn experienced losses sufficient to, and “may establish liability” under a CFAA civil claim “based on hiQ’s data collection practices and based on hiQ’s direct access to password-protected pages on LinkedIn’s platforms using fake accounts.”
This was part of the terms of the settlement.
1 days ago [-]
exe34 1 days ago [-]
it's only legal if you have a team of lawyers though. the law still applies to the rest of us.
paulnpace 1 days ago [-]
It is the future: I own nothing, and I've never been happier. They can sue me and take nothing.
1 days ago [-]
exe34 1 days ago [-]
I've been trying to convince myself I'd be able to live like Diogenes, sleep in the streets, bathe in the sea and just generally survive off scraps - but I think that only works if other people can afford to throw away scraps.
portaouflop 1 days ago [-]
If you live in the west it’s no problem - the amount of waste there is insane
exe34 24 hours ago [-]
It won't be after the billionaires own everything and the rest are living off scraps already.
voidUpdate 1 days ago [-]
So if you are legally allowed to "adapt, edit or create derivative works from any materials", what's the point of the TOS?
margalabargala 17 hours ago [-]
The TOS specify the circumstances in which the corp may take action that is unrelated to the legal system. Just because they can't sue you (and easily win) for scraping, doesn't mean they can't block you if they notice you doing it.
Google for example has a TOS and is well known for permanently banning accounts for real or imagined or AI-generated violations of it. Google banning you for breaking TOS doesn't mean you broke the law, just that you broke their rules, which apparently include a clause against being in the wrong place at the wrong time.
zigzag312 1 days ago [-]
I believe TOS is binding as long as it doesn't conflict with the law. If something is deemed fair use under the law, TOS cannot override those legal rights.
hrimfaxi 22 hours ago [-]
Legal rights are signed away all the time in contracts though.
zigzag312 2 hours ago [-]
Aren't there some limits on which legal rights can be signed away through a contract?
I imagine that a contract in which someone agrees to become a slave would be void.
hopelite 1 days ago [-]
That’s a good question. It also would not be the first time that companies use trickery and manipulation or even deliberately illegal practices for various business/financial reasons. At the very least it could be used as a tool to underpin intimidating lawsuits and another step up, regardless of the legality in the relevant jurisdiction, it could be used to influence official government foreign policy to exert pressure on a jurisdiction that permits scraping.
6stringmerc 24 hours ago [-]
Tell that to judyrecords with the same smug attitude.
Your textbook versus reality conceptualization of things is dogshit. It’s exploitation to do what OP did. You’re endorsing it and minimizing the ethics and this certainly shall poison the well from which you drink. Godspeed.
kulahan 19 hours ago [-]
This is so overly dramatic it’s hard to even consider the point you’re trying to make.
MichaelBosworth 1 days ago [-]
What expectation of confidentiality are you ascribing to people having posted publicly accessible opinions on the internet?
Out of curiosity, is your point about TOS out of concern for the poster or for Goodreads?
voidUpdate 1 days ago [-]
My expectation isn't of confidentiality, but of attribution. Sure, my website is perfectly accessible on the internet, and I'm fine with being able to find it on google, but if you pipe it into an algorithm that will start throwing out stuff based on what I wrote, with zero reference to me at all, I'd get a bit annoyed. This website has taken the combined output of probably thousands of people, shoved it into an algorithm and is then using their work to give "original" ideas. If one person wanted their content removed from the system, how would you do that?
caconym_ 1 days ago [-]
What does that comment have to do with confidentiality?
MichaelBosworth 20 hours ago [-]
That he viewed a review on Goodreads as the reviewer’s intellectual property hadn’t occurred to me. I see why, in aggregate, many such opinions become valuable, but the whole is more than the sum of its parts.
So does it feel to you guys like your comments, say, here in this Hacker News thread should be considered effectively copyrighted as your personal IP?
If so, do you feel the same way about opinions you share out in a supermarket or on the street?
nextaccountic 20 minutes ago [-]
Of course comments are copyrighted, if they happen to contain text that is novel. As an example, in the reddit TOS, they require commenters to license their comments to reddit.
> If so, do you feel the same way about opinions you share out in a supermarket or on the street?
Well being novel isn't the only criteria for copyright, the work must also be "fixated", and opinions in a supermarket usually isn't (but they can be, if I film them and post on reels or something; then the video itself is copyrighted)
> To meet the fixation requirement, a work of authorship must be fixed in a tangible medium of expression. Protection attaches automatically to an eligible work the moment the work is fixed. A work is considered to be fixed so long as it is sufficiently permanent or stable to permit it to be perceived, reproduced, or otherwise communicated for a period of more than transitory duration.
caconym_ 19 hours ago [-]
There are well established legal standards for what is copyrightable and I believe written literary criticism trivially qualifies (as it should). Stuff you yell at the supermarket doesn't, IIUC, as it isn't fixed in a tangible form. Social media comments are, IIUC, generally protected. The exception would be comments that don't meet the bar to be considered "original", "creative", etc.
(not a lawyer)
pantropy 1 days ago [-]
Technically speaking none of Goodreads material or content is being used publically, the only information displayed on the site is freely available (Title, Author) and not Goodread's property.
You could try to argue that this falls under "create derivative works from any materials or content accessible on the Service" but even then it seems really flimsy to say that recommending books based on Goodread reviews is an infringemnt.
It's just not that different to a youtuber saying "I read reviews for 50 books, here's the ones to read"
voidUpdate 1 days ago [-]
I'd be impressed if a youtuber could read 3 billion reviews and recommend books to you based on that
tonyhart7 1 days ago [-]
what about youtuber that build a machine that scrape 3 billions books and make recommendation based on the data????
bravoetch 20 hours ago [-]
Skip that step. This project enables a Youtuber that automates pulling related booklists from this site, and uses AI to make the recommendation videos. Thousands of videos.
croes 1 days ago [-]
I visit your garden and take 1 apple from your tree
I visit your garden and take 1000 apples from your tree.
Not that different.
kemotep 1 days ago [-]
Not only am I taking 1,000 apples, but I use those 1,000 apples to start my own orchard and encourage people to come to it instead of yours.
simianparrot 1 days ago [-]
Yeah but if I program a drone swarm to automate this process it’s for the greater good — more apples for everyone!
And I only charge a tiny subscription for access to all my drone-managed orchards, you can eat as many apples as you want. But don’t steal any and start your own orchard or I sue.
croes 1 days ago [-]
All the people who care for the trees and pick the apples have lost their job while an apple became nearly worthless, but without a job it‘s still unaffordable.
Replace your drones with China or India and you have the current situation in the US.
Apple farmers go out of business so you lose the people who create new varieties.
lm28469 23 hours ago [-]
> but I use those 1,000 apples to start my own orchard
Steal cuttings, not the fruit, if you plan to start an orchard. From 1000 apples you'll get ~10 000 seeds, statistically you won't even end up with one good tree.
> An output of three cultivars from around 50.000 seeds means that 17.000 seeds were needed to get one cultivar. Only one out of around 9.000 scab resistant seedlings showed the appropriate quality to become a cultivar. This proportion underlines the enormous effort which is necessary to develop a new cultivar.
... and somehow your garden did not lose any apple in the process.
croes 1 days ago [-]
But you are an apple seller
jalk 1 days ago [-]
Not a great analogy, since a digital copy leaves the original intact unlike your apples
petralithic 16 hours ago [-]
For every apple I take, you still have your apple on the tree, because my apple is only a copy of yours.
contravariant 23 hours ago [-]
At what point are they feeding reviews into an LLM? From what I got the only personal data they're using is which user read which books.
irl_zebra 19 hours ago [-]
This is, essentially, why I've withdrawn from posting content from my human brain almost anywhere on the open internet (except here, sometimes) and have retired blog posts, opinions, and so on to our friends WAN.
kosolam 1 days ago [-]
I’m not taking sides in this debate, however since feeding whole books into LLMs is considered legal fair use now, I guess these reviews don’t require a permission as well. Would be great to hear a professional lawyer take on this.
saaaaaam 1 days ago [-]
The hidden gotcha in the Anthropic judgement (which I think is what you’re referencing?) is that feeding whole books into LLMs is considered legal fair use if you obtain them legitimately.
I suspect we need to wait for the NYT (and others) case to be decided before we know whether scraping sites in contravention of their terms is also fair use for LLM training.
My own opinion (as someone who creates written content on an occasional professional basis) is that if you can’t monetise your content in some other way than blocking people from accessing it then your content probably isn’t as valuable as you think.
But at the same time that’s tricky when it’s genuine journalism, as in NYT’s case.
Obviously user generated content reviewing books online is rather different because the motivation of the reviewers was (presumably) not to generate money. And, indeed, with goodreads there’s a strong argument that people have already been screwed over after their good faith review submissions were packaged up as an asset and flogged to Amazon. A lot of people were quite upset by that when it happened a decade or so back.
So from a ‘moral arguments’ perspective I don’t think scraping goodreads is as problematic as other scraping examples.
(Sorry, none of this was aimed at you - your comment just got me thinking and it seemed as good a place as any to put it!)
IncreasePosts 23 hours ago [-]
Goodreads offers those reviews up publicly by serving them from their webservers to anyone who asks for it.
saaaaaam 20 hours ago [-]
Sorry, I don’t understand the point you’re making. I know that these are publicly available - the point I was making, drawing off the parent comment, is that where it has been deemed fair use in copyright to use books to train LLMs when the content has been legitimately obtained then a similar assessment might apply for this sort of ingestion.
If content is publicly available that does not necessarily mean it’s free of copyright control: the justification for using the reviews to train an LLM would be based on the fact that fair use means it is not an infringement of copyright. But if the publisher has terms that forbid scraping then that may mean the fair use argument is undermined if it is precedent in the content being legitimately obtained. I’m not a lawyer but it’s quite easy to see how “books can be used for LLM training under fair use but not if you pirate them” extends to “content on the web can be used for LLM training under fair use but not if you’ve breached the terms set out by the publisher”.
petralithic 1 days ago [-]
Why ask questions you already know the answers to?
hananova 1 days ago [-]
Because some tech adjacent people still have morals?
onetokeoverthe 1 days ago [-]
[dead]
lunias 1 days ago [-]
If it's on the internet, and people can access it, then it's public. I would have no expectations for what people do with public data; that just seems like setting yourself up for disappointment.
Vvector 23 hours ago [-]
Is a pirated movie, found on bittorrent, public?
IMO, your definition is overbroad
lunias 23 hours ago [-]
If it's on bittorrent then, yes, it's public. It doesn't matter if you intended it to be or not, it's publicly accessible, therefore it's public.
vessenes 2 days ago [-]
OK, I just added books until you told me I had too many. Fun idea! I have a couple of suggestions:
* UI - once someone clicks "Add" you really should remove that item from the suggested list - it's very confusing to still see it.
* Beam search / diversification -- Your system threw like 100 books at me of which I'd read 95 and heard of 2 of the other 3, so it worked for me as a predictor of what I'd read, but not so well for discovery.
I'd be interested in recommendations that pushed me into a new area, or gave me a surprising read. This is easier to do if you have a fairly complete list of what someone's read, I know. But off the top of my head, I'm imagining finding my eigenfriends, then finding books that are either controversial (very wide rating differences amongst my fellow readers) or possibly ghettoized, that is, some portion of similar readers also read this X or Y subject, but not all.
Anyway, thanks, this is fun! Hook up a VLM and let people take pictures of their bookshelf next.
kace91 2 days ago [-]
(From the site)
>If you visit the "intersect" page, you can input multiple books and find the set of users that have read all of those books. This can be useful for finding longer tail books that weren't popular enough to meet the threshold. For instance, if you like reading about the collapse of the Soviet Union, you could put in "Lenin's Tomb" and "Secondhand Time", and see what other books the resultant users have read.
This is how filmaffinity works, which is the best recommendation system I've tried. They have a group of several dozen 'soulmates', which are users with the most similar set of films seen and ratings given; recommendations are other stuff they also liked, and you get direct access to their lists.
>then finding books that are either controversial or possibly ghettoized
Naively, I’d say the surprises are going to be better if you filter more different friends, rather than more controversial books among your friends. As in “find me a person that’s like me only in some ways, tell me what they love”. Long term this method is much better at exposing you to new ideas rather than just finding your cliques holy wars.
idoubtit 1 days ago [-]
The "Intersect" page was useless for me. I added 15 books, but got no matching user. I entered a cycle of removing-searching, and at 10 books I had 2 users: one had read 41353 books, and the other 85363, with no ratings...
To be useful, the "Intersect" page should have:
- find near matches when there is no exact match with every book,
- ignore fake users (can any human read 80k books in many languages?),
- do not ignore users' votes (my input was books I liked, I expected to find users that rated them highly).
With the "Recommend" page I had the same problem as the GP, and all the recommendations were useless. To fix that, I think some features are needed:
- do not list books by authors from my list (I don't need recommendations for them),
- add a button for marking a suggested book as "disliked" (at the bare minimum, it should remove it from the suggestion, and ideally it should influence le suggestions as much as a "liked" book),
- do not suggest several books by the same author,
- add a button to hide a suggestion or show more suggestions (there were dozens of books I'd read but wouldn't rate high).
costco 1 days ago [-]
What do you think the probability that someone else read 15 books you also read is? It’s very unlikely unless they are all staples of a genre, part of the same series, or just extremely popular in general. 3-5 books is how much I would use on that page. I have found interesting accounts of medievalists, people who work at think tanks, etc with it.
Fake users I would agree should be filtered, but I don’t think filtering out users who gave it a bad review is necessarily the intended behavior. If I put in 3 semi obscure Russian history books, I am presumably looking for someone who is an expert in Russian history to see what else they read. In that case I don’t care if they didn’t like one of the books or not. Approximate matches would require something like LSH or cosine similarity of average input book embedding against average embedding of read books of every user which I think wouldn’t work well anyone for retrieving anyone with a moderately long interaction history.
idoubtit 20 hours ago [-]
I wanted to find users that loved the same kinds of classical novels. The core of my list was each famous work of famous classical writers like Dostoievsky, Tolstoi, Huxley and Borges. I added a few excellent authors, still famous but to a lesser degree, like Italo Calvino or Marguerite Yourcenar. I know there are many readers of the whole list I wrote, I could name a few among my friends and family.
So I think the problem was not in the existence of similar readers, but in the way to reach them. Few people that read classical books log in Goodreads (I don't) and even fewer input what they've read over the past decades.
dbl000 2 days ago [-]
Echoing what everyone else has said here - awesome site, love how fast it was.
I did notice that when I put in a single book in a series (in my case Going Postal, Discworld #33) that tended to dominate the rest of the selection. That does make sense, but I don't want recommendations for a series I'm already well into.
Also noticed that a few books (Spycraft by Nadine Akkerman and Pete Langman, Tribalism is Dumb by Andrew Heaton) that I know are in goodreads and reviewed didn't show up in the search. I tried both author's name and the title of the book. Maybe they aren't in the dataset.
It did stumble with some books more niche books (The Complete Yes Minister). Trying the "Similar" button gave me more books that were _technically_ similar because they were novelizations of British comedy shows, but not what I was looking for.
For more common books though it lined up very well with books already on my wishlist!
costco 2 days ago [-]
Yes I would say the handling of series is probably the biggest problem. Once my test metrics got to a point I was happy with and my quality spot checks passed (can I follow the models recommendations from one generic history book to Steven Runciman, also making sure popular books don't always dominate the results), I was ready to release because I had been working on this project for so long. The solution is probably using the transformer model to generate 100-200 candidates and then having a reranker on top.
walletdrainer 2 days ago [-]
Not just series, but I seem to mostly get a list of other books from the same authors.
The recommendations from other authors are good, but as far as I can tell I’ve read every single one of them.
Continuing to aggressively add everything it recommends eventually does seem to result in some interesting books I wasn’t familiar with, but I also end up with more and more books that are of zero interest to me.
For what it’s worth, I started with:
Infinite Jest David Foster Wallace
Europe Central William T. Vollmann
Gravity’s Rainbow Thomas Pynchon
White Noise Don DeLillo
One Hundred Years of Solitude Gabriel García Márquez
It is possible that there simply aren’t many books like these in existence, so the pool of relevant recommendations gets exhausted fairly quickly. I’d guess trending towards unrelated popular books is also just a feature of the source data, that largely sums up my experience with goodreads anyway.
Very cool project though. I did end up ordering a couple of new books, so thank you very much.
IanCal 1 days ago [-]
Releasing is the right choice, well done with this it’s really cool.
I’ve only had a short play but a solution to this problem might be to show authors rather than books. Or select authors outside of the list the user has shared and then a top n (1,3,5) for each of those.
I feel like that’s how you’d recommend to someone else - type of book -> unknown author -> best matching few books from them.
After that the other side would be trying to find some diversity (if you think I’d like author X, personally you might suggest three different styles of book from them rather than three very similar books from them)
Peritract 23 hours ago [-]
> It did stumble with ... more niche books (The Complete Yes Minister).
If you haven't already read it, you might like Lawrence Durrell's Antrobus [1].
Going Postal is awesome. The flood of the mails and the test to be the post master where you would have a slide mail into the hole where a vicious dog is barking.
mscbuck 2 days ago [-]
Awesome site and speed!
My advice from someone who has built recommendation systems: Now comes the hard part! It seems like a lot of the feedback here is that it's operating pretty heavily like a content based system system, which is fine. But this is where you can probably start evaluating on other metrics like serendipity, novelty, etc. One of the best things I did for recommender systems in production is having different ones for different purposes, then aggregating them together into a final. Have a heavy content-based one to keep people in the rabbit hole. Have a heavy graph based to try and traverse and find new stuff. Have one that is heavily tuned on a specific metric for a specific purpose. Hell, throw in a pure TF-IDF/BM25/Splade based one.
The real trick of rec systems is that people want to be recommnded things differently. Having multiple systems that you can weigh differently per user is one way to be able to achieve that, usually one algorithm can't quite do that effectively.
maaaaattttt 17 hours ago [-]
Speaking of TF-IDF I once added it “after” the recommendations to downscore items that were too popular and tended to be recommended too much/with too many other items (think Beatles/iphone) and inversely for more niche items. It might be too costly too do depending on how you generate the recommendations though.
diffeomorphism 1 days ago [-]
The robots.txt is pretty explicit that this scraping is "disallowed"
I agree. As a frequent reviewer on Goodreads, this feels really icky.
psandor 23 hours ago [-]
You are right.
At the same time, everything you ever posted online has already been scraped by hundreds (maybe thousands) of entities and distributed/sold to countless other entities. The only difference is that OP shared his project here.
sputr 1 days ago [-]
Why would it be unethical?
This obsession with "everything must be commercialized" is really killing creativity.
Now if the author was commercializing other peoples reviews, sure, it's potentially(!) unethical.
But scraping a website for reviews that are publicly(!) posted, training a recommendation LLM and then sharing it, for free, seems ... exactly the ideal use case for this technology.
paulnpace 1 days ago [-]
It is truly criminal that such a bright and brilliant model of ethics, Amazon, should endure such an attack.
diffeomorphism 1 days ago [-]
Unethical behavior does not become good just because it happens to hurt "bad people" (or more accurately, companies bought by bad people).
paulnpace 1 days ago [-]
Using a sword to stab someone is evil, therefore, stabbing someone who is stabbing me with a sword is evil?
wongarsu 24 hours ago [-]
Another factor is that Amazon is big enough that crawling a minor website under their umbrella for a noncommercial project is unlikely to notably affect them.
Stabbing people with swords is evil, unless they are so big that to them it's at worst a light poke with a fork
contravariant 22 hours ago [-]
If it's unethical it's not because of what the robots.txt says.
Blindly violating it is bad manners, but deliberately scraping a single website over a month isn't the worst.
blehn 2 days ago [-]
You should filter out authors from the input books in the output. If liked a book by an author, surely I'd read more of their work if I wanted to — recommending them isn't helpful. Along the same lines, I think interesting recommendations tend to be the ones that (1) I like and (2) I didn't expect. The more similar the recommendations are to the input, the more likely I already know them, and the more likely to create a recommendation echo chamber.
Semaphor 1 days ago [-]
> You should filter out authors from the input books in the output.
No, or at least make it configurable.
I’d agree for series, but not for Authors, just because I once read a book by someone doesn’t mean I even know they have other stuff, the list of Authors I read and enjoyed is very long.
martin82 7 hours ago [-]
I don't agree at all.
VERY few authors write consistenly good books.
If you liked one book by an author, it is not at all likely that you will like the other books as well. For example, Neil Stephenson is probably my favorite author alive today, but I hate almost half of his books.
The only author that I can think of where I read and liked every single book was Terry Pratchett, and that might have be a case of "I was still young and easy to impress".
Ntrails 21 hours ago [-]
Agree entirely - more excluding series than authors but both should be options.
I also i need a way to describe its recommendations as "meh". For example, if I put Gone Girl in, I get Girl on a Train. Which, personally, I thought was bad. I want to exclude that from all future rec sets, and ideally align my preferences to the intersection of liked A and disliked B. vOv
honkycat 2 days ago [-]
yep, was gonna say this. Getting recommended all of the same books I've already read isn't great
yoz-y 2 days ago [-]
It works pretty well in the sense that after inputting only a few quite diverse books it gave me recommendations for a lot of books that I’ve already also read and enjoyed.
I would also really like a possibility to add negative signal. It did also recommend books that seemed interesting to me but I ultimately didn’t like.
Overall quite impressive.
2 days ago [-]
varenc 2 days ago [-]
I love this site, and the approach! Great seeing someone making good use of Goodreads data.
Sadly my experience with the book recommender isn't too great because of the 64 book limit. If I import either the most recent or least recent 64 book, 95% of the books it recommends to me are books I've read. Though it was helpful for spotting a few books I've read that I didn't log on Goodreads. Guess I'm pretty consistent.
costco 2 days ago [-]
I think I will expand the input books limit (sadly requires retraining) and or the output books limit of 30.
varenc 16 hours ago [-]
I ended up playing with it more and found the recommended useful! I just removed a bunch of books of a certain theme, then got a bunch of good recs for the theme that remains.
Hilift 23 hours ago [-]
This seems like it should be an easy task for an AI to implement. For example, the question "what is the most helpful rated negative review of the book 'Original Sin' by Jake Tapper?" There are obvious and prominent "helpfulness" ratings of reviews, but they don't seem to be scraped, at least not by Gemini. Additionally, Gemini reports seemingly inaccurate or minimal effort information:
"It is difficult to pinpoint a single "most helpful" negative review of
Original Sin by Jake Tapper, as helpfulness ratings on platforms like Amazon or Goodreads are dynamic and subjective, and the provided search results *do not include specific user reviews with their respective helpfulness votes*."
One suggestion would be to make the search less strict on diacritics. Searching for popular cook J. Kenji López Alt was only successful if I entered the correct O.
sosuke 22 hours ago [-]
Feature request: Combine book series into a single entity. Bummer getting recommendations for another book in the same series as one I already liked and read.
Feature request: Exclude books already in shelf. This is harder I'm sure. I've got 1146 books in my Read shelf.
aj_hackman 2 days ago [-]
Thank you! Because of this, "The Making of Prince of Persia: Journals 1985–1993" by Jordan Mechner is on its way to my house.
mentos 1 days ago [-]
Here’s a human recommendation if you like that you may like these that I’ve read:
1.Sid Meier’s Memoir!: A Life in Computer Games — Sid Meier
2.Source Code: My Beginnings — Bill Gates
3.Build: An Unorthodox Guide to Making Things Worth Making — Tony Fadell
4.Prince of Persia: The Journals — Jordan Mechner
5.A Theory of Fun for Game Design — Raph Koster
6.Ask Iwata: Words of Wisdom from Satoru Iwata, Nintendo’s Legendary CEO — Viz Media (Editor)
7.Control Freak: My Epic Adventure Making Video Games — Cliff Bleszinski
8.Once Upon Atari: How I Made History by Killing an Industry — Howard Scott Warshaw
9.Press Reset: Ruin and Recovery in the Video Game Industry — Jason Schreier
10.Masters of Doom: How Two Guys Created an Empire and Transformed Pop Culture — David Kushner
aj_hackman 23 hours ago [-]
Thank you for the recs, Masters of Doom really lit a fire inside of me as a young programmer. I'd also recommend "Soul of a New Machine" by Tracy Kidder.
qingcharles 2 days ago [-]
You definitely will not regret that purchase. It's a very enjoyable read.
mcbrit 2 days ago [-]
I don't know. I entered, trying to be popular but at least slightly? opiniated:
Tigana, Hyperion, A Fire Upon the Deep, Blindsight, Moby Dick
and I got a list. Sure, read all that or wasn't interested for reasons, I added (only Neuromancer on initial recommendations):
If I provide that list, a (real) person doesn't ask me if I've read the Hobbit.
teaearlgraycold 2 days ago [-]
I don’t think past liked books are nearly enough information to provide a good book for you today. You need a lot more information about the state of someone’s mind.
mcbrit 2 days ago [-]
You're talking to a dude. (in my case.) I mentioned 8 books.
I won't tell you exactly what to do, but one way to do it is to measure your surprise with me choosing each of those 8 books when you provide a recommendation back to me of what I should read next. I think I get kind of that experience talking to someone about books.
The algorithm didn't do that.
teaearlgraycold 2 days ago [-]
Talking to someone about books gives you so much more information than a book list. Their expressions, their accent, their energy level, their clothes, and many other things help to provide supplemental information.
boplicity 1 days ago [-]
Please remove my reviews from your LLM model and training data. Thank you.
mparnisari 1 days ago [-]
Why?
NitpickLawyer 2 days ago [-]
Interesting. I tested it with sci-fi, and it definitely recommends good books, but not sure how accurate it is at surfacing the sub genres / themes. For example for [aurora -ksr, seveneves, project hail mary, ender's game] it gave me dune. Which is a great book, but not in the "first-ish contact" style I hoped it would be.
Another thing I noticed is that it tends to recommend 2nd and 3rd books in a series, which is a bit so-so. If I add the first book in a series, I probably already read the whole series...
28304283409234 2 days ago [-]
Came here to say this (recommending book 2 and 3 in a trilogy). Great app otherwise!
zeroq 2 days ago [-]
Great work!
Some five years ago I was day dreaming about recommendation engine for movies where you could say "hey Ciri, give me a good gangster flick", and it will come up with something that you haven't seen yet but you'd definitely love.
To my amazement almost everyone, even true AI believers, thought it was impossible to achieve. :d
But my question is - having such huge dataset, do we really need AI for it? SASRec/RAG is sexy, but could the same result be achieved with simple ranking and intersections like lastfm did in the past with music?
Some twenty years ago I came up with an idea of "brain" data structure for recommendations where you have all your items (books, movies or articles) modeled as a graph, and whenever you pick something it makes a ripple effect, effectively raising scores in cascade of every adjecent item.
Just like your brain works - when you stumble upon something new it immediately brings back memories of similar things from the past. I never had the opportunity to implement it and test in real life scenario, but I'd be surprised if a variant of this is not widely used across different recommendation systems, like Amazon.
krisoft 1 days ago [-]
> To my amazement almost everyone, even true AI believers, thought it was impossible to achieve.
I mean. Is it possible? Hey Zeroq give me a good gangster flick please?
How will you know what i have seen and what I haven’t yet? How will you know what movies I like? Are there even good gangster flicks i would enjoy and haven’t seen yet?
The way the problem is phrased it sounds like your dream recommender has two properties: “it doesn’t receive any other information than what is in the prompt” and “it always recommends a movie you haven’t seen and will enjoy”. Those together are what makes your dream recommendation engine impossible. If you relax those then of course it is possible. That’s just a recommendation engine.
esafak 2 days ago [-]
last.fm used a primitive machine learning algorithm too, else what are you going to rank by?
zeroq 1 days ago [-]
Did they? I recall similar site back from as far as from 2008. Might be them or something similar.
Anyway. I can totally see such site running purely on statistics. Every song, every artists, every genre is a bucket. You listen to a song you put a drop in these buckets. Once there's enough water running we can compare you to other users and their buckets.
It might be hard to run it on scale in real time, but c'mon, it's leetcode junior level assignment level of complication.
slipperybeluga 23 hours ago [-]
[dead]
majormajor 2 days ago [-]
Neat! It's a validation of the model that 75%+ of the recommendations are things I've read and also enjoyed, with a few "read, didn't like" and some more "didn't read, don't really want to."
But I think to break the content-bubble effects to find the longer tail, some way to reject or blacklist things - and have that be taken into effect in the model - might help.
robertritz 2 days ago [-]
To add to this Youtube afaik uses multiple models to sprinkle in new content alongside your usual recommendation just for this.
daemonologist 1 days ago [-]
Likewise, I put in six of my favorites and had already read (and enjoyed) 29 of the 30 recommendations (I'll have to check out Blindsight by Watts). Working great but it would be cool - as with pretty much every recommendation algorithm ever - to have more of a "discovery" capability.
MattGrommes 2 days ago [-]
This is cool but I'd love the option to filter out the author of the book you entered. I put in Shroud by Adrian Tchaikovsky and almost all the books are others by him, which is fine but doesn't really mix up the stuff I'm reading.
wtf242 19 hours ago [-]
That's super cool. I launched a book recommendations feature this year, which works vastly different. I ask users what their favorite books are(which you can rank), and then allow them to import their goodreads data which includes star reviews, and books they have read, then I determine your favorite style of books based on genres and subjects, then use opensearch to find similar books. It's a lot more complicated than that, but seems to work well. I'm always looking for ideas on how to improve this feature. Interested in what you are actually doing on the backend on the how-it-works page. thanks!
It's much more powerful if you're a member. you can restrict the results to certain genres or book lengths, as we as published date ranges, etc. If someone wants to try out the more powerful feature, DM me and i'll mark your account as a member.
The best way I’ve found for finding predictably enjoyable fiction is to read interviews with the authors I like, and read about the works and authors they admire or are influenced by. Or who they exchanged letters or communications with, if they’re long dead and no interviews proper exist.
Strongly recommend giving that a try yourself. And trying to build an algorithm around it!
Here’s an example: Tolstoy really admired Turgenev, who was friends with Theodore Storm and Gustave Flaubert, and greatly admired Gogol.
If you like Anna Karenina you’ll probably find something of value in Torrents of Spring, Immensee, Madame Bovary or Dead Souls.
It spiders out pretty quickly!
sexylibrarian 1 days ago [-]
We've been working on this data set since 2016 and have it covered! Our app is on test flight in private beta and will be sharing it very soon xo
1 days ago [-]
1 days ago [-]
contravariant 22 hours ago [-]
For the intersect page you probably want to order the users by the size of their shelf. For some more obscure combinations I'm mostly getting users who read 10,000s of books, which is less useful than the users with <1000 books.
simlevesque 2 days ago [-]
The How it works it way too short :) I'd love to see some scripts, know the hardware you use, etc...
By the way you could use Summa FTS Wasm + Duckdb Wasm to have the same website without any backend except file hosting. Maybe even just Duckdb Wasm with it's FTS would be enough. Summa FTS is very similar to meilisearch in essence because they're both derived from Tantivy.
I use a Hetzner server with Ryzen 7 3700X and an SSD.
I think I could get the model to work with ONNX web but it'd be a 2GB download so the user experience wouldn't be too great. My Meilisearch index is ~40GB but I don't know how much that could be compressed down.
Cool so you stole my data and now you're bragging about it?
greenie_beans 24 hours ago [-]
so sick!! always wanted to do something like this but would use it for my saas (https://bookhead.net), so i'm hesitant to scrape goodreads due to their terms of service.
in addition to the goodreads data, i think it would be interesting to add an author's favorite books by scraping their paris review interviews (and other interviews) and using that as a "review" because i've learned about so much good stuff through an artist mention.
and in a retail context, i've always wondered if a recommendation engine like this could have its own "flavor" based on a specific store's customer buying history. like if a bookstore's customers were weighted in the algorithm so that their similarities scores were given preference. much of what a bookstore carries is based on their customers' taste. you could use the goodreads etc as the base recommendations and then train it on a bookstore's sales history.
a project like this is a bit outside of my expertise but i have a tiny bit of knowledge about it, and now i have a lot more to learn after seeing this. if anybody has any good book recommendations (hehe) or papers i should read to learn about these sort of systems, please let me know!
thank you for sharing!!
foresterre 2 days ago [-]
It seems to work decently even with just one or two titles for popular titles, but less so for the niche.
For example, the title "Impro: Improvisation and the Theatre" by Keith Johnstone, linked by another article posted to HN today gives back the following suggestions:
- Truth in Comedy: The Manual of Improvisation by Charna Halpern
- Steve Jobs by Walter Isaacson
- 1984 by George Orwell
- Harry Potter and the Sorcerer's Stone (Harry Potter, #1) by J.K. Rowling
- Sapiens: A Brief History of Humankind by Yuval Noah Harari
- The Alchemist by Paulo Coelho
- The Tipping Point: How Little Things Can Make a Big Difference by Malcolm Gladwell
- Dune (Dune, #1) by Frank Herbert
It's a bit unfortunate that all suggestions are fairly popular titles, which are fairly easy to find, while the unpopular or niche may be just as well written but a lot harder to find.
Within niche topics or books, it is also usually harder to provide multiple similar enough titles up front.
costco 2 days ago [-]
It's recommended that you put at least 3 books in. If you would like recommendations just based on one book, click the similar button on the book, it should take you to this page: https://book.sv/similar?id=297914
tgv 1 days ago [-]
It seems you cannot reject recommendations. Based on two books, the system showed reasonable recommendations, some of which I'd read, including one that I really didn't like. There must be useful information in that, too.
loremm 18 hours ago [-]
For intersect I also wonder if you add a filter that the books are within the top rated. Like if I give my favorite books and want to find someone who has my same taste, it doesn't help if they hated (all/most/some) of those books. Tricky in that not all users give star ratings
sajb 1 days ago [-]
You could color code results from the same author or from the same series as an already added book, since the user most likely already knows about them. Perhaps a toggle to filter these out altogether.
skayvr 2 days ago [-]
I've worked in recommender systems for a while, and it's great to see them publicized.
SASRec was released in 2018 just after transformer paper, and uses the same attention mechanism but different losses than LLMs. Any plans to upgrade to other item/user prediction models?
costco 2 days ago [-]
I'm not an expert by any means but as far as sequential recommendations go, aren't SASRec and its derivatives pretty much the name of the game? I probably should have looked into HSTUs more. Also this / sparse transformers in general: https://arxiv.org/pdf/2212.04120
skayvr 2 days ago [-]
There's a few alternatives, but SASRec is a good baseline for next-item recommendation. I'd look at BERT4Rec too. HSTU is definitely a strong step forward, but stays in the domain of ID models. HSTU also seems to rely heavily on some extra item information that SASRec does not (timestamps).
Other models include Google's TIGER model which uses a VAE to encode more information about items. Similar to how modern text-to-voice operates.
costco 2 days ago [-]
Thank you for the recommendations. I didn't try BERT4Rec because I assumed it would perform the same or worse as what I already had after having read https://dl.acm.org/doi/pdf/10.1145/3699521. The TIGER paper seems interesting - I definitely want to explore semantic IDs in general and also because I think it could allow including more long-tail items.
bigskydog 2 days ago [-]
Recommend OneRec which is an improvement of HSTU and it recently became open source
josvdwest 13 hours ago [-]
Super cool tool!
Feature request: Be able to import all my goodreads books, unread as well. Not only 64. Most of the recommendations were already on my shelf.
illdave 1 days ago [-]
This is really great - refreshing to have something that's instantly useful, with no need to signup/login. Really fast, immediately helpful - this is wonderful.
sodality2 2 days ago [-]
This is fantastic!!! I've added many results to my want-to-read list, they're very on-point from very few inputs. It would be really cool to import from a user ID, where you can choose some subset of your read list to inspire new suggestions, while excluding all books in your want-to-read and already-read lists. But that's an ongoing scrape to maintain, it's a cat and mouse game you probably don't want to start. I wonder what the legal status of scraped training data is... if you don't reproduce any of the review data I presume you're fine?
costco 2 days ago [-]
You can import the first or last 64 books of your read, to-read, or currently-reading shelves if you press the "Import Goodreads" button and provide your Goodreads ID.
sodality2 2 days ago [-]
D'oh, didn't even notice that button :P Wow, that greatly improved the recommendations, it even found a book I wouldn't say is particularly related to the others but I found it interesting-sounding. Thanks for such a cool site!!
fridental 1 days ago [-]
I've entered books from The Expanse and Lockwood & Co series and its output was not really overwhelming:
- other books from the series (duh, I don't need a recommender for that recommendation)
- Hobbit, Harry Potter, Azimov etc (duh, I like scifi and surely I've already read all the classic works).
jamesponddotco 2 days ago [-]
The recommendations are pretty good; even though I only input six books, it was enough for it to recommend books I have on my wish list. Definitely going to play around some more. Plus, the website is super fast, very impressive.
Any chance we could get an API going at some point? Are you planning to open source the work?
I'm interested in the scrapping of Goodreads too. I'm building a book metadata aggregation API and plan on building a scrapper for Goodreads, but I imagine using a data center IP address will be a problem very fast. Were you scrapping from your home network?
costco 2 days ago [-]
Thank you for the compliments :) I used 50-100 datacenter proxies. I just logged requests made by the iOS app with Charles and then recreated the headers to the best of my ability though the server did not seem to be very strict at all. Worth noting though that static residential proxies are not too expensive these days anyways.
Re the API: The model does actually run fairly well on CPU so it probably wouldn't be too expensive to serve. I guess if there is demand for it I could do it. I think most social book sites would probably like to own their recommendation system though.
goatsi 2 days ago [-]
Speaking of sustained scraping for AI services, I found a strange file on your site: https://book.sv/robots.txt. Would you be able to explain the intent behind it?
costco 2 days ago [-]
I didn't want an agent to get stuck on an infinite loop invoking endpoints that cost GPU resources. Those fears are probably unfounded, so if people really cared I could remove those. /similar is blocked by default because I don't want 500000 "similar books for" pages to pollute the search results for my website but I do not mind if people scrape those pages.
dbl000 2 days ago [-]
I would love an API or the dataset if you could share it somehow! Just to play around with my own book lists.
rapatel0 2 days ago [-]
So I tried a few disparate books independently:
- Guns Germs and steel
- The Alchemist
- The Ramayana
(a few others)
Harry Potter and the sorcerers stone came up in all of them near the top. :D
costco 2 days ago [-]
> Note 1: If you only provide one or two books, the model doesn't have a lot to work with and may include a handful of somewhat unrelated popular books in the results. If you want recommendations based on just one book, click the "Similar" button next to the book after adding it to the input book list on the recommendations page.
xkbarkar 2 days ago [-]
Have nothing to add that hasn’t already been commented. Like the entries in the add list stay.
Other than that, my recommendation list keeps coming up with books I have already read and loved and I am hitting the limit :(.
So filtering would be great,
I have seen a few versions of the same books listed more than once.
Loved this. Hope you get to tune it a little.
Also, thank you for not ruining the site with a single popup, email subscription list offer, chatbot, wheelspin from hell anywhere.
Blessings from the popup hating part of the interwebs.
foota 1 days ago [-]
Many years ago I built a D3 graph based explorer for movies using the IMDB API. You would start by entering a movie title and it would pull up the similar movies from IMDB and you could click them and see more similar ones and they would all be connected based on similarity. It was very fun!
easywood 1 days ago [-]
Thank you very much, I have always wished for something like this to be part of Goodreads itself. The intersect function especially will help me find hidden gems that other likeminded people have found. I'm looking forward to find out what books I have missed all my life.
mring33621 23 hours ago [-]
FEEDBACK:
I should be able to mark recommended books as "Read, Liked"; "Read, Didn't Like"; "Remove, Other Reason"
and then allow a rerun of Recs, based on additional info
_virtu 2 days ago [-]
Hey OP I’m building a bookclub app. Do you happen to have an api I could plug into? I’d love to add this to our member suggestions section.
laszlojamf 1 days ago [-]
I gave it a spin with Gravity's Rainbow. Most of the recommendations are what you'd expect, Pynchon himself, Don Delillo, David Foster Wallace... and then right at the end... The Hobbit, or There and Back Again >.<
nsypteras 2 days ago [-]
I'm impressed it recommended so many books i've already read and liked! I have a big reading backlog but once it's whittled down I will likely come back to this. One feature request would be to also show a "why this is recommended" for each recommendation so I can further narrow down the list for what I'm looking for
spullara 2 days ago [-]
I would love to be able to filter the resulting list by removing certainly all books that in the same series but I think removing all books by authors that I have already listed would be great to get new things that I haven't already read. The resulting recommendations maybe included 1 new book for me.
qingcharles 2 days ago [-]
I put in a bunch of books and hit recommendations and... I'd already read 95% of them, so at least we know it works well! (checking out the other 5% now)
p.s. one idea: when you click [Add] on the recommended books list, it should remove it from that list
p.p.s. if there is a way to filter out the spam "Summary of ____" books, that would be good too
jacquesm 2 days ago [-]
I have a hard time remembering titles of books I've read if they are not directly related to the subject matter. No problem remembering the content though. With movies I remember both.
aaronax 23 hours ago [-]
Can you create a list of the most common book names? I think it would be funny to have a shelf of books that all have the same name.
maxglute 1 days ago [-]
Would be nice to not recommend books by author already added for novel discoverability.
mhb 18 hours ago [-]
Do you have any thoughts on how SASRec compares with the SVD-based Cinematch algorithm?
nickthesick 2 days ago [-]
I have a web app https://bookhive.buzz which is a GoodReads alternative based on BlueSky’s protocol. I scrape all of the book data from Goodreads too.
I would love to be able to add a recommendation system based on this.
noir_lord 2 days ago [-]
It has a tendency to recommend books in the same series as are input (putting aside that if I like a book in a series I've likely already read the series).
It did suggest Murderbot Diaries (not on the input but a series I have read and did like) and an Adrian Tchaikovsky I hadn't read :).
costco 2 days ago [-]
It's explicitly trained to predict the next book read in a sequence, which is why you get that behavior. There's probably a better way for me to handle it rather than having 5 books from the same series tend towards the top though.
noir_lord 2 days ago [-]
If you have the data to know the other books in a series maybe split the results so you have "books in series" in one column and "books not in a series mentioned" in the other but other than that it did a better job than Kindle recommendations which are often hilariously off the mark.
bananaflag 2 days ago [-]
Yeah the hardest problem for recommendation systems is to find non-Star Wars books which are like some specific Star Wars books and unlike some other Star Wars books. I would say it's AGI-complete ;)
noir_lord 2 days ago [-]
Ironically that is one of the few uses where I've found an LLM to actually be useful.
ChatGPT does a fairly good job at letting you negate/refine whatever it was you where looking for.
dylan604 2 days ago [-]
I would expect a recommend of Star Trek if it were AGI-complete just to troll
cyrusradfar 2 days ago [-]
I think this is cool and super fast -- kudos on whatever tech you needed to tackle to make it so.
I don't see anyone saying safety or ethics, so I'll just put it out there that it has some safety and ethical considerations you should consider.
Consider "inflammatory" books and how they could be used to harm a group of people. Although I recognize folks post this "publicly", I think the intersection feature provides more than Goodreads.
Let's say, people who have read "Mein Kampf" & "The Anarchists Cookbook" or some other combination that say "Antifa" to the current regime.
I'd recommend you have a list that you consider private, always and allow Users to add to that list so it's more scalable. If folks try to intersect with anything in that list, you can warn that you don't allow intersection with private books.
Anyway, super fun demo!
costco 1 days ago [-]
Thank you for the kind remarks. I would say your request is reasonable and it’s something I thought about but it’s worth noting that if you scroll down on a Goodreads book page you can see all of the users who gave it 1 star, 2 stars, etc (not just reviews, ratings too). And in fact Goodreads does not tell people this but these lists include private users. So someone looking to cause controversy would probably just find an incendiary book and look for everyone who gave it 5 stars. My data does not include people who marked their account as private or only visible to those who are signed in and also does not give information on stars. It is also very easy to remove your data if you are so inclined. I only log request URLs but I can tell that many thousands of people visited the site and less than 10 opted out despite “Remove My Data” being prominently displayed and referenced in the attached text to my post.
Also, I don’t use them but based on forum posts I read I think other services like LibraryThing or Storygraph do expose similar information about book readers.
zeroq 2 days ago [-]
It always baffled me how we censor "Mein Kampf" but we - as a society - are super fine with either Alex Jones shouting about lasers and lizard people or Joe Rogan leaving an open mic to people claiming there are nuclear plants and space stations buried under pyramids [1].
Mein Kampf is absolutely terrible piece of literature not by it's message but by it's quality. It's exactly something I would expect to find in Alex Jones cell if we would sentence him to a year of solitary confinement.
[1] just a tiny exaggeration
colechristensen 2 days ago [-]
I don't want "safety" or "ethics" if the requirements for them are banning or hiding books based on somebody's ideology whether or not it agrees with mine.
cyrusradfar 1 days ago [-]
I'm not asking him to hide the book recommendations, I'm talking about the "intersect" user feature not doxxing people reading them.
colechristensen 20 hours ago [-]
I mean if that's the case a person not wanting other people to know they're reading something probably shouldn't put that on a tool designed to share what you're reading for other people (Goodreads).
I don't think anybody needs to be protected from themselves sharing things intentionally.
billfruit 1 days ago [-]
I feel that the last added book in one's list seem to have more influence on the recommendations, which results in a rather similar type of recommendations.
costco 1 days ago [-]
This is a result of the use of positional embeddings, which typically results in the final item being weighted very highly. The problem is that this information is shown to be very relevant to the task of predicting the next item interacted with. If you add more books the effect of this is somewhat diluted.
Jayakumark 2 days ago [-]
Cool work, How much did it cost to train ? Will the source for training be open source ?
6031769 1 days ago [-]
We won't know how much it cost until after the court case.
aidenn0 1 days ago [-]
I have "too many books" to add more to books I like and I've read almost all of the recommendations...
comrade1234 2 days ago [-]
I gave up on goodreads reviews. I've been burned too many times by highly rated books that weren't that good. If you're into (horny) ya romance fantasy then goodreads is great, but it's not for me. I haven't really found a substitute.
owenversteeg 2 days ago [-]
Any broadly used ratings system is total garbage. Goodreads ratings, Google Maps ratings, Amazon reviews, Vivino for wine, et cetera. Even assuming the reviews are real and genuine, most people just aren’t good at writing reviews, and the handful that are often have wildly different criteria than you. Someone already commented with one enthusiast site - and sure, enthusiast sites are often better than the mainstream option (see also: CellarTracker for wine) but honestly my advice is to get good at determining the quality of the thing yourself. For books there are a ton of hints about what you’ll be getting. “NYT Bestseller”, “xyz book club”, certain publishers, who’s quoted on the back, when was it published, who wrote it? All of those things can help you rapidly identify books. I personally dislike most modern books and prefer the “classics”, so a lot of this is only useful as a negative signal, but even then there are positive signals, for example a reference to a much older book.
HeinzStuckeIt 2 days ago [-]
GR is also great if you are into academic nonfiction, Classics, poetry, etc. The site does, after all, let you track and review any publication with an ISBN. What my peers and I use it for is worlds apart from the romance novel or LGBT young-adult book reviewing community that often puts GR in the news, and far away from all the drama that rages around genre fiction.
jamesponddotco 2 days ago [-]
I'm not into the social aspect, so Goodreads was never an option, but Hardcover[1] seems like a pretty good alternative.
Considering how much treasure has been poured into building recommendation engines for just about everything online, books have always been very difficult for me to find recommendations that work. Interested to try it!
SomeUserName432 1 days ago [-]
I've made multiple attempts to use "AI" to find the names of books I've read in the past. Never managed to find a single one. Quite disappointed with that. I've read so many books that I have no idea what were called.
smcleod 1 days ago [-]
Very neat! By chance have you open sourced the code and model anywhere? I'd be really keen to have a play with this.
cfraenkel 2 days ago [-]
FYI, on this android tablet (android v12 / FF 144.0.2), the 'start typing a book title...' field doesn't do anything. On the Mac, it brings up a list of matches to select from.
1 days ago [-]
mayahisali 23 hours ago [-]
Curious about how you trained the model on a billion reviews. What architecture did you use?
nwhnwh 2 days ago [-]
I entered "Alone Together: Why We Expect More from Technology and Less from Each Other" and I received books about Steve Jobs, Harry Potter and "The Subtle Art of Not Giving a F*ck". Like how???
These seem to fit the description you are going for better. The model is trained to predict the next book in the sequence. Those other books you listed happen to be very popular, so in the absence of information about you (only having 1 book), the model will tend to recommend those.
BeetleB 2 days ago [-]
> Provide 3+ books for best results.
jimmoores 2 days ago [-]
I unexpectedly liked this. I thought the recommendations were actually useful.
parkersweb 2 days ago [-]
I sadly didn’t share that experience - I fed it my goodreads most recent - but it largely picked up on 2 or 3 series I’ve been slowly working my way through so that most of the recommendation list was ALL the other books in the series (and the spin-off series) so I didn’t really get anything useful…
__alexander 2 days ago [-]
Care to share the scrapped data? I would love to play around with it.
So you're ok with stealing the data yourself but not ok with providing it to others, ironic.
18 hours ago [-]
demaga 2 days ago [-]
I am not sure about legal side of things here, but a Kaggle dataset would be really cool
guelo 2 days ago [-]
I'm surprised he got that much data. Goodreads uses several tricks to try to stop scrapers, for example pagination only works up to a few pages.
jacquesm 2 days ago [-]
They might send him a bill for use of resources.
cjaackie 2 days ago [-]
I’m wondering about how ethical it is to load down a resource in this way, open to opinions. There is a mention “I didn’t hammer down the servers” but what does that really even mean? The site isn’t being used as intended and just curious how other people feel about that.
fennec-posix 2 days ago [-]
Very neat. Even found a couple Cold War-setting books to read and an entire series of 6 books on the same topic, All from searching up Team Yankee.
Thanks for the new reading list :D
thinkcontext 3 days ago [-]
I'm impressed! It didn't take many books for it to start suggesting other books that I liked and it showed me several solid choices I'm adding to my queue.
logicprog 1 days ago [-]
I tested the model out, and I can attest that it's very on point! It gets the assignment
esafak 2 days ago [-]
It is interesting that you chose a contextual recommender when you would think book affinity is not very susceptible to context. Did you try other models too?
iamcreasy 1 days ago [-]
Very fast. Thanks for building it.
Besides title, I'd like to provide suggestion on what type of books I am looking for.
caro_kann 1 days ago [-]
I tried three different genre books, for each I got 1984 - George Orwell recommendation.
djent 19 hours ago [-]
Why not scrape the content of the books too?
mpern 1 days ago [-]
I read the title as "I scraped 38 book reviews". Time to get reading glasses...
hmokiguess 20 hours ago [-]
People of HN:
- Remove my data, unethical, my words are mine! Argh!
Also People of HN:
- I built an HN aggregator that shows sentiment analysis of comments and . . .
calebt3141 23 hours ago [-]
This is fantastic. Thanks for sharing this.
djoldman 2 days ago [-]
Can you share the details about the Meilisearch instance? How big is the box and database size?
costco 2 days ago [-]
Everything (namely Meilisearch, Postgres and the web server in Go) besides the model inference is running on a Hetzner server with a large SSD and an "AMD Ryzen 7 3700X 8-Core Processor." The data.ms directory is about 40GB. Once the HN traffic dies down I will probably move the model back to the Hetzner server so I don't have to pay $0.15/hour for an A4000.
skerit 2 days ago [-]
Please make this for tv series too!
Llamamoe 1 days ago [-]
My impressions are similar as those of others:
- It seems to mostly show me books I've already read and know of, including sequels of what I added, which isn't very useful.
- It ultimately seems to prioritize "highest rated in category" too much, rather than focusing more on what made my chosen books stand out over others.
- Needs a "disliked books" list, especially when the recommendations show me a lot of superficially similar books I hated. I'd like to blacklist them.
- Would be cool to have a discovery mechanism for less popular and even obscure titles. Again, the top picks of each category are very well-known.
- Might not be practical, but I'd like some way to filter by specific features of reviews. E.g. prioritize reviews that say "the MC is a psychopath/murderhobo/rapist" higher for anti-recommendation, ignore reviews that say "whiny character", etc.
stichers 1 days ago [-]
I don't get what "Add" does
hasbot 1 days ago [-]
Try it and see! But, it adds the recommended book to the list of selected books.
stevage 2 days ago [-]
This is great. would be really nice to be able to reject suggestions though.
mna_ 1 days ago [-]
I typed "Introduction to Real Analysis by Bartle" and I got:
Steve Jobs by Walter Isaacson
Harry Potter and the Sorcerer's Stone (Harry Potter, #1) by JK Rowling
Topology by James R Munkres
and so on.. Munkres' book is relevant and I want to read it, but what have Steve Jobs and Harry Potter got to do with with mathematics?
> Note 1: If you only provide one or two books, the model doesn't have a lot to work with and may include a handful of somewhat unrelated popular books in the results. If you want recommendations based on just one book, click the "Similar" button next to the book after adding it to the input book list on the recommendations page.
mna_ 7 hours ago [-]
Oh, my mistake. That's quite cool. I'll most probably use your site over Goodreads.
franticgecko3 1 days ago [-]
They have nothing to do with mathematics but everything to do with being extremely popular books.
Most people that have read a mathematics textbook have also read and enjoyed Harry Potter.
Given you have enjoyed drinking water and breathing in the past, there is a high likelihood that you will enjoy watching the Star Wars films.
SilverSlash 1 days ago [-]
Bless this man! I despise Goodreads but continue to use it, because there are no real alternatives. It feels like they outsourced the creation of that website to some cheap consulting agency in a low cost location and then left it at that. For example, Goodreads hasn't updated its outdated version of React in years.
For a while now I have really wanted good book recommendations matching my tastes. The LLMs suck at this (likely due to the mode collapse that Karpathy mentioned in his excellent podcast appearance on Dwarkesh) and Amazon is very good but only recommends based on the current book you're browsing.
I will try this out now! But could you increase the number of books fed to the recommender or maybe get the top-64 highest rated books instead of just the most or least recent 64?
the_coffee_bean 2 days ago [-]
Amazing work. Do you plan on publishing the training code?
mhb 1 days ago [-]
Import doesn't work for me.
runnr_az 23 hours ago [-]
same. will stop back by in a day or two when the site has a chance to recover
2 days ago [-]
lifeisstillgood 1 days ago [-]
Goodreads - “hey those user written comments belong to us, you need to pay us”
HNUser - “OpenAI told you to go swivel until they made a billion and you accepted that. Samesies “
momocowcow 2 days ago [-]
Whatever I put in, it wants me to read Sapiens :_(
sigh No thanks. I rather not have my comments used by random LLMs.
bossyTeacher 19 hours ago [-]
You scrapping Goodreads to make a replacement is like your company making you train the new hire so he can replace you eventually
freen 20 hours ago [-]
The commons of the creative output of humanity is a resource, just like oil or lithium.
We are rapidly replaying the worst of the Resource Curse.
jauntywundrkind 2 days ago [-]
Where do nice scrapes like this end up? Are there BitTorrents out there for scrapes like this?
Honestly this would finally be the web2.0 we all wanted & hoped for. It's against majesty that it's all captured owned user content that is legally captured by essentially public message boards/sites.
dbingham 2 days ago [-]
See, now this is an excellent use of LLMs (if we're going to be using them at all). Low stakes if it gets shit wrong, but can provide some really useful and surprising answers!
One request, it would be nice to not have to add Goodreads, since I don't use it. I've love to be able to enter a couple of book titles or an author and just get recommendations!
costco 2 days ago [-]
You don't have to import your Goodreads profile. You can type titles and authors in the box and find books to add to the list that way.
Invictus0 23 hours ago [-]
The problem with recommender engines is they're always recommending the most popular books that are in the same vein as what you've already read. So you're always getting pop-culture pap and not actually-interesting, somewhat more niche and unrelated books that are only tangentially related. The recommendations I got were all pop-psych stuff and other titles by the authors I've already read.
deanc 1 days ago [-]
Looks cool, and no bullshit. Please let us filter recommendations. If I put in a non-fiction book I'm probably looking for recommendations of other non-fiction books :)
conartist6 1 days ago [-]
SHAME. Gross. Morally bankrupt. Greedy.
tristor 2 days ago [-]
Two bugs to know about. First, you are using a deprecated API call that fails in Firefox. Second, you are using an HTTP endpoint that fails to upgrade to HTTPS to call the GoodReads API, which also fails with HTTPS-Only enabled in both Chrome and Firefox.
The idea seems good, but since I can't import my GoodReads successfully, it's hard for me to try
costco 2 days ago [-]
I use `fetch` on relative endpoints so that's odd. There shouldn't be any external API calls on my website other than whatever the Cloudflare captcha uses. I also use HTTPS-only in Chrome and did not experience any issues. I just tested Firefox with HTTPS-only on/off and Safari on my phone and I was able to import shelves for multiple users. Are you sure that you do not have any privacy settings on (can you access your shelf in Incognito mode)?
submeta 2 days ago [-]
Like the idea! Wondering: Weren’t the early LLMs trained on data in Goodreads as well? I can upload and ask ChatGPT as well, and it will give me similar recommendations, no?
piskov 1 days ago [-]
Tried and unfortunately it was meh.
Svoka 24 hours ago [-]
Honestly, with this I see same results as with any other recommendation system - I type some nice Sci-fi/Fantasy I read, it give me generic Sci-Fi fantasy I already read. Even those I really didn't like.
I add those I like to the list, ignoring those I didn't, and in the end I just end up with recommendations I already read and didn't like.
I feel like wasted my time yet with another smart recommendation system.
brailsafe 2 days ago [-]
In some sense, it seems to work well, but the results are sort of nothing special and that's not what I'd personally hope for. I put in three books that are unrelated and got results that compare to a standard book store, either from the same series or other meme startup tech bro recommendations that I'd often literally see on the same shelf. I can't say it's not good, because obviously that's how people browse books and that's what you'd get from reviews, which is perhaps why I never consult reviews for anything.
I put in Thinking in Systems and got a bunch of engineering management stuff which I don't care about. Deep work of course gave me all the rich dad poor dad, steve jobs bio, tim ferriswheel crap which shouldn't surprise me at all. Girl with the dragon tattoo gave me the rest of the series.
Thematic similarity + popularity just seems boring, I'd like something that surfaces unusual deep cuts that I wouldn't necessarily find at the book store on the same shelf, but maybe that I could find if I went to a great library and might be out of print, or that I could find on libgen.
With these:
- Thinking In Systems: A Primer
- Paddle to the Amazon: The Ultimate 12,000-Mile Canoe Adventure
- The Elements of Typographic Style
I was kind of hoping to at least get "Grid Systems in Graphic Design" or something, but mostly got Alchemist, Zen', Into the Wild, almost comically mainstream cuts that of course in some cases I've already read or could find in a Cupertino trash can, not that any of them are not worth reading necessarily, but very typical.
An option to surface rarer choices that combine signals from all the books on the list would be neat, like in the above case, the least read real adventure book that somehow touches on the economics of places travelled through with musings about signage or that just happens to use a similar prose that Robert Bringhurst used to make print design theory not dull. Recommendations that only someone with a real sweaty and weird venn diagram of genuine personal deep interests might conjure up, and that a normal person might say "why the hell would I ever read that" but that otherwise amazing books that are just slept on and might never have found a market, or maybe thematically dissimilar+ conceptually similar in aggregate + unpopular. I'd like to be able to input a seed of inspiration that I haven't been able to find the next deeper step in, rather than all the books on how to start a startup in the garage I don't have. If it's James Hoffman's book on brewing coffee at a high level, I wouldn't want another YouTubers book on brewing coffee at a high level, I'd want the Physics of Filter Coffee, or something in an adjacent sphere grid / tree branch that gives me a way to pursue depth AND breadth but not necessarily the same book by someone else, or the same book with different characters. If I've found a seedling or a mushroom, I'd like to explore the root system of that fruiting body, and then at a certain point find a new seedling based on what I've learned so far, or the one video with 50 views that's somehow the best explanation of how to handle back-pressure in highly concurrent systems after I've realized that I don't know shit about concurrency, but not so deep in the stack that I can't bridge the gap; make the series for me.
Granted, my take here might just be an indictment of reviews in general, or at least those sourced from a generic site like goodreads/amazon which is all about popularity and armchair criticism.
costco 2 days ago [-]
I would agree the results are generally OK but do not feel magical in most cases (I think in some specific cases they do though). The results can be not great if you add books across many disciplines. For instance if you add "The Elements of Typographic Style" and "The Design of Everyday Things" (https://book.sv/#671857,18518), you do get "Grid Systems in Graphic Design" but under its German name "Rastersysteme für die visuelle Gestaltung."
brailsafe 1 days ago [-]
Wow, quite impressive actually, some of those additional entries are quite good as well, with Ellen Lupton, Steve Krugman, Erik Spiekermann, and then also 1984 and Mark Manson for some reason. Would the latter few be because they're just super widely read/recommended/connected regardless of genre?
Great project anyway, thanks for sharing and responding to my stream of consciousness, I certainly didn't mean to be insulting and hope my tone didn't come across that way.
costco 1 days ago [-]
Yeah, the latter are just because they are popular. If you have 3+ books you tend to get less random popular books included
3 days ago [-]
maxtoc 1 days ago [-]
[dead]
slipperybeluga 23 hours ago [-]
[dead]
sexylibrarian 1 days ago [-]
[dead]
6stringmerc 24 hours ago [-]
Wow I wonder how fun conversations in the afterlife with Aaron are going to be for the OP. There are ways to improve broken systems. Pimping them out with glee is not one of them in my book.
Everything about this concept I hate and it’s difficult not to conflate that with the creator. I make comparisons and equivocations. This is an ethical discussion akin to “Can you enjoy Bill Cosby comedy knowing he was a Rapist” and I’m not being glib.
Rendered at 15:05:07 GMT+0000 (Coordinated Universal Time) with Vercel.
"[...] you agree not to sell, license, rent, modify, distribute, copy, reproduce, transmit, publicly display, publicly perform, publish, adapt, edit or create derivative works from any materials or content accessible on the Service. Use of the Goodreads Content or materials on the Service for any purpose not expressly permitted by this Agreement is strictly prohibited."
Also did the reviewers give you permission to fed their content into an LLM?
In the US there’s a major precedent [0] which held that scraping public-facing pages isn’t a CFAA "unauthorized access" issue. That’s a big part of why we’ve seen entire venture-backed scraping companies pop up - it’s not considered hacking if the data is already public.
[0] https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
> However, after further appeal in another court, hiQ was found to be in breach of LinkedIn's terms, and there was a settlement.
So why would the same not apply here?
These were some of the notable elements (worth noting that none mention breaching terms of service):
> Damages: Judgment in the amount of $500,000 is entered against hiQ, with all other monetary relief waived.
> CFAA liability: hiQ stipulates that LinkedIn experienced losses sufficient to, and “may establish liability” under a CFAA civil claim “based on hiQ’s data collection practices and based on hiQ’s direct access to password-protected pages on LinkedIn’s platforms using fake accounts.”
> California “CFAA”: hiQ stipulates that LinkedIn “may establish civil liability” under California’s state-law counterpart to the CFAA based on hiQ’s data collection practices, use of fake accounts and other means to evade detection by LinkedIn, hiQ’s direct access to password-protected pages on LinkedIn’s platforms using fake accounts, and hiQ’s unauthorized commercial use of data.
> Trespass: hiQ stipulates that LinkedIn has established judgment as to liability under California law for the common law torts of trespass to chattels and misappropriation.
> Irreparable harm: hiQ stipulates that LinkedIn has established that it has suffered an irreparable injury and that LinkedIn satisfied the remaining factors and is entitled to a permanent injunction.
https://natlawreview.com/article/hiq-and-linkedin-reach-prop...
In America, you can simply pay to not lose any lawsuit ever, and thus never have to face legal consequence or changes to the law you don't like.
This was part of the terms of the settlement.
Google for example has a TOS and is well known for permanently banning accounts for real or imagined or AI-generated violations of it. Google banning you for breaking TOS doesn't mean you broke the law, just that you broke their rules, which apparently include a clause against being in the wrong place at the wrong time.
I imagine that a contract in which someone agrees to become a slave would be void.
Your textbook versus reality conceptualization of things is dogshit. It’s exploitation to do what OP did. You’re endorsing it and minimizing the ethics and this certainly shall poison the well from which you drink. Godspeed.
Out of curiosity, is your point about TOS out of concern for the poster or for Goodreads?
So does it feel to you guys like your comments, say, here in this Hacker News thread should be considered effectively copyrighted as your personal IP?
If so, do you feel the same way about opinions you share out in a supermarket or on the street?
> If so, do you feel the same way about opinions you share out in a supermarket or on the street?
Well being novel isn't the only criteria for copyright, the work must also be "fixated", and opinions in a supermarket usually isn't (but they can be, if I film them and post on reels or something; then the video itself is copyrighted)
https://copyrightalliance.org/education/copyright-law-explai...
> Fixation
> To meet the fixation requirement, a work of authorship must be fixed in a tangible medium of expression. Protection attaches automatically to an eligible work the moment the work is fixed. A work is considered to be fixed so long as it is sufficiently permanent or stable to permit it to be perceived, reproduced, or otherwise communicated for a period of more than transitory duration.
(not a lawyer)
You could try to argue that this falls under "create derivative works from any materials or content accessible on the Service" but even then it seems really flimsy to say that recommending books based on Goodread reviews is an infringemnt.
It's just not that different to a youtuber saying "I read reviews for 50 books, here's the ones to read"
I visit your garden and take 1000 apples from your tree.
Not that different.
And I only charge a tiny subscription for access to all my drone-managed orchards, you can eat as many apples as you want. But don’t steal any and start your own orchard or I sue.
Replace your drones with China or India and you have the current situation in the US.
Apple farmers go out of business so you lose the people who create new varieties.
Steal cuttings, not the fruit, if you plan to start an orchard. From 1000 apples you'll get ~10 000 seeds, statistically you won't even end up with one good tree.
> An output of three cultivars from around 50.000 seeds means that 17.000 seeds were needed to get one cultivar. Only one out of around 9.000 scab resistant seedlings showed the appropriate quality to become a cultivar. This proportion underlines the enormous effort which is necessary to develop a new cultivar.
https://orgprints.org/id/eprint/13698/1/220-225.pdf
I suspect we need to wait for the NYT (and others) case to be decided before we know whether scraping sites in contravention of their terms is also fair use for LLM training.
My own opinion (as someone who creates written content on an occasional professional basis) is that if you can’t monetise your content in some other way than blocking people from accessing it then your content probably isn’t as valuable as you think.
But at the same time that’s tricky when it’s genuine journalism, as in NYT’s case.
Obviously user generated content reviewing books online is rather different because the motivation of the reviewers was (presumably) not to generate money. And, indeed, with goodreads there’s a strong argument that people have already been screwed over after their good faith review submissions were packaged up as an asset and flogged to Amazon. A lot of people were quite upset by that when it happened a decade or so back.
So from a ‘moral arguments’ perspective I don’t think scraping goodreads is as problematic as other scraping examples.
(Sorry, none of this was aimed at you - your comment just got me thinking and it seemed as good a place as any to put it!)
If content is publicly available that does not necessarily mean it’s free of copyright control: the justification for using the reviews to train an LLM would be based on the fact that fair use means it is not an infringement of copyright. But if the publisher has terms that forbid scraping then that may mean the fair use argument is undermined if it is precedent in the content being legitimately obtained. I’m not a lawyer but it’s quite easy to see how “books can be used for LLM training under fair use but not if you pirate them” extends to “content on the web can be used for LLM training under fair use but not if you’ve breached the terms set out by the publisher”.
IMO, your definition is overbroad
* UI - once someone clicks "Add" you really should remove that item from the suggested list - it's very confusing to still see it.
* Beam search / diversification -- Your system threw like 100 books at me of which I'd read 95 and heard of 2 of the other 3, so it worked for me as a predictor of what I'd read, but not so well for discovery.
I'd be interested in recommendations that pushed me into a new area, or gave me a surprising read. This is easier to do if you have a fairly complete list of what someone's read, I know. But off the top of my head, I'm imagining finding my eigenfriends, then finding books that are either controversial (very wide rating differences amongst my fellow readers) or possibly ghettoized, that is, some portion of similar readers also read this X or Y subject, but not all.
Anyway, thanks, this is fun! Hook up a VLM and let people take pictures of their bookshelf next.
This is how filmaffinity works, which is the best recommendation system I've tried. They have a group of several dozen 'soulmates', which are users with the most similar set of films seen and ratings given; recommendations are other stuff they also liked, and you get direct access to their lists.
>then finding books that are either controversial or possibly ghettoized
Naively, I’d say the surprises are going to be better if you filter more different friends, rather than more controversial books among your friends. As in “find me a person that’s like me only in some ways, tell me what they love”. Long term this method is much better at exposing you to new ideas rather than just finding your cliques holy wars.
To be useful, the "Intersect" page should have:
- find near matches when there is no exact match with every book,
- ignore fake users (can any human read 80k books in many languages?),
- do not ignore users' votes (my input was books I liked, I expected to find users that rated them highly).
With the "Recommend" page I had the same problem as the GP, and all the recommendations were useless. To fix that, I think some features are needed:
- do not list books by authors from my list (I don't need recommendations for them),
- add a button for marking a suggested book as "disliked" (at the bare minimum, it should remove it from the suggestion, and ideally it should influence le suggestions as much as a "liked" book),
- do not suggest several books by the same author,
- add a button to hide a suggestion or show more suggestions (there were dozens of books I'd read but wouldn't rate high).
Fake users I would agree should be filtered, but I don’t think filtering out users who gave it a bad review is necessarily the intended behavior. If I put in 3 semi obscure Russian history books, I am presumably looking for someone who is an expert in Russian history to see what else they read. In that case I don’t care if they didn’t like one of the books or not. Approximate matches would require something like LSH or cosine similarity of average input book embedding against average embedding of read books of every user which I think wouldn’t work well anyone for retrieving anyone with a moderately long interaction history.
So I think the problem was not in the existence of similar readers, but in the way to reach them. Few people that read classical books log in Goodreads (I don't) and even fewer input what they've read over the past decades.
I did notice that when I put in a single book in a series (in my case Going Postal, Discworld #33) that tended to dominate the rest of the selection. That does make sense, but I don't want recommendations for a series I'm already well into.
Also noticed that a few books (Spycraft by Nadine Akkerman and Pete Langman, Tribalism is Dumb by Andrew Heaton) that I know are in goodreads and reviewed didn't show up in the search. I tried both author's name and the title of the book. Maybe they aren't in the dataset.
It did stumble with some books more niche books (The Complete Yes Minister). Trying the "Similar" button gave me more books that were _technically_ similar because they were novelizations of British comedy shows, but not what I was looking for.
For more common books though it lined up very well with books already on my wishlist!
The recommendations from other authors are good, but as far as I can tell I’ve read every single one of them.
Continuing to aggressively add everything it recommends eventually does seem to result in some interesting books I wasn’t familiar with, but I also end up with more and more books that are of zero interest to me.
For what it’s worth, I started with:
It is possible that there simply aren’t many books like these in existence, so the pool of relevant recommendations gets exhausted fairly quickly. I’d guess trending towards unrelated popular books is also just a feature of the source data, that largely sums up my experience with goodreads anyway.Very cool project though. I did end up ordering a couple of new books, so thank you very much.
I’ve only had a short play but a solution to this problem might be to show authors rather than books. Or select authors outside of the list the user has shared and then a top n (1,3,5) for each of those.
I feel like that’s how you’d recommend to someone else - type of book -> unknown author -> best matching few books from them.
After that the other side would be trying to find some diversity (if you think I’d like author X, personally you might suggest three different styles of book from them rather than three very similar books from them)
If you haven't already read it, you might like Lawrence Durrell's Antrobus [1].
[1] https://www.goodreads.com/book/show/759709.Antrobus_complete
My advice from someone who has built recommendation systems: Now comes the hard part! It seems like a lot of the feedback here is that it's operating pretty heavily like a content based system system, which is fine. But this is where you can probably start evaluating on other metrics like serendipity, novelty, etc. One of the best things I did for recommender systems in production is having different ones for different purposes, then aggregating them together into a final. Have a heavy content-based one to keep people in the rabbit hole. Have a heavy graph based to try and traverse and find new stuff. Have one that is heavily tuned on a specific metric for a specific purpose. Hell, throw in a pure TF-IDF/BM25/Splade based one.
The real trick of rec systems is that people want to be recommnded things differently. Having multiple systems that you can weigh differently per user is one way to be able to achieve that, usually one algorithm can't quite do that effectively.
https://www.goodreads.com/robots.txt
So legalities aside, this seems unethical.
At the same time, everything you ever posted online has already been scraped by hundreds (maybe thousands) of entities and distributed/sold to countless other entities. The only difference is that OP shared his project here.
This obsession with "everything must be commercialized" is really killing creativity.
Now if the author was commercializing other peoples reviews, sure, it's potentially(!) unethical. But scraping a website for reviews that are publicly(!) posted, training a recommendation LLM and then sharing it, for free, seems ... exactly the ideal use case for this technology.
Stabbing people with swords is evil, unless they are so big that to them it's at worst a light poke with a fork
Blindly violating it is bad manners, but deliberately scraping a single website over a month isn't the worst.
No, or at least make it configurable.
I’d agree for series, but not for Authors, just because I once read a book by someone doesn’t mean I even know they have other stuff, the list of Authors I read and enjoyed is very long.
VERY few authors write consistenly good books.
If you liked one book by an author, it is not at all likely that you will like the other books as well. For example, Neil Stephenson is probably my favorite author alive today, but I hate almost half of his books.
The only author that I can think of where I read and liked every single book was Terry Pratchett, and that might have be a case of "I was still young and easy to impress".
I also i need a way to describe its recommendations as "meh". For example, if I put Gone Girl in, I get Girl on a Train. Which, personally, I thought was bad. I want to exclude that from all future rec sets, and ideally align my preferences to the intersection of liked A and disliked B. vOv
I would also really like a possibility to add negative signal. It did also recommend books that seemed interesting to me but I ultimately didn’t like.
Overall quite impressive.
Sadly my experience with the book recommender isn't too great because of the 64 book limit. If I import either the most recent or least recent 64 book, 95% of the books it recommends to me are books I've read. Though it was helpful for spotting a few books I've read that I didn't log on Goodreads. Guess I'm pretty consistent.
"It is difficult to pinpoint a single "most helpful" negative review of Original Sin by Jake Tapper, as helpfulness ratings on platforms like Amazon or Goodreads are dynamic and subjective, and the provided search results *do not include specific user reviews with their respective helpfulness votes*."
https://share.google/aimode/OnWrGe4j508c4u3gh
One suggestion would be to make the search less strict on diacritics. Searching for popular cook J. Kenji López Alt was only successful if I entered the correct O.
Feature request: Exclude books already in shelf. This is harder I'm sure. I've got 1146 books in my Read shelf.
1.Sid Meier’s Memoir!: A Life in Computer Games — Sid Meier 2.Source Code: My Beginnings — Bill Gates 3.Build: An Unorthodox Guide to Making Things Worth Making — Tony Fadell 4.Prince of Persia: The Journals — Jordan Mechner 5.A Theory of Fun for Game Design — Raph Koster 6.Ask Iwata: Words of Wisdom from Satoru Iwata, Nintendo’s Legendary CEO — Viz Media (Editor) 7.Control Freak: My Epic Adventure Making Video Games — Cliff Bleszinski 8.Once Upon Atari: How I Made History by Killing an Industry — Howard Scott Warshaw 9.Press Reset: Ruin and Recovery in the Video Game Industry — Jason Schreier 10.Masters of Doom: How Two Guys Created an Empire and Transformed Pop Culture — David Kushner
Tigana, Hyperion, A Fire Upon the Deep, Blindsight, Moby Dick
and I got a list. Sure, read all that or wasn't interested for reasons, I added (only Neuromancer on initial recommendations):
Neuromancer, VALIS, Quantum Thief, Towing Jehovah.
List did not get more interesting.
Book recommendations are still kind of difficult.
I won't tell you exactly what to do, but one way to do it is to measure your surprise with me choosing each of those 8 books when you provide a recommendation back to me of what I should read next. I think I get kind of that experience talking to someone about books.
The algorithm didn't do that.
Another thing I noticed is that it tends to recommend 2nd and 3rd books in a series, which is a bit so-so. If I add the first book in a series, I probably already read the whole series...
Some five years ago I was day dreaming about recommendation engine for movies where you could say "hey Ciri, give me a good gangster flick", and it will come up with something that you haven't seen yet but you'd definitely love.
To my amazement almost everyone, even true AI believers, thought it was impossible to achieve. :d
But my question is - having such huge dataset, do we really need AI for it? SASRec/RAG is sexy, but could the same result be achieved with simple ranking and intersections like lastfm did in the past with music?
Some twenty years ago I came up with an idea of "brain" data structure for recommendations where you have all your items (books, movies or articles) modeled as a graph, and whenever you pick something it makes a ripple effect, effectively raising scores in cascade of every adjecent item.
Just like your brain works - when you stumble upon something new it immediately brings back memories of similar things from the past. I never had the opportunity to implement it and test in real life scenario, but I'd be surprised if a variant of this is not widely used across different recommendation systems, like Amazon.
I mean. Is it possible? Hey Zeroq give me a good gangster flick please?
How will you know what i have seen and what I haven’t yet? How will you know what movies I like? Are there even good gangster flicks i would enjoy and haven’t seen yet?
The way the problem is phrased it sounds like your dream recommender has two properties: “it doesn’t receive any other information than what is in the prompt” and “it always recommends a movie you haven’t seen and will enjoy”. Those together are what makes your dream recommendation engine impossible. If you relax those then of course it is possible. That’s just a recommendation engine.
Anyway. I can totally see such site running purely on statistics. Every song, every artists, every genre is a bucket. You listen to a song you put a drop in these buckets. Once there's enough water running we can compare you to other users and their buckets.
It might be hard to run it on scale in real time, but c'mon, it's leetcode junior level assignment level of complication.
But I think to break the content-bubble effects to find the longer tail, some way to reject or blacklist things - and have that be taken into effect in the model - might help.
here's my recommendations feature: https://thegreatestbooks.org/recommendations
It's much more powerful if you're a member. you can restrict the results to certain genres or book lengths, as we as published date ranges, etc. If someone wants to try out the more powerful feature, DM me and i'll mark your account as a member.
Here is the URL with your books: https://book.sv/#52752877,46049530,18437030,52480873,3260654...
Strongly recommend giving that a try yourself. And trying to build an algorithm around it!
Here’s an example: Tolstoy really admired Turgenev, who was friends with Theodore Storm and Gustave Flaubert, and greatly admired Gogol.
If you like Anna Karenina you’ll probably find something of value in Torrents of Spring, Immensee, Madame Bovary or Dead Souls.
It spiders out pretty quickly!
By the way you could use Summa FTS Wasm + Duckdb Wasm to have the same website without any backend except file hosting. Maybe even just Duckdb Wasm with it's FTS would be enough. Summa FTS is very similar to meilisearch in essence because they're both derived from Tantivy.
https://izihawa.github.io/summa/quick-start/
I think I could get the model to work with ONNX web but it'd be a 2GB download so the user experience wouldn't be too great. My Meilisearch index is ~40GB but I don't know how much that could be compressed down.
Here's how the similar page for books is generated, which I forgot to mention on the "how it works" page: https://gist.github.com/chris124567/8d06d64bfe827cb7f6121f93...
in addition to the goodreads data, i think it would be interesting to add an author's favorite books by scraping their paris review interviews (and other interviews) and using that as a "review" because i've learned about so much good stuff through an artist mention.
and in a retail context, i've always wondered if a recommendation engine like this could have its own "flavor" based on a specific store's customer buying history. like if a bookstore's customers were weighted in the algorithm so that their similarities scores were given preference. much of what a bookstore carries is based on their customers' taste. you could use the goodreads etc as the base recommendations and then train it on a bookstore's sales history.
a project like this is a bit outside of my expertise but i have a tiny bit of knowledge about it, and now i have a lot more to learn after seeing this. if anybody has any good book recommendations (hehe) or papers i should read to learn about these sort of systems, please let me know!
thank you for sharing!!
For example, the title "Impro: Improvisation and the Theatre" by Keith Johnstone, linked by another article posted to HN today gives back the following suggestions:
- Truth in Comedy: The Manual of Improvisation by Charna Halpern - Steve Jobs by Walter Isaacson - 1984 by George Orwell - Harry Potter and the Sorcerer's Stone (Harry Potter, #1) by J.K. Rowling - Sapiens: A Brief History of Humankind by Yuval Noah Harari - The Alchemist by Paulo Coelho - The Tipping Point: How Little Things Can Make a Big Difference by Malcolm Gladwell - Dune (Dune, #1) by Frank Herbert
It's a bit unfortunate that all suggestions are fairly popular titles, which are fairly easy to find, while the unpopular or niche may be just as well written but a lot harder to find.
Within niche topics or books, it is also usually harder to provide multiple similar enough titles up front.
SASRec was released in 2018 just after transformer paper, and uses the same attention mechanism but different losses than LLMs. Any plans to upgrade to other item/user prediction models?
Other models include Google's TIGER model which uses a VAE to encode more information about items. Similar to how modern text-to-voice operates.
Feature request: Be able to import all my goodreads books, unread as well. Not only 64. Most of the recommendations were already on my shelf.
Any chance we could get an API going at some point? Are you planning to open source the work?
I'm interested in the scrapping of Goodreads too. I'm building a book metadata aggregation API and plan on building a scrapper for Goodreads, but I imagine using a data center IP address will be a problem very fast. Were you scrapping from your home network?
Re the API: The model does actually run fairly well on CPU so it probably wouldn't be too expensive to serve. I guess if there is demand for it I could do it. I think most social book sites would probably like to own their recommendation system though.
- Guns Germs and steel - The Alchemist - The Ramayana (a few others)
Harry Potter and the sorcerers stone came up in all of them near the top. :D
So filtering would be great,
I have seen a few versions of the same books listed more than once.
Loved this. Hope you get to tune it a little.
Also, thank you for not ruining the site with a single popup, email subscription list offer, chatbot, wheelspin from hell anywhere.
Blessings from the popup hating part of the interwebs.
I should be able to mark recommended books as "Read, Liked"; "Read, Didn't Like"; "Remove, Other Reason"
and then allow a rerun of Recs, based on additional info
p.s. one idea: when you click [Add] on the recommended books list, it should remove it from that list
p.p.s. if there is a way to filter out the spam "Summary of ____" books, that would be good too
I would love to be able to add a recommendation system based on this.
It did suggest Murderbot Diaries (not on the input but a series I have read and did like) and an Adrian Tchaikovsky I hadn't read :).
ChatGPT does a fairly good job at letting you negate/refine whatever it was you where looking for.
I don't see anyone saying safety or ethics, so I'll just put it out there that it has some safety and ethical considerations you should consider.
Consider "inflammatory" books and how they could be used to harm a group of people. Although I recognize folks post this "publicly", I think the intersection feature provides more than Goodreads.
Let's say, people who have read "Mein Kampf" & "The Anarchists Cookbook" or some other combination that say "Antifa" to the current regime.
I'd recommend you have a list that you consider private, always and allow Users to add to that list so it's more scalable. If folks try to intersect with anything in that list, you can warn that you don't allow intersection with private books.
Anyway, super fun demo!
Also, I don’t use them but based on forum posts I read I think other services like LibraryThing or Storygraph do expose similar information about book readers.
Mein Kampf is absolutely terrible piece of literature not by it's message but by it's quality. It's exactly something I would expect to find in Alex Jones cell if we would sentence him to a year of solitary confinement.
[1] just a tiny exaggeration
I don't think anybody needs to be protected from themselves sharing things intentionally.
[1]: https://hardcover.app
These seem to fit the description you are going for better. The model is trained to predict the next book in the sequence. Those other books you listed happen to be very popular, so in the absence of information about you (only having 1 book), the model will tend to recommend those.
Thanks for the new reading list :D
Besides title, I'd like to provide suggestion on what type of books I am looking for.
Also People of HN: - I built an HN aggregator that shows sentiment analysis of comments and . . .
- It seems to mostly show me books I've already read and know of, including sequels of what I added, which isn't very useful.
- It ultimately seems to prioritize "highest rated in category" too much, rather than focusing more on what made my chosen books stand out over others.
- Needs a "disliked books" list, especially when the recommendations show me a lot of superficially similar books I hated. I'd like to blacklist them.
- Would be cool to have a discovery mechanism for less popular and even obscure titles. Again, the top picks of each category are very well-known.
- Might not be practical, but I'd like some way to filter by specific features of reviews. E.g. prioritize reviews that say "the MC is a psychopath/murderhobo/rapist" higher for anti-recommendation, ignore reviews that say "whiny character", etc.
Steve Jobs by Walter Isaacson
Harry Potter and the Sorcerer's Stone (Harry Potter, #1) by JK Rowling
Topology by James R Munkres
and so on.. Munkres' book is relevant and I want to read it, but what have Steve Jobs and Harry Potter got to do with with mathematics?
https://book.sv/similar?id=211570
> Note 1: If you only provide one or two books, the model doesn't have a lot to work with and may include a handful of somewhat unrelated popular books in the results. If you want recommendations based on just one book, click the "Similar" button next to the book after adding it to the input book list on the recommendations page.
Most people that have read a mathematics textbook have also read and enjoyed Harry Potter.
Given you have enjoyed drinking water and breathing in the past, there is a high likelihood that you will enjoy watching the Star Wars films.
For a while now I have really wanted good book recommendations matching my tastes. The LLMs suck at this (likely due to the mode collapse that Karpathy mentioned in his excellent podcast appearance on Dwarkesh) and Amazon is very good but only recommends based on the current book you're browsing.
I will try this out now! But could you increase the number of books fed to the recommender or maybe get the top-64 highest rated books instead of just the most or least recent 64?
HNUser - “OpenAI told you to go swivel until they made a billion and you accepted that. Samesies “
https://book.sv/#2300585,644416
We are rapidly replaying the worst of the Resource Curse.
Honestly this would finally be the web2.0 we all wanted & hoped for. It's against majesty that it's all captured owned user content that is legally captured by essentially public message boards/sites.
One request, it would be nice to not have to add Goodreads, since I don't use it. I've love to be able to enter a couple of book titles or an author and just get recommendations!
The idea seems good, but since I can't import my GoodReads successfully, it's hard for me to try
I add those I like to the list, ignoring those I didn't, and in the end I just end up with recommendations I already read and didn't like.
I feel like wasted my time yet with another smart recommendation system.
I put in Thinking in Systems and got a bunch of engineering management stuff which I don't care about. Deep work of course gave me all the rich dad poor dad, steve jobs bio, tim ferriswheel crap which shouldn't surprise me at all. Girl with the dragon tattoo gave me the rest of the series.
Thematic similarity + popularity just seems boring, I'd like something that surfaces unusual deep cuts that I wouldn't necessarily find at the book store on the same shelf, but maybe that I could find if I went to a great library and might be out of print, or that I could find on libgen.
With these:
- Thinking In Systems: A Primer
- Paddle to the Amazon: The Ultimate 12,000-Mile Canoe Adventure
- The Elements of Typographic Style
I was kind of hoping to at least get "Grid Systems in Graphic Design" or something, but mostly got Alchemist, Zen', Into the Wild, almost comically mainstream cuts that of course in some cases I've already read or could find in a Cupertino trash can, not that any of them are not worth reading necessarily, but very typical.
An option to surface rarer choices that combine signals from all the books on the list would be neat, like in the above case, the least read real adventure book that somehow touches on the economics of places travelled through with musings about signage or that just happens to use a similar prose that Robert Bringhurst used to make print design theory not dull. Recommendations that only someone with a real sweaty and weird venn diagram of genuine personal deep interests might conjure up, and that a normal person might say "why the hell would I ever read that" but that otherwise amazing books that are just slept on and might never have found a market, or maybe thematically dissimilar+ conceptually similar in aggregate + unpopular. I'd like to be able to input a seed of inspiration that I haven't been able to find the next deeper step in, rather than all the books on how to start a startup in the garage I don't have. If it's James Hoffman's book on brewing coffee at a high level, I wouldn't want another YouTubers book on brewing coffee at a high level, I'd want the Physics of Filter Coffee, or something in an adjacent sphere grid / tree branch that gives me a way to pursue depth AND breadth but not necessarily the same book by someone else, or the same book with different characters. If I've found a seedling or a mushroom, I'd like to explore the root system of that fruiting body, and then at a certain point find a new seedling based on what I've learned so far, or the one video with 50 views that's somehow the best explanation of how to handle back-pressure in highly concurrent systems after I've realized that I don't know shit about concurrency, but not so deep in the stack that I can't bridge the gap; make the series for me.
Granted, my take here might just be an indictment of reviews in general, or at least those sourced from a generic site like goodreads/amazon which is all about popularity and armchair criticism.
Great project anyway, thanks for sharing and responding to my stream of consciousness, I certainly didn't mean to be insulting and hope my tone didn't come across that way.
Everything about this concept I hate and it’s difficult not to conflate that with the creator. I make comparisons and equivocations. This is an ethical discussion akin to “Can you enjoy Bill Cosby comedy knowing he was a Rapist” and I’m not being glib.