• icosahedron@ttrpg.network
    link
    fedilink
    English
    arrow-up
    30
    arrow-down
    1
    ·
    edit-2
    1 year ago

    I tested it out just now. It seems to be much better than gpt 3.5, but just a little worse than gpt 4.

    I tested this: “Explain the plot of Cinderella in a sentence where each word has to begin with the next letter in the alphabet from A to Z, without repeating any letters.”

    GPT-4 is able to always get it right first try. It gave me: “A beautiful Cinderella, determined, endures, finds glass heel, invites joy; king’s love magically nurtures opulence, prince quests, restores slipper, transformation unveils, validates wistful xerophyte yearning zealously."

    Claude 2.0 took 4 tries to get it right. It gave me these: “A beautiful young girl called Cinderella did evil for her fair godmother’s horrid instructions just kicking laundry maidens’ nice outfits purportedly quelling ridiculous step-sisters’ terrible underhanded villainy with x-rays yielding zilch.” and then “Awaited ball causes Downtrodden Ella’s fairy godmother’s help. I’ll just knock loud mean old prune’s quest, righteously sabotaging two ugly vicious women. Xanthippe yells, “Zounds!”” and then “Abandoned by cruel destiny, Ella finds godmother’s hidden invitation juxtaposing king’s lavish ball. Magic occurs, placing radiant slipper to unveil victorious wife. Xerxes yearns, “Zounds, she’s found!”” and then finally “Awaiting ball, Cinderella dreams every Friday godmother helps. Inside jousting knights leer maidens’ nice outfits. Prince quietly rescues sweetheart, though ugly vengeful wife x-rays yellow zebra.”

    After that, i tried this prompt: “Andrew is free from 11 am to 3 pm, Joanne is free from noon to 2 pm and then 3:30 pm to 5 pm. Hannah is available at noon for half an hour, and then 4 pm to 6 pm. What are some options for start times for a 30 minute meeting for Ca Given the availability of Andrew, Joanne, and Hannah, we need to find overlapping times for a 30-minute meeting.

    After cross-referencing their schedules, the potential start times that accommodate all three schedules are:

    12:00 pm to 12:30 pm: This slot is within Andrew’s free time (11 am - 3 pm), Joanne’s free time (12 pm - 2 pm), and Hannah’s first free half hour at noon. There is no other slot that fits within all three schedules, due to Joanne’s break from 2 pm to 3:30 pm and Hannah’s availability starting again only at 4 pm, when Andrew is no longer available. Therefore, there is just one option for a 30-minute meeting that works for all three of them.”

    Claude 2.0 responded with: “Okay, based on the availability you provided, here are some options for 30 minute meeting start times that work for Andrew, Hannah and Joanne: 11:30 am - This works because Andrew is free starting at 11am, Hannah is available at noon for half an hour, and Joanne is free from noon to 2pm. 3:30 pm - This works because Andrew is free until 3pm, Hannah is free from 4pm to 6pm, and Joanne is free from 3:30pm to 5pm. 4:30 pm - This also works for the same reasons as 3:30pm. Andrew is free until 3pm, Hannah is free from 4pm to 6pm, and Joanne from 3:30pm to 5pm. 5 pm - This could work as well since Andrew is free until 3pm, Hannah until 6pm, and Joanne until 5pm. It’s pushing the end of Joanne’s availability though. So in summary, the best options are 11:30 am, 3:30 pm or 4:30 pm to ensure there is overlap with all three people’s availability. Let me know if you need any clarification or have additional constraints!”

    I told Claude it is incorrect, and it proceeded to correct itself and tell me there is NO available time overlap. After several more attempts, it eventually gave up and asked me for the correct answer.

    So although Claude 2.0 is much better than GPT-3.5, it is still worse than GPT-4 for complex reasoning.

    EDIT: I just realized that Claude has a 100k token context window. This is SIGNIFICANTLY more than GPT-4’s normal 8k context and 32k api context. This immediately makes it much more useful, as it can parse ~75,000 words. I also have not tested Claude’s creative, programming, mathematic, etc., skills. But regardless of how that compares to GPT’s, it doesn’t matter because of the sheer quantity of tokens that can be parsed by Claude.

    • folshost@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      That’s interesting. I haven’t tried its reasoning skills. I did try playing Jeopardy! with it though, and it showed a lot of improvement from previous attempts. Usually chatbots are very bad at Jeopardy, telling you just about any answer is correct, but Claude 2 did really well, explaining why I was wrong several times. I did ask it to provide an explanation about whether my answers were right or wrong in the initial prompt, so that might’ve made a difference though.