Made me look. seems to be about the tweet you reference on Claude 3 Opus. is about results on GPT-4 and Claude 2.1

It looks like one has to construct the test very carefully and have the right prompts to do well on this problem. I'm sure that as the models get bigger, the hardware to run the models is more tailored to them, etc., they will do better. ("Add a subroutine with a clever retort if the score on 'could this be a trick question?' is high...") And, yes, it becomes a philosophical problem eventually about how good a simulation needs to be to be "the real thing".

I have vague recollections that a company in Japan (maybe it was Panasonic - story about a 1990 patent) was touting their "AI" in their vacuum cleaners in the 1980s (figuring out whether there was still dirt to be picked up).

It's not there yet. I don't know if anyone really knows how long it will take to get there. The folks dumping billions in it NOW NOW NOW may be upset if it takes another 30 years to get there...