The reason for my skepticism is that I'm not sure xAI would give away Grok 3 and push it on x.com so aggressively if it cost arm and leg to run, as gpt-4.5 pricing indicates
01.03.2025 17:31 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0@leventov.bsky.social
The reason for my skepticism is that I'm not sure xAI would give away Grok 3 and push it on x.com so aggressively if it cost arm and leg to run, as gpt-4.5 pricing indicates
01.03.2025 17:31 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0I know there are no official info, of course. I'm following these rumors pretty closely, too. The compute flops they have had could have been achieved on ~2T model, no? I think Elon said that they used a ton of synthetic generated data too, and many rollouts to find good solutions for RL, too
01.03.2025 17:26 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0Source that Grok 3 is 10T? I'm very skeptical of that. Maybe they scaled training data substantially but parameters not *that* much.
01.03.2025 05:42 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0Perhaps this laziness is an intentional nudge towards using reasoning models (which are not yet available, though - I mean reasoners based on 4.5)
27.02.2025 20:54 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0We need an uncertainty knob similar to temperature
25.02.2025 14:30 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0They promised to open source prev gen after releasing the next. So we will know
21.02.2025 03:54 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0And then if anything goes wrong or unexpected during the "wet" phase a clueless wannabe would like to pull VLM with camera and ask the model for "debug instructions". You can't do it "on Google".
06.02.2025 20:04 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0You cannot get precise and detailed instructions for making a bomb or poison or other very dangerous stuff from things you can buy legitimately in just "a few clicks on Google". As a very minimum it's days of research, including how to gaslight vendors, how to prepare things, etc.
06.02.2025 20:02 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0Analyzing the ethics and risks of autonomous agents is crucial. Thank you for your insightful work @mmitchell.bsky.social @evijit.io @sashamtl.bsky.social @giadapistilli.com
06.02.2025 15:58 โ ๐ 62 ๐ 11 ๐ฌ 3 ๐ 0Bad take. Censorship of "recipes for ruin" is good. A blanket deontological rule "censorship is bad" doesn't work.
06.02.2025 19:41 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0I guess the equivalent in AI agent(cy) engineering, the equivalent transition will be towards the method design and decomposition: dialogue? multi-role debate? argument tree? the data model the model is operating on top? Reward design for RL post-training/fine-tuning?
05.02.2025 01:07 โ ๐ 2 ๐ 1 ๐ฌ 1 ๐ 0(take not mine) the current AI/agent(cy) engineering is much like pre-DL computer vision, when people tried to massage the problem around a few fairly rigid algos like SIFT. There also was RAG: scikit-image.org/docs/stable/... and also didn't work very well
05.02.2025 00:59 โ ๐ 2 ๐ 1 ๐ฌ 1 ๐ 0I guess the equivalent in AI agent(cy) engineering, the equivalent transition will be towards the method design and decomposition: dialogue? multi-role debate? argument tree? the data model the model is operating on top? Reward design for RL post-training/fine-tuning?
05.02.2025 01:07 โ ๐ 2 ๐ 1 ๐ฌ 1 ๐ 0With DL and end-to-end training in CV, loss design became a more important skill than heuristic bricolage
05.02.2025 01:01 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0(take not mine) the current AI/agent(cy) engineering is much like pre-DL computer vision, when people tried to massage the problem around a few fairly rigid algos like SIFT. There also was RAG: scikit-image.org/docs/stable/... and also didn't work very well
05.02.2025 00:59 โ ๐ 2 ๐ 1 ๐ฌ 1 ๐ 0Wrong, this is still a liberal hysteria
03.02.2025 20:49 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0I have yet to regret shooting a request to undermind. It always finds something interesting. My requests are always of the form "who have done research in roughly this shape" (where I'm sure that someone did, but hard to find via Scholar)
03.02.2025 13:31 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0FWIW in my impression, none of the services in this category (Perplexity, You.com, etc.) live up to the "deep" label except undermind.ai so far. Didn't try PaperQA though.
03.02.2025 06:37 โ ๐ 1 ๐ 0 ๐ฌ 2 ๐ 0Google's deep research is a total flop. I paid for a subscription to try it. Tried on maybe 10 requests, very broad range, from technical to culture to philosophical. It spit out bald, often outright wrong slop every. single. time. Idk why you keep praising it
03.02.2025 06:35 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0Zero information. Consistently candid Sama will say whatever the specific audience likes. I'm sure in different rooms he says the opposite of this
01.02.2025 16:10 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Claude is very hit-or-miss for perplexity-like questions. Same for everything else: ChatGPT, Gemini, exa.ai, You.com.
Meta search with all of them may be helpful, even if not fully automatable yet: if LLMs knew good answers to these searches they would not be so hit-or-miss to begin with.
Maybe no hard distinction. It's a continuum between Sonnet and "reasoning" models.
08.01.2025 05:32 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Good design. "Native" tool calling makes LLM APIs much more complex than they should be. Using just a single "tool" - code execution - to rule them all is better.
01.01.2025 05:20 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0Does the book argue for expected utility/value based decision-making? "Radical Uncertainty" by @profjohnkay.bsky.social directly argues against that
26.12.2024 10:03 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0I think o1/o3 should be better at this (I don't use them at the moment), but breaking the flow and waiting would be weird. o1-capable coder with access to the context that constantly does some analysis in background and makes insightful suggestions for me from time to time would be best
24.12.2024 05:43 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0Similarly, when writing a somewhat long function that includes 2-3 copies of similar but not identical logic (e.g., loop bodies), LLMs are never capable of factoring those out to shorten the function overall.
24.12.2024 05:43 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0I find it annoying that LLMs often tend to write their own functions for doing something instead of using the standard library or the "utils" that I've already created in my project.
24.12.2024 05:43 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0Presumably, people on MechanicalTurk got 75%. However, I would argue that people on MechanicalTurk are self-selected for something like openness for new tasks and problems.
21.12.2024 06:43 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0Ethan, do you cherry-pick the stuff you post on twitter/bsky? How many experiments do you do that never make it to your twitter in which none of the AIs do anything remarkable or badly misunderstand your intent?
16.12.2024 05:38 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0Google just shadow banned my account, returning bullshit 403 to all requests. At least Anthropic and OpenAI don't have this BS
12.12.2024 13:54 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0