The machines, now inaccessible, are arguably more secure than before.

  • Sailor Sega Saturn@awful.systems
    link
    fedilink
    English
    arrow-up
    29
    ·
    edit-2
    6 months ago

    Fair warning that I’ll be ranty because I hate losers talking about DEI hires.

    So why is memory address 0x9c trying to be read from? Well because… programmer error.

    So what happened is that the programmer forgot to check that the object it’s working with isn’t valid, it tried to access one of the objects member variables…

    This is a huge assumption. The last rumor I’ve read from actual cybersecurity people is that Crowdstrike’s update files were corrupt (update: disproven by Crowdstrike’s blog post). If this is true it’s likely still from programmer error at some level, but maybe not as simple as “whoopsie I forgot an if (data == nullptr) teehee”.

    He, like the rest of us that don’t work at Crowdstrike, has no idea what happened. I have seen computers do the weirdest gosh darn things. I know better than to assume anything at this point. I wouldn’t even rule out weird stuff like the data getting corrupted between release qualification and release yet.

    It turns out that C++, the language crowdstrike is using, likes to use address 0x0 as a special value to mean “there’s nothing here”, don’t try to access it or you’ll die.

    This thread is full of these sorts of small technical inaccuracies and oversimplifications so I won’t point out all of them, but nothing in the C++ standard requires null pointers to refer to memory address 0x0. Nor does it require that dereferencing a null pointer terminates the program.

    Windows died not because C++ asked it nicely to, but because a driver tried to access an address which wasn’t paged in.

    Crowdstrike should have set up automated testing using address sanitizer and thread sanitizer that runs on every code update.

    The funny thing about accessing into non-paged memory in kernel space:

    1. It will crash regardless of if it’s running under Asan or not, sanitizers are literally irrelevant based on what we know so far
    2. The Asan version he linked to is for user-space. In the windows kernel you’d need KASAN instead.

    (If this was a simple nullptr dereference on bad input data then perhaps a fuzzer would have helped. Fuzzers are great though I have no idea how hard they are to use with kernel drivers)

    C++ is hard. Maybe they have a DEI engineer that did this

    Dude would probably call me a “DEI hire”; but I bet I could beat him in a C++ deathmatch so neener neener.

    • V0ldek@awful.systems
      link
      fedilink
      English
      arrow-up
      14
      ·
      6 months ago

      Also, and this shouldn’t be left unsaid, we’re talking about the Windows kernel here. A place with C++ code so cursed it is legendarily unhealthy to work in, as the cosmic horrors contained within slowly eat away at your sanity and warp the perception of time and space. Seeing that code for a few hours is enough to make a grown man cry. Seeing that code for a few weeks is enough to make you never cry again, as the terrible truth worms its way into your mind.

      “DEI hire”, hah! The creature makes no distinction for race or gender as it fattens itself upon your failure! Even a glimpse at the edge of its abyss is enough to trigger a cycle of revelation - all modern software lies upon a rotting pile of ancient mistakes.

      • Noah Gibbs@ruby.social
        link
        fedilink
        arrow-up
        7
        ·
        6 months ago

        From a lovely response to the Crowdstrike error and various speculation on what caused it (https://ruby.social/deck/@[email protected]/112824202708490681), comes this gem:

        > all modern software lies upon a rotting pile of ancient mistakes.

        To be clear: this is 100% true. As we slowly, painfully work our way toward being less awful at software engineering, we are better than we have ever been. As fucked as modern code is, old code was worse.

        The lower in the stack you go, the more horrifying the revelations, just as a rule.

        • V0ldek@awful.systems
          link
          fedilink
          English
          arrow-up
          4
          ·
          6 months ago

          Absolutely stellar writing, except for this one weird bit

          Database people are systems people. Modern databases have their own memory management, thread scheduler, and a fucking compiler inside. A promising research direction is to just bundle the database with your own bloody kernel that you handwrote with a box of scraps to make the entire thing less cursed and not have to wrestle with Linux.

          You know, just in case you were looking for people to include in your postapo gang, database experts will also murder whatever you want with bare hands.

          • Fazal Majid@social.vivaldi.net
            link
            fedilink
            arrow-up
            2
            ·
            edit-2
            6 months ago

            @V0ldek there’s a difference between people who develop database engines, and people who use an existing database engine to write database applications in SQL or whatever.

            It’s just your comedic hyperbolic turns of phrase reminded me of Mickens’.

            • V0ldek@awful.systems
              link
              fedilink
              English
              arrow-up
              2
              ·
              6 months ago

              If more than one system devs launch into a Lovecraftian stream of epithets about how incomprehensiblly horrific it is when you ask them about their work then there just may be some truth in it.

    • 0xb0ba@awful.systems
      link
      fedilink
      English
      arrow-up
      12
      ·
      6 months ago

      Mention C (and to an extent C++) and turbo nerds froth to show off how ultra cool they are cause they are LoW lEvEl programmers. But like most things, these loud freaks are mostly incoherent with their random insertion of tech words. Putting aside the DEI stuff cause I will rant forever against this racist and sexist fuckwit, it’s massively annoying working in an industry and dummies love to be all hand wavy and suggest something like sanitizers. Thanks bro, let’s all add runtime sanitizers and watch perf tank in the most critical section of your computer. And as you pointed out he doesn’t even mention the right one.

      Next time Crowdstrike should just have an if check all registers after every instruction to make sure their values are within your address space! And and and make sure a woman doesn’t program it cause according to him they are exempt from code reviews cause of the left agenda or some bullshit

    • Mike Knell@blat.at
      link
      fedilink
      arrow-up
      8
      ·
      6 months ago

      @sailor_sega_saturn And given enough time and enough scale even the most improbably weird things will eventually happen. Update file corrupted by a storage controller that flips a couple of bits at random after every 720 hours of uptime but only if it’s 23.682 seconds after the hour? Weirder shit has happened.

      • YourNetworkIsHaunted@awful.systems
        link
        fedilink
        English
        arrow-up
        16
        ·
        6 months ago

        I once helped one of my company’s customers troubleshoot an issue that had seen the same ridiculous edge case error happen three times over the course of a few years. At one point the actual sustaining developer we worked with was able to narrow down a specific bit that was getting flipped somehow, and pitched that cosmic radiation was a plausible solution given how rarely this kind of thing impacted other customers.

        It was at this point that we remembered that the customer was either a university with a nuclear physics lab or a hospital with a nuclear medicine program (can’t remember now, ironically enough) that the server rack lived adjacent to.

      • flere-imsaho@awful.systems
        link
        fedilink
        English
        arrow-up
        11
        ·
        edit-2
        6 months ago

        some twenty four years ago i managed, amongst others, a company’s samba and print server (that was at the time when all the company’s servers were beige boxes with less memory and disk than the laptop i’m using to type this – and still they served a few hundred employees).

        the machine developed a strange custom of hard-resetting itself, which we initially tracked to specific files being sent for printing; the behaviour was fully reproducible.

        as it happened, it was a hardware fault somewhere between the mainboard and the integrated SCSI card; installing a separate SCSI card and reconnecting the disks and backup tape device fixed the problem. (i did not have the budget for a new serwer, no.)

        establishing the actual cause took me fucking weeks.

    • within_epsilon@beehaw.org
      link
      fedilink
      English
      arrow-up
      6
      ·
      6 months ago

      “DEI hire” is arrogant. That’s a great way to other people instead of owning the flaw. I appreciate the call for maturity in the field. Own your flaws.

        • alm@awful.systems
          link
          fedilink
          English
          arrow-up
          17
          ·
          6 months ago

          It actually blows my mind that these people can see a bad thing happen, know exactly zero about it, and conclude “must have been a (insert slur) who did that”. They did the same shit with the Baltimore bridge collapse.

    • Architeuthis@awful.systems
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      6 months ago

      (update: disproven by Crowdstrike’s blog post).

      How do you mean? The current top post on the blog seems to mention .sys files as part of the problem very prominently.

      Channel file “C-00000291*.sys” with timestamp of 0527 UTC or later is the reverted (good) version. Channel file “C-00000291*.sys” with timestamp of 0409 UTC is the problematic version.

      • Sailor Sega Saturn@awful.systems
        link
        fedilink
        English
        arrow-up
        11
        ·
        6 months ago

        https://www.crowdstrike.com/blog/technical-details-on-todays-outage/

        This is not related to null bytes contained within Channel File 291 or any other Channel File.

        That to me implied that the channel file wasn’t actually necessarily corrupt (or as corrupt as people thought), but that it triggered a logic error. In particular this point implies that it wasn’t from garbage zero bytes in the file.

        (That said I could have worded this better, in my defense I’m sick in bed and only half thinking straight)

        • froztbyte@awful.systems
          link
          fedilink
          English
          arrow-up
          3
          ·
          6 months ago

          yeah that phrase of “null bytes” reads like addressing one of the rumours

          “what was the problem?” “well it wasn’t null bytes” “so… what was it then?” “have definitely eliminated null bytes from the running!”

          • Sailor Sega Saturn@awful.systems
            link
            fedilink
            English
            arrow-up
            4
            ·
            6 months ago

            Aside but I have been in some weird as heck discussions about how to phrase public blog posts. A few times I’ve had to point out some phrasing is so cryptic that no one will even know what we’re talking about, and really there’s nothing wrong with being a bit clearer about what we want to express. Sometimes you’d like companies want the audience to be bewildered and confused; and I’m not totally sure where this instinct comes from.

            (Though in this case they probably don’t want to share too much yet for stonk or legal reasons)