The Nibbler 4-Bit CPU board is optimized for exactly one thing: Creating a fully functional computer with a CPU split into its components with as few chips as possible to make it easy to build it and to actually understand what is going on. It does an excellent job at this and one can even learn a lot about CPU architecture design from stuff that has been left out of the design on purpose. Here are a couple of examples and how to work around the missing parts:
No stack: Whenever writing a program that does more than just switching an LED on and off it's almost certain that the program will be split up into subroutines or that one uses a library of routines built by others. To be able to jump and return to a subroutine from several places, the current content of the program counter is put on the stack to serve as a return address. In addition, all input variables to be used by the subroutine are pushed onto the stack as well. The subroutine then retrieves (“pops”) the variables from the stack, does whatever it needs to do and then executes a 'Return from Subroutine' CPU instruction. The CPU then puts the return address that was put on the stack back into the program counter which effectively returns to the main program thread. As the Nibbler does not have a stack it's not possible push the program counter and variables on the stack and return from a subroutine to various places with a single instruction at the end of the subroutine. The way to work around this is to implement a jump cascade at the end of a subroutine. Whenever the subroutine is re-used, the jump cascade has to be modified by inserting a new return target at the end. Which jump to take is written into a memory location before jumping to the subroutine. A different value is used from each jump location. In other words, if the subroutine is used in 8 places in the program there is a cascade of cmp/jz instructions at the end of the subroutine. Also, the subroutine has to be modified whenever it's used from an additional location. Not elegant at all but the only way to have subroutines without a stack. To pass variables to the subroutine, they have to be put in memory (the 'heap') at predefined locations. I'm pretty sure if somebody has never ever heard of the 'stack' concept it wouldn't take long to come up with it as it's just a pain to do just about anything without one.
No indexed addressing: One thing computers are good at is to do simple things quickly over and over again. For example, in many cases it's required to make the same calculation on consecutive input data and to put the results back into memory one after each other. Another repetitive thing is to write into a buffer, e.g. writing a string to be sent to the LCD display into a buffer, one byte (or nibble in this case) at a time. An elegant way to do this is to do repetitive things in in a loop by using an index variable to point to the current input parameters and an index variable that points to where the next output in a buffer can be written to. The way this is done on machine instruction level is called indexed addressing. An instruction to write into memory is given a base address to which the content of an index register is added. After writing to memory, the index register is increased by one and the next loop iteration begins. Thing is, there is no index register on the Nibbler and therefore no indexed addressing, again for the purpose of making the hardware as simple as possible. The only way to work around this is to do repetitive things one after another rather than in a loop. If an action needs to be repeated 20 times, no loop can be used. Instead, the same instructions have to be repeated 20 times in the code with different source and destination addresses. Like the return cascades above, the missing functionality produces very ugly code and makes more complicated stuff that requires many iterations over different input and/or output data difficult to implement on the Nibbler.
No hardware interrupts: A great way of checking for external events is to use hardware interrupts. When the CPU notices an interrupt bit being set it suspends normal program execution and automatically sets the program counter to the beginning of a service routine for that interrupt. This makes it easy, for example, to check for the user pressing a key and to react to it immediately without delay. On PCs, hardware (and software) interrupts are used for many things such as for example peripheral devices indicating that data has become available for processing. It should come as no surprise that the Nibbler does not have interrupts. Checking for key input on the Nibbler thus requires polling the single 4 bit input register to detect when a bit connected to one of the input keys changes its state. This has to be done frequently as otherwise there will be a noticeable delay between the user pressing a key and the computer reacting. In programs that use delay loops between activities, checking for key input must be done in the delay loops to avoid this lag.
No add with carry: Another thing that very much simplifies the hardware design but makes life difficult on the software side is that there is no add with carry instruction. Therefore, adding up integers that are comprised of more than a nibble requires saving the carry flag in a variable and checking for it when using the add command on the next nibble. In practice even more work has to be done because a carry bit can result from adding one value to another or from adding the carry bit to a value. Together with not having an index register makes the whole affair quite complicated in practice.
Every instruction executed in two clock cycles: One of the brilliant design choices that significantly reduces hardware complexity is to execute every instruction in exactly two clock cycles. In the first clock cycle the instruction is loaded into the FETCH register and then executed during the second clock cycle. As the first clock cycle has also advanced the program counter the new 8 bits from the program ROM will either be used as the next instruction in case the program counter is not increased during the second clock cycle for the current instruction or as the lower 8 bits in combination with another 4 bits of the current instruction and put on the address bus to use the content of a RAM cell as the second operand in an operation. Very clever but that obviously also limits the complexity of a task that can be done with an instruction. That's why more complex CPUs use a variable number of steps per instruction and a more generic addressing scheme. Needless to say that more hardware would be required for that. And on the other hand there are only 16 instructions anyway so there's little opportunity for making some of them more complex.
Lots of technical detail in this post, perhaps better understood when looking directly at the source code. I've put one of my programs I've done for the Nibbler on Github which goes into the details of all the topics mentioned above. It can be compiled and run in the Nibbler simulator or, of course, on the real hardware.