# 34 LLMs Tackle Shopify: Unexpected Results

*Interactive version: [View on web](/articles/llm-compare-shopify-api.html)*


When it comes to code refactoring, David can sometimes beat Goliath.

In our experiment, smaller and lesser-known LLMs like Claude-Haiku and Mistral surprised us by outperforming industry heavyweights such as GPT-4.

The task? Refactor a Shopify invoice generator to enhance efficiency and scalability using GraphQL.

As LLMs grow increasingly central to software development, their real-world efficacy becomes a pressing question. This experiment highlights an important insight: the size and fame of the model isn’t always the best predictor of success.

**Update (March 2025):** This evaluation has been sunsetted due to model convergence—top performers now consistently solve the task within 1-2 iterations, making it no longer useful for discriminating between leading models. The results below represent the complete evaluation series.

## The Challenge: Simplifying Shopify Invoicing with GraphQL

The experiment revolved around refactoring a Shopify invoice generator plagued by inefficiencies. The current implementation, built on Shopify's REST API, required multiple redundant API calls for every order processed:

- **1 call** for order details.
- **1 call per line item** for inventory item IDs.
- **1 call per line item** for HSN codes.

For an order with 5 line items, this approach generates **11 (1 + 5 + 5) API calls**—a significant performance bottleneck. Consolidating these calls into a single GraphQL query offered a clear path to optimization.

### Why GraphQL?

Shopify's GraphQL API can fetch all necessary data in a single query, simplifying the codebase. Here's a sample query illustrating the improvement:

```graphql
query GetOrderDetails($orderId: ID!) {
  order(id: $orderId) {
    id
    lineItems {
      edges {
        node {
          variant {
            inventoryItem {
              harmonizedSystemCode
            }
          }
        }
      }
    }
  }
}
```

## How We Put LLMs to the Test

The evaluation process was designed to assess how effectively each LLM adapted to the task requirements and how quickly it arrived at a correct solution.

### Setup

- **Codebase**: The task used the `invoice-rest2graphql` branch of the [aurovilledotcom/gst-shopify](https://github.com/aurovilledotcom/gst-shopify) repository as the baseline.
- **Tools**: The **[LLM Context](https://github.com/cyberchitta/llm-context.py)** tool extracted relevant code snippets and prepared structured prompts for the models.

### Initial Output - First Interaction

- **First Prompt: Context Setup**  
  Each model received [a system prompt](https://github.com/aurovilledotcom/gst-shopify/blob/invoice-rest2graphql/.llm-context/templates/lc-prompt.md) and comprehensive code snippets, generated using the **[LLM Context](https://github.com/cyberchitta/llm-context.py)** tool. The provided files included:

  ```txt
  /gst-shopify/e_invoice_exp_lut.py      # Contains the invoice generation code to be refactored
  /gst-shopify/api_client.py             # Includes the GraphQL API wrapper for data retrieval
  ```

- **Second Prompt: Detailed Task Instructions**  
  The second prompt outlined [a clear, step-by-step guide to the solution](https://github.com/aurovilledotcom/gst-shopify/issues/1#issue-2651400490), focusing on:
  - Replacing REST API calls with a consolidated GraphQL query.
  - Using the `graphql_request` wrapper for error handling and retries.

The output from the prompt pair was merged into the codebase as commit `out-1` in the branch `det-r2gql/<model-name>`. If the solution worked, the process ended. Otherwise, errors were reported, prompts were refined, and new outputs were tested iteratively until no further progress was made.

### Iteration Process

If the initial output contained errors—like those outlined below—these were addressed through iterative prompts:

- **Error Feedback**: Models were provided with specific error messages, including test outputs or stack traces.
- **Refined Prompts**: Task instructions were clarified to address misunderstandings or overlooked details, like camelCase conventions in GraphQL.
- **Testing and Integration**: Each revised output was tested and committed as `out-2`, `out-3`, etc. Iterations continued until a correct solution was achieved or progress stalled for two consecutive attempts.

### Where LLMs Fell Short

Common issues that impacted model performance included:

- **Schema Mismatches** Several models demonstrated lack of knowledge of Shopify's GraphQL schema, leading to issues like incorrectly named or referenced attributes. This might reflect the age of their training data rather than fundamental deficiencies in understanding GraphQL or coding.
- **Case Conventions** The map key names in the code needed to be refactored from snake_case (REST) to camelCase (GraphQL). Successful models handled this seamlessly, but others struggled, leaving the keys unchanged.
- **Wrapper Misuse** Several models hallucinated implementations of `graphql_request` instead of using the provided wrapper.
- **Barcode Handling Oversight** Some models initially excluded `barcode` from their GraphQL schema and set the invoice value to `""`/`None`. The issue initially escaped detection since the test data lacked barcodes. This meant the blank fields in the REST outputs coincidentally matched those produced by the model-generated code.

  Once identified, we opted not to redo all experiments and instead penalized these models by one iteration—possibly understating the actual work that would have been needed to fix this issue.

Additional challenges not affecting rankings, but noted in the results:

- **Decimal Precision Issues** Minor inconsistencies in decimal precision for calculated fields (CDP) or price-related fields (PDP).
- **Inconvenient Code Format** Several models presented code in formats that weren't immediately usable, such as showing diffs instead of complete code or providing GraphQL queries that needed escaping before use in Python f-strings.

### Evaluation Criteria

The evaluation focused on two key metrics:

1. **Correctness**: Did the model produce a working solution that matched the output of the original REST implementation?
2. **Convergence Cycles (CC)**: How many iterations were required for the model to produce a correct solution? Convergence cycles serve as a proxy for developer productivity, reflecting how quickly a model enables a developer to solve a problem.

### Single-Shot Testing

Each model was tested exactly once, with outputs captured as-is. Results were not cherry-picked from multiple runs, meaning performance could reflect "luck of the draw" - models might perform better or worse in repeated trials.

## Model Leaderboard

| Model                                   | CC  | Notes                                                                                                                                                                                                                                                                                                                                                             |
| --------------------------------------- | --- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **claude-3.5-haiku**                    | 1   | CDP<br>Deltas: [`det-r2gql/claude-haiku`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/claude-haiku)<br>Site: <https://claude.ai/new>                                                                                                                                                                                  |
| **gemini-2.5-pro-exp-03-25**            | 1   | Changed output file name, PDP<br>Deltas: [`det-r2gql/gemini-2.5-pro-exp-03-25`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/gemini-2.5-pro-exp-03-25)<br>Site: <https://aistudio.google.com/app/prompts/new_chat>                                                                                                     |
| **claude-3.5-sonnet-new**               | 2   | Wrong 'graphql_request', CDP<br>Deltas: [`det-r2gql/claude-3.5-sonnet`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/claude-3.5-sonnet)<br>Site: <https://claude.ai/new>                                                                                                                                               |
| **o1**                                  | 2   | 1 extra try to find correct schema, PDP<br>Deltas: [`det-r2gql/o1`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/o1)[Transcript](https://chatgpt.com/share/67545178-23ec-8012-be2f-4a1939c96260)                                                                                                                       |
| **mistral** on LeChat                   | 2   | Missed barcode, CDP<br>Deltas: [`det-r2gql/mistral`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/mistral)<br>Site: <https://chat.mistral.ai/chat>                                                                                                                                                                     |
| **o3-mini-high**                        | 2   | 1 extra try to find correct schema, PDP<br>Deltas: [`det-r2gql/o3-mini-high`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/o3-mini-high)[Transcript](https://chatgpt.com/share/679df6b5-e0e8-8012-8b2f-87d68b53df92)                                                                                                   |
| **o3-mini**                             | 2   | 1 extra try to find correct schema, CDP<br>Deltas: [`det-r2gql/o3-mini`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/o3-mini)[Transcript](https://chatgpt.com/share/679dd077-7448-8012-9fec-6b308b73ad77)                                                                                                             |
| **grok-3-beta-think**                   | 2   | 1 extra try to find correct schema, PDP<br>Deltas: [`det-r2gql/grok-3-beta-think`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/grok-3-beta-think)<br>Site: <https://x.com/i/grok>                                                                                                                                     |
| **deepseek-chat-v3-0324**               | 2   | Wrong 'graphql_request', 1 extra try to find correct schema, PDP<br>Deltas: [`det-r2gql/deepseek-chat-v3-0324`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/deepseek-chat-v3-0324)<br>Site: <https://openrouter.ai/deepseek/deepseek-chat-v3-0324>                                                                    |
| **gemini-exp-1206**                     | 2.5 | Wrong 'graphql_request', 1 extra try to find "super-correct" schema, another to find "correct" schema, inconvenient code format, CDP<br>Deltas: [`det-r2gql/gemini-exp-1206`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/gemini-exp-1206)<br>Site: <https://aistudio.google.com/app/prompts/new_chat>                |
| **o1-preview**                          | 3   | 2 extra tries to find correct schema, PDP<br>Deltas: [`det-r2gql/o1-preview`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/o1-preview)<br>[Transcript](https://chatgpt.com/share/673c38c3-8344-8012-864d-ef0d6b7616b5)                                                                                                 |
| **gemini-exp-1121**                     | 3   | 2 extra tries for schema, inconvenient code format, PDP<br>Deltas: [`det-r2gql/gemini-exp-1121`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/gemini-exp-1121)<br>Site: <https://aistudio.google.com/app/prompts/new_chat>                                                                                             |
| **llama-3.3-70b-instruct**              | 3   | 1 extra try for correct schema, missed barcode, CDP<br>Deltas: [`det-r2gql/llama-3.3-70b-instruct`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/llama-3.3-70b-instruct)<br>Site: <https://openrouter.ai/meta-llama/llama-3.3-70b-instruct>                                                                            |
| **llama-3.2** on WhatsApp               | 3   | case convention mixup, hallucinated barcode value, CDP<br>Deltas: [`det-r2gql/WA-llama-3.2`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/WA-llama-3.2)<br>Site: <https://web.whatsapp.com/>                                                                                                                           |
| **gpt-4o**                              | 3   | 1 extra try to find correct schema, missed barcode, PDP<br>Deltas: [`det-r2gql/gpt-4o`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/gpt-4o)<br>[Transcript](https://chatgpt.com/share/67552bcd-b700-8012-bc1a-2b3bc4450ae8)                                                                                           |
| **deepseek-r1**                         | 3   | 1 extra try to find correct schema, PDP<br>Deltas: [`det-r2gql/deepseek-r1`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/deepseek-r1)<br>Site: <https://openrouter.ai/deepseek/deepseek-r1>                                                                                                                           |
| **grok-2-mini-beta**                    | 4   | 2 extra tries for schema, missed barcode, PDP<br>Deltas: [`det-r2gql/grok-2-mini-beta`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/grok-2-mini-beta)<br>Site: <https://x.com/i/grok>                                                                                                                                 |
| **gemini-1.5-pro**                      | 4   | 1 extra try for schema, multiple case convention mixup, PDP<br>Deltas: [`det-r2gql/gemini-1.5-pro`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/gemini-1.5-pro)<br>Site: <https://aistudio.google.com/app/prompts/new_chat>                                                                                           |
| **grok-2-beta**                         | 4   | Wrong 'graphql_request', 1 extra try for schema, missed barcode, PDP<br>Deltas: [`det-r2gql/grok-2-beta`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/grok-2-beta)<br>Site: <https://x.com/i/grok>                                                                                                                    |
| **deepseek-r1-lite-preview**            | 4   | Wrong 'graphql_request', 1 extra try to find correct schema, case convention mixup, PDP<br>Deltas: [`det-r2gql/deep-think`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/deep-think)<br>Site: <https://chat.deepseek.com/>                                                                                             |
| **gemini-2.0-flash-thinking-exp-01-21** | 4   | Wrong 'graphql_request', 1 extra try for schema, missed barcode, PDP<br>Deltas: [`det-r2gql/gemini-2.0-flash-thinking-exp-01-21`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/gemini-2.0-flash-thinking-exp-01-21)<br>Site: <https://aistudio.google.com/app/prompts/new_chat>                                        |
| **grok-3-beta**                         | 4   | Wrong 'graphql_request', 2 extra tries for schema, PDP<br>Deltas: [`det-r2gql/grok-3-beta`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/grok-3-beta)<br>Site: <https://x.com/i/grok>                                                                                                                                  |
| **deepseek-v3**                         | 5   | Wrong 'graphql_request', 1 extra try to find correct schema, case convention mixup, wrong barcode, CDP<br>Deltas: [`det-r2gql/deepseek-v3`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/deepseek-v3)<br>Site: <https://chat.deepseek.com/>                                                                            |
| **qwq-32b-preview**                     | 5   | Multiple tries for schema, CDP<br>Deltas: [`det-r2gql/qwq-32b-preview`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/qwq-32b-preview)<br>Site: <https://openrouter.ai/qwen/qwq-32b-preview>                                                                                                                            |
| **nova-pro-1.0**                        | 5   | Wrong 'graphql_request', multiple tries for schema, missed barcode, CDP<br>Deltas: [`det-r2gql/nova-pro-1.0`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/nova-pro-1.0)<br>Site: <https://openrouter.ai/amazon/nova-pro-v1>                                                                                           |
| **gemini-2.0-flash-exp**                | 6   | Wrong 'graphql_request', multiple tries for schema, PDP<br>Deltas: [`det-r2gql/gemini-2.0-flash-exp`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/gemini-2.0-flash-exp)<br>Site: <https://aistudio.google.com/app/prompts/new_chat>                                                                                   |
| **gpt-4o-mini**                         | 6   | Wrong 'graphql_request', multiple tries for schema, case convention mixup, PDP<br>Deltas: [`det-r2gql/gpt-4o-mini`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/gpt-4o-mini)<br>[Transcript](https://chatgpt.com/share/673c3c3d-4720-8012-9e0d-0f95ab9ba545)                                                          |
| **minimax-text-01**                     | 6   | Multiple tries for schema, wrong barcode, CDP<br>Deltas: [`det-r2gql/minimax-text-01`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/minimax-text-01)<br>Site: <https://openrouter.ai/minimax/minimax-01>                                                                                                               |
| **gpt-4**                               | 8   | 2 tries to find correct schema, case convention mixup.<br>Deltas: [`det-r2gql/gpt-4`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/gpt-4)<br>[Transcript](https://chatgpt.com/share/673c6c7b-78bc-8012-964d-d89c65414a45)                                                                                              |
| **gemini-1.5-flash**                    | ❌  | Couldn't find working schema in 2 extra tries.<br>Deltas: [`det-r2gql/gemini-1.5-flash`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/gemini-1.5-flash)<br>Site: <https://gemini.google.com/app>                                                                                                                       |
| **o1-mini**                             | ❌  | Couldn't find working schema in 2 extra tries<br>Deltas: [`det-r2gql/o1-mini`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/o1-mini)<br>[Transcript](https://chatgpt.com/share/673c3955-40b4-8012-83e2-df4f0a6d4814)                                                                                                   |
| **qwen-2.5-coder-32b-instruct**         | ❌  | Couldn't find working schema in 2 extra tries<br>Deltas: [`det-r2gql/qwen-2.5-coder-32b-instruct`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/qwen-2.5-coder-32b-instruct)<br>Site: <https://openrouter.ai/qwen/qwen-2.5-coder-32b-instruct>                                                                         |
| **nemotron-70b-instruct**               | ❌  | Couldn't find working schema in 2 extra tries, wrong 'graphql_request', hallucinated barcode value, PDP<br>Deltas: [`det-r2gql/llama-3.1-nemotron-70b-instruct`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/llama-3.1-nemotron-70b-instruct)<br>Site: <https://openrouter.ai/nvidia/llama-3.1-nemotron-70b-instruct> |
| **qwen-2.5-72b-instruct**               | ❌  | Couldn't find working schema in 2 extra tries, wrong 'graphql_request'<br>Deltas: [`det-r2gql/qwen-2.5-72b-instruct`](https://github.com/aurovilledotcom/gst-shopify/compare/invoice-rest2graphql...det-r2gql/qwen-2.5-72b-instruct)<br>Site: <https://openrouter.ai/qwen/qwen-2.5-72b-instruct>                                                                  |


**Note on Model Attribution:** Some interfaces (WhatsApp, chat.mistral.ai) don't specify exact model versions. We use their provided names ('llama-3.2', 'mistral') though underlying versions may vary.

## Diverse Models, Surprising Outcomes

This experiment revealed that smaller or lesser-known LLMs like Claude-Haiku and Mistral can outperform larger, more established models.

Emerging open source models like Llama-3.2 & 3.3 and deepseek-r1 showed promising results, positioning themselves as serious contenders.

In contrast, of industry leader OpenAI’s suite of six models, three (o1, o1-preview and gpt-4o) ranked among the top performers, while one (o1-mini) failed the test entirely.

While these results are specific to this experiment, they highlight the value of exploring diverse tools for development tasks.

## Future Work

This experiment focused on guided problem-solving, where models executed a predefined solution plan. While this structured approach ensured straightforward comparisons between models, it also limited the models from exercising more advanced capabilities.

Future studies could explore how LLMs perform with minimal guidance, testing their ability to identify the issue, propose a solution, and implement it autonomously.

Additionally, research could investigate how models perform when provided with current API schema documentation, potentially eliminating the schema knowledge gap that affected several models in this study.

## Credits

Initial experiment design by @restlessronin. Experiment methodology refined and fleshed out by @o1-preview, who authored the second prompt.

Article text: Initial outline and draft by @o1-preview, re-written by @gpt-4-turbo, reviewed and refined by @claude-3.5-sonnet.

Showrunner: @restlessronin


---

## Document History

**Jul 17, 2025:** Added test saturation note - evaluation was sunsetted

**Mar 26, 2025:** Added evaluation for **gemini-2.5-pro-exp-03-25** and **deepseek-chat-v3-0324**

**Feb 20, 2025:** Added evaluations for **grok-3-beta** and **grok-3-beta-think**

**Feb 1, 2025:** Added evaluations for **o3-mini** and **o3-mini-high**

**Jan 22, 2025:** Added evaluation for **gemini-2.0-flash-thinking-exp-01-21**

**Jan 20, 2025:** Added evaluation for **deepseek-r1**

**Jan 16, 2025:** Added evaluation for **minimax-text-01**

**Dec 28, 2024:** Added evaluation for **deepseek-v3**

**Dec 12, 2024:** Added evaluation for **gemini-exp-1206**

**Dec 11, 2024:** Added evaluation for **gemini-2.0-flash-exp**

**Dec 8, 2024:** Added evaluation for **qwq-32b-preview**

**Dec 7, 2024:** Added evaluations for **o1**, **llama-3.3-70b-instruct** and **nova-pro-1.0**, adjusted incorrect grok-2 barcode penalty

**Nov 26, 2024:** Added evaluations for **qwen-2.5-coder-32b-instruct** and **gemini-exp-1121**

**Nov 24, 2024:** Added evaluations for **qwen-2.5-72b-instruct** and **nemotron-70b-instruct**

**Nov 22, 2024:** Added evaluation for **deepseek-r1-lite-preview**

