NavTrans Errors - 2024

Uit De Vliegende Brigade
(Doorverwezen vanaf NavTrans Problem - Spring 2024)
Naar navigatie springen Naar zoeken springen

In spring 2024, we experienced sudden and frequent crashes of a MariaDB server and equally frequent corruption of databases. This was on a webserver with a bunch of WooCommerce sites. The symptoms seemed very similar to the Transients Layered Nav Counts problem we had in autumn 2023 - but it can be something entirely differently. However, Ockam's razor tells me, that it is something quite similar. Especially since we don't have adequate measurements in place since autumn 2023; only some quick fixes.

Brainstorm

  • NavTransients are a problem, independent of the number of filters: When any NavTrans gets too big, it doesn't work anymore in the interface + table locking problems

First symptoms

What were the very first things that indicated that there was a problem?

In practice, there might be a bunch of stuff happening seemingly at the same time, or in such short succession, that it isn't always possible or useful to isolate what seems to be the first thing. Additionally, this question can quickly become philosofical, when considering what is cause and what is effect. Here, we mean with symptoms the first things that folks who worked with the system, noticed.

First symptoms
Data Event Notes
2024.04.25 Site nl_nl had database-connectivity problems → Database server was down Database-connectivity problems & database server down, are both symptoms rather than causes in this context
2024.05.27 Database server was down with internal errors according to its log From symptom to cause: What causes these crashes?
2024.05.02-06 Individual sites don't work due to corrupted databases From symptom to cause: What causes these crashes?
2024.05.08
  • At 12:07, report came about a non-functioning site
  • According to MySQL error log, MySQL server had crashed at 12:05

Note that these are actually two different symptoms:

  • Corrupted database
  • Crashed database servers.

For now, let's assume they are related to the same underlying cause.

Also note that the events happened at several different moments. For now, let's assume it's one big happy problem.

What happened next

At this moment of writing (2024.05), it isn't clear how the chains of cause-and-effect look like and what the final cause or causes are. But let's try to figure out what's directly behind the aforementioned symptoms

Database corruption

I don't remember exactly what the situation was, and it's too much work to figure it out now:

  • When trying to drop existing databases, I got errors that certain files couldn't be deleted
  • Something went wrong when trying to read tables?

Ah, I found something: 02-May-2024 21:43:22 UTC] WordPress database error Table 'example_autralia_com.wp_posts' doesn't exist in engine for query SHOW FULL COLUMNS FROM `wp_posts` made by wp_install, wp_install_defaults

Database crashes

As an example, this is the part of the MySQL error log (/var/lib/mysql/error.log) around the crash on 2024.05.08-12:05:

2024-05-08  9:53:19 0 [Note] /usr/sbin/mariadbd: ready for connections.
Version: '10.6.16-MariaDB-0ubuntu0.22.04.1'  socket: '/run/mysqld/mysqld.sock'  port: 3306  Ubuntu 22.04
2024-05-08  9:53:19 0 [Note] InnoDB: Rolled back recovered transaction 46179644
2024-05-08  9:53:19 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2024-05-08  9:53:19 9 [ERROR] InnoDB: Table mysql.innodb_table_stats not found.
2024-05-08 10:31:19 0 [Note] Starting MariaDB 10.6.16-MariaDB-0ubuntu0.22.04.1 source revision  as process 381314
2024-05-08 10:31:19 0 [Note] mariadbd: Aria engine: starting recovery
tables to flush: 3 2 1 0
 (0.0 seconds); 
2024-05-08 10:31:19 0 [Note] mariadbd: Aria engine: recovery done
2024-05-08 10:31:19 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2024-05-08 10:31:19 0 [Note] InnoDB: Number of pools: 1
2024-05-08 10:31:19 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2024-05-08 10:31:19 0 [Note] InnoDB: Initializing buffer pool, total size = 134217728, chunk size = 134217728
2024-05-08 10:31:19 0 [Note] InnoDB: Completed initialization of buffer pool
2024-05-08 10:31:19 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=14370627528336,14370674627487
2024-05-08 10:31:20 0 [Note] InnoDB: 3 transaction(s) which must be rolled back or cleaned up in total 3 row operations to undo
2024-05-08 10:31:20 0 [Note] InnoDB: Trx id counter is 46412788
2024-05-08 10:31:20 0 [Note] InnoDB: To recover: 3282 pages
2024-05-08 10:31:20 0 [Note] InnoDB: 128 rollback segments are active.
2024-05-08 10:31:20 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2024-05-08 10:31:20 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
2024-05-08 10:31:20 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2024-05-08 10:31:20 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2024-05-08 10:31:20 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2024-05-08 10:31:20 0 [Note] InnoDB: 10.6.16 started; log sequence number 14370682353825; transaction id 46412789
2024-05-08 10:31:20 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2024-05-08 10:31:20 0 [Note] Plugin 'FEEDBACK' is disabled.
2024-05-08 10:31:20 0 [Note] InnoDB: Cannot open '/var/lib/mysql/ib_buffer_pool' for reading: No such file or directory
2024-05-08 10:31:20 0 [Note] Server socket created on IP: '0.0.0.0'.
2024-05-08 10:31:20 0 [Note] Server socket created on IP: '::'.
2024-05-08 10:31:20 0 [Note] InnoDB: Rolled back recovered transaction 46406898
2024-05-08 10:31:20 0 [Note] InnoDB: Rolled back recovered transaction 46412786
2024-05-08 10:31:20 0 [Note] InnoDB: Rolled back recovered transaction 46412784
2024-05-08 10:31:20 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2024-05-08 10:31:20 1 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1932: Table 'mysql.gtid_slave_pos' doesn't exist in engine
2024-05-08 10:31:20 0 [Note] /usr/sbin/mariadbd: ready for connections.
Version: '10.6.16-MariaDB-0ubuntu0.22.04.1'  socket: '/run/mysqld/mysqld.sock'  port: 3306  Ubuntu 22.04
2024-05-08 10:31:20 4 [ERROR] InnoDB: Table mysql.innodb_table_stats not found.


2024-05-08 12:05:39 0x7fed1c681640  InnoDB: Assertion failure in file ./storage/innobase/buf/buf0lru.cc line 285
InnoDB: Failing assertion: !block->page.in_file()
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mariadbd startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
240508 12:05:39 [ERROR] mysqld got signal 6 ;
Sorry, we probably made a mistake, and this is a bug.

Your assistance in bug reporting will enable us to fix this for the next release.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.

Server version: 10.6.16-MariaDB-0ubuntu0.22.04.1 source revision: 
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=192
max_threads=302
thread_count=192
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 796155 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7febe8000c68
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fed1c680c38 thread_stack 0x49000
/usr/sbin/mariadbd(my_print_stacktrace+0x32)[0x555904162e82]
/usr/sbin/mariadbd(handle_fatal_signal+0x478)[0x555903c97298]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7feda2afe520]
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7feda2b529fc]
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7feda2afe476]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7feda2ae47f3]
/usr/sbin/mariadbd(+0x66b7ee)[0x55590392d7ee]
/usr/sbin/mariadbd(+0x675564)[0x555903937564]
/usr/sbin/mariadbd(+0xdd1705)[0x555904093705]
/usr/sbin/mariadbd(+0xe07cbb)[0x5559040c9cbb]
/usr/sbin/mariadbd(+0xe0bb5c)[0x5559040cdb5c]
/usr/sbin/mariadbd(+0xe0c320)[0x5559040ce320]
/usr/sbin/mariadbd(+0xdb5199)[0x555904077199]
/usr/sbin/mariadbd(+0xd6fe12)[0x555904031e12]
/usr/sbin/mariadbd(+0xd72c4b)[0x555904034c4b]
/usr/sbin/mariadbd(+0xd746b0)[0x5559040366b0]
/usr/sbin/mariadbd(+0xd53b28)[0x555904015b28]
/usr/sbin/mariadbd(+0xcb9b06)[0x555903f7bb06]
/usr/sbin/mariadbd(_ZN7handler13ha_update_rowEPKhS1_+0x23a)[0x555903ca5f4a]
/usr/sbin/mariadbd(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x1adc)[0x555903b31fcc]
/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THDb+0x23b3)[0x555903a68b83]
/usr/sbin/mariadbd(_Z11mysql_parseP3THDPcjP12Parser_state+0x1d8)[0x555903a6b988]
/usr/sbin/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjb+0xbcd)[0x555903a6d52d]
/usr/sbin/mariadbd(_Z10do_commandP3THDb+0x131)[0x555903a6f031]
/usr/sbin/mariadbd(_Z24do_handle_one_connectionP7CONNECTb+0x3af)[0x555903b7657f]
/usr/sbin/mariadbd(handle_one_connection+0x5d)[0x555903b768fd]
/usr/sbin/mariadbd(+0xc0fa26)[0x555903ed1a26]
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7feda2b50ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7feda2be2850]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7fea9c000070): UPDATE `wp_options` SET `option_value` = 'a:44660:{i:0;b:0;s:32:\"602d...

Connection ID (thread ID): 28743
Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off,hash_join_cardinality=off,cset_narrowing=off

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             514201               514201               processes 
Max open files            32768                32768                files     
Max locked memory         524288               524288               bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       514201               514201               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: |/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E

Kernel version: Linux version 5.15.0-105-generic (buildd@lcy02-amd64-007) (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024

2024-05-08 12:11:21 0 [Note] Starting MariaDB 10.6.16-MariaDB-0ubuntu0.22.04.1 source revision  as process 864516
2024-05-08 12:11:21 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2024-05-08 12:11:21 0 [Note] InnoDB: Number of pools: 1
2024-05-08 12:11:21 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2024-05-08 12:11:21 0 [Note] InnoDB: Initializing buffer pool, total size = 134217728, chunk size = 134217728
2024-05-08 12:11:21 0 [Note] InnoDB: Completed initialization of buffer pool
2024-05-08 12:11:21 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=14626521631583,14626555390524
2024-05-08 12:11:21 0 [ERROR] InnoDB: Corrupted page identifier at 14626576696142; set innodb_force_recovery=1 to ignore the record.
2024-05-08 12:11:21 0 [Note] InnoDB: Dump from the start of the mini-transaction (LSN=14626576696142) to 100 bytes after the record: len 117; hex 388... :"9948eb07";a:;
2024-05-08 12:11:21 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
2024-05-08 12:11:21 0 [Note] InnoDB: Starting shutdown...
2024-05-08 12:11:21 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2024-05-08 12:11:21 0 [Note] Plugin 'FEEDBACK' is disabled.
2024-05-08 12:11:21 0 [ERROR] Unknown/unsupported storage engine: InnoDB
2024-05-08 12:11:21 0 [ERROR] Aborting

Two things that stood out from this log:

  • Line Query (0x7fea9c000070)...: That was a NavTrans update query
  • Line near bottom  :"9948eb07";a:;: That was an NavTrans update query.

A single oversized NavTrans can crash a filter - Case: 2024.05.10

Dit is site nl_nl moved to a new VPS. Applying this 'dimension filter' causes a database error. This most likely happens because the associated navTransient became too big. So even on a brand new server, navTrans are problematic, even when they don't crash the database server

Is this a separate problem, a symptom or additional data? Maybe.

  • When a single transient becomes too big, the corresponding filter isn't working anymore on the site
  • So, it is not only about the total number of NavTransients. Hence, simply reducing the total number of filters on a site, isn't a sufficient solution
  • This happened on multiple servers and so far, it doesn't seem to be caused by something that's wrong on a server
  • This might also indicate, that these NavTrans are basically a bad idea and that probably not having them at all, is an improvement.
  • The error depicted in the image, was again generated at 2024.05.10-07:59

Additional information

About sizes of various NavTransients:

  • wp transient list --fields=name,expiration --human-readable --search=wc_layered_nav_counts_pa_breedte
  • wp transient get wc_layered_nav_counts_pa_breedte > /home/jeroen/navtrans-breedte.txt → Only 414.7 kB
  • wp transient get wc_layered_nav_counts_pa_breedte --format=json > /home/jeroen/navtrans-breedte-json.txt → Only 183 kB
  • wp transient get wc_layered_nav_counts_pa_submodel > /home/jeroen/navtrans-submodel.txt → 1.2MB - Still working
  • wp transient get wc_layered_nav_counts_productlijn > /home/jeroen/navtrans-productlijn.txt → 420kB

Sizes according to script db_size_monitor:

2024-05-10-08:00 - db: 652.95MB. wp_options (storage): 357M. wp_options: 92.30MB. Tr: 91.01MB. NavTr: 85.88MB. #NavTr: 16 → 5.36MB/NavTr

Cause: Out-of-memory

The PHP log contains endless errors like these:

[10-May-2024 06:17:01 UTC] PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 57989641 bytes) in /var/www/example.nl/wp-includes/class-wpdb.php on line 2349
[10-May-2024 06:17:02 UTC] PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 57975238 bytes) in /var/www/example.nl/wp-includes/class-wpdb.php on line 2349
[10-May-2024 06:17:02 UTC] PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 57975238 bytes) in /var/www/example.nl/wp-includes/class-wpdb.php on line 2349
[10-May-2024 06:17:02 UTC] PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 57975238 bytes) in /var/www/example.nl/wp-includes/class-wpdb.php on line 2349
[10-May-2024 06:17:03 UTC] PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 57989641 bytes) in /var/www/example.nl/wp-includes/class-wpdb.php on line 2349

Cause: Object not loaded in time

The log line below appeared only once in the last 100 lines. My liberal interpretation is, that some object became too big to load in time for execution - I can well imagine this being part of the NavTrans problem discussed here:

[10-May-2024 06:16:36 UTC] PHP Fatal error:  mysqli_real_connect(): Cannot use output buffering in output buffering display handlers in /var/www/example.nl/wp-includes/class-wpdb.php on line 1990
[10-May-2024 06:16:36 UTC] PHP Fatal error:  Uncaught Error: mysqli object is not fully initialized in /var/www/example.nl/wp-includes/class-wpdb.php:1283
Stack trace: ...

Temporary solutions: Delete transients + optimize wp_options

All issues + associated php log lines stop upon execution of this:

wp transient list --fields=name --format=csv | grep wc_layered_nav_counts_pa_ | cut -d',' -f1 | xargs -n 1 wp transient delete

Along the way, it seems to help to avoid other NavTrans-related problems to optimize table wp_options:

optimize table wp_options;

These are both temporary solutions, as they address symptoms, not the underlying problem of NavTrans growing out of whack.

Additional information

How NavTrans bring down a webserver

One mechanism:

  • WooCommerce updates more and more data in transients in table wp_options
  • These updates are so frequent, that table wp_options grow uncontrolably. In november 2023 at a rate of some 2GB/minute
  • Eventually, it takes MySQL too long to lock such a row or field, and locking time-outs occur
  • When locking time-outs occur, those SQL updates don't happen anymore, and the complete SQL update statements, are copied to the PHP errors log
  • Instead of the database, now the PHP error log grows with about 2GB/minute (autumn 2023).

In autumn 2023, a quick fix was, to have each hour these transients removed + table wp_options optimized. I doubt if it's a good idea to do that every hour.

Solutions from ChatGPT

On WordPress/WooCommerce sites, I have a bunch of filters on the shop page. Amongst others, WooCommerce produces transients for these filters that start with the name `wc_layered_nav_counts_pa_`. These transients are stored in table `wp_options`. These transients grow out of proportions and make servers crash and filters not work. How to solve this?

Excessive transients can indeed be a performance issue on WordPress/WooCommerce sites, especially if they're not being managed properly. Here are a few steps you can take to address this problem:

Increase Transient Expiry Time: By default, transients in WordPress expire after a certain period. You can increase the expiry time for these transients so they're not regenerated as frequently. You can do this using the set_transient() function with a longer expiration time. Use a Transient Management Plugin: There are plugins available specifically for managing transients on WordPress sites. These plugins can help you monitor, clean up, and optimize transient data. Manually Delete Unused Transients: If you're comfortable with PHP and WordPress, you can manually delete unused transients from the database. You can use the delete_transient() function to remove transients that are no longer needed. Optimize Queries and Code: Sometimes, excessive transients can be a symptom of inefficient code or database queries. Review your theme and plugin code to ensure that transients are being used appropriately and that queries are optimized for performance. Cache Queries Instead of Using Transients: Depending on your specific use case, caching the results of queries directly may be more efficient than using transients. You can use caching plugins like WP Super Cache or W3 Total Cache to implement query caching. Consider Offloading to a Separate Server: If your site is receiving a large amount of traffic and generating a significant number of transients, you may want to consider offloading some of the workload to a separate server or using a caching service like Redis or Memcached. Monitor Server Resources: Keep an eye on your server's resource usage to identify any issues caused by excessive transients. You may need to upgrade your server resources or optimize your server configuration to handle the load more efficiently.

And specifically for these "NavTrans" transients?

To specifically address the issue with the wc_layered_nav_counts_pa_ transients in WooCommerce, you can take the following steps:

Increase Expiry Time: As with any transient, you can increase the expiry time for these specific transients using the set_transient() function with a longer expiration time. This will reduce the frequency at which these transients are regenerated. Use a Transient Management Plugin: Look for WordPress plugins designed to manage transients, some of which might offer specific tools for handling WooCommerce transients. These plugins can help you monitor and clean up unused transients, including those related to WooCommerce layered navigation. Manually Delete Unused Transients: If you're comfortable working with PHP and MySQL, you can manually delete specific transients from the database. You can identify and delete wc_layered_nav_counts_pa_ transients that are no longer needed. Optimize Layered Navigation: Review your layered navigation setup in WooCommerce. If you have excessive filters or unnecessary variations, consider simplifying or optimizing them to reduce the number of transients being generated. Cache Layered Navigation Results: Instead of relying solely on transients, you can implement caching mechanisms specifically for layered navigation results. This can involve caching the results of queries or pre-generating navigation data to reduce the reliance on transient storage. Monitor and Test: Continuously monitor your site's performance and test changes to ensure they're effective. Keep an eye on server resources, database queries, and overall site speed to identify any remaining issues related to transients.

Questions

  • What causes database crashes?
  • What causes database corruptions?
  • How to solve the NavTrans problem? → Start with the table at NavTrans Problem or do some fresh research.

See also