Lessons Learned During a Recent Magento 2 Deploy
Published: May 17, 2018
I had an interesting Magento 2 deployment experience recently. I learned quite a few things in the process, and wanted to share them here.
“It’s Stuck”
A co-worker was assigned to handle the deploy, but I was on call in case escalation was required. I got a ping from my co-worker that the deploy had been hanging on setup:upgrade
for an unusual amount of time.
I got on the server and attached an strace
to the process. It was hanging on restart_syscall
…
$ strace -p17555 -f
Process 17555 attached
restart_syscall(<... resuming interrupted call ...>
I did some Googling and the best explanation I could find of how to interpret this is in an answer to the “How do you interpret strace
on an apache process returning restart_syscall
” Server Fault question…
That line means that the current system call was interrupted by your strace and that strace asked to re-run it. I think that the process you are tracing is sleeping.
Bottom line is it wasn’t a good sign that the strace
was showing the process as hung on restart_syscall
.
Re-trying the deploy
We agreed to cancel the current deployment and re-attempt it. Maybe it would just magically work the second time around, but more importantly, we could also get an strace
going on setup:upgrade
before it got stuck.
Watching The strace - Learning #1
As I watched the strace one thing I noticed was that an insane amount of time was being spent opening and accessing files in var/
and pub/media
.
stat("/var/www/html/var/report/1286546962302", {st_mode=S_IFREG|0664, st_size=10133, ...}) = 0
access("/var/www/html/var/report/1286546962302", W_OK) = 0
lstat("/var/www/html/var/report/1159876265576", {st_mode=S_IFREG|0664, st_size=10133, ...}) = 0
stat("/var/www/html/var/report/1159876265576", {st_mode=S_IFREG|0664, st_size=10133, ...}) = 0
access("/var/www/html/var/report/1159876265576", W_OK) = 0
I did some quick research on this and it lead me to the “Why my bin/magento setup:upgrade is really slow” question on the Magento Stack Exchange. One user answered the question by recommending mv
-ing pub/media
before running setup:upgrade
and then mv-ing
it back afterwards.
We found out that we can increase the speed by moving the folder
pub/media/catalog
totemp/media/catalog
before runningbin/magento setup:upgrade
.
This was really intriguing to me and I brought it up with my co-workers.
One of my co-workers responded and let me know that this is because setup:upgrade
runs a check on every file in pub/media
and var
to make sure that the current user has write access. This can be slow, especially over NFS.
While the mv
approach is definitely one way to solve it, he also let me know that there’s a patch that Magento auto-applies when deploying to Commerce Cloud that disables that check.
Upon further investigation I also found the post “Improving performance when upgrading Magento 2 with Elastic File System” by Max Pronko which suggests simply skipping the check for symlinks.
Either way, I’ll be working with my team at Something Digital to review integrating a fix into our deploy process for non-Commerce Cloud projects.
Getting Stuck Again - Learning #2
Eventually the command finished checking pub/media
shortly after which it got stuck again. Fortunately, I had additional diagnostic info from the strace
.
Looking at the strace
I could see that is had just tried to run a DROP TRIGGER
MySQL query…
access("/var/www/html/vendor/composer/../magento/framework/DB/Ddl/Trigger.php", F_OK) = 0
lstat("/var/www/html/vendor/composer/../magento/framework/DB/Ddl/Trigger.php", {st_mode=S_IFREG|0755, st_size=5736, ...}) = 0
open("/var/www/html/vendor/magento/framework/DB/Ddl/Trigger.php", O_RDONLY) = 13
fstat(13, {st_mode=S_IFREG|0755, st_size=5736, ...}) = 0
fstat(13, {st_mode=S_IFREG|0755, st_size=5736, ...}) = 0
fstat(13, {st_mode=S_IFREG|0755, st_size=5736, ...}) = 0
mmap(NULL, 5736, PROT_READ, MAP_SHARED, 13, 0) = 0x7f509c3aa000
munmap(0x7f509c3aa000, 5736) = 0
close(13) = 0
sendto(6, "9\0\0\0\3SELECT `main_table`.* FROM "..., 61, MSG_DONTWAIT, NULL, 0) = 61
poll([{fd=6, events=POLLIN|POLLERR|POLLHUP}], 1, 1471228928) = 1 ([{fd=6, revents=POLLIN}])
recvfrom(6, "\1\0\0\1\6D\0\0\2\3def\tmage_prod\nmain_tab"..., 3072, MSG_DONTWAIT, NULL, NULL) = 1444
sendto(6, "Z\0\0\0\3SELECT `mview_state`.* FROM"..., 94, MSG_DONTWAIT, NULL, 0) = 94
poll([{fd=6, events=POLLIN|POLLERR|POLLHUP}], 1, 1471228928) = 1 ([{fd=6, revents=POLLIN}])
recvfrom(6, "\1\0\0\1\6E\0\0\2\3def\tmage_prod\vmview_st"..., 1628, MSG_DONTWAIT, NULL, NULL) = 449
sendto(6, "h\0\0\0\3SELECT `mview_state`.* FROM"..., 108, MSG_DONTWAIT, NULL, 0) = 108
poll([{fd=6, events=POLLIN|POLLERR|POLLHUP}], 1, 1471228928) = 1 ([{fd=6, revents=POLLIN}])
recvfrom(6, "\1\0\0\1\6E\0\0\2\3def\tmage_prod\vmview_st"..., 1179, MSG_DONTWAIT, NULL, NULL) = 449
sendto(6, "g\0\0\0\3SELECT `mview_state`.* FROM"..., 107, MSG_DONTWAIT, NULL, 0) = 107
poll([{fd=6, events=POLLIN|POLLERR|POLLHUP}], 1, 1471228928) = 1 ([{fd=6, revents=POLLIN}])
recvfrom(6, "\1\0\0\1\6E\0\0\2\3def\tmage_prod\vmview_st"..., 730, MSG_DONTWAIT, NULL, NULL) = 449
sendto(6, "A\0\0\0\3DROP TRIGGER IF EXISTS `trg"..., 69, MSG_DONTWAIT, NULL, 0) = 69
poll([{fd=6, events=POLLIN|POLLERR|POLLHUP}], 1, 1471228928
My next course of action was to connect to the database to check the PROCESSLIST
and the ENGINE INNODB STATUS
. Here’s what I learned from the PROCESSLIST
(note that query id 87913191 is truncated for brevity)…
mysql> SHOW FULL PROCESSLIST\G
*************************** 1. row ***************************
Id: 87913191
User: magento_user
Host: magento-web3:36996
db: magento_prod
Command: Query
Time: 56308
State: Creating sort index
Info: /*
Parent SKU - v191
UPC - sku
Type (Parent / Child) - cpe
Title - v71
.
.
.
*/
SELECT
vStyle.`value` AS `parent sku`,
cpe.sku AS `sku`,
cpe.type_id AS `type`,
vTitle.`value` AS `title`,
tDesc.`value` AS `description`,
eBrand.`value` AS `brand name`,
.
.
.
FROM catalog_product_entity AS cpe
JOIN catalog_product_entity_varchar AS vStyle
ON cpe.entity_id = vStyle.row_id AND vStyle.attribute_id = 191 AND vStyle.store_id = 0
LEFT JOIN catalog_product_super_link AS cpsl
ON cpe.entity_id = cpsl.product_id
/*child stuff*/
JOIN catalog_product_entity_varchar AS vTitle
ON vTitle.attribute_id = 71 AND vTitle.store_id = 0 AND ((cpe.row_id = vTitle.row_id AND cpe.type_id = "configurable") OR (cpsl.parent_id = vTitle.row_id AND cpe.type_id = "simple"))
JOIN catalog_product_entity_int AS iBrand
ON iBrand.attribute_id = 81 AND iBrand.store_id = 0 AND ((cpe.row_id = iBrand.row_id AND cpe.type_id = "configurable") OR (cpsl.parent_id = iBrand.row_id AND cpe.type_id = "simple"))
JOIN eav_attribute_option_value AS eBrand
ON iBrand.`value` = eBrand.option_id AND eBrand.store_id = 0
.
.
.
GROUP BY cpe.sku
ORDER BY cpe.sku, mgv.position) AS img
LIMIT 100
Rows_sent: 0
Rows_examined: 1242249
*************************** 2. row ***************************
Id: 88121455
User: magento_user
Host: magento-web3:36530
db: magento_prod
Command: Sleep
Time: 5334
State:
Info: NULL
Rows_sent: 0
Rows_examined: 0
*************************** 3. row ***************************
Id: 88122191
User: magento_user
Host: magento-web3:40442
db: magento_prod
Command: Query
Time: 117
State: Waiting for table metadata lock
Info: DROP TRIGGER IF EXISTS `trg_catalog_product_entity_after_insert`
Rows_sent: 0
Rows_examined: 0
A-ha! So it was waiting on a lock. But what was holding the lock?
I did some quick Googling and found the article “Who Holds the Metadata Lock? MySQL 5.7.3 Brings Help” that explains that information about what is holding a table metadata locks is only available from MySQL 5.7.3 and requires a non-standard configuration.
As such I decided to look in more detail at the process list to theorize what might be holding the lock…
The Strange SELECT query
The first thing that stood out to me was the SELECT
query (id 87913191) that had been running for 56308 seconds. I quick seconds to hours conversion told me the query had been running for literally almost 16 hours.
It was also SELECT
-ing FROM
the catalog_product_entity
, one of the tables at play with the TRIGGER
that was being dropped.
But what would be issuing such a strange query and what might be the side effects if I kill
-ed it…?
I grep
-ed the code base for “/*child stuff*/
” a comment which was included in the query. Strangely, that string was not present in any files in vendor/
or app/
.
Looking more closely at the query it appeared that it was being used to generate some kind of report. As it was blocking the deploy and we were getting very close to finishing our scheduled window for the release I made the executive decision to kill
the query.
mysql> kill 87913191;
Immediately I saw my DROP TRIGGER
query complete in my strace
. Success!
Whodunnit
In the end, we determined that the slow query was run by an employee at the client in order to generate a report he needed. The fact that it had been running for nearly 16 hours when it blocked our deploy is definitely an issue. We’ll be working to either optimize the query or set up a non-production MySQL instance for this type of usage.