[00:04:03] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:04:07] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:07:13] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:07:17] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:37:17] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:37:27] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:38:51] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:39:01] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:12:03] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:12:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:02:21] (03PS49) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [03:46:41] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [03:52:07] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) is CRITICAL: Test Get site-specific CSS returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:53:41] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:16:39] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [04:34:53] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) is CRITICAL: Test Get site-specific CSS returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:38:05] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:59:35] (03PS1) 10Marostegui: dbproxy1016: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/533371 (https://phabricator.wikimedia.org/T202367) [05:00:15] (03CR) 10Marostegui: [C: 03+2] dbproxy1016: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/533371 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [05:08:33] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533372 (https://phabricator.wikimedia.org/T231625) [05:10:18] !log Restart wikibugs [05:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:13] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10Marostegui) p:05Triage→03Normal [05:11:17] Cool, now it works [05:11:42] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533372 (https://phabricator.wikimedia.org/T231625) (owner: 10Marostegui) [05:12:04] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10Marostegui) [05:12:13] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:12:47] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533372 (https://phabricator.wikimedia.org/T231625) (owner: 10Marostegui) [05:13:04] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533372 (https://phabricator.wikimedia.org/T231625) (owner: 10Marostegui) [05:14:12] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2060 from config T231625 (duration: 00m 53s) [05:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:18] T231625: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 [05:14:19] PROBLEM - Disk space on alsafi is CRITICAL: DISK CRITICAL - free space: / 338 MB (3% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=alsafi&var-datasource=codfw+prometheus/ops [05:14:53] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:15:01] (03PS1) 10Marostegui: mariadb: Decommission db2060 [puppet] - 10https://gerrit.wikimedia.org/r/533374 (https://phabricator.wikimedia.org/T231625) [05:15:39] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2060 from config T231625 (duration: 00m 53s) [05:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:46] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2060 [puppet] - 10https://gerrit.wikimedia.org/r/533374 (https://phabricator.wikimedia.org/T231625) (owner: 10Marostegui) [05:23:24] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10Marostegui) [05:23:44] !log Remove db2060 from tendril and zarcillo - T231625 [05:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:49] T231625: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 [05:25:09] !log Stop MySQL on db2060 - T231625 [05:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:32] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10Marostegui) a:05Marostegui→03RobH [05:26:47] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10Marostegui) This host is ready for #dc-ops to decommission [05:26:57] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:44:21] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:44:31] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:48:31] (03CR) 10Marostegui: [C: 03+1] "We place the socket at /run/mysqld" [puppet] - 10https://gerrit.wikimedia.org/r/533308 (https://phabricator.wikimedia.org/T224828) (owner: 10Bstorm) [05:55:44] 10Operations, 10SRE-Access-Requests: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10Urbanecm) [05:57:59] PROBLEM - Host elastic2051 is DOWN: PING CRITICAL - Packet loss = 100% [05:58:28] onimisionipe: ^ [06:00:09] RECOVERY - Host elastic2051 is UP: PING OK - Packet loss = 0%, RTA = 30.25 ms [06:02:55] 10Operations, 10Traffic: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [06:03:34] 10Operations, 10Traffic: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) p:05Triage→03Normal [06:04:33] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:06:43] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:07:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1076 for upgrade - T230785', diff saved to https://phabricator.wikimedia.org/P9006 and previous config saved to /var/cache/conftool/dbconfig/20190830-060702-marostegui.json [06:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:09] T230785: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 [06:07:15] !log Upgrade db1076 [06:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:33] 10Operations, 10SRE-Access-Requests: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10Urbanecm) >>! In T231616#5453138, @Krenair wrote: >>>! In T231616#5453129, @Urbanecm wrote: >>>>! In T231616#5453124, @Krenair wrote: >> I think you just need the researcher grou... [06:10:45] (03PS1) 10Vgutierrez: ATS: Add known websocket endpoints to the TLS instance mapping rules [puppet] - 10https://gerrit.wikimedia.org/r/533379 (https://phabricator.wikimedia.org/T231627) [06:12:55] (03CR) 10Vgutierrez: "PCC looks happy: https://puppet-compiler.wmflabs.org/compiler1001/18123/" [puppet] - 10https://gerrit.wikimedia.org/r/533379 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [06:15:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1076 after upgrade', diff saved to https://phabricator.wikimedia.org/P9007 and previous config saved to /var/cache/conftool/dbconfig/20190830-061546-marostegui.json [06:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:44] 10Operations, 10Traffic: Cannot download STL files due to network error - https://phabricator.wikimedia.org/T231422 (10Vgutierrez) I can reproduce the issue, using HTTP/2 and HTTP/1.1 on eqsin: `willikins:~ vgutierrez$ curl https://upload.wikimedia.org/wikipedia/commons/d/de/65-msr-sandaled-foot-5.stl -o /dev/... [06:25:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1076 after upgrade', diff saved to https://phabricator.wikimedia.org/P9008 and previous config saved to /var/cache/conftool/dbconfig/20190830-062517-marostegui.json [06:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:27] RECOVERY - Disk space on alsafi is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=alsafi&var-datasource=codfw+prometheus/ops [06:29:32] 10Operations, 10Traffic: Perform HTTPS redirect without crossing domain boundaries for non canonical domains - https://phabricator.wikimedia.org/T231513 (10Vgutierrez) 05Open→03Resolved [06:37:45] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:31] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:39:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1076 after upgrade', diff saved to https://phabricator.wikimedia.org/P9009 and previous config saved to /var/cache/conftool/dbconfig/20190830-063949-marostegui.json [06:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:59] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:44:43] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:48:39] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:49:23] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:01:07] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:01:53] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:47] Telia intervention is being rather noisy :) [07:10:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1076 after upgrade', diff saved to https://phabricator.wikimedia.org/P9010 and previous config saved to /var/cache/conftool/dbconfig/20190830-071043-marostegui.json [07:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:18] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Dzahn) I am confused by this statement. The content repo is explicitely setup this way so that people working on this site can merge content changes... [07:15:59] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:45] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:32] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Dzahn) All the changes you can see above, including your own, have been deployed to production servers automatically in the past. The puppet code... [07:21:36] (03PS7) 10Vgutierrez: acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) [07:21:38] (03PS3) 10Vgutierrez: api: Allow acme-chief clients to fetch OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530806 (https://phabricator.wikimedia.org/T219765) [07:23:43] (03CR) 10Vgutierrez: "Thanks for the review!" (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:29:30] (03PS8) 10Vgutierrez: acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) [07:29:32] (03PS4) 10Vgutierrez: api: Allow acme-chief clients to fetch OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530806 (https://phabricator.wikimedia.org/T219765) [07:42:39] !log Upgrade db2055 db2071 db2072 db2092 [07:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:11] (03CR) 10Dzahn: [C: 03+2] ATS: remove wikiba.se backend [puppet] - 10https://gerrit.wikimedia.org/r/532976 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [07:44:20] (03PS2) 10Dzahn: ATS: remove wikiba.se backend [puppet] - 10https://gerrit.wikimedia.org/r/532976 (https://phabricator.wikimedia.org/T99531) [07:55:17] (03PS2) 10KartikMistry: Move ContentTranslation out of Beta in jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533172 (https://phabricator.wikimedia.org/T231207) [08:00:31] 10Operations, 10Traffic: Cannot download STL files due to network error - https://phabricator.wikimedia.org/T231422 (10ema) p:05Triage→03Normal a:03ema [08:03:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1076 after upgrade', diff saved to https://phabricator.wikimedia.org/P9011 and previous config saved to /var/cache/conftool/dbconfig/20190830-080334-marostegui.json [08:03:37] volans: ^ :p [08:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:22] lol [08:10:15] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) Window reserved on the Deployments page. It will happen at the same t... [08:12:29] 10Operations, 10Analytics, 10Traffic: varnishkafka statsv and webrequest crashed on cp1081 - https://phabricator.wikimedia.org/T231331 (10Nuria) 05Open→03Resolved [08:12:34] 10Operations, 10Analytics, 10Traffic: varnishkafka statsv and webrequest crashed on cp1081 - https://phabricator.wikimedia.org/T231331 (10Nuria) Agreed, closing. [08:19:30] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [08:19:51] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [08:20:47] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [08:25:25] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:35:24] (03CR) 10Ema: [C: 03+1] trafficserver: fix grafana link [puppet] - 10https://gerrit.wikimedia.org/r/533229 (owner: 10CDanis) [08:35:51] oops [08:35:53] that's my fault [08:35:57] thanks cdanis & ema [08:36:42] vgutierrez: ah! The dashboard was removed and re-created wasn't it? [08:37:04] yeah, I created the ones with support for layers [08:37:10] removed the old ones and renamed the new ones [08:37:12] vgutierrez: that explains everything (see emails to ops@). Totally forgot about that [08:37:59] I guess that doesn't qualify for a t-shirt, right? [08:38:03] :_) [08:38:15] not even close [08:40:55] PROBLEM - snapshot of s3 in eqiad on db1115 is CRITICAL: snapshot for s3 at eqiad taken more than 4 days ago: Most recent backup 2019-08-26 08:23:11 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:43:30] (03CR) 10Vgutierrez: [C: 03+1] "this has been triggered by me refactoring the dashboards to add layer (tls/backend) support, thanks cdanis!" [puppet] - 10https://gerrit.wikimedia.org/r/533229 (owner: 10CDanis) [08:47:13] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:49:49] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet,service=ats-be [08:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:07] !log repool ats-be on cp1075 and verify if T231504 is fixed [08:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:12] T231504: Unexpectedly received mobile version of an article while logged out - https://phabricator.wikimedia.org/T231504 [08:59:13] 10Operations, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Icinga should alert on free disk space < 15% (now < 12%) on Elasticsearch hosts - https://phabricator.wikimedia.org/T130329 (10Dzahn) a:05Dzahn→03EBernhardson Should we lower it to 10% now? [09:03:35] (03PS1) 10Dzahn: ssl: add certificate for planet [puppet] - 10https://gerrit.wikimedia.org/r/533483 (https://phabricator.wikimedia.org/T210411) [09:13:24] (03PS1) 10Dzahn: add fake SSL key for planet.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/533485 [09:13:40] !log cp1075: upgrade ATS to 8.0.5-1wm4 [09:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:57] (03CR) 10Dzahn: [C: 03+2] ssl: add certificate for planet [puppet] - 10https://gerrit.wikimedia.org/r/533483 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [09:15:08] (03PS2) 10Dzahn: ssl: add certificate for planet [puppet] - 10https://gerrit.wikimedia.org/r/533483 (https://phabricator.wikimedia.org/T210411) [09:15:21] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile- [09:15:21] the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:16:39] ema: ^ could that be the upgrade? [09:16:53] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:16:58] :) [09:18:35] PROBLEM - Host db1074 is DOWN: PING CRITICAL - Packet loss = 100% [09:19:12] jynus: ^^ [09:20:34] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1075.eqiad.wmnet,service=ats-be [09:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:04] mutante: yeah, it can very well be, there was a constant rate of 504s after the upgrade :( [09:21:07] (03PS1) 10Volans: netbox: add role spare::system to new VMs [puppet] - 10https://gerrit.wikimedia.org/r/533487 (https://phabricator.wikimedia.org/T223291) [09:21:39] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:23:13] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:23:42] (03CR) 10Dzahn: [C: 03+1] netbox: add role spare::system to new VMs [puppet] - 10https://gerrit.wikimedia.org/r/533487 (https://phabricator.wikimedia.org/T223291) (owner: 10Volans) [09:23:53] (03CR) 10Volans: [C: 03+2] netbox: add role spare::system to new VMs [puppet] - 10https://gerrit.wikimedia.org/r/533487 (https://phabricator.wikimedia.org/T223291) (owner: 10Volans) [09:24:02] (03PS2) 10Volans: netbox: add role spare::system to new VMs [puppet] - 10https://gerrit.wikimedia.org/r/533487 (https://phabricator.wikimedia.org/T223291) [09:24:30] !log cp1075: depool ats-be due to low but constant 504 rate after 8.0.5-1wm4 upgrade [09:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:39] PROBLEM - MariaDB Slave IO: s2 on db1125 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1074.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1074.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:24:55] ema: well, but it was only very short, i thought it's just an issue during the upgrade itself in that moment [09:25:53] mutante: unfortunately not, I've seen a few real 504 errors in the logs that didn't happen before the upgrade [09:27:11] hmm.ack [09:27:55] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:28:10] oh, but maybe it was unrelated after all ^ [09:28:36] ATS is now depooled on cp1075 [09:29:29] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:29:53] doh, https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps is empty :( [09:31:57] PROBLEM - MariaDB Slave Lag: s2 on db1125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 939.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:31:57] ah, yea, that is because the URL is generated as Service/Monitoring/$title in puppet [09:32:15] marostegui, jynus: you around? [09:33:57] ema: interview [09:33:58] marostegui is currently on an interview [09:34:09] jynus ^^ [09:34:11] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for Ap [09:34:11] rned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:34:47] does anybody know how to act on these mobileapps endpoints alerts? [09:35:39] can they be related to the database issues? (db1074 down, db1125 lagging) [09:35:47] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:38:24] its only happening on one of the scb2 hosts, not on the others [09:39:01] the general "Mobileapps LVS" check did not get triggered [09:39:05] true [09:39:51] well in icinga.log I see: [09:39:53] [1567139501] SERVICE ALERT: scb2001;mobileapps endpoints health;CRITICAL;SOFT;1;/{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL [09:40:03] and that timestamp is Fri 30 Aug 2019 06:31:41 AM CEST [09:41:02] they stayed SOFT because they recovered within the time it takes to check 3 times [09:41:08] then we won't see them on IRC [09:41:15] right [09:42:01] db1074 being down is the issue, db1125 replicates from there [09:43:02] apergos: ah, thanks! should we try restarting it or wait for DBAs? [09:43:12] we don't want to restart mysql over there even if the host were to ome back up, a dba needs to look at it [09:43:23] *come back up [09:43:38] I meant restarting the host (db1074) to see if it comes back online [09:43:50] that would be ok, yes [09:44:00] mysql won't auto-start there? [09:44:13] mysql isn't set to autostart after reboot, generally [09:44:17] this is intentional [09:44:19] ack [09:45:22] 4000 errors per minute [09:45:36] that is an improvement over the 2million ones per minute we used to have [09:45:47] jynus: ok to try power-cycle db1074? I'm in console but there's nothing to see [09:45:56] what is the ssh status? [09:46:13] or ipmi? [09:46:14] the host is down [09:46:28] (03PS1) 10Dzahn: add discovery name for planet [dns] - 10https://gerrit.wikimedia.org/r/533491 [09:46:30] lets check hw logs first [09:46:31] db1115 is tendril so i's annoyig but won't break the wikis if there's lag [09:46:46] jynus: ack I'll leave it to you then [09:47:05] mysql crashing is normal due to a hw failure [09:47:08] the host crashing is not [09:48:21] in any case, this is not an outage [09:48:25] we can take time [09:48:34] it is only a redundancy degradation [09:50:27] if someone can depool db1074, however, it will speed me up [09:50:39] (m*nuel ins in an interview) [09:51:23] "Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support" [09:52:00] how can a battery failure lead to a crash? [09:52:26] maybe that's not the cause, just another issue [09:53:23] I think it may have caused a power issue [09:53:44] instead of just pass through power [09:54:06] but even disks going down should have made mysql crash, not the os [09:54:31] ok, I will depool the host [09:54:55] need to check the runbook, never done it before [09:55:37] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_slave [09:57:13] how can I help? [09:57:48] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1074 after crash', diff saved to https://phabricator.wikimedia.org/P9013 and previous config saved to /var/cache/conftool/dbconfig/20190830-095747-jynus.json [09:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:11] not sure that is ok, what about the other loads? [09:58:32] (03PS2) 10Dzahn: add discovery name for planet [dns] - 10https://gerrit.wikimedia.org/r/533491 [09:58:43] what do you mean? [09:58:51] (03CR) 10Dzahn: [C: 03+2] add discovery name for planet [dns] - 10https://gerrit.wikimedia.org/r/533491 (owner: 10Dzahn) [09:58:56] it was removed from both the main load and the groups it is in [09:59:10] volans: that wasn't shown on the diff [09:59:19] the other thing I would report is the: [09:59:25] WARNING:etcd.client:etcd response did not contain a cluster ID [09:59:29] it should, it's in the phabricator phaste with the diff [09:59:31] Backend error: The request requires user authentication : Insufficient credentials [09:59:36] sudo -i [09:59:44] see https://phabricator.wikimedia.org/P9013 [09:59:46] for the diff [10:00:31] yeah, I got that, otherwise it wouldn't have gont through [10:03:52] (03PS1) 10Jcrespo: mariadb: Depool db1074 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533492 [10:04:59] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1074 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533492 (owner: 10Jcrespo) [10:05:53] (03Merged) 10jenkins-bot: mariadb: Depool db1074 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533492 (owner: 10Jcrespo) [10:06:06] I don't think we're mirroring that anymore since a while [10:06:08] but not 100% sure [10:06:39] (03CR) 10jenkins-bot: mariadb: Depool db1074 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533492 (owner: 10Jcrespo) [10:08:25] db-codfw.php: Promote db2129 to s6 codfw master Tue Aug 27 09:12:32 2019 +0200 [10:08:56] (03PS1) 10Dzahn: planet: add Hiera keys and include class vor envoy [puppet] - 10https://gerrit.wikimedia.org/r/533493 (https://phabricator.wikimedia.org/T210411) [10:10:15] so things that help us: creating a ticket (maybe someone did already?) [10:10:44] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Mirror dbctl depool of db1074 (duration: 00m 55s) [10:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:00] just to be clear, I am not saying you should do that- just answering what things one can do without specific knowledge of db later :-D [10:12:36] RECOVERY - snapshot of s3 in eqiad on db1115 is OK: snapshot for s3 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-08-30 08:20:42 from db1095.eqiad.wmnet:3313 (828 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [10:13:26] ^that is the unrelated issue I mentioned above, nthng to do with db1074 [10:13:48] (03PS1) 10Ema: ATS: log Cookies [puppet] - 10https://gerrit.wikimedia.org/r/533494 (https://phabricator.wikimedia.org/T227432) [10:14:06] 10Operations, 10DBA: db1074 crashed - https://phabricator.wikimedia.org/T231638 (10Dzahn) [10:14:11] there you go jynus [10:14:16] thanks daniel, I was about to do it [10:14:39] again, that is a favour you do to me, not complaining [10:14:40] hey [10:14:43] I am back from the interview [10:14:48] anything I can help with? [10:14:58] nothing urget at the moment [10:15:11] we have to put db1074 back up and debug [10:15:48] I can take care of that, you want me to do that or take care of something else? [10:17:04] definitely will need help, maybe checking sanitarium [10:17:12] and see what is the best way to proceed with taht? [10:17:39] sure, is db1074 hard down and unrecoverable? [10:18:10] 10Operations, 10DBA: db1074 crashed - https://phabricator.wikimedia.org/T231638 (10Dzahn) [10:18:18] no, I was going to put it up now [10:19:10] maybe you want to wait to see what is the state after that [10:20:13] yeah, exactly [10:20:20] let's see how it comes back [10:20:37] I will leave you to run all the CLI there [10:21:09] So we don't touch stuff at the same time :) [10:21:41] double check with me it is db1074 we want to reboot, right? [10:22:44] correct [10:22:47] db1074 doesn't respond to ping [10:22:50] so go for it [10:23:40] !log reseting db1074 from iLo [10:24:17] marostegui: still around after netslpit on irc? [10:24:21] yep [10:24:23] (I think!) [10:24:32] jynus: can you read me? [10:24:33] yeah, you are still here to me [10:24:37] cool [10:24:38] I can read you [10:24:51] I have disabled notifications on db1074 [10:25:08] log will not go through, I guess, however [10:25:30] I can do that for you on the task [10:25:57] 10Operations, 10DBA: db1074 crashed - https://phabricator.wikimedia.org/T231638 (10Marostegui) [10:23:40] !log reseting db1074 from iLo [10:26:09] 10Operations, 10DBA: db1074 crashed - https://phabricator.wikimedia.org/T231638 (10jcrespo) On reboot: ` 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists. Important information available or errors detected ` [10:26:51] ^it is booting, though, that was the same thing that on the health logs [10:27:10] the support expired a few months ago [10:27:27] (03PS2) 10Dzahn: add fake SSL key for planet.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/533485 [10:27:42] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake SSL key for planet.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/533485 (owner: 10Dzahn) [10:28:01] (03PS1) 10Dzahn: ATS/varnish: switch planet to discovery name, disable codfw backend [puppet] - 10https://gerrit.wikimedia.org/r/533495 (https://phabricator.wikimedia.org/T210411) [10:28:29] 86/PAT: hpasmlited:698 map pfn expected mapping type uncached-minus for [mem 0x79170000-0x79172fff], got write-back [10:29:15] disk themselves seem to be ok [10:29:27] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/18125/planet1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/533493 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [10:29:43] marostegui: do you know what is the s2 candidate master? [10:29:51] it is not db1074, right? [10:30:03] yep, db1122 [10:30:05] 10Operations, 10DBA: db1074 crashed - https://phabricator.wikimedia.org/T231638 (10Marostegui) >>! In T231638#5453726, @jcrespo wrote: > On reboot: > ` > 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. > Action: Restart system. Contact HPE support if condition persists. > > >... [10:30:10] (03CR) 10Dzahn: "i checked nothing is listening on 443 on planet1001 right now" [puppet] - 10https://gerrit.wikimedia.org/r/533493 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [10:30:12] s2 will be failed over in 2 weeks to that new host [10:30:17] ok [10:30:43] forcing all icinga checks before touching mysql [10:31:04] jynus: did you check ilo logs, or you want me to? [10:31:17] I did on web iface [10:31:21] but not copied it [10:31:31] ok, I will do that [10:31:32] it only said the same that on reboot, battery [10:31:33] 10Operations, 10DBA: db1074 crashed - https://phabricator.wikimedia.org/T231638 (10Marostegui) [10:31:36] you can focus on the host itself [10:31:47] if you want to do it from ssh and copy them that would be great [10:31:53] yep, doing it [10:32:32] Battery count: 0 [10:32:37] great [10:32:56] ferm also failed to start, but that may be unrelated [10:33:32] started it manually now [10:33:52] I would start mysql with no replication and consider failing it over to codfw [10:34:03] (the sanitarium replication) [10:34:23] let's start it without replication for now yeah [10:34:33] without a battery either it would not catch up or would be an unreliable replica [10:34:37] we can maybe enable replication till the point we can stop that host with codfw master and then change it [10:34:55] oh, we can just get easily the coords as long as it is up [10:35:15] 10Operations, 10DBA: db1074 crashed - https://phabricator.wikimedia.org/T231638 (10Marostegui) BBU is broken: ` description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support Verbs cd vers description=POST Error: 313-HPE Smart Storage... [10:36:07] 10Operations, 10DBA: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 (10Marostegui) [10:36:23] InnoDB: Starting crash recovery from checkpoint LSN=24770133479800 [10:36:44] recovery seems clean [10:36:49] nice [10:37:33] oh, I just realized you meant replication on sanitarium [10:37:43] not db1074, you are absolutely right [10:37:49] sorry I missunderstood [10:37:53] no worries :) [10:38:11] db1074 is up an stopped [10:39:07] db1066-bin.001234:4 ? [10:39:08] RECOVERY - MariaDB Slave IO: s2 on db1125 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [10:39:11] checking [10:39:17] too much coincidence it is a the start of a log? [10:39:46] db1066-bin.001234:4 seems to be yes [10:40:08] let's see db1125 [10:40:35] the master is at: | db1066-bin.001234 | 234395014 | at the moment, so it is a coincidence yes [10:40:52] db1125 started replicating [10:41:00] yeah, it caught up [10:41:06] let me see what position when its master crashed [10:41:22] db1074-bin.004422:456 [10:45:20] the last executed statement seems to be: on nlwiki.page page_id=4810735 [10:45:43] That was run on db1074 and also replicated to sanitarium [10:45:59] And that is db1074-bin.004421:139305511 [10:46:29] I can see also lots of information_schema_p activity [10:46:58] any suggestion on how to proceed? [10:47:11] we should probably look for that event on sanitarium codfw [10:47:15] and start replication from there [10:47:22] ok [10:47:34] double check what I posted above [10:47:39] to make sure we are looking for the right event [10:48:03] do you have a gtid for me? [10:48:16] one sec and I will have it [10:51:05] 0-180359173-4858865027,171966574-171966574-2132058081,171966668-171966668-2580,171966670-171966670-2410812544,171970567-171970567-390719906,171978766-171978766-732975,180359173-180359173-70817914,180359241-180359241-121693516 [10:51:06] XDDD [10:51:08] Let me try to filter that [10:51:20] PROBLEM - HHVM rendering on mw1290 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:51:21] yeah, I only need the master id [10:51:33] (03PS2) 10Vgutierrez: prometheus: Add basic ATS network and ssl metrics [puppet] - 10https://gerrit.wikimedia.org/r/533193 (https://phabricator.wikimedia.org/T231533) [10:52:01] don't worry, 10.5 will fix that! [10:52:05] 171966668-171966668-2580 I think [10:52:08] But let me confirm [10:52:23] it is ok, I will work with that and reverse your setops [10:52:42] 10Operations, 10ops-eqiad: Degraded RAID on db1074 - https://phabricator.wikimedia.org/T231639 (10ops-monitoring-bot) [10:52:52] RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 79470 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:53:08] 10Operations, 10ops-eqiad: Degraded RAID on db1074 - https://phabricator.wikimedia.org/T231639 (10Marostegui) [10:53:12] 10Operations, 10DBA: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 (10Marostegui) [10:55:09] jynus: https://phabricator.wikimedia.org/P9014 [10:55:31] so that list (from the binlog) matches: 171966668-171966668-2580 which is the slave_pos output from db1125 [10:55:45] thanks [10:55:46] let's see if we reach the same conclusion [10:56:36] give me some time, I want to also check the previous transactions [10:56:39] sure [10:57:01] From my investigation the last executed one was an UPDATE on nlwiki.page for the page_id= 4810735 [10:57:05] and that one reached db1125 [10:57:14] after that, there is nothing apart from the binlog rotation [10:57:21] that's what I have seen for now [10:59:16] 171966668 is db1074 [10:59:50] yes [10:59:55] we are more interested on 171966574 ids [11:00:12] db1066? [11:00:19] the original master, yes [11:00:23] I have that too [11:00:28] 171966574-171966574-2132058081 is what I saw [11:00:39] it is possible that we had aditional writes [11:00:52] or that those get overwritten on format change [11:00:52] that is the last GTID from the master on db1074 [11:03:08] I am trying to correlate that GTID with db2094 to see if before that one we have the same update on nlwiki.page [11:03:32] db2094/db2095 [11:04:39] I confirm that the last write I get is UPDATE `nlwiki`.`page` @1=4810735 [11:04:42] before the crash [11:04:51] same as I did [11:05:10] I am now correlating 171966574-171966574-2132058081 on codfw, to see if that is the GTID after the nlwiki update [11:05:12] I would like to check if there were additional writes on replication start [11:05:53]