[00:04:03] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:04:07] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:07:13] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:07:17] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:37:17] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:37:27] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:38:51] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:39:01] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:12:03] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:12:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:02:21] (03PS49) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [03:46:41] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [03:52:07] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) is CRITICAL: Test Get site-specific CSS returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [03:53:41] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:16:39] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [04:34:53] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) is CRITICAL: Test Get site-specific CSS returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:38:05] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [04:59:35] (03PS1) 10Marostegui: dbproxy1016: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/533371 (https://phabricator.wikimedia.org/T202367) [05:00:15] (03CR) 10Marostegui: [C: 03+2] dbproxy1016: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/533371 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [05:08:33] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533372 (https://phabricator.wikimedia.org/T231625) [05:10:18] !log Restart wikibugs [05:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:13] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10Marostegui) p:05Triage→03Normal [05:11:17] Cool, now it works [05:11:42] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533372 (https://phabricator.wikimedia.org/T231625) (owner: 10Marostegui) [05:12:04] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10Marostegui) [05:12:13] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:12:47] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533372 (https://phabricator.wikimedia.org/T231625) (owner: 10Marostegui) [05:13:04] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2060 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533372 (https://phabricator.wikimedia.org/T231625) (owner: 10Marostegui) [05:14:12] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2060 from config T231625 (duration: 00m 53s) [05:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:18] T231625: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 [05:14:19] PROBLEM - Disk space on alsafi is CRITICAL: DISK CRITICAL - free space: / 338 MB (3% inode=89%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=alsafi&var-datasource=codfw+prometheus/ops [05:14:53] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [05:15:01] (03PS1) 10Marostegui: mariadb: Decommission db2060 [puppet] - 10https://gerrit.wikimedia.org/r/533374 (https://phabricator.wikimedia.org/T231625) [05:15:39] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2060 from config T231625 (duration: 00m 53s) [05:15:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:46] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2060 [puppet] - 10https://gerrit.wikimedia.org/r/533374 (https://phabricator.wikimedia.org/T231625) (owner: 10Marostegui) [05:23:24] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10Marostegui) [05:23:44] !log Remove db2060 from tendril and zarcillo - T231625 [05:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:49] T231625: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 [05:25:09] !log Stop MySQL on db2060 - T231625 [05:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:26:32] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10Marostegui) a:05Marostegui→03RobH [05:26:47] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2060.codfw.wmnet - https://phabricator.wikimedia.org/T231625 (10Marostegui) This host is ready for #dc-ops to decommission [05:26:57] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [05:44:21] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:44:31] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:48:31] (03CR) 10Marostegui: [C: 03+1] "We place the socket at /run/mysqld" [puppet] - 10https://gerrit.wikimedia.org/r/533308 (https://phabricator.wikimedia.org/T224828) (owner: 10Bstorm) [05:55:44] 10Operations, 10SRE-Access-Requests: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10Urbanecm) [05:57:59] PROBLEM - Host elastic2051 is DOWN: PING CRITICAL - Packet loss = 100% [05:58:28] onimisionipe: ^ [06:00:09] RECOVERY - Host elastic2051 is UP: PING OK - Packet loss = 0%, RTA = 30.25 ms [06:02:55] 10Operations, 10Traffic: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [06:03:34] 10Operations, 10Traffic: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) p:05Triage→03Normal [06:04:33] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:06:43] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:07:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1076 for upgrade - T230785', diff saved to https://phabricator.wikimedia.org/P9006 and previous config saved to /var/cache/conftool/dbconfig/20190830-060702-marostegui.json [06:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:09] T230785: Switchover s2 primary database master db1066 -> db1122 - 17th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230785 [06:07:15] !log Upgrade db1076 [06:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:33] 10Operations, 10SRE-Access-Requests: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10Urbanecm) >>! In T231616#5453138, @Krenair wrote: >>>! In T231616#5453129, @Urbanecm wrote: >>>>! In T231616#5453124, @Krenair wrote: >> I think you just need the researcher grou... [06:10:45] (03PS1) 10Vgutierrez: ATS: Add known websocket endpoints to the TLS instance mapping rules [puppet] - 10https://gerrit.wikimedia.org/r/533379 (https://phabricator.wikimedia.org/T231627) [06:12:55] (03CR) 10Vgutierrez: "PCC looks happy: https://puppet-compiler.wmflabs.org/compiler1001/18123/" [puppet] - 10https://gerrit.wikimedia.org/r/533379 (https://phabricator.wikimedia.org/T231627) (owner: 10Vgutierrez) [06:15:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1076 after upgrade', diff saved to https://phabricator.wikimedia.org/P9007 and previous config saved to /var/cache/conftool/dbconfig/20190830-061546-marostegui.json [06:15:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:44] 10Operations, 10Traffic: Cannot download STL files due to network error - https://phabricator.wikimedia.org/T231422 (10Vgutierrez) I can reproduce the issue, using HTTP/2 and HTTP/1.1 on eqsin: `willikins:~ vgutierrez$ curl https://upload.wikimedia.org/wikipedia/commons/d/de/65-msr-sandaled-foot-5.stl -o /dev/... [06:25:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1076 after upgrade', diff saved to https://phabricator.wikimedia.org/P9008 and previous config saved to /var/cache/conftool/dbconfig/20190830-062517-marostegui.json [06:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:27] RECOVERY - Disk space on alsafi is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=alsafi&var-datasource=codfw+prometheus/ops [06:29:32] 10Operations, 10Traffic: Perform HTTPS redirect without crossing domain boundaries for non canonical domains - https://phabricator.wikimedia.org/T231513 (10Vgutierrez) 05Open→03Resolved [06:37:45] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:31] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:39:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1076 after upgrade', diff saved to https://phabricator.wikimedia.org/P9009 and previous config saved to /var/cache/conftool/dbconfig/20190830-063949-marostegui.json [06:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:59] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:44:43] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:48:39] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:49:23] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:01:07] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:01:53] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:47] Telia intervention is being rather noisy :) [07:10:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'More traffic to db1076 after upgrade', diff saved to https://phabricator.wikimedia.org/P9010 and previous config saved to /var/cache/conftool/dbconfig/20190830-071043-marostegui.json [07:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:18] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Dzahn) I am confused by this statement. The content repo is explicitely setup this way so that people working on this site can merge content changes... [07:15:59] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:45] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 133, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:32] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Dzahn) All the changes you can see above, including your own, have been deployed to production servers automatically in the past. The puppet code... [07:21:36] (03PS7) 10Vgutierrez: acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) [07:21:38] (03PS3) 10Vgutierrez: api: Allow acme-chief clients to fetch OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530806 (https://phabricator.wikimedia.org/T219765) [07:23:43] (03CR) 10Vgutierrez: "Thanks for the review!" (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [07:29:30] (03PS8) 10Vgutierrez: acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) [07:29:32] (03PS4) 10Vgutierrez: api: Allow acme-chief clients to fetch OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530806 (https://phabricator.wikimedia.org/T219765) [07:42:39] !log Upgrade db2055 db2071 db2072 db2092 [07:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:11] (03CR) 10Dzahn: [C: 03+2] ATS: remove wikiba.se backend [puppet] - 10https://gerrit.wikimedia.org/r/532976 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn) [07:44:20] (03PS2) 10Dzahn: ATS: remove wikiba.se backend [puppet] - 10https://gerrit.wikimedia.org/r/532976 (https://phabricator.wikimedia.org/T99531) [07:55:17] (03PS2) 10KartikMistry: Move ContentTranslation out of Beta in jvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533172 (https://phabricator.wikimedia.org/T231207) [08:00:31] 10Operations, 10Traffic: Cannot download STL files due to network error - https://phabricator.wikimedia.org/T231422 (10ema) p:05Triage→03Normal a:03ema [08:03:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1076 after upgrade', diff saved to https://phabricator.wikimedia.org/P9011 and previous config saved to /var/cache/conftool/dbconfig/20190830-080334-marostegui.json [08:03:37] volans: ^ :p [08:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:22] lol [08:10:15] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) Window reserved on the Deployments page. It will happen at the same t... [08:12:29] 10Operations, 10Analytics, 10Traffic: varnishkafka statsv and webrequest crashed on cp1081 - https://phabricator.wikimedia.org/T231331 (10Nuria) 05Open→03Resolved [08:12:34] 10Operations, 10Analytics, 10Traffic: varnishkafka statsv and webrequest crashed on cp1081 - https://phabricator.wikimedia.org/T231331 (10Nuria) Agreed, closing. [08:19:30] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [08:19:51] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [08:20:47] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC - https://phabricator.wikimedia.org/T229657 (10Marostegui) [08:25:25] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:35:24] (03CR) 10Ema: [C: 03+1] trafficserver: fix grafana link [puppet] - 10https://gerrit.wikimedia.org/r/533229 (owner: 10CDanis) [08:35:51] oops [08:35:53] that's my fault [08:35:57] thanks cdanis & ema [08:36:42] vgutierrez: ah! The dashboard was removed and re-created wasn't it? [08:37:04] yeah, I created the ones with support for layers [08:37:10] removed the old ones and renamed the new ones [08:37:12] vgutierrez: that explains everything (see emails to ops@). Totally forgot about that [08:37:59] I guess that doesn't qualify for a t-shirt, right? [08:38:03] :_) [08:38:15] not even close [08:40:55] PROBLEM - snapshot of s3 in eqiad on db1115 is CRITICAL: snapshot for s3 at eqiad taken more than 4 days ago: Most recent backup 2019-08-26 08:23:11 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [08:43:30] (03CR) 10Vgutierrez: [C: 03+1] "this has been triggered by me refactoring the dashboards to add layer (tls/backend) support, thanks cdanis!" [puppet] - 10https://gerrit.wikimedia.org/r/533229 (owner: 10CDanis) [08:47:13] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:49:49] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet,service=ats-be [08:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:07] !log repool ats-be on cp1075 and verify if T231504 is fixed [08:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:12] T231504: Unexpectedly received mobile version of an article while logged out - https://phabricator.wikimedia.org/T231504 [08:59:13] 10Operations, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Icinga should alert on free disk space < 15% (now < 12%) on Elasticsearch hosts - https://phabricator.wikimedia.org/T130329 (10Dzahn) a:05Dzahn→03EBernhardson Should we lower it to 10% now? [09:03:35] (03PS1) 10Dzahn: ssl: add certificate for planet [puppet] - 10https://gerrit.wikimedia.org/r/533483 (https://phabricator.wikimedia.org/T210411) [09:13:24] (03PS1) 10Dzahn: add fake SSL key for planet.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/533485 [09:13:40] !log cp1075: upgrade ATS to 8.0.5-1wm4 [09:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:57] (03CR) 10Dzahn: [C: 03+2] ssl: add certificate for planet [puppet] - 10https://gerrit.wikimedia.org/r/533483 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [09:15:08] (03PS2) 10Dzahn: ssl: add certificate for planet [puppet] - 10https://gerrit.wikimedia.org/r/533483 (https://phabricator.wikimedia.org/T210411) [09:15:21] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile- [09:15:21] the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:16:39] ema: ^ could that be the upgrade? [09:16:53] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:16:58] :) [09:18:35] PROBLEM - Host db1074 is DOWN: PING CRITICAL - Packet loss = 100% [09:19:12] jynus: ^^ [09:20:34] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1075.eqiad.wmnet,service=ats-be [09:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:04] mutante: yeah, it can very well be, there was a constant rate of 504s after the upgrade :( [09:21:07] (03PS1) 10Volans: netbox: add role spare::system to new VMs [puppet] - 10https://gerrit.wikimedia.org/r/533487 (https://phabricator.wikimedia.org/T223291) [09:21:39] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:23:13] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:23:42] (03CR) 10Dzahn: [C: 03+1] netbox: add role spare::system to new VMs [puppet] - 10https://gerrit.wikimedia.org/r/533487 (https://phabricator.wikimedia.org/T223291) (owner: 10Volans) [09:23:53] (03CR) 10Volans: [C: 03+2] netbox: add role spare::system to new VMs [puppet] - 10https://gerrit.wikimedia.org/r/533487 (https://phabricator.wikimedia.org/T223291) (owner: 10Volans) [09:24:02] (03PS2) 10Volans: netbox: add role spare::system to new VMs [puppet] - 10https://gerrit.wikimedia.org/r/533487 (https://phabricator.wikimedia.org/T223291) [09:24:30] !log cp1075: depool ats-be due to low but constant 504 rate after 8.0.5-1wm4 upgrade [09:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:39] PROBLEM - MariaDB Slave IO: s2 on db1125 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1074.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1074.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:24:55] ema: well, but it was only very short, i thought it's just an issue during the upgrade itself in that moment [09:25:53] mutante: unfortunately not, I've seen a few real 504 errors in the logs that didn't happen before the upgrade [09:27:11] hmm.ack [09:27:55] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) is CRITICAL: Test Get page content HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:28:10] oh, but maybe it was unrelated after all ^ [09:28:36] ATS is now depooled on cp1075 [09:29:29] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:29:53] doh, https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps is empty :( [09:31:57] PROBLEM - MariaDB Slave Lag: s2 on db1125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 939.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [09:31:57] ah, yea, that is because the URL is generated as Service/Monitoring/$title in puppet [09:32:15] marostegui, jynus: you around? [09:33:57] ema: interview [09:33:58] marostegui is currently on an interview [09:34:09] jynus ^^ [09:34:11] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for Ap [09:34:11] rned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:34:47] does anybody know how to act on these mobileapps endpoints alerts? [09:35:39] can they be related to the database issues? (db1074 down, db1125 lagging) [09:35:47] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [09:38:24] its only happening on one of the scb2 hosts, not on the others [09:39:01] the general "Mobileapps LVS" check did not get triggered [09:39:05] true [09:39:51] well in icinga.log I see: [09:39:53] [1567139501] SERVICE ALERT: scb2001;mobileapps endpoints health;CRITICAL;SOFT;1;/{domain}/v1/page/media/{title} (Get media in test page) is CRITICAL [09:40:03] and that timestamp is Fri 30 Aug 2019 06:31:41 AM CEST [09:41:02] they stayed SOFT because they recovered within the time it takes to check 3 times [09:41:08] then we won't see them on IRC [09:41:15] right [09:42:01] db1074 being down is the issue, db1125 replicates from there [09:43:02] apergos: ah, thanks! should we try restarting it or wait for DBAs? [09:43:12] we don't want to restart mysql over there even if the host were to ome back up, a dba needs to look at it [09:43:23] *come back up [09:43:38] I meant restarting the host (db1074) to see if it comes back online [09:43:50] that would be ok, yes [09:44:00] mysql won't auto-start there? [09:44:13] mysql isn't set to autostart after reboot, generally [09:44:17] this is intentional [09:44:19] ack [09:45:22] 4000 errors per minute [09:45:36] that is an improvement over the 2million ones per minute we used to have [09:45:47] jynus: ok to try power-cycle db1074? I'm in console but there's nothing to see [09:45:56] what is the ssh status? [09:46:13] or ipmi? [09:46:14] the host is down [09:46:28] (03PS1) 10Dzahn: add discovery name for planet [dns] - 10https://gerrit.wikimedia.org/r/533491 [09:46:30] lets check hw logs first [09:46:31] db1115 is tendril so i's annoyig but won't break the wikis if there's lag [09:46:46] jynus: ack I'll leave it to you then [09:47:05] mysql crashing is normal due to a hw failure [09:47:08] the host crashing is not [09:48:21] in any case, this is not an outage [09:48:25] we can take time [09:48:34] it is only a redundancy degradation [09:50:27] if someone can depool db1074, however, it will speed me up [09:50:39] (m*nuel ins in an interview) [09:51:23] "Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support" [09:52:00] how can a battery failure lead to a crash? [09:52:26] maybe that's not the cause, just another issue [09:53:23] I think it may have caused a power issue [09:53:44] instead of just pass through power [09:54:06] but even disks going down should have made mysql crash, not the os [09:54:31] ok, I will depool the host [09:54:55] need to check the runbook, never done it before [09:55:37] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_slave [09:57:13] how can I help? [09:57:48] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool db1074 after crash', diff saved to https://phabricator.wikimedia.org/P9013 and previous config saved to /var/cache/conftool/dbconfig/20190830-095747-jynus.json [09:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:11] not sure that is ok, what about the other loads? [09:58:32] (03PS2) 10Dzahn: add discovery name for planet [dns] - 10https://gerrit.wikimedia.org/r/533491 [09:58:43] what do you mean? [09:58:51] (03CR) 10Dzahn: [C: 03+2] add discovery name for planet [dns] - 10https://gerrit.wikimedia.org/r/533491 (owner: 10Dzahn) [09:58:56] it was removed from both the main load and the groups it is in [09:59:10] volans: that wasn't shown on the diff [09:59:19] the other thing I would report is the: [09:59:25] WARNING:etcd.client:etcd response did not contain a cluster ID [09:59:29] it should, it's in the phabricator phaste with the diff [09:59:31] Backend error: The request requires user authentication : Insufficient credentials [09:59:36] sudo -i [09:59:44] see https://phabricator.wikimedia.org/P9013 [09:59:46] for the diff [10:00:31] yeah, I got that, otherwise it wouldn't have gont through [10:03:52] (03PS1) 10Jcrespo: mariadb: Depool db1074 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533492 [10:04:59] (03CR) 10Jcrespo: [C: 03+2] mariadb: Depool db1074 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533492 (owner: 10Jcrespo) [10:05:53] (03Merged) 10jenkins-bot: mariadb: Depool db1074 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533492 (owner: 10Jcrespo) [10:06:06] I don't think we're mirroring that anymore since a while [10:06:08] but not 100% sure [10:06:39] (03CR) 10jenkins-bot: mariadb: Depool db1074 after crash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533492 (owner: 10Jcrespo) [10:08:25] db-codfw.php: Promote db2129 to s6 codfw master Tue Aug 27 09:12:32 2019 +0200 [10:08:56] (03PS1) 10Dzahn: planet: add Hiera keys and include class vor envoy [puppet] - 10https://gerrit.wikimedia.org/r/533493 (https://phabricator.wikimedia.org/T210411) [10:10:15] so things that help us: creating a ticket (maybe someone did already?) [10:10:44] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Mirror dbctl depool of db1074 (duration: 00m 55s) [10:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:00] just to be clear, I am not saying you should do that- just answering what things one can do without specific knowledge of db later :-D [10:12:36] RECOVERY - snapshot of s3 in eqiad on db1115 is OK: snapshot for s3 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-08-30 08:20:42 from db1095.eqiad.wmnet:3313 (828 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [10:13:26] ^that is the unrelated issue I mentioned above, nthng to do with db1074 [10:13:48] (03PS1) 10Ema: ATS: log Cookies [puppet] - 10https://gerrit.wikimedia.org/r/533494 (https://phabricator.wikimedia.org/T227432) [10:14:06] 10Operations, 10DBA: db1074 crashed - https://phabricator.wikimedia.org/T231638 (10Dzahn) [10:14:11] there you go jynus [10:14:16] thanks daniel, I was about to do it [10:14:39] again, that is a favour you do to me, not complaining [10:14:40] hey [10:14:43] I am back from the interview [10:14:48] anything I can help with? [10:14:58] nothing urget at the moment [10:15:11] we have to put db1074 back up and debug [10:15:48] I can take care of that, you want me to do that or take care of something else? [10:17:04] definitely will need help, maybe checking sanitarium [10:17:12] and see what is the best way to proceed with taht? [10:17:39] sure, is db1074 hard down and unrecoverable? [10:18:10] 10Operations, 10DBA: db1074 crashed - https://phabricator.wikimedia.org/T231638 (10Dzahn) [10:18:18] no, I was going to put it up now [10:19:10] maybe you want to wait to see what is the state after that [10:20:13] yeah, exactly [10:20:20] let's see how it comes back [10:20:37] I will leave you to run all the CLI there [10:21:09] So we don't touch stuff at the same time :) [10:21:41] double check with me it is db1074 we want to reboot, right? [10:22:44] correct [10:22:47] db1074 doesn't respond to ping [10:22:50] so go for it [10:23:40] !log reseting db1074 from iLo [10:24:17] marostegui: still around after netslpit on irc? [10:24:21] yep [10:24:23] (I think!) [10:24:32] jynus: can you read me? [10:24:33] yeah, you are still here to me [10:24:37] cool [10:24:38] I can read you [10:24:51] I have disabled notifications on db1074 [10:25:08] log will not go through, I guess, however [10:25:30] I can do that for you on the task [10:25:57] 10Operations, 10DBA: db1074 crashed - https://phabricator.wikimedia.org/T231638 (10Marostegui) [10:23:40] !log reseting db1074 from iLo [10:26:09] 10Operations, 10DBA: db1074 crashed - https://phabricator.wikimedia.org/T231638 (10jcrespo) On reboot: ` 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists. Important information available or errors detected ` [10:26:51] ^it is booting, though, that was the same thing that on the health logs [10:27:10] the support expired a few months ago [10:27:27] (03PS2) 10Dzahn: add fake SSL key for planet.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/533485 [10:27:42] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake SSL key for planet.discovery.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/533485 (owner: 10Dzahn) [10:28:01] (03PS1) 10Dzahn: ATS/varnish: switch planet to discovery name, disable codfw backend [puppet] - 10https://gerrit.wikimedia.org/r/533495 (https://phabricator.wikimedia.org/T210411) [10:28:29] 86/PAT: hpasmlited:698 map pfn expected mapping type uncached-minus for [mem 0x79170000-0x79172fff], got write-back [10:29:15] disk themselves seem to be ok [10:29:27] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/18125/planet1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/533493 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [10:29:43] marostegui: do you know what is the s2 candidate master? [10:29:51] it is not db1074, right? [10:30:03] yep, db1122 [10:30:05] 10Operations, 10DBA: db1074 crashed - https://phabricator.wikimedia.org/T231638 (10Marostegui) >>! In T231638#5453726, @jcrespo wrote: > On reboot: > ` > 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. > Action: Restart system. Contact HPE support if condition persists. > > >... [10:30:10] (03CR) 10Dzahn: "i checked nothing is listening on 443 on planet1001 right now" [puppet] - 10https://gerrit.wikimedia.org/r/533493 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [10:30:12] s2 will be failed over in 2 weeks to that new host [10:30:17] ok [10:30:43] forcing all icinga checks before touching mysql [10:31:04] jynus: did you check ilo logs, or you want me to? [10:31:17] I did on web iface [10:31:21] but not copied it [10:31:31] ok, I will do that [10:31:32] it only said the same that on reboot, battery [10:31:33] 10Operations, 10DBA: db1074 crashed - https://phabricator.wikimedia.org/T231638 (10Marostegui) [10:31:36] you can focus on the host itself [10:31:47] if you want to do it from ssh and copy them that would be great [10:31:53] yep, doing it [10:32:32] Battery count: 0 [10:32:37] great [10:32:56] ferm also failed to start, but that may be unrelated [10:33:32] started it manually now [10:33:52] I would start mysql with no replication and consider failing it over to codfw [10:34:03] (the sanitarium replication) [10:34:23] let's start it without replication for now yeah [10:34:33] without a battery either it would not catch up or would be an unreliable replica [10:34:37] we can maybe enable replication till the point we can stop that host with codfw master and then change it [10:34:55] oh, we can just get easily the coords as long as it is up [10:35:15] 10Operations, 10DBA: db1074 crashed - https://phabricator.wikimedia.org/T231638 (10Marostegui) BBU is broken: ` description=Smart Storage Battery failure (Battery 1, service information: 0x0A). Action: Gather AHS log and contact Support Verbs cd vers description=POST Error: 313-HPE Smart Storage... [10:36:07] 10Operations, 10DBA: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 (10Marostegui) [10:36:23] InnoDB: Starting crash recovery from checkpoint LSN=24770133479800 [10:36:44] recovery seems clean [10:36:49] nice [10:37:33] oh, I just realized you meant replication on sanitarium [10:37:43] not db1074, you are absolutely right [10:37:49] sorry I missunderstood [10:37:53] no worries :) [10:38:11] db1074 is up an stopped [10:39:07] db1066-bin.001234:4 ? [10:39:08] RECOVERY - MariaDB Slave IO: s2 on db1125 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [10:39:11] checking [10:39:17] too much coincidence it is a the start of a log? [10:39:46] db1066-bin.001234:4 seems to be yes [10:40:08] let's see db1125 [10:40:35] the master is at: | db1066-bin.001234 | 234395014 | at the moment, so it is a coincidence yes [10:40:52] db1125 started replicating [10:41:00] yeah, it caught up [10:41:06] let me see what position when its master crashed [10:41:22] db1074-bin.004422:456 [10:45:20] the last executed statement seems to be: on nlwiki.page page_id=4810735 [10:45:43] That was run on db1074 and also replicated to sanitarium [10:45:59] And that is db1074-bin.004421:139305511 [10:46:29] I can see also lots of information_schema_p activity [10:46:58] any suggestion on how to proceed? [10:47:11] we should probably look for that event on sanitarium codfw [10:47:15] and start replication from there [10:47:22] ok [10:47:34] double check what I posted above [10:47:39] to make sure we are looking for the right event [10:48:03] do you have a gtid for me? [10:48:16] one sec and I will have it [10:51:05] 0-180359173-4858865027,171966574-171966574-2132058081,171966668-171966668-2580,171966670-171966670-2410812544,171970567-171970567-390719906,171978766-171978766-732975,180359173-180359173-70817914,180359241-180359241-121693516 [10:51:06] XDDD [10:51:08] Let me try to filter that [10:51:20] PROBLEM - HHVM rendering on mw1290 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:51:21] yeah, I only need the master id [10:51:33] (03PS2) 10Vgutierrez: prometheus: Add basic ATS network and ssl metrics [puppet] - 10https://gerrit.wikimedia.org/r/533193 (https://phabricator.wikimedia.org/T231533) [10:52:01] don't worry, 10.5 will fix that! [10:52:05] 171966668-171966668-2580 I think [10:52:08] But let me confirm [10:52:23] it is ok, I will work with that and reverse your setops [10:52:42] 10Operations, 10ops-eqiad: Degraded RAID on db1074 - https://phabricator.wikimedia.org/T231639 (10ops-monitoring-bot) [10:52:52] RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 79470 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:53:08] 10Operations, 10ops-eqiad: Degraded RAID on db1074 - https://phabricator.wikimedia.org/T231639 (10Marostegui) [10:53:12] 10Operations, 10DBA: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 (10Marostegui) [10:55:09] jynus: https://phabricator.wikimedia.org/P9014 [10:55:31] so that list (from the binlog) matches: 171966668-171966668-2580 which is the slave_pos output from db1125 [10:55:45] thanks [10:55:46] let's see if we reach the same conclusion [10:56:36] give me some time, I want to also check the previous transactions [10:56:39] sure [10:57:01] From my investigation the last executed one was an UPDATE on nlwiki.page for the page_id= 4810735 [10:57:05] and that one reached db1125 [10:57:14] after that, there is nothing apart from the binlog rotation [10:57:21] that's what I have seen for now [10:59:16] 171966668 is db1074 [10:59:50] yes [10:59:55] we are more interested on 171966574 ids [11:00:12] db1066? [11:00:19] the original master, yes [11:00:23] I have that too [11:00:28] 171966574-171966574-2132058081 is what I saw [11:00:39] it is possible that we had aditional writes [11:00:52] or that those get overwritten on format change [11:00:52] that is the last GTID from the master on db1074 [11:03:08] I am trying to correlate that GTID with db2094 to see if before that one we have the same update on nlwiki.page [11:03:32] db2094/db2095 [11:04:39] I confirm that the last write I get is UPDATE `nlwiki`.`page` @1=4810735 [11:04:42] before the crash [11:04:51] same as I did [11:05:10] I am now correlating 171966574-171966574-2132058081 on codfw, to see if that is the GTID after the nlwiki update [11:05:12] I would like to check if there were additional writes on replication start [11:05:53] jynus: maybe start IO thread? [11:05:57] on db1074 [11:07:27] it is true it doesn't give an original master gtid, interesting [11:10:08] I am guessing a format change generates new transactions, making gtids not realy global? [11:10:14] yeah, it is very weird [11:10:38] because they have the same id on the intermediate master [11:11:50] The last GTID executed (for the same transaction page_id=4810735) on the sanitarium master in codfw (db2126) is: 171966574-171966574-2131614606 [11:11:54] there is no more transactions in it [11:12:10] it is true that our setup is non-standard [11:12:24] but that makes gtid for us even less useful [11:12:24] So I think we should start db1125:3312 to replicate from db2126:db2126-bin.000086:3277753 [11:12:36] but please check that [11:12:51] yeah, I will not look at your numbers and check it myself [11:13:44] yep [11:13:44] happyly it crashed at a whole second [11:13:50] so it is easy to see [11:13:54] just after the heartbeat [11:14:11] or probably- at a low write moment [11:16:47] intersting, I get db2126-bin.000086:268653481 [11:17:01] let me see [11:17:53] previous transaction at 268653188 [11:19:07] and before, 2 heartbeats, instead of 1, at 268652500 [11:19:15] mmm, that's strange [11:19:34] (that is expected due to the cut on eqiad -> codfw) [11:19:50] codfw heartbeats won't be on eqiad [11:20:16] yes, your position is right [11:20:32] mine is the previous one on nlwiki.page for that page_id [11:20:34] so this is not about who is right, more like, how you got yours? [11:20:43] Mine is wrong [11:20:50] Because mine is previous [11:20:52] did you search for the transaction by grepping? [11:20:54] it happened at 4am [11:21:06] I use timestamp first [11:21:27] which is not precise engouth but a good startiung point assuming no lag [11:21:35] yours matches all the columns on the row for db1125 [11:21:45] and mine does, but not the page_random one [11:22:11] so 268653481 is the one [11:26:58] STOP SLAVE; CHANGE MASTER TO MASTER_HOST='db2126.codfw.wmnet', MASTER_PORT=3312, MASTER_USER='repl', MASTER_LOG_FILE='db2126-bin.000086', MASTER_LOG_POS=268653481, MASTER_SSL=1; START SLAVE; [11:27:18] checking [11:27:25] you probably need reset slave all [11:27:34] from db1125:s2 [11:27:53] adding the reset [11:28:05] port=3306 [11:28:12] 06? [11:28:13] double checking the position again [11:28:33] yes, db2126 is the sanitarium master and it runs on 3306 [11:28:36] didn't get the port thing [11:28:46] ah, so I remove it [11:28:49] like db1074 [11:28:49] yep [11:29:10] that is why I need help, easy to mistake with so many instances and ports :-D [11:29:12] give me a second to double check again the position [11:29:19] please do [11:29:39] checking the UPDATE against what is in the DB [11:30:44] this is much easier with WMFReplication.setup() ! [11:31:19] ok, looks good [11:31:29] maybe also let's stop replication on labsdbs, just in case? [11:31:34] for s2 [11:31:36] ok, let me do that [11:31:49] waiting here for you [11:31:56] !log Temporary stop s2 replication on labsdb1009-labsdb1012 [11:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:38] ok [11:32:40] replication stopped [11:32:45] you can proceed [11:33:11] what is the right log for this, switching s2 sanitarium to replicate from codfw? [11:33:18] yeah [11:33:21] sounds goofd [11:33:56] !log switching db1125:s2 (eqiad sanitarium) to replicate from codfw T231638 [11:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:01] T231638: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 [11:34:44] it complained about codfw heartbeat [11:34:52] expected then [11:34:54] that is expected although I forgot about that [11:34:58] me too [11:35:03] adding a dummy column [11:35:35] you mean row? [11:35:42] oh yeahs, sorry [11:35:45] :-D [11:35:49] just checking! :) [11:37:13] what is the PK of heartbeat table? [11:37:58] the server_id [11:38:07] bah, just inserted the real row [11:38:11] working now [11:38:19] cool [11:38:27] wondder if the view will work well [11:38:35] I guess so, as it worked when codfw was primary [11:38:41] which view? [11:38:52] heartbeat_p [11:38:54] we have to enable GTID on db1125:3312 [11:38:58] yeah [11:39:03] wanted to leave it for a wait [11:39:05] let's wait for it to catch up [11:39:05] *while [11:39:06] yeah [11:39:22] then we restart labs [11:39:22] we should enable replication only on one of the lasbs host first [11:39:35] and enable gtid [11:40:15] I have asked DCOPs to see if we have a spare BBU in eqiad [11:40:39] I wouldn't want to lose db1074 so early in its life! [11:41:37] repl flowing https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1125&var-port=13312&from=now-1h&to=now [11:42:48] io took only 1 minute to catch up [11:42:51] db1125:3332 almost up-to-date! [11:43:04] which is more than ok for cross-dc [11:43:32] ok, then lets start on of the lab hosts [11:43:40] should we start GTID first? [11:43:43] on the sanitarium itself? [11:43:47] ah, yes [11:43:51] doing [11:43:56] do that and once you are done, I will start labsdb1012 [11:44:22] RECOVERY - MariaDB Slave Lag: s2 on db1125 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [11:45:45] I need to check my code to remember how to enable gtid [11:45:57] STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=Slave_pos; START SLAVE; [11:46:03] that should do it [11:47:20] it doesn't nead double quotes? [11:47:35] don't think so [11:47:36] or single ones [11:47:48] ok done [11:48:13] again, too accustomed to set_gtid_mode('slave_pos') [11:48:14] ok, going for labsdb1012 [11:48:16] :-D [11:48:24] !log Start s2 replication on labsdb1012 [11:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:39] done [11:48:42] let's see how it goes [11:49:22] catching up nicely for now [11:49:47] we should start replication on db1074 anyways [11:50:00] you handle labs, and I do that? [11:50:04] yep [11:50:11] I will wait for 1012 to fully be in sync [11:50:14] before starting the others [11:52:33] interestingly db1125:3312 stills shows on show slave hosts; [11:52:33] labsdb1012 sync'ed fine, let's go with 1011 [11:52:49] we sure that replication is not active, right? [11:53:07] from db1074? [11:53:11] yeah [11:53:23] it is probably cached, but I will check [11:54:07] I don't see it connected with an "ss" [11:54:13] there is one repl user connected on processlist [11:54:43] 26 | repl | 10.64.48.14:58642 | NULL | Binlog Dump | 4593 | Master has sent all binlog to s [11:54:48] root@db2126:~# ss -a | grep -i "10.64.48.14" [11:54:48] tcp ESTAB 0 0 ::ffff:10.192.32.182:mysql ::ffff:10.64.48.14:48632 [11:54:48] Am I crazy? [11:55:02] root@db1074:~# ss -a | grep -i "10.64.48.14" [11:55:02] root@db1074:~# [11:55:14] see processlist [11:55:22] checking [11:55:49] :/ [11:56:06] Master_Host: db2126.codfw.wmnet on the replica [11:56:24] root@db1125:~# mysql -S /run/mysqld/mysqld.s2.sock -e "show all slaves status\G" | grep -i Master_host [11:56:24] Master_Host: db2126.codfw.wmnet [11:56:38] so it is ok, right, just giving bad info? [11:56:42] or cached one? [11:57:06] probably cached [11:57:12] maybe stop mysql on db1074 and restart it [11:57:14] I will kill the connection [11:57:15] it should clean [11:57:17] on db1074 [11:57:18] or thatyes [11:57:22] going to start labsdb1011 [11:57:24] agreed? [11:57:27] yes [11:57:52] !log Start replication s2 on labsdb1011 [11:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:00] killed, nothing broke? [11:58:27] db1125:3322 replicating fine [11:58:49] so probably very aggressive caching [11:59:00] labsdb1012 also replicating fine [11:59:00] but I don't like, it is very missleading [11:59:11] yeah [11:59:14] very dangerous [11:59:31] I started replication on db1074 [12:00:06] cool [12:00:18] 1011 catching up [12:00:26] if it does it fine, I will start 1009 and 1010 at the same time [12:03:21] once everthing catches up, I will reenable alerting on db1074 [12:03:32] cool [12:03:46] or maybe I will leave it with downtime for some days [12:03:54] actually, yeah, till monday maybe [12:03:58] we won't repool it anyways for now [12:04:17] we have had hosts repooled without a BBU, so maybe we can repool it on monday [12:12:16] jynus: 1011 is now in sync, going to start 1009 and 1010 [12:12:23] cool [12:12:39] !log Start replication s2 on labsdb1009 and labsdb1010 [12:12:43] !log Start replication s2 on labsdb1009 and labsdb1010 [12:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:56] Done [12:21:48] (03CR) 10Alex Monk: [C: 03+2] acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [12:22:20] \o/ [12:22:31] acme-chief is getting ocsp stapling [12:24:36] (03Merged) 10jenkins-bot: acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [12:24:56] (03Merged) 10jenkins-bot: api: Allow acme-chief clients to fetch OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530806 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [12:27:19] (03CR) 10jenkins-bot: acme_chief: Provide OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [12:27:58] (03CR) 10jenkins-bot: api: Allow acme-chief clients to fetch OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530806 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [12:32:49] (03CR) 10Ema: [C: 03+2] ATS: log Cookies [puppet] - 10https://gerrit.wikimedia.org/r/533494 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [12:40:49] (03PS2) 10Dzahn: planet: add Hiera keys and include class vor envoy [puppet] - 10https://gerrit.wikimedia.org/r/533493 (https://phabricator.wikimedia.org/T210411) [12:41:30] (03CR) 10Dzahn: [C: 03+2] planet: add Hiera keys and include class vor envoy [puppet] - 10https://gerrit.wikimedia.org/r/533493 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [12:43:05] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Varnent) I am incredibly confused right now. I have made some additions to the code it builds from. However, that was basically adding a line of HTM... [12:43:36] !log ema@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1075.eqiad.wmnet,service=ats-be [12:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:38] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Varnent) >>! In T230638#5453416, @Dzahn wrote: > I am confused by this statement. The content repo is explicitely setup this way so that people work... [12:49:48] (03CR) 10Alexandros Kosiaris: [C: 03+1] "+1 from my side" [puppet] - 10https://gerrit.wikimedia.org/r/528617 (owner: 10BryanDavis) [12:52:07] (03PS1) 10Dzahn: ssl/planet: rename cert to .crt not .crt.pem [puppet] - 10https://gerrit.wikimedia.org/r/533514 [12:54:06] (03CR) 10Dzahn: [C: 03+2] ssl/planet: rename cert to .crt not .crt.pem [puppet] - 10https://gerrit.wikimedia.org/r/533514 (owner: 10Dzahn) [12:54:17] (03PS2) 10Dzahn: ssl/planet: rename cert to .crt not .crt.pem [puppet] - 10https://gerrit.wikimedia.org/r/533514 [12:59:57] (03CR) 10Gehel: [C: 03+1] "LGTM, will merge on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/533189 (https://phabricator.wikimedia.org/T231516) (owner: 10Mathew.onipe) [13:10:35] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10BBlack) Some clarifying points: 1. `Code` - means different things to different people in different contexts. From the SRE perspective, the whole... [13:13:11] (03CR) 10Gehel: [C: 04-1] "see comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [13:14:52] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Varnent) Okay - so basically SRE will not setup the requested configuration until we figure out how to setup the script to output the old site into... [13:16:52] (03CR) 10Gehel: [C: 03+1] "LGTM, let's see if @volans has more to say." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [13:17:08] more :-P [13:17:20] gehel: ^^^ [13:17:22] he doesn't :) [13:18:55] volans: :) [13:19:27] !log cp1075: pause ats-be testing during the weekend T228629 [13:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:33] * gehel feels like a Friday, took him 3 minutes to find out this was a joke while he did not find the comment from volans [13:19:33] T228629: ATS Backends: Test live cache_text traffic - https://phabricator.wikimedia.org/T228629 [13:19:46] lol [13:19:48] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1075.eqiad.wmnet,service=ats-be [13:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:54] I'll try to have a look at it today [13:28:02] (03PS2) 10CDanis: trafficserver: fix grafana link [puppet] - 10https://gerrit.wikimedia.org/r/533229 [13:28:27] (03CR) 10CDanis: [C: 03+2] trafficserver: fix grafana link [puppet] - 10https://gerrit.wikimedia.org/r/533229 (owner: 10CDanis) [13:30:04] (03CR) 10Dzahn: [C: 03+2] ATS/varnish: switch planet to discovery name, disable codfw backend [puppet] - 10https://gerrit.wikimedia.org/r/533495 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [13:30:41] (03PS2) 10Dzahn: ATS/varnish: switch planet to discovery name, disable codfw backend [puppet] - 10https://gerrit.wikimedia.org/r/533495 (https://phabricator.wikimedia.org/T210411) [13:30:55] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10BBlack) There are two separate things to do here: 1) The move of content to the new `/historical/` sub-path: For this, nothing will actually chan... [13:37:25] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Varnent) Awesome - thank you @BBlack - super helpful and I get now what you all need done. Apologies for the confusion on my part, I thought this wa... [13:48:28] 10Operations, 10WMF-Legal, 10serviceops: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10BBlack) On the broader meta-topics: Long-lived canonical URLs are important, and I think that `transparency.wikimedia.org` seems like a more-natural... [13:54:29] davidwbarratt: k, will be on mwdebug1002 once it lands in CI. meanwhile could verify that its reproducible on that server now? [13:54:56] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [13:54:59] Krinkle uhh, how would I do that? [13:55:39] davidwbarratt: Monitor https://logstash.wikimedia.org/app/kibana#/dashboard/mwdebug1002, then via https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug view a page where that code runs. I think logged-in views on action=history would do it. [13:56:02] That's actually how I found it, when I was deploying/verifying a different patch. [13:56:11] oo [13:58:44] hmm, so I'm on mwdebug1002 and I'm on an action=history page (I see the links) but nothing in logstash [14:01:04] davidwbarratt: XWD enabled and set to the same server? [14:01:14] yep [14:01:28] davidwbarratt: wmf.20 wiki? [14:01:35] oh dang [14:01:38] good catch [14:01:57] I used test2 yesterday [14:02:06] there we go! [14:02:56] Only ∞ minutes left before I can stage the fix. [14:03:11] BAHAHA [14:05:05] https://www.xkcd.com/612/ [14:05:13] should be ~ 11 min or so though [14:10:43] (03PS1) 10Gergő Tisza: Fix mwvagrant sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/533529 [14:12:06] (03CR) 10Volans: "LGTM, just a comment on the query parameter and a couple of very minor things inline." (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [14:13:34] davidwbarratt: staged on mwdebug1002 [14:13:44] Krinkle great! one moment [14:14:42] Krinkle sweet! no log messages after the deploy [14:16:12] davidwbarratt: ack, syncing now [14:17:01] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.20/extensions/Thanks/includes/: T231617 - 8a3c458c4d937 (duration: 00m 54s) [14:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:50] T231617: DBPerformance: Expectation masterConns <= 0 not met by PermissionManager->isBlockedFrom - https://phabricator.wikimedia.org/T231617 [14:21:41] (03PS1) 10Ema: ATS: cache responses to cookies [puppet] - 10https://gerrit.wikimedia.org/r/533530 (https://phabricator.wikimedia.org/T227432) [14:25:07] (03PS17) 10Jhedden: openstack: Add codfw1dev nova API and metadata to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) [14:25:52] davidwbarratt: cool, looks all clear now at https://logstash.wikimedia.org/goto/4a870776f31dd6441e3e67b1bdc235a5 [14:25:59] YAY! [14:26:42] Krinkle I'm glad that fixed the problem, still odd that we didn't see it months ago [14:30:13] (03CR) 10Jhedden: [C: 03+2] openstack: Add codfw1dev nova API and metadata to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [14:56:21] 10Operations, 10SRE-Access-Requests: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10Trizek-WMF) >>! In T231616#5453042, @MMiller_WMF wrote: > Hello -- I'm not working with @Urbanecm on this specific project, but he is a paid part-time contractor working with @Tr... [15:20:11] (03PS1) 10Zoranzoki21: Fix wgMetaNamespaceTalk for bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533538 (https://phabricator.wikimedia.org/T231654) [15:22:53] 10Operations, 10WMF-Legal, 10serviceops, 10Patch-For-Review: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Varnent) >>! In T230638#5453970, @BBlack wrote: > On the broader meta-topics: Long-lived canonical URLs are important, and I t... [15:24:12] (03CR) 10Srdjan m: [C: 03+1] "> Uploaded patch set 1." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533538 (https://phabricator.wikimedia.org/T231654) (owner: 10Zoranzoki21) [16:07:04] 10Operations, 10MobileFrontend, 10Traffic, 10Mobile: https://en.wikipedia.org/wiki/Heteromyidae shows the mobile version on desktop - https://phabricator.wikimedia.org/T231620 (10AntiCompositeNumber) [16:11:39] (03PS1) 10Andrew Bogott: openstack configs: forward some mitaka updates to newton [puppet] - 10https://gerrit.wikimedia.org/r/533549 [16:21:25] 10Operations, 10MobileFrontend, 10Traffic, 10Mobile: https://en.wikipedia.org/wiki/Heteromyidae shows the mobile version on desktop - https://phabricator.wikimedia.org/T231620 (10CDanis) Seems likely this is the same as {T231504} ? [16:23:51] 10Operations, 10DBA, 10StructuredDiscussions, 10Growth-Team (Current Sprint), 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Catrope) >>! In T107610#5443335, @Anomie wrote: > Or do we want one table that's shared by all... [16:28:37] (03PS1) 10Jhedden: openstack: Add codfw1dev neutron server to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/533552 (https://phabricator.wikimedia.org/T223907) [16:33:42] 10Operations, 10WMF-Legal, 10serviceops, 10Patch-For-Review: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10BBlack) >>! In T230638#5454212, @Varnent wrote: > I agree that is a better long-term setup and is something I can bring up wit... [16:38:00] !log cloudelastic-chi all indices auto_expand_replicas set to '0-1' [16:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:29] 10Operations, 10MediaWiki-Configuration, 10conftool, 10Performance-Team (Radar): noc.wm.o/db.php: remove hosts information, or fetch it from etcd somehow - https://phabricator.wikimedia.org/T229631 (10CDanis) 05Open→03Resolved The remaining work on this task is now part of {T231642} [16:40:33] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, and 3 others: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10CDanis) [16:44:22] !log restarting restbase2017-b with live hack startup script (adds logging) -- T231027 [16:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:28] T231027: Outage of restbase2017-b - https://phabricator.wikimedia.org/T231027 [16:46:11] 10Operations, 10WMF-Legal, 10serviceops, 10Patch-For-Review: Move old transparency report pages to historical URLs and setup redirect - https://phabricator.wikimedia.org/T230638 (10Varnent) Sorry - should have clarified. I meant if in theory down the road the wikimediafoundation.org site was moved to our s... [16:53:05] (03PS1) 10CDanis: dbctl: document config commit --batch [software/conftool] - 10https://gerrit.wikimedia.org/r/533556 (https://phabricator.wikimedia.org/T231629) [16:56:13] (03PS1) 10Volans: transports: add JunOS transport [software/homer] - 10https://gerrit.wikimedia.org/r/533558 (https://phabricator.wikimedia.org/T228388) [16:57:07] (03PS2) 10Jhedden: openstack: Add codfw1dev neutron server to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/533552 (https://phabricator.wikimedia.org/T223907) [17:01:23] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [17:01:33] (03CR) 10Volans: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/533556 (https://phabricator.wikimedia.org/T231629) (owner: 10CDanis) [17:04:28] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1002/18128/" [puppet] - 10https://gerrit.wikimedia.org/r/533552 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [17:06:07] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [17:08:08] 10Operations, 10ops-eqiad, 10cloud-services-team: Degraded RAID on cloudvirt1024 -- Filesystem mounted read-only - https://phabricator.wikimedia.org/T230289 (10Cmjohnson) updated the idrac and raid f/w [17:09:40] (03PS2) 10Andrew Bogott: Fix mwvagrant sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/533529 (owner: 10Gergő Tisza) [17:10:24] (03CR) 10Jhedden: [C: 03+1] Fix mwvagrant sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/533529 (owner: 10Gergő Tisza) [17:24:39] 10Operations, 10ops-eqiad, 10DBA: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 (10wiki_willy) a:03Cmjohnson @Cmjohnson @Jclark-ctr - do you guys know offhand if we have a spare BBU lying around from a decom'd server by any chance? If not, let me know and we'll order the part. Th... [17:25:59] 10Operations, 10ops-eqiad, 10DBA: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 (10Cmjohnson) @wiki_willy negative, we do not have any spare BBUs lying around. [17:28:37] (03CR) 10Bstorm: [C: 03+2] Fix mwvagrant sudo rules [puppet] - 10https://gerrit.wikimedia.org/r/533529 (owner: 10Gergő Tisza) [17:31:18] 10Operations, 10ops-eqiad, 10DBA: Request to Order HP BBU for HP ProLiant DL360 Gen9 (db1074) - https://phabricator.wikimedia.org/T231670 (10wiki_willy) [17:31:57] 10Operations, 10ops-eqiad, 10DBA: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 (10wiki_willy) Thanks for confirming @Cmjohnson , subtask T231670 created for Rob to order the part. Thanks, Willy [17:32:43] wiki_willy: thank you so much for ^ <3 [17:34:08] 10Operations, 10ops-eqiad, 10DBA: db1074 crashed: Broken BBU - https://phabricator.wikimedia.org/T231638 (10Marostegui) >>! In T231638#5454666, @wiki_willy wrote: > Thanks for confirming @Cmjohnson , subtask T231670 created for Rob to order the part. Thanks, Willy Thank you sooo much :) [17:40:21] 10Operations, 10DBA, 10procurement: Request to Order HP BBU for HP ProLiant DL360 Gen9 (db1074) - https://phabricator.wikimedia.org/T231670 (10wiki_willy) [17:41:53] (03PS1) 10Herron: prometheus: aggregate ipsec_status and add alert [puppet] - 10https://gerrit.wikimedia.org/r/533563 (https://phabricator.wikimedia.org/T230570) [17:42:15] 10Operations, 10DBA, 10procurement: eqiad: Request to Order HP BBU for HP ProLiant DL360 Gen9 (db1074) - https://phabricator.wikimedia.org/T231670 (10wiki_willy) [17:43:40] (03PS2) 10Herron: prometheus: aggregate ipsec_status and add alert [puppet] - 10https://gerrit.wikimedia.org/r/533563 (https://phabricator.wikimedia.org/T230236) [17:45:38] (03CR) 10jerkins-bot: [V: 04-1] prometheus: aggregate ipsec_status and add alert [puppet] - 10https://gerrit.wikimedia.org/r/533563 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [17:47:53] (03PS3) 10Herron: prometheus: aggregate ipsec_status and add alert [puppet] - 10https://gerrit.wikimedia.org/r/533563 (https://phabricator.wikimedia.org/T230236) [18:13:39] PROBLEM - Disk space on Hadoop worker on an-worker1090 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 26 GB (0% inode=99%): /var/lib/hadoop/data/e 26 GB (0% inode=99%): /var/lib/hadoop/data/b 25 GB (0% inode=99%): /var/lib/hadoop/data/d 26 GB (0% inode=99%): /var/lib/hadoop/data/g 26 GB (0% inode=99%): /var/lib/hadoop/data/c 26 GB (0% inode=99%): /var/lib/hadoop/data/m 28 GB (0% inode=99%): /var/lib/hadoop/data/i [18:13:39] 99%): /var/lib/hadoop/data/j 27 GB (0% inode=99%): /var/lib/hadoop/data/h 26 GB (0% inode=99%): /var/lib/hadoop/data/f 26 GB (0% inode=99%): /var/lib/hadoop/data/l 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:15:41] 10Operations, 10ops-eqiad, 10Traffic: cp1085 - IPMI not working - https://phabricator.wikimedia.org/T231525 (10wiki_willy) a:03Cmjohnson [18:17:46] 10Operations, 10ops-eqiad, 10User-Elukey: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10wiki_willy) a:05elukey→03Cmjohnson Assigning over to @Cmjohnson for @elukey 's question. [18:18:35] (03PS2) 10Volans: transports: add JunOS transport [software/homer] - 10https://gerrit.wikimedia.org/r/533558 (https://phabricator.wikimedia.org/T228388) [18:18:37] (03PS1) 10Volans: config: inject role and site to the configuration [software/homer] - 10https://gerrit.wikimedia.org/r/533568 (https://phabricator.wikimedia.org/T228388) [18:19:26] 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) upgrade msw1-eqiad from EX4200 to EX4300 - https://phabricator.wikimedia.org/T225121 (10wiki_willy) a:05Papaul→03Cmjohnson [18:25:55] (03CR) 10Ayounsi: [C: 03+1] "tested, it works" [software/homer] - 10https://gerrit.wikimedia.org/r/533568 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [18:30:40] (03CR) 10Ayounsi: [C: 03+1] "code looks good and tested successfully" [software/homer] - 10https://gerrit.wikimedia.org/r/533558 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [18:32:43] (03PS1) 10Volans: CLI: suppress ncclient noisy logger [software/homer] - 10https://gerrit.wikimedia.org/r/533570 (https://phabricator.wikimedia.org/T228388) [18:35:15] (03PS1) 10Krinkle: mediawiki: Disable loadExitNodes.php maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/533571 (https://phabricator.wikimedia.org/T229736) [18:35:17] (03PS1) 10Krinkle: mediawiki: Remove loadExitNodes.php maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/533572 (https://phabricator.wikimedia.org/T229736) [18:36:03] (03PS2) 10Volans: CLI: suppress ncclient noisy logger [software/homer] - 10https://gerrit.wikimedia.org/r/533570 (https://phabricator.wikimedia.org/T228388) [18:37:03] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Disable loadExitNodes.php maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/533571 (https://phabricator.wikimedia.org/T229736) (owner: 10Krinkle) [18:41:49] (03PS2) 10Krinkle: mediawiki: Disable loadExitNodes.php maintenance script [puppet] - 10https://gerrit.wikimedia.org/r/533571 (https://phabricator.wikimedia.org/T229736) [18:47:20] 10Operations, 10Traffic, 10Performance-Team (Radar): Some load.php requests failing due to "ERR_SPDY_PROTOCOL_ERROR 200" - https://phabricator.wikimedia.org/T220022 (10Krinkle) [18:47:29] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review, and 2 others: Visual Editor gets stuck opening article (net::ERR_SPDY_PROTOCOL_ERROR 200/Loading failed for the