[00:00:04] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190808T0000). Please do the needful. [00:02:29] !log tstarling@deploy1001 Synchronized wmf-config/CommonSettings.php: hack for Parsoid testing on scandium (duration: 00m 55s) [00:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:23] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [00:19:33] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [00:25:38] (03PS2) 10Dzahn: parsoid-testing: add Hiera switch between parsoid/JS and parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/528615 (https://phabricator.wikimedia.org/T229363) [00:34:56] 10Operations, 10wikimediafoundation.org, 10Security: Setting up static maintenance page on Foundation servers for Foundation website - https://phabricator.wikimedia.org/T230075 (10Southparkfan) GDNSD supports [[ https://github.com/gdnsd/gdnsd/wiki/GdnsdPluginHttpStatus | HTTP health checks ]] (or [[ https://... [00:42:47] PROBLEM - PHP opcache health on mwdebug2002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:48:06] !log mwdebug2002 - sudo -i restart-php7.2-fpm [00:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:27] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [00:48:49] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [00:49:13] RECOVERY - PHP opcache health on mwdebug2002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [00:52:10] (03PS3) 10Dzahn: parsoid-testing: add Hiera switch between parsoid/JS and parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/528615 (https://phabricator.wikimedia.org/T229363) [00:53:23] (03PS1) 10DannyS712: Add Hubblesite.org and Spacetelescope.org to commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528975 (https://phabricator.wikimedia.org/T230083) [01:00:16] (03PS2) 10DannyS712: Add Hubblesite.org and Spacetelescope.org to commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528975 (https://phabricator.wikimedia.org/T230083) [01:09:40] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10wiki_willy) @Cmjohnson - just following up on this one, since you were out on vacation last week when the task came in. Thanks, Willy [01:13:20] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17793/" [puppet] - 10https://gerrit.wikimedia.org/r/528615 (https://phabricator.wikimedia.org/T229363) (owner: 10Dzahn) [01:23:07] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10wiki_willy) Drives received last Wed, July 31 by @Jclark-ctr [01:39:08] (03PS1) 10Dzahn: parsoid::testing: switch parsoid/JS to parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/528976 [01:40:33] (03CR) 10jerkins-bot: [V: 04-1] parsoid::testing: switch parsoid/JS to parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/528976 (owner: 10Dzahn) [01:44:10] (03PS2) 10Dzahn: parsoid::testing: switch parsoid/JS to parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/528976 [01:45:29] (03CR) 10jerkins-bot: [V: 04-1] parsoid::testing: switch parsoid/JS to parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/528976 (owner: 10Dzahn) [02:38:48] (03PS3) 10Dzahn: parsoid::testing: allow switching parsoid/JS to parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/528976 (https://phabricator.wikimedia.org/T229363) [02:46:15] (03PS4) 10Dzahn: parsoid::testing: allow switching parsoid/JS to parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/528976 (https://phabricator.wikimedia.org/T229363) [02:47:11] (03CR) 10jerkins-bot: [V: 04-1] parsoid::testing: allow switching parsoid/JS to parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/528976 (https://phabricator.wikimedia.org/T229363) (owner: 10Dzahn) [02:51:05] (03PS5) 10Dzahn: parsoid::testing: allow switching parsoid/JS to parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/528976 (https://phabricator.wikimedia.org/T229363) [02:52:19] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 26576800 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:52:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1002: SMART/disk error - https://phabricator.wikimedia.org/T230088 (10Mathew.onipe) [02:53:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1002: SMART/disk error - https://phabricator.wikimedia.org/T230088 (10Mathew.onipe) p:05Triage→03Normal [02:53:55] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 164352 and 75 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:01:43] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [03:02:55] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [03:04:31] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/17800/scandium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/528976 (https://phabricator.wikimedia.org/T229363) (owner: 10Dzahn) [03:17:42] (03CR) 10jenkins-bot: Add conditional loading of Parsoid/PHP as an extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528591 (https://phabricator.wikimedia.org/T229354) (owner: 10Subramanya Sastry) [03:22:21] (03CR) 10Subramanya Sastry: [C: 03+1] parsoid::testing: allow switching parsoid/JS to parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/528976 (https://phabricator.wikimedia.org/T229363) (owner: 10Dzahn) [03:22:33] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [03:23:49] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [03:31:15] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [03:39:19] 10Operations, 10SRE-Access-Requests: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10Arrbee) I am Abijeet's manager. This is an approved request. Thanks. [04:07:22] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10bd808) [04:09:31] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:10:05] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:24:28] (03PS1) 10Subramanya Sastry: Fix the update_parsoid.sh script for the Parsoid/PHP usecase [puppet] - 10https://gerrit.wikimedia.org/r/528980 (https://phabricator.wikimedia.org/T229858) [04:56:31] (03PS9) 10Vgutierrez: Backport required OCSP commits [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) [04:59:41] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:00:11] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:06:19] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:06:25] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:41:39] 10Operations, 10MediaWiki-Configuration, 10conftool, 10Patch-For-Review, 10Performance-Team (Radar): noc.wm.o/db.php: remove hosts information, or fetch it from etcd somehow - https://phabricator.wikimedia.org/T229631 (10Marostegui) >>! In T229631#5399719, @CDanis wrote: > @Marostegui as of now, there is... [05:46:38] (03PS2) 10Smalyshev: Add L and M to allowed statement starts [puppet] - 10https://gerrit.wikimedia.org/r/526755 [05:49:24] (03PS1) 10Marostegui: mariadb: Decommission db2035 [puppet] - 10https://gerrit.wikimedia.org/r/528982 (https://phabricator.wikimedia.org/T229784) [05:51:50] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2035 [puppet] - 10https://gerrit.wikimedia.org/r/528982 (https://phabricator.wikimedia.org/T229784) (owner: 10Marostegui) [05:52:17] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, and 2 others: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui) [05:52:55] !log Remove db2035 from tendril and zarcillo T229784 [05:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:05] T229784: Decommission db2035 - https://phabricator.wikimedia.org/T229784 [05:53:52] (03PS10) 10Vgutierrez: Backport required OCSP commits [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) [05:54:43] !log Stop MySQL on db2035 for decommissioning T229784 [05:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:10] (03PS11) 10Vgutierrez: Backport required OCSP commits [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) [05:56:43] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, and 2 others: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui) [05:57:13] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui) a:05Marostegui→03RobH This host is now ready for #dc-ops to decommission [05:57:50] (03PS1) 10Vgutierrez: Backport commits required to report SSL stats to an origin server [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528984 (https://phabricator.wikimedia.org/T228135) [06:07:11] (03PS1) 10Elukey: cdh: increase the max retention of hdfs-audit.log [puppet] - 10https://gerrit.wikimedia.org/r/528985 [06:09:21] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs_80: Servers wdqs1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:09:33] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs_80: Servers wdqs1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:10:40] (03PS2) 10Vgutierrez: Backport commits required to report SSL stats to an origin server [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528984 (https://phabricator.wikimedia.org/T228135) [06:10:48] PROBLEM - LVS HTTP IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:10:53] uh [06:10:59] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [06:11:49] (03CR) 10Elukey: [C: 03+2] cdh: increase the max retention of hdfs-audit.log [puppet] - 10https://gerrit.wikimedia.org/r/528985 (owner: 10Elukey) [06:12:16] RECOVERY - LVS HTTP IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.013 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:12:35] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:12:47] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [06:23:00] (03PS1) 10Marostegui: mariadb: Promote db2096 to x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/528987 (https://phabricator.wikimedia.org/T220170) [06:24:02] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: tune after an appserver 100% in production [puppet] - 10https://gerrit.wikimedia.org/r/528176 [06:25:01] (03PS1) 10Marostegui: db-codfw.php: Promote db2096 to x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528988 (https://phabricator.wikimedia.org/T228258) [06:25:55] (03CR) 10jerkins-bot: [V: 04-1] db-codfw.php: Promote db2096 to x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528988 (https://phabricator.wikimedia.org/T228258) (owner: 10Marostegui) [06:26:02] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [06:27:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: tune after an appserver 100% in production [puppet] - 10https://gerrit.wikimedia.org/r/528176 (owner: 10Giuseppe Lavagetto) [06:30:44] PROBLEM - puppet last run on mw2285 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpreaper.conf] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:40:43] <_joe_> !log restarting php-fpm on the application servers to pick up the change [06:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:54] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:44:12] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [06:45:50] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:47:00] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:50:06] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "it's so hacky I love it!" [puppet] - 10https://gerrit.wikimedia.org/r/528906 (https://phabricator.wikimedia.org/T230002) (owner: 10Jbond) [06:58:28] RECOVERY - puppet last run on mw2285 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:00:38] PROBLEM - PHP opcache health on mw1242 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:01:14] PROBLEM - PHP opcache health on mw1226 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:02:12] PROBLEM - PHP opcache health on mw1343 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:02:14] RECOVERY - PHP opcache health on mw1242 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:02:42] PROBLEM - PHP opcache health on mw1307 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:03:48] RECOVERY - PHP opcache health on mw1343 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:04:16] RECOVERY - PHP opcache health on mw1307 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:04:26] RECOVERY - PHP opcache health on mw1226 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:05:08] PROBLEM - PHP opcache health on mw1249 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:05:16] PROBLEM - PHP opcache health on mw1244 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:06:52] RECOVERY - PHP opcache health on mw1244 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:07:26] PROBLEM - PHP opcache health on mw1258 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:08:18] RECOVERY - PHP opcache health on mw1249 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:09:00] RECOVERY - PHP opcache health on mw1258 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:10:32] (03PS3) 10Vgutierrez: ATS: Include ATS tls instance in cache upload role [puppet] - 10https://gerrit.wikimedia.org/r/513970 (https://phabricator.wikimedia.org/T221594) [07:10:34] PROBLEM - PHP opcache health on mw1246 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:11:24] (03CR) 10Ema: [C: 03+1] "Very nice! A few minor comments." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/528887 (owner: 10Giuseppe Lavagetto) [07:12:08] RECOVERY - PHP opcache health on mw1246 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:13:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoyproxy: create module, add tls terminator definition [puppet] - 10https://gerrit.wikimedia.org/r/526110 (owner: 10Giuseppe Lavagetto) [07:22:26] (03PS1) 10ArielGlenn: look at dumps logs every so often for exceptions and report them [puppet] - 10https://gerrit.wikimedia.org/r/528995 (https://phabricator.wikimedia.org/T230099) [07:23:50] (03CR) 10jerkins-bot: [V: 04-1] look at dumps logs every so often for exceptions and report them [puppet] - 10https://gerrit.wikimedia.org/r/528995 (https://phabricator.wikimedia.org/T230099) (owner: 10ArielGlenn) [07:25:40] (03CR) 10Vgutierrez: [C: 03+1] Add discovery CNAME webserver-misc-static -> bromine [dns] - 10https://gerrit.wikimedia.org/r/528721 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [07:25:42] (03CR) 10Giuseppe Lavagetto: profile::tlsproxy::envoy: new TLS terminator for services (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/528887 (owner: 10Giuseppe Lavagetto) [07:26:02] (03PS6) 10Giuseppe Lavagetto: envoyproxy: create module, add tls terminator definition [puppet] - 10https://gerrit.wikimedia.org/r/526110 [07:26:04] (03PS3) 10Giuseppe Lavagetto: profile::tlsproxy::envoy: new TLS terminator for services [puppet] - 10https://gerrit.wikimedia.org/r/528887 (https://phabricator.wikimedia.org/T210411) [07:26:06] (03PS2) 10Giuseppe Lavagetto: role::webserver_misc_static: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/528900 (https://phabricator.wikimedia.org/T210411) [07:26:50] (03PS2) 10ArielGlenn: look at dumps logs every so often for exceptions and report them [puppet] - 10https://gerrit.wikimedia.org/r/528995 (https://phabricator.wikimedia.org/T230099) [07:28:38] (03PS2) 10Muehlenhoff: varnish: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/528637 [07:30:54] (03PS3) 10Giuseppe Lavagetto: role::webserver_misc_static: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/528900 (https://phabricator.wikimedia.org/T210411) [07:31:26] (03PS2) 10Ema: Add discovery CNAME webserver-misc-static -> bromine [dns] - 10https://gerrit.wikimedia.org/r/528721 (https://phabricator.wikimedia.org/T210411) [07:32:17] (03PS2) 10Marostegui: db-codfw.php: Promote db2096 to x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528988 (https://phabricator.wikimedia.org/T228258) [07:33:10] (03PS3) 10Muehlenhoff: varnish: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/528637 [07:33:15] (03CR) 10jerkins-bot: [V: 04-1] db-codfw.php: Promote db2096 to x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528988 (https://phabricator.wikimedia.org/T228258) (owner: 10Marostegui) [07:33:36] (03CR) 10Ema: [C: 03+2] Add discovery CNAME webserver-misc-static -> bromine [dns] - 10https://gerrit.wikimedia.org/r/528721 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [07:35:53] (03PS3) 10Marostegui: db-codfw.php: Promote db2096 to x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528988 (https://phabricator.wikimedia.org/T228258) [07:37:14] (03CR) 10Muehlenhoff: [C: 03+2] varnish: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/528637 (owner: 10Muehlenhoff) [07:37:39] (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::master|standby: test java G1GC [puppet] - 10https://gerrit.wikimedia.org/r/529034 [07:38:27] (03PS4) 10Giuseppe Lavagetto: role::webserver_misc_static: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/528900 (https://phabricator.wikimedia.org/T210411) [07:38:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::tlsproxy::envoy: new TLS terminator for services [puppet] - 10https://gerrit.wikimedia.org/r/528887 (https://phabricator.wikimedia.org/T210411) (owner: 10Giuseppe Lavagetto) [07:39:03] _joe_: shall I puppet-merge your " envoyproxy: create module, add tls terminator definition " commit along? [07:39:11] (03PS4) 10Giuseppe Lavagetto: profile::tlsproxy::envoy: new TLS terminator for services [puppet] - 10https://gerrit.wikimedia.org/r/528887 (https://phabricator.wikimedia.org/T210411) [07:39:13] !log Switchover x1 codfw master db2069 -> db2096 T220170 [07:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:21] T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 [07:39:45] <_joe_> moritzm: sure sorry [07:39:49] doing [07:40:02] <_joe_> I was about to merge the second patch in the batch but you got in the middle of it [07:40:53] (03PS2) 10Elukey: Test Java's G1GC in the Hadoop Test Namenodes [puppet] - 10https://gerrit.wikimedia.org/r/529034 [07:41:00] (03PS1) 10Ema: ATS: use TLS and discovery hostname for bromine [puppet] - 10https://gerrit.wikimedia.org/r/529035 (https://phabricator.wikimedia.org/T210411) [07:41:19] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Promote db2096 to x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528988 (https://phabricator.wikimedia.org/T228258) (owner: 10Marostegui) [07:42:02] (03PS3) 10Elukey: Test Java's G1GC in the Hadoop Test Namenodes [puppet] - 10https://gerrit.wikimedia.org/r/529034 [07:42:05] (03CR) 10jenkins-bot: db-codfw.php: Promote db2096 to x1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528988 (https://phabricator.wikimedia.org/T228258) (owner: 10Marostegui) [07:42:21] 10Operations, 10SRE-Access-Requests: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10abi_) [07:42:21] _joe_: going to merge your change [07:42:32] (03PS4) 10Elukey: Test Java's G1GC in the Hadoop Test Namenodes [puppet] - 10https://gerrit.wikimedia.org/r/529034 [07:42:41] <_joe_> marostegui: I just did [07:42:54] (03PS2) 10Marostegui: mariadb: Promote db2096 to x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/528987 (https://phabricator.wikimedia.org/T220170) [07:42:55] me too :-/ [07:42:56] <_joe_> marostegui: which change you meant? [07:43:08] <_joe_> the ops/puppet one? [07:43:14] profile::tlsproxy::envoy: new TLS terminator for services (0bc9b90483) [07:43:16] thatone [07:43:21] <_joe_> I just merged it [07:43:23] <_joe_> :P [07:43:25] me too [07:43:37] (03CR) 10Elukey: [C: 03+2] Test Java's G1GC in the Hadoop Test Namenodes [puppet] - 10https://gerrit.wikimedia.org/r/529034 (owner: 10Elukey) [07:43:38] I am getting errors from puppet1002, 2001, 2002... [07:43:46] ouch [07:44:04] (03PS3) 10Marostegui: mariadb: Promote db2096 to x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/528987 (https://phabricator.wikimedia.org/T220170) [07:44:54] I just ran a puppet merge and it went fine marostegui [07:45:00] cool [07:45:00] (03CR) 10Mobrovac: [C: 03+1] Fix the update_parsoid.sh script for the Parsoid/PHP usecase [puppet] - 10https://gerrit.wikimedia.org/r/528980 (https://phabricator.wikimedia.org/T229858) (owner: 10Subramanya Sastry) [07:45:14] I guess a race condition [07:45:22] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2096 to x1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/528987 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [07:46:44] (03PS3) 10ArielGlenn: look at dumps logs every so often for exceptions and report them [puppet] - 10https://gerrit.wikimedia.org/r/528995 (https://phabricator.wikimedia.org/T230099) [07:47:19] (03PS2) 10Ema: ATS: use TLS and discovery hostname for bromine [puppet] - 10https://gerrit.wikimedia.org/r/529035 (https://phabricator.wikimedia.org/T210411) [07:48:27] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Promote db2096 as codfw x1 master T220170 (duration: 00m 57s) [07:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:34] T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 [07:49:08] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [07:50:30] wow [07:50:35] that's amazing [07:51:19] (03PS5) 10Marostegui: mariadb: Productionize dbproxy2003 into m3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/528734 (https://phabricator.wikimedia.org/T202367) [07:58:00] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize dbproxy2003 into m3-codfw [puppet] - 10https://gerrit.wikimedia.org/r/528734 (https://phabricator.wikimedia.org/T202367) (owner: 10Marostegui) [08:08:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17804/bromine.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/528900 (https://phabricator.wikimedia.org/T210411) (owner: 10Giuseppe Lavagetto) [08:09:00] <_joe_> ema ^^ [08:09:09] <_joe_> I'll disable puppet on bromine, run it on vega [08:09:18] (03PS5) 10Giuseppe Lavagetto: role::webserver_misc_static: add TLS termination with envoy [puppet] - 10https://gerrit.wikimedia.org/r/528900 (https://phabricator.wikimedia.org/T210411) [08:10:09] <_joe_> waiting on jenkins... [08:10:34] <_joe_> jeez that thing is slow [08:12:12] <_joe_> puppet-merge is excruciatingly slow too [08:12:38] * ema follows along on vega [08:13:05] Aug 8 08:12:40 vega puppet-agent[3830]: (/Stage[main]/Envoyproxy/File[/usr/local/sbin/build-envoy-config]) Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/envoyproxy/files/build_envoy_config.py [08:13:22] !log Stop MySQL on db2065 to test dbproxy2003 [08:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:30] <_joe_> envoyproxy.service vs envoy.service [08:13:32] <_joe_> sigh [08:13:37] <_joe_> ema: amending [08:13:38] almost :) [08:13:41] _joe_: ack! [08:14:13] (03PS1) 10Urbanecm: Use editautoreviewprotected for autoreview protection level only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529039 (https://phabricator.wikimedia.org/T230103) [08:14:33] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for Abijeet Patro - https://phabricator.wikimedia.org/T230104 (10abi_) [08:14:47] (03PS2) 10Urbanecm: Use editautoreviewprotected for autoreview protection level only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529039 (https://phabricator.wikimedia.org/T230103) [08:14:55] (03PS3) 10Urbanecm: Use editautoreviewprotected for autoreview protection level only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529039 (https://phabricator.wikimedia.org/T230103) [08:16:11] (03PS3) 10Ema: ATS: use TLS and discovery hostname for bromine [puppet] - 10https://gerrit.wikimedia.org/r/529035 (https://phabricator.wikimedia.org/T210411) [08:16:19] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/build-envoy-config],Service[envoy.service] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:16:45] we know we know [08:17:09] <_joe_> yeah not sure about the former failure [08:17:13] ACKNOWLEDGEMENT - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/build-envoy-config],Service[envoy.service] Ema Known https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:17:37] <_joe_> oh I see it now, d'oh [08:18:49] should source be puppet:///modules/envoyproxy/build_envoy_config.py instead? [08:19:02] <_joe_> yes [08:19:56] that confuses me every time [08:19:58] (03PS4) 10Vgutierrez: ATS: Include TLS instance in cache upload role [puppet] - 10https://gerrit.wikimedia.org/r/513970 (https://phabricator.wikimedia.org/T221594) [08:20:00] (03PS1) 10Vgutierrez: ATS: Toggle X-Forwarded-For header [puppet] - 10https://gerrit.wikimedia.org/r/529040 (https://phabricator.wikimedia.org/T221594) [08:21:07] (03PS1) 10Giuseppe Lavagetto: envoy: fix file path, service name [puppet] - 10https://gerrit.wikimedia.org/r/529041 [08:21:09] (03PS4) 10Urbanecm: Use editautoreviewprotected for autoreview protection level only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529039 (https://phabricator.wikimedia.org/T230103) [08:21:52] (03CR) 10Vgutierrez: "almost a NOOP for the backend ATS instance: https://puppet-compiler.wmflabs.org/compiler1001/17806/" [puppet] - 10https://gerrit.wikimedia.org/r/529040 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:23:31] <_joe_> it's 3 minutes I wait on CI [08:23:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoy: fix file path, service name [puppet] - 10https://gerrit.wikimedia.org/r/529041 (owner: 10Giuseppe Lavagetto) [08:26:02] (03CR) 10Ema: [C: 03+1] ATS: Toggle X-Forwarded-For header [puppet] - 10https://gerrit.wikimedia.org/r/529040 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:26:08] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:26:32] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:26:43] <_joe_> ema: more issues, but I can work on them, I'll let you know when I have everything sorted out [08:26:46] (03PS1) 10Urbanecm: [wip] Use editautopatrolprotected right for pages protected for autopatrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529043 (https://phabricator.wikimedia.org/T230103) [08:26:56] _joe_: ok! Let me know if I can help [08:31:14] PROBLEM - Check systemd state on vega is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:31:40] (03CR) 10VolkerE: design.wikimedia.org: add new dir and repo for strategy site (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528922 (https://phabricator.wikimedia.org/T230053) (owner: 10Dzahn) [08:32:19] 10Operations, 10Cloud-Services, 10Traffic: All sites served by cloudweb2001-dev return 503 - https://phabricator.wikimedia.org/T230105 (10ema) [08:32:25] 10Operations, 10Cloud-Services, 10Traffic: All sites served by cloudweb2001-dev return 503 - https://phabricator.wikimedia.org/T230105 (10ema) p:05Triage→03Normal [08:32:51] (03PS1) 10Elukey: Use Java's G1 GC for the Hadoop Analytics HDFS Namenodes [puppet] - 10https://gerrit.wikimedia.org/r/529044 [08:35:13] <_joe_> :q [08:35:22] <_joe_> wrong window! [08:35:36] (03PS2) 10Elukey: Use Java's G1 GC for the Hadoop Analytics HDFS Namenodes [puppet] - 10https://gerrit.wikimedia.org/r/529044 [08:36:04] _joe_: I thought you were an emacs user! Now we know what you do when nobody is watching :) [08:36:48] <_joe_> ema: I was using "view" in production :P [08:36:57] !log Remove math table from s5 T196055 [08:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:06] T196055: Remove table `math` from the database - https://phabricator.wikimedia.org/T196055 [08:37:25] (03CR) 10Elukey: [C: 03+2] Use Java's G1 GC for the Hadoop Analytics HDFS Namenodes [puppet] - 10https://gerrit.wikimedia.org/r/529044 (owner: 10Elukey) [08:38:37] (03CR) 10Gehel: [C: 04-1] "see comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528885 (https://phabricator.wikimedia.org/T229621) (owner: 10Mathew.onipe) [08:38:54] (03PS2) 10Urbanecm: Use editautopatrolprotected right for pages protected for autopatrollers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529043 (https://phabricator.wikimedia.org/T230103) [08:39:19] (03PS1) 10Muehlenhoff: Fix comparison of 'ops' groups in daily account check [puppet] - 10https://gerrit.wikimedia.org/r/529045 [08:40:20] (03CR) 10jerkins-bot: [V: 04-1] Fix comparison of 'ops' groups in daily account check [puppet] - 10https://gerrit.wikimedia.org/r/529045 (owner: 10Muehlenhoff) [08:40:39] (03PS1) 10Urbanecm: Use editeditorprotected for protecting pages for editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529046 (https://phabricator.wikimedia.org/T230103) [08:42:12] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for Abijeet Patro - https://phabricator.wikimedia.org/T230104 (10Arrbee) This is an approved request for Abijeet. Thanks. [08:43:22] (03PS14) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [08:44:38] !log installing OpenJDK security updates on elastic* servers [08:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:27] !log restart hadoop namenodes on an-master100* to pick up new GC settings (CMS -> G1 switch) [08:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:33] (03PS1) 10Urbanecm: [tests] Test wgRestrictionLevels entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529047 (https://phabricator.wikimedia.org/T230103) [08:50:41] (03PS2) 10Muehlenhoff: Fix comparison of 'ops' groups in daily account check [puppet] - 10https://gerrit.wikimedia.org/r/529045 [08:50:51] (03CR) 10jerkins-bot: [V: 04-1] [tests] Test wgRestrictionLevels entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529047 (https://phabricator.wikimedia.org/T230103) (owner: 10Urbanecm) [08:51:52] PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[openjdk-8-jdk] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:52:30] moritzm: ^ related to your upgrade? [08:53:24] looks like a transient, should recover in a minute [08:56:40] (03PS2) 10Urbanecm: [tests] Test wgRestrictionLevels entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529047 (https://phabricator.wikimedia.org/T230103) [08:57:04] RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:57:58] (03CR) 10jerkins-bot: [V: 04-1] [tests] Test wgRestrictionLevels entries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529047 (https://phabricator.wikimedia.org/T230103) (owner: 10Urbanecm) [08:58:23] gehel: yeah, it's the transitient log spam; it happens when puppet runs during a fleet-wide deploy [08:59:35] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db2069 - https://phabricator.wikimedia.org/T230107 (10Marostegui) [08:59:49] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db2069 - https://phabricator.wikimedia.org/T230107 (10Marostegui) p:05Triage→03Normal [09:00:42] (03PS1) 10Muehlenhoff: Add Cumin alias for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/529050 [09:01:34] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2069 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529051 (https://phabricator.wikimedia.org/T230107) [09:03:25] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Remove db2069 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529051 (https://phabricator.wikimedia.org/T230107) [09:03:27] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad,db-codfw.php: Remove db2069 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529051 (https://phabricator.wikimedia.org/T230107) (owner: 10Marostegui) [09:05:25] (03PS2) 10Muehlenhoff: Add Cumin alias for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/529050 [09:05:50] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission db2069 - https://phabricator.wikimedia.org/T230107 (10Marostegui) [09:12:05] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/529050 (owner: 10Muehlenhoff) [09:17:23] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2069 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529051 (https://phabricator.wikimedia.org/T230107) (owner: 10Marostegui) [09:18:24] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2069 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529051 (https://phabricator.wikimedia.org/T230107) (owner: 10Marostegui) [09:19:16] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2069 from config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529051 (https://phabricator.wikimedia.org/T230107) (owner: 10Marostegui) [09:19:19] 10Operations, 10ops-eqiad, 10DBA, 10decommission, 10Patch-For-Review: Decommission db2069 - https://phabricator.wikimedia.org/T230107 (10Marostegui) [09:19:54] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2069 from config T230107 (duration: 00m 57s) [09:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:03] T230107: Decommission db2069 - https://phabricator.wikimedia.org/T230107 [09:20:58] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2069 from config T230107 (duration: 00m 55s) [09:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:27] (03PS3) 10Marostegui: maintain-views.yaml: Remove math table [puppet] - 10https://gerrit.wikimedia.org/r/528724 (https://phabricator.wikimedia.org/T196055) [09:24:10] hi all i plan to disable puppet fleet wide for a period of about 5 minutes at 10:30 while we restarte the puppetdb service [09:24:28] sorry 9:30 UTC [09:24:31] 10Operations, 10Puppet: Stop forcing RUNNER=php for foreachwiki/foreachwikiindblist - https://phabricator.wikimedia.org/T230110 (10MarcoAurelio) [09:25:49] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/528719 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [09:26:07] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Legacy (Watching / External), and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10MarcoAurelio) [09:26:11] !log Drop table math from labswiki (wikitech) and labtestwiki T196055 [09:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:20] T196055: Remove table `math` from the database - https://phabricator.wikimedia.org/T196055 [09:27:47] (03PS5) 10Vgutierrez: ATS: Include TLS instance in cache upload role [puppet] - 10https://gerrit.wikimedia.org/r/513970 (https://phabricator.wikimedia.org/T221594) [09:27:49] (03PS1) 10Vgutierrez: ATS: Propagate config_prefix to trafficserver::lua_infra [puppet] - 10https://gerrit.wikimedia.org/r/529052 (https://phabricator.wikimedia.org/T221594) [09:28:41] (03CR) 10Jbond: [C: 03+1] "LGTM and thanks" [puppet] - 10https://gerrit.wikimedia.org/r/528586 (owner: 10Cwhite) [09:30:16] (03CR) 10jerkins-bot: [V: 04-1] ATS: Include TLS instance in cache upload role [puppet] - 10https://gerrit.wikimedia.org/r/513970 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:30:19] 10Operations, 10MediaWiki-API, 10Wikidata, 10Wikidata-Campsite, 10Core Platform Team Workboards (Clinic Duty Team): wikidata.org handles GET MWAPI requests, but silently fails on POST - https://phabricator.wikimedia.org/T230051 (10mobrovac) p:05Triage→03Normal [09:30:27] WTF? [09:30:46] !log disabling puppet fleet wide [09:30:52] RECOVERY - Check systemd state on vega is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:22] (03CR) 10Vgutierrez: "NOOP for existing ATS backend instance: https://puppet-compiler.wmflabs.org/compiler1002/17807/ the diff showed by pcc is due to the previ" [puppet] - 10https://gerrit.wikimedia.org/r/529052 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:32:06] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:34:19] (03CR) 10Muehlenhoff: "Have the widespread-puppet-agent* checks been encountered live to validate that they are working as expected, either by intentionally forc" [puppet] - 10https://gerrit.wikimedia.org/r/528719 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [09:36:32] (03CR) 10Vgutierrez: [C: 03+2] ATS: Toggle X-Forwarded-For header [puppet] - 10https://gerrit.wikimedia.org/r/529040 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:36:43] (03PS2) 10Vgutierrez: ATS: Toggle X-Forwarded-For header [puppet] - 10https://gerrit.wikimedia.org/r/529040 (https://phabricator.wikimedia.org/T221594) [09:37:50] apergos: flow is a headache re. T207627 [09:37:51] T207627: Disable unused Flow extension on ur.wikibooks - https://phabricator.wikimedia.org/T207627 [09:37:58] (03PS2) 10Mathew.onipe: lvs: isolate cloudelastic icinga check [puppet] - 10https://gerrit.wikimedia.org/r/528885 (https://phabricator.wikimedia.org/T229621) [09:38:00] (03PS1) 10Mathew.onipe: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/529053 (https://phabricator.wikimedia.org/T176875) [09:38:01] yeah I know [09:38:28] I think RoanKattouw knows how to deal with the leftovers [09:39:39] if you look at the history of these, the problem revisions are all actually wikitext, or at least the ones I have inspected by retrieving the content from external storage and uncompressing it. [09:40:23] so they did run the migration script apparently right? [09:40:30] but when flow was disabled, the model in the page table was set properly to wikitext while the entry in the content table for the specific revisions were not. [09:40:37] I don't know if they did [09:40:57] and I don't know if it would DTRT now in any case [09:41:12] I think it'd be easier to get rid of those page_ids (delete) [09:41:13] !log installing OpenJDK security updates on WDQS servers [09:41:19] but I'm not sure myself either [09:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:30] you can't delete them via mw [09:41:35] because of the bad content handler issue [09:41:36] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:41:40] these revisions are all top [09:41:46] Indeed. I've tried and failed, royally [09:42:25] the real problem is that I don't know who does these sorts of direct alterations of fields in the db [09:42:28] there's no procedure [09:42:47] and there's no one I know to ask even to point me to someone else [09:43:00] jynus or marostegui maybe? [09:43:09] ummmm :-D [09:43:10] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:43:28] I doubt it but let's see what marostegui says. [09:43:59] Care to modify some content_model fields in the db for dewikiversity and urwikibooks? I *think* that's all that's needed but can't be 100% sure because [09:44:03] it's flow :-/ [09:44:24] apergos: I also think that given that the namespace where those pages no longer exists it is causing issues as well [09:44:39] the namespace doesn't exist [09:44:54] at least one of these cases the page had the unfortunate name before there was flow [09:45:02] [[wikitech:Flow]] says: "$wgExtraNamespaces[2600] = 'Topic';" [09:45:03] so the fact that it had the same prefix was chance [09:45:16] that'd maybe make them accesible via the UI [09:45:34] without the namespace they will just be treated as namespace 0 entries [09:46:30] hmm [09:46:34] MariaDB [urwikibooks_p]> select count(*) from page where page_namespace=2600; returns 5 [09:46:42] so mediawiki thinks they're still in NS 2600 [09:47:19] What's up? [09:47:26] hmmm these ones in urwikibooks are all ns 2600, well that's just jolly [09:47:37] oh, I see you were doing the same thing :-D [09:47:52] so there's more that's needed. uuuggghh [09:48:07] marostegui: it's this same old ticket https://phabricator.wikimedia.org/T207627 and the dewikiversity one [09:48:17] trying to figure out exactly what interventions are needed to clean up the mess [09:48:59] given that mediawiki can't acces the revisions or pages because of the now unknown content handler [09:49:05] sudo bash; then delete * from urwikibooks; done :P :P [09:51:09] it looks like resetting the content_model to 1 for the specific rows in he content table, and resetting the namespace to 0 for the corresponding entries in the page table, *might* be enough, but dunno, and we have no volunteers to actually try it either :-/ [09:51:45] I'd have to check reach revision to be sure the content really is wikitext when retrieved directly from ext store and uncompressed, I've only spot checked a couple so far [09:51:45] I'll take a look at dewikiversity [09:52:48] select page_id, page_title, page_namespace from page where page_namespace=2600; returns just one [09:52:56] NS 2600 ofc [09:52:58] sigh [09:53:38] hi apergos & hauskatze [09:53:42] https://phabricator.wikimedia.org/T220594 [09:53:47] all the info on that one rev is there [09:53:52] may I suggest to follow https://wikitech.wikimedia.org/wiki/Flow ? [09:53:57] it's verified to be wikitext [09:54:04] see if that let us delete the stuff? [09:54:08] hi Urbanecm [09:54:55] that's not enough [09:55:05] hauskatze apergos I wouldn't know really, I think we need someone with more context of what would MW expect [09:55:07] the problem is the content_model in the content table [09:55:15] the content_model in the page table is already correct [09:55:21] I guess that these instructions predate mcr [09:57:37] Whatever intervention that needs to be done, should go thru MW though [09:57:46] ie: a script [09:57:47] (03CR) 10Vgutierrez: "pcc shows an actual NOOP now that the CR chain is clean: https://puppet-compiler.wmflabs.org/compiler1001/17808/" [puppet] - 10https://gerrit.wikimedia.org/r/529052 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:57:55] (03CR) 10Elukey: "Adding the Traffic team for a more experienced review :)" [puppet] - 10https://gerrit.wikimedia.org/r/529053 (https://phabricator.wikimedia.org/T176875) (owner: 10Mathew.onipe) [09:58:06] title": "Spezial:Badtitle/NS2600:Was wir h\u00f6ren und sehen", [09:58:15] well there's a direct UPDATE page SET page_content_model [09:58:23] so why can't we have one of those for the content table? [09:58:39] https://de.wikiversity.org/wiki/Spezial:ApiSandbox#action=query&format=json&prop=info&pageids=47279 <-- says indeed page content model is wikitext. [09:58:55] it ays on the ticket because I did a select to verify [09:59:14] all the info directly from the database for the revisions, slots, content, page rows is on there [09:59:23] (03CR) 10Ema: [C: 03+1] ATS: Propagate config_prefix to trafficserver::lua_infra [puppet] - 10https://gerrit.wikimedia.org/r/529052 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [09:59:44] I'd try adding the fake NS and try to move it out of NS 2600 afterwards [09:59:54] it's not going to harm the wiki [10:00:04] Urbanecm: ? ^ [10:00:09] hauskatze, can try that [10:00:21] (03PS6) 10Vgutierrez: ATS: Include TLS instance in cache upload role [puppet] - 10https://gerrit.wikimedia.org/r/513970 (https://phabricator.wikimedia.org/T221594) [10:00:34] hauskatze, for dewikiversity, I guess? [10:00:55] Urbanecm: and urwikibooks, but maybe do dewikiversity first? [10:01:02] at least I can read the words there [10:01:06] let's do one by one :) [10:01:15] urwikibooks being rtl is complicates my life [10:01:24] :) [10:01:38] hauskatze, play on mwdebug1002 [10:01:41] *1001 [10:01:45] sure [10:01:52] * hauskatze enables x-wikimedia-debug [10:01:57] 1001 [10:02:02] yes, mwdebug1001 [10:02:42] accesing... [10:02:53] it's loading [10:03:14] (03CR) 10jerkins-bot: [V: 04-1] ATS: Include TLS instance in cache upload role [puppet] - 10https://gerrit.wikimedia.org/r/513970 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [10:03:33] keeps loading [10:05:24] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:05:46] Urbanecm: "title": "Topic:Was wir h\u00f6ren und sehen", :D [10:05:52] I'll try to move it [10:05:55] ok [10:05:58] via API [10:06:05] * Urbanecm wondering why it takes so long [10:06:49] indeed takes long, even on PHP 7.x [10:07:10] (03PS7) 10Vgutierrez: ATS: Include TLS instance in cache upload role [puppet] - 10https://gerrit.wikimedia.org/r/513970 (https://phabricator.wikimedia.org/T221594) [10:07:45] 10Operations, 10Discovery, 10Traffic, 10WMDE-Analytics-Engineering, and 4 others: Allow access to wdqs.svc.eqiad.wmnet on port 8888 - https://phabricator.wikimedia.org/T176875 (10Gehel) A few more comments after discussion with @elukey : * the use of port 8888 to get extended query timeouts is exceptional... [10:09:50] Urbanecm: "info": "[XUv03wpAIHsAAGvQFdwAAABO] Caught exception of type MWUnknownContentModelException", [10:09:52] :\ [10:09:57] :/ [10:10:00] just instantiating the page means you have to have the content-handler for 'flow-board' which doesn't exist. so no operations on the page are going to work. not moves, renders, deletes, either via api or not [10:10:02] kurwa [10:10:21] this is what I've been trying to explain [10:12:34] makes sense [10:14:05] any maintenance script that can delete them? maybe deletebatch for the single page [10:14:07] ? [10:14:28] I'd say it will end up with the same result [10:16:09] (03PS1) 10Ema: profile::cache::ssl::wikibase: ensure dhparam creation [puppet] - 10https://gerrit.wikimedia.org/r/529056 [10:16:36] RECOVERY - HHVM rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 75077 bytes in 6.547 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:17:41] (03CR) 10jerkins-bot: [V: 04-1] profile::cache::ssl::wikibase: ensure dhparam creation [puppet] - 10https://gerrit.wikimedia.org/r/529056 (owner: 10Ema) [10:21:32] ha [10:21:36] Urbanecm: and https://phabricator.wikimedia.org/T207627#5402068 ? [10:21:38] (03PS2) 10Ema: profile::cache::ssl::wikibase: ensure dhparam creation [puppet] - 10https://gerrit.wikimedia.org/r/529056 [10:22:01] hauskatze, could you try to move https://de.wikiversity.org/w/index.php?curid=47279? [10:22:14] Urbanecm: on mwdebug or normal? [10:22:17] (03CR) 10Vgutierrez: [C: 03+2] ATS: Propagate config_prefix to trafficserver::lua_infra [puppet] - 10https://gerrit.wikimedia.org/r/529052 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [10:22:22] mwdebug1001 hauskatze [10:22:25] sure [10:22:26] (03PS2) 10Vgutierrez: ATS: Propagate config_prefix to trafficserver::lua_infra [puppet] - 10https://gerrit.wikimedia.org/r/529052 (https://phabricator.wikimedia.org/T221594) [10:23:09] MediaWiki internal error. [10:23:09] Original exception: [XUv3-wpAIHsAAGvQF9EAAABT] 2019-08-08 10:22:55: Fatal exception of type "Wikimedia\Assert\PostconditionException" [10:23:22] hmm [10:23:45] (03CR) 10Vgutierrez: [C: 03+1] profile::cache::ssl::wikibase: ensure dhparam creation [puppet] - 10https://gerrit.wikimedia.org/r/529056 (owner: 10Ema) [10:25:03] hauskatze, see https://de.wikiversity.org/wiki/Spezial:Logbuch/contentmodel [10:25:34] heh, was trying that as well :) [10:26:28] Urbanecm: I've managed to edit the page [10:26:32] I'll try to move it now [10:26:42] ok [10:27:52] I'm pretty sure we would be able to delete the page now hauskatze [10:27:58] but not sure if we need it [10:28:23] Urbanecm apergos https://de.wikiversity.org/wiki/Broken/T207626 [10:28:25] now there [10:28:33] and after that... incinerator!! [10:28:44] hauskatze, wonderful [10:29:17] Done [10:29:26] so whole dewikiversity is done now? hauskatze [10:29:29] checking mysql just in case [10:30:17] select count(*) from page where page_namespace=2600; [10:30:19] 0 [10:30:22] success I guess [10:30:43] if apergos verifies it's all okay from his end then we can call it really a success [10:31:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good! If we were a Bay Area startup, we'd host this under puppetdb-migration.io and charge thousands of dollars for it :-)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528906 (https://phabricator.wikimedia.org/T230002) (owner: 10Jbond) [10:31:54] let's do the same on urwikibooks [10:32:01] should be possible [10:32:38] it's deleted? [10:32:51] should be apergos [10:32:55] wikiadmin@10.64.16.7(dewikiversity)> select * from page where page_id = 47279; [10:32:55] Empty set (0.00 sec) [10:33:09] hauskatze did a delete there [10:33:40] a bit of change content model, edit, move and delete fixed it [10:33:48] not elegant but it worked :) [10:34:28] so when you do the next 'change content model' [10:34:35] please stop, tell me what you did, and let me inspect [10:34:41] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/529045 (owner: 10Muehlenhoff) [10:34:57] then we can see exactly what's impacted and maybe be able to update those instructions [10:35:25] apergos: not doing anything atm myself [10:35:32] I ran abstracts over a page range including the bad page and it ran to completion [10:35:50] for the record, I've added "$wgContentHandlers["flow-board"] = $wgContentHandlers["wikitext"];" to mwdebug1001, which enabled hauskatze to move/delete/whatever the page via the web [10:35:58] ahhhhh [10:36:00] that's the piece [10:36:20] Urbanecm: so the content model change was not really required? [10:36:26] I don't think so [10:36:28] (03CR) 10Jbond: "> Patch Set 4: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/528906 (https://phabricator.wikimedia.org/T230002) (owner: 10Jbond) [10:36:41] so... set extra namespace and $wgContentHandlers [10:36:46] but now if anyone ever tries to undelete that, it will probably fail [10:36:51] hopefully they never do [10:37:01] true [10:37:26] it just had one revision [10:37:39] (but if it fails before undeletion completed, does it matter?) [10:38:11] I just mean if anyone ever wants any of these pages back... [10:40:22] yeah, not ideal :/ [10:40:32] they could theoretically copy the content [10:40:34] (at least would work) [10:41:35] As far as I can see, the page we deleted was just some test edits. [10:42:12] the other ones, I don't know how useful they were [10:42:16] on urwikibooks I mean [10:42:30] hauskatze, let me know once I can put things back [10:43:11] !log installing exim4 security updates on buster hosts (our exim config is not vulnerable) [10:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:36] Urbanecm: I think you can now. dewikiversity is done. [10:43:51] (03PS1) 10Giuseppe Lavagetto: envoy: several fixes [puppet] - 10https://gerrit.wikimedia.org/r/529060 [10:43:54] if you want to go with urwikibooks I have time [10:44:04] <_joe_> ema: ^^ [10:44:27] hauskatze, go forward, I'll simply revert later :) [10:44:49] <_joe_> Urbanecm: are you !log ging any local modifications on mwdebug1001? if not please do from now on :) [10:44:49] apergos: okay from you to do the same on urwikibooks? [10:45:02] <_joe_> (and log when you run scap pull to restore state at the end) [10:45:02] sec [10:45:10] waiting [10:45:19] ok _joe_ [10:45:22] <_joe_> or apergos, whoever's doing things :) [10:45:23] (03CR) 10jerkins-bot: [V: 04-1] envoy: several fixes [puppet] - 10https://gerrit.wikimedia.org/r/529060 (owner: 10Giuseppe Lavagetto) [10:45:26] <_joe_> thanks! [10:45:34] the rev lengths are all tiny so yes, go ahead [10:45:55] Okay. Doing urwikibooks now on mwdebug1001 [10:46:32] !log Set $wgContentHandlers["flow-board"] = $wgContentHandlers["wikitext"]; locally on mwdebug1001 to fix few bad pages (T207627) [10:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:41] T207627: Disable unused Flow extension on ur.wikibooks - https://phabricator.wikimedia.org/T207627 [10:47:17] <_joe_> wtf? [10:48:04] _joe_, is that a reaction to the log statement? [10:48:14] <_joe_> no to jenkins, lol sorry :D [10:48:27] heh [10:48:28] <_joe_> jenkins just did something it shouldn't and I'm disappointed [10:48:34] aha, thanks _joe_ [10:48:50] I understand why it's called 'je*rk*ins' now :P [10:49:29] (03PS1) 10Effie Mouzeli: hieradata: enable php72_only on mw122[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/529061 (https://phabricator.wikimedia.org/T219150) [10:50:40] let me know when I should test [10:50:46] jouncebot now [10:50:46] No deployments scheduled for the next 0 hour(s) and 9 minute(s) [10:50:51] jouncebot next [10:50:51] In 0 hour(s) and 9 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190808T1100) [10:50:52] Urbanecm: apergos - so I've moved the first page on urwikibooks [10:50:57] https://ur.wikibooks.org/wiki/Broken/Id-2687 [10:51:03] then changed the content model [10:51:09] (it was flow-board) [10:51:17] to unformatted-text [10:51:26] PROBLEM - puppet last run on centrallog1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[exim4-config] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:51:31] now if they want to restore the page somewhat, they'll be able to [10:52:23] Yup. No exceptions: https://ur.wikibooks.org/w/index.php?title=%D8%AE%D8%A7%D8%B5:%D8%A8%D8%AD%D8%A7%D9%84&target=Broken%2FId-2687×tamp=20130702083347 [10:52:28] go ahead and do the batch, I'll be testing all of them [10:52:30] at once [10:52:31] Doing the same with the rest of them [10:52:49] I hope I can finish before the start of the SWAT [10:53:20] !log Disable puppet, depool and pool mw1221, mw1222, mw1223 for 529061 [10:53:24] (03PS5) 10Jbond: puppetmaster upgrade: add a lua filter to remove the job_id [puppet] - 10https://gerrit.wikimedia.org/r/528906 (https://phabricator.wikimedia.org/T230002) [10:53:25] if not, we'd simply ask them to wait a while with the window hauskatze :) [10:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:03] Moving done, and content model change as well [10:56:06] I'll delete now [10:56:22] (03CR) 10Jbond: [C: 03+2] puppetmaster upgrade: add a lua filter to remove the job_id [puppet] - 10https://gerrit.wikimedia.org/r/528906 (https://phabricator.wikimedia.org/T230002) (owner: 10Jbond) [10:57:20] Done. [10:57:29] (03CR) 10Effie Mouzeli: "https://puppet-compiler.wmflabs.org/compiler1002/17810/" [puppet] - 10https://gerrit.wikimedia.org/r/529061 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [10:58:23] Checked the deleted content to see if it returned an exception. It did not. [10:58:32] you've deleted all of the bad pages, yes? [10:58:49] apergos: yes, after moving and changing the content model to wikitext [10:58:59] I see they are gone from the page table [10:59:10] I will run a full abstracts dump here with histrical revisions included [10:59:13] they were all moved to Broken/Id-$page_id [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190808T1100). [11:00:05] tarrow: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] o/ [11:00:19] Please hold SWAT for a minute [11:00:20] hauskatze: still busy or can we go ahead? [11:00:22] ok [11:00:32] danke und sorry [11:00:39] no problem, only two patches [11:00:55] I have run both it and stubs for urwikibooks, ran to completion, no exceptions [11:00:58] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: enable php72_only on mw122[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/529061 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [11:01:03] giving the thumbs up [11:01:08] apergos: so everything's okay? [11:01:09] ack [11:01:12] from my end yes [11:01:17] (03PS2) 10Effie Mouzeli: hieradata: enable php72_only on mw122[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/529061 (https://phabricator.wikimedia.org/T219150) [11:01:22] o/ [11:01:25] Urbanecm: I think you can revert mwdebug1001 now [11:01:35] to the state it's suposed to be [11:01:39] no hurry, we're kicking jenkins right now anyway [11:02:04] poor jenkins [11:02:40] darn - outside mwdebug 1001 trying to see the deleted revids returns mw exception [11:02:43] at least you're not ... [11:02:46] kicking it while it's down :-P [11:04:17] :D [11:05:08] if someone updates the wiki page with the added steps, that would be awesome [11:05:30] then maybe we will get lucky about this in the future (at least have the bad data hidden in the archive table :-P) [11:05:32] hauskatze, ok, will do [11:05:53] !log Run scap pull on mwdebug1001 to revert local modifications (T207627) [11:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:02] T207627: Disable unused Flow extension on ur.wikibooks - https://phabricator.wikimedia.org/T207627 [11:06:30] apergos: so the revisions of those pages before changing the content model are unreachable via UI but given that flow-board content model is not known to the wiki anymore I guess that's expected [11:06:40] (03PS1) 10Muehlenhoff: Import cas-overlay-template [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/529063 [11:06:57] yes [11:07:06] perfect then [11:07:18] I guess SWAT can start now? [11:07:20] at some point there will be a default content handler for these cases [11:07:28] Urbanecm: please let Lucas_WMDE know when they can start swat [11:07:54] I'd say we're done&ready for SWAT [11:08:02] Lucas_WMDE, you can deploy the patches now [11:08:06] okay [11:08:12] tarrow: do you want to deploy them or should I do it? [11:08:21] thanks very much for the fixups! [11:08:33] (thanks for the ping btw) [11:08:34] Lucas_WMDE: I think we'll depoly them :) [11:08:39] okay :) [11:08:44] go ahead [11:08:48] I'm showing jakob_WMDE [11:09:14] we're still waiting on the second patch to pass Jenkins gate on master because of flaky browser tests... :/ [11:09:19] doing the first one now though [11:09:33] (03PS1) 10Jbond: puppetdb: enable lua filter to munge puppet reports [puppet] - 10https://gerrit.wikimedia.org/r/529064 (https://phabricator.wikimedia.org/T230002) [11:09:42] Thanks for the help @all, and sorry for halting the SWAT for some minutes. [11:11:17] hauskatze, I've tried to add the required steps to https://wikitech.wikimedia.org/w/index.php?title=Flow&diff=1834974&oldid=1786777 [11:11:37] (03PS1) 10Marostegui: wmnet: Decrease m5-master TTL to 1M [dns] - 10https://gerrit.wikimedia.org/r/529065 (https://phabricator.wikimedia.org/T229657) [11:11:55] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Import cas-overlay-template [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/529063 (owner: 10Muehlenhoff) [11:12:18] Urbanecm: looks good to me [11:12:30] It seems that'd cover MCR [11:12:39] (03PS1) 10Muehlenhoff: Add .git-review file [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/529066 [11:13:42] we'll find out the nex time :-) [11:13:45] thanks again! [11:14:04] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site={codfw,eqsin,ulsfo} https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:14:10] I've also added a note that the wgContentHandlers change shouldn't be required once T220608 is done [11:14:10] T220608: Introduce UnknownContentHandler and UnknownContent - https://phabricator.wikimedia.org/T220608 [11:14:57] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add .git-review file [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/529066 (owner: 10Muehlenhoff) [11:15:08] great [11:15:59] (03CR) 10Jbond: [C: 03+2] puppetdb: enable lua filter to munge puppet reports [puppet] - 10https://gerrit.wikimedia.org/r/529064 (https://phabricator.wikimedia.org/T230002) (owner: 10Jbond) [11:17:07] Was anyone else hoping to SWAT anything? Looks like we might be waiting a little bit for jenkins [11:18:12] tarrow, if you're waiting on jenkins, I have a patch to deploy [11:19:24] RECOVERY - puppet last run on centrallog1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [11:20:28] Urbanecm: go for it [11:20:32] ok, doing [11:20:44] can you ping me when you are done [11:20:53] (03CR) 10Urbanecm: [C: 03+2] Add Hubblesite.org and Spacetelescope.org to commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528975 (https://phabricator.wikimedia.org/T230083) (owner: 10DannyS712) [11:20:56] certainly [11:21:39] thanks! (by the was we are waiting for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/529055) but I assume you aren't trying to backport Wikibase [11:22:08] no, I'm doing a config patch [11:22:12] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki) [11:22:16] (03Merged) 10jenkins-bot: Add Hubblesite.org and Spacetelescope.org to commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528975 (https://phabricator.wikimedia.org/T230083) (owner: 10DannyS712) [11:22:32] (03CR) 10jenkins-bot: Add Hubblesite.org and Spacetelescope.org to commons wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528975 (https://phabricator.wikimedia.org/T230083) (owner: 10DannyS712) [11:23:16] (03CR) 10Ema: "There seems to be a problem with pep8:" [puppet] - 10https://gerrit.wikimedia.org/r/529060 (owner: 10Giuseppe Lavagetto) [11:23:54] (03PS9) 10Urbanecm: Add hd variations for zhwikiource project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527773 (owner: 10Viztor) [11:23:57] (03PS1) 10Muehlenhoff: Sync with upstream 6.0 branch [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/529067 [11:24:02] (03CR) 10Urbanecm: [C: 03+2] Add hd variations for zhwikiource project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527773 (owner: 10Viztor) [11:24:07] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 9a4494a: Add Hubblesite.org and Spacetelescope.org to commons wgCopyUploadsDomains (T230083) (duration: 00m 57s) [11:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:15] T230083: Add Hubblesite.org and Spacetelescope.org to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T230083 [11:24:40] curl 'http://wdqs1005.eqiad.wmnet' on stat1007 times out for me. Did we change something? [11:25:13] Amir1, aren't stats host firewalled nowadays? [11:25:13] (03Merged) 10jenkins-bot: Add hd variations for zhwikiource project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527773 (owner: 10Viztor) [11:25:15] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Sync with upstream 6.0 branch [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/529067 (owner: 10Muehlenhoff) [11:25:28] (03CR) 10jenkins-bot: Add hd variations for zhwikiource project logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527773 (owner: 10Viztor) [11:25:31] but this is internal [11:25:51] (03PS3) 10Muehlenhoff: Fix comparison of 'ops' groups in daily account check [puppet] - 10https://gerrit.wikimedia.org/r/529045 [11:26:16] https://lists.wikimedia.org/pipermail/analytics/2019-July/006648.html says "will cause any non-localhost or known traffic to be blocked by default" [11:26:23] not sure if that host is known traffic [11:26:34] but it's certainly not localhost traffic [11:27:19] (03CR) 10Muehlenhoff: [C: 03+2] Fix comparison of 'ops' groups in daily account check [puppet] - 10https://gerrit.wikimedia.org/r/529045 (owner: 10Muehlenhoff) [11:27:58] Interesting [11:28:21] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: be886ad: Add hd variations for zhwikiource project logo (T229715) (duration: 00m 56s) [11:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:29] T229715: Set up 2x logo for Chinese Wikisouce - https://phabricator.wikimedia.org/T229715 [11:30:08] elukey: hey, this is causing issue for our graphs :( T214894 [11:30:09] T214894: Grafana Datamodel References dashboard broken (daily data) - https://phabricator.wikimedia.org/T214894 [11:30:11] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: be886ad: Add hd variations for zhwikiource project logo (T229715) (duration: 00m 55s) [11:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:29] !log Purge https://en.wikipedia.org/static/images/project-logos/zhwikisource.png (T229715) [11:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:59] (03PS1) 10Muehlenhoff: cas: Switch overlay repo to internally forked version [puppet] - 10https://gerrit.wikimedia.org/r/529069 [11:36:04] (03PS1) 10Urbanecm: HD logo for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529070 (https://phabricator.wikimedia.org/T230114) [11:36:16] Lucas_WMDE: should I +2 the second backport now. I fear that it's too late to make it through the gate before the end of the window [11:36:23] (03CR) 10Urbanecm: [C: 03+2] HD logo for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529070 (https://phabricator.wikimedia.org/T230114) (owner: 10Urbanecm) [11:36:39] Or should I just defer it until tonight? [11:36:43] tarrow, I'm not Lucas_WMDE, but feel free to +2 the second backport [11:36:46] it won't do any harm [11:36:54] Urbanecm: ok! [11:37:06] tarrow: go ahead [11:37:07] (03CR) 10jerkins-bot: [V: 04-1] cas: Switch overlay repo to internally forked version [puppet] - 10https://gerrit.wikimedia.org/r/529069 (owner: 10Muehlenhoff) [11:37:09] (03CR) 10jerkins-bot: [V: 04-1] HD logo for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529070 (https://phabricator.wikimedia.org/T230114) (owner: 10Urbanecm) [11:37:11] jouncebot: next [11:37:12] In 4 hour(s) and 22 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190808T1600) [11:37:20] (03CR) 10jerkins-bot: [V: 04-1] HD logo for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529070 (https://phabricator.wikimedia.org/T230114) (owner: 10Urbanecm) [11:37:32] Urbanecm: you know that it might not be done before the end of the hour? [11:37:40] (03CR) 10Krinkle: [C: 03+1] mediawiki: Introduce startupregistrystats.pp to record RL modules registry [puppet] - 10https://gerrit.wikimedia.org/r/528526 (https://phabricator.wikimedia.org/T229836) (owner: 10Ladsgroup) [11:38:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, thanks" [dns] - 10https://gerrit.wikimedia.org/r/529065 (https://phabricator.wikimedia.org/T229657) (owner: 10Marostegui) [11:38:23] tarrow, wdym? [11:38:42] I'm not sure how fast jenkins would be, honestly [11:39:06] (03PS2) 10Urbanecm: HD logo for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529070 (https://phabricator.wikimedia.org/T230114) [11:39:26] Urbanecm: if it merges after the end of SWAT don't I end up leaving the branch "dirty" [11:39:35] like with undeployed patches [11:39:44] ah, then you should revert the change [11:39:49] tarrow: I think it’s okay to continue syncing it after the end of SWAT [11:39:54] (03CR) 10Holger Knust: "I'll be out next week and don't think this will be merged before then anyway." [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [11:39:55] deployment calendar is free for hours afterwards [11:40:00] that's the second possibility :) [11:40:03] Lucas_WMDE: right! I see [11:40:08] (but !log extension of window then) [11:40:09] cool, that's fine then [11:40:13] (03PS2) 10Muehlenhoff: cas: Switch overlay repo to internally forked version [puppet] - 10https://gerrit.wikimedia.org/r/529069 [11:40:26] (03PS3) 10Urbanecm: HD logo for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529070 (https://phabricator.wikimedia.org/T230114) [11:40:30] (03PS1) 10Ema: envoy: add security settings to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/529071 [11:40:32] (03CR) 10Urbanecm: [C: 03+2] HD logo for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529070 (https://phabricator.wikimedia.org/T230114) (owner: 10Urbanecm) [11:42:55] 10Operations, 10serviceops: Update component/php72 to 7.2.20 - https://phabricator.wikimedia.org/T230024 (10MoritzMuehlenhoff) Status: php7.2 currently fails to build on boron due to some build time hostname check which fails on boron, I still need to get to the bottom of that. [11:43:00] the first backport was merged and can be synced btw [11:43:05] (03Merged) 10jenkins-bot: HD logo for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529070 (https://phabricator.wikimedia.org/T230114) (owner: 10Urbanecm) [11:43:30] * Urbanecm is syncing his last change [11:44:25] (03CR) 10Ema: [C: 03+1] ATS: Include TLS instance in cache upload role [puppet] - 10https://gerrit.wikimedia.org/r/513970 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [11:44:58] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: SWAT: dfeb2a9: HD logo for enwikivoyage (T230114) (duration: 00m 56s) [11:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:06] T230114: Create HIDPI logos for Wikivoyages - https://phabricator.wikimedia.org/T230114 [11:45:54] (03CR) 10Ema: [C: 03+1] Backport required OCSP commits [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528716 (https://phabricator.wikimedia.org/T220383) (owner: 10Vgutierrez) [11:46:11] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: dfeb2a9: HD logo for enwikivoyage (T230114) (duration: 00m 56s) [11:46:15] (03CR) 10Ema: [C: 03+1] "No need to bump revision number twice, other than that LGTM." (031 comment) [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/528984 (https://phabricator.wikimedia.org/T228135) (owner: 10Vgutierrez) [11:46:16] tarrow, the machine is clear [11:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:23] Urbanecm: awesome! [11:46:25] Thanks :) [11:46:39] yw [11:47:35] (03PS3) 10Ema: profile::cache::ssl::wikibase: ensure dhparam creation [puppet] - 10https://gerrit.wikimedia.org/r/529056 [11:47:37] (03CR) 10jenkins-bot: HD logo for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529070 (https://phabricator.wikimedia.org/T230114) (owner: 10Urbanecm) [11:50:46] (03CR) 10Ema: [C: 03+2] profile::cache::ssl::wikibase: ensure dhparam creation [puppet] - 10https://gerrit.wikimedia.org/r/529056 (owner: 10Ema) [11:52:04] (03PS1) 10Holger Knust: Add cassandra-table-properties tool to Cassandra deployments [puppet] - 10https://gerrit.wikimedia.org/r/529074 (https://phabricator.wikimedia.org/T226555) [11:57:37] (03PS1) 10MarcoAurelio: [WIP] profile::mediawiki::maintenance: purge_checkuser to use periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/529077 [11:58:22] (03PS2) 10MarcoAurelio: [WIP] profile::mediawiki::maintenance: purge_checkuser to use periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/529077 [11:59:50] (03CR) 10jerkins-bot: [V: 04-1] [WIP] profile::mediawiki::maintenance: purge_checkuser to use periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/529077 (owner: 10MarcoAurelio) [12:00:43] !log Running SWAT a little over time because late start and slow jenkins [12:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:31] (03PS2) 10Holger Knust: Add cassandra-table-properties tool to Cassandra deployments [puppet] - 10https://gerrit.wikimedia.org/r/529074 (https://phabricator.wikimedia.org/T226555) [12:01:38] !log tarrow@deploy1001 Synchronized php-1.34.0-wmf.17/extensions/Wikibase/: SWAT: [[gerrit:529055|Split ParserCache on Termbox (T228978)]] (duration: 01m 21s) [12:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:47] T228978: Unknown placeholder: entityViewPlaceholder-entitytermsview-entitytermsforlanguagelistview-class - https://phabricator.wikimedia.org/T228978 [12:02:14] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:07:52] (03CR) 10CDanis: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/528719 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [12:15:14] !log tarrow@deploy1001 Synchronized php-1.34.0-wmf.17/extensions/Wikibase/: SWAT: [[gerrit:529059|Add hook to invalidate cache entries missing TermboxOption (T228978)]] (duration: 01m 14s) [12:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:22] T228978: Unknown placeholder: entityViewPlaceholder-entitytermsview-entitytermsforlanguagelistview-class - https://phabricator.wikimedia.org/T228978 [12:15:54] !log EU midday SWAT done [12:16:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:50] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:20:54] (03PS3) 10MarcoAurelio: [WIP] profile::mediawiki::maintenance: purge_checkuser to use periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/529077 [12:21:47] (03PS1) 10Jbond: puppetdb - filter_job_id: only munge data we need to [puppet] - 10https://gerrit.wikimedia.org/r/529080 [12:22:23] (03CR) 10jerkins-bot: [V: 04-1] [WIP] profile::mediawiki::maintenance: purge_checkuser to use periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/529077 (owner: 10MarcoAurelio) [12:23:58] 'Widespread puppet agent failures' - these are all me, sorry for the noise but much better with the recent improvments [12:24:32] (03PS4) 10MarcoAurelio: [WIP] profile::mediawiki::maintenance: purge_checkuser to use periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/529077 [12:24:53] (03CR) 10Jbond: [C: 03+2] puppetdb - filter_job_id: only munge data we need to [puppet] - 10https://gerrit.wikimedia.org/r/529080 (owner: 10Jbond) [12:26:01] (03CR) 10jerkins-bot: [V: 04-1] [WIP] profile::mediawiki::maintenance: purge_checkuser to use periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/529077 (owner: 10MarcoAurelio) [12:26:46] 10Operations, 10LDAP, 10cloud-services-team (Kanban): LDAP: multiples accounts for Phamhi - https://phabricator.wikimedia.org/T230126 (10aborrero) p:05Triage→03High [12:29:04] (03PS5) 10MarcoAurelio: [WIP] profile::mediawiki::maintenance: purge_checkuser to use periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/529077 [12:31:18] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/529077 (owner: 10MarcoAurelio) [12:34:16] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:34:53] (03CR) 10MarcoAurelio: [WIP] profile::mediawiki::maintenance: purge_checkuser to use periodic_job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529077 (owner: 10MarcoAurelio) [12:35:36] (03PS6) 10MarcoAurelio: [WIP] profile::mediawiki::maintenance: purge_checkuser to use periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/529077 [12:39:47] (03PS2) 10Giuseppe Lavagetto: envoy: several fixes [puppet] - 10https://gerrit.wikimedia.org/r/529060 [12:39:49] (03PS1) 10Giuseppe Lavagetto: taskgen: automatically exclude python files with own tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/529083 [12:41:22] (03CR) 10Ema: [C: 03+1] taskgen: automatically exclude python files with own tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/529083 (owner: 10Giuseppe Lavagetto) [12:42:02] (03PS7) 10MarcoAurelio: [WIP] profile::mediawiki::maintenance: purge_checkuser to use periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/529077 [12:42:36] (03CR) 10Ema: [C: 03+1] envoy: several fixes [puppet] - 10https://gerrit.wikimedia.org/r/529060 (owner: 10Giuseppe Lavagetto) [12:42:44] (03CR) 10Giuseppe Lavagetto: [C: 03+2] taskgen: automatically exclude python files with own tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/529083 (owner: 10Giuseppe Lavagetto) [12:43:48] (03PS1) 10Jbond: puppetdb - filter_job_id: add some debug [puppet] - 10https://gerrit.wikimedia.org/r/529085 [12:44:00] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/529077 (owner: 10MarcoAurelio) [12:44:06] (03PS2) 10Jbond: puppetdb - filter_job_id: add some debug [puppet] - 10https://gerrit.wikimedia.org/r/529085 [12:45:07] (03CR) 10CDanis: "If you're curious about the Puppet side of things that generates these JSON files, see https://gerrit.wikimedia.org/r/plugins/gitiles/oper" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528938 (https://phabricator.wikimedia.org/T229631) (owner: 10CDanis) [12:45:32] (03CR) 10Jbond: [C: 03+2] puppetdb - filter_job_id: add some debug [puppet] - 10https://gerrit.wikimedia.org/r/529085 (owner: 10Jbond) [12:47:41] (03PS3) 10Giuseppe Lavagetto: envoy: several fixes [puppet] - 10https://gerrit.wikimedia.org/r/529060 [12:47:53] (03CR) 10MarcoAurelio: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/253/" [puppet] - 10https://gerrit.wikimedia.org/r/529077 (owner: 10MarcoAurelio) [12:48:34] _joe_: do profile::mediawiki::periodic_job works for 'foreachwiki`? [12:49:05] <_joe_> hauskatze: uh I think so, but I'd have to check, I'm in the middle of something [12:49:19] <_joe_> lemme get back to you later [12:49:25] _joe_: alright. I'm working on a WIP patch to migrate a cron to use that [12:49:44] sure, I'm taking a nap so have all the time you need :) [12:50:06] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/529077/ fwiw [12:50:09] <_joe_> don't tell that to jenkins, he might take it even easier than he is right now [12:50:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoy: several fixes [puppet] - 10https://gerrit.wikimedia.org/r/529060 (owner: 10Giuseppe Lavagetto) [12:51:08] (03PS2) 10Giuseppe Lavagetto: envoy: add security settings to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/529071 (owner: 10Ema) [12:52:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoy: add security settings to systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/529071 (owner: 10Ema) [12:58:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/528826 (owner: 10Elukey) [12:58:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/528827 (owner: 10Elukey) [13:02:13] (03PS1) 10Jbond: puppetmaster1003 - add neodymium back as a canary server [puppet] - 10https://gerrit.wikimedia.org/r/529087 [13:02:32] (03CR) 10Marostegui: [C: 03+2] wmnet: Decrease m5-master TTL to 1M [dns] - 10https://gerrit.wikimedia.org/r/529065 (https://phabricator.wikimedia.org/T229657) (owner: 10Marostegui) [13:02:48] (03PS1) 10Giuseppe Lavagetto: envoyproxy: further small fixes [puppet] - 10https://gerrit.wikimedia.org/r/529088 [13:03:45] (03CR) 10Jbond: [C: 03+2] puppetmaster1003 - add neodymium back as a canary server [puppet] - 10https://gerrit.wikimedia.org/r/529087 (owner: 10Jbond) [13:04:20] (03CR) 10jerkins-bot: [V: 04-1] envoyproxy: further small fixes [puppet] - 10https://gerrit.wikimedia.org/r/529088 (owner: 10Giuseppe Lavagetto) [13:06:12] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:08:34] 10Operations, 10Puppet: clean up systemd::timer::job logging basedir mess - https://phabricator.wikimedia.org/T230127 (10CDanis) [13:09:13] !log Drop table math from s8 T196055 [13:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:23] T196055: Remove table `math` from the database - https://phabricator.wikimedia.org/T196055 [13:12:35] (03PS2) 10Giuseppe Lavagetto: envoyproxy: further small fixes [puppet] - 10https://gerrit.wikimedia.org/r/529088 [13:14:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/17812/vega.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/529088 (owner: 10Giuseppe Lavagetto) [13:14:29] (03PS3) 10Giuseppe Lavagetto: envoyproxy: further small fixes [puppet] - 10https://gerrit.wikimedia.org/r/529088 [13:14:32] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] envoyproxy: further small fixes [puppet] - 10https://gerrit.wikimedia.org/r/529088 (owner: 10Giuseppe Lavagetto) [13:15:21] (03PS3) 10Elukey: admin: add the gpu-users group [puppet] - 10https://gerrit.wikimedia.org/r/528826 [13:16:29] 10Operations, 10Puppet: clean up systemd::timer::job logging basedir mess - https://phabricator.wikimedia.org/T230127 (10MarcoAurelio) I'm seeing this behaviour on https://puppet-compiler.wmflabs.org/compiler1002/253/: `File[/var/log/mediawiki/mediawiki_job_mediawiki_purge_checkuser/mediawiki_job_mediawiki_pur... [13:18:28] cdanis: part of that class was done by me, it is something to improve yes :( [13:18:36] (I am referring to ---^) [13:18:58] (03CR) 10Elukey: [C: 03+2] admin: add the gpu-users group [puppet] - 10https://gerrit.wikimedia.org/r/528826 (owner: 10Elukey) [13:19:13] (03PS3) 10Elukey: Add gpu-users to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/528827 [13:19:29] (03CR) 10Elukey: [C: 03+2] Add gpu-users to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/528827 (owner: 10Elukey) [13:19:41] elukey: item #452 on the list of "weird unfortunate interactions in our puppet codebase" 🙃 [13:22:45] (03PS1) 10Jbond: puppetmaster1003: add canary hosts back and remove debug logging [puppet] - 10https://gerrit.wikimedia.org/r/529095 (https://phabricator.wikimedia.org/T228657) [13:23:43] 10Operations, 10Puppet: clean up systemd::timer::job logging basedir mess - https://phabricator.wikimedia.org/T230127 (10CDanis) Actually @MarcoAurelio that looks fine -- the $title is only repeated twice, not thrice. profile::mediawiki::periodic_job sets logging_basedir so the truly pathological behavior doe... [13:23:50] (03PS2) 10Jbond: puppetmaster1003: add canary hosts back and remove debug logging [puppet] - 10https://gerrit.wikimedia.org/r/529095 (https://phabricator.wikimedia.org/T228657) [13:26:02] (03CR) 10Jbond: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/528719 (https://phabricator.wikimedia.org/T229262) (owner: 10Filippo Giunchedi) [13:26:35] (03CR) 10Jbond: [C: 03+2] puppetmaster1003: add canary hosts back and remove debug logging [puppet] - 10https://gerrit.wikimedia.org/r/529095 (https://phabricator.wikimedia.org/T228657) (owner: 10Jbond) [13:34:08] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [13:35:22] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:41:04] (03PS1) 10CDanis: fetch_dbconfig: sort output for easier reading [puppet] - 10https://gerrit.wikimedia.org/r/529100 [13:42:23] (03CR) 10CDanis: [C: 03+2] fetch_dbconfig: sort output for easier reading [puppet] - 10https://gerrit.wikimedia.org/r/529100 (owner: 10CDanis) [13:47:35] (03CR) 10CDanis: [V: 03+2 C: 03+2] fetch_dbconfig: sort output for easier reading [puppet] - 10https://gerrit.wikimedia.org/r/529100 (owner: 10CDanis) [13:50:01] (03PS1) 10Elukey: admin: add analytics-admins and ops to gpu-users [puppet] - 10https://gerrit.wikimedia.org/r/529101 [13:52:37] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/17813/stat1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/529101 (owner: 10Elukey) [13:57:37] (03PS2) 10Elukey: admin: add analytics-admins and ops to gpu-users [puppet] - 10https://gerrit.wikimedia.org/r/529101 [14:01:03] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [14:04:20] 10Operations, 10ops-eqiad, 10cloud-services-team: (OoW) cloudvirt1006 - RAID battery failed - https://phabricator.wikimedia.org/T222950 (10Cmjohnson) I received what I think may be the battery. It doesn't look like any of the batteries we've had in the past. I will need to crack this open and verify that it... [14:04:58] 10Operations, 10ops-codfw, 10DBA, 10decommission: Decommission db2069 - https://phabricator.wikimedia.org/T230107 (10Cmjohnson) [14:21:22] (03CR) 10Ema: [C: 04-1] vcl: add Access-Control-Allow-Origin to mobile redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526627 (https://phabricator.wikimedia.org/T229385) (owner: 10Lucas Werkmeister (WMDE)) [14:25:21] (03CR) 10Lucas Werkmeister (WMDE): vcl: add Access-Control-Allow-Origin to mobile redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/526627 (https://phabricator.wikimedia.org/T229385) (owner: 10Lucas Werkmeister (WMDE)) [14:25:45] (03PS3) 10Lucas Werkmeister (WMDE): vcl: add Access-Control-Allow-Origin to mobile redirects [puppet] - 10https://gerrit.wikimedia.org/r/526627 (https://phabricator.wikimedia.org/T229385) [14:27:37] (03PS4) 10Lucas Werkmeister (WMDE): vcl: add Access-Control-Allow-Origin to mobile redirects [puppet] - 10https://gerrit.wikimedia.org/r/526627 (https://phabricator.wikimedia.org/T229385) [14:30:51] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) [14:31:50] 10Operations, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10CDanis) We unfortunately did not discuss this during the SRE summit. Here is my two lepta: * The current situation of ff-only is both not as safe as it seems, and often create... [14:32:46] 10Operations, 10DBA, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Switchover m5 primary master: db1073 to db1133 - https://phabricator.wikimedia.org/T229657 (10Marostegui) [14:34:23] (03CR) 10Thcipriani: "A idea: instead of storing the mtime in the cache file itself you could set the mtime on the cache file to be the same as the InitialiseSe" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528447 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [14:41:43] (03CR) 10Bstorm: "Looks like the maintain-views script just needs the --clean option added to it (in addition to --replace-all and --all-databases). This t" [puppet] - 10https://gerrit.wikimedia.org/r/528724 (https://phabricator.wikimedia.org/T196055) (owner: 10Marostegui) [14:41:48] (03CR) 10Bstorm: [C: 03+1] maintain-views.yaml: Remove math table [puppet] - 10https://gerrit.wikimedia.org/r/528724 (https://phabricator.wikimedia.org/T196055) (owner: 10Marostegui) [14:50:04] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Cmjohnson) Replaced the disk at slot 7. Letting that rebuild and will replace the second failed disk [14:51:36] PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:00] !log continue cr1-codfw:re1 replacement - T226422 [14:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:10] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [14:53:20] * onimisionipe is looking at elastic [14:56:17] extending downtime [15:00:42] (03PS5) 10Ema: vcl: add Access-Control-Allow-Origin to mobile redirects [puppet] - 10https://gerrit.wikimedia.org/r/526627 (https://phabricator.wikimedia.org/T229385) (owner: 10Lucas Werkmeister (WMDE)) [15:02:06] Lucas_WMDE: I've amended your patch to also test that the header is set, merging shortly [15:03:34] Lucas_WMDE_: in case you haven't seen my previous message, I've amended your patch to also test that the header is set, merging shortly. Tests can be executed with ./modules/varnish/files/tests/run.sh FTR :) [15:04:07] (03CR) 10Ema: [C: 03+2] vcl: add Access-Control-Allow-Origin to mobile redirects [puppet] - 10https://gerrit.wikimedia.org/r/526627 (https://phabricator.wikimedia.org/T229385) (owner: 10Lucas Werkmeister (WMDE)) [15:07:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr Please wipe logstash1004 and 1005 and then remove from rack and updat... [15:08:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10Cmjohnson) [15:11:03] !log commit synchronize on cr1-codfw - T226422 [15:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:11] T226422: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 [15:13:19] !log rebooting fermium (lists) for security updates [15:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:51] 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10Cmjohnson) This is odd, I am not getting a link light on the raid controller connections. [15:18:43] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T229156 (10Cmjohnson) The ticket was approved. the new ssd should arrive today or tomorrow [15:19:45] (03PS3) 10Cwhite: add conniecc1 to analytics-(wmde|privatedata)-users,researchers group [puppet] - 10https://gerrit.wikimedia.org/r/525578 (https://phabricator.wikimedia.org/T228447) (owner: 10Herron) [15:21:41] (03CR) 10Thcipriani: [C: 03+1] ">" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528447 (https://phabricator.wikimedia.org/T217830) (owner: 10Krinkle) [15:22:04] (03PS4) 10Cwhite: add conniecc1 to analytics-(wmde|privatedata)-users,researchers group [puppet] - 10https://gerrit.wikimedia.org/r/525578 (https://phabricator.wikimedia.org/T228447) (owner: 10Herron) [15:23:11] 04Critical Alert for device cr1-codfw.wikimedia.org - Juniper alarm active [15:23:50] 10Operations, 10ops-eqiad: Degraded RAID on sulfur - https://phabricator.wikimedia.org/T229134 (10Cmjohnson) This doesn't really tell me anything about the bad disk. I am not able to ssh into the host for more details. I will create a ticket and hope that there is something Dell can use in their TSR [15:27:48] (03CR) 10Cwhite: [C: 03+2] add conniecc1 to analytics-(wmde|privatedata)-users,researchers group [puppet] - 10https://gerrit.wikimedia.org/r/525578 (https://phabricator.wikimedia.org/T228447) (owner: 10Herron) [15:29:43] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10colewhite) 05Open→03Resolved a:03colewhite [15:31:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10colewhite) Group membership change has been deployed. Please feel free to reopen if you enc... [15:38:07] (03PS1) 10Ema: Add class base::policy_rcd_not_allowed [puppet] - 10https://gerrit.wikimedia.org/r/529109 [15:40:52] (03PS3) 10Jbond: cas: Switch overlay repo to internally forked version [puppet] - 10https://gerrit.wikimedia.org/r/529069 (owner: 10Muehlenhoff) [15:41:55] (03CR) 10jerkins-bot: [V: 04-1] cas: Switch overlay repo to internally forked version [puppet] - 10https://gerrit.wikimedia.org/r/529069 (owner: 10Muehlenhoff) [15:43:16] (03PS4) 10Jbond: cas: Switch overlay repo to internally forked version [puppet] - 10https://gerrit.wikimedia.org/r/529069 (owner: 10Muehlenhoff) [15:47:17] can ignore ^ [15:49:09] !log set virtual-chassis vcp-snmp-statistics to all VC - T228824 [15:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:18] T228824: Add VCP stats monitoring - https://phabricator.wikimedia.org/T228824 [15:49:53] (03CR) 10Jbond: [C: 03+2] cas: Switch overlay repo to internally forked version [puppet] - 10https://gerrit.wikimedia.org/r/529069 (owner: 10Muehlenhoff) [15:50:01] (03PS5) 10Jbond: cas: Switch overlay repo to internally forked version [puppet] - 10https://gerrit.wikimedia.org/r/529069 (owner: 10Muehlenhoff) [15:50:13] I really feel like i should be involved in this project since it pings me all of the time :) [15:50:40] lol sorry i keep forgeting it triggers you [15:50:57] hehe it's okay :) [16:00:04] godog and _joe_: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190808T1600). [16:00:04] Smalyshev: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:32] 10Operations, 10Patch-For-Review: puppetdb queue size went up since July 30 - https://phabricator.wikimedia.org/T230002 (10jbond) The lua hack seems to have worked. I have again updated the config to send some canary servers to puppetmaster1003. So far i have seen no errors in the puppetdb log. Further via... [16:02:22] 10Operations, 10LDAP, 10cloud-services-team (Kanban): LDAP: multiples accounts for Phamhi - https://phabricator.wikimedia.org/T230126 (10Phamhi) I prefer the first one 10:59 AM uid=phamhi,ou=people,dc=wikimedia,dc=org [16:04:07] <_joe_> SMalyshev: I'm confused, can't gehel merge that for you? :) [16:05:09] _joe_: yep, gehel should be able to do it, he just needs to find 5 minutes in his schedule :( [16:05:32] <_joe_> SMalyshev: I'll merge and apply it tomorrow, I'm in a meeting [16:12:38] 10Operations, 10netops, 10observability: Add VCP stats monitoring - https://phabricator.wikimedia.org/T228824 (10ayounsi) 05Open→03Resolved a:03ayounsi We now have visibility on all VCPs; https://librenms.wikimedia.org/ports/ifType=vcp/format=list_basic/ They also benefit from the same alerting as regu... [16:13:26] 10Operations, 10ops-codfw, 10netops: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 (10Papaul) Return information for bad SCB {F30000784} [16:24:20] (03PS6) 10Dzahn: parsoid::testing: allow switching parsoid/JS to parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/528976 (https://phabricator.wikimedia.org/T229363) [16:31:37] (03CR) 10Dzahn: [C: 03+2] parsoid::testing: allow switching parsoid/JS to parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/528976 (https://phabricator.wikimedia.org/T229363) (owner: 10Dzahn) [16:33:02] 10Operations, 10Traffic, 10Performance, 10Performance-Team (Radar), 10User-notice: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Elitre) [16:33:17] 10Operations, 10Traffic, 10Performance, 10Performance-Team (Radar), 10User-notice: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Elitre) Removing our tag then. [16:34:55] subbu: the config file on scandium switched to the Parsoid/PHP URL now, that's ok, right [16:40:48] !log fdans@deploy1001 Started deploy [analytics/refinery@cef01d3]: deploying analytics refinery [16:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:48] (03PS1) 10Dzahn: parsoid::testing: switch from Parsoid/PHP to Parsoid/JS [puppet] - 10https://gerrit.wikimedia.org/r/529118 (https://phabricator.wikimedia.org/T229363) [16:47:50] (03CR) 10Eevans: [C: 04-1] "Setting this -1 because until the package is in the repository, it cannot be merged." [puppet] - 10https://gerrit.wikimedia.org/r/529074 (https://phabricator.wikimedia.org/T226555) (owner: 10Holger Knust) [16:48:07] (03PS3) 10Eevans: [WIP]: Add cassandra-table-properties tool to Cassandra deployments [puppet] - 10https://gerrit.wikimedia.org/r/529074 (https://phabricator.wikimedia.org/T226555) (owner: 10Holger Knust) [16:49:42] (03CR) 10Dzahn: [C: 03+2] parsoid::testing: switch from Parsoid/PHP to Parsoid/JS [puppet] - 10https://gerrit.wikimedia.org/r/529118 (https://phabricator.wikimedia.org/T229363) (owner: 10Dzahn) [16:49:51] (03PS2) 10Dzahn: parsoid::testing: switch from Parsoid/PHP to Parsoid/JS [puppet] - 10https://gerrit.wikimedia.org/r/529118 (https://phabricator.wikimedia.org/T229363) [16:50:40] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [16:52:05] 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10colewhite) a:03colewhite [16:52:48] Database read only? Hmm [16:53:14] 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10colewhite) The list has been created and the list administrators password has been emailed to azimoveldar085 and wertuose gmail addresses. Please let us know if you have trouble accessin... [16:53:27] 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10colewhite) 05Open→03Resolved [16:53:30] Getting a "Database locked" message on Simple [16:55:13] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@069d297]: Remove workaround for ORES not supporting eventgate events T228688 [16:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:21] T228688: Fix revision-score event production in change-prop after migration of revision-create to eventgate-main - https://phabricator.wikimedia.org/T228688 [16:55:41] (03CR) 10Eevans: [C: 04-1] "Also, shouldn't this be under https://phabricator.wikimedia.org/T226553 (Install Cassandra table properties Debian package on Cassandra ho" [puppet] - 10https://gerrit.wikimedia.org/r/529074 (https://phabricator.wikimedia.org/T226555) (owner: 10Holger Knust) [16:56:37] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@069d297]: Remove workaround for ORES not supporting eventgate events T228688 (duration: 01m 24s) [16:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:17] (03PS1) 10Dzahn: parsoid-testing: add both PHP and JS parsoid URLs to rt client config [puppet] - 10https://gerrit.wikimedia.org/r/529122 (https://phabricator.wikimedia.org/T229363) [17:00:04] cscott, arlolra, subbu, halfak, and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190808T1700). [17:00:19] (03CR) 10Eevans: [C: 04-1] [WIP]: Add cassandra-table-properties tool to Cassandra deployments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529074 (https://phabricator.wikimedia.org/T226555) (owner: 10Holger Knust) [17:00:28] (03PS2) 10Dzahn: parsoid-testing: add both PHP and JS parsoid URLs to rt client config [puppet] - 10https://gerrit.wikimedia.org/r/529122 (https://phabricator.wikimedia.org/T229363) [17:01:23] 10Operations, 10Traffic, 10Wikidata-Bridge-Sprint-3: Data Bridge can’t load entity data on mobile clients - https://phabricator.wikimedia.org/T229385 (10Lydia_Pintscher) 05Open→03Resolved \o/ [17:02:00] (03CR) 10Dzahn: [C: 03+2] parsoid-testing: add both PHP and JS parsoid URLs to rt client config [puppet] - 10https://gerrit.wikimedia.org/r/529122 (https://phabricator.wikimedia.org/T229363) (owner: 10Dzahn) [17:02:12] (03PS3) 10Dzahn: parsoid-testing: add both PHP and JS parsoid URLs to rt client config [puppet] - 10https://gerrit.wikimedia.org/r/529122 (https://phabricator.wikimedia.org/T229363) [17:06:04] (03PS1) 10Giuseppe Lavagetto: envoy: use newer definitions for clusters, other fixes [puppet] - 10https://gerrit.wikimedia.org/r/529123 [17:06:09] 10Operations, 10Elasticsearch, 10SRE-tools, 10Discovery-Search (Current work): cookbook sre.elasticsearch.rolling-restart failed with cluster relforge - https://phabricator.wikimedia.org/T229807 (10debt) [17:06:29] (03PS1) 10Ppchelko: Switch high-traffic jobs to eventgate. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529124 (https://phabricator.wikimedia.org/T228705) [17:07:25] (03CR) 10jerkins-bot: [V: 04-1] envoy: use newer definitions for clusters, other fixes [puppet] - 10https://gerrit.wikimedia.org/r/529123 (owner: 10Giuseppe Lavagetto) [17:09:39] (03CR) 10Dzahn: [C: 03+2] Fix the update_parsoid.sh script for the Parsoid/PHP usecase [puppet] - 10https://gerrit.wikimedia.org/r/528980 (https://phabricator.wikimedia.org/T229858) (owner: 10Subramanya Sastry) [17:09:43] (03CR) 10Ppchelko: "Previous time we've had 2 issues:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529124 (https://phabricator.wikimedia.org/T228705) (owner: 10Ppchelko) [17:09:48] (03PS2) 10Dzahn: Fix the update_parsoid.sh script for the Parsoid/PHP usecase [puppet] - 10https://gerrit.wikimedia.org/r/528980 (https://phabricator.wikimedia.org/T229858) (owner: 10Subramanya Sastry) [17:12:09] (03PS2) 10Giuseppe Lavagetto: envoy: use newer definitions for clusters, other fixes [puppet] - 10https://gerrit.wikimedia.org/r/529123 [17:13:11] 10Operations, 10SRE-Access-Requests: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10colewhite) [17:14:00] 10Operations, 10SRE-Access-Requests: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10colewhite) [17:16:25] !log fdans@deploy1001 Started deploy [analytics/refinery@cef01d3]: deploy analytics refinery, second attempt [17:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:14] 10Operations, 10SRE-Access-Requests: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10Dzahn) Hi @abi may i ask what the part you are planning to run on mwmaint servers will be? It's less common to see that requested related to "analytics and metrics".... [17:17:32] (03PS3) 10Giuseppe Lavagetto: envoy: use newer definitions for clusters, other fixes [puppet] - 10https://gerrit.wikimedia.org/r/529123 [17:18:37] (03CR) 10Subramanya Sastry: "Oh, how did gerrit merge this even with a Depends-On tag?" [puppet] - 10https://gerrit.wikimedia.org/r/528980 (https://phabricator.wikimedia.org/T229858) (owner: 10Subramanya Sastry) [17:19:02] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] envoy: use newer definitions for clusters, other fixes [puppet] - 10https://gerrit.wikimedia.org/r/529123 (owner: 10Giuseppe Lavagetto) [17:21:07] !log add user jbond to network devices [17:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:26] (03PS9) 10Effie Mouzeli: profile::mediawiki::hhvm: default php to php7 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/425027 (https://phabricator.wikimedia.org/T195392) (owner: 10Giuseppe Lavagetto) [17:22:23] (03PS4) 10Holger Knust: WIP: Add cassandra-table-properties tool to Cassandra deployments [puppet] - 10https://gerrit.wikimedia.org/r/529074 (https://phabricator.wikimedia.org/T226553) [17:23:03] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::hhvm: default php to php7 on stretch [puppet] - 10https://gerrit.wikimedia.org/r/425027 (https://phabricator.wikimedia.org/T195392) (owner: 10Giuseppe Lavagetto) [17:23:29] (03CR) 10Holger Knust: "Add WIP and changed the task #" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529074 (https://phabricator.wikimedia.org/T226553) (owner: 10Holger Knust) [17:24:16] (03PS1) 10Cwhite: admin: admin data and access for Abijeet Patro [puppet] - 10https://gerrit.wikimedia.org/r/529125 (https://phabricator.wikimedia.org/T230020) [17:24:48] (03CR) 10jerkins-bot: [V: 04-1] admin: admin data and access for Abijeet Patro [puppet] - 10https://gerrit.wikimedia.org/r/529125 (https://phabricator.wikimedia.org/T230020) (owner: 10Cwhite) [17:25:08] (03CR) 10Subramanya Sastry: "I'll just have someone review that other patch now and have it merged. No need to revert this since we run the update_parsoid.sh script ma" [puppet] - 10https://gerrit.wikimedia.org/r/528980 (https://phabricator.wikimedia.org/T229858) (owner: 10Subramanya Sastry) [17:26:43] 10Operations, 10netops: BGP session down for AS4739 on cr4-ulsfo - https://phabricator.wikimedia.org/T230005 (10elukey) a:03elukey [17:26:58] 10Operations, 10netops: BGP session down for AS4739 on cr4-ulsfo - https://phabricator.wikimedia.org/T230005 (10elukey) Sent an email just now to their NOC, will wait for the answer before closing. [17:27:06] 10Operations, 10netops: BGP session down for AS 20485 on cr2-esams - https://phabricator.wikimedia.org/T230004 (10elukey) a:03elukey [17:27:18] 10Operations, 10netops: BGP session down for AS 20485 on cr2-esams - https://phabricator.wikimedia.org/T230004 (10elukey) Sent an email to their NOC, going to wait for an answer before closing. [17:29:43] (03PS13) 10Thcipriani: Scap: scap_source correct gid [puppet] - 10https://gerrit.wikimedia.org/r/361796 [17:32:06] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10colewhite) Description of roles and groups can be found here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modu... [17:33:17] !log fdans@deploy1001 Finished deploy [analytics/refinery@cef01d3]: deploy analytics refinery, second attempt (duration: 16m 52s) [17:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:32] (03CR) 10Thcipriani: [C: 03+1] gerrit: Re-enable the use of HTTP auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/527596 (https://phabricator.wikimedia.org/T225308) (owner: 10Paladox) [17:37:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Mwmaint1002, Stat1007 for Abijeet Patro - https://phabricator.wikimedia.org/T230020 (10colewhite) @RStallman-legalteam Would you mind confirming NDA on file for Abijeet? @Arrbee From the Phabricator profile, it says that Abijeet i... [17:40:58] (03PS10) 10Effie Mouzeli: profile::mediawiki::hhvm: default php to php7 [puppet] - 10https://gerrit.wikimedia.org/r/425027 (https://phabricator.wikimedia.org/T195392) (owner: 10Giuseppe Lavagetto) [17:41:05] (03PS2) 10Cwhite: admin: admin data and access for Abijeet Patro [puppet] - 10https://gerrit.wikimedia.org/r/529125 (https://phabricator.wikimedia.org/T230020) [17:42:19] (03PS1) 10Giuseppe Lavagetto: envoy: further fixes [puppet] - 10https://gerrit.wikimedia.org/r/529128 [17:42:21] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::hhvm: default php to php7 [puppet] - 10https://gerrit.wikimedia.org/r/425027 (https://phabricator.wikimedia.org/T195392) (owner: 10Giuseppe Lavagetto) [17:42:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoy: further fixes [puppet] - 10https://gerrit.wikimedia.org/r/529128 (owner: 10Giuseppe Lavagetto) [17:45:39] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for Abijeet Patro - https://phabricator.wikimedia.org/T230104 (10colewhite) p:05Triage→03Normal a:03colewhite [17:46:37] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for Abijeet Patro - https://phabricator.wikimedia.org/T230104 (10colewhite) Assuming NDA approval from T230020 once it is available. [17:47:59] 10Operations, 10Wikimedia-Mailing-lists: Yoruba Wikimedians User Group Official Mailing List - https://phabricator.wikimedia.org/T229749 (10colewhite) p:05Triage→03Normal a:03colewhite [17:48:22] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 44 seconds ago with 2 failures. Failed resources (up to 3 shown): Exec[verify-envoy-config] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:51:52] 10Operations, 10Wikimedia-Mailing-lists: Yoruba Wikimedians User Group Official Mailing List - https://phabricator.wikimedia.org/T229749 (10colewhite) Mailing list has been provisioned and the list administrator password emailed to reachout2isaac and jerrykufang22 at gmail. Please let us know if you encounter... [17:52:01] 10Operations, 10Wikimedia-Mailing-lists: Yoruba Wikimedians User Group Official Mailing List - https://phabricator.wikimedia.org/T229749 (10colewhite) 05Open→03Resolved [17:52:11] !log run /usr/local/sbin/restart-php7.2-fpm on mwdebug1001 [17:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:20] (03PS8) 10MarcoAurelio: profile::mediawiki::maintenance: purge_checkuser to use periodic_job [puppet] - 10https://gerrit.wikimedia.org/r/529077 [17:53:54] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:54:47] gehel: can you or onimisionipe reload 1005 and repool it? https://phabricator.wikimedia.org/T229876 [17:55:01] SMalyshev: yep, on my todo list for this night [17:55:08] great, thanks [17:56:10] ACKNOWLEDGEMENT - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL Ayounsi https://phabricator.wikimedia.org/T229998 https://wikitech.wikimedia.org/wiki/Netbox%23Reports [17:57:04] 10Operations: decom cookbook: dry-run mode not working / PuppetDB and Debmonitor removals can fail - https://phabricator.wikimedia.org/T229998 (10ayounsi) I ACKed the Netbox/PuppetDB alert (`missing VM from Netbox: poolcounter2001`) linking to that task. [17:58:16] (03PS11) 10Effie Mouzeli: profile::mediawiki::hhvm: default php to php7 [puppet] - 10https://gerrit.wikimedia.org/r/425027 (https://phabricator.wikimedia.org/T195392) (owner: 10Giuseppe Lavagetto) [17:59:16] (03CR) 10jerkins-bot: [V: 04-1] profile::mediawiki::hhvm: default php to php7 [puppet] - 10https://gerrit.wikimedia.org/r/425027 (https://phabricator.wikimedia.org/T195392) (owner: 10Giuseppe Lavagetto) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190808T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:02:15] (03CR) 10MarcoAurelio: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/529077 (owner: 10MarcoAurelio) [18:06:09] (03PS1) 10CRusnov: netbox: Make it so that failed reports dont also systemd alert [puppet] - 10https://gerrit.wikimedia.org/r/529132 [18:07:08] (03CR) 10jerkins-bot: [V: 04-1] netbox: Make it so that failed reports dont also systemd alert [puppet] - 10https://gerrit.wikimedia.org/r/529132 (owner: 10CRusnov) [18:07:44] mutante: Good day. I made https://gerrit.wikimedia.org/r/529077/ for what we talked about yesterday :) [18:08:10] (03PS12) 10Effie Mouzeli: profile::mediawiki::hhvm: default php to php7 [puppet] - 10https://gerrit.wikimedia.org/r/425027 (https://phabricator.wikimedia.org/T195392) (owner: 10Giuseppe Lavagetto) [18:08:18] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:09:11] (03PS2) 10CRusnov: netbox: Make it so that failed reports dont also systemd alert [puppet] - 10https://gerrit.wikimedia.org/r/529132 [18:10:11] (03CR) 10jerkins-bot: [V: 04-1] netbox: Make it so that failed reports dont also systemd alert [puppet] - 10https://gerrit.wikimedia.org/r/529132 (owner: 10CRusnov) [18:11:37] (03PS3) 10CRusnov: netbox: Make it so that failed reports dont also systemd alert [puppet] - 10https://gerrit.wikimedia.org/r/529132 [18:12:45] (03CR) 10jerkins-bot: [V: 04-1] netbox: Make it so that failed reports dont also systemd alert [puppet] - 10https://gerrit.wikimedia.org/r/529132 (owner: 10CRusnov) [18:13:07] (03CR) 10Eevans: WIP: Add cassandra-table-properties tool to Cassandra deployments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/529074 (https://phabricator.wikimedia.org/T226553) (owner: 10Holger Knust) [18:18:26] (03CR) 10MarcoAurelio: "PCC for PS8: https://puppet-compiler.wmflabs.org/compiler1002/254/" [puppet] - 10https://gerrit.wikimedia.org/r/529077 (owner: 10MarcoAurelio) [18:23:16] (03PS13) 10Effie Mouzeli: profile::mediawiki::hhvm: default php to php7 [puppet] - 10https://gerrit.wikimedia.org/r/425027 (https://phabricator.wikimedia.org/T195392) (owner: 10Giuseppe Lavagetto) [18:25:02] (03PS4) 10CRusnov: netbox: Make it so that failed reports dont also systemd alert [puppet] - 10https://gerrit.wikimedia.org/r/529132 [18:27:15] (03CR) 10Effie Mouzeli: "LGTM https://puppet-compiler.wmflabs.org/compiler1001/17822/mwmaint1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/425027 (https://phabricator.wikimedia.org/T195392) (owner: 10Giuseppe Lavagetto) [18:31:01] 10Operations, 10LDAP, 10cloud-services-team (Kanban): LDAP: multiples accounts for Phamhi - https://phabricator.wikimedia.org/T230126 (10bd808) a:05Phamhi→03bd808 * `uid=phamhi,ou=people,dc=wikimedia,dc=org` is preferred per T230126#5402992 ** This is the Developer account connected to @Phamhi here on Ph... [18:33:16] (03CR) 10Ayounsi: [C: 03+1] "Not tested but looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/529132 (owner: 10CRusnov) [18:33:49] (03CR) 10CRusnov: [C: 03+2] netbox: Make it so that failed reports dont also systemd alert [puppet] - 10https://gerrit.wikimedia.org/r/529132 (owner: 10CRusnov) [18:54:31] 10Operations, 10LDAP, 10cloud-services-team (Kanban): LDAP: multiples accounts for Phamhi - https://phabricator.wikimedia.org/T230126 (10Phamhi) * `uid=hpham,ou=people,dc=wikimedia,dc=org` has no Cloud VPS memberships ** Created by the OIT group on the first day.. I think.. according to them.. it's for "your... [19:00:04] brennen: (Dis)respected human, time to deploy MediaWiki train - American version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190808T1900). Please do the needful. [19:02:10] (03PS1) 10Brennen Bearnes: all wikis to 1.34.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529144 [19:02:12] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.34.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529144 (owner: 10Brennen Bearnes) [19:03:12] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529144 (owner: 10Brennen Bearnes) [19:03:17] (03PS1) 10Elukey: profiile::analytics::refinery::job::test::refine: use refinery 0.97 [puppet] - 10https://gerrit.wikimedia.org/r/529145 [19:03:31] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529144 (owner: 10Brennen Bearnes) [19:04:20] (03CR) 10Elukey: [C: 03+2] profiile::analytics::refinery::job::test::refine: use refinery 0.97 [puppet] - 10https://gerrit.wikimedia.org/r/529145 (owner: 10Elukey) [19:06:40] !log brennen@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.17 [19:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:33] (03PS2) 10Ejegg: Make banner-preview CSP match normal CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/527183 (https://phabricator.wikimedia.org/T225261) [19:11:35] (03PS5) 10Ejegg: CSP for banner preview: allow remind me later host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) [19:19:36] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:24:28] PROBLEM - Check the Netbox report-s- librenms for fail status. on netmon1002 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:27:07] !log gehel@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:47] in a few minutes we'll be restarting gerrit [19:28:09] (03PS3) 10CDanis: gerrit: Re-enable the use of HTTP auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/527596 (https://phabricator.wikimedia.org/T225308) (owner: 10Paladox) [19:28:52] RECOVERY - Disk space on wdqs1005 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wdqs1005&var-datasource=eqiad+prometheus/ops [19:29:03] (03CR) 10CDanis: [C: 03+2] gerrit: Re-enable the use of HTTP auth tokens [puppet] - 10https://gerrit.wikimedia.org/r/527596 (https://phabricator.wikimedia.org/T225308) (owner: 10Paladox) [19:34:23] !log restart gerrit-replica on gerrit2001 to pick up new config [19:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:07] !log restart gerrit on cobalt to pick up new config [19:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:50] cdanis: ^ all done thanks for the merge :) [19:37:58] :=1: [19:38:01] 👍 [19:42:19] (03PS3) 10CDanis: Add L and M to allowed statement starts [puppet] - 10https://gerrit.wikimedia.org/r/526755 (owner: 10Smalyshev) [19:45:22] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_wikistats-v2],Exec[git_pull_analytics.wikimedia.org] https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:50:54] (03CR) 10Ppchelko: "I believe we need a .tgz file as well?" (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/526679 (https://phabricator.wikimedia.org/T229287) (owner: 10MSantos) [19:54:59] (03CR) 10Ayounsi: [C: 04-1] "I do use those pages when I work with the API." [puppet] - 10https://gerrit.wikimedia.org/r/528531 (owner: 10CRusnov) [19:59:34] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:07:44] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [20:30:00] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:33:05] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [20:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:22] PROBLEM - High lag on wdqs1010 is CRITICAL: 3740 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:43:02] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [20:45:22] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 60.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [20:45:25] (03PS3) 10Dzahn: scap/mw-maint: switch foreachwikiindblist to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528613 (https://phabricator.wikimedia.org/T195392) [20:48:50] RECOVERY - High lag on wdqs1010 is OK: (C)3600 ge (W)1200 ge 1155 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:49:51] (03PS1) 10CRusnov: switch swagger to nonpublic mode [software/netbox] - 10https://gerrit.wikimedia.org/r/529171 [20:52:22] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Legacy (Watching / External), and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Dzahn) [20:56:36] PROBLEM - Host thumbor2004 is DOWN: PING CRITICAL - Packet loss = 100% [21:01:47] 10Operations, 10LDAP, 10cloud-services-team (Kanban): LDAP: multiples accounts for Phamhi - https://phabricator.wikimedia.org/T230126 (10bd808) >>! In T230126#5403598, @Phamhi wrote: > * `uid=hpham,ou=people,dc=wikimedia,dc=org` has no Cloud VPS memberships > ** Created by the OIT group on the first day.. I... [21:03:50] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [21:04:34] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [21:08:38] (03PS1) 10Subramanya Sastry: Make scandium a read-only appserver [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529173 (https://phabricator.wikimedia.org/T228069) [21:10:09] (03PS1) 10Viztor: Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 [21:10:12] (03CR) 10Dzahn: [C: 03+1] Make scandium a read-only appserver [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529173 (https://phabricator.wikimedia.org/T228069) (owner: 10Subramanya Sastry) [21:17:23] (03PS2) 10Viztor: Update HD logo for wikisource using default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/529175 [21:22:27] !log built new scap version 3.12.0-1 on boron, imported packages on install1002 (apt.wm.org), copied from stretch to jessie and buster (T230144) [21:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:35] T230144: Deploy scap 3.12.0-1 to production - https://phabricator.wikimedia.org/T230144 [21:26:26] (03PS1) 10Subramanya Sastry: Tweak update_parsoid.sh to handle symlinking correctly [puppet] - 10https://gerrit.wikimedia.org/r/529177 [21:28:20] !log rolling out new scap package 3.12.0-1 on contint servers [21:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:32] 10Operations, 10netops: csw2-esams's VCP link flapped - https://phabricator.wikimedia.org/T229755 (10ayounsi) Seems like this device is seeing its end coming with the esams refresh. The on disk logs have rolled over but syslog logs are visible on https://logstash.wikimedia.org/goto/08b36fc12fdc83accef419f308f... [21:28:45] 10Operations, 10netops: csw2-esams's VCP link flapped - https://phabricator.wikimedia.org/T229755 (10ayounsi) a:03ayounsi [21:30:44] !log rolling out new scap package 3.12.0-1 on mw-canary servers via debdeploy (T230144) [21:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:52] T230144: Deploy scap 3.12.0-1 to production - https://phabricator.wikimedia.org/T230144 [21:31:54] jouncebot: next [21:31:54] In 1 hour(s) and 28 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190808T2300) [21:36:37] (03CR) 10Dzahn: [C: 03+2] scap/mw-maint: switch foreachwikiindblist to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/528613 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn) [21:40:21] !log mwmaint1002 - manually running (weekly) echo_mail cron job (user notifications) to confirm it works after switching foreachwikiindblist to use php7.2 (T195392) [21:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:31] T195392: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 [21:41:34] (03CR) 10Dzahn: [C: 03+2] Tweak update_parsoid.sh to handle symlinking correctly [puppet] - 10https://gerrit.wikimedia.org/r/529177 (owner: 10Subramanya Sastry) [21:41:42] (03PS2) 10Dzahn: Tweak update_parsoid.sh to handle symlinking correctly [puppet] - 10https://gerrit.wikimedia.org/r/529177 (owner: 10Subramanya Sastry) [21:47:28] RECOVERY - Check the Netbox report-s- librenms for fail status. on netmon1002 is OK: librenms.LibreNMS OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:48:28] 10Operations, 10LDAP, 10cloud-services-team (Kanban): LDAP: multiples accounts for Phamhi - https://phabricator.wikimedia.org/T230126 (10bd808) [21:50:06] !log mwmaint1002 - manually running purgeUnusedProjects with PageAssessments extension to confirm no issues after switch to PHP7.2 [21:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:56] !log mwmwaint1002 - manually running purge_old_cx_drafts maintenance job for ContentTranslation - after switching helper script to PHP 7.2 [21:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:19] 10Operations, 10LDAP, 10cloud-services-team (Kanban): LDAP: multiples accounts for Phamhi - https://phabricator.wikimedia.org/T230126 (10bd808) [21:54:17] !log (purge unpublished articles from ContentTranslation older than 455 days) [21:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:46] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Legacy (Watching / External), and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Dzahn) [21:54:52] 10Operations, 10ops-eqiad: Degraded RAID on helium - https://phabricator.wikimedia.org/T224794 (10Jclark-ctr) Replaced the disk at slot 11 [21:54:59] 10Operations, 10LDAP, 10cloud-services-team (Kanban): LDAP: multiples accounts for Phamhi - https://phabricator.wikimedia.org/T230126 (10bd808) >>! In T230126#5403925, @bd808 wrote: > I can't find the creation of `uid=hpham,ou=people,dc=wikimedia,dc=org` in either which is confusing, especially because the '... [21:59:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1002: SMART/disk error - https://phabricator.wikimedia.org/T230088 (10wiki_willy) a:03Cmjohnson [22:00:43] !log rolling out new scap version 3.12.0-1 on all of codfw [22:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:36] !log mwdebug2002 - scap pull to test new scap, nothing to do [22:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:33] !log rolling out new scap version 3.12.0-1 on all of eqiad [22:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:51] jouncebot: next [22:09:52] In 0 hour(s) and 50 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190808T2300) [22:16:38] (03PS3) 10Cwhite: admin: admin data and access for Abijeet Patro [puppet] - 10https://gerrit.wikimedia.org/r/529125 (https://phabricator.wikimedia.org/T230020) [22:16:40] (03PS1) 10Cwhite: admin: add Jaime Anstee to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/529184 (https://phabricator.wikimedia.org/T229959) [22:25:12] (03PS1) 10CDanis: dbctl: reduce argparse boilerplate [software/conftool] - 10https://gerrit.wikimedia.org/r/529185 [23:00:04] MaxSem, RoanKattouw, and Niharika: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190808T2300). [23:00:04] ejegg and Krenair: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:02:44] hey [23:07:51] 10Operations, 10SRE-Access-Requests: Requesting access to machines [stat1004, stat1007, stat1006, notebook1003 and notebook1004] and groups for cchen - https://phabricator.wikimedia.org/T228447 (10kzimmerman) Thanks @colewhite! [23:09:40] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Help speed up onboarding for new Analysts - https://phabricator.wikimedia.org/T230173 (10kzimmerman) [23:10:19] (03CR) 10Eevans: "Another pass..." (0310 comments) [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [23:19:51] Krenair, ejegg: Seems no one claimed SWAT yet? [23:19:52] If wanted, I can SWAT your patches :-) [23:20:45] oh hi! [23:21:12] yes [23:21:13] please [23:21:18] Thanks, there was a question from another potential SWATter about one of the config patches [23:21:41] and we were trying to nail down the paper trail of legal and security review of a related feature from 5 years ago [23:22:11] but we in fr-tech are pretty confident that the acutal config patch to be deployed doesn't open up any cans of worms that weren't already open [23:22:25] ok Krenair, +2'ed your backport, waiting on CI [23:22:28] just points in the direction of one of them [23:22:34] woah woah wait, ejegg [23:22:49] Krenair: nothing to do with your patch! [23:22:56] you want to make a CSP change in SWAT? [23:22:58] 10Operations: install2002 short on disk space - https://phabricator.wikimedia.org/T229997 (10Dzahn) Oh, thanks. looks like i forgot to add it to fstab when i created and mounted it. This should fix it indeed. [23:23:04] I know, I'm looking at yours [23:23:19] Yeah, it's for a very special case [23:23:29] only loaded when force-previewing a CentralNotice banner [23:23:36] with banner=BlahDeDah [23:23:46] basically only CentralNotice admins use that [23:24:04] and we set an extra-strict CSP on that preview so we can catch any unintended privacy leaks [23:24:26] Urbanecm, thanks [23:25:00] the site-wide CSP is report-only, but the banner-preview CSP is actually blocking, and is served up along with a bit of JS to catch the CSP violation error and visibly alert the user about it [23:25:28] so banner creators don't inadvertently leak referrers by including e.g. an image or script hosted off-wiki [23:25:53] the changes in this SWAT deploy are 1) bring the banner-preview CSP in line with site-wide report-only CSP [23:27:28] and 2) whitelist an additional domain (our bulk email provider) in the banner-preview CSP, which has been in use in banners for the past 5 years to implement the email reminder feature [23:28:24] the fundraising folks who create banners have been complaing since we added the strict CSP on preview that they can't test the email reminders, even though they work for real banner views [23:28:31] so I have verified that it will only be used when the 'banner' param is specified. Urbanecm, are you confident enough with CSP to verify that new value being a stricter policy than the existing thing? [23:29:29] I think if this is making a CSP policy anywhere less strict and more open, it should get a proper security review before going to SWAT [23:31:12] here's where we set the header: https://phabricator.wikimedia.org/diffusion/ECNO/browse/wmf_deploy/includes/CentralNoticeHooks.php$237 [23:31:27] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/527183/2/wmf-config/CommonSettings.php definitelly stricts the CSP policy, given the previous policy contains *.wikimedia.org, which is indeed pretty bad (whitelists people.wikimedia.org as well, which is a place where "anyone can edit") [23:31:28] you can see it's the content-security-policy header [23:31:52] and not the content-security-policy-report-only header [23:32:12] people.wikimedia.org is probably less scary than the wikis Urbanecm [23:32:36] IIRC there are domains under wikimedia.org pointing to outside the wm production cluster, those would match and be far more scary [23:33:15] true too [23:33:23] anyway [23:33:25] if you're happy then okay [23:33:26] (03PS1) 10Dzahn: mediawiki:maintenance: switch translationnotifications to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/529190 (https://phabricator.wikimedia.org/T195392) [23:33:57] (03CR) 10jerkins-bot: [V: 04-1] mediawiki:maintenance: switch translationnotifications to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/529190 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn) [23:34:28] however, I'm not comfortable with adding another domain (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/526756), without a +1 from someone from security [23:34:28] (03PS2) 10Dzahn: mediawiki:maintenance: switch translationnotifications to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/529190 (https://phabricator.wikimedia.org/T195392) [23:34:32] in principle it seems like a risky idea to assume that whoever handles a given swat window will be willing to review such a change. I haven't been a deployer for a few years but I probably would've sent this back for a proper security review [23:34:45] +1 [23:35:05] ejegg, could you please get a clear "go ahead" from #security and reschedule? [23:36:00] (03CR) 10Dzahn: [C: 04-1] "please get a review from security and or traffic for adding this non-WMF domain" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/526756 (https://phabricator.wikimedia.org/T194019) (owner: 10Ejegg) [23:37:35] (03CR) 10Dzahn: [C: 03+2] mediawiki:maintenance: switch translationnotifications to PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/529190 (https://phabricator.wikimedia.org/T195392) (owner: 10Dzahn) [23:37:44] OK Urbanecm, I'll see what I can do [23:37:47] IMHO it's a pretty important expectation that a swat deployer vetos everything and anything they do not feel qualified to review [23:38:17] thank you ejegg [23:42:30] !log mwmaint1002 - manually running TranslatioNNotifications DigestEmailer maintenance cron [23:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:12] Krenair, your patch is on mwdebug1002 [23:43:41] ah, yes, I need that extension don't I [23:43:52] yeah :) [23:43:58] (or test with curl) [23:44:13] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Legacy (Watching / External), and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Dzahn) [23:44:14] Krenair, https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Browser_extensions [23:45:13] apparently it's already installed in my browser, just not showing up? will try chrome, one sec [23:45:30] sure [23:47:02] ok [23:47:21] Urbanecm, yep, works [23:47:32] thanks, syncing [23:48:22] !log mwmaint1002 - manually running purge_securepoll maintenance script [23:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:19] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.17/extensions/WikiEditor/modules/jquery.wikiEditor.dialogs.config.js: SWAT: 6dcab39: Follow-up Ia75d685c: Fix the insert file dialog (T230078) (duration: 00m 50s) [23:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:28] T230078: WikiEditor insert file dialog broken - https://phabricator.wikimedia.org/T230078 [23:49:40] Krenair, synced [23:50:27] Urbanecm, works, thank you! [23:50:32] happy to help! [23:50:37] !log Evening SWAT done [23:50:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:32] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Legacy (Watching / External), and 4 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Dzahn)