[00:04:07] (03PS12) 10Alex Monk: elastalert: new module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [00:04:09] (03PS10) 10Alex Monk: elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [00:05:07] (03PS11) 10Alex Monk: elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [00:06:43] (03PS13) 10Alex Monk: elastalert: new module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [00:06:45] (03PS12) 10Alex Monk: elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [00:10:05] 10Operations, 10netbox, 10netops: Netbox racks consistency report - https://phabricator.wikimedia.org/T212878 (10ayounsi) [00:44:53] (03PS2) 10Dzahn: eventlogging: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/510053 (https://phabricator.wikimedia.org/T197873) [00:46:21] (03CR) 10Dzahn: [C: 03+2] "this seemed relatively obvious to me so just self-merging to save your valuable time. but if you see something better please let me know o" [puppet] - 10https://gerrit.wikimedia.org/r/510053 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [00:50:34] (03CR) 10Ayounsi: "Thanks! comment inline (some more useful than others)." (0312 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [00:53:55] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [00:55:31] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [00:55:51] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [00:55:55] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [00:56:05] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [00:56:23] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [00:57:37] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [00:59:33] any op mind figuring out for me what image is being used to run the zotero container? [00:59:55] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [01:03:25] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:03:37] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:08:03] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [01:08:05] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [01:17:17] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient [01:17:27] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [01:17:49] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational [01:18:43] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:20:21] PROBLEM - HHVM rendering on mw2146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [01:21:13] RECOVERY - HHVM rendering on mw2146 is OK: HTTP OK: HTTP/1.1 200 OK - 77938 bytes in 0.406 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:30:05] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1007 is OK: OK: synced at Wed 2019-05-15 01:30:03 UTC. [01:30:43] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [01:32:33] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [01:34:37] RECOVERY - DPKG on stat1007 is OK: All packages OK [01:36:14] (03CR) 10DannyS712: [C: 04-1] "The "rollbacker" group is created, but no other group is given the ability to add or remove it" (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510269 (https://phabricator.wikimedia.org/T219617) (owner: 10Zoranzoki21) [01:36:29] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up [01:55:57] (03PS1) 10Alex Monk: deployment-prep: Use new Citoid service running inside Docker [puppet] - 10https://gerrit.wikimedia.org/r/510290 (https://phabricator.wikimedia.org/T220235) [02:15:06] (03CR) 10CRusnov: [C: 03+1] "trivial LGTM. Note to self to remove it from the automation." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510268 (owner: 10Faidon Liambotis) [02:20:11] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Krenair) May 14 08:24:34 <_joe_> and thanks for helping with that. In an ideal world, the teams developing new services would i... [02:30:30] (03PS6) 10CRusnov: Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [02:30:44] (03CR) 10CRusnov: "Thanks for review!" (0312 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [02:31:23] (03CR) 10Alex Monk: "also, the existing one just seems to be completely down. From RB at https://en.wikipedia.beta.wmflabs.org/api/rest_v1/data/citation/mediaw" [puppet] - 10https://gerrit.wikimedia.org/r/510290 (https://phabricator.wikimedia.org/T220235) (owner: 10Alex Monk) [02:34:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [02:40:11] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [03:03:15] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:22:11] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:22:45] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:52:47] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.6123 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [03:54:11] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.258 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [03:55:35] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [03:56:55] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [04:18:59] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:58:38] (03PS24) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [05:07:19] PROBLEM - puppet last run on mw1233 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [05:10:51] <_joe_> uhm [05:11:37] <_joe_> this message ^^ is inaccurate [05:11:41] <_joe_> the puppetmaster failed [05:34:09] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:52:10] (03CR) 10Effie Mouzeli: [C: 03+1] Empty mediawiki_memcached_servers for 3 mw hosts [puppet] - 10https://gerrit.wikimedia.org/r/510153 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [05:53:09] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] debug_proxy: force http/1.1 when proxying [puppet] - 10https://gerrit.wikimedia.org/r/509848 (https://phabricator.wikimedia.org/T217846) (owner: 10Effie Mouzeli) [05:56:57] (03PS2) 10Ppchelko: [EventBus] Make EventFactory and event destination configurable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509092 (https://phabricator.wikimedia.org/T222822) [05:57:13] (03CR) 10Giuseppe Lavagetto: "I think this makes sense, but why not go with all of codfw first?" [puppet] - 10https://gerrit.wikimedia.org/r/510153 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [05:59:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] deployment-prep: Use new Citoid service running inside Docker [puppet] - 10https://gerrit.wikimedia.org/r/510290 (https://phabricator.wikimedia.org/T220235) (owner: 10Alex Monk) [06:02:23] (03PS1) 10Ppchelko: [EventBus] Add eventgate-main event service. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510299 (https://phabricator.wikimedia.org/T222822) [06:03:50] (03PS3) 10Ppchelko: [EventBus] Make EventFactory and event destination configurable. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509092 (https://phabricator.wikimedia.org/T222822) [06:05:45] (03CR) 10Elukey: "> I think this makes sense, but why not go with all of codfw first?" [puppet] - 10https://gerrit.wikimedia.org/r/510153 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [06:07:49] <_joe_> elukey: why would a rollback in codfw be painful? [06:07:54] <_joe_> it receives no traffic [06:08:07] <_joe_> unless we use that hiera key somewhere else [06:08:23] _joe_ yes you are right, I am lazy [06:08:27] this is the real reason [06:08:29] :P [06:16:42] anyway, codfw first! [06:16:47] going to send a patch in a second [06:18:24] (03PS1) 10Effie Mouzeli: debug_proxy: force http/1.1 when proxying [puppet] - 10https://gerrit.wikimedia.org/r/510305 (https://phabricator.wikimedia.org/T217846) [06:19:55] (03Abandoned) 10Effie Mouzeli: debug_proxy: force http/1.1 when proxying [puppet] - 10https://gerrit.wikimedia.org/r/509848 (https://phabricator.wikimedia.org/T217846) (owner: 10Effie Mouzeli) [06:20:13] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [06:21:13] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational [06:24:01] !log force remount of /mnt/hdfs on stat1007 - fuse hdfs stuck [06:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:04] what a lovely tool [06:27:07] (03PS3) 10Giuseppe Lavagetto: Disable the PHP7 beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508177 (https://phabricator.wikimedia.org/T219128) [06:28:26] (03PS2) 10Effie Mouzeli: debug_proxy: force http/1.1 when proxying [puppet] - 10https://gerrit.wikimedia.org/r/510305 (https://phabricator.wikimedia.org/T217846) [06:29:33] PROBLEM - puppet last run on mw1247 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:30:21] PROBLEM - puppet last run on db2086 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:30:32] (03CR) 10Effie Mouzeli: [C: 03+2] debug_proxy: force http/1.1 when proxying [puppet] - 10https://gerrit.wikimedia.org/r/510305 (https://phabricator.wikimedia.org/T217846) (owner: 10Effie Mouzeli) [06:33:51] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [06:36:35] 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): PHP fatal error handler not working on mwdebug servers - https://phabricator.wikimedia.org/T217846 (10jijiki) 05Open→03Resolved a:03jijiki @Krinkle sorry for the delay in merging this. LGTM now, please reopen if there are issues. [06:37:57] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:38:18] known --^ [06:42:01] (03PS2) 10Elukey: Empty mediawiki_memcached_servers for 3 mw hosts [puppet] - 10https://gerrit.wikimedia.org/r/510153 (https://phabricator.wikimedia.org/T214275) [06:42:03] (03PS1) 10Elukey: Remove nutcracker memcached config in codfw [puppet] - 10https://gerrit.wikimedia.org/r/510309 (https://phabricator.wikimedia.org/T214275) [06:42:09] _joe_ --^ :) [06:42:47] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:43:05] ufff [06:43:36] this is surely again mc1029 [06:43:45] (tx bandwidth saturated) [06:44:00] the key responsible for this should get reduced in the next mw train [06:44:03] hopefully [06:45:04] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/16555/mw2245.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/510309 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [06:45:31] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:47:44] (yep confirmed mc1029) [06:53:35] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10WMDE-leszek) >>! In T220402#5183351, @Krenair wrote: > May 14 08:24:34 <_joe_> and thanks for helping with that. In an ideal wo... [06:56:45] (03PS1) 10Elukey: Move superset.wikimedia.org to analytics-tool1004 [puppet] - 10https://gerrit.wikimedia.org/r/510310 (https://phabricator.wikimedia.org/T212243) [06:57:13] RECOVERY - puppet last run on db2086 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:45] RECOVERY - puppet last run on mw1247 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:02:17] PROBLEM - Host an-worker1094 is DOWN: PING CRITICAL - Packet loss = 100% [07:06:31] 10Operations, 10Traffic, 10media-storage, 10Patch-For-Review: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035 (10Ankry) Is this really fixed? I still get strange Content-Type for the thumbnail from de... [07:13:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove nutcracker memcached config in codfw [puppet] - 10https://gerrit.wikimedia.org/r/510309 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [07:15:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Disable the PHP7 beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508177 (https://phabricator.wikimedia.org/T219128) (owner: 10Giuseppe Lavagetto) [07:16:14] (03CR) 10jenkins-bot: Disable the PHP7 beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508177 (https://phabricator.wikimedia.org/T219128) (owner: 10Giuseppe Lavagetto) [07:21:55] !log oblivian@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove the php7 beta feature T219128 (duration: 00m 59s) [07:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:00] T219128: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 [07:22:59] (03CR) 10Elukey: [C: 03+2] Remove nutcracker memcached config in codfw [puppet] - 10https://gerrit.wikimedia.org/r/510309 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [07:24:13] ah snap I just saw an-worker1094 [07:24:14] lovely [07:24:30] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 4 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) [07:25:06] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 4 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) [07:28:25] (03CR) 10Filippo Giunchedi: "Thanks for working on this! LGTM, just a nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [07:29:01] !log powercycle an-worker1094 (OEM event occurred, checking if temporary) [07:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:31] PROBLEM - nutcracker port on mw2245 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [07:29:45] <_joe_> hah [07:29:49] <_joe_> elukey: ^^ [07:31:01] RECOVERY - Host an-worker1094 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [07:31:14] <_joe_> oh you're busy [07:31:18] <_joe_> lemme take a look then [07:32:55] nono I am free now sorry [07:33:07] <_joe_> nutcracker is running, it's just not allowing memcached connections i guess [07:33:10] I have restarted nutcracker in there [07:33:19] good that I did on one only first :D [07:33:43] I didn't think about the alarms [07:33:59] !log restart nutcracker on mw2245 to pick up config changes (removal of memcached config) [07:34:01] <_joe_> me neither [07:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:14] <_joe_> so we need to change the monitoring [07:34:24] <_joe_> before we proceed [07:34:27] <_joe_> I can work on it [07:35:29] super, I didn't meant to derail your day on this though, I can work on it if you are busy with other things [07:39:16] <_joe_> nah don't worry [07:39:24] I have a patch coming :) [07:39:27] <_joe_> I'll do it once I finish my current convo [07:39:32] <_joe_> ah ok then go on :D [07:44:25] 10Operations, 10Traffic, 10media-storage, 10Patch-For-Review: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035 (10matmarex) May be a separate issue? There are a few tasks that mention 'application/x-ww... [07:46:18] (03PS1) 10Elukey: profile::mediawiki::nutcracker: add port alarms only if config is deployed [puppet] - 10https://gerrit.wikimedia.org/r/510434 (https://phabricator.wikimedia.org/T214275) [07:47:02] not a beauty but it should do its job [07:47:13] once we complete eqiad I'll clean up that role [07:47:16] err profile [07:49:05] 10Operations, 10serviceops, 10Beta-Feature, 10MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), and 2 others: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 (10Joe) A net effect of this patch is now all logged-in users are back to HHVM. I think we need to backport the patch above to the runni... [07:49:12] (03CR) 10Hashar: CI Tests: add a check to ensure all python files have a py extension (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [07:49:37] (03CR) 10Hashar: [C: 03+1] CI Tests: add a check to ensure all python files have a py extension [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [07:51:20] (03CR) 10Giuseppe Lavagetto: [C: 04-1] profile::mediawiki::nutcracker: add port alarms only if config is deployed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510434 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [07:51:28] <_joe_> elukey: you'll hate me for this [07:51:35] <_joe_> but tbh [07:51:42] <_joe_> I'm happy to work on it myself [07:52:06] jouncebot: now [07:52:06] No deployments scheduled for the next 3 hour(s) and 7 minute(s) [07:52:10] <_joe_> it's my team's responsibility after all [07:52:15] <_joe_> James_F: hi! [07:52:15] (03Abandoned) 10Elukey: profile::mediawiki::nutcracker: add port alarms only if config is deployed [puppet] - 10https://gerrit.wikimedia.org/r/510434 (https://phabricator.wikimedia.org/T214275) (owner: 10Elukey) [07:52:36] _joe_ not at all, makes sense, we don't really check the redis port :) [07:52:38] Hey _joe_. Want me to deploy the HHVM/PHP7 Beta Feature backport? [07:53:27] <_joe_> James_F: I'm ok doing it during SWAT myself, a +1 is enough [07:53:57] <_joe_> we won't die with 3 hours of our logged in users being back to the tired and tested rendering engine [07:53:59] <_joe_> :P [07:54:19] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:54:49] <_joe_> uh [07:55:07] <_joe_> elukey: if you would care to take a look at this ^^ instead :P [07:55:48] _joe_: OK. [07:56:18] the BGP status? [07:56:23] <_joe_> yep [07:56:50] ack, I have limited understanding of that, I can try :) [07:57:27] (moving the discussion to -sre) [07:58:29] PROBLEM - OSPF status on cr1-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:59:19] this should be related to the other one, a transit link is down [07:59:51] RECOVERY - OSPF status on cr1-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:03:09] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 71 probes of 419 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:07:15] 08Warning Alert for device cr2-eqsin.wikimedia.org - Traffic on tunnel link [08:09:16] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222788 (10ArielGlenn) [08:09:37] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222788 (10ArielGlenn) [08:13:01] (03PS1) 10Giuseppe Lavagetto: nutcracker: monitor the redis socket when memcached is not present [puppet] - 10https://gerrit.wikimedia.org/r/510439 [08:13:10] <_joe_> elukey: ^^ [08:13:13] <_joe_> this should do [08:13:30] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 48 probes of 458 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:13:53] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 15 probes of 419 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:14:19] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Release-Engineering-Team (Backlog): Request: add awight to contint-docker - https://phabricator.wikimedia.org/T223262 (10ArielGlenn) p:05Triage→03Normal [08:15:09] _joe_ ah ok so we add the redis monitoring while we decom the memcached conf [08:15:46] <_joe_> note in theory you could do both [08:15:52] yep yep [08:16:13] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Release-Engineering-Team (Backlog): Request: add awight to contint-docker - https://phabricator.wikimedia.org/T223262 (10ArielGlenn) Let's get sign-off from your WMDE manager (who is that?) and from @hashar or @greg in releng.... [08:17:15] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-eqsin.wikimedia.org recovered from Traffic on tunnel link [08:17:20] (03CR) 10Elukey: [C: 03+1] "elukey@mw2245:~$ sudo /usr/lib/nagios/plugins/check_tcp -H /var/run/nutcracker/redis_codfw.sock --timeout 2" [puppet] - 10https://gerrit.wikimedia.org/r/510439 (owner: 10Giuseppe Lavagetto) [08:17:30] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Grant James Forrester access to contint-admins and contint-docker, and to the ciadmin LDAP group - https://phabricator.wikimedia.org/T223137 (10ArielGlenn) p:05Triage→03Normal [08:18:55] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 5 probes of 458 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [08:19:17] RECOVERY - BGP status on cr1-eqsin is OK: BGP OK - up: 264, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:20:52] (03CR) 10Elukey: [C: 03+1] "also https://puppet-compiler.wmflabs.org/compiler1001/16559/mw2245.codfw.wmnet/ looks good" [puppet] - 10https://gerrit.wikimedia.org/r/510439 (owner: 10Giuseppe Lavagetto) [08:21:27] <_joe_> elukey: no actually it's wrong :D [08:22:30] (03PS2) 10Giuseppe Lavagetto: nutcracker: monitor the redis socket when memcached is not present [puppet] - 10https://gerrit.wikimedia.org/r/510439 [08:22:44] <_joe_> elukey: this ^^ is what we want [08:23:46] _joe_ you are right, otherwise the memcached alert remains :( [08:24:19] <_joe_> https://puppet-compiler.wmflabs.org/compiler1002/16560/mw2245.codfw.wmnet/ more like it [08:25:19] yep yep [08:25:20] my bad [08:30:07] (03PS1) 10Vgutierrez: ATS: Improve paths handling for multi-instance support [puppet] - 10https://gerrit.wikimedia.org/r/510442 (https://phabricator.wikimedia.org/T221217) [08:30:42] (03PS1) 10Muehlenhoff: Skip profile::rsyslog::kafka_shipper on trusty [puppet] - 10https://gerrit.wikimedia.org/r/510443 [08:31:05] (03CR) 10jerkins-bot: [V: 04-1] ATS: Improve paths handling for multi-instance support [puppet] - 10https://gerrit.wikimedia.org/r/510442 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [08:31:47] !log rebooting mw2164 [08:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:30] (03CR) 10Ema: [C: 03+1] Move superset.wikimedia.org to analytics-tool1004 [puppet] - 10https://gerrit.wikimedia.org/r/510310 (https://phabricator.wikimedia.org/T212243) (owner: 10Elukey) [08:35:12] (03PS1) 10Awight: Replace own ssh key after prod/cloud mistake [puppet] - 10https://gerrit.wikimedia.org/r/510444 [08:35:55] (03PS2) 10Vgutierrez: ATS: Improve paths handling for multi-instance support [puppet] - 10https://gerrit.wikimedia.org/r/510442 (https://phabricator.wikimedia.org/T221217) [08:36:31] !log stop superset on analytics-tool1003 as prep step for the migration to the new host - T212243 [08:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:37] T212243: Staging environment for upgrades of superset - https://phabricator.wikimedia.org/T212243 [08:36:45] (03CR) 10jerkins-bot: [V: 04-1] ATS: Improve paths handling for multi-instance support [puppet] - 10https://gerrit.wikimedia.org/r/510442 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [08:37:59] (03PS3) 10Vgutierrez: ATS: Improve paths handling for multi-instance support [puppet] - 10https://gerrit.wikimedia.org/r/510442 (https://phabricator.wikimedia.org/T221217) [08:41:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] nutcracker: monitor the redis socket when memcached is not present [puppet] - 10https://gerrit.wikimedia.org/r/510439 (owner: 10Giuseppe Lavagetto) [08:44:05] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud Services: reallocate workload from rack B5-eqiad - https://phabricator.wikimedia.org/T223148 (10hashar) Following the maintenance email, for the `integration` project, I could use some reallocations / sync up before the ma... [08:44:26] (03PS4) 10Vgutierrez: ATS: Improve paths handling for multi-instance support [puppet] - 10https://gerrit.wikimedia.org/r/510442 (https://phabricator.wikimedia.org/T221217) [08:44:44] !log elukey@deploy1001 Started deploy [analytics/superset/deploy@31c2c30]: Superset 0.32 [08:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:09] !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@31c2c30]: Superset 0.32 (duration: 00m 26s) [08:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:20] <_joe_> elukey: you can go on with the nutcracker work, but will need to re-run puppet on the servers [08:46:33] (03PS5) 10Vgutierrez: ATS: Improve paths handling for multi-instance support [puppet] - 10https://gerrit.wikimedia.org/r/510442 (https://phabricator.wikimedia.org/T221217) [08:48:47] _joe_ ack! [08:50:10] (03PS6) 10Vgutierrez: ATS: Improve paths handling for multi-instance support [puppet] - 10https://gerrit.wikimedia.org/r/510442 (https://phabricator.wikimedia.org/T221217) [08:52:05] (03CR) 10Vgutierrez: "pcc shows (almost) a NOOP for cp4021: https://puppet-compiler.wmflabs.org/compiler1001/16565/" [puppet] - 10https://gerrit.wikimedia.org/r/510442 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [08:54:57] (03CR) 10Volans: [C: 03+1] "reply to comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [08:58:57] (03PS2) 10ArielGlenn: Replace own ssh key after prod/cloud mistake [puppet] - 10https://gerrit.wikimedia.org/r/510444 (owner: 10Awight) [09:00:35] (03CR) 10Hashar: local_dev: Add config for dev-images docker-pkg (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510249 (https://phabricator.wikimedia.org/T223329) (owner: 10Brennen Bearnes) [09:01:22] (03CR) 10ArielGlenn: [C: 03+2] Replace own ssh key after prod/cloud mistake [puppet] - 10https://gerrit.wikimedia.org/r/510444 (owner: 10Awight) [09:01:33] (03PS2) 10Hashar: local_dev: Add config for dev-images docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/510249 (https://phabricator.wikimedia.org/T223329) (owner: 10Brennen Bearnes) [09:02:02] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Aklapper) [09:03:43] 10Operations, 10serviceops, 10Beta-Feature, 10MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), and 2 others: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 (10Nohtr15) What was the point remove PHP7 beta feature? Now all logged-in users are forced to use HHVM. There is a huge difference betw... [09:06:48] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Release-Engineering-Team (Backlog): Request: add awight to contint-docker - https://phabricator.wikimedia.org/T223262 (10awight) My manager is @Tobi_WMDE_SW, and the CI work would usually full under the maintenance / 20% catego... [09:09:49] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests, 10Release-Engineering-Team (Backlog): Request: add awight to contint-docker - https://phabricator.wikimedia.org/T223262 (10ArielGlenn) The ssh key issue has been cleared up. [09:11:12] 10Operations, 10serviceops, 10Beta-Feature, 10MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), and 2 others: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 (10Joe) >>! In T219128#5183956, @Nohtr15 wrote: > What was the point remove PHP7 beta feature? Now all logged-in users are forced to use... [09:12:31] 10Operations, 10Operations-Software-Development, 10Patch-For-Review, 10cloud-services-team (Kanban): Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169 (10Volans) @jbond apart the mentioned in the CR `.erb.py` files and `modules/mediawiki/files/mediawiki-fi... [09:17:30] (03CR) 10Volans: [C: 03+1] "LGTM, nice hack but let's hope upstream would make it easier to query efficiently the custom fields." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510267 (owner: 10Faidon Liambotis) [09:19:27] (03CR) 10Ema: [C: 03+1] ATS: Improve paths handling for multi-instance support [puppet] - 10https://gerrit.wikimedia.org/r/510442 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [09:22:31] (03CR) 10Volans: cumin: Allow Puppet DB backend to be used within Labs projects that use it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [09:22:33] (03CR) 10Vgutierrez: [C: 03+2] ATS: Improve paths handling for multi-instance support [puppet] - 10https://gerrit.wikimedia.org/r/510442 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [09:22:43] (03PS7) 10Vgutierrez: ATS: Improve paths handling for multi-instance support [puppet] - 10https://gerrit.wikimedia.org/r/510442 (https://phabricator.wikimedia.org/T221217) [09:24:36] (03CR) 10Fsero: [C: 03+2] local_dev: Add config for dev-images docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/510249 (https://phabricator.wikimedia.org/T223329) (owner: 10Brennen Bearnes) [09:26:09] 10Operations, 10serviceops, 10Beta-Feature, 10MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), and 2 others: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 (10Nohtr15) HHVM feels very slow after using PHP7, so it doesn't make any sense to use HHVM. I'm very sad because I cannot choose PHP7 a... [09:28:42] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Grant James Forrester access to contint-admins and contint-docker, and to the ciadmin LDAP group - https://phabricator.wikimedia.org/T223137 (10ArielGlenn) Added to ciadmin ldap group. [09:30:25] 10Operations, 10serviceops, 10Beta-Feature, 10MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), and 2 others: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 (10Reedy) >>! In T219128#5183998, @Nohtr15 wrote: > HHVM feels very slow after using PHP7, so it doesn't make any sense to use HHVM. I'm... [09:31:31] (03PS2) 10ArielGlenn: admin: add jforrester to contint-{admins,docker} [puppet] - 10https://gerrit.wikimedia.org/r/509891 (https://phabricator.wikimedia.org/T223137) (owner: 10Jforrester) [09:32:51] 10Operations, 10netops, 10cloud-services-team (Kanban): CloudVPS: evaluate if we can make rsync use 10G in cloudvirts - https://phabricator.wikimedia.org/T223272 (10aborrero) [09:33:13] 10Operations, 10serviceops, 10Beta-Feature, 10MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), and 2 others: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 (10Joe) >>! In T219128#5183998, @Nohtr15 wrote: > HHVM feels very slow after using PHP7, so it doesn't make any sense to use HHVM. I'm v... [09:33:35] !log Disable CI castor cache system since the instance is being migrated. Some / most CI jobs might have failed for the last 20 minutes or so T223148 [09:33:38] (03CR) 10ArielGlenn: [C: 03+2] admin: add jforrester to contint-{admins,docker} [puppet] - 10https://gerrit.wikimedia.org/r/509891 (https://phabricator.wikimedia.org/T223137) (owner: 10Jforrester) [09:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:40] T223148: Cloud Services: reallocate workload from rack B5-eqiad - https://phabricator.wikimedia.org/T223148 [09:33:48] (03PS5) 10Fsero: mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) [09:34:26] vgutierrez: can merge your change too? [09:34:44] uh? [09:34:45] (03CR) 10jerkins-bot: [V: 04-1] mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) (owner: 10Fsero) [09:34:55] Vgutierrez: ATS: Improve paths handling for multi-instance support (19153d5d6e) [09:35:04] it asks me for multiple when doing puppet-merge [09:35:17] go for it [09:35:22] * vgutierrez is a moron [09:35:31] and ema must be laughing pretty hard right now [09:36:35] apergos: let me know when the merge is completed, please [09:37:00] all through [09:37:07] thx [09:37:53] (03PS1) 10Volans: icinga: increase the open files limit [puppet] - 10https://gerrit.wikimedia.org/r/510451 (https://phabricator.wikimedia.org/T220297) [09:38:38] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Grant James Forrester access to contint-admins and contint-docker, and to the ciadmin LDAP group - https://phabricator.wikimedia.org/T223137 (10ArielGlenn) You should check in about half an hour once... [09:45:53] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) (owner: 10Fsero) [09:45:57] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222788 (10ArielGlenn) [09:46:45] (03PS1) 10Jcrespo: mariadb-backups: Refactor dumps and snapshots [puppet] - 10https://gerrit.wikimedia.org/r/510453 (https://phabricator.wikimedia.org/T206203) [09:46:47] (03CR) 10jerkins-bot: [V: 04-1] mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) (owner: 10Fsero) [09:46:58] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222788 (10ArielGlenn) Done. Please verify that this gives you the expected access. [09:47:46] vgutierrez: ema has to write performance reviews today, he needs a *very* good joke to laugh [09:48:18] PROBLEM - HHVM rendering on mw2215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [09:49:10] RECOVERY - HHVM rendering on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 77939 bytes in 0.386 second response time https://wikitech.wikimedia.org/wiki/Application_servers [09:51:12] (03PS3) 10Hashar: local_dev: Add config for dev-images docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/510249 (https://phabricator.wikimedia.org/T223329) (owner: 10Brennen Bearnes) [09:51:27] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/510249 (https://phabricator.wikimedia.org/T223329) (owner: 10Brennen Bearnes) [09:55:04] (03CR) 10Hashar: [V: 03+1 C: 03+1] "Ppc result:" [puppet] - 10https://gerrit.wikimedia.org/r/510249 (https://phabricator.wikimedia.org/T223329) (owner: 10Brennen Bearnes) [09:57:04] <_joe_> jouncebot: next [09:57:04] In 1 hour(s) and 2 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190515T1100) [09:57:25] (03CR) 10Jbond: "All Comments resolved will push in ~30 minutes unless further comments" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [10:03:45] (03PS1) 10Arturo Borrero Gonzalez: openstack: wmcs-cold-migrate: fix typo in error message [puppet] - 10https://gerrit.wikimedia.org/r/510455 [10:04:12] !log elukey@deploy1001 Started deploy [analytics/superset/deploy@9cdb9c5]: Superset 0.32 - update pyhive dependency [10:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:17] 10Operations, 10Traffic, 10media-storage, 10Patch-For-Review: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035 (10ema) 05Resolved→03Open [10:04:27] (03CR) 10jerkins-bot: [V: 04-1] openstack: wmcs-cold-migrate: fix typo in error message [puppet] - 10https://gerrit.wikimedia.org/r/510455 (owner: 10Arturo Borrero Gonzalez) [10:04:39] !log elukey@deploy1001 Finished deploy [analytics/superset/deploy@9cdb9c5]: Superset 0.32 - update pyhive dependency (duration: 00m 26s) [10:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:44] (03PS2) 10Elukey: Move superset.wikimedia.org to analytics-tool1004 [puppet] - 10https://gerrit.wikimedia.org/r/510310 (https://phabricator.wikimedia.org/T212243) [10:06:51] 10Operations, 10Traffic, 10media-storage, 10Patch-For-Review: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035 (10ema) The issue is indeed reproducible again, affecting ATS hosts. Swift is still retur... [10:08:05] (03PS2) 10Arturo Borrero Gonzalez: openstack: wmcs-cold-migrate: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/510455 [10:08:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: wmcs-cold-migrate: cleanup [puppet] - 10https://gerrit.wikimedia.org/r/510455 (owner: 10Arturo Borrero Gonzalez) [10:12:12] 10Operations, 10Operations-Software-Development, 10Patch-For-Review, 10cloud-services-team (Kanban): Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169 (10jbond) @Volans >utils/pcc left this till last as im not sure how/where its called >modules/mediawi... [10:12:57] (03PS7) 10Jbond: CI Tests: add a check to ensure all python files have a py extension [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) [10:16:36] 10Operations, 10Traffic, 10media-storage, 10Patch-For-Review: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035 (10ema) Actually no, we did fix the issue at the Swift layer (T162348), hence we removed t... [10:18:30] (03CR) 10Elukey: [C: 03+2] Move superset.wikimedia.org to analytics-tool1004 [puppet] - 10https://gerrit.wikimedia.org/r/510310 (https://phabricator.wikimedia.org/T212243) (owner: 10Elukey) [10:18:39] (03PS3) 10Elukey: Move superset.wikimedia.org to analytics-tool1004 [puppet] - 10https://gerrit.wikimedia.org/r/510310 (https://phabricator.wikimedia.org/T212243) [10:19:56] (03CR) 10Filippo Giunchedi: [C: 03+2] conftool-data: add restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/509423 (https://phabricator.wikimedia.org/T219404) (owner: 10Filippo Giunchedi) [10:20:07] (03PS2) 10Filippo Giunchedi: conftool-data: add restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/509423 (https://phabricator.wikimedia.org/T219404) [10:20:12] (03PS8) 10Jbond: CI Tests: add a check to ensure all python files have a py extension [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) [10:21:19] running puppet on cp text nodes [10:22:38] (03PS2) 10Jcrespo: mariadb-backups: Refactor dumps and snapshots [puppet] - 10https://gerrit.wikimedia.org/r/510453 (https://phabricator.wikimedia.org/T206203) [10:27:01] !log installing linux 4.9.168-1+deb9u2 kernel on stretch hosts (no reboots, just installing the new package) [10:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:29] (03CR) 10Jbond: [C: 03+2] CI Tests: add a check to ensure all python files have a py extension [puppet] - 10https://gerrit.wikimedia.org/r/510216 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [10:31:42] !log superset.wikimedia.org moved to analytics-tool1004 (Buster + python 3.7 + Superset 0.32 upgrade) [10:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:01] (03PS1) 10Arturo Borrero Gonzalez: openstack: wmcs-cold-migrate: be verbose about the name of the instance [puppet] - 10https://gerrit.wikimedia.org/r/510463 [10:37:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: wmcs-cold-migrate: be verbose about the name of the instance [puppet] - 10https://gerrit.wikimedia.org/r/510463 (owner: 10Arturo Borrero Gonzalez) [10:38:34] (03PS1) 10Jbond: flake8 - misc: add py extension so CI can run [puppet] - 10https://gerrit.wikimedia.org/r/510465 (https://phabricator.wikimedia.org/T144169) [10:39:44] 10Operations, 10Traffic, 10observability, 10PHP 7.2 support, and 2 others: [Regression] Varnish is replacing the detailed HTTP 500 page from PHP 7 with "503 Service Temporarily Unavailable" - https://phabricator.wikimedia.org/T223336 (10ArielGlenn) p:05Triage→03High [10:40:17] 10Operations: URL shortener subdomains for useful Wikimedia infrastructure - https://phabricator.wikimedia.org/T223319 (10ArielGlenn) p:05Triage→03High [10:40:58] 10Operations, 10Performance-Team, 10PHP 7.2 support: Monitoring PHP 7 APC usage - https://phabricator.wikimedia.org/T223180 (10ArielGlenn) p:05Triage→03Normal [10:41:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery: Memory correctable errors -EDAC- elastic1029 - https://phabricator.wikimedia.org/T214283 (10ArielGlenn) p:05Triage→03Normal [10:42:13] 10Operations, 10Traffic, 10media-storage, 10Patch-For-Review: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser - https://phabricator.wikimedia.org/T162035 (10fgiunchedi) Other data points, the 250px thumb has the correct c-t (image/png) although... [10:42:29] 10Operations, 10Availability (MediaWiki-MultiDC), 10codfw-rollout: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10ArielGlenn) p:05Triage→03Normal [10:43:31] 10Operations, 10Traffic, 10observability, 10PHP 7.2 support, and 2 others: [Regression] Varnish is replacing the detailed HTTP 500 page from PHP 7 with "503 Service Temporarily Unavailable" - https://phabricator.wikimedia.org/T223336 (10ema) Please provide the full responses, including headers, returned by... [10:52:18] (03PS3) 10Filippo Giunchedi: conftool-data: add restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/509423 (https://phabricator.wikimedia.org/T219404) [10:52:18] (03PS1) 10Filippo Giunchedi: site: add restbase1019 to production [puppet] - 10https://gerrit.wikimedia.org/r/510467 (https://phabricator.wikimedia.org/T219404) [10:52:46] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975 (10Tgr) [10:53:22] (03PS2) 10Filippo Giunchedi: site: add restbase1019 to production [puppet] - 10https://gerrit.wikimedia.org/r/510467 (https://phabricator.wikimedia.org/T219404) [10:53:22] (03PS4) 10Filippo Giunchedi: conftool-data: add restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/509423 (https://phabricator.wikimedia.org/T219404) [10:54:32] (03CR) 10Filippo Giunchedi: [C: 03+2] site: add restbase1019 to production [puppet] - 10https://gerrit.wikimedia.org/r/510467 (https://phabricator.wikimedia.org/T219404) (owner: 10Filippo Giunchedi) [10:59:10] 10Operations, 10PHP 7.2 support, 10Wikimedia-production-error: PHP7 opcache sometimes corrupts when cleared (was: Fatal ConfigException, undefined InitialiseSettings variable) - https://phabricator.wikimedia.org/T221347 (10ArielGlenn) [10:59:11] 10Operations, 10Discovery-Search (Current work): Elasticsearch nodes overloading in eqiad - https://phabricator.wikimedia.org/T220901 (10ArielGlenn) [11:00:04] MaxSem, RoanKattouw, and Niharika: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190515T1100). [11:00:05] MatmaRex and _joe_: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:15] hi [11:00:43] 10Operations, 10Operations-Software-Development, 10Patch-For-Review, 10cloud-services-team (Kanban): Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169 (10Volans) >>! In T144169#5184068, @jbond wrote: > @Volans > >>utils/pcc > left this till last as im no... [11:01:51] <_joe_> hi [11:02:00] <_joe_> I'm here debugging an issue [11:03:15] <_joe_> but I'm here [11:03:22] <_joe_> who's SWATTING today? [11:03:41] 10Operations, 10Cloud-VPS, 10Toolforge, 10LDAP, and 2 others: LDAP server running out of memory frequently and disrupting Cloud VPS clients - https://phabricator.wikimedia.org/T217280 (10ArielGlenn) Is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/496991/ still on the table for merging, or do we a... [11:03:49] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10ArielGlenn) It's been blocked for some months; where are we on this? [11:04:21] <_joe_> MatmaRex: it looks like no one is around for this SWAT window [11:04:28] (03PS4) 10Alexandros Kosiaris: Add logging support [software/service-checker] - 10https://gerrit.wikimedia.org/r/495238 [11:05:17] 10Operations, 10Phabricator, 10serviceops, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10ArielGlenn) The blocking ticket is closed; what else is needed for this to move forward? [11:05:25] 10Operations, 10Phabricator, 10serviceops, 10Release-Engineering-Team (Watching / External): Reimage both phab1001 and phab2001 to stretch - https://phabricator.wikimedia.org/T190568 (10ArielGlenn) [11:05:48] _joe_: uhh, the list of deployers is weird. aren't all these people in SF? [11:05:59] <_joe_> yup [11:05:59] where it's 4 am right now [11:06:12] <_joe_> I think some people are leaving for the hackathon already [11:06:13] _joe_: could you deploy it? D: [11:06:25] <_joe_> I can, technically, but I shouldn't [11:07:24] why shouldn't you? [11:07:32] hm, i guess i don't know how y'all coordinate deployments [11:07:39] (i can't deploy it even technically) [11:08:06] <_joe_> well I'm in SRE, I have the right to deploy but I'm supposed to limit it to small config changes/emergency merges [11:08:17] <_joe_> having said that, let me look at your patches [11:08:27] <_joe_> if they're simple enough, I'm willing to deploy them [11:08:46] <_joe_> but it's two backports, I see :/ [11:08:54] 10Operations, 10observability, 10Patch-For-Review, 10User-CDanis, 10User-fgiunchedi: Better organization for SRE grafana dashboards - https://phabricator.wikimedia.org/T178690 (10Tgr) It would be nice to make it a little clearer what the intended replacement is (either put it in the task description or t... [11:09:23] _joe_: thanks. it's one logical change that i split into two commits since it seemed nicer that way [11:10:44] <_joe_> yeah that's not the problem, it's they're in a branch, and I'm not 100% confident deploying a full branch [11:10:55] <_joe_> please note I have the same issue with *my* patches :D [11:12:08] i think folks would usually just sync the individual files that were changed [11:12:56] <_joe_> MatmaRex: hold on for now, I'm discussing the best course of action with others [11:13:21] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/16568/" [puppet] - 10https://gerrit.wikimedia.org/r/510467 (https://phabricator.wikimedia.org/T219404) (owner: 10Filippo Giunchedi) [11:13:40] sure [11:16:59] (03PS1) 10Arturo Borrero Gonzalez: openstack: wmcs-cold-migrate: speed up rsync [puppet] - 10https://gerrit.wikimedia.org/r/510472 (https://phabricator.wikimedia.org/T223272) [11:17:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: wmcs-cold-migrate: speed up rsync [puppet] - 10https://gerrit.wikimedia.org/r/510472 (https://phabricator.wikimedia.org/T223272) (owner: 10Arturo Borrero Gonzalez) [11:21:26] PROBLEM - Restbase root url on restbase1019 is CRITICAL: connect to address 10.64.0.100 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [11:21:35] <_joe_> uh? [11:21:50] that's me, provisioning [11:21:54] I'll silence it [11:21:55] <_joe_> hah ok [11:22:27] bummer we can't silence non-existing objects in icinga pre-provisioning [11:22:28] RECOVERY - Restbase root url on restbase1019 is OK: HTTP OK: HTTP/1.1 200 - 16254 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/RESTBase [11:22:52] (03PS9) 10Alexandros Kosiaris: First draft of a wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402) [11:23:27] (03PS10) 10Alexandros Kosiaris: Introduce the wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402) [11:23:41] <_joe_> MatmaRex: ok let's see if I manage not to break things [11:23:42] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Introduce the wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402) (owner: 10Alexandros Kosiaris) [11:24:16] (03PS1) 10Arturo Borrero Gonzalez: openstack: wmcs-cold-migrate: fix misplaced option in rsync ssh command [puppet] - 10https://gerrit.wikimedia.org/r/510473 (https://phabricator.wikimedia.org/T223272) [11:24:25] _joe_: thanks :o [11:24:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: wmcs-cold-migrate: fix misplaced option in rsync ssh command [puppet] - 10https://gerrit.wikimedia.org/r/510473 (https://phabricator.wikimedia.org/T223272) (owner: 10Arturo Borrero Gonzalez) [11:25:10] PROBLEM - cassandra-a CQL 10.64.0.101:9042 on restbase1019 is CRITICAL: connect to address 10.64.0.101 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [11:25:19] yes yes [11:26:41] godog: \o/ [11:26:43] thank youuuuuuuu [11:27:15] <_joe_> MatmaRex: first I need to find out where the instructions are on wikitech [11:27:17] mobrovac: np! [11:27:19] !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-staging-values.yaml staging stable/citoid [namespace: citoid, clusters: staging] [11:27:20] !log akosiaris@deploy1001 scap-helm citoid cluster staging completed [11:27:20] !log akosiaris@deploy1001 scap-helm citoid finished [11:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:16] heh akosiaris, we'll have to work on scap-helm's verbosity, it's a bit too eager to log [11:28:30] especially the scap-helm help command :P [11:28:31] _joe_: in a branch, as opposes to not being in git? [11:28:31] (03PS3) 10Jcrespo: mariadb-backups: Refactor dumps and snapshots [puppet] - 10https://gerrit.wikimedia.org/r/510453 (https://phabricator.wikimedia.org/T206203) [11:28:33] it's going away rather soon [11:28:42] as in hopefully in the next 2 weeks [11:29:20] !log upgrade to statsd_export 0.9 for citoid T220709 [11:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:24] T220709: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 [11:29:24] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Refactor dumps and snapshots [puppet] - 10https://gerrit.wikimedia.org/r/510453 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [11:29:27] <_joe_> Krinkle: I need to merge cherry-picks to wmf.4 [11:29:30] <_joe_> for MatmaRex [11:29:35] <_joe_> and then for my work [11:29:52] <_joe_> and there are clear and detailed instructions on wikitech, but I fail to find them [11:30:15] _joe_: yeah, just wanted to ask what characteristic you meant as being unusual or different. All files are in a branch somewhere, assuming they're in git. [11:30:26] _joe_: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment ? https://wikitech.wikimedia.org/wiki/How_to_deploy_code ? [11:30:36] <_joe_> yeah I'm just used to sync things from mediawiki-config [11:30:39] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/510443 (owner: 10Muehlenhoff) [11:30:45] <_joe_> MatmaRex: neither, apparently [11:30:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10ArielGlenn) [11:31:02] !log bootstrap restbase1019-a - T219404 [11:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:06] T219404: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 [11:31:14] _joe_: right. So non core repos are connected to core's wmf branch as git submodulew [11:31:59] !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-codfw-values.yaml production stable/citoid [namespace: citoid, clusters: codfw] [11:32:01] !log akosiaris@deploy1001 scap-helm citoid cluster codfw completed [11:32:01] _joe_: +2 in Gerrit will, once Jenkins integrated it and merged it, result in an automatic commit updating the core branch submodule [11:32:01] !log akosiaris@deploy1001 scap-helm citoid finished [11:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:27] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] flake8 - misc: add py extension so CI can run (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510465 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [11:32:27] <_joe_> Krinkle: yes but there are security patches and such [11:33:11] Then you pull it down from the php core directory, which will auto rebase. Then status of ext dir will be dirty [11:33:18] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "> Patch Set 1: Code-Review-1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510465 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [11:33:38] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:33:55] Cd to ext dir and if no sec patches there, cd back and apply submodule update [11:34:03] Then stage, test, deploy [11:34:44] _joe_: Lucas_WMDE is here and can take over if you want [11:34:57] <_joe_> yes please :) [11:35:09] o/ [11:35:10] <_joe_> Krinkle: it's almost certainly clear to me, but I prefer not to improvise [11:35:13] * Lucas_WMDE reads backlog [11:35:28] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud Services: reallocate workload from rack B5-eqiad - https://phabricator.wikimedia.org/T223148 (10hashar) `integration` is covered. webperformance.integration.eqiad.wmflabs and integration-slave-docker-1050 would have to b... [11:35:51] <_joe_> Lucas_WMDE: TL;DR is - no one's around for SWAT [11:36:07] just back from lunch [11:36:11] I can swat something if needed [11:36:26] <_joe_> hashar: yeah there are patches by both MatmaRex and me [11:36:43] <_joe_> and I don't really feel confident deploying changes to core [11:36:52] I dont either [11:36:59] I just push buttons and cross fingers nowadays [11:37:12] <_joe_> well you know which buttons to push I guess [11:37:14] <_joe_> I don't [11:37:28] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:37:36] git pull && scap sync-dir .../php-x-yz/extensions/Bar ;) [11:37:49] this page has pretty good documentation https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers [11:37:52] for backports as well IIRC [11:37:57] Lucas_WMDE: which patch do you need? [11:38:02] I don’t need any patch [11:38:07] but I could deploy as well if needed [11:38:53] <_joe_> Lucas_WMDE: that's what I was searching for, yes [11:39:43] <_joe_> hashar: https://wikitech.wikimedia.org/wiki/Deployments#Wednesday,_May_15 [11:39:55] <_joe_> it's 4 patches total [11:40:00] <_joe_> 2 by MatmaRex, 2 by me [11:40:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Handle application/octet-stream requests properly; release v0.1.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) (owner: 10Mobrovac) [11:41:05] !log akosiaris@deploy1001 scap-helm citoid upgrade -f citoid-eqiad-values.yaml production stable/citoid [namespace: citoid, clusters: eqiad] [11:41:06] !log akosiaris@deploy1001 scap-helm citoid cluster eqiad completed [11:41:06] !log akosiaris@deploy1001 scap-helm citoid finished [11:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:09] _joe_: I dont think we need to backport your patches to wmf branches [11:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:23] seems instead we can just disable the php7 betafeature from mediawiki-config? [11:41:27] jouncebot: now [11:41:27] For the next 0 hour(s) and 18 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190515T1100) [11:41:30] <_joe_> hashar: we did that [11:41:33] ^ link with anchor [11:41:38] <_joe_> hashar: and we need this too [11:42:05] <_joe_> gory details at https://phabricator.wikimedia.org/T219128 [11:43:36] <_joe_> go on with MatmaRex's patches [11:43:36] but why do you need the code to be removed if the feature is disabled/gone anyway? [11:43:41] <_joe_> i can work on mine [11:43:53] the code removal can be done as part of the usual train next week sin't it? [11:43:58] <_joe_> https://phabricator.wikimedia.org/T219128#5183787 [11:44:15] <_joe_> the two need to go together, really [11:44:30] now all logged-in users are back to HHVM. [11:44:30] <_joe_> but I can deploy my code myself, please help out MatmaRex :) [11:44:32] so solved? [11:44:42] <_joe_> no [11:44:53] <_joe_> we want them treated as the other users [11:45:24] <_joe_> but, can we postpone this discussion and deploy the other patches that were scheduled in this window? [11:45:53] Hey was wondering if i can add a last min config change to eu swat, its just a simple enabling for wgUseSandboxLink [11:46:10] Zppix: we won’t even have enough time for the changes already in the window, sorry [11:46:12] <_joe_> Zppix: I doubt it, we're already short on swatters [11:46:21] Understood, thanks anyways! [11:46:28] (03PS2) 10Jbond: flake8 - misc: add py extension so CI can run [puppet] - 10https://gerrit.wikimedia.org/r/510465 (https://phabricator.wikimedia.org/T144169) [11:46:35] !log akosiaris@deploy1001 scap-helm mathoid upgrade --wait -f mathoid-staging-values.yaml staging stable/mathoid [namespace: mathoid, clusters: staging] [11:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:41] let’s +2 MatmaRex’ backports? they’ll take a while to go through CI anyways [11:46:42] hashar: it's not strictly removing redundant/unused code, it's changing behavior. We need the new behavior [11:46:43] Good luck :) [11:46:48] <_joe_> Lucas_WMDE: +1 [11:47:00] okay, doing [11:47:00] !log akosiaris@deploy1001 scap-helm mathoid cluster staging completed [11:47:00] !log akosiaris@deploy1001 scap-helm mathoid finished [11:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:50] MatmaRex: your VisualEditor patches are in the pipeline [11:47:58] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:48:07] yeah, i've been following the conversation. thanks [11:48:13] expected ^. I should actually fix that now [11:49:18] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:51:28] _joe_: Krinkle: ok I got it, it is a commit having two features (remove beta features code which is no more needed and switch logged in users to be sampled) [11:51:45] I would argue that the removal of the beta features code is not worth backporting but thenn... the patches are there [11:52:43] anyway +2ed both [11:53:53] 10Operations, 10Traffic, 10observability, 10PHP 7.2 support, and 2 others: [Regression] Varnish is replacing the detailed HTTP 500 page from PHP 7 with "503 Service Temporarily Unavailable" - https://phabricator.wikimedia.org/T223336 (10Joe) Hi, I've tested a few combinations of errors, and the the only ca... [11:54:05] since I don’t think we’ll actually deploy four patches in the seven minutes left… [11:54:10] 10Operations, 10observability, 10PHP 7.2 support, 10Performance-Team (Radar), 10User-jijiki: [Regression] Varnish is replacing the detailed HTTP 500 page from PHP 7 with "503 Service Temporarily Unavailable" - https://phabricator.wikimedia.org/T223336 (10Joe) [11:54:22] does the pre-train sanity break still apply if the train only happens later? [11:54:22] 10Operations, 10observability, 10serviceops, 10PHP 7.2 support, and 2 others: [Regression] Varnish is replacing the detailed HTTP 500 page from PHP 7 with "503 Service Temporarily Unavailable" - https://phabricator.wikimedia.org/T223336 (10Joe) [11:54:31] the calendar has two breaks but only one train [11:54:42] (though I don’t know why the train is called “European” if it uses the later window, but perhaps I’m just confused) [11:54:58] <_joe_> the deployment calendar is confusing today, indeed [11:55:55] 10Operations, 10observability, 10serviceops, 10PHP 7.2 support, and 2 others: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm. - https://phabricator.wikimedia.org/T223336 (10Joe) [11:57:07] hashar: I don't know for sure, but I think the issue is that the sampling and beta cannot be compatible. Beta is opt-in. Sampling applies to everyone (out of sample is forced hhvm, in sample is forced php 7). It cannot be together with beta feature code. Previously it worked because one is for logged in and one for logged out. [11:57:39] 10Operations, 10Phabricator, 10serviceops, 10Release-Engineering-Team (Kanban), and 2 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10ArielGlenn) [11:57:40] Krinkle: yeah sorry, I am probably just over thinking ;D [11:57:59] and gate-and-submit-swat triggers the php71 jobs :/ [11:58:24] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [11:58:30] is it me or is prometheus not in the best place right now? [11:58:40] oh enwiki timeout? [11:58:49] <_joe_> on rb [11:58:55] all the wikis seem super slow [11:59:06] 10Operations, 10observability, 10serviceops, 10PHP 7.2 support, and 2 others: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm. - https://phabricator.wikimedia.org/T223336 (10Joe) I changed the title of the task to reflect myt findings, and changed the associated tags acc... [11:59:20] <_joe_> indeed they do [11:59:23] everything seems super slow [11:59:28] I can't even open phabricator [11:59:36] <_joe_> so, network? [11:59:36] network issues? [11:59:38] Oh [11:59:38] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [11:59:41] and some esmas probe died [11:59:44] Is that why phab not working for me? [11:59:55] <_joe_> esams I guess akosiaris [11:59:58] using el wp and it's fast enough (logged in user) [12:00:03] seems fast again now [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190515T1200) [12:00:12] <_joe_> wtf was that [12:00:15] yeah everything seems fine again [12:00:25] almost certainly something network [12:00:25] <_joe_> we're having a critical lvs on text-lb in esams though [12:00:31] and indeed phab is now zippy again [12:00:37] <_joe_> let's see if that comes back or not [12:02:06] <_joe_> yes it's gone away [12:02:08] incoming spike at esams per librenms [12:02:44] Krinkle: _joe_ and wmf prod does not use php7.0 but php7.2 isn't it? [12:02:54] 7.2 indeed [12:03:08] Yep [12:03:39] <_joe_> yes [12:04:05] (03PS1) 10Jbond: flake8 - pcc: add extension so file can be checked [puppet] - 10https://gerrit.wikimedia.org/r/510489 (https://phabricator.wikimedia.org/T144169) [12:04:33] (03CR) 10Jbond: [C: 03+2] Skip profile::rsyslog::kafka_shipper on trusty [puppet] - 10https://gerrit.wikimedia.org/r/510443 (owner: 10Muehlenhoff) [12:04:41] (03PS2) 10Jbond: Skip profile::rsyslog::kafka_shipper on trusty [puppet] - 10https://gerrit.wikimedia.org/r/510443 (owner: 10Muehlenhoff) [12:05:34] MatmaRex: looks like one of your backports has a CI failure already :/ [12:05:46] wait [12:06:01] sorry, that’s _joe_’s change [12:06:15] <_joe_> Lucas_WMDE: which one? [12:06:29] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/510431 [12:06:32] <_joe_> I'm following an outage right now [12:06:37] not yet finished but you can see the failure in zuul already [12:07:03] so we probably shouldn’t deploy right now? while the outage is ongoing? [12:07:06] Be sure to file a Shared Build Failure task about the flaky ci issue [12:07:25] <_joe_> it is finished, but [12:07:27] <_joe_> [E] [MWBOT] Login failed: WikiAdmin@http://127.0.0.1:9412/ [12:07:29] <_joe_> 13:57:11 Unhandled rejection Error: Could not login: WrongToken [12:07:37] <_joe_> this doesn't seem related to my patch, sorry [12:08:10] I think it's not [12:08:22] <_joe_> 14:00:17 ConfigException: HashConfig::get: undefined option: 'AllowConfirmedEmail' [12:08:29] <_joe_> anyways, let's not deploy those [12:08:38] <_joe_> I'll deal with them later [12:09:29] so MatmaRex’ changes were both merged, I think [12:09:34] but neither was deployed yet [12:09:38] yeah, they seem to have merged [12:09:51] but if we’re in an outage then perhaps we shouldn’t deploy at the moment? [12:10:05] (also the sanity break question is still open as far as I’m aware) [12:10:11] <_joe_> no go on [12:10:11] (03CR) 10Jbond: flake8 - misc: add py extension so CI can run (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510465 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:10:15] ok [12:10:24] <_joe_> and the sanity break for a train that's not happening? [12:10:59] I think having undeployed changes in the deployment branch isn’t especially sane either [12:11:08] <_joe_> +1 [12:11:08] so let’s deploy, I guess [12:11:41] hashar doesn’t seem to be on the deployment server right now, so I assume he’s doing something else and I’ll do the deployment [12:11:55] Lucas_WMDE: yes please ) [12:12:08] the sanity train can be ignored [12:12:15] I am not going to deploy the train this afternoon but later tonight [12:12:16] ;) [12:12:43] ok thanks [12:13:25] MatmaRex: can you test your patches? [12:13:34] yeah [12:13:34] and if yes, can you test them individually or should I deploy both at once? [12:14:12] MatmaRex: the first one should be on mwdebug1002 now [12:14:55] Lucas_WMDE: both, please. the first one should be a no-op, but the second depends on those changes [12:15:04] ok [12:15:30] now both should be on mwdebug1002 [12:16:12] Lucas_WMDE: i'm testing. the pages are taking forever to load, as usual on the mwdebug machines [12:16:30] (the test is to visit https://de.wikipedia.org/w/index.php?title=Wikipedia:Spielwiese&veaction=edit and confirms that it loads the visual editor) [12:17:04] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510465 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [12:17:09] yeah, mwdebug1002 has hhvm at 100% CPU for some reason [12:17:25] IIRC whatever causes that eventually times out and then it’s fast again for a while [12:17:43] but I don’t remember what it was [12:18:08] Lucas_WMDE: anyway, it seems to be working! [12:18:26] alright, then I’ll deploy it [12:18:26] well, VE doesn't quite open because it times out after 30 seconds. but it starts loading, so that's enough [12:18:28] (03PS1) 10Alexandros Kosiaris: Depool esams, network issues [dns] - 10https://gerrit.wikimedia.org/r/510492 [12:20:11] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Depool esams, network issues [dns] - 10https://gerrit.wikimedia.org/r/510492 (owner: 10Alexandros Kosiaris) [12:20:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] Depool esams, network issues [dns] - 10https://gerrit.wikimedia.org/r/510492 (owner: 10Alexandros Kosiaris) [12:20:49] !log depool esams, network issues [12:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:55] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.4/extensions/VisualEditor/: SWAT: [[gerrit:510217|VisualEditorHooks: Use isVisualAvailable() when changing tabs/editsections]] + [[gerrit:510218|DesktopArticleTarget.init: Allow veaction=edit to override namespace settings (T221892)]] (duration: 01m 15s) [12:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:00] T221892: &veaction=edit no longer overrides namespace settings (can't edit global sandbox page using VE) - https://phabricator.wikimedia.org/T221892 [12:22:02] okay, the link you posted still loads the visual editor on the prod servers [12:22:04] so I think it’s working, yay [12:22:19] !log EU SWAT done [12:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:47] Lucas_WMDE: thanks! [12:23:08] np [12:23:51] <_joe_> hashar: any idea what is happening with https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEvents/+/510431/ ? [12:24:48] <_joe_> Lucas_WMDE: I'll remove your -2 when we're ready to merge my changes later in the day, ok? [12:24:57] sure [12:25:01] 10Operations, 10Operations-Software-Development, 10Patch-For-Review, 10cloud-services-team (Kanban): Flake8 for python files without extension in puppet repo - https://phabricator.wikimedia.org/T144169 (10jbond) have created changes for [[https://gerrit.wikimedia.org/r/510489| pcc]] and [[https://gerrit.wi... [12:25:03] I was going to remove it myself as soon as it goes through [12:25:07] just wanted to block the auto-merge [12:25:44] <_joe_> no you did well I understand [12:25:51] <_joe_> thanks [12:27:10] _joe_: Lucas_WMDE: seems some sleneium job fails to login :-((( [12:27:33] <_joe_> I... dont' see how it can be related to my change? [12:27:36] the browser test suite is way too sensible when login, I believe it is subject to some race condition [12:27:49] <_joe_> anyways, I'm afk for now [12:27:58] Could not login: WrongToken [12:28:01] it got confused :/ [12:28:01] <_joe_> I'll aks to merge my patches first in the evvening swat [12:32:05] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:06] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:06] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:06] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:06] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:32:08] PROBLEM - wiki content on commons on commons.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/1118/ [12:32:39] * arturo paged [12:32:58] Hmm [12:32:58] Commons is not loading for me [12:33:06] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [12:33:06] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:12] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:12] Nor is wikipedia [12:33:13] no, we're hving issues [12:33:14] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:23] folks are looking into it [12:33:30] PROBLEM - graphoid endpoints health on scb1003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:33:41] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:33:41] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 24.33 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:33:49] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [12:33:50] (03PS1) 10Giuseppe Lavagetto: Revert "Depool esams, network issues" [dns] - 10https://gerrit.wikimedia.org/r/510495 [12:33:55] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:55] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedi [12:33:55] se [12:33:55] <_joe_> akosiaris: ^^ [12:33:58] _joe_: +! [12:33:58] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:59] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:59] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:59] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:34:00] _joe_: +1 [12:34:10] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:34:11] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:34:11] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:34:11] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:34:14] PROBLEM - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [12:34:19] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:34:20] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [12:34:36] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [12:34:36] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:34:36] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:34:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Depool esams, network issues" [dns] - 10https://gerrit.wikimedia.org/r/510495 (owner: 10Giuseppe Lavagetto) [12:35:28] (03CR) 10Filippo Giunchedi: [C: 03+1] Revert "Depool esams, network issues" [dns] - 10https://gerrit.wikimedia.org/r/510495 (owner: 10Giuseppe Lavagetto) [12:35:28] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:28] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:35:28] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:28] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:35:28] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:28] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:35:28] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:35:32] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:35:50] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:35:50] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:36:02] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:36:20] PROBLEM - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/project/view/71/ [12:36:37] PROBLEM - Host text-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [12:37:27] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:34] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:41] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:38:36] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 106 probes of 458 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [12:38:45] (03PS4) 10星耀晨曦: Enable FlaggedRevisions on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507932 (https://phabricator.wikimedia.org/T221933) [12:39:39] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:39:50] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:40:03] PROBLEM - SSH on lvs1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:40:09] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:40:09] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:40:10] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:40:27] PROBLEM - Host en.planet.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [12:40:45] (03CR) 10jerkins-bot: [V: 04-1] Enable FlaggedRevisions on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507932 (https://phabricator.wikimedia.org/T221933) (owner: 10星耀晨曦) [12:41:47] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 12.98 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:41:49] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [12:41:54] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:41:54] RECOVERY - graphoid endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:42:02] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:42:04] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:42:04] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:42:07] RECOVERY - Host text-lb.eqiad.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [12:42:07] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:42:08] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:42:12] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:42:14] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:42:14] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:42:18] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:42:18] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:42:18] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:42:19] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:42:19] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:42:22] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:42:24] RECOVERY - Host en.planet.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [12:42:28] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:42:28] (03PS5) 10星耀晨曦: Enable FlaggedRevisions on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507932 (https://phabricator.wikimedia.org/T221933) [12:42:29] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:42:32] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [12:42:36] RECOVERY - Graphoid LVS eqiad on graphoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [12:42:36] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:42:42] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:42:52] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:42:52] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:42:52] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:42:52] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:42:54] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:42:58] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:43:02] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:43:04] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 33523 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [12:43:06] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:43:08] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:43:11] RECOVERY - wiki content on commons on commons.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 164979 bytes in 0.020 second response time https://phabricator.wikimedia.org/project/view/1118/ [12:43:15] 04Critical Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% [12:44:24] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:44:34] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 92.99 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:46:56] RECOVERY - wikidata.org dispatch lag is REALLY high ---4000s- on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 2006 bytes in 0.110 second response time https://phabricator.wikimedia.org/project/view/71/ [12:47:36] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 89.18 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:48:28] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [utfa] https://wikitech.wikimed [12:48:28] ase [12:49:00] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 5 probes of 459 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [12:49:43] (03CR) 10A2093064: [C: 04-1] Enable FlaggedRevisions on zhwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507932 (https://phabricator.wikimedia.org/T221933) (owner: 10星耀晨曦) [12:51:10] 10Operations, 10Release Pipeline, 10Maps (Kartotherian), 10Patch-For-Review: Create blubberfile for deploying kartotherian into docker environment. - https://phabricator.wikimedia.org/T223275 (10hashar) To run the pipeline, we need a CI configuration change in `integration/config.git` which would look like... [12:51:12] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:52:26] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:53:15] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% [12:53:44] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [12:53:52] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [12:55:48] !sal [12:55:48] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [12:56:11] (03CR) 10星耀晨曦: Enable FlaggedRevisions on zhwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507932 (https://phabricator.wikimedia.org/T221933) (owner: 10星耀晨曦) [12:58:20] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 50.22 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:59:08] (03PS6) 10星耀晨曦: Enable FlaggedRevisions on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507932 (https://phabricator.wikimedia.org/T221933) [13:03:52] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 86.03 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [13:09:18] PROBLEM - Check systemd state on analytics-tool1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:10:29] PROBLEM - superset on analytics-tool1003 is CRITICAL: connect to address 10.64.36.112 and port 9080: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset [13:10:42] (03PS1) 10BBlack: geoip: add blackhole to localhost capability [dns] - 10https://gerrit.wikimedia.org/r/510500 [13:10:44] (03PS1) 10BBlack: Temporarily blackhole Hetzner via DNS [dns] - 10https://gerrit.wikimedia.org/r/510501 [13:10:55] superset is downtime expired [13:10:56] my bad [13:11:04] (03CR) 10jerkins-bot: [V: 04-1] geoip: add blackhole to localhost capability [dns] - 10https://gerrit.wikimedia.org/r/510500 (owner: 10BBlack) [13:11:06] (03CR) 10jerkins-bot: [V: 04-1] Temporarily blackhole Hetzner via DNS [dns] - 10https://gerrit.wikimedia.org/r/510501 (owner: 10BBlack) [13:11:08] (03CR) 10Ottomata: [EventBus] Add eventgate-main event service. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510299 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [13:12:22] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [13:12:50] (03PS2) 10BBlack: geoip: add blackhole to localhost capability [dns] - 10https://gerrit.wikimedia.org/r/510500 [13:12:52] (03PS2) 10BBlack: Temporarily blackhole Hetzner via DNS [dns] - 10https://gerrit.wikimedia.org/r/510501 [13:13:40] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a res [13:13:40] d: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [13:14:54] (03CR) 10BBlack: [C: 03+2] geoip: add blackhole to localhost capability [dns] - 10https://gerrit.wikimedia.org/r/510500 (owner: 10BBlack) [13:15:02] (03CR) 10BBlack: [C: 03+2] Temporarily blackhole Hetzner via DNS [dns] - 10https://gerrit.wikimedia.org/r/510501 (owner: 10BBlack) [13:16:25] (03PS1) 10Elukey: Set analytics-tool1004 as primary superset host [puppet] - 10https://gerrit.wikimedia.org/r/510502 (https://phabricator.wikimedia.org/T212243) [13:16:56] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [13:17:22] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:18:13] (03PS1) 10Anomie: Set ActorTableSchemaMigrationStage => write-new/read-new on test wikis & mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510503 (https://phabricator.wikimedia.org/T188327) [13:18:24] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [13:18:45] (03CR) 10Anomie: [C: 03+2] "Deploying planned config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510503 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:19:48] (03Merged) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-new/read-new on test wikis & mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510503 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:20:09] (03CR) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-new/read-new on test wikis & mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510503 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:20:11] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/16569/" [puppet] - 10https://gerrit.wikimedia.org/r/510502 (https://phabricator.wikimedia.org/T212243) (owner: 10Elukey) [13:21:15] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-new/read-new on testwikis and mediawikiwiki (T188327) (duration: 00m 57s) [13:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:21] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [13:21:43] hi [13:21:55] https://phabricator.wikimedia.org/T222994 this should be UBN [13:22:09] It is happening to several users, and it prevents them the basic functionality of Commons: uploading files. [13:22:50] There's still hundreds of files being uploaded every hour [13:22:51] https://commons.wikimedia.org/w/index.php?title=Special:NewFiles&offset=&limit=500 [13:23:30] it doesn't happen to everybody, but yet it is a serious issue [13:24:38] <_joe_> yannf: it's happening to you? if so, did you opt-in to the php7 beta? [13:25:08] not to me [13:25:20] but several users have reported it [13:26:08] would php7 change something? [13:26:51] <_joe_> it potentially could, who knows :) [13:27:22] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% [13:28:10] 10Operations, 10Performance-Team, 10observability, 10serviceops, and 2 others: [Regression] fatal-errors.php action=segfault results in a 503 error under php7-fpm. - https://phabricator.wikimedia.org/T223336 (10Krinkle) [13:30:59] _joe_, where is this php7 option? [13:31:25] <_joe_> yannf: it was removed today [13:31:33] <_joe_> so it might be simpler for now [13:31:50] I don't see it in [[Special:Preferences#mw-prefsection-betafeatures]] [13:31:53] <_joe_> to ask people if the problem resolved itself [13:32:14] I don't understand [13:32:14] <_joe_> yannf: yes it was removed as part of the php7 migration, the beta was closed earlier today [13:32:33] <_joe_> so logged-in users are (for now) all on HHVM [13:37:34] (03PS1) 10BBlack: Block two other hosting networks [dns] - 10https://gerrit.wikimedia.org/r/510510 [13:40:22] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [13:40:30] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [13:42:34] PROBLEM - puppet last run on netmon2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [13:44:03] 10Operations, 10Release Pipeline, 10Maps (Kartotherian), 10Patch-For-Review: Create blubberfile for deploying kartotherian into docker environment. - https://phabricator.wikimedia.org/T223275 (10MSantos) @hashar and @Mathew.onipe, kartotherian is considered a third-party and is supposed to have its own com... [13:50:09] (03PS2) 10Andrew Bogott: Stop installing the obsolete puppet-common transition package [puppet] - 10https://gerrit.wikimedia.org/r/510159 (owner: 10Muehlenhoff) [13:51:32] 10Operations, 10Release Pipeline, 10Maps (Kartotherian), 10Patch-For-Review: Create blubberfile for deploying kartotherian into docker environment. - https://phabricator.wikimedia.org/T223275 (10MSantos) If we need the CI to read the source repo, we might need to have a specific WMF fork of kartotherian/ti... [13:51:41] (03CR) 10Andrew Bogott: [C: 03+2] Stop installing the obsolete puppet-common transition package [puppet] - 10https://gerrit.wikimedia.org/r/510159 (owner: 10Muehlenhoff) [13:52:43] (03PS2) 10Andrew Bogott: Replace git-sync-upstream on labspuppetmasters, remove from puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/509915 (https://phabricator.wikimedia.org/T171188) [13:55:26] (03CR) 10Andrew Bogott: [C: 03+2] Replace git-sync-upstream on labspuppetmasters, remove from puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/509915 (https://phabricator.wikimedia.org/T171188) (owner: 10Andrew Bogott) [13:57:29] (03CR) 10BBlack: [C: 03+2] Block two other hosting networks [dns] - 10https://gerrit.wikimedia.org/r/510510 (owner: 10BBlack) [14:09:24] RECOVERY - puppet last run on netmon2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:12:14] RECOVERY - cassandra-a CQL 10.64.0.101:9042 on restbase1019 is OK: TCP OK - 0.000 second response time on 10.64.0.101 port 9042 https://phabricator.wikimedia.org/T93886 [14:12:15] (03PS1) 10BBlack: Revert "Block two other hosting networks" [dns] - 10https://gerrit.wikimedia.org/r/510521 [14:12:54] (03CR) 10BBlack: [C: 03+2] Revert "Block two other hosting networks" [dns] - 10https://gerrit.wikimedia.org/r/510521 (owner: 10BBlack) [14:13:13] (03PS1) 10BBlack: Revert "Temporarily blackhole Hetzner via DNS" [dns] - 10https://gerrit.wikimedia.org/r/510522 [14:13:21] (03CR) 10jerkins-bot: [V: 04-1] Revert "Temporarily blackhole Hetzner via DNS" [dns] - 10https://gerrit.wikimedia.org/r/510522 (owner: 10BBlack) [14:13:56] (03PS2) 10BBlack: Revert "Temporarily blackhole Hetzner via DNS" [dns] - 10https://gerrit.wikimedia.org/r/510522 [14:14:20] (03CR) 10BBlack: [C: 03+2] Revert "Temporarily blackhole Hetzner via DNS" [dns] - 10https://gerrit.wikimedia.org/r/510522 (owner: 10BBlack) [14:15:09] !log bootstrap restbase1019-b - T219404 [14:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:15] T219404: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 [14:16:34] !log depooling labweb1002 to test https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/509916/ [14:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:11] (03PS2) 10Andrew Bogott: labweb/wikitech: set PHP version to 7.2 in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/509916 (owner: 10Dzahn) [14:18:54] (03CR) 10Andrew Bogott: [C: 03+2] labweb/wikitech: set PHP version to 7.2 in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/509916 (owner: 10Dzahn) [14:19:08] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/etc/systemd/system/nginx.service.d/security.conf],File[/usr/local/bin/apache-status] [14:23:13] 10Operations: Investigate increase in tx bandwidth usage for mc1033 - https://phabricator.wikimedia.org/T223310 (10elukey) @aaron hi! I am trying to figure out if Redis traffic for mc1033 is causing this increase in bandwidth usage. From tcpdump I can see a lot of traffic for `GET global:Wikimedia\Rdbms\Chronolo... [14:28:16] !log repooling labweb1002 [14:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:11] (03CR) 10Andrew Bogott: [C: 03+2] "This is applied and everything looks good. Is that all we need for the php upgrade? I'm sure I can tell what version it's actually runni" [puppet] - 10https://gerrit.wikimedia.org/r/509916 (owner: 10Dzahn) [14:40:35] (03PS1) 10Papaul: DNS: Remove mgmt DNS for labtestweb2001 [dns] - 10https://gerrit.wikimedia.org/r/510536 [14:41:13] (03PS5) 10Filippo Giunchedi: conftool-data: add restbase10[19-27] [puppet] - 10https://gerrit.wikimedia.org/r/509423 (https://phabricator.wikimedia.org/T219404) [14:41:15] (03PS1) 10Filippo Giunchedi: site: move restbase102[0-7] to production [puppet] - 10https://gerrit.wikimedia.org/r/510537 (https://phabricator.wikimedia.org/T219404) [14:41:20] <_joe_> andrewbogott: we will need to do some more work to fix everything, but that's a start :) [14:41:42] _joe_: just cleaning up old packages? Or is there more to it? [14:41:50] <_joe_> that [14:42:02] <_joe_> and then probably fixing the vhosts for the actual site [14:42:11] <_joe_> for wikitech we can try a full switch [14:42:13] <_joe_> IMHO [14:42:26] <_joe_> switchback is going to be a simple revert, in case [14:42:27] (03CR) 10Alexandros Kosiaris: "many +1s just for the tests!" [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) (owner: 10Mobrovac) [14:42:49] (03CR) 10Filippo Giunchedi: [C: 03+2] "Note this will not start bootstrapping cassandra but will start restbase, which will be depooled until https://gerrit.wikimedia.org/r/c/op" [puppet] - 10https://gerrit.wikimedia.org/r/510537 (https://phabricator.wikimedia.org/T219404) (owner: 10Filippo Giunchedi) [14:43:47] (03PS5) 10Paladox: prometheus: Add gerrit.yaml under targets [puppet] - 10https://gerrit.wikimedia.org/r/510251 [14:44:46] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [14:45:12] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for labtestweb2001 [dns] - 10https://gerrit.wikimedia.org/r/510536 (owner: 10Papaul) [14:45:54] RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:46:31] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10Papaul) [14:47:14] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10Papaul) 05Open→03Resolved complete [14:47:35] (03CR) 10Elukey: [C: 03+2] Set analytics-tool1004 as primary superset host [puppet] - 10https://gerrit.wikimedia.org/r/510502 (https://phabricator.wikimedia.org/T212243) (owner: 10Elukey) [14:47:41] (03PS2) 10Elukey: Set analytics-tool1004 as primary superset host [puppet] - 10https://gerrit.wikimedia.org/r/510502 (https://phabricator.wikimedia.org/T212243) [14:48:47] PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:48:47] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:50:07] RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16078 bytes in 5.763 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:52:32] RECOVERY - Check systemd state on analytics-tool1003 is OK: OK - running: The system is fully operational [14:53:10] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{ti [14:53:10] sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [14:53:34] RECOVERY - superset on analytics-tool1003 is OK: TCP OK - 0.000 second response time on 10.64.36.112 port 9080 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset [14:54:19] this was me --^ [14:56:34] (03PS1) 10Andrew Bogott: Remove labs-ns and labs-recursor names [dns] - 10https://gerrit.wikimedia.org/r/510546 (https://phabricator.wikimedia.org/T221183) [14:58:22] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [14:58:51] Hmm [14:59:23] Well Wikipedia not working for me :( [14:59:42] paladox: very much known, unfortunateyl [15:00:10] known issue, folks are working on it [15:00:14] see topic [15:00:18] (sorry) [15:00:24] Ah thanks, so it’s still on going. Thought it was fixed [15:00:25] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: lvs2006 crashed into (what it seems) an unrecoverable state - https://phabricator.wikimedia.org/T209337 (10Papaul) @ArielGlenn that system is out of warranty and the plan is to replace it with the systems in T196560 [15:00:28] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [15:02:56] nope, still slugging away at it [15:03:08] I guess folsk will change the topic once it's resolved for sure [15:08:12] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [15:09:50] (03PS2) 10Alexandros Kosiaris: profile::redis::master: Switch hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/509873 [15:09:52] (03PS1) 10Alexandros Kosiaris: kube-apiserver: Don't alert on long CONNECTs [puppet] - 10https://gerrit.wikimedia.org/r/510555 [15:15:45] (03CR) 10Fsero: [C: 03+1] kube-apiserver: Don't alert on long CONNECTs [puppet] - 10https://gerrit.wikimedia.org/r/510555 (owner: 10Alexandros Kosiaris) [15:17:21] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% [15:19:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/510546 (https://phabricator.wikimedia.org/T221183) (owner: 10Andrew Bogott) [15:21:24] 10Operations, 10Analytics, 10Analytics-Kanban, 10vm-requests, and 2 others: Create an-tool1005 (Staging environment for Superset) - https://phabricator.wikimedia.org/T217738 (10elukey) @MoritzMuehlenhoff I am ready to make this host a proper staging environment for superset, let me know if we can proceed o... [15:21:32] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.4/extensions/GrowthExperiments/includes/HelpPanel/QuestionPoster.php: T222980 (duration: 00m 57s) [15:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:38] T222980: ConfigException: HashConfig::get: undefined option: 'AllowConfirmedEmail' - https://phabricator.wikimedia.org/T222980 [15:27:12] 08Warning Alert for device cr2-esams.wikimedia.org - Memory over 85% [15:30:11] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:40:54] (03PS6) 10Fsero: mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) [15:41:49] (03CR) 10jerkins-bot: [V: 04-1] mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) (owner: 10Fsero) [15:44:47] (03PS7) 10Fsero: mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) [15:45:35] (03CR) 10jerkins-bot: [V: 04-1] mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) (owner: 10Fsero) [15:45:53] 10Operations, 10Reading-Infrastructure-Team-Backlog, 10Traffic, 10Maps (Tilerator): Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776 (10LGoto) p:05High→03Normal [15:47:17] jouncebot: now [15:47:17] No deployments scheduled for the next 0 hour(s) and 12 minute(s) [15:47:18] (03PS8) 10Fsero: mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) [15:47:19] jouncebot: next [15:47:19] In 0 hour(s) and 12 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190515T1600) [15:47:52] Reedy: You've just about got time… ;-) [15:48:03] James_F: The patches are in the SWAT [15:48:08] (03CR) 10jerkins-bot: [V: 04-1] mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) (owner: 10Fsero) [15:48:14] See when jerkins merges them ;P [15:48:19] Excuses, excuses. :-) [15:48:58] We should add the PHP7 tag to the list of tags not to display, I guess? [15:49:04] Or should we wait? Hmm. [15:50:40] I think wait for now [15:51:04] Kk. [15:52:14] While we've still got some users using hhvm, knowing easily if an edit/similar causes problems on PHP7 it's worthwhile [15:53:38] <_joe_> indeed [15:53:43] <_joe_> we want to see it James_F [15:54:00] <_joe_> also we're having 95% of users on HHVM [15:55:22] <_joe_> we'll move to 10% tomorrow [15:55:28] <_joe_> supposed to be now, but you know [15:58:09] it's always the internets fault [15:59:43] (03CR) 10CDanis: "dumb question where does the default come from on our hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/510451 (https://phabricator.wikimedia.org/T220297) (owner: 10Volans) [15:59:57] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "small nits, lgtm otherwise." (033 comments) [software/service-checker] - 10https://gerrit.wikimedia.org/r/495238 (owner: 10Alexandros Kosiaris) [16:00:04] MaxSem, RoanKattouw, and Niharika: Your horoscope predicts another unfortunate Morning SWAT (Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190515T1600). [16:00:04] _joe_ and Zoranzoki21: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:18] <_joe_> here I am [16:00:32] <_joe_> Reedy: will you deploy both? [16:00:40] (03PS6) 10Jcrespo: network::constants: Move mysql_root_clients from special_hosts [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [16:00:45] Yeah [16:00:50] <_joe_> <3 [16:00:53] just waiting for jekins [16:00:57] +r [16:01:09] <_joe_> I can try to test them on mwdebug once they're in deploy1001 [16:01:30] <_joe_> uhm [16:01:38] <_joe_> something funny happening with appservers I think [16:01:44] Not sure if there's much point if the betafeature is already disabled [16:01:45] PROBLEM - HHVM jobrunner on mw1308 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Jobrunner [16:01:59] <_joe_> ' [16:02:09] <_joe_> Reedy: there is a small change in behaviour [16:02:37] <_joe_> uhm it's the jobrunners [16:02:42] <_joe_> a ton of videos I'd say [16:02:59] RECOVERY - HHVM jobrunner on mw1308 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [16:03:15] !log disable puppet on all production databases [16:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:26] <_joe_> Reedy: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEvents/+/508172/ merged [16:03:35] <_joe_> argh sorry [16:03:37] <_joe_> wrong patch [16:03:52] <_joe_> nah they're still both waiting for jerkins [16:03:54] it's just finishing off php72 tests [16:04:01] the second will submit when the first does [16:04:15] MaxSem RoanKattouw Niharika if you're swatting... I may have an urgent CentralNotice patch to deploy in about 45 min, if that's doable? [16:04:17] 10Operations: maint-announce calendar: Google changes made "no action needed button" disappear - https://phabricator.wikimedia.org/T223388 (10Dzahn) "Group settings have been updated. If you are experiencing issues, check the FAQ for help." -- As pointed out by @ArielGlenn we had to first check what group typ... [16:04:19] thx in advance! [16:04:32] very exciting browsertests [16:04:39] _joe_: there is user [16:04:43] <_joe_> ??? [16:04:58] they uploaded quite a few 2.5 GB videos [16:05:01] Reedy: You doing SWAT? [16:05:06] Aye [16:05:16] _joe_: https://commons.wikimedia.org/wiki/Special:NewFiles?user=&mediatype%5B%5D=VIDEO&start=&end=&wpFormIdentifier=specialnewimages&limit=50&offset= [16:05:16] <_joe_> jijiki: so expected [16:05:19] (03CR) 10Jcrespo: [C: 03+2] network::constants: Move mysql_root_clients from special_hosts [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [16:06:03] AndyRussG: Reedy is swat-ing. :) [16:06:25] I might die of boredom first though [16:06:29] <_joe_> yeah [16:06:36] jerkins is only just kicking off phpunit tests on that one [16:06:42] Niharika: gotcha thanks... Reedy just gonna try a revert patch and see if it fixes stuff on the beta cluster [16:06:43] <_joe_> https://memegenerator.net/img/instances/65392936/waiting-for-jenkins-to-build.jpg [16:07:03] maybe a book would help? [16:07:10] (03CR) 10Volans: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/510451 (https://phabricator.wikimedia.org/T220297) (owner: 10Volans) [16:07:11] To whack jenkins with? [16:07:16] i always just think "Leeeroy" when waiting for it https://knowyourmeme.com/memes/leeroy-jenkins [16:07:18] <_joe_> https://memegenerator.net/img/instances/65289046/waiting-for-jenkins-to-finish-build.jpg [16:07:51] <_joe_> it's 20 minutes [16:07:56] <_joe_> this is ridiculous [16:07:59] PROBLEM - Host mw1296 is DOWN: PING CRITICAL - Packet loss = 100% [16:08:07] RECOVERY - Host mw1296 is UP: PING OK - Packet loss = 0%, RTA = 4.50 ms [16:08:08] _joe_: We've got proof PHP7.2 isn't faster [16:08:17] HHVM was finished ages ago :P [16:08:33] success [16:09:19] <_joe_> .5 merged [16:09:47] <_joe_> .4 too [16:10:07] (03CR) 10Jcrespo: "The change is functionally correct, verifying now no production impact before running it everywhere:" [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [16:10:38] _joe_: both are on mwdebug [16:10:48] <_joe_> Reedy: thanks [16:10:55] uh, 1002 [16:11:01] probably useful information to give [16:11:16] <_joe_> I know you devs [16:11:24] <_joe_> you say "mwdebug" you mean the default one [16:11:37] <_joe_> I can't seem to rrender enwiki there [16:11:58] <_joe_> php7 works [16:12:19] <_joe_> mwPhp7Seed=c41 [16:12:24] <_joe_> even as a logged in user [16:12:26] <_joe_> ok [16:12:58] #wmhack [16:13:05] sorry, I wanted to join [16:13:15] <_joe_> ok, it works [16:13:21] <_joe_> Reedy: you can deploy [16:13:25] sweet [16:13:27] _joe_: bookmarked, I’ve been thinking about making an image like this several times already ^^ [16:13:43] <_joe_> Lucas_WMDE: I just searched google [16:13:55] of course it already existed :D [16:13:57] <_joe_> we're not the first ones frustrated with a sluggish ci [16:14:45] (03CR) 10Reedy: "Not SWAT-ing due to CR-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510269 (https://phabricator.wikimedia.org/T219617) (owner: 10Zoranzoki21) [16:14:46] (03CR) 10Hashar: "Running locally with: bundle exec rake -j 1 test" [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) (owner: 10Fsero) [16:14:49] (03PS4) 10Reedy: Enable SandboxLink on papwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510142 (https://phabricator.wikimedia.org/T223166) (owner: 10Zppix) [16:14:51] (03CR) 10Reedy: [C: 03+2] Enable SandboxLink on papwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510142 (https://phabricator.wikimedia.org/T223166) (owner: 10Zppix) [16:14:54] nice of you to join us wikibugs [16:14:56] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.4/extensions/WikimediaEvents/: T219128 (duration: 01m 06s) [16:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:01] T219128: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 [16:15:05] (03Merged) 10jenkins-bot: Enable SandboxLink on papwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510142 (https://phabricator.wikimedia.org/T223166) (owner: 10Zppix) [16:15:07] (03CR) 10jenkins-bot: Enable SandboxLink on papwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510142 (https://phabricator.wikimedia.org/T223166) (owner: 10Zppix) [16:15:22] (03CR) 10Ppchelko: [EventBus] Add eventgate-main event service. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510299 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [16:15:52] 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10Joe) [16:15:59] 10Operations, 10serviceops, 10Beta-Feature, 10MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), and 2 others: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 (10Joe) 05Open→03Resolved [16:16:14] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.5/extensions/WikimediaEvents/: T219128 (duration: 01m 13s) [16:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:36] (03PS4) 10Fsero: local_dev: Add config for dev-images docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/510249 (https://phabricator.wikimedia.org/T223329) (owner: 10Brennen Bearnes) [16:17:04] (03PS1) 10Ottomata: Remove deprecated eventgate-analytics chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/510564 [16:17:21] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T223166 (duration: 00m 56s) [16:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:25] T223166: Enable Extension:SandboxLink on papwiki - https://phabricator.wikimedia.org/T223166 [16:18:17] (03PS1) 10Bstorm: wiki replicas: depool labsdb1010 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/510565 (https://phabricator.wikimedia.org/T212972) [16:19:34] (03CR) 10Fsero: [C: 03+2] local_dev: Add config for dev-images docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/510249 (https://phabricator.wikimedia.org/T223329) (owner: 10Brennen Bearnes) [16:20:28] ^ thanks for that. [16:21:12] (03CR) 10Jcrespo: [C: 04-1] wiki replicas: depool labsdb1010 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/510565 (https://phabricator.wikimedia.org/T212972) (owner: 10Bstorm) [16:21:40] (03CR) 10Jcrespo: [C: 04-1] "Coordinate with me, there is ongoing db maintenance on labsdbs" [puppet] - 10https://gerrit.wikimedia.org/r/510565 (https://phabricator.wikimedia.org/T212972) (owner: 10Bstorm) [16:22:30] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/510565 (https://phabricator.wikimedia.org/T212972) (owner: 10Bstorm) [16:23:30] AndyRussG: I've gotta go... Someone else should be good to help you SWAT if you need help :) [16:23:40] Reedy: ok.... [16:23:45] (03CR) 10Ayounsi: [C: 03+1] Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [16:23:54] Or I'll be around a bit later this evening (my time) [16:24:05] Reedy: thx! no worries eh :) [16:24:47] MaxSem RoanKattouw Niharika or anyone else for SWAT? [16:25:03] PROBLEM - Disk space on mw1299 is CRITICAL: DISK CRITICAL - free space: /tmp 229 MB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [16:26:45] !log T212972 updated all views on labsdb1009 [16:26:47] (03CR) 10Ottomata: [EventBus] Add eventgate-main event service. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510299 (https://phabricator.wikimedia.org/T222822) (owner: 10Ppchelko) [16:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:51] T212972: Remove reference to text fields replaced by the comment table from WMCS views - https://phabricator.wikimedia.org/T212972 [16:27:45] RECOVERY - Disk space on mw1299 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [16:28:05] !log restart nutcracker on mw2240 to pick up the new config (no more memcached settings) [16:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:17] so this --^ should not alarm [16:28:21] if it does, my fault :) [16:28:43] (for nutcracker memcached port not set/present) [16:30:59] (03PS1) 10Papaul: DNS: Remove mgmt DNS for labtestservices2001 [dns] - 10https://gerrit.wikimedia.org/r/510567 [16:31:56] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for labtestservices2001 [dns] - 10https://gerrit.wikimedia.org/r/510567 (owner: 10Papaul) [16:32:36] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10Papaul) [16:32:54] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: decommission: labtestservices2001.wikimedia.org - https://phabricator.wikimedia.org/T218022 (10Papaul) 05Open→03Resolved Complete [16:32:57] (03PS9) 10Fsero: mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) [16:33:54] (03CR) 10jerkins-bot: [V: 04-1] mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) (owner: 10Fsero) [16:33:54] cd /Users/ltoscano/puppet [16:33:59] ufff [16:34:08] sorry :) [16:35:11] (03PS10) 10Fsero: mcrouter: feat(T221346) add icinga check for certs [puppet] - 10https://gerrit.wikimedia.org/r/510082 (https://phabricator.wikimedia.org/T221346) [16:36:35] Reedy: oh thanks for swatting my patch :) [16:38:22] 10Operations: maint-announce calendar: Google changes made "no action needed button" disappear - https://phabricator.wikimedia.org/T223388 (10Dzahn) update: while the missing button is gone and we can "mark as no action needed" again.. the filters seem to have stopped working for now. so filtering "all unresol... [16:40:25] !log bootstrap restbase1019-c - T219404 [16:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:30] T219404: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 [16:40:40] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2019-05-15: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10thiemowmde) [16:40:50] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10WMDE-QWERTY-Sprint-2019-05-15: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10thiemowmde) [16:47:25] (03PS6) 1020after4: switch phabricator from phab1001 to phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/437620 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [16:47:36] hashar: are you doing the train today? [16:49:27] 10Operations, 10Cassandra, 10RESTBase-Cassandra, 10Core Platform Team Backlog (Later), and 4 others: Configure a threshold for earlier notification of /srv/cassandra/instance-data - https://phabricator.wikimedia.org/T191659 (10ArielGlenn) How is this looking, anyone? [16:50:09] !log restart Hadoop HDFS namenodes on an-master100[1,2] to pick up new settings [16:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:55] AndyRussG: yeah in a couple hours frmo now [16:51:04] 9pm in europe or 19:00 UTC [16:51:39] I am off for dinner etc [16:51:40] ;) [16:51:45] hashar: ok thanks! [16:58:40] !log T212972 updated all views on labsdb1012 [16:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:44] T212972: Remove reference to text fields replaced by the comment table from WMCS views - https://phabricator.wikimedia.org/T212972 [16:59:22] 10Operations: Investigate increase in tx bandwidth usage for mc1033 - https://phabricator.wikimedia.org/T223310 (10Addshore) For things added my wmf.3, it looks like there were indeed some chronology protector changes https://github.com/wikimedia/mediawiki/compare/wmf/1.34.0-wmf.1...wmf/1.34.0-wmf.3 https://ger... [17:01:47] !log bootstrap restbase1019-c - T219404 [17:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:54] T219404: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 [17:08:03] (03PS1) 10Jbond: CI - python: update python type checking to use mime type [puppet] - 10https://gerrit.wikimedia.org/r/510575 (https://phabricator.wikimedia.org/T144169) [17:09:13] !log powerup elastic2038 (was down for maintenance) [17:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:26] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 DOWN (CPU/memory errors ) - https://phabricator.wikimedia.org/T217398 (10elukey) Started up just now, forgot to add the task in the SAL :)