[00:01:07] (03PS7) 10Zoranzoki21: Update logos for for tiwiki and tiwikt (part II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592507 (https://phabricator.wikimedia.org/T150618) [00:03:12] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 87, down: 0, dormant: 0, excluded: 2, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:04:00] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:13:20] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Wolfgang Kandek - https://phabricator.wikimedia.org/T249352 (10wkandek) ` gpg --fingerprint 9B51CE0772203719B26C8ED3EEABB9556398421F pub rsa4096 2020-04-23 [SC] 9B51 CE07 7220 3719 B26C 8ED3 EEAB B955 6398 421F uid [ul... [00:14:23] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Wolfgang Kandek - https://phabricator.wikimedia.org/T249352 (10wkandek) RSA Key from yubikey: ` ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDZ4qNjoBpqHTj25VV1MgfQiNqv5jupK1FtJ6M84dLxHQZgaoSvoDsoBJrfgVxZXC46b8S31rv46VkG6WPTfN4Y+h8w39RlK3ekUl+... [01:32:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:34:34] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:00:56] (03PS1) 10VulpesVulpes825: Enable DiscussionTools as a beta feature on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592514 (https://phabricator.wikimedia.org/T251075) [05:17:19] (03PS1) 10Majavah: Enable cross-project search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592515 (https://phabricator.wikimedia.org/T250724) [05:19:44] (03PS2) 10Majavah: Enable cross-project search on frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592515 (https://phabricator.wikimedia.org/T250724) [05:27:13] (03PS1) 10Marostegui: dbproxy: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/592516 (https://phabricator.wikimedia.org/T249188) [05:31:20] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on es2004 - https://phabricator.wikimedia.org/T251017 (10Marostegui) a:03jcrespo @jcrespo this host is scheduled for decommission {T222592} however it is (or it was) being used for backups, what would you like to do with this task? Do you want the disk replac... [05:32:03] (03CR) 10Marostegui: [C: 03+2] dbproxy: Depool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/592516 (https://phabricator.wikimedia.org/T249188) (owner: 10Marostegui) [05:33:51] !log Depool labsdb1011 T249188 [05:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:58] T249188: Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 [05:47:29] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Restart extension1 (x1) database primary master (db1120) - https://phabricator.wikimedia.org/T250701 (10Marostegui) >>! In T250701#6078884, @Addshore wrote: > Hmm, Cognate should be in the lists in the description? > Or am I confusing som... [05:47:38] 10Operations, 10ops-eqiad: msw1-a6-eqiad flopping up and down mgmt connections on A6 - https://phabricator.wikimedia.org/T250652 (10Marostegui) [05:47:39] !log rolling restart ats-tls in cp[1085,1089] and text@esams - T249335 [05:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:45] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [05:53:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1104 for defragmentation - T232446', diff saved to https://phabricator.wikimedia.org/P11039 and previous config saved to /var/cache/conftool/dbconfig/20200427-055320-marostegui.json [05:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:28] T232446: Compress new Wikibase tables - https://phabricator.wikimedia.org/T232446 [05:54:55] (03PS1) 10Marostegui: db1104: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/592517 [05:55:55] (03CR) 10Marostegui: [C: 03+2] db1104: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/592517 (owner: 10Marostegui) [05:56:50] !log Compress tables on db1104 - T232446 [05:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:48] !log installing git security updates on jessie [05:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:01] !log Stop MySQL on labsdb1011 for reimage - T249188 [06:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:07] T249188: Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 [06:00:53] (03CR) 10Marostegui: [C: 03+2] wikireplicas: Set innodb_purge_threads to 10 [puppet] - 10https://gerrit.wikimedia.org/r/591320 (https://phabricator.wikimedia.org/T247978) (owner: 10Jcrespo) [06:05:02] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [06:05:18] ^ me [06:05:20] (03PS1) 10DannyS712: Remove use of `wgAllowImageMoving` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592518 (https://phabricator.wikimedia.org/T245293) [06:05:41] (03PS2) 10DannyS712: Remove use of `wgAllowImageMoving` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592518 (https://phabricator.wikimedia.org/T245293) [06:17:53] checking mw1280! [06:19:00] Correctable memory error rate exceeded for DIMM_A1 / DIMM_B1 [06:19:08] IIRC this host has a history of failures [06:19:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [06:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:20] new memory was ordered for mw1280 in https://phabricator.wikimedia.org/T245670 [06:21:01] and was swapped in T240187, so seems this didn't help... [06:21:02] T240187: mw1280 crashed logging correctable memory errors - https://phabricator.wikimedia.org/T240187 [06:21:08] yeah I was checking [06:22:05] opening a task and depooling explicitly [06:22:09] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:16] (03CR) 10RhinosF1: [C: 03+1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592514 (https://phabricator.wikimedia.org/T251075) (owner: 10VulpesVulpes825) [06:26:58] 10Operations, 10ops-eqiad, 10serviceops: mw1280 correctable memory errors logged in getsel - https://phabricator.wikimedia.org/T251077 (10elukey) [06:30:35] !log elukey@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1280.eqiad.wmnet [06:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:03] 10Operations, 10ops-eqiad, 10serviceops: mw1280 correctable memory errors logged in getsel - https://phabricator.wikimedia.org/T251077 (10elukey) ` eqiad/api_appserver/apache2/mw1280.eqiad.wmnet: pooled changed yes => inactive eqiad/api_appserver/nginx/mw1280.eqiad.wmnet: pooled changed yes => inactive WARNI... [06:31:25] ACKNOWLEDGEMENT - Host mw1280 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T251077 [06:31:31] ok done :) [06:32:05] 10Operations, 10serviceops: Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10Joe) [06:34:06] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui on it https://wikitech.wikimedia.org/wiki/HAProxy [06:36:01] (03PS1) 10Giuseppe Lavagetto: mcrouter: enable failover route for on all canaries [puppet] - 10https://gerrit.wikimedia.org/r/592519 (https://phabricator.wikimedia.org/T244852) [06:36:03] (03PS1) 10Giuseppe Lavagetto: mcrouter: enable the gutter pool everywhere. [puppet] - 10https://gerrit.wikimedia.org/r/592520 (https://phabricator.wikimedia.org/T244852) [06:40:45] _joe_ wow! \o/ [06:41:58] <_joe_> elukey: I plan to activate it on the canaries this morning, everywhere tomorrow if we don't see anything funny [06:42:10] \o/ [06:42:49] !log installing Java security updates on IDP hosts, will void current SSO sessions [06:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:17] _joe_ sounds wonderful, thanks! [06:47:02] (03CR) 10JMeybohm: [C: 03+2] Revert "Update .ruby-version to what is running in production" [puppet] - 10https://gerrit.wikimedia.org/r/591358 (https://phabricator.wikimedia.org/T250538) (owner: 10Hashar) [06:47:49] (03CR) 10JMeybohm: [C: 03+2] Run rubocop on changes to .ruby-version [puppet] - 10https://gerrit.wikimedia.org/r/591359 (https://phabricator.wikimedia.org/T250538) (owner: 10Hashar) [06:48:12] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on es2004 - https://phabricator.wikimedia.org/T251017 (10jcrespo) 05Open→03Declined [06:54:55] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Check backups size also based on previous runs [puppet] - 10https://gerrit.wikimedia.org/r/591326 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [06:58:26] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:59:02] !log force ifdown/ifup eno1 on analytics1052 - interface negotiated speed flapping [06:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:34] <_joe_> elukey: have you tried turn it off and on again? [06:59:38] <_joe_> :P [07:01:33] _joe_ I am trying sophisticated fixes to this problem, please don't interfere :P [07:02:05] (03CR) 10Elukey: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/592519 (https://phabricator.wikimedia.org/T244852) (owner: 10Giuseppe Lavagetto) [07:02:22] <_joe_> did you try to hit the case near the ehternet card? [07:02:29] <_joe_> a gentle nudge usually fixes the problems [07:02:46] I'll ask Chris some help if this doesn't work :D [07:07:24] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [07:07:46] PROBLEM - Host an-worker1089 is DOWN: PING CRITICAL - Packet loss = 100% [07:08:21] really? [07:09:11] <_joe_> lol [07:10:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/22136/" [puppet] - 10https://gerrit.wikimedia.org/r/592519 (https://phabricator.wikimedia.org/T244852) (owner: 10Giuseppe Lavagetto) [07:10:04] I fix one and another goes down, wack a mole this morning [07:13:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] safe-service-restart: allow for a grace period after depooling [puppet] - 10https://gerrit.wikimedia.org/r/588961 (owner: 10Giuseppe Lavagetto) [07:14:29] !log powercycle an-worker1089 - unreachable via ssh, mgmt serial available, soft cpu lock events registered in dmesg [07:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:47] <_joe_> elukey: wanna bet on which one's next? [07:15:04] _joe_ please do [07:15:11] <_joe_> :D [07:15:39] !log Kill updateSpecialPages.php wikidatawiki --override --only=Fewestrevisions as it is causing lag - T238199 [07:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:44] T238199: SpecialFewestRevisions::reallyDoQuery takes more than 9h to run - https://phabricator.wikimedia.org/T238199 [07:16:46] RECOVERY - Host an-worker1089 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [07:22:17] (03PS1) 10Jcrespo: mariadb-backups: Tune backup size monitoring threasholds [puppet] - 10https://gerrit.wikimedia.org/r/592599 (https://phabricator.wikimedia.org/T138562) [07:23:06] (03PS2) 10Jcrespo: mariadb-backups: Tune backup size monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/592599 (https://phabricator.wikimedia.org/T138562) [07:25:01] (03CR) 10Jcrespo: "Will compile, but may want input already. Maybe it should be different for dumps and snapshots, as the latter have almost daily frequency?" [puppet] - 10https://gerrit.wikimedia.org/r/592599 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [07:25:42] !log roll restart elastic-chi on cloudelastic100[1-4] to pick up the last JVM GC settings - T231517 [07:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:48] T231517: Investigate and fix GC issues on cloudelastic machines - https://phabricator.wikimedia.org/T231517 [07:26:18] (03CR) 10jerkins-bot: [V: 04-1] mariadb-backups: Tune backup size monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/592599 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [07:26:49] (03PS3) 10Jcrespo: mariadb-backups: Tune backup size monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/592599 (https://phabricator.wikimedia.org/T138562) [07:29:07] (03PS4) 10Jcrespo: mariadb-backups: Tune backup size monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/592599 (https://phabricator.wikimedia.org/T138562) [07:32:28] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for CAS on fallback servers and staging host [puppet] - 10https://gerrit.wikimedia.org/r/592600 (https://phabricator.wikimedia.org/T135991) [07:34:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/591330 (https://phabricator.wikimedia.org/T233930) (owner: 10Jbond) [07:34:36] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for CAS on fallback servers and staging host [puppet] - 10https://gerrit.wikimedia.org/r/592600 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:34:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/591331 (https://phabricator.wikimedia.org/T233930) (owner: 10Jbond) [07:36:49] (03CR) 10Jbond: [C: 03+2] idp: alow specifying specific branches for the idp profile [puppet] - 10https://gerrit.wikimedia.org/r/591330 (https://phabricator.wikimedia.org/T233930) (owner: 10Jbond) [07:37:06] (03CR) 10Jbond: [C: 03+2] idp_test: update idp_test to use staging branch [puppet] - 10https://gerrit.wikimedia.org/r/591331 (https://phabricator.wikimedia.org/T233930) (owner: 10Jbond) [07:40:33] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for CAS on fallback servers [puppet] - 10https://gerrit.wikimedia.org/r/592600 (https://phabricator.wikimedia.org/T135991) [07:44:03] (03PS5) 10Jcrespo: mariadb-backups: Tune backup size monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/592599 (https://phabricator.wikimedia.org/T138562) [07:46:36] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 31.76 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:50:34] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [07:51:23] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for CAS on fallback servers [puppet] - 10https://gerrit.wikimedia.org/r/592600 (https://phabricator.wikimedia.org/T135991) [07:54:47] (03PS10) 10Jbond: apereo_cas: update templates login page [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) [07:55:36] (03CR) 10Jbond: "I have now uploaded theses changes to the new staging site which should hopefully make it easier to review then screen shots and diffs" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [07:56:11] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui on it https://wikitech.wikimedia.org/wiki/HAProxy [08:01:49] (03PS2) 10Dzahn: ATS: switch backends for and research and bienvenida.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/591306 (https://phabricator.wikimedia.org/T247650) [08:02:42] (03CR) 10Marostegui: mariadb-backups: Tune backup size monitoring thresholds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592599 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:03:49] (03CR) 10Jcrespo: "Indeed, thanks." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592599 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:04:48] (03PS6) 10Jcrespo: mariadb-backups: Tune backup size monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/592599 (https://phabricator.wikimedia.org/T138562) [08:06:20] 10Puppet, 10Wikimedia Meet: Puppetize the account manager - https://phabricator.wikimedia.org/T251034 (10Dzahn) I'd be happy to help with puppetizing. A first step would be to move the repo from github to Gerrit. Then we can have puppet clone the source. Should i start with requesting the project? [08:07:22] (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Tune backup size monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/592599 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:07:26] (03CR) 10Dzahn: [C: 03+2] ATS: switch backends for and research and bienvenida.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/591306 (https://phabricator.wikimedia.org/T247650) (owner: 10Dzahn) [08:09:42] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 71.19 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [08:11:29] (03CR) 10Jcrespo: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/22138/db1115.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/592599 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:11:50] (03CR) 10Gergő Tisza: [C: 04-2] "According to T250878#6083317 this is on hold pending investigation of potential LanguageConverter issues." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592455 (https://phabricator.wikimedia.org/T250878) (owner: 10Zoranzoki21) [08:12:11] (03PS1) 10Muehlenhoff: Add Cumin alias for idp test [puppet] - 10https://gerrit.wikimedia.org/r/592601 [08:12:40] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [08:15:31] !log add 80G to prometheus global LV [08:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:06] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 81.35 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [08:16:08] (03PS1) 10Dzahn: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/592602 [08:16:10] (03PS1) 10Dzahn: ATS: switch backends for transparency and transparency-private.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/592603 (https://phabricator.wikimedia.org/T247650) [08:16:41] (03CR) 10jerkins-bot: [V: 04-1] Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/592602 (owner: 10Dzahn) [08:16:42] RECOVERY - Check no envoy runtime configuration is left persistent on idp-test2001 is OK: HTTP OK: HTTP/1.1 200 OK - 286 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [08:18:20] !log running puppet on all cp-ats [08:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:06] (03Abandoned) 10Dzahn: Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into production [puppet] - 10https://gerrit.wikimedia.org/r/592602 (owner: 10Dzahn) [08:19:15] (03PS2) 10Dzahn: ATS: switch backends for transparency and transparency-private.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/592603 (https://phabricator.wikimedia.org/T247650) [08:24:30] !log Truncating and optimizing parsercache for pc1010 and pc2010 T247787 [08:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:36] T247787: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 [08:26:12] !log Deploy schema change on s5 codfw, lag will show up - T250055 [08:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:17] T250055: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 [08:28:26] (03PS11) 10Jbond: apereo_cas: update templates login page [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) [08:29:06] (03PS3) 10Dzahn: ATS: switch backends for the 3 transparency sites [puppet] - 10https://gerrit.wikimedia.org/r/592603 (https://phabricator.wikimedia.org/T247650) [08:30:11] (03CR) 10Dzahn: "I don't want to be the one to judge who is or is not harassing others. Is there some decision made by others that could be linked to?" [puppet] - 10https://gerrit.wikimedia.org/r/592438 (https://phabricator.wikimedia.org/T251001) (owner: 10Dereckson) [08:31:27] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) (owner: 10Jbond) [08:31:54] (03PS2) 10Dzahn: Prune non existing domains from Planet [puppet] - 10https://gerrit.wikimedia.org/r/592439 (https://phabricator.wikimedia.org/T168459) (owner: 10Dereckson) [08:33:10] (03CR) 10Jcrespo: [C: 03+2] transfer.py: Let netcat's listen instance timeout after 10 seconds [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/592247 (owner: 10Jcrespo) [08:33:32] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 118 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [08:35:49] (03PS1) 10Jcrespo: mariadb-backups: Update transfer.py to HEAD [puppet] - 10https://gerrit.wikimedia.org/r/592608 (https://phabricator.wikimedia.org/T138562) [08:36:12] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Kormat) 05Open→03Resolved Work is now complete. [08:36:19] (03PS3) 10Dzahn: planet: Prune 2 non existing domains [puppet] - 10https://gerrit.wikimedia.org/r/592439 (https://phabricator.wikimedia.org/T168459) (owner: 10Dereckson) [08:38:14] (03CR) 10Dzahn: [C: 03+2] planet: Prune 2 non existing domains [puppet] - 10https://gerrit.wikimedia.org/r/592439 (https://phabricator.wikimedia.org/T168459) (owner: 10Dereckson) [08:40:00] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Update transfer.py to HEAD [puppet] - 10https://gerrit.wikimedia.org/r/592608 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:41:14] (03PS1) 10Dzahn: ATS: switch backends for design and sitemaps static sites [puppet] - 10https://gerrit.wikimedia.org/r/592610 (https://phabricator.wikimedia.org/T247650) [08:41:43] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Marostegui) Thank you! For the record: the incident report for this is at: https://wikitech.wikimedia.org/wiki/Incident_documentation/2020031... [08:42:47] (03PS1) 10Ema: ATS: stop monitoring PURGE fifo logs [puppet] - 10https://gerrit.wikimedia.org/r/592611 (https://phabricator.wikimedia.org/T248067) [08:43:09] (03PS1) 10Dzahn: ATS: switch backends for wikiworkshop static sites [puppet] - 10https://gerrit.wikimedia.org/r/592612 (https://phabricator.wikimedia.org/T247650) [08:43:47] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for idp test [puppet] - 10https://gerrit.wikimedia.org/r/592601 (owner: 10Muehlenhoff) [08:44:42] 10Operations, 10observability, 10cloud-services-team (Kanban): remove cloud "dev" hosts from Icinga? - https://phabricator.wikimedia.org/T250787 (10fgiunchedi) Agreed the hosts might as well not show up on `https://icinga.wikimedia.org/alerts`, it seems to me we can extend what we do for test hosts to these... [08:50:34] 10Operations, 10observability, 10cloud-services-team (Kanban): remove cloud "dev" hosts from Icinga? - https://phabricator.wikimedia.org/T250787 (10MoritzMuehlenhoff) We can simply set profile::base::notifications=disabled in Hiera for these? [08:50:45] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: log safepoints only when running the daemon [puppet] - 10https://gerrit.wikimedia.org/r/587705 (https://phabricator.wikimedia.org/T221052) (owner: 10Filippo Giunchedi) [08:50:52] (03PS2) 10Filippo Giunchedi: logstash: log safepoints only when running the daemon [puppet] - 10https://gerrit.wikimedia.org/r/587705 (https://phabricator.wikimedia.org/T221052) [08:52:14] !log restarting cas on idp1001 to pick up Java 11 security update (will void active SSO sessions) [08:52:17] (03PS1) 10Dzahn: planet: remove 3 German blogs that have been closed [puppet] - 10https://gerrit.wikimedia.org/r/592614 (https://phabricator.wikimedia.org/T168459) [08:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:18] 10Operations, 10observability, 10cloud-services-team (Kanban): remove cloud "dev" hosts from Icinga? - https://phabricator.wikimedia.org/T250787 (10ayounsi) >>! In T250787#6083928, @MoritzMuehlenhoff wrote: > We can simply set profile::base::notifications=disabled in Hiera for these? I didn't check the code,... [08:53:28] (03PS2) 10Vgutierrez: Release 8.1.0-unreleased-1wm1 [debs/trafficserver] (8.1.x) - 10https://gerrit.wikimedia.org/r/591308 [08:53:41] (03CR) 10jerkins-bot: [V: 04-1] Release 8.1.0-unreleased-1wm1 [debs/trafficserver] (8.1.x) - 10https://gerrit.wikimedia.org/r/591308 (owner: 10Vgutierrez) [08:54:15] (03CR) 10Dzahn: [C: 03+2] "All 3 have explicit messages saying the blogs have been closed, so not just temp issues." [puppet] - 10https://gerrit.wikimedia.org/r/592614 (https://phabricator.wikimedia.org/T168459) (owner: 10Dzahn) [08:57:21] (03PS12) 10Jbond: apereo_cas: update templates login page [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/587538 (https://phabricator.wikimedia.org/T233939) [09:00:12] PROBLEM - Check systemd state on labsdb1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:16] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:03:50] RECOVERY - Check systemd state on labsdb1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:02] ^ labdb1011 expected [09:07:02] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [09:07:54] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:08:15] (03CR) 10Dereckson: [C: 04-1] "That would be indeed a good idea to include in the task description a link for that." [puppet] - 10https://gerrit.wikimedia.org/r/592438 (https://phabricator.wikimedia.org/T251001) (owner: 10Dereckson) [09:09:08] (03PS1) 10Ema: ATS: stop logging PURGE traffic [puppet] - 10https://gerrit.wikimedia.org/r/592615 (https://phabricator.wikimedia.org/T248067) [09:09:34] (03Abandoned) 10Ema: ATS: stop monitoring PURGE fifo logs [puppet] - 10https://gerrit.wikimedia.org/r/592611 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [09:10:25] dereckson: i noticed your blog has an issue connecting to its database currently [09:11:20] (03PS1) 10Jcrespo: Decommission es2001, es2002, es2003, es2004 [puppet] - 10https://gerrit.wikimedia.org/r/592617 (https://phabricator.wikimedia.org/T222592) [09:11:24] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22716 bytes in 0.259 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:11:32] !log Deploy schema change on s1 codfw, lag will show up - T250055 [09:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:39] T250055: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 [09:12:09] (03CR) 10Jcrespo: "CC @moritz for microcode check code references." [puppet] - 10https://gerrit.wikimedia.org/r/592617 (https://phabricator.wikimedia.org/T222592) (owner: 10Jcrespo) [09:12:53] (03CR) 10Dereckson: [C: 03+1] planet: remove 3 German blogs that have been closed [puppet] - 10https://gerrit.wikimedia.org/r/592614 (https://phabricator.wikimedia.org/T168459) (owner: 10Dzahn) [09:12:55] (03CR) 10Ema: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1002/22142/" [puppet] - 10https://gerrit.wikimedia.org/r/592615 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [09:13:34] (03CR) 10Muehlenhoff: "Thanks for the ping, these can simply be dropped. When the remaining servers with such old CPUs are dropped, I'll remove the entire blackl" [puppet] - 10https://gerrit.wikimedia.org/r/592617 (https://phabricator.wikimedia.org/T222592) (owner: 10Jcrespo) [09:14:39] (03CR) 10Jcrespo: "with "dropped" do you mean a virtual +1 to deploy this as is, or to skip all changes?" [puppet] - 10https://gerrit.wikimedia.org/r/592617 (https://phabricator.wikimedia.org/T222592) (owner: 10Jcrespo) [09:14:51] (03CR) 10Vgutierrez: [C: 03+1] ATS: stop logging PURGE traffic [puppet] - 10https://gerrit.wikimedia.org/r/592615 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [09:15:28] 10Operations, 10DBA, 10Patch-For-Review: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070 (10Dzahn) Projects currently still using the "simplelamp" class: Project: glampipe All project instances Project: gratitude... [09:15:31] (03PS1) 10Ema: pcc: use python3 [puppet] - 10https://gerrit.wikimedia.org/r/592618 [09:16:04] (03CR) 10Marostegui: [C: 03+1] "I would also suggest to convert the task to the normal decommissioning template, just for consistency" [puppet] - 10https://gerrit.wikimedia.org/r/592617 (https://phabricator.wikimedia.org/T222592) (owner: 10Jcrespo) [09:17:00] PROBLEM - Unmerged changes on repository puppet on labtestpuppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:17:53] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10Samwalton9) That makes sense, I'll use my staff account then. I'm a little confused about that, though, since I can't find any Wikitech login details for my staff account. https://t... [09:18:30] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is OK: (C)100 gt (W)80 gt 74.24 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [09:18:59] (03CR) 10Ema: [C: 03+2] ATS: stop logging PURGE traffic [puppet] - 10https://gerrit.wikimedia.org/r/592615 (https://phabricator.wikimedia.org/T248067) (owner: 10Ema) [09:19:24] mutante: hi! ok to puppet-merge your planet change? [09:20:22] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:20:56] ema: yes please, very harmless, just got distracted [09:21:09] done! [09:21:15] thx [09:22:21] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for squid [puppet] - 10https://gerrit.wikimedia.org/r/592619 (https://phabricator.wikimedia.org/T135991) [09:22:28] RECOVERY - Unmerged changes on repository puppet on labtestpuppetmaster2001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [09:22:46] (03PS7) 10Jbond: varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) [09:23:04] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [09:23:40] (03CR) 10Jcrespo: [C: 03+2] Decommission es2001, es2002, es2003, es2004 [puppet] - 10https://gerrit.wikimedia.org/r/592617 (https://phabricator.wikimedia.org/T222592) (owner: 10Jcrespo) [09:23:49] (03CR) 10Filippo Giunchedi: [C: 03+2] modules: add thanos-sidecar define and profile [puppet] - 10https://gerrit.wikimedia.org/r/586312 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:24:14] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/592600 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:25:38] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for squid [puppet] - 10https://gerrit.wikimedia.org/r/592619 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:25:40] !log Stop MySQL on labsdb1012 to reclone labsdb1011 - T249188 [09:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:46] T249188: Reimage labsdb1011 to Buster and 10.4 - https://phabricator.wikimedia.org/T249188 [09:25:52] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10Dzahn) This looks like a legit bug. The user samwalton is definitely in LDAP ( using ldapsearch on mwmaint1002) and it's even listed as member of various toolforge groups and project... [09:26:06] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for squid [puppet] - 10https://gerrit.wikimedia.org/r/592619 (https://phabricator.wikimedia.org/T135991) [09:26:18] (03CR) 10Jbond: [C: 03+1] "thanks this was on my list glad it was so easy :)" [puppet] - 10https://gerrit.wikimedia.org/r/592618 (owner: 10Ema) [09:26:21] 10Operations, 10LDAP-Access-Requests, 10LDAP: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10Dzahn) [09:27:37] mutante: you're dzahn right? Can I throw you a possible similar case? [09:27:54] listed on ldap but wikitech account vanished [09:28:13] I'll pm you if so because of the nature of the account [09:29:10] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1003/22143/prometheus1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/586313 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:29:14] (03CR) 10Ema: [C: 03+2] pcc: use python3 [puppet] - 10https://gerrit.wikimedia.org/r/592618 (owner: 10Ema) [09:29:22] (03PS4) 10Filippo Giunchedi: prometheus: add thanos-sidecar to prometheus@ops [puppet] - 10https://gerrit.wikimedia.org/r/586313 (https://phabricator.wikimedia.org/T233956) [09:30:04] hoo: Dear deployers, time to do the Wikidata deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200427T0930). [09:30:06] RhinosF1: I am the person. But i am just a reporter of the bug myself. [09:31:22] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [09:31:29] ^ expected [09:31:33] !log jynus@cumin2001 START - Cookbook sre.hosts.decommission [09:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:51] mutante: see PM [09:31:56] ACKNOWLEDGEMENT - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui Working on labsdb1011 https://wikitech.wikimedia.org/wiki/HAProxy [09:32:24] !log jynus@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [09:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:33] (03PS1) 10Jbond: abuse_networks: add dummy data [labs/private] - 10https://gerrit.wikimedia.org/r/592620 [09:32:34] !log jynus@cumin2001 START - Cookbook sre.hosts.decommission [09:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:53] (03CR) 10Jbond: [V: 03+2 C: 03+2] abuse_networks: add dummy data [labs/private] - 10https://gerrit.wikimedia.org/r/592620 (owner: 10Jbond) [09:33:12] !log jynus@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [09:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:21] !log jynus@cumin2001 START - Cookbook sre.hosts.decommission [09:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:37] jbond42: ok to merge your change ? [09:34:01] !log jynus@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [09:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:15] !log jynus@cumin2001 START - Cookbook sre.hosts.decommission [09:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:53] !log jynus@cumin2001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [09:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:31] I'll stand by for now [09:35:51] also for some reason my change isn't showing up anyways on puppet-merge, for some reason [09:37:34] (03PS1) 10Elukey: role::elasticsearch::cloudelastic: set CMS occupancy fraction to 85 [puppet] - 10https://gerrit.wikimedia.org/r/592621 (https://phabricator.wikimedia.org/T231517) [09:37:35] ah got it, puppet-merge goes through changes one by one now [09:37:59] I answered no to John's change and yes to mine, then got "problems merging labs" [09:38:14] (03PS1) 10Jbond: etc/varnish/blocked-nets.inc.vcl: fix whitespace and quoting [labs/private] - 10https://gerrit.wikimedia.org/r/592622 [09:39:06] godog: yes i think thats a bug best to always just say yes to the labs one. i have another labs one now so ill take a look [09:39:30] (03CR) 10Jbond: [V: 03+2 C: 03+2] etc/varnish/blocked-nets.inc.vcl: fix whitespace and quoting [labs/private] - 10https://gerrit.wikimedia.org/r/592622 (owner: 10Jbond) [09:39:56] jbond42: ack, thanks! LMK how it goes for you [09:40:16] afaict my change hasn't been pushed to puppet masters [09:40:47] godog: looks like it got pushed with my labs changes, everything is now on b9939d622d7d8c5bd31c12f1b48aa101e1539099 [09:41:18] let me knof if you see issues [09:41:19] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/22145/" [puppet] - 10https://gerrit.wikimedia.org/r/592621 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [09:41:21] godog: yeah I think answering "no" to the labs/private question and "yes" to prod does not produce the expected result [09:41:28] jbond42: sweet, thanks! [09:41:39] I have seen that myself in the past at least [09:42:11] heh I haven't got to the labs question, I got asked first to merge john's patch, which I answered 'no' and then to mine which I said yes [09:42:18] anyways, all well now [09:42:26] godog: my merge was to labs private [09:42:40] ah! got it [09:42:59] (03PS1) 10Jcrespo: Remove es2001-4 DNS but leaving asset tags for decommission [dns] - 10https://gerrit.wikimedia.org/r/592623 (https://phabricator.wikimedia.org/T222592) [09:43:38] OCD review for puppet-merge coming up [09:43:41] (03PS1) 10Filippo Giunchedi: puppetmaster: add trailing newline on puppet-merge.py problems [puppet] - 10https://gerrit.wikimedia.org/r/592624 [09:43:45] ^ [09:45:01] (03CR) 10Ema: [C: 03+2] Release 0.8 [software/purged] - 10https://gerrit.wikimedia.org/r/591300 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [09:45:19] 10Puppet, 10User-jbond: puppet-merge: answering not to merging labs-private prvents puppet-merge from pushing to all puppet masters - https://phabricator.wikimedia.org/T251104 (10jbond) p:05Triage→03Medium [09:45:28] godog: ema: ^^ [09:45:57] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/592624 (owner: 10Filippo Giunchedi) [09:46:02] yup sounds good! thanks jbond42 [09:46:19] (03CR) 10Filippo Giunchedi: [C: 03+2] puppetmaster: add trailing newline on puppet-merge.py problems [puppet] - 10https://gerrit.wikimedia.org/r/592624 (owner: 10Filippo Giunchedi) [09:46:22] godog: the ultimate goal is to convert the bash part of puppet-merge to python [09:46:25] jbond42: thanks! [09:46:52] yeah that'd be sweet, erb-expanded code makes my eye twitch every time [09:46:52] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 125.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [09:47:12] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 54.83 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:47:25] (03CR) 10Marostegui: "I am not sure about this - I never remove the DNS for mgmt entries. You might want to ask Rob about this." [dns] - 10https://gerrit.wikimedia.org/r/592623 (https://phabricator.wikimedia.org/T222592) (owner: 10Jcrespo) [09:47:44] (03PS2) 10Jcrespo: Remove es2001-4 entries for decommission, but not touching asset tags [dns] - 10https://gerrit.wikimedia.org/r/592623 (https://phabricator.wikimedia.org/T222592) [09:48:52] (03PS8) 10Jbond: varnish: update varnish config to use the abuse_networks global [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) [09:48:58] (03CR) 10Jcrespo: "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/592623 (https://phabricator.wikimedia.org/T222592) (owner: 10Jcrespo) [09:49:49] (03CR) 10Marostegui: "Then go for it" [dns] - 10https://gerrit.wikimedia.org/r/592623 (https://phabricator.wikimedia.org/T222592) (owner: 10Jcrespo) [09:49:58] (03CR) 10Muehlenhoff: "Ack, Jaime is correct, the mgmt DNS entries are removed once DC ops handle the decom steps for disk wipe etc." [dns] - 10https://gerrit.wikimedia.org/r/592623 (https://phabricator.wikimedia.org/T222592) (owner: 10Jcrespo) [09:50:16] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 59.45 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:50:54] (03CR) 10Jcrespo: [C: 03+2] Remove es2001-4 entries for decommission, but not touching asset tags [dns] - 10https://gerrit.wikimedia.org/r/592623 (https://phabricator.wikimedia.org/T222592) (owner: 10Jcrespo) [09:51:25] looks a jump in http requests (not https though) a while ago, and now showing up as a drop [09:52:58] (03CR) 10Filippo Giunchedi: [C: 03+2] Add Thanos query [puppet] - 10https://gerrit.wikimedia.org/r/586314 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [09:53:06] (03PS5) 10Filippo Giunchedi: Add Thanos query [puppet] - 10https://gerrit.wikimedia.org/r/586314 (https://phabricator.wikimedia.org/T233956) [09:54:39] !log hoo@deploy1001 Synchronized php-1.35.0-wmf.28/extensions/Wikibase: Add pruneItemsPerSite maintenance script (T249613) (duration: 01m 06s) [09:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:46] T249613: Cleanup rows from wb_items_per_site that should not be there any more - https://phabricator.wikimedia.org/T249613 [09:57:25] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: scrape thanos sidecar/query metrics [puppet] - 10https://gerrit.wikimedia.org/r/586315 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [10:00:32] (03PS5) 10Filippo Giunchedi: prometheus: scrape thanos sidecar/query metrics [puppet] - 10https://gerrit.wikimedia.org/r/586315 (https://phabricator.wikimedia.org/T233956) [10:04:21] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/583342 (https://phabricator.wikimedia.org/T233945) (owner: 10Jbond) [10:06:24] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: scrape thanos sidecar/query metrics [puppet] - 10https://gerrit.wikimedia.org/r/586315 (https://phabricator.wikimedia.org/T233956) (owner: 10Filippo Giunchedi) [10:10:12] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 30.51 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [10:15:14] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 78.18 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:19:21] jan_drewniak: Would you mind if I stretch my deployment slot a little? [10:22:59] !log depool and restart wdqs1007 (deadlocks) T242453 [10:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:06] T242453: Deadlock in blazegraph blocking all queries and updates - https://phabricator.wikimedia.org/T242453 [10:23:24] PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:23:29] ^^ thats me [10:24:32] RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:26:13] hoo: no problem. Mine only takes a few minutes [10:26:23] (03PS1) 10Muehlenhoff: Allow specifying a process name for services without a native systemd unit (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/592628 [10:27:13] jan_drewniak: Thanks :) Just +2ed my cherry-pick… I hope to be done by 10:45 UTC :) [10:27:57] 10Operations: Add a systemd unit for DHCP - https://phabricator.wikimedia.org/T251112 (10MoritzMuehlenhoff) [10:29:38] !log contint2001 - jenkins failed and can't start because address is already in use [10:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:43] (03CR) 10jerkins-bot: [V: 04-1] Allow specifying a process name for services without a native systemd unit (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/592628 (owner: 10Muehlenhoff) [10:30:04] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200427T1030). [10:30:18] 10Operations: Add a systemd unit for DHCP - https://phabricator.wikimedia.org/T251112 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:32:42] !log contint2001 - systemd status was degraded. icinga alerted. failed unit was jenkins. starting it failed with "address already in use". manually started without using systemctl? killed jenkins and started again with systemctl. T224591 [10:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:49] T224591: Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 [10:36:44] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:39:50] (03PS1) 10Esanders: Load DiscussionTools on en.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592630 (https://phabricator.wikimedia.org/T249376) [10:40:20] (03PS1) 10Hnowlan: mediawiki:jobrunner_tls: Remove runjobs monitoring [puppet] - 10https://gerrit.wikimedia.org/r/592631 (https://phabricator.wikimedia.org/T243096) [10:43:06] (03Abandoned) 10Hnowlan: changeprop: allow setting arbitrary keys for kafka options [deployment-charts] - 10https://gerrit.wikimedia.org/r/589059 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [10:44:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/592619 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:46:06] mutante: checking, thanks [10:48:17] (03PS4) 10Cparle: Enable quality constraints on production commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588988 (https://phabricator.wikimedia.org/T248117) [10:49:09] !log hoo@deploy1001 Synchronized php-1.35.0-wmf.28/extensions/Wikibase: pruneItemsPerSite: Fix join_condition call signature (T249613) (duration: 01m 01s) [10:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:17] T249613: Cleanup rows from wb_items_per_site that should not be there any more - https://phabricator.wikimedia.org/T249613 [10:51:19] Forgot the submodule update :/ [10:51:25] Will be done in a second [10:52:00] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission es2001, es2002, es2003, es2004 - https://phabricator.wikimedia.org/T222592 (10Marostegui) [10:52:12] !log hoo@deploy1001 Synchronized php-1.35.0-wmf.28/extensions/Wikibase: pruneItemsPerSite: Fix join_condition call signature (T249613) (duration: 01m 02s) [10:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:48] !log Running the pruneItemsPerSite on mwmaint1002 maintenance script for Wikidata (T249613) [10:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:56] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592634 (https://phabricator.wikimedia.org/T128546) [10:58:02] jouncebot: next [10:58:02] In 0 hour(s) and 1 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200427T1100) [10:58:23] 10Operations: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Dzahn) [10:59:18] 10Operations: Onboarding Stephen Shirley - https://phabricator.wikimedia.org/T250134 (10Dzahn) Added to pwstore. re-signed .users file and re-encrypted all the files for the "ops" group. Stephen confirmed he can decrypt the management file and has a working mgmt pass now. [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200427T1100). [11:00:04] cormacparle, Zoranzoki21, rxy, Majavah, and VulpesVulpes825: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:07] o/ [11:00:12] o/ [11:00:13] o/ [11:00:34] ready to go when you are Lucas_WMDE [11:00:40] I’m ready, go ahead :) [11:00:45] cool [11:00:48] (you’re a deployer, right?) [11:00:52] (03CR) 10Cparle: [C: 03+2] Enable quality constraints on production commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588988 (https://phabricator.wikimedia.org/T248117) (owner: 10Cparle) [11:00:56] o/ but not ready [11:00:58] apparently :) [11:01:00] * Urbanecm waves [11:02:03] (03Merged) 10jenkins-bot: Enable quality constraints on production commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588988 (https://phabricator.wikimedia.org/T248117) (owner: 10Cparle) [11:02:13] hoo: should we wait for your script to finish or is it fine to run in parallel? [11:02:22] (first SWAT is a config change for commonswiki, not wikidata-related) [11:02:24] Lucas_WMDE: It would be fine [11:02:27] ok thanks [11:02:29] but also it's done already :) [11:02:33] ah! [11:02:36] nice :D [11:02:50] I thought it would take longer ^^ [11:03:19] Hello, sorry for lating, I'm here :D [11:03:27] oh hey, wait a sec, could I do the portals deploy first, was waiting for hoo [11:04:11] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for pmacctd [puppet] - 10https://gerrit.wikimedia.org/r/592636 (https://phabricator.wikimedia.org/T135991) [11:04:15] oh, sorry [11:04:24] I’m not sure if cormacparle__ is syncing again or something [11:04:30] (03PS8) 10Zoranzoki21: Update logos for for tiwiki and tiwikt (part II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592507 (https://phabricator.wikimedia.org/T150618) [11:04:31] but we can fit you in after him, at least [11:04:35] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for pmacctd [puppet] - 10https://gerrit.wikimedia.org/r/592636 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:04:48] (03PS3) 10Zoranzoki21: Update logos for for tiwiki and tiwikt (part I) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592506 (https://phabricator.wikimedia.org/T150618) [11:04:51] Lucas_WMDE: sure no problem, mine is quick [11:04:57] ok sounds good [11:05:05] yeah hang on, I'm jus testing ... [11:05:06] (03PS9) 10Zoranzoki21: Update logos for for tiwiki and tiwikt (part II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592507 (https://phabricator.wikimedia.org/T150618) [11:06:27] (03CR) 10KartikMistry: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592479 (https://phabricator.wikimedia.org/T246383) (owner: 10VulpesVulpes825) [11:06:50] it’s definitely making wbcheckconstraints requests [11:06:56] I haven’t seen any failures yet [11:07:57] ok grand, syncing [11:08:40] !log cparle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [SDC] Enable constraints on production commons (duration: 00m 58s) [11:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:51] ... and syncing again [11:09:22] (03PS1) 10Marostegui: Revert "wikireplicas: Set innodb_purge_threads to 10" [puppet] - 10https://gerrit.wikimedia.org/r/592638 [11:09:28] (03CR) 10Marostegui: "Let's revert for now" [puppet] - 10https://gerrit.wikimedia.org/r/592638 (owner: 10Marostegui) [11:09:36] cormacparle__: that's no longer necessawry, according to recent email sent to the ops list [11:09:44] ah! [11:09:46] was about to write that :) [11:09:46] ok great [11:09:53] !log cparle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [SDC] Enable constraints on production commons (duration: 00m 57s) [11:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:00] but doesn’t hurt either, so no harm done [11:10:03] indeed [11:10:05] jan_drewniak: go ahead! [11:10:21] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592634 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:11:05] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add two domains in wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592306 (https://phabricator.wikimedia.org/T250903) (owner: 10Zoranzoki21) [11:11:22] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592634 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:12:07] (03PS2) 10Rxy: Add transwiki import sources in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592330 (https://phabricator.wikimedia.org/T250972) [11:12:50] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/592636 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:13:51] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:592634| Bumping portals to master (563985)]] (duration: 00m 58s) [11:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:28] (03CR) 10Dzahn: [C: 03+1] Enable base::service_auto_restart for squid [puppet] - 10https://gerrit.wikimedia.org/r/592619 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:14:48] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:592634| Bumping portals to master (563985)]] (duration: 00m 57s) [11:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:45] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable cross-project search on frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592515 (https://phabricator.wikimedia.org/T250724) (owner: 10Majavah) [11:15:49] Lucas_WMDE: ok I'm all done [11:16:01] ok then let’s continue with Zoranzoki21! [11:16:05] thanks [11:16:20] (03PS4) 10Lucas Werkmeister (WMDE): Add two domains in wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592306 (https://phabricator.wikimedia.org/T250903) (owner: 10Zoranzoki21) [11:16:31] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592306 (https://phabricator.wikimedia.org/T250903) (owner: 10Zoranzoki21) [11:16:33] Ok, I'm here :) [11:16:50] Patch for wgCopyUploadsDomains no needs testing :) [11:16:54] with mwdebug [11:17:13] ok [11:17:19] the other one I’m not going to deploy, sorry [11:17:24] (03Merged) 10jenkins-bot: Add two domains in wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592306 (https://phabricator.wikimedia.org/T250903) (owner: 10Zoranzoki21) [11:17:24] I see you already replied on Phabricator [11:17:33] but I don’t feel comfortable overriding that -2 by myself [11:17:39] Lucas_WMDE: I added another patches for SWAT [11:17:40] I know [11:17:50] I would at least need some acknowledgment by ppelberg on phab [11:17:59] * Lucas_WMDE reloads deployment spage [11:20:59] <_joe_> !log restarted php-fpm on mw1407 to pick up enlarged opcache values, T99740 [11:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:06] T99740: Use static php array files for l10n cache at WMF (instead of CDB) - https://phabricator.wikimedia.org/T99740 [11:21:20] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:592306|Add two domains in wgCopyUploadsDomains (T250903, T250904)]] (duration: 00m 57s) [11:21:20] 10Operations, 10netops: OSPF metrics - https://phabricator.wikimedia.org/T200277 (10faidon) Interesting idea! Couple of notes: - What do you mean by "virtual links" and Netbox not supporting them? Is that VLANs for our transports over the PtMP VPLS? - What do you envision the difference to be between "primary"... [11:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:28] T250903: Add www.iau.org to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T250903 [11:21:28] T250904: Add www.kari.re.kr to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T250904 [11:22:17] Zoranzoki21: the 1.5x and 2x versions of those logos look blurry to me, is that how it should be? [11:22:36] Let' me check it again [11:22:53] https://meta.wikimedia.org/wiki/File:Wikipedia-logo-v2-ti.svg gives me higher-res versions [11:24:31] It looks good for me on my laptop [11:24:49] let me download the change, maybe gerrit is doing something funny [11:25:26] Lucas_WMDE: https://prnt.sc/s6pedi [11:25:35] !log repool wdqs1007 T242453 [11:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:41] T242453: Deadlock in blazegraph blocking all queries and updates - https://phabricator.wikimedia.org/T242453 [11:26:04] 2x https://prnt.sc/s6peqb [11:26:31] 1.5x https://prnt.sc/s6pf6x [11:27:17] weird [11:27:40] https://i.imgur.com/oNz3uyA.png [11:27:54] that’s what they look like on my end [11:28:39] maybe you haven’t pushed the latest version to Gerrit or something? [11:29:07] I was [11:29:50] On my end: https://prnt.sc/s6phhj [11:30:52] wait, the wikipedia logo looks fine to me [11:31:01] but that change doesn’t touch the wikipedia logo [11:31:19] especially not t*l*wiki [11:31:28] sorry, I got confused [11:31:32] it does touch a wikipedia logo [11:31:35] but tiwiki, not tlwiki [11:31:46] and tiwiktionary [11:32:43] Uhh, I was also [11:33:42] https://prnt.sc/s6pkda [11:34:38] Urbanecm: Can you help? [11:34:57] maybe we can continue with other changes in the meantime [11:35:03] I’m looking at rxy’s now [11:35:06] yes [11:35:10] o/ [11:35:26] (03PS3) 10Rxy: Add transwiki import sources in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592330 (https://phabricator.wikimedia.org/T250972) [11:35:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add transwiki import sources in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592330 (https://phabricator.wikimedia.org/T250972) (owner: 10Rxy) [11:36:04] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592330 (https://phabricator.wikimedia.org/T250972) (owner: 10Rxy) [11:36:13] ok you already rebased it ^^ [11:36:16] (03PS1) 10Jbond: mcrouter: add default timeout values [puppet] - 10https://gerrit.wikimedia.org/r/592641 [11:36:18] (03PS1) 10Jbond: profile::idp: add mcrouter (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) [11:36:59] (03Merged) 10jenkins-bot: Add transwiki import sources in zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592330 (https://phabricator.wikimedia.org/T250972) (owner: 10Rxy) [11:37:48] (03PS3) 10Majavah: Enable cross-project search on frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592515 (https://phabricator.wikimedia.org/T250724) [11:38:07] rxy: do you know a sysop on zhwiki to test this? if I read https://zh.wikipedia.org/w/index.php?title=Special:%E7%94%A8%E6%88%B7%E6%9D%83%E9%99%90/Rxy&uselang=en correctly you’re not a sysop yourself [11:38:28] I'm steward [11:38:36] ah :) [11:38:39] sorry ^^ [11:38:42] np [11:38:47] change is on mwdebug1001, can you test it? [11:39:15] (03CR) 10jerkins-bot: [V: 04-1] mcrouter: add default timeout values [puppet] - 10https://gerrit.wikimedia.org/r/592641 (owner: 10Jbond) [11:40:13] ok, It works expectedly [11:40:17] Please deploy to prod [11:40:23] great, thanks [11:40:28] thanks too :D [11:41:41] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:592330|Add transwiki import sources in zhwiki (T250972)]] (duration: 00m 57s) [11:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:47] T250972: Transwiki import in zhwiki - https://phabricator.wikimedia.org/T250972 [11:42:25] Majavah: are you ready now? [11:42:38] sure [11:43:01] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592515 (https://phabricator.wikimedia.org/T250724) (owner: 10Majavah) [11:43:20] server: mw1403.eqiad.wmnet server: mw1351.eqiad.wmnet : It works collectry. Thanks for deploy :D [11:43:28] great \o/ [11:43:46] (03PS2) 10Jbond: mcrouter: add default timeout values [puppet] - 10https://gerrit.wikimedia.org/r/592641 [11:44:03] (03Merged) 10jenkins-bot: Enable cross-project search on frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592515 (https://phabricator.wikimedia.org/T250724) (owner: 10Majavah) [11:44:08] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for fastnetmon [puppet] - 10https://gerrit.wikimedia.org/r/592643 (https://phabricator.wikimedia.org/T135991) [11:44:26] change is on mwdebug1001, testing [11:44:46] seems to look okay, doesn’t clash horribly with the “Outils de recherche évolués Outils de recherche évolués : ” [11:44:49] Lucas_WMDE: appears to work as expected on mwdebug1001 [11:44:54] ok, syncing [11:45:56] (03CR) 10Ayounsi: [C: 03+1] "Tested that a pmacctd restart doesn't impact nfacctd. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/592636 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:46:11] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:592515|Enable cross-project search on frwiktionary (T250724)]] (duration: 00m 57s) [11:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:18] T250724: Enable cross-wiki (interwiki) search for French Wiktionary - https://phabricator.wikimedia.org/T250724 [11:46:47] VulpesVulpes825 doesn’t seem to be online [11:47:17] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] "(not SWATted because the 1.5x and 2x logos are blurry)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592506 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [11:48:06] Lucas_WMDE: according to /whois they are online on #wikimedia-ops [11:48:30] ah, that might be channel confusion then [11:48:33] thanks [11:48:36] (03CR) 10Ayounsi: [C: 03+1] Enable base::service_auto_restart for fastnetmon [puppet] - 10https://gerrit.wikimedia.org/r/592643 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:49:01] yeah the names are pretty similar [11:49:10] pinged them there [11:49:30] !log Started Wikibase rebuildItemsPerSite on mwmaint1002 for wikidatawiki. Can be killed at any time, if necessary. (T249613) [11:49:31] (03PS4) 10Zoranzoki21: Update logos for for tiwiki and tiwikt (part I) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592506 (https://phabricator.wikimedia.org/T150618) [11:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:37] T249613: Cleanup rows from wb_items_per_site that should not be there any more - https://phabricator.wikimedia.org/T249613 [11:49:56] (03PS10) 10Zoranzoki21: Update logos for for tiwiki and tiwikt (part II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592507 (https://phabricator.wikimedia.org/T150618) [11:50:07] (03PS5) 10Zoranzoki21: Update logos for for tiwiki and tiwikt (part I) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592506 (https://phabricator.wikimedia.org/T150618) [11:50:14] (03PS11) 10Zoranzoki21: Update logos for for tiwiki and tiwikt (part II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592507 (https://phabricator.wikimedia.org/T150618) [11:50:22] …or we continue with the logos, it looks like [11:50:33] (03CR) 10Gehel: [C: 03+1] "Minor comment inline, but LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592621 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [11:51:36] Lucas_WMDE: In tiwiki-1.5x.png it was +2 pixels more for each, I made it 2 pixels less, as in simular file for another wiki.. I'm not sure should we continue right now, maybe someone from Design should check logos [11:51:45] yeah, ok [11:52:01] (03CR) 10DCausse: [C: 03+1] role::elasticsearch::cloudelastic: set CMS occupancy fraction to 85 [puppet] - 10https://gerrit.wikimedia.org/r/592621 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [11:53:24] ok, let’s just close the SWAT window now, it’s been productive enough I think :) [11:53:27] !log EU SWAT done [11:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:57] Lucas_WMDE: Yes, thanks! [11:57:51] (03CR) 10Jcrespo: [C: 03+1] Revert "wikireplicas: Set innodb_purge_threads to 10" [puppet] - 10https://gerrit.wikimedia.org/r/592638 (owner: 10Marostegui) [12:03:14] Lucas_WMDE: they’ve moved to wednesday for the zhwiki task [12:03:31] ok [12:04:06] Lucas_WMDE: andre has raised a concern on the task as well so reached out to editing for feedback [12:06:32] !log cp: upgrade purged to 0.8 T249583 [12:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:38] T249583: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 [12:09:47] (03CR) 10Marostegui: [C: 03+2] Revert "wikireplicas: Set innodb_purge_threads to 10" [puppet] - 10https://gerrit.wikimedia.org/r/592638 (owner: 10Marostegui) [12:11:10] (03PS3) 10VulpesVulpes825: wmf-config/: Adjust MT threshold for Chinese Wikipedia to 70% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592479 (https://phabricator.wikimedia.org/T246383) [12:11:29] (03PS2) 10VulpesVulpes825: Enable DiscussionTools as a beta feature on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592514 (https://phabricator.wikimedia.org/T251075) [12:12:40] PROBLEM - PHP opcache health on mw1407 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:14:59] !log Remove empty table T248086_wb_terms from commonswiki and testcommonswiki on s4 master - T248086 [12:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:05] T248086: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 [12:16:57] 10Operations, 10Wikimedia-Logstash, 10observability: config file change canarying for logstash - https://phabricator.wikimedia.org/T221052 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'm going to boldly resolve this, we're testing Logstash config at puppet run time now, which is meant to at least p... [12:18:08] hi VulpesVulpes825 [12:21:05] (03PS1) 10Majavah: Create a bunch of namespace aliases for thwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592645 (https://phabricator.wikimedia.org/T251118) [12:21:36] (03PS4) 10Hnowlan: changeprop: Use global puppet CA cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/589570 (https://phabricator.wikimedia.org/T249633) [12:21:42] Hi RhinosF1, I just found out I was monitoring wikimedia-ops, rather than this one... [12:22:04] (03PS3) 10Hnowlan: changeprop: Use staging eventgate in staging environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/589562 (https://phabricator.wikimedia.org/T249739) [12:23:08] (03PS1) 10Ema: purged: pass host_regex [puppet] - 10https://gerrit.wikimedia.org/r/592646 (https://phabricator.wikimedia.org/T249583) [12:24:42] (03CR) 10Elukey: [C: 03+2] role::elasticsearch::cloudelastic: set CMS occupancy fraction to 85 [puppet] - 10https://gerrit.wikimedia.org/r/592621 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [12:25:02] (03CR) 10Elukey: "ouch sorry didn't see the comments, hit +2 by mistake" [puppet] - 10https://gerrit.wikimedia.org/r/592621 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [12:25:16] 10Operations, 10Traffic, 10Wikimedia-Incident: cp1083: ats-tls and varnish-fe crashed due to insufficient memory - https://phabricator.wikimedia.org/T241593 (10fgiunchedi) Untagging observability for now since there doesn't seem to be any action [12:25:34] (03CR) 10Elukey: role::elasticsearch::cloudelastic: set CMS occupancy fraction to 85 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592621 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [12:26:27] (03PS2) 10Elukey: role::elasticsearch::cloudelastic: set CMS occupancy fraction to 85 [puppet] - 10https://gerrit.wikimedia.org/r/592621 (https://phabricator.wikimedia.org/T231517) [12:27:40] (03CR) 10Ema: "pcc output seems correct: https://puppet-compiler.wmflabs.org/compiler1001/22150/" [puppet] - 10https://gerrit.wikimedia.org/r/592646 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [12:28:52] (03CR) 10Elukey: "new pcc https://puppet-compiler.wmflabs.org/compiler1001/22151/" [puppet] - 10https://gerrit.wikimedia.org/r/592621 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [12:32:29] !log Remove empty table T248086_wb_terms from wikidatawiki on s8 codfw master - T248086 [12:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:37] T248086: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 [12:32:55] (03PS1) 10Ema: cache: test purged on cp2030, part of cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/592648 (https://phabricator.wikimedia.org/T249583) [12:33:58] (03PS2) 10MSantos: Re-enable OSM replication [puppet] - 10https://gerrit.wikimedia.org/r/591028 (https://phabricator.wikimedia.org/T249086) [12:34:52] ^ gehel [12:36:43] !log Remove empty table T248086_wb_terms from wikidatawiki on s3 codfw master - T248086 [12:36:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:00] RECOVERY - PHP opcache health on mw1407 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [12:41:27] !log Remove empty table T248086_wb_terms from wikidatawiki on s3 eqiad - T248086 [12:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:33] T248086: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 [12:44:29] (03PS1) 10Dzahn: planet: remove some more feeds that don't exist anymore [puppet] - 10https://gerrit.wikimedia.org/r/592649 (https://phabricator.wikimedia.org/T168459) [12:45:09] !log upgrade etherpad to 1.8.3 [12:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:22] akosiaris: /o\ [12:45:34] :D [12:47:13] (03PS2) 10Dzahn: planet: remove some more feeds that don't exist anymore [puppet] - 10https://gerrit.wikimedia.org/r/592649 (https://phabricator.wikimedia.org/T168459) [12:48:47] pretty interesting... seems like I 've killed etherpad? [12:49:09] akosiaris: yeah, it is not working for me [12:49:37] ok, let's do a rollback [12:49:56] !log rolling back etherpad to 1.8.0 [12:50:00] !log Removed img_deleted from s1 (enwiki) T250055 [12:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:08] T250055: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 [12:50:27] ok, it's back [12:50:30] weird [12:50:34] no logs... nothing [12:50:53] akosiaris: works now yeah [12:50:55] "weird" uttering and etherpad don't go toghether [12:51:05] let's retest this locally. I 've already did but maybe I missed something [12:51:18] but my first action on a Monday resulting in an outage [12:51:20] nice... [12:51:43] akosiaris: it must be a good week :) [12:52:01] it sure has the smell of it right now [12:52:04] (03CR) 10Vgutierrez: purged: pass host_regex (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/592646 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [12:52:38] (03CR) 10Vgutierrez: [C: 03+1] cache: test purged on cp2030, part of cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/592648 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [12:52:50] (03PS2) 10Jbond: profile::idp: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) [12:53:01] (03PS3) 10Jbond: profile::idp: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) [12:53:07] akosiaris: good luck! [12:53:20] ♥ [12:53:26] thanks. I might need it [12:53:43] !log Drop T248086_wb_terms from db1104 - T248086 [12:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:49] T248086: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 [12:56:22] 10Operations, 10SRE-Access-Requests: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T251122 (10ahemmer) [12:59:09] (03PS1) 10Ema: VCL: clarify scripted requests error [puppet] - 10https://gerrit.wikimedia.org/r/592652 [12:59:40] (03PS4) 10Volans: scripts: add offline device script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576461 [13:00:46] (03PS1) 10Dzahn: create a role like simplelamp but using mariadb, not mysql [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) [13:01:10] 10Operations, 10LDAP-Access-Requests: LDAP access to the wf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251124 (10RhinosF1) [13:01:28] (03CR) 10Volans: "@crusnov: FYI, script refactored to cover the use case of DCOps needing to offline (unrack) a device. The requirement to update the gsheet" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/576461 (owner: 10Volans) [13:02:05] 10Operations, 10LDAP-Access-Requests: LDAP access to the wf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251124 (10RhinosF1) Please can you provide the following: When filing your request: Username: (The user name used on Wikitech.) Shell access: Yes/No (Whether you... [13:02:26] (03PS2) 10Ema: purged: pass host_regex [puppet] - 10https://gerrit.wikimedia.org/r/592646 (https://phabricator.wikimedia.org/T249583) [13:02:55] (03CR) 10jerkins-bot: [V: 04-1] create a role like simplelamp but using mariadb, not mysql [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [13:04:26] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10Product-Analytics (Kanban): LDAP access to the wf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251123 (10ahemmer) [13:05:35] 10Operations, 10LDAP-Access-Requests: LDAP access to the wf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251124 (10RhinosF1) [13:05:43] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10Product-Analytics (Kanban): LDAP access to the wf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251123 (10RhinosF1) [13:05:58] (03PS2) 10Dzahn: create a role like simplelamp but using mariadb, not mysql [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) [13:06:17] 10Operations, 10LDAP-Access-Requests: LDAP access to the wf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251124 (10ahemmer) Hi guys, this is a duplication of a T251123, Sorry for this, unsure of how to delete this. This can be removed. [13:06:26] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:07:07] 10Operations, 10LDAP-Access-Requests: LDAP access to the wf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251124 (10RhinosF1) >>! In T251124#6084852, @ahemmer wrote: > Hi guys, this is a duplication of a T251123, Sorry for this, unsure of how to delete this. > > This... [13:07:43] (03CR) 10Elukey: [C: 03+2] role::elasticsearch::cloudelastic: set CMS occupancy fraction to 85 [puppet] - 10https://gerrit.wikimedia.org/r/592621 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [13:08:04] (03CR) 10jerkins-bot: [V: 04-1] create a role like simplelamp but using mariadb, not mysql [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [13:08:51] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10Product-Analytics (Kanban): LDAP access to the wf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251123 (10Dzahn) [13:08:53] 10Operations, 10LDAP-Access-Requests: LDAP access to the wf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251124 (10Dzahn) [13:08:55] 10Operations, 10LDAP-Access-Requests: LDAP access to the wf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251124 (10Dzahn) The fix for duplicates to "merge in duplicates" on the task to be kept. Done! [13:09:05] !log Deploy schema change on s7 codfw, lag will show up - T250055 [13:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:12] T250055: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 [13:09:27] mutante: I’d already closed the task as dupe. [13:09:41] * RhinosF1 goes to test something [13:09:56] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 22721 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:10:20] (03CR) 10Jcrespo: "comment fix, unless it is just a temporarily named class." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [13:10:49] !log roll restart elastic on cloudelastic-chi again to pick up new JVM settings - T231517 [13:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:55] T231517: Investigate and fix GC issues on cloudelastic machines - https://phabricator.wikimedia.org/T231517 [13:12:06] mutante: I found a phab bug. You can merge a task that is already closed as a dupe of the same task into it again. Reproduced on https://phab.wmflabs.org/T80 [13:12:20] 10Operations, 10SRE-Access-Requests: Requesting Access to sites from Google Search Console - https://phabricator.wikimedia.org/T251128 (10ahemmer) [13:12:26] (03PS3) 10Ema: purged: pass host_regex [puppet] - 10https://gerrit.wikimedia.org/r/592646 (https://phabricator.wikimedia.org/T249583) [13:12:47] 10Operations, 10observability, 10cloud-services-team (Kanban): remove cloud "dev" hosts from Icinga? - https://phabricator.wikimedia.org/T250787 (10Dzahn) The goal is to clear the "unacknowledged services CRITICAL" column in Icinga. Disabling notifications is not a form of acknowledging, scheduling downtime... [13:14:07] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10Product-Analytics (Kanban): LDAP access to the wmf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251123 (10Aklapper) [13:15:03] RhinosF1: confirmed, i repeated what you did before. not too worried about it though [13:15:11] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10Product-Analytics (Kanban): LDAP access to the wmf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251123 (10Aklapper) (For future reference, see https://phabricator.wikimedia.org/project/profile/1564/ for required... [13:15:35] mutante: I’ll drop a message on the upstream discource [13:16:35] mutante: what phab version are we? [13:19:19] (03PS4) 10Ema: purged: pass host_regex [puppet] - 10https://gerrit.wikimedia.org/r/592646 (https://phabricator.wikimedia.org/T249583) [13:19:31] RhinosF1: 82a0d4251ae61e2387d9e4abc17992db8c9e40db (Feb 12 2020) [13:22:25] (03CR) 10Vgutierrez: [C: 03+1] purged: pass host_regex [puppet] - 10https://gerrit.wikimedia.org/r/592646 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [13:22:43] (03CR) 10Ema: [C: 03+2] purged: pass host_regex [puppet] - 10https://gerrit.wikimedia.org/r/592646 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [13:22:58] (03PS2) 10Ema: cache: test purged on cp2030, part of cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/592648 (https://phabricator.wikimedia.org/T249583) [13:25:21] (03PS1) 10Ottomata: eventgate-{analytics,main} - use Kafka TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/592658 (https://phabricator.wikimedia.org/T250149) [13:26:12] (03CR) 10Elukey: [C: 03+1] eventgate-{analytics,main} - use Kafka TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/592658 (https://phabricator.wikimedia.org/T250149) (owner: 10Ottomata) [13:26:17] (03CR) 10Ottomata: [C: 03+2] eventgate-{analytics,main} - use Kafka TLS [deployment-charts] - 10https://gerrit.wikimedia.org/r/592658 (https://phabricator.wikimedia.org/T250149) (owner: 10Ottomata) [13:28:29] (03CR) 10Ema: [C: 03+2] cache: test purged on cp2030, part of cache_upload [puppet] - 10https://gerrit.wikimedia.org/r/592648 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [13:28:31] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [13:28:31] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [13:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:11] <_joe_> !log depooled mw1409 as well as mw1407 for further benchmarking, T99740 [13:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:18] T99740: Use static php array files for l10n cache at WMF (instead of CDB) - https://phabricator.wikimedia.org/T99740 [13:32:01] mutante: https://discourse.phabricator-community.org/t/possible-to-mark-a-task-as-a-duplicate-of-another-twice/3794 [13:33:07] RhinosF1: looks like i can't read the content, but i trust you [13:33:20] oh, i can :) [13:33:33] :) [13:34:58] (03CR) 10Ppchelko: [C: 03+1] mediawiki:jobrunner_tls: Remove runjobs monitoring [puppet] - 10https://gerrit.wikimedia.org/r/592631 (https://phabricator.wikimedia.org/T243096) (owner: 10Hnowlan) [13:37:01] (03CR) 10Ppchelko: [C: 03+2] changeprop: Use staging eventgate in staging environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/589562 (https://phabricator.wikimedia.org/T249739) (owner: 10Hnowlan) [13:38:14] <_joe_> !log repooling both mw1407 and mw1409 for tesing T99740 [13:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:21] T99740: Use static php array files for l10n cache at WMF (instead of CDB) - https://phabricator.wikimedia.org/T99740 [13:39:21] (03PS4) 10Ppchelko: changeprop: Use staging eventgate in staging environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/589562 (https://phabricator.wikimedia.org/T249739) (owner: 10Hnowlan) [13:39:33] (03CR) 10Ppchelko: changeprop: Use staging eventgate in staging environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/589562 (https://phabricator.wikimedia.org/T249739) (owner: 10Hnowlan) [13:39:37] (03CR) 10Ppchelko: [C: 03+2] changeprop: Use staging eventgate in staging environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/589562 (https://phabricator.wikimedia.org/T249739) (owner: 10Hnowlan) [13:39:55] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [13:40:00] (03Merged) 10jenkins-bot: changeprop: Use staging eventgate in staging environment. [deployment-charts] - 10https://gerrit.wikimedia.org/r/589562 (https://phabricator.wikimedia.org/T249739) (owner: 10Hnowlan) [13:41:36] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [13:41:39] !log dzahn@cumin1001 END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) [13:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:42] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [13:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:20] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [13:42:28] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'production' . [13:42:28] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics' for release 'canary' . [13:42:31] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [13:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:37] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [13:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:54] !log Drop img_deleted column from s7 eqiad - T250055 [13:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:02] T250055: Remove image.img_deleted column from production - https://phabricator.wikimedia.org/T250055 [13:46:04] !log Drop img_deleted column from wikitech - T250055 [13:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:41] (03PS1) 10Jbond: build.gradle: add memcached support to cas blob [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/592659 (https://phabricator.wikimedia.org/T233931) [13:47:25] (03CR) 10Andrew Bogott: "retest" [puppet] - 10https://gerrit.wikimedia.org/r/589741 (owner: 10Andrew Bogott) [13:47:40] !log Deploy schema change on s3 codfw, lag will show up - T250055 [13:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:57] (03PS1) 10Jbond: apero_cas: alow ability to use memcached for tickets [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) [13:47:59] (03PS1) 10Jbond: apero_cas: enable memcached on idp_test [puppet] - 10https://gerrit.wikimedia.org/r/592661 (https://phabricator.wikimedia.org/T233931) [13:48:39] (03CR) 10Andrew Bogott: [C: 03+1] mcrouter: add default timeout values [puppet] - 10https://gerrit.wikimedia.org/r/592641 (owner: 10Jbond) [13:48:52] (03PS2) 10Jbond: apero_cas: alow ability to use memcached for tickets [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) [13:49:04] (03PS2) 10Jbond: apero_cas: enable memcached on idp_test [puppet] - 10https://gerrit.wikimedia.org/r/592661 (https://phabricator.wikimedia.org/T233931) [13:51:25] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [13:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:04] (03CR) 10jerkins-bot: [V: 04-1] apero_cas: alow ability to use memcached for tickets [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [13:52:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [13:52:13] !log decom'ing install1002 and install2002 - see install1003/2003 and apt1001/2001 [13:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:01] (03PS3) 10Jbond: apero_cas: alow ability to use memcached for tickets [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) [13:53:23] !log restart ats-tls on cp3056 - T249335 [13:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:30] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [13:53:45] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [13:54:11] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for CAS on fallback servers [puppet] - 10https://gerrit.wikimedia.org/r/592600 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:54:30] 10Operations, 10Patch-For-Review: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10Dzahn) install1002 and install2002 have now been removed. install1003/2003 and apt1001/2001 have replaced them. [13:54:41] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/589741 (owner: 10Andrew Bogott) [13:55:49] !log depool cp4026 and upgrade to ATS 8.1.0 - T249335 [13:55:51] (03CR) 10jerkins-bot: [V: 04-1] apero_cas: alow ability to use memcached for tickets [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [13:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:56] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [13:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:59] (03PS4) 10Jbond: apero_cas: alow ability to use memcached for tickets [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) [13:59:13] (03PS3) 10Jbond: apero_cas: enable memcached on idp_test [puppet] - 10https://gerrit.wikimedia.org/r/592661 (https://phabricator.wikimedia.org/T233931) [14:00:33] 10Operations: git-pbuilder incorrectly copies DIST=stretch package files into results/buster-amd64 on deneb.codfw.wmnet - https://phabricator.wikimedia.org/T250803 (10MoritzMuehlenhoff) This seems to be specific to the gbp buildpackage integration, the correct build environment is picked with a plain pbuilder in... [14:00:43] (03CR) 10Dzahn: [C: 03+2] "decom'ed with cookbook" [dns] - 10https://gerrit.wikimedia.org/r/569683 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [14:00:55] (03PS1) 10Dzahn: apt_repo: stop including migration rsync class [puppet] - 10https://gerrit.wikimedia.org/r/592662 (https://phabricator.wikimedia.org/T224576) [14:02:23] (03PS2) 10Dzahn: apt_repo: stop including migration rsync class [puppet] - 10https://gerrit.wikimedia.org/r/592662 (https://phabricator.wikimedia.org/T224576) [14:02:42] (03CR) 10Dzahn: [C: 03+2] apt_repo: stop including migration rsync class [puppet] - 10https://gerrit.wikimedia.org/r/592662 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [14:06:41] (03PS2) 10Dzahn: remove install1002/install2002 [dns] - 10https://gerrit.wikimedia.org/r/569683 (https://phabricator.wikimedia.org/T224576) [14:07:07] (03CR) 10jerkins-bot: [V: 04-1] remove install1002/install2002 [dns] - 10https://gerrit.wikimedia.org/r/569683 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [14:09:33] (03PS3) 10Dzahn: remove install1002/install2002 [dns] - 10https://gerrit.wikimedia.org/r/569683 (https://phabricator.wikimedia.org/T224576) [14:12:44] (03CR) 10Dzahn: [C: 03+2] remove install1002/install2002 [dns] - 10https://gerrit.wikimedia.org/r/569683 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [14:13:13] (03Abandoned) 10Hashar: Update .ruby-version to what is running in production [puppet] - 10https://gerrit.wikimedia.org/r/591361 (https://phabricator.wikimedia.org/T250538) (owner: 10Hashar) [14:13:15] (03PS4) 10Dzahn: remove install1002/install2002 [dns] - 10https://gerrit.wikimedia.org/r/569683 (https://phabricator.wikimedia.org/T224576) [14:15:00] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [14:15:12] 10Operations, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [14:15:14] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [14:15:16] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) 05Open→03Resolved [14:15:18] 10Operations, 10Patch-For-Review: Sort out plan for install* servers in edge sites - https://phabricator.wikimedia.org/T242602 (10Dzahn) [14:15:30] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [14:15:48] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 164.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [14:16:51] ACKNOWLEDGEMENT - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 164.7 gt 100 daniel_zahn https://phabricator.wikimedia.org/T231517 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloude [14:16:51] d=37 [14:18:35] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for pmacctd [puppet] - 10https://gerrit.wikimedia.org/r/592636 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:19:20] (03CR) 10Dzahn: create a role like simplelamp but using mariadb, not mysql (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [14:19:55] (03PS3) 10Dzahn: create a role like simplelamp but using mariadb, not mysql [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) [14:19:57] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission es2001, es2002, es2003, es2004 - https://phabricator.wikimedia.org/T222592 (10Papaul) p:05Triage→03Low [14:20:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1078 for schema change', diff saved to https://phabricator.wikimedia.org/P11045 and previous config saved to /var/cache/conftool/dbconfig/20200427-142006-marostegui.json [14:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:28] (03CR) 10jerkins-bot: [V: 04-1] create a role like simplelamp but using mariadb, not mysql [puppet] - 10https://gerrit.wikimedia.org/r/592653 (https://phabricator.wikimedia.org/T162070) (owner: 10Dzahn) [14:25:40] (03PS1) 10Ottomata: eventgate-analytics-external - Support backwards compatible eventlogging_ topic prefixing [deployment-charts] - 10https://gerrit.wikimedia.org/r/592664 (https://phabricator.wikimedia.org/T238230) [14:26:39] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics-external - Support backwards compatible eventlogging_ topic prefixing [deployment-charts] - 10https://gerrit.wikimedia.org/r/592664 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [14:26:42] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for fastnetmon [puppet] - 10https://gerrit.wikimedia.org/r/592643 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:27:18] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [14:27:18] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:32] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [14:30:31] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [14:30:31] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:39] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/592659 (https://phabricator.wikimedia.org/T233931) (owner: 10Jbond) [14:33:36] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'production' . [14:33:36] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-analytics-external' for release 'canary' . [14:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:10] (03PS1) 10KartikMistry: Update cxserver to 2020-04-27-061703-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/592665 (https://phabricator.wikimedia.org/T249852) [14:40:47] (03CR) 10Muehlenhoff: profile::idp: add mcrouter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [14:45:12] (03PS5) 10Ppchelko: changeprop: Use global puppet CA cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/589570 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [14:45:47] (03PS4) 10Jbond: profile::idp: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) [14:46:20] (03CR) 10Ppchelko: [C: 03+2] changeprop: Use global puppet CA cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/589570 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [14:46:45] (03Merged) 10jenkins-bot: changeprop: Use global puppet CA cert [deployment-charts] - 10https://gerrit.wikimedia.org/r/589570 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [14:46:47] !log pool cp4026 running ATS 8.1.0 - T249335 [14:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:54] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [14:47:14] (03CR) 10Jbond: "thanks updated" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/592642 (https://phabricator.wikimedia.org/T233933) (owner: 10Jbond) [14:47:31] (03PS5) 10Jbond: apero_cas: alow ability to use memcached for tickets [puppet] - 10https://gerrit.wikimedia.org/r/592660 (https://phabricator.wikimedia.org/T233933) [14:47:41] (03PS4) 10Jbond: apero_cas: enable memcached on idp_test [puppet] - 10https://gerrit.wikimedia.org/r/592661 (https://phabricator.wikimedia.org/T233931) [14:50:06] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is OK: (C)100 gt (W)80 gt 72.2 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [14:50:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1078', diff saved to https://phabricator.wikimedia.org/P11046 and previous config saved to /var/cache/conftool/dbconfig/20200427-145010-marostegui.json [14:50:13] !log setting default etherpadlite db on m1 to utf8mb4_bin [14:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: scb1001: Memory correctable errors -EDAC- - https://phabricator.wikimedia.org/T250482 (10akosiaris) I 'll second @MoritzMuehlenhoff on this one. No need for that. We can even take the entire server out of rotation now that most apps are off of the cluster[... [14:56:10] 10Operations, 10observability: run nic_saturation_exporter on all physical hosts - https://phabricator.wikimedia.org/T250401 (10CDanis) [14:58:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1075 for schema change', diff saved to https://phabricator.wikimedia.org/P11047 and previous config saved to /var/cache/conftool/dbconfig/20200427-145851-marostegui.json [14:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:53] 10Operations, 10Cloud-VPS (Project-requests), 10cloud-services-team (Kanban): Create vm-reaper job to manage lifespan of VMs - https://phabricator.wikimedia.org/T251152 (10Andrew) [15:12:41] 10Operations, 10DBA: Upgrade and restart s5 and s6 primary DB master - https://phabricator.wikimedia.org/T251154 (10Marostegui) [15:12:54] 10Operations, 10DBA: Upgrade and restart s5 and s6 primary DB master - https://phabricator.wikimedia.org/T251154 (10Marostegui) [15:13:01] 10Operations, 10DBA: Upgrade and restart s5 and s6 primary DB master - https://phabricator.wikimedia.org/T251154 (10Marostegui) p:05Triage→03Medium [15:13:32] 10Operations, 10netops, 10observability: replace check_ripe_atlas Python script with a check_prometheus backed by atlasexporter data - https://phabricator.wikimedia.org/T251155 (10CDanis) [15:13:46] 10Operations, 10netops, 10observability: replace check_ripe_atlas Python script with a check_prometheus backed by atlasexporter data - https://phabricator.wikimedia.org/T251155 (10CDanis) p:05Triage→03Low [15:14:01] 10Operations, 10netops, 10observability: replace check_ripe_atlas Python script with a check_prometheus backed by atlasexporter data - https://phabricator.wikimedia.org/T251155 (10CDanis) [15:14:03] 10Operations, 10observability: Add RIPE atlas data to Prometheus - https://phabricator.wikimedia.org/T167689 (10CDanis) [15:16:10] 10Operations, 10netops, 10observability: add traceroute measurements to RIPE Atlas prometheus data - https://phabricator.wikimedia.org/T251156 (10CDanis) p:05Triage→03Low [15:16:12] 10Operations, 10observability: Add RIPE atlas data to Prometheus - https://phabricator.wikimedia.org/T167689 (10CDanis) [15:16:15] 10Operations, 10netops, 10observability: add traceroute measurements to RIPE Atlas prometheus data - https://phabricator.wikimedia.org/T251156 (10CDanis) [15:16:17] 10Operations, 10observability: Add RIPE atlas data to Prometheus - https://phabricator.wikimedia.org/T167689 (10CDanis) 05Open→03Resolved Broke out some pending nice-to-haves into child tasks; resolving this one. [15:19:30] 10Operations, 10DBA: Upgrade and restart s3 and s7 primary DB master - https://phabricator.wikimedia.org/T251158 (10Marostegui) [15:19:44] 10Operations, 10DBA: Upgrade and restart s3 and s7 primary DB master - https://phabricator.wikimedia.org/T251158 (10Marostegui) p:05Triage→03Medium [15:20:40] 10Operations, 10DBA: Upgrade and restart s5 and s6 primary DB master - https://phabricator.wikimedia.org/T251154 (10Marostegui) [15:22:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1075', diff saved to https://phabricator.wikimedia.org/P11048 and previous config saved to /var/cache/conftool/dbconfig/20200427-152242-marostegui.json [15:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:42] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:01] (03PS1) 10Hnowlan: changeprop: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/592683 (https://phabricator.wikimedia.org/T249633) [15:35:58] (03PS11) 10Cwhite: smart: add _check_output wrapper method and tests [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) [15:36:02] (03CR) 10Ppchelko: [C: 03+2] changeprop: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/592683 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [15:36:14] (03CR) 10jerkins-bot: [V: 04-1] smart: add _check_output wrapper method and tests [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [15:36:18] (03CR) 10Cwhite: smart: add _check_output wrapper method and tests (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [15:36:28] (03Merged) 10jenkins-bot: changeprop: release new version [deployment-charts] - 10https://gerrit.wikimedia.org/r/592683 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [15:38:46] (03PS12) 10Cwhite: smart: add _check_output wrapper method and tests [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) [15:39:34] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:20] !log restart wdqs-updater on all servers [15:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:54] (03CR) 10Thcipriani: [C: 03+1] gerrit: add Keepalive=on to ProxyPass config lines [puppet] - 10https://gerrit.wikimedia.org/r/591304 (https://phabricator.wikimedia.org/T246763) (owner: 10Dzahn) [15:48:16] (03CR) 10Filippo Giunchedi: "There's an extra print() but LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [15:50:23] (03PS13) 10Cwhite: smart: add _check_output wrapper method and tests [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) [15:50:35] (03PS1) 10Hnowlan: changeprop: correct naming of puppet_ca_cert variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/592686 (https://phabricator.wikimedia.org/T249633) [15:50:55] (03CR) 10Cwhite: smart: add _check_output wrapper method and tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [15:52:03] (03CR) 10Ppchelko: [C: 03+2] changeprop: correct naming of puppet_ca_cert variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/592686 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [15:52:24] (03Merged) 10jenkins-bot: changeprop: correct naming of puppet_ca_cert variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/592686 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [15:52:45] (03PS1) 10Ema: Revert "cache: test purged on cp2030, part of cache_upload" [puppet] - 10https://gerrit.wikimedia.org/r/592687 [15:53:46] 10Operations, 10LDAP: Add uid=srodlund,ou=people,dc=wikimedia,dc=org to cn=wmf,ou=groups,dc=wikimedia,dc=org - https://phabricator.wikimedia.org/T251163 (10bd808) [15:54:07] (03CR) 10Ema: [C: 03+2] Revert "cache: test purged on cp2030, part of cache_upload" [puppet] - 10https://gerrit.wikimedia.org/r/592687 (owner: 10Ema) [15:54:28] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [15:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:10] 10Operations, 10SRE-Access-Requests, 10LDAP: Add uid=srodlund,ou=people,dc=wikimedia,dc=org to cn=wmf,ou=groups,dc=wikimedia,dc=org - https://phabricator.wikimedia.org/T251163 (10bd808) [15:58:49] (03CR) 10Filippo Giunchedi: [C: 03+1] smart: add _check_output wrapper method and tests [puppet] - 10https://gerrit.wikimedia.org/r/587811 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [15:59:01] 10Operations, 10SRE-Access-Requests, 10LDAP: Add uid=srodlund,ou=people,dc=wikimedia,dc=org to cn=wmf,ou=groups,dc=wikimedia,dc=org - https://phabricator.wikimedia.org/T251163 (10bd808) @Bmueller Your +1 would be appreciated here per https://wikitech.wikimedia.org/wiki/Production_access. [16:01:37] (03PS1) 10JMeybohm: Add a script to simplify imports of new upstream versions [debs/helm] - 10https://gerrit.wikimedia.org/r/592689 [16:01:39] (03PS1) 10JMeybohm: Clean up debian directory [debs/helm] - 10https://gerrit.wikimedia.org/r/592690 [16:02:17] !log hnowlan@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [16:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:52] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10Product-Analytics (Kanban): LDAP access to the wmf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251123 (10crusnov) @Dzahn is this complete then? [16:07:24] !log hnowlan@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [16:07:27] 10Operations, 10ops-eqiad, 10serviceops: mw1280 correctable memory errors logged in getsel - https://phabricator.wikimedia.org/T251077 (10crusnov) p:05Triage→03Medium [16:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:43] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10Product-Analytics (Kanban): LDAP access to the wmf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251123 (10Dzahn) @crusnov No, it's not complete. There were just 2 tickets for the same thing. But it still needs d... [16:08:52] 10Operations: git-pbuilder incorrectly copies DIST=stretch package files into results/buster-amd64 on deneb.codfw.wmnet - https://phabricator.wikimedia.org/T250803 (10crusnov) p:05Triage→03Medium [16:15:28] (03PS2) 10Dzahn: gerrit: add Keepalive=on to ProxyPass config lines [puppet] - 10https://gerrit.wikimedia.org/r/591304 (https://phabricator.wikimedia.org/T246763) [16:18:28] !log gerrit - enabling Keepalive for httpd->jetty reverse proxy connections (T246763) [16:24:10] !log gerrit - restarted httpd [16:25:37] a bot is missing [16:41:11] 10Operations, 10Analytics, 10LDAP-Access-Requests: LDAP access to the wmf group for Antonino Hemmer (superset, turnilo, hue) - https://phabricator.wikimedia.org/T251123 (10nshahquinn-wmf) [16:41:32] 10Operations: git-pbuilder incorrectly copies DIST=stretch package files into results/buster-amd64 on deneb.codfw.wmnet - https://phabricator.wikimedia.org/T250803 (10akosiaris) This is https://github.com/agx/git-buildpackage/commit/021fe9dfb39815827e14199ff480d8c00096df55 and also bigger than just `results/`. I... [16:41:55] (03CR) 10Ppchelko: [C: 03+2] changeprop: enable more rules, increase number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/592700 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [16:42:20] (03Merged) 10jenkins-bot: changeprop: enable more rules, increase number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/592700 (https://phabricator.wikimedia.org/T248677) (owner: 10Hnowlan) [16:46:18] !log hnowlan@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [16:46:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:08] (03CR) 10Marostegui: "The commit says 10 seconds, but you are actually using -w 5, so I am not sure if your intention was to place 10 seconds, or 5 :)" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/592247 (owner: 10Jcrespo) [16:49:51] (03CR) 10JMeybohm: "> Patch Set 1: Verified+2" [debs/helm] - 10https://gerrit.wikimedia.org/r/592689 (owner: 10JMeybohm) [16:53:26] (03CR) 10Herron: role::mail::mx: enable jumpcloud test domain (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/588425 (https://phabricator.wikimedia.org/T244792) (owner: 10Jbond) [16:54:28] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration: Stop advertising webmaster@wikimedia.org in apache configs - https://phabricator.wikimedia.org/T251005 (10crusnov) p:05Triage→03Medium [16:55:44] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:58:00] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration: Stop advertising webmaster@wikimedia.org in apache configs - https://phabricator.wikimedia.org/T251005 (10Dzahn) Should be replaced with noc@wikimedia.org. The ones above are the special cases. Almost everything else uses noc@ as ServerAdmin if you... [17:00:05] gehel and onimisionipe: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200427T1700). [17:01:09] (03CR) 10Elukey: [C: 03+1] mcrouter: add default timeout values [puppet] - 10https://gerrit.wikimedia.org/r/592641 (owner: 10Jbond) [17:03:10] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:04:49] (03PS1) 10Reedy: Use noc@ not webmaster@ [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) [17:13:19] (03CR) 10Dzahn: [C: 03+1] Use noc@ not webmaster@ [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [17:13:30] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [17:30:47] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:30:47] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:30:47] (03PS1) 10Ottomata: InitialiseSettings-labs.php - Merge default beta stream config with production stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592715 (https://phabricator.wikimedia.org/T242122) [17:30:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:31:12] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:31:26] (03CR) 10Ottomata: "Hi James, will this do what I want? I want to only have to add overrides for streams in beta. All the production specified stream config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592715 (https://phabricator.wikimedia.org/T242122) (owner: 10Ottomata) [17:31:54] !log ppchelko@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'changeprop' for release 'production' . [17:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:36] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 31 probes of 550 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:34:02] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:34:36] !log ppchelko@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'changeprop' for release 'production' . [17:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:10] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:37:34] 10Operations, 10Discovery-Search: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10Mstyles) [17:39:50] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:41:56] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:42:16] (03PS1) 10CRusnov: Edit Project Config [software/netbox] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/592720 [17:43:29] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission es2001, es2002, es2003, es2004 - https://phabricator.wikimedia.org/T222592 (10Papaul) ` [edit interfaces interface-range disabled] member xe-7/0/6 { ... } + member ge-1/0/2; [edit interfaces] - ge-1/0/2 { - description es2001... [17:43:42] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:45:19] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission es2001, es2002, es2003, es2004 - https://phabricator.wikimedia.org/T222592 (10Papaul) [17:45:20] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:45:38] (03PS1) 10CRusnov: Update Netbox to v2.8.1-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/592721 [17:46:22] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/592721 (owner: 10CRusnov) [17:46:34] (03CR) 10CRusnov: [C: 04-1] "Oops missed a step.e" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/592721 (owner: 10CRusnov) [17:47:25] (03CR) 10Jhedden: [C: 03+2] cloudvps: metricsinfra add prometheus alert manager and email notifications [puppet] - 10https://gerrit.wikimedia.org/r/591202 (https://phabricator.wikimedia.org/T250206) (owner: 10Jhedden) [17:49:00] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:49:09] 10Operations, 10ops-eqiad, 10serviceops: mw1280 correctable memory errors logged in getsel - https://phabricator.wikimedia.org/T251077 (10wiki_willy) a:03Jclark-ctr [17:49:28] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:50:55] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592715 (https://phabricator.wikimedia.org/T242122) (owner: 10Ottomata) [17:51:43] (03CR) 10Ottomata: "Perfect thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592715 (https://phabricator.wikimedia.org/T242122) (owner: 10Ottomata) [17:51:52] (03CR) 10Ottomata: [C: 03+2] InitialiseSettings-labs.php - Merge default beta stream config with production stream config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592715 (https://phabricator.wikimedia.org/T242122) (owner: 10Ottomata) [17:52:14] (03PS1) 10Cmjohnson: Adding prodcution dns for restbase1028-1030 [dns] - 10https://gerrit.wikimedia.org/r/592725 (https://phabricator.wikimedia.org/T241784) [17:54:52] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [17:55:21] (03CR) 10Cmjohnson: [C: 03+2] Adding prodcution dns for restbase1028-1030 [dns] - 10https://gerrit.wikimedia.org/r/592725 (https://phabricator.wikimedia.org/T241784) (owner: 10Cmjohnson) [17:56:40] (03PS1) 10Ottomata: wgEventStreams - Add SearchSatisfaction stream config and remove beta specific overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592726 (https://phabricator.wikimedia.org/T238230) [17:59:00] (03CR) 10Ottomata: [C: 03+2] wgEventStreams - Add SearchSatisfaction stream config and remove beta specific overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592726 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [17:59:34] (03PS4) 10Jhedden: cloudvps: metricsinfra add prometheus alert manager and email notifications [puppet] - 10https://gerrit.wikimedia.org/r/591202 (https://phabricator.wikimedia.org/T250206) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200427T1800). [18:00:04] MatmaRex and DannyS712: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:09] hello [18:00:15] I can SWAT today! [18:00:16] oh [18:00:18] swat time [18:00:20] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:00:32] hi i was about to sync-file something to for beta [18:00:46] ottomata: go ahead and ping me once done :-) [18:00:49] already merged, mind if i do that real quick before swattign? [18:00:50] ok thanks [18:01:11] I'm here for swatting [18:01:30] Though as an fyi there is nothing to be done to test mine - the change is removing a setting that matches the default [18:01:50] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:02:08] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:02:16] (03Abandoned) 10CRusnov: Edit Project Config [software/netbox] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/592720 (owner: 10CRusnov) [18:02:19] MatmaRex: +2'ed your backport as it will take some time to get merged [18:02:30] yup, thanks [18:02:46] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgEventStreams: configure SearchSatisfaction - T249261 (duration: 00m 58s) [18:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:52] T249261: Vertical: Migrate SearchSatisfaction EventLogging event stream to Event Platform - https://phabricator.wikimedia.org/T249261 [18:02:57] (03CR) 10Jhedden: [C: 03+2] cloudvps: metricsinfra add prometheus alert manager and email notifications [puppet] - 10https://gerrit.wikimedia.org/r/591202 (https://phabricator.wikimedia.org/T250206) (owner: 10Jhedden) [18:03:12] (03PS2) 10CRusnov: Update Netbox to v2.8.1-wmf [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/592721 [18:04:13] (03PS3) 10DannyS712: Remove use of `wgAllowImageMoving` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592518 (https://phabricator.wikimedia.org/T245293) [18:04:37] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: wgEventStreams: in beta, merge settings from production - T242122 (duration: 00m 56s) [18:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:43] T242122: Deploy EventStreamConfig extension - https://phabricator.wikimedia.org/T242122 [18:04:59] all done Urbanecm thank yo u [18:05:04] thanks ottomata ! [18:05:16] ottomata: You don't need to sync the -labs files, BTW. [18:05:22] oh! i don't? [18:05:28] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [18:05:29] ottomata: Production servers never load them. Only Beta Cluster. [18:05:34] Intentional isolation. [18:05:36] Hmm. [18:05:37] i thought we had to to make sure prod was always 100% in sync [18:05:38] even though [18:05:45] they weren't used [18:05:50] Yeah, for the real files. [18:05:57] i guess I should at least jst merge on deployment host? [18:06:02] but no need to sync? [18:06:03] yup [18:06:04] For the -labs ones, just make sure that there's a clean checkout of master on the deployment host. [18:06:07] ahh k [18:06:16] that is good to know [18:06:17] thank you [18:06:29] Any time. [18:06:58] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592518 (https://phabricator.wikimedia.org/T245293) (owner: 10DannyS712) [18:08:12] (03Merged) 10jenkins-bot: Remove use of `wgAllowImageMoving` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592518 (https://phabricator.wikimedia.org/T245293) (owner: 10DannyS712) [18:09:10] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [18:09:30] DannyS712: syncing [18:10:25] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 8b71f38: Remove use of `wgAllowImageMoving` (T245293) (duration: 00m 57s) [18:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:32] T245293: Remove $wgAllowImageMoving - https://phabricator.wikimedia.org/T245293 [18:10:48] (03PS1) 10Cmjohnson: Adding new restbases to dhcpd file and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/592730 (https://phabricator.wikimedia.org/T241784) [18:11:24] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:11:32] (03PS1) 10Urbanecm: Create several namespace aliases for thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592731 (https://phabricator.wikimedia.org/T251134) [18:11:34] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:12:40] (03CR) 10jerkins-bot: [V: 04-1] Create several namespace aliases for thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592731 (https://phabricator.wikimedia.org/T251134) (owner: 10Urbanecm) [18:12:46] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:12:46] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:12:48] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:12:58] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:13:04] (03CR) 10Cmjohnson: [C: 03+2] Adding new restbases to dhcpd file and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/592730 (https://phabricator.wikimedia.org/T241784) (owner: 10Cmjohnson) [18:13:10] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:13:50] (03PS2) 10Urbanecm: Create several namespace aliases for thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592731 (https://phabricator.wikimedia.org/T251134) [18:14:48] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:14:48] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10Cmjohnson) [18:17:27] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:17:27] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:17:27] PROBLEM - PHP opcache health on mwdebug2001 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:17:28] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:17:38] MatmaRex: pulled onto mwdebug1001, could you check, please? [18:17:57] yes. looking [18:18:40] 10Operations, 10Traffic: Statistics on a CN banner - https://phabricator.wikimedia.org/T251177 (10Ciell) [18:18:54] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:19:12] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:19:52] Urbanecm: all looks good [18:19:58] thanks, going to sync [18:20:04] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:20:10] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:20:14] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:21:17] (03PS3) 10Urbanecm: Create several namespace aliases for thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592731 (https://phabricator.wikimedia.org/T251134) [18:21:22] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592731 (https://phabricator.wikimedia.org/T251134) (owner: 10Urbanecm) [18:21:44] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.28/extensions/Kartographer/modules/: SWAT: 6cd2847: Do not use remove() on maplinks (T250620; T251053) (duration: 00m 58s) [18:21:47] MatmaRex: done [18:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:51] T250620: Visual Editor fails during save dialog on specific page containing a Kartographer map - https://phabricator.wikimedia.org/T250620 [18:21:51] T251053: VisualEditor throws map.remove error while loading pages with maps for editing - https://phabricator.wikimedia.org/T251053 [18:21:54] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:22:18] (03Merged) 10jenkins-bot: Create several namespace aliases for thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592731 (https://phabricator.wikimedia.org/T251134) (owner: 10Urbanecm) [18:22:53] thanks Urbanecm [18:22:58] happy to help! [18:24:00] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 56a447e: Create several namespace aliases for thwikisource (T251134) (duration: 00m 58s) [18:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:06] T251134: Create an alias for some namespace in Thai Wikisource - https://phabricator.wikimedia.org/T251134 [18:24:56] Prod clear? [18:25:16] James_F: yes [18:25:22] Cool, thanks. [18:25:28] (03PS3) 10Jforrester: VisualEditor: Allow external link paste on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591416 (owner: 10Esanders) [18:25:32] !log Run namespaceDupes.php for thwikisource (T251134) [18:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:45] (03CR) 10Jforrester: [C: 03+2] "Let's fix the tech-debt soon, though." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591416 (owner: 10Esanders) [18:26:34] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:26:41] (03Merged) 10jenkins-bot: VisualEditor: Allow external link paste on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/591416 (owner: 10Esanders) [18:26:43] (03PS2) 10Reedy: Use noc@ not webmaster@ [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) [18:26:46] (03PS1) 10Reedy: Use noc@ not webmaster@ in lists.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/592734 (https://phabricator.wikimedia.org/T251005) [18:26:54] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Stop advertising webmaster@wikimedia.org in apache configs - https://phabricator.wikimedia.org/T251005 (10Reedy) >>! In T251005#6086077, @Dzahn wrote: > .. except the apache and httpd modules are also used in cloud VPS projects... [18:27:30] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:27:54] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:27:54] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:28:04] (03PS1) 10Ottomata: wgEventStreams - properly prefix legacy eventlogging analytics stream names with eventlogging_ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592735 (https://phabricator.wikimedia.org/T238230) [18:28:42] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 550 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:29:09] (03CR) 10jerkins-bot: [V: 04-1] wgEventStreams - properly prefix legacy eventlogging analytics stream names with eventlogging_ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592735 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [18:29:44] (03PS2) 10Ottomata: wgEventStreams - properly prefix legacy eventlogging analytics stream names with eventlogging_ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592735 (https://phabricator.wikimedia.org/T238230) [18:29:54] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:30:10] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:31:08] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:31:13] 10Operations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10CDanis) >>! In T215183#5575143, @CDanis wrote: > An update: as of now, 20.8% of the fleet (or 30% of hosts with software RAID enabled) have redundant bootloaders. This is just from fixing the partman configs and... [18:31:22] 10Operations: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10CDanis) p:05Medium→03Low [18:32:58] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:33:04] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:33:22] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:34:32] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 31 probes of 550 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:34:46] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:34:48] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:35:22] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:35:42] (03PS3) 10Reedy: [WIP] Use noc@ not webmaster@ [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) [18:36:34] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:38:26] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:39:02] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Use noc@ not webmaster@ [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [18:39:11] Urbanecm: done with swat? [18:40:14] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [18:40:14] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:40:24] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed o [18:40:24] nse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:40:44] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response wa [18:40:44] ain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:41:14] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed o [18:41:14] nse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:41:34] we have more busy appserver worker threads than idle ones [18:41:40] we might be about to have an outage [18:42:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:42:10] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out bef [18:42:10] s received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:42:16] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed o [18:42:16] nse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:42:38] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [18:42:46] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:43:04] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:43:04] (03CR) 10Ottomata: [C: 03+2] wgEventStreams - properly prefix legacy eventlogging analytics stream names with eventlogging_ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592735 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [18:43:57] hm Urbanecm sorry maybe you aren't done with swat? [18:44:07] i see a unmerged commit in mediawiki-staging [18:44:08] Oh, sorry, got distracted. [18:44:19] ottomata: I am, but James_F was deploying sth [18:44:44] James_F: my commit is there too now [18:44:48] but it is a noop atm [18:44:51] nothing it using it yet [18:44:52] And now there's no way to safely deploy mine without yours. [18:45:01] You can sync it [18:45:03] This is why you're meant to ask in here first. :-) [18:45:05] OK. [18:45:15] i did...but heard no answer and then proceeded... sorry [18:45:15] James_F: git reset --hard HEAD~ isn't safe? [18:45:20] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10Cmjohnson) [18:45:25] i didn't merge mine yet [18:45:33] just see it in origin/mster [18:45:40] Urbanecm: Getting away from master isn't a good idea. [18:45:41] you could merge just yours in instead of origin/master [18:45:43] if you prefer [18:46:00] Eh, it's fine. [18:46:01] then i could merge origin/master [18:46:04] yeah, it is safe [18:46:07] mine isn't used so proceed [18:46:09] (03PS4) 10Reedy: [WIP] Use noc@ not webmaster@ [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) [18:46:09] thanks sorr [18:46:44] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: IS: Set wmgVisualEditorAllowExternalLinkPaste false everywhere except officewiki (duration: 01m 17s) [18:46:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:34] OK, sync going out, all done from my end. Over to ottomata. [18:48:02] no diff in origin/mster so am assuming mine is out too. thank you [18:48:12] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [18:48:13] !log Sync failure to mw1279.eqiad.wmnet (timeout) [18:48:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:26] Okie-dokie [18:48:47] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Ready wmgVisualEditorAllowExternalLinkPaste to set wgVisualEditorAllowExternalLinkPaste (duration: 01m 29s) [18:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:30] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqi... [18:49:30] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:49:32] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:49:42] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [18:50:00] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [18:50:20] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:50:26] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [18:50:26] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:50:26] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:50:33] !log Manually ran `scap pull` on mw1279.eqiad.wmnet as it flaked during deploy. [18:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:40] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [18:50:40] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqi... [18:51:10] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:51:10] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:51:12] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [18:51:18] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:51:22] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:51:40] (03CR) 10Reedy: "Do we want to split the heira variables for each of apache and httpd? Or have them the same?" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [18:51:58] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqi... [18:52:12] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:52:14] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:52:16] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [18:52:18] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:52:54] James_F: ottomata: what changed in your deploys? [18:52:56] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:52:56] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:53:00] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [18:53:02] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:53:08] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:53:08] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:53:08] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:53:12] we started having a bad appserver latency spike a little before 18:40 [18:53:56] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:53:56] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:54:00] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:54:00] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:54:03] it does not look like databases or memcached are at fault [18:54:20] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [18:56:09] cdanis: Nothing at all about RB. Just a minor config addition for the JS runtime code. [18:56:41] cdanis: The EventLogging change of ottomata's looks OK too? Just a prefixing. [18:56:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:56:43] James_F: the restbase alerts are caused by the appserver latency increase [18:56:48] okay, maybe it's unrelated then [18:56:52] But the logging system is dark magic from my POV. [18:56:58] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:57:07] or it's possibly something about traffic patterns as well [18:57:16] cdanis: Could be that the php-fpm cache clear added extra load? [18:57:18] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:57:26] it's possible, although it was starting earlier as well [18:57:28] https://grafana.wikimedia.org/d/ifM0GzjWk/cdanis-xxx-php-worker-threads?orgId=1&from=now-2d&to=now [18:57:46] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:57:48] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:57:50] Right. [18:58:11] cdanis: nothing is using that config yet at all anyway [18:58:34] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:58:42] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:58:46] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:58:46] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:59:00] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [18:59:10] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [18:59:20] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:01:00] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:02:35] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1029.eqiad.wmnet'] `... [19:02:45] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1029.eqiad.wmnet'] `... [19:04:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:04:13] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1030.eqiad.wmnet'] `... [19:04:18] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:04:25] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [19:04:26] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:04:28] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:41] (03PS1) 10Cmjohnson: Adding production dns for cloudelastic1005-6 [dns] - 10https://gerrit.wikimedia.org/r/592743 (https://phabricator.wikimedia.org/T249062) [19:04:44] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:04:56] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:05:40] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqi... [19:05:42] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqi... [19:05:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [19:05:51] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:09] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqi... [19:06:16] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:06:26] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1029.eqiad.wmnet'] `... [19:06:36] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [19:06:36] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:06:36] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:06:38] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review: (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqi... [19:06:51] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:36] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:07:38] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:07:58] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:08:00] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:08:30] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:08:43] (03CR) 10Cmjohnson: [C: 03+2] Adding production dns for cloudelastic1005-6 [dns] - 10https://gerrit.wikimedia.org/r/592743 (https://phabricator.wikimedia.org/T249062) (owner: 10Cmjohnson) [19:09:18] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:09:28] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:09:36] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:09:36] 10Operations, 10ops-eqiad, 10Patch-For-Review: (Need by: TDB) rack/setup/install cloudelastic100[56] - https://phabricator.wikimedia.org/T249062 (10Cmjohnson) [19:11:02] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [19:12:38] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:12:38] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:12:50] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:12:50] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:12:50] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:13:16] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:14:20] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:14:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:14:54] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:15:04] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:15:36] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:18:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:18:36] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:18:48] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1030.eqiad.wmnet'] ` Of which those **FAIL... [19:18:50] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:18:50] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:18:54] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1028.eqiad.wmnet'] ` Of which those **FAIL... [19:19:16] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:19:20] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:19:23] 10Operations, 10ops-eqiad, 10Core Platform Team Workboards (Clinic Duty Team): (Need by: TBD) rack/setup/install restbase1028, restbase1029, restbase1030 - https://phabricator.wikimedia.org/T241784 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['restbase1029.eqiad.wmnet'] ` Of which those **FAIL... [19:20:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [19:20:31] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [19:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:54] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:21:04] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [19:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:50] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:21:58] !log ppchelko@deploy1001 Started deploy [changeprop/deploy@ecca66b]: Switch off rules moved to k8s T248677 [19:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:04] T248677: Finalise changeprop migration to k8s - https://phabricator.wikimedia.org/T248677 [19:22:40] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:22:40] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [19:22:40] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:22:51] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:22:52] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:22:54] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:22:56] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:20] !log ppchelko@deploy1001 Finished deploy [changeprop/deploy@ecca66b]: Switch off rules moved to k8s T248677 (duration: 01m 22s) [19:23:22] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:23:22] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:56] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:24:12] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:24:16] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:24:16] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:24:44] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:25:14] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:04] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [19:26:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:02] RECOVERY - PHP opcache health on mwdebug2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:27:28] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [19:27:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:28:38] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a re [19:28:38] ed https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:28:40] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [19:28:40] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:28:40] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:29:14] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [19:29:14] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:29:36] 10Operations, 10ops-eqiad: restbase1025 reported DIMM issues in getsel - https://phabricator.wikimedia.org/T250027 (10wiki_willy) a:03Cmjohnson [19:29:42] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [19:29:58] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:31:20] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:31:52] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:31:52] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:32:24] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:32:44] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:32:48] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:34:06] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:34:48] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:36:06] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed o [19:36:06] nse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:37:06] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:37:08] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:37:44] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:38:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:38:02] 10Operations, 10ops-eqiad, 10decommission, 10fundraising-tech-ops: decommission frav1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T222109 (10Papaul) 05Open→03Resolved Complete [19:38:14] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [19:38:14] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:38:16] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [19:38:16] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:38:44] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [19:38:48] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:38:48] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:38:54] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:39:28] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:39:32] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:39:40] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:40:00] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:40:28] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [19:40:30] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:40:30] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:40:34] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:40:36] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:40:40] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:40:40] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:41:40] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:41:56] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:41:56] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3054 is OK: HTTP OK: HTTP/1.0 200 OK - 22720 bytes in 0.259 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [19:43:21] (03PS1) 10Andrew Bogott: Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 [19:44:02] 10Operations, 10MediaWiki-Cache, 10Traffic, 10serviceops, and 4 others: Stop sending purges for `action=history` for linked pages. - https://phabricator.wikimedia.org/T250261 (10Krinkle) a:05Krinkle→03daniel [19:44:20] (03PS2) 10Andrew Bogott: Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 [19:45:04] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:45:16] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:45:20] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:45:38] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:46:12] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:46:12] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:46:58] (03PS2) 10QEDK: Enable VisualEditor for more namespaces on vecwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592427 (https://phabricator.wikimedia.org/T250419) [19:47:30] (03CR) 10jerkins-bot: [V: 04-1] Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 (owner: 10Andrew Bogott) [19:48:02] (03CR) 10QEDK: "Recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592427 (https://phabricator.wikimedia.org/T250419) (owner: 10QEDK) [19:48:55] (03PS3) 10Andrew Bogott: Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 [19:50:35] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:51:05] (03PS4) 10Andrew Bogott: Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 [19:52:14] 10Operations, 10ops-eqiad, 10decommission, 10fundraising-tech-ops: decommission bismuth.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T248516 (10Papaul) ` [edit interfaces interface-range vlan-administration] - member "ge-[0-1]/0/18"; [edit interfaces interface-range disabled] member ge-0/0... [19:52:17] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:52:49] (03PS5) 10Andrew Bogott: Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 [19:53:49] 10Operations, 10ops-eqiad, 10decommission, 10fundraising-tech-ops: decommission bismuth.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T248516 (10Papaul) [19:53:52] (03PS1) 10RLazarus: admin: Upgrade wkandek from ldap_only_users to root shell [puppet] - 10https://gerrit.wikimedia.org/r/592753 (https://phabricator.wikimedia.org/T249352) [19:55:31] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:55:37] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:55:49] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:55:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:57:03] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:57:41] 10Operations, 10ops-eqiad, 10decommission: decommission thulium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T203520 (10Papaul) [19:58:55] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:59:11] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:59:11] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [19:59:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:59:29] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:00:04] halfak and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200427T2000). [20:00:29] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:00:32] (03CR) 10Wolfgang Kandek: [C: 04-1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/592753 (https://phabricator.wikimedia.org/T249352) (owner: 10RLazarus) [20:00:39] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:00:52] (03CR) 10Wolfgang Kandek: [C: 03+1] admin: Upgrade wkandek from ldap_only_users to root shell [puppet] - 10https://gerrit.wikimedia.org/r/592753 (https://phabricator.wikimedia.org/T249352) (owner: 10RLazarus) [20:02:51] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:03:11] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:04:25] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:04:30] Does anyone know how to fix the bot at https://phabricator.wikimedia.org/T249613 ? [20:04:35] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [20:04:35] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:04:47] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:05:36] (03CR) 10CDanis: [C: 03+1] admin: Upgrade wkandek from ldap_only_users to root shell [puppet] - 10https://gerrit.wikimedia.org/r/592753 (https://phabricator.wikimedia.org/T249352) (owner: 10RLazarus) [20:06:13] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:06:51] 10Operations, 10LDAP-Access-Requests, 10LDAP: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 (10bd808) >>! In T250189#6084077, @Dzahn wrote: > This looks like a legit bug. The user samwalton is definitely in LDAP ( using ldapsearch on mwmaint1002) and it's even listed... [20:07:29] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:08:51] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response wa [20:08:51] ain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:08:53] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:09:13] DannyS712: Possibly legoktm or valhallasw per https://www.mediawiki.org/wiki/User:ReleaseTaggerBot [20:09:15] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Wolfgang Kandek - https://phabricator.wikimedia.org/T249352 (10wkandek) [20:09:24] Hello [20:09:52] hi legoktm :) See the ReleaseTaggerBot actions on T249613 [20:09:52] T249613: Cleanup rows from wb_items_per_site that should not be there any more - https://phabricator.wikimedia.org/T249613 [20:09:52] Huh [20:10:09] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:10:13] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:10:23] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:10:23] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:10:27] it seems to be wheel warning with itself or maybe jsut trying to add 2 milestones to the same ticket [20:10:31] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:10:31] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:10:45] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:11:13] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [20:11:37] can you not have 2 milestones on the same ticket? [20:11:47] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:11:49] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:11:49] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:11:53] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:11:57] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:12:05] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:12:19] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:12:21] (03CR) 10RLazarus: [C: 03+2] admin: Upgrade wkandek from ldap_only_users to root shell [puppet] - 10https://gerrit.wikimedia.org/r/592753 (https://phabricator.wikimedia.org/T249352) (owner: 10RLazarus) [20:12:44] 10Operations, 10ops-eqiad, 10Analytics, 10decommission: Decommission analytics100[1,2] - https://phabricator.wikimedia.org/T205507 (10Papaul) [20:13:21] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:13:22] 2020-04-27 20:01:06,828 - forrestbot - INFO - Adding PHID {'PHID-PROJ-b4o3sskyehcn6dpnw5ma', 'PHID-PROJ-uuxy726tm3zsedw7k6fv'} to T249613. [20:13:22] 2020-04-27 20:01:06,896 - forrestbot - DEBUG - Existing PHIDs: {'PHID-PROJ-b4o3sskyehcn6dpnw5ma', 'PHID-PROJ-xza7ajpadfzs2hdxyzdm', 'PHID-PROJ-cqqgehrardkzz7qwxhiv', 'PHID-PROJ-7ocjej2gottz7cikkdc6', 'PHID-PROJ-vumw5jyyw4r3fv52k34y'} [20:13:23] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:13:29] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:13:39] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:13:47] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:13:49] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:15:05] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:15:40] DannyS712, bd808: filed https://phabricator.wikimedia.org/T251194, will look more in depth after I finish class [20:16:19] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:16:20] 10Operations, 10ops-eqiad, 10decommission: decommission dbstore1001.eqiad.wmnet - https://phabricator.wikimedia.org/T236227 (10Papaul) [20:17:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:21:01] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:22:02] 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission labsdb1002 - https://phabricator.wikimedia.org/T146455 (10Papaul) [20:23:07] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:25:00] (03PS1) 10Cwhite: smart: disable timeout fetching facts [puppet] - 10https://gerrit.wikimedia.org/r/592755 (https://phabricator.wikimedia.org/T199236) [20:25:14] (03PS1) 10Ottomata: refine.pp - Slight refactor to use new unified refine tranform functions [puppet] - 10https://gerrit.wikimedia.org/r/592756 (https://phabricator.wikimedia.org/T238230) [20:25:18] (03CR) 10jerkins-bot: [V: 04-1] smart: disable timeout fetching facts [puppet] - 10https://gerrit.wikimedia.org/r/592755 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [20:25:25] Thanks @legoktm [20:26:17] (03PS2) 10Ottomata: refine.pp - Slight refactor to use new unified refine tranform functions [puppet] - 10https://gerrit.wikimedia.org/r/592756 (https://phabricator.wikimedia.org/T238230) [20:26:35] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:26:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:26:47] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:27:21] (03PS2) 10Cwhite: smart: disable timeout fetching facts [puppet] - 10https://gerrit.wikimedia.org/r/592755 (https://phabricator.wikimedia.org/T199236) [20:27:34] (03CR) 10jerkins-bot: [V: 04-1] smart: disable timeout fetching facts [puppet] - 10https://gerrit.wikimedia.org/r/592755 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [20:28:40] !log holger@mwmaint1002 Restarting uppercaseTitlesForUnicodeTransition.php as part of T219279 for 2 wikis [20:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:49] T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 [20:28:58] 10Operations, 10ops-eqiad, 10decommission, 10fundraising-tech-ops: decommission americium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T245038 (10Papaul) ` [edit interfaces interface-range vlan-fundraising] - member "ge-[0-1]/0/10"; [edit interfaces interface-range disabled] member "ge-[0-... [20:28:59] @hknust which wikis? [20:29:13] en,fr [20:29:17] (03PS6) 10Andrew Bogott: Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 [20:29:39] (03CR) 10jerkins-bot: [V: 04-1] refine.pp - Slight refactor to use new unified refine tranform functions [puppet] - 10https://gerrit.wikimedia.org/r/592756 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [20:30:05] (03PS3) 10Cwhite: smart: disable timeout fetching facts [puppet] - 10https://gerrit.wikimedia.org/r/592755 (https://phabricator.wikimedia.org/T199236) [20:30:32] 10Operations, 10ops-eqiad, 10decommission, 10fundraising-tech-ops: decommission americium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T245038 (10Papaul) [20:31:32] (03CR) 10Cwhite: [C: 03+2] smart: disable timeout fetching facts [puppet] - 10https://gerrit.wikimedia.org/r/592755 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [20:32:01] (03PS7) 10Andrew Bogott: Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 [20:32:23] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:32:52] (03PS3) 10Urbanecm: Enable DiscussionTools as a beta feature on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592514 (https://phabricator.wikimedia.org/T251075) (owner: 10VulpesVulpes825) [20:33:25] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:33:33] (03PS10) 10Cwhite: smart: abstract parsing from data gathering and add tests [puppet] - 10https://gerrit.wikimedia.org/r/587816 (https://phabricator.wikimedia.org/T199236) [20:33:37] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:33:45] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:33:45] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:33:49] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:33:57] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:34:01] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:34:23] (03CR) 10jerkins-bot: [V: 04-1] smart: abstract parsing from data gathering and add tests [puppet] - 10https://gerrit.wikimedia.org/r/587816 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [20:34:42] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Move netflow to TLS encryption/authentication via librdkafka - https://phabricator.wikimedia.org/T248980 (10Nuria) 05Open→03Resolved [20:35:04] (03PS8) 10Andrew Bogott: Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 [20:35:37] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@99c350c]: Update mobileapps to 09cb7c2e [20:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:01] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:36:46] (03PS9) 10Andrew Bogott: Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 [20:39:25] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:39:49] (03PS11) 10Cwhite: smart: abstract parsing from data gathering and add tests [puppet] - 10https://gerrit.wikimedia.org/r/587816 (https://phabricator.wikimedia.org/T199236) [20:40:08] (03CR) 10jerkins-bot: [V: 04-1] Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 (owner: 10Andrew Bogott) [20:41:31] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:42:01] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@99c350c]: Update mobileapps to 09cb7c2e (duration: 06m 24s) [20:42:03] (03PS8) 10Cwhite: smart: add tests for _parse_smart_info and _parse_smart_attributes [puppet] - 10https://gerrit.wikimedia.org/r/587877 (https://phabricator.wikimedia.org/T199236) [20:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:45] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [20:42:45] received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:42:51] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [20:42:51] received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:43:00] (03CR) 10jerkins-bot: [V: 04-1] smart: add tests for _parse_smart_info and _parse_smart_attributes [puppet] - 10https://gerrit.wikimedia.org/r/587877 (https://phabricator.wikimedia.org/T199236) (owner: 10Cwhite) [20:43:01] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out bef [20:43:01] s received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:43:01] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [20:43:01] eceived: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedi [20:43:01] es/Monitoring/recommendation_api [20:43:02] !log mobileapps deployed failed due to timeouts, rolled back. [20:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:09] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:11] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:15] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:43:17] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good artic [20:43:17] ut before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:43:19] (03PS9) 10Cwhite: smart: add tests for _parse_smart_info and _parse_smart_attributes [puppet] - 10https://gerrit.wikimedia.org/r/587877 (https://phabricator.wikimedia.org/T199236) [20:43:21] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:21] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:29] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:29] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:31] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:43:33] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:47] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:47] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:47] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [20:44:37] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:51] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:51] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:01] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:01] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:01] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:07] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:15] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:19] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:23] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:37] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:46] (03PS10) 10Andrew Bogott: Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 [20:46:10] (03PS1) 10Ladsgroup: Increase the memory limit from 660MB to 666MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592761 [20:46:35] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:46:35] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:46:43] (03CR) 10Framawiki: [C: 03+1] Increase the memory limit from 660MB to 666MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592761 (owner: 10Ladsgroup) [20:46:49] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:46:51] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:46:59] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:47:01] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:48:35] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:48:43] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:48:49] (03PS11) 10Andrew Bogott: Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 [20:49:21] (03PS6) 10Cwhite: smart: simplify PD [puppet] - 10https://gerrit.wikimedia.org/r/588515 (https://phabricator.wikimedia.org/T199236) [20:49:29] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [20:49:42] (03PS3) 10Cwhite: smart: move metrics registry and metrics init to global [puppet] - 10https://gerrit.wikimedia.org/r/588759 (https://phabricator.wikimedia.org/T199236) [20:49:53] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedi [20:49:53] se [20:49:53] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:11] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:11] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:13] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:27] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:41] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:50:51] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:54] (03PS12) 10Andrew Bogott: Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 [20:50:57] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:50:59] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:51:03] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:51:03] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:51:19] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [20:51:41] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:51:43] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:51:47] (03PS6) 10Zoranzoki21: Update logos for for tiwiki and tiwikt (part I) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592506 (https://phabricator.wikimedia.org/T150618) [20:52:01] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:03] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:05] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:05] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:07] (03PS12) 10Zoranzoki21: Update logos for for tiwiki and tiwikt (part II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592507 (https://phabricator.wikimedia.org/T150618) [20:52:09] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:52:11] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:19] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:19] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:21] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:31] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:31] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:35] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:52:39] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:39] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:39] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:49] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:51] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:55] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:31] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:35] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:51] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:54:15] (03CR) 10RhinosF1: [C: 04-1] "Per task. Do not merge without ppelberg’s approval or editing team otherwise." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592514 (https://phabricator.wikimedia.org/T251075) (owner: 10VulpesVulpes825) [20:54:19] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:54:37] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:54:57] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:55:11] (03PS13) 10Andrew Bogott: Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 [20:55:39] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:55:49] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@99c350c]: Update mobileapps to 09cb7c2e [20:55:54] !log holger@mwmaint1002 END (enwiki=success, frwiki=fail) uppercaseTitlesForUnicodeTransition.php as part of T219279 [20:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:02] T219279: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 [20:56:09] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:56:41] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@99c350c]: Update mobileapps to 09cb7c2e (duration: 00m 52s) [20:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:54] (03PS14) 10Andrew Bogott: Add wmcs-instancepurge.py and a timer job that uses it [puppet] - 10https://gerrit.wikimedia.org/r/592750 [20:57:53] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:58:03] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:58:09] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:58:35] !log mobileapps deploy on canary failed due to timeouts, rolled back. [20:58:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] Reedy and sbassett: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200427T2100). [21:00:09] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:00:29] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [21:05:29] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:05:49] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:07:25] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:09:19] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:09:29] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:10:26] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10holger.knust) @tstarling enwiki worked. frwiki failed. I d... [21:10:39] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:10:49] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:12:57] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:14:21] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:14:49] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:16:21] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:16:23] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:16:35] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:16:39] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [21:16:39] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:19:51] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:19:57] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:20:17] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:20:17] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:20:17] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:20:33] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:20:48] 10Operations, 10ops-codfw: Degraded RAID on restbase2014 - https://phabricator.wikimedia.org/T250050 (10Eevans) 05Open→03Resolved AFAIK, this is complete [21:21:39] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:21:45] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:21:53] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:21:57] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:22:09] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:22:35] (03PS1) 10DannyS712: Activate DiscussionTools as a beta feature on enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592768 (https://phabricator.wikimedia.org/T249785) [21:23:20] (03PS2) 10DannyS712: Activate DiscussionTools as a beta feature on enwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592768 (https://phabricator.wikimedia.org/T249785) [21:24:31] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [21:25:55] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:25:55] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:25:55] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:26:05] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:27:39] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:27:41] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:27:43] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:28:04] (03CR) 10Jforrester: [C: 04-1] "Per the team." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592768 (https://phabricator.wikimedia.org/T249785) (owner: 10DannyS712) [21:28:52] 10Operations, 10SRE-Access-Requests, 10LDAP: Add uid=srodlund,ou=people,dc=wikimedia,dc=org to cn=wmf,ou=groups,dc=wikimedia,dc=org - https://phabricator.wikimedia.org/T251163 (10Aklapper) See https://phabricator.wikimedia.org/project/profile/956/ which links to https://phabricator.wikimedia.org/maniphest/ta... [21:29:13] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:29:35] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:31:23] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:31:51] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [21:33:43] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [21:37:41] (03PS1) 10RLazarus: icinga: Privileged access for wkandek [puppet] - 10https://gerrit.wikimedia.org/r/592769 (https://phabricator.wikimedia.org/T249352) [21:39:11] (03CR) 10Wolfgang Kandek: [C: 03+1] icinga: Privileged access for wkandek [puppet] - 10https://gerrit.wikimedia.org/r/592769 (https://phabricator.wikimedia.org/T249352) (owner: 10RLazarus) [21:42:17] (03CR) 10RLazarus: [C: 03+2] icinga: Privileged access for wkandek [puppet] - 10https://gerrit.wikimedia.org/r/592769 (https://phabricator.wikimedia.org/T249352) (owner: 10RLazarus) [21:46:08] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Onboarding Wolfgang Kandek - https://phabricator.wikimedia.org/T249352 (10wkandek) [21:55:00] (03PS1) 10Ssingh: Avoid manually formatting the datetime object for OONI's query [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/592770 [21:57:09] (03CR) 10Ssingh: [C: 03+2] Avoid manually formatting the datetime object for OONI's query [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/592770 (owner: 10Ssingh) [22:08:11] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592506 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [22:10:45] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:15:19] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Framawiki) >>! In T219279#6086938, @holger.knust wrote: >... [22:19:31] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 127.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [22:20:59] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 126.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [22:22:43] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 128.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [22:25:25] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 128.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1004&panelId=37 [22:28:11] 10Operations, 10MediaWiki-General, 10serviceops, 10Core Platform Team Workboards (Clinic Duty Team), and 4 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10DannyS712) >>! In T219279#6087202, @Framawiki wrote: >>>!... [22:41:18] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Onboarding Wolfgang Kandek - https://phabricator.wikimedia.org/T249352 (10wkandek) [23:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200427T2300). [23:00:05] Zoranzoki21: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:55] Here, as always :) [23:01:36] I can SWAT [23:01:44] Great :) [23:01:49] This is record [23:02:15] Usually when I schedule patch for Evening SWAT, I wait much or noone no deploys it :/ [23:02:17] But thanks! [23:03:51] (03CR) 10Catrope: "> Patch Set 2: Code-Review-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592455 (https://phabricator.wikimedia.org/T250878) (owner: 10Zoranzoki21) [23:04:06] (03CR) 10Catrope: [C: 03+2] Enable visualeditor on srwiki by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592455 (https://phabricator.wikimedia.org/T250878) (owner: 10Zoranzoki21) [23:05:20] (03Merged) 10jenkins-bot: Enable visualeditor on srwiki by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592455 (https://phabricator.wikimedia.org/T250878) (owner: 10Zoranzoki21) [23:06:37] Zoranzoki21: Re language converter + VE + srwiki: does srwiki have pages that use syntax like `-{sr-el:foo; sr-el:bar}-` ? [23:06:51] Or other special language converter syntax documented on https://www.mediawiki.org/wiki/Writing_systems/Syntax [23:07:14] If not, great. If there are such pages, I recommend testing VE on those pages to double-check that it doesn't explode [23:07:40] mwdebug is ready [23:07:43] We usually use -{ text }- [23:08:01] For mwdebug I mean it is enabled [23:08:04] Can I test patch now? [23:08:37] PROBLEM - DPKG on fermium is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [23:08:46] Yes, go ahead and test on mwdebug1002 [23:09:18] OK, -{ text }- counts, please make sure to test VE on a page that has that syntax [23:10:41] Testing... [23:11:27] Ok for me, I can't find problems [23:12:00] (03PS1) 10Nray: Add Config for Growth Study Quick Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592788 (https://phabricator.wikimedia.org/T248421) [23:12:45] (03CR) 10Nray: [C: 04-1] "I need to clarify a few things about this works" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592788 (https://phabricator.wikimedia.org/T248421) (owner: 10Nray) [23:12:59] Alright, I'll deploy [23:14:31] Ok [23:15:22] (03CR) 10Catrope: [C: 03+2] Update logos for for tiwiki and tiwikt (part I) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592506 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [23:15:25] (03CR) 10Catrope: [C: 03+2] Update logos for for tiwiki and tiwikt (part II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592507 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [23:16:01] !log catrope@deploy1001 Synchronized wmf-config/config/srwiki.yaml: Enable VisualEditor by default on srwiki (T250878) (duration: 00m 58s) [23:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:09] T250878: Scale: VE in Serbian Wikipedia - https://phabricator.wikimedia.org/T250878 [23:16:18] (03Merged) 10jenkins-bot: Update logos for for tiwiki and tiwikt (part I) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592506 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [23:16:23] (03Merged) 10jenkins-bot: Update logos for for tiwiki and tiwikt (part II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592507 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [23:16:23] Zoranzoki21: Let me know if that worked, I'm not sure if syncing srwiki.yaml is enough or if I also need to sync the .dblist file [23:16:45] Also the logos are now on mwdebug1002 for testing [23:17:32] RoanKattouw: You should sync and dblist, VisualEditor is in beta features, on mwdebug it wasn't [23:17:37] I'll test logos now [23:18:49] !log catrope@deploy1001 Synchronized dblists/visualeditor-nondefault.dblist: Enable VisualEditor by default on srwiki (T250878) (duration: 00m 57s) [23:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:04] Logos looks good [23:19:37] srwiki looks good now [23:20:00] Grea [23:20:00] t [23:20:09] Logos are syncing now [23:20:38] !log catrope@deploy1001 Synchronized static/images/project-logos/: Update logos for tiwiki and tiwiktionary (T150618, T249451) (duration: 00m 58s) [23:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:47] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [23:20:47] T249451: Logo update for tiwiki and tiwikt - https://phabricator.wikimedia.org/T249451 [23:21:20] Ok [23:22:31] RECOVERY - DPKG on fermium is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [23:25:00] 10Puppet, 10Wikimedia Meet: Puppetize the account manager - https://phabricator.wikimedia.org/T251034 (10Ladsgroup) >>! In T251034#6083749, @Dzahn wrote: > I'd be happy to help with puppetizing. Thank you! You also helped with codesearch too, thank you <3 > A first step would be to move the repo from github... [23:25:19] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Update logos for tiwiki and tiwiktionary (T150618, T249451) (duration: 00m 57s) [23:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:57] RoanKattouw: Umm.. I no see changes [23:27:17] Without mwdebug [23:27:22] On mwdebug I see it [23:30:21] It's working for me now [23:30:26] Yes, also [23:30:28] Thanks! [23:30:41] I forgot to do that second sync at first [23:31:44] Everything looks good now :) [23:37:25] I'm preparing one more for T150618 [23:37:26] T150618: Provide HD logos for all projects - https://phabricator.wikimedia.org/T150618 [23:38:10] I think that we will have time for deployment of it [23:42:14] Ok, I won't it do now as I'm confused with something [23:43:06] I'm going to sleep now [23:43:09] :me waves [23:43:11] oops [23:43:13] * Zoranzoki21 waves