[00:00:04] twentyafterfour: That opportune time is upon us again. Time for a Phabricator update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191003T0000). [00:01:38] mutante did it disconnect you? [00:01:52] paladox: yea, and i am on it via mgmt [00:01:58] ok [00:03:31] paladox: it's about the wrong row it looks [00:03:38] oh [00:03:49] 208.80.154.136 is wrong in v6 [00:11:14] 10Operations, 10ops-eqiad: Verify switch port connections - https://phabricator.wikimedia.org/T233302 (10Cmjohnson) 05Open→03Resolved ports have been identified and updated [00:12:36] removed the IPv6 addresses from interface and ran puppet again. that is already better [00:14:53] PROBLEM - Check systemd state on gerrit1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:15:56] ACKNOWLEDGEMENT - Check systemd state on gerrit1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn new server https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:28] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [00:37:40] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [00:47:54] (03PS1) 10Dzahn: fix IPv6 address for gerrit1001, wrong row [dns] - 10https://gerrit.wikimedia.org/r/540514 (https://phabricator.wikimedia.org/T222391) [00:48:46] (03CR) 10Dzahn: [C: 03+2] fix IPv6 address for gerrit1001, wrong row [dns] - 10https://gerrit.wikimedia.org/r/540514 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [00:51:48] !log gerrit1001 - removing wrong IPv6 address from interface, running puppet [00:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:52:56] RECOVERY - Check systemd state on gerrit1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:05:26] !log gerrit1001 - shutdown - scheduled downtime [01:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:40:16] (03PS1) 10Krinkle: Enable AbuseFilterCachingParser for hewiki and commonswiki (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540516 (https://phabricator.wikimedia.org/T156095) [01:56:20] (03CR) 10Krinkle: [C: 03+2] Enable AbuseFilterCachingParser for hewiki and commonswiki (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540516 (https://phabricator.wikimedia.org/T156095) (owner: 10Krinkle) [01:57:10] (03Merged) 10jenkins-bot: Enable AbuseFilterCachingParser for hewiki and commonswiki (group1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540516 (https://phabricator.wikimedia.org/T156095) (owner: 10Krinkle) [01:58:16] * Krinkle staging on mwdebug1001 [02:00:03] (03PS1) 10Dzahn: Revert "gerrit: add role on gerrit1001 and remove gerrit::migration" [puppet] - 10https://gerrit.wikimedia.org/r/540518 [02:01:23] (03CR) 10Dzahn: [C: 03+2] Revert "gerrit: add role on gerrit1001 and remove gerrit::migration" [puppet] - 10https://gerrit.wikimedia.org/r/540518 (owner: 10Dzahn) [02:07:23] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 1c599baea51f9 (duration: 01m 03s) [02:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:43] 10Operations, 10SRE-Access-Requests: Requesting access to analytics cluster for Djellel Difallah - https://phabricator.wikimedia.org/T234473 (10leila) approved (Manager). [03:12:41] 10Operations, 10Wikimedia-Site-requests, 10HHVM: Set hhvm.virtual_host[default][always_decode_post_data] = false - https://phabricator.wikimedia.org/T208191 (10MaxSem) 05Open→03Declined Now that we're not supporting HHVM anymore, this is moot. [03:13:49] 10Operations, 10HHVM: Enable the usage of `hhvm -m debug --debug-host ::1` from mw1017 so developers can step through code (think gdb) in production to see what is going wrong. - https://phabricator.wikimedia.org/T94951 (10MaxSem) 05Open→03Invalid Now that we're not supporting HHVM anymore, this is moot. [03:24:47] 10Operations, 10MediaWiki-API, 10Availability, 10HHVM: HHVM is leaking memory on the API appservers - https://phabricator.wikimedia.org/T133674 (10Krinkle) 05Open→03Declined Declining per T192166 / T176370. [03:29:08] 10Operations, 10HHVM: Fix config file handling for /etc/hhvm/php.ini - https://phabricator.wikimedia.org/T157306 (10Krinkle) 05Open→03Declined Obsolete per T229792. [03:33:33] 10Operations, 10Puppet, 10HHVM: Tighten permissions on HHVM bytecode cache - https://phabricator.wikimedia.org/T85990 (10Krinkle) 05Open→03Declined Obsolete per T229792. [03:33:42] 10Operations, 10Scap (Scap3-MediaWiki-MVP): Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352 (10Krinkle) [03:36:22] 10Operations, 10Patch-For-Review, 10Vuln-DoS: Long running mediawiki web requests impacts service availability, specially databases - https://phabricator.wikimedia.org/T149421 (10Krinkle) [03:36:32] 10Operations, 10MediaWiki-Core-Testing, 10HHVM: Re-add complete URL parsing fix from HHVM 3.18.7 release - https://phabricator.wikimedia.org/T185024 (10Krinkle) [03:36:54] 10Operations, 10MediaWiki-Core-Testing, 10HHVM: Re-add complete URL parsing fix from HHVM 3.18.7 release - https://phabricator.wikimedia.org/T185024 (10Krinkle) 05Open→03Resolved a:03Krinkle Obsolete per T229792. [03:38:30] 10Operations, 10MediaWiki-General: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748 (10Krinkle) [03:40:41] 10Operations, 10HHVM, 10Patch-For-Review, 10Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176 (10Krinkle) [03:51:25] (03PS2) 10Vgutierrez: Deploy inactive digicert 2019 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/540470 (https://phabricator.wikimedia.org/T209515) (owner: 10BBlack) [03:56:38] (03PS1) 10Vgutierrez: add dummy SSL keys for digicert-2019-{rsa,unified}-unified [labs/private] - 10https://gerrit.wikimedia.org/r/540522 (https://phabricator.wikimedia.org/T209515) [03:58:22] (03PS2) 10Vgutierrez: add dummy SSL keys for digicert-2019-{rsa,ecdsa}-unified [labs/private] - 10https://gerrit.wikimedia.org/r/540522 (https://phabricator.wikimedia.org/T209515) [03:58:25] E_COFFEE :) [03:58:48] vgutierrez: coffee.wikimedia.org (good night) [03:58:49] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] add dummy SSL keys for digicert-2019-{rsa,ecdsa}-unified [labs/private] - 10https://gerrit.wikimedia.org/r/540522 (https://phabricator.wikimedia.org/T209515) (owner: 10Vgutierrez) [04:02:17] (03CR) 10Vgutierrez: "pcc happy on ats-tls nodes: https://puppet-compiler.wmflabs.org/compiler1002/18721/" [puppet] - 10https://gerrit.wikimedia.org/r/540470 (https://phabricator.wikimedia.org/T209515) (owner: 10BBlack) [04:08:01] (03PS3) 10Vgutierrez: fifo-log-tailer: Retry on errors [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/539312 [04:08:17] (03CR) 10Vgutierrez: fifo-log-tailer: Retry on errors (031 comment) [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/539312 (owner: 10Vgutierrez) [04:18:16] (03CR) 10Krinkle: [C: 04-1] "Per brief chat on #mediawiki-core, this is super safe and trivial. If we can find a way to limit the ability to auto-create your account t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540380 (https://phabricator.wikimedia.org/T222117) (owner: 10Urbanecm) [05:26:33] (03PS2) 10Marostegui: dumps-misc.sh.erb: Remove designate_pool_manager from backups [puppet] - 10https://gerrit.wikimedia.org/r/539839 (https://phabricator.wikimedia.org/T233978) [05:30:51] (03CR) 10Marostegui: [C: 03+2] dumps-misc.sh.erb: Remove designate_pool_manager from backups [puppet] - 10https://gerrit.wikimedia.org/r/539839 (https://phabricator.wikimedia.org/T233978) (owner: 10Marostegui) [05:41:37] 10Operations, 10ops-eqiad, 10DBA: db1114 crashed due to memory issues (server under warranty) - https://phabricator.wikimedia.org/T229452 (10Marostegui) Thank you! [05:45:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:07] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db2050.codfw.wmnet` - db2050.codfw.wmnet (**PASS**) - Downtimed host on Ic... [05:47:04] (03PS1) 10Marostegui: site.pp: Remove db2050 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/540531 (https://phabricator.wikimedia.org/T230391) [05:47:27] (03PS1) 10Marostegui: wmnet: Remove production DNS entries for db2050 [dns] - 10https://gerrit.wikimedia.org/r/540532 (https://phabricator.wikimedia.org/T230391) [05:48:02] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db2050 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/540531 (https://phabricator.wikimedia.org/T230391) (owner: 10Marostegui) [05:48:47] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS entries for db2050 [dns] - 10https://gerrit.wikimedia.org/r/540532 (https://phabricator.wikimedia.org/T230391) (owner: 10Marostegui) [05:49:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "this seems ready to go. We should merge it as soon as you wake up today, and then fix the unintended consequences :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [05:49:28] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui) a:05RobH→03Papaul [05:49:46] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Marostegui) Host ready for @Papaul for on-site steps and switch port disablement. [05:57:18] (03PS1) 10KartikMistry: Update cxserver to 2019-10-03-054958-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/540533 (https://phabricator.wikimedia.org/T232986) [06:16:03] !log Deploy schema change on db2089:3316 T233135 T234066 [06:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:09] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [06:16:09] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [06:28:01] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [06:29:19] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [06:29:44] 10Operations, 10MediaWiki-General: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748 (10Joe) 05Open→03Resolved [06:36:40] (03CR) 10Marostegui: [C: 03+1] Remove grants for labpuppet [puppet] - 10https://gerrit.wikimedia.org/r/538270 (https://phabricator.wikimedia.org/T233281) (owner: 10Muehlenhoff) [06:37:56] !log Rename tables on m5 master on designate_pool_manager - T233978 [06:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:00] T233978: Drop 'designate_pool_manager' database from m5 and remove associated grants - https://phabricator.wikimedia.org/T233978 [06:41:48] (03PS1) 10Marostegui: production-m5.sql.erb: Remove grants from designate_pool_manager [puppet] - 10https://gerrit.wikimedia.org/r/540534 (https://phabricator.wikimedia.org/T233978) [06:42:10] (03PS1) 10Alexandros Kosiaris: zotero: kill logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/540535 (https://phabricator.wikimedia.org/T207200) [06:46:09] (03CR) 10Giuseppe Lavagetto: [C: 03+1] zotero: kill logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/540535 (https://phabricator.wikimedia.org/T207200) (owner: 10Alexandros Kosiaris) [06:47:34] 10Operations, 10DBA, 10Patch-For-Review: Check/remove unused databases following labpuppetmaster deprecation - https://phabricator.wikimedia.org/T233281 (10Marostegui) I am going to drop the following grants: ` root@db1133.eqiad.wmnet[(none)]> select user,host from mysql.user where user like 'labspuppet'; +-... [06:48:03] !log Drop database grants on m5 for labspuppet - T233281 [06:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:06] T233281: Check/remove unused databases following labpuppetmaster deprecation - https://phabricator.wikimedia.org/T233281 [06:49:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] zotero: kill logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/540535 (https://phabricator.wikimedia.org/T207200) (owner: 10Alexandros Kosiaris) [06:50:09] (03PS2) 10Alexandros Kosiaris: zotero: kill logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/540535 (https://phabricator.wikimedia.org/T207200) [06:50:11] (03PS2) 10Alexandros Kosiaris: restrouter: Revert the initialDelay seconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/540365 (https://phabricator.wikimedia.org/T223953) [06:53:14] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) I can start on this once the new dumpsdata host is racked and has a base install. [06:58:00] (03PS3) 10Alexandros Kosiaris: restrouter: Revert the initialDelay seconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/540365 (https://phabricator.wikimedia.org/T223953) [06:58:02] (03PS1) 10Alexandros Kosiaris: zotero: Set the currently deployed version [deployment-charts] - 10https://gerrit.wikimedia.org/r/540537 [06:58:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] zotero: Set the currently deployed version [deployment-charts] - 10https://gerrit.wikimedia.org/r/540537 (owner: 10Alexandros Kosiaris) [06:58:32] (03Merged) 10jenkins-bot: zotero: Set the currently deployed version [deployment-charts] - 10https://gerrit.wikimedia.org/r/540537 (owner: 10Alexandros Kosiaris) [06:59:13] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'zotero' for release 'production' . [06:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:16] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'zotero' for release 'staging' . [07:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:44] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'zotero' for release 'production' . [07:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:50] (03PS2) 10Alexandros Kosiaris: Revert "Revert "rsyslog: populate kubernetes configuration"" [puppet] - 10https://gerrit.wikimedia.org/r/539912 (https://phabricator.wikimedia.org/T207200) [07:12:20] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Request access to Analytics cluster for Urbanecm - https://phabricator.wikimedia.org/T231616 (10elukey) >>! In T231616#5504151, @Krenair wrote: > The idea that access for a formal collaboration with an external non-wikimedia group is an acceptable use... [07:18:31] (03PS2) 10Marostegui: Remove grants for labpuppet [puppet] - 10https://gerrit.wikimedia.org/r/538270 (https://phabricator.wikimedia.org/T233281) (owner: 10Muehlenhoff) [07:19:17] !log Remove unused labspuppet database from m5 - T233281 [07:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:21] T233281: Check/remove unused databases following labpuppetmaster deprecation - https://phabricator.wikimedia.org/T233281 [07:20:42] (03CR) 10Marostegui: [C: 03+2] Remove grants for labpuppet [puppet] - 10https://gerrit.wikimedia.org/r/538270 (https://phabricator.wikimedia.org/T233281) (owner: 10Muehlenhoff) [07:21:33] 10Operations, 10DBA, 10Patch-For-Review: Check/remove unused databases following labpuppetmaster deprecation - https://phabricator.wikimedia.org/T233281 (10Marostegui) 05Open→03Resolved database dropped: ` root@db1133.eqiad.wmnet[labspuppet]> show tables; +-------------------------+ | Tables_in_labspuppe... [07:49:17] !log Set notes on the sanitarium masters - T234039 [07:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:22] T234039: Add "candidate master and sanitarium master" on dbctl for the candidate masters and sanitarium masters in eqiad/codfw - https://phabricator.wikimedia.org/T234039 [08:11:20] (03PS3) 10Alexandros Kosiaris: Revert "Revert "rsyslog: populate kubernetes configuration"" [puppet] - 10https://gerrit.wikimedia.org/r/539912 (https://phabricator.wikimedia.org/T207200) [08:15:40] (03CR) 10Joal: [C: 03+1] "Thanks a lot @ottomata!" [puppet] - 10https://gerrit.wikimedia.org/r/540457 (https://phabricator.wikimedia.org/T231828) (owner: 10Ottomata) [08:15:40] !log slowly rolling restart all pods in eqiad, codfw, staging for log rollover before merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/539912 [08:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:33] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:26:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1096:3316 for schema change T233135 T234066', diff saved to https://phabricator.wikimedia.org/P9236 and previous config saved to /var/cache/conftool/dbconfig/20191003-082651-marostegui.json [08:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:57] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [08:26:58] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [08:28:15] (03PS2) 10Elukey: Run hdfs balancer with threshold of 5% [puppet] - 10https://gerrit.wikimedia.org/r/540457 (https://phabricator.wikimedia.org/T231828) (owner: 10Ottomata) [08:28:58] !log Deploy schema change on db1096:3316 - T233625 [08:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:02] T233625: Change PK and remove partitions from the logging table - https://phabricator.wikimedia.org/T233625 [08:33:13] !log Deploy schema change on db2087:3316 T233135 T234066 [08:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:19] T233135: Schema change for refactored actor and comment storage - https://phabricator.wikimedia.org/T233135 [08:33:19] T234066: Schema change to rename user_newtalk indexes - https://phabricator.wikimedia.org/T234066 [08:38:31] (03CR) 10Elukey: [C: 03+2] Run hdfs balancer with threshold of 5% [puppet] - 10https://gerrit.wikimedia.org/r/540457 (https://phabricator.wikimedia.org/T231828) (owner: 10Ottomata) [08:48:49] has anybody checked the mw fatals? [08:48:54] https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops [08:49:48] (mostly exceptions sorry0 [08:52:34] I am seeing lag on some slaves [08:53:47] hello! [09:01:05] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:11:28] (03PS4) 10Alexandros Kosiaris: Add a commit message guide [puppet] - 10https://gerrit.wikimedia.org/r/540366 [09:22:38] (03PS1) 10Phedenskog: Grafana: Add external Graphite for synthetic testing [puppet] - 10https://gerrit.wikimedia.org/r/540572 (https://phabricator.wikimedia.org/T231870) [09:24:23] (03CR) 10jerkins-bot: [V: 04-1] Grafana: Add external Graphite for synthetic testing [puppet] - 10https://gerrit.wikimedia.org/r/540572 (https://phabricator.wikimedia.org/T231870) (owner: 10Phedenskog) [09:25:41] (03PS2) 10Phedenskog: Grafana: Add external Graphite for synthetic testing [puppet] - 10https://gerrit.wikimedia.org/r/540572 (https://phabricator.wikimedia.org/T231870) [09:32:51] !log run apt-get autoremove incrementally on all the hadoop prod workers to remove python2 deps (and verify that they are not used anymore by Hadoop) [09:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:43] 10Operations, 10DC-Ops, 10SRE-tools: Host decommission improvements - https://phabricator.wikimedia.org/T231066 (10Volans) Updated Lifecycle page accordingly: https://wikitech.wikimedia.org/w/index.php?title=Server_Lifecycle&type=revision&diff=1839914&oldid=1837183 [09:49:33] (03CR) 10Gehel: [C: 04-1] "PCC fails: https://puppet-compiler.wmflabs.org/compiler1001/18724/" [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [09:51:58] (03CR) 10Gehel: [C: 04-1] "Minor issue inline, and then we can check PCC" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/538572 (https://phabricator.wikimedia.org/T232297) (owner: 10Mathew.onipe) [09:52:14] (03PS1) 10Arturo Borrero Gonzalez: openstack: glance: image_sync: cleanup unused cron [puppet] - 10https://gerrit.wikimedia.org/r/540573 [09:55:10] (03PS3) 10Marostegui: dump-misc.sh.erb: Remove puppet database from the backups [puppet] - 10https://gerrit.wikimedia.org/r/539840 (https://phabricator.wikimedia.org/T231539) [09:55:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: glance: image_sync: cleanup unused cron [puppet] - 10https://gerrit.wikimedia.org/r/540573 (owner: 10Arturo Borrero Gonzalez) [09:57:22] (03PS4) 10Marostegui: dump-misc.sh.erb: Remove puppet database from the backups [puppet] - 10https://gerrit.wikimedia.org/r/539840 (https://phabricator.wikimedia.org/T231539) [09:59:19] (03CR) 10Marostegui: [C: 03+2] dump-misc.sh.erb: Remove puppet database from the backups [puppet] - 10https://gerrit.wikimedia.org/r/539840 (https://phabricator.wikimedia.org/T231539) (owner: 10Marostegui) [10:03:43] (03CR) 10Volans: "moving discussion to the task, more appropriate" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/539013 (https://phabricator.wikimedia.org/T233183) (owner: 10CRusnov) [10:05:13] (03PS1) 10Marostegui: dumps-misc.sh.erb: This script is no longer in use [puppet] - 10https://gerrit.wikimedia.org/r/540574 [10:05:53] (03CR) 10Marostegui: "jcrespo: thoughts on this?" [puppet] - 10https://gerrit.wikimedia.org/r/540574 (owner: 10Marostegui) [10:18:51] !log Manually cleared signup throttle for IP 90.176.155.12 at cswiki, issue with introduced throttle rule [10:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:58] !log Manually cleared signup throttle for IP 88.100.221.84 at cswiki, issue with introduced throttle rule [10:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:09] !log Manually cleared signup throttle for IP 80.188.128.54 at cswiki, issue with introduced throttle rule [10:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:29] !log rolling upgrade of openssl packages [10:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:06] !log killed rsync processes in "D" state on stat1007, force umount/mount of /mnt/hdfs [10:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:12] (03PS1) 10Giuseppe Lavagetto: Make the debianization compatible with stretch [debs/confd] - 10https://gerrit.wikimedia.org/r/540576 [10:36:22] (03CR) 10ArielGlenn: [C: 03+2] one-off for generating some page meta history files [dumps] - 10https://gerrit.wikimedia.org/r/537546 (owner: 10ArielGlenn) [10:44:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Make the debianization compatible with stretch [debs/confd] - 10https://gerrit.wikimedia.org/r/540576 (owner: 10Giuseppe Lavagetto) [10:45:56] Hello, gerrit's dead? [10:46:58] <_joe_> Daimona: you wish! [10:47:15] <_joe_> Daimona: jokes aside, it works for me now [10:48:36] Hah, poor gerrit [10:48:52] Still doesn't for me, so something locally I guess [10:49:29] <_joe_> Daimona: what are you experiencing? just difficulties connecting or are you getting errors/timeouts from our servers? [10:50:35] It's... Weird. First of all it didn't load reviewers suggestions for a specific patch. I then tried switching to the new UI, I was able to add a reviewer, but as soon as I pressed "send" it froze. Trying to open any gerrit page will result in infinite loading [10:50:49] <_joe_> uh [10:51:11] <_joe_> i would be tempted to bet on packet loss between you and gerrit [10:51:38] (03PS1) 10Jbond: admin::jbond update aliases [puppet] - 10https://gerrit.wikimedia.org/r/540577 [10:51:58] <_joe_> best I can do is to point you to https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue [10:52:13] Back working now... [10:52:28] Let's see if it can also let me add reviewers [10:52:56] (03PS2) 10Meshvogel: db::views: Bring back abuse_filter_history table [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) [10:53:30] Of course it doesn't. Crap. [10:54:11] (03CR) 10Jbond: [C: 03+2] admin::jbond update aliases [puppet] - 10https://gerrit.wikimedia.org/r/540577 (owner: 10Jbond) [10:56:43] (03PS3) 10Meshvogel: db::views: Bring back abuse_filter_history table [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191003T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:05:26] (03CR) 10Meshvogel: "I've now applied all the suggested changes. Thank you again for the reviews and let me know if there's still something that needs to be do" [puppet] - 10https://gerrit.wikimedia.org/r/498773 (https://phabricator.wikimedia.org/T123978) (owner: 10Meshvogel) [11:09:47] (03PS1) 10Jbond: admin: update alias for jbond [puppet] - 10https://gerrit.wikimedia.org/r/540580 [11:11:52] (03CR) 10Jbond: [C: 03+2] admin: update alias for jbond [puppet] - 10https://gerrit.wikimedia.org/r/540580 (owner: 10Jbond) [11:19:09] (03PS1) 10ArielGlenn: add another fixup script for completing broken dump runs [dumps] - 10https://gerrit.wikimedia.org/r/540582 [11:38:20] (03PS1) 10Elukey: profile::analytics::refienry::job::project_namespace_map: fix syslog id [puppet] - 10https://gerrit.wikimedia.org/r/540583 [11:38:44] (03CR) 10Elukey: [C: 03+2] profile::analytics::refienry::job::project_namespace_map: fix syslog id [puppet] - 10https://gerrit.wikimedia.org/r/540583 (owner: 10Elukey) [11:55:41] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:59:21] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:59:49] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [11:59:59] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:59:59] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:00:11] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:00:19] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:00:45] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:00:57] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:01:07] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [12:01:17] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:01:18] uh oh [12:01:27] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:01:31] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:02:33] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:02:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:02:45] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [12:02:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:03:03] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:03:03] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:03:07] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:03:13] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:03:13] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:03:27] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:03:33] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:03:59] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:06:17] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:12:00] (03PS2) 10ArielGlenn: add another fixup script for completing broken dump runs [dumps] - 10https://gerrit.wikimedia.org/r/540582 [12:18:03] https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?orgId=1 [12:18:04] Gerrit about to go down [12:18:05] thcipriani: ^ [12:18:06] _joe_: what daimona explains is the thread issue we are having :( [12:18:11] It’ll effect certain people [12:18:23] Then once it goes to 60+ threads it’ll effect everyone [12:18:49] or hashar ^ [12:19:25] https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?orgId=1&panelId=16&fullscreen [12:21:49] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:25:41] It’s stuck on the /changes/ rest api [12:25:43] With some long number in ms [12:25:43] (03CR) 10ArielGlenn: [C: 03+2] add another fixup script for completing broken dump runs [dumps] - 10https://gerrit.wikimedia.org/r/540582 (owner: 10ArielGlenn) [12:26:06] (03Merged) 10jenkins-bot: add another fixup script for completing broken dump runs [dumps] - 10https://gerrit.wikimedia.org/r/540582 (owner: 10ArielGlenn) [12:26:31] Url /changes/149848/detail?O=10004 , 16,456,877 time in ms [12:29:21] PROBLEM - PHP7 rendering on mw1242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:29:25] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:29:30] is the site down? [12:29:43] Uh-oh [12:29:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:30:13] If you report this error to the Wikimedia System Administrators, please include the details below. [12:30:14] Request from 'removed' via cp1087 cp1087, Varnish XID 333611332 [12:30:14] Error: 503, Backend fetch failed at Thu, 03 Oct 2019 12:28:59 GMT [12:30:23] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:30:23] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [12:30:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:30:47] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [12:30:53] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:30:59] RECOVERY - PHP7 rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 75707 bytes in 9.231 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:31:03] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:31:09] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [12:31:11] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a res [12:31:11] d: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:31:11] Any opsen around? [12:31:15] PROBLEM - PHP7 rendering on mw1317 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:31:19] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:31:31] PROBLEM - PHP7 rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:31:31] PROBLEM - HHVM rendering on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:31:31] PROBLEM - HHVM rendering on mw1254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:31:49] PROBLEM - Nginx local proxy to apache on mw1273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:31:51] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:31:57] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed o [12:31:57] nse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:31:57] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was receiv [12:31:57] article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:31:57] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [12:31:58] onse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:31:59] PROBLEM - Apache HTTP on mw1274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:31:59] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [12:31:59] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:32:00] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translatio [12:32:00] med out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:32:01] PROBLEM - PHP7 rendering on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:32:01] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:02] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:32:02] PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/metadata/{title} (Get metadata from storage) timed out before a re [12:32:03] ed: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:32:03] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:04] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [12:32:05] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [12:32:05] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:32:07] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:07] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:11] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:11] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:32:11] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transfor [12:32:11] html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:32:11] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:32:11] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:32:12] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:32:12] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org [12:32:13] nitoring/restbase [12:32:13] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org [12:32:14] nitoring/restbase [12:32:19] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was [12:32:19] in}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:32:19] PROBLEM - PHP7 rendering on mw1273 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:32:19] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [12:32:19] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:32:21] PROBLEM - PHP7 rendering on mw1254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:32:21] PROBLEM - Apache HTTP on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:32:27] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response wa [12:32:27] ain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:32:27] PROBLEM - Nginx local proxy to apache on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:32:33] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:32:33] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:33] #page something seems to be broken [12:32:37] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [12:32:37] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:32:37] PROBLEM - Apache HTTP on mw1282 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:32:39] PROBLEM - Apache HTTP on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:32:41] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: [12:32:41] /summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileap [12:32:43] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:43] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:43] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:43] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org [12:32:43] nitoring/restbase [12:32:45] PROBLEM - Apache HTTP on mw1272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:32:45] PROBLEM - HHVM rendering on mw1272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:32:47] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia [12:32:47] s/Monitoring/restbase [12:32:47] PROBLEM - PHP7 rendering on mw1272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:32:47] PROBLEM - PHP7 rendering on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:32:49] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:49] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITI [12:32:49] raph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:49] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:53] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a respons [12:32:53] {domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:32:57] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:32:57] PROBLEM - HHVM rendering on mw1224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:32:59] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domai [12:32:59] ed/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:33:01] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed ou [12:33:01] se was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimed [12:33:01] ces/Monitoring/recommendation_api [12:33:05] RECOVERY - PHP7 rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 75707 bytes in 2.005 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:33:05] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:05] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was rece [12:33:05] 1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:33:05] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a res [12:33:06] d: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:33:06] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:07] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:07] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:08] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not [12:33:08] xistent title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:33:09] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [12:33:09] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:10] RECOVERY - HHVM rendering on mw1243 is OK: HTTP OK: HTTP/1.1 200 OK - 75707 bytes in 3.068 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:33:10] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/p [12:33:11] r}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:33:19] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1274.eqiad.wmnet, mw1242.eqiad.wmnet, mw1240.eqiad.wmnet, mw1322.eqiad.wmnet, mw1272.eqiad.wmnet, mw1263.eqiad.wmnet, mw1269.eqiad.wmnet, mw1264.eqiad.wmnet, mw1333.eqiad.wmnet, mw1241.eqiad.wmnet, mw1324.eqiad.wmnet, mw1255.eqiad.wmnet, mw1331.eqiad.wmnet, mw1268.eqiad.wmnet, mw1254.eqiad.wmnet, mw1262.eqiad. [12:33:19] down but pooled: api_80: Servers mw1229.eqiad.wmnet, mw1280.eqiad.wmnet, mw1221.eqiad.wmnet, mw1232.eqiad.wmnet, mw1277.eqiad.wmnet, mw1227.eqiad.wmnet, mw1278.eqiad.wmnet, mw1290.eqiad.wmnet, mw1317.eqiad.wmnet, mw1231.eqiad.wmnet, mw1340.eqiad.wmnet, mw1225.eqiad.wmnet, mw1285.eqiad.wmnet, mw1228.eqiad.wmnet, mw1345.eqiad.wmnet, mw1339.eqiad.wmnet are marked down but pooled: apaches_80: Servers mw1265.eqiad.wmnet, mw1242.eqiad. [12:33:19] ad.wmnet, mw1267.eqiad.wmnet, mw1322.eqiad.wmnet, mw1272.eqiad.wmnet, mw1263.eqiad.wmnet, mw1258.eqiad.wmnet, mw1329.eqiad.wmnet, mw1264.eqiad.wmnet, mw1321.eqiad.wmnet, mw1333.eqiad.wm https://wikitech.wikimedia.org/wiki/PyBal [12:33:19] PROBLEM - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wiki [12:33:19] STBase [12:33:19] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a resp [12:33:20] https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:33:20] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:21] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:23] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [12:33:27] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:27] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org [12:33:27] nitoring/restbase [12:33:29] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1265.eqiad.wmnet, mw1242.eqiad.wmnet, mw1240.eqiad.wmnet, mw1322.eqiad.wmnet, mw1331.eqiad.wmnet, mw1272.eqiad.wmnet, mw1258.eqiad.wmnet, mw1269.eqiad.wmnet, mw1264.eqiad.wmnet, mw1250.eqiad.wmnet, mw1321.eqiad.wmnet, mw1333.eqiad.wmnet, mw1241.eqiad.wmnet, mw1324.eqiad.wmnet, mw1255.eqiad.wmnet, mw1320.eqiad. [12:33:29] ad.wmnet, mw1254.eqiad.wmnet, mw1248.eqiad.wmnet, mw1270.eqiad.wmnet, mw1239.eqiad.wmnet are marked down but pooled: api_80: Servers mw1284.eqiad.wmnet, mw1346.eqiad.wmnet, mw1280.eqiad.wmnet, mw1232.eqiad.wmnet, mw1344.eqiad.wmnet, mw1227.eqiad.wmnet, mw1314.eqiad.wmnet, mw1226.eqiad.wmnet, mw1340.eqiad.wmnet, mw1225.eqiad.wmnet, mw1228.eqiad.wmnet, mw1221.eqiad.wmnet, mw1317.eqiad.wmnet, mw1278.eqiad.wmnet, mw1224.eqiad.wmnet, [12:33:29] t, mw1235.eqiad.wmnet, mw1231.eqiad.wmnet, mw1285.eqiad.wmnet, mw1277.eqiad.wmnet, mw1339.eqiad.wmnet are marked down but pooled: apaches_80: Servers mw1274.eqiad.wmnet, mw1242.eqiad.wm https://wikitech.wikimedia.org/wiki/PyBal [12:33:39] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:33:41] RECOVERY - Apache HTTP on mw1274 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.812 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:33:41] RECOVERY - PHP7 rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 75706 bytes in 7.981 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:33:43] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [12:33:45] PROBLEM - Nginx local proxy to apache on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:33:45] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [12:33:55] PROBLEM - HHVM rendering on mw1226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:33:59] RECOVERY - PHP7 rendering on mw1254 is OK: HTTP OK: HTTP/1.1 200 OK - 75707 bytes in 6.670 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:33:59] RECOVERY - PHP7 rendering on mw1273 is OK: HTTP OK: HTTP/1.1 200 OK - 75707 bytes in 7.499 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:33:59] PROBLEM - PHP7 rendering on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:34:03] RECOVERY - Apache HTTP on mw1275 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.923 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:34:09] RECOVERY - Nginx local proxy to apache on mw1275 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.972 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:34:09] PROBLEM - Apache HTTP on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:34:13] RECOVERY - Apache HTTP on mw1282 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 6.195 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:34:13] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 3.893 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:34:19] PROBLEM - Nginx local proxy to apache on mw1272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:34:21] 10Operations: Wikimedia sites are down - https://phabricator.wikimedia.org/T234528 (10Matanya) [12:34:25] RECOVERY - Apache HTTP on mw1272 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 9.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:34:25] RECOVERY - HHVM rendering on mw1272 is OK: HTTP OK: HTTP/1.1 200 OK - 75707 bytes in 8.451 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:34:25] RECOVERY - PHP7 rendering on mw1272 is OK: HTTP OK: HTTP/1.1 200 OK - 75707 bytes in 7.546 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:34:27] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 75706 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:34:31] 10Operations: Wikimedia sites are down - https://phabricator.wikimedia.org/T234528 (10Matanya) p:05Triage→03Unbreak! [12:34:31] RECOVERY - PHP7 rendering on mw1317 is OK: HTTP OK: HTTP/1.1 200 OK - 75707 bytes in 4.848 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:34:39] opened https://phabricator.wikimedia.org/T234528 [12:34:43] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:34:43] RECOVERY - HHVM rendering on mw1254 is OK: HTTP OK: HTTP/1.1 200 OK - 75705 bytes in 0.927 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:34:43] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:34:44] 10Operations: Wikimedia sites are down - https://phabricator.wikimedia.org/T234528 (10Marostegui) We are checking [12:35:15] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [12:35:17] PROBLEM - termbox codfw on termbox.svc.codfw.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [12:35:17] RECOVERY - Nginx local proxy to apache on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 3.024 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:35:29] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:35:35] RECOVERY - PHP7 rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 75707 bytes in 6.945 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:35:39] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:35:49] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:35:57] RECOVERY - Nginx local proxy to apache on mw1272 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.959 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:35:57] PROBLEM - Apache HTTP on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:35:57] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:35:59] PROBLEM - Nginx local proxy to apache on mw1268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:36:03] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:36:03] RECOVERY - PHP7 rendering on mw1275 is OK: HTTP OK: HTTP/1.1 200 OK - 75707 bytes in 7.065 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:36:05] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:36:09] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:36:11] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:36:17] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:36:17] PROBLEM - Apache HTTP on mw1228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:36:23] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:36:37] PROBLEM - Packet loss ratio for UDP on logstash1007 is CRITICAL: 0.2585 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [12:36:39] RECOVERY - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:36:39] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:36:39] RECOVERY - Nginx local proxy to apache on mw1273 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:36:43] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:36:45] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:36:47] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [12:36:53] RECOVERY - termbox codfw on termbox.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [12:36:57] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:36:59] RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:37:01] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:37:05] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:05] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 75706 bytes in 0.754 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:37:05] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:05] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:37:05] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:37:05] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:07] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:13] PROBLEM - Nginx local proxy to apache on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:37:15] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:37:15] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:37:19] RECOVERY - Apache HTTP on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:37:19] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:37:27] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:37:27] RECOVERY - Apache HTTP on mw1268 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:37:29] RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.489 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:37:29] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:29] RECOVERY - Nginx local proxy to apache on mw1268 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:37:35] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:37:37] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:37] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:39] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:39] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:39] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:39] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:43] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:43] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [12:37:45] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:37:45] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:37:47] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:47] RECOVERY - Apache HTTP on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:37:49] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:51] 10Operations: Wikimedia sites are down - https://phabricator.wikimedia.org/T234528 (10IKhitron) Sometimes the page opens, but instead of contents you get ` [XZXq4wpAADkAADTd@iwAAAUF] 2019-10-03 12:36:02: Fatal exception of type "WMFTimeoutException" ` [12:37:55] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:37:57] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:57] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:57] <_joe_> wtf? [12:37:59] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:37:59] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:38:01] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:38:05] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:38:08] _joe_: check security [12:38:11] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:38:11] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:38:15] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:38:15] RECOVERY - Packet loss ratio for UDP on logstash1007 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [12:38:19] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [12:38:21] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:38:30] (03PS1) 10Jgreen: remove fr-tech from alerts for check_log_messages [puppet] - 10https://gerrit.wikimedia.org/r/540588 [12:38:33] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:38:33] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:38:33] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:38:33] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:38:33] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:38:39] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:38:41] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:38:41] RECOVERY - Nginx local proxy to apache on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:38:43] 10Operations: Wikimedia sites are down - https://phabricator.wikimedia.org/T234528 (10IKhitron) {F30535503} [12:38:53] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:38:57] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:39:09] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:39:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:40:05] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:40:11] (03CR) 10Jgreen: [C: 03+2] remove fr-tech from alerts for check_log_messages [puppet] - 10https://gerrit.wikimedia.org/r/540588 (owner: 10Jgreen) [12:40:15] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:40:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:40:25] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:40:25] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:40:39] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:40:45] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:41:11] RECOVERY - HTTP availability for Varnish at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:41:55] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:42:17] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [12:42:53] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10EYener) [12:44:07] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:44:53] PROBLEM - PHP7 rendering on scandium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:45:09] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [12:45:27] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [12:45:27] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/ [12:45:27] itoring/recommendation_api [12:46:15] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:46:19] RECOVERY - PHP7 rendering on scandium is OK: HTTP OK: HTTP/1.1 200 OK - 75734 bytes in 0.146 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:46:45] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:46:45] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [12:46:45] ponse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:46:45] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was receiv [12:46:45] article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:46:45] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [12:46:46] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/ [12:46:46] itoring/recommendation_api [12:46:47] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:46:47] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed ou [12:46:48] se was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:46:48] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [12:46:49] eceived: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:46:49] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:46:50] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:46:50] PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1 [12:46:51] itle} (Get metadata from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a respons [12:46:51] tps://wikitech.wikimedia.org/wiki/RESTBase [12:46:52] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org [12:46:52] nitoring/restbase [12:46:55] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:46:55] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:46:57] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:34] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1242.eqiad.wmnet, mw1246.eqiad.wmnet, mw1267.eqiad.wmnet, mw1322.eqiad.wmnet, mw1333.eqiad.wmnet, mw1323.eqiad.wmnet, mw1328.eqiad.wmnet, mw1243.eqiad.wmnet, mw1272.eqiad.wmnet, mw1269.eqiad.wmnet, mw1250.eqiad.wmnet, mw1319.eqiad.wmnet, mw1254.eqiad.wmnet, mw1252.eqiad.wmnet, mw1262.eqiad.wmnet are marked dow [12:47:34] ches_80: Servers mw1242.eqiad.wmnet, mw1246.eqiad.wmnet, mw1267.eqiad.wmnet, mw1322.eqiad.wmnet, mw1323.eqiad.wmnet, mw1328.eqiad.wmnet, mw1243.eqiad.wmnet, mw1272.eqiad.wmnet, mw1269.eqiad.wmnet, mw1326.eqiad.wmnet, mw1268.eqiad.wmnet, mw1333.eqiad.wmnet, mw1321.eqiad.wmnet, mw1254.eqiad.wmnet, mw1248.eqiad.wmnet, mw1252.eqiad.wmnet, mw1239.eqiad.wmnet, mw1262.eqiad.wmnet, mw1332.eqiad.wmnet, mw1330.eqiad.wmnet are marked down b [12:47:34] /wikitech.wikimedia.org/wiki/PyBal [12:47:38] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [12:47:40] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:40] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:40] PROBLEM - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a resp [12:47:40] : /en.wikipedia.org/v1/page/metadata/{title} (Get metadata from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [12:47:44] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1275.eqiad.wmnet, mw1267.eqiad.wmnet, mw1322.eqiad.wmnet, mw1319.eqiad.wmnet, mw1323.eqiad.wmnet, mw1327.eqiad.wmnet, mw1243.eqiad.wmnet, mw1272.eqiad.wmnet, mw1269.eqiad.wmnet, mw1266.eqiad.wmnet, mw1333.eqiad.wmnet, mw1324.eqiad.wmnet, mw1244.eqiad.wmnet, mw1254.eqiad.wmnet, mw1252.eqiad.wmnet, mw1239.eqiad. [12:47:44] ad.wmnet, mw1332.eqiad.wmnet, mw1330.eqiad.wmnet are marked down but pooled: apaches_80: Servers mw1242.eqiad.wmnet, mw1246.eqiad.wmnet, mw1267.eqiad.wmnet, mw1322.eqiad.wmnet, mw1333.eqiad.wmnet, mw1243.eqiad.wmnet, mw1272.eqiad.wmnet, mw1269.eqiad.wmnet, mw1266.eqiad.wmnet, mw1268.eqiad.wmnet, mw1319.eqiad.wmnet, mw1324.eqiad.wmnet, mw1238.eqiad.wmnet, mw1254.eqiad.wmnet, mw1252.eqiad.wmnet, mw1239.eqiad.wmnet, mw1262.eqiad.wmn [12:47:44] wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:47:48] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out be [12:47:48] as received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:48] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:47:48] PROBLEM - Apache HTTP on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:47:48] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:47:58] PROBLEM - PHP7 rendering on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:48:04] PROBLEM - Nginx local proxy to apache on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:48:06] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:48:08] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:48:08] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:48:12] PROBLEM - graphoid endpoints health on scb1004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:48:12] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/tit [12:48:12] ev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:48:14] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received: /{domain}/v1/page/featu [12:48:14] }/{day} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:48:14] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: [12:48:14] /summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:48:14] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 500 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a r [12:48:14] ved https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:48:15] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve a [12:48:15] ntent for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:48:16] PROBLEM - Apache HTTP on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:48:16] PROBLEM - Nginx local proxy to apache on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:48:20] PROBLEM - PHP7 rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:48:22] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a res [12:48:22] d: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured art [12:48:22] , 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:48:24] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed o [12:48:24] nse was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:48:24] PROBLEM - HHVM rendering on mw1284 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:48:24] PROBLEM - Apache HTTP on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:48:28] PROBLEM - PHP7 rendering on mw1244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:48:32] PROBLEM - Nginx local proxy to apache on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:48:46] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:48:46] PROBLEM - HHVM rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:48:48] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [12:48:48] PROBLEM - PHP7 rendering on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:48:54] PROBLEM - PHP7 rendering on mw1257 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:49:00] PROBLEM - Nginx local proxy to apache on mw1347 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:49:04] RECOVERY - Nginx local proxy to apache on mw1323 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 7.706 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:49:04] PROBLEM - PHP7 rendering on mw1269 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:49:06] PROBLEM - HHVM rendering on mw1242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:49:08] PROBLEM - termbox eqiad on termbox.svc.eqiad.wmnet is CRITICAL: /termbox (get rendered termbox) is CRITICAL: Test get rendered termbox returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [12:49:10] RECOVERY - graphoid endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [12:49:12] RECOVERY - Apache HTTP on mw1238 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 3.055 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:49:12] RECOVERY - Nginx local proxy to apache on mw1238 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 3.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:49:14] RECOVERY - PHP7 rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 75730 bytes in 0.123 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:49:18] RECOVERY - HHVM rendering on mw1284 is OK: HTTP OK: HTTP/1.1 200 OK - 75730 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:49:22] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:49:26] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:49:26] RECOVERY - PHP7 rendering on mw1244 is OK: HTTP OK: HTTP/1.1 200 OK - 75731 bytes in 6.181 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:49:28] RECOVERY - Apache HTTP on mw1323 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 8.896 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:49:34] RECOVERY - Nginx local proxy to apache on mw1267 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 9.471 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:49:34] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:49:34] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:49:36] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:49:38] RECOVERY - Restrouter LVS eqiad on restrouter.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:49:40] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1323.eqiad.wmnet, mw1328.eqiad.wmnet, mw1333.eqiad.wmnet, mw1261.eqiad.wmnet, mw1243.eqiad.wmnet, mw1324.eqiad.wmnet, mw1242.eqiad.wmnet, mw1272.eqiad.wmnet, mw1267.eqiad.wmnet, mw1269.eqiad.wmnet, mw1262.eqiad.wmnet, mw1332.eqiad.wmnet, mw1330.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [12:49:42] RECOVERY - HHVM rendering on mw1328 is OK: HTTP OK: HTTP/1.1 200 OK - 75731 bytes in 3.703 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:49:42] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [12:49:44] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:49:46] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:49:48] RECOVERY - PHP7 rendering on mw1257 is OK: HTTP OK: HTTP/1.1 200 OK - 75731 bytes in 1.348 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:49:52] RECOVERY - Nginx local proxy to apache on mw1347 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:49:52] RECOVERY - PHP7 rendering on mw1267 is OK: HTTP OK: HTTP/1.1 200 OK - 75730 bytes in 0.121 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:49:58] RECOVERY - HHVM rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 75730 bytes in 0.404 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:49:58] RECOVERY - PHP7 rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 75731 bytes in 2.536 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:50:00] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:50:02] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:04] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:04] RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:50:08] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:08] RECOVERY - termbox eqiad on termbox.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service [12:50:10] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:12] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:50:12] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:12] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:12] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:50:14] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:22] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:50:22] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [12:50:32] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:38] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:50:40] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:50:44] RECOVERY - Apache HTTP on mw1269 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:50:44] RECOVERY - PHP7 rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 75730 bytes in 0.412 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [12:51:18] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:51:30] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:51:42] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:51:42] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:51:46] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [12:52:06] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:52:08] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:52:08] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:52:08] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:52:12] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:52:42] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [12:53:10] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:53:12] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:53:18] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:53:24] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:53:28] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:53:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:54:08] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:54:20] !log force shard allocation on eqiad chi cluster [12:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:46] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:55:38] 10Operations: Wikimedia sites are down - https://phabricator.wikimedia.org/T234528 (10SUM1) {F30535524} {F30535526} {F30535522} [12:55:44] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [12:56:12] 10Operations: Wikimedia sites are down - https://phabricator.wikimedia.org/T234528 (10Marostegui) We are on it [12:57:10] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:57:12] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:57:46] PROBLEM - Nginx local proxy to apache on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:58:24] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was receiv [12:58:24] article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:58:24] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good articl [12:58:24] t before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:58:54] RECOVERY - Nginx local proxy to apache on mw1323 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.511 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:59:08] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [12:59:08] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured [12:59:08] l 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:59:08] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:59:10] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good artic [12:59:10] ut before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:59:10] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before [12:59:10] eceived: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:59:10] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good artic [12:59:11] ut before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:59:14] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed ou [12:59:14] se was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:59:14] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed [12:59:14] ponse was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:59:20] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [12:59:26] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [12:59:28] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:59:40] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received: /{domain}/v1/page/ [12:59:40] month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:59:40] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:59:42] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:59:54] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:59:54] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:59:56] PROBLEM - Apache HTTP on mw1256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:59:58] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [12:59:58] PROBLEM - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var- [12:59:58] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:59:58] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:00:10] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:00:10] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:00:10] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:00:12] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:00:26] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:00:30] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:00:38] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [13:00:38] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:00:38] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:00:38] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:00:38] PROBLEM - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [13:00:42] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:00:42] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:00:52] PROBLEM - Apache HTTP on mw1326 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:00:56] PROBLEM - Nginx local proxy to apache on mw1246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:00:58] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:01:00] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good artic [13:01:00] ut before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - bad article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:01:08] RECOVERY - Apache HTTP on mw1256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:01:10] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:01:14] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:01:26] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:01:42] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:01:44] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:01:48] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:01:54] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:01:54] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:01:56] RECOVERY - Restrouter LVS codfw on restrouter.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:02:02] RECOVERY - Apache HTTP on mw1326 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:02:08] RECOVERY - Nginx local proxy to apache on mw1246 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.719 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:02:16] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:02:16] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:02:16] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:02:18] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:02:18] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:02:18] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:02:30] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:02:34] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:02:34] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:02:46] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:02:46] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:03:04] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:03:04] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:03:04] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:03:08] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:03:08] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:03:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:04:02] RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST [13:04:14] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:04:15] <_joe_> !log restarting php7 on mw1275 [13:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:30] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:04:34] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:04:36] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:04:42] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:04:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:04:50] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [13:04:50] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [13:05:48] 10Operations, 10Discovery-Search, 10Elasticsearch: Unassigned shards in eqiad - https://phabricator.wikimedia.org/T233403 (10Mathew.onipe) This issue has come up again. Currently, we have only `enwiki_content_1546970425` unassigned with `too many shards [1] allocated to this node for index [enwiki_content_15... [13:05:56] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:05:56] 10Operations: Wikimedia sites are down - https://phabricator.wikimedia.org/T234528 (10Marostegui) We believe we are out of the woods now. Can people confirm? [13:06:34] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:06:35] Marostegui: Yes, apparently. [13:07:04] up for me [13:14:34] 10Operations: Wikimedia sites are down - https://phabricator.wikimedia.org/T234528 (10Ghuron) looks like its working again [13:14:36] 10Operations: Wikimedia sites are down - https://phabricator.wikimedia.org/T234528 (10Marostegui) p:05Unbreak!→03High We believe we are back ok, some people reported so on IRC and the graphs look good again. We'll create an incident report. Thanks for reporting this issue. [13:15:15] 10Operations: Wikimedia sites are down - https://phabricator.wikimedia.org/T234528 (10IKhitron) OK for me either. [13:22:27] !log Gerrit might be dead again; taking traces [13:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:51] hashar: thanks a lot, was about to ask [13:22:57] I need a stack trace ;) [13:23:10] you always need stacktraces! [13:23:11] :P [13:23:31] though now we have openjdk dbg symbols [13:23:38] so we would get the native trace as well !! [13:24:12] 10Operations: Wikimedia sites are down - https://phabricator.wikimedia.org/T234528 (10Marostegui) 05Open→03Resolved a:03Marostegui Incident report is in being created. [13:24:49] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [13:24:53] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [13:26:14] * hashar traces [13:26:16] traces [13:26:19] thcipriani: taking trace for gerrit [13:26:20] definitely feels dead from here [13:26:22] and I will restart it [13:26:33] Caused by: sun.jvm.hotspot.types.WrongTypeException: No suitable match for type of address 0x00000002c03fe8a8 [13:26:34] :-\ [13:26:48] Java ❤️ [13:26:58] !log restarting Gerrit due to a deadlock in SendEmail task and AccountCacheImpl [13:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:27] I have ruined my Tuesday trying to understand the garbage collector and jvm memory allocation [13:27:41] that can ruin an entire week [13:27:46] hashar: thank you! [13:27:52] It’ll be because of the http threads [13:27:54] Per the ping from around an hour before :) [13:27:55] Heh [13:27:58] I am smart, only took a day to figure out I would ruin a whole week ;-] [13:28:06] :-D [13:28:36] I think I have kicked it [13:28:54] hashar: is there any metrics that we can use to alarm us, rather than getting to gerrit down before restarting it? It seems the same problem that has been occuring for weeks right? [13:29:20] !log Gerrit should be back [13:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:39] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26094 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [13:29:45] elukey: yeah that got mentionned, but I have no idea how to detect it :-\ [13:29:45] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.069 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [13:30:27] hashar: https://grafana.wikimedia.org/d/Bw2mQ3iWz/gerrit-javamelody?panelId=16&fullscreen&orgId=1&from=now-3h&to=now [13:30:49] I recall that the metric was doing the same some time ago when we debugged this together [13:31:03] and I proposed to alarm on active thread count even in the task IIRC [13:31:06] :) [13:31:15] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01009 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:31:35] 10Operations: Wikimedia sites are down - https://phabricator.wikimedia.org/T234528 (10MBH) @Marostegui and where is this report? Post a link, please. [13:32:35] 10Operations: Wikimedia sites are down - https://phabricator.wikimedia.org/T234528 (10Marostegui) It is being worked on. [13:32:50] (03PS2) 10Pmiazga: beta: noop: remove unused Minerva EventLogging error tracking configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540404 (https://phabricator.wikimedia.org/T233663) [13:34:05] (03PS2) 10Gilles: Document Apache gzip sidestepping [puppet] - 10https://gerrit.wikimedia.org/r/539842 (https://phabricator.wikimedia.org/T232615) [13:34:35] (03CR) 10Gilles: Document Apache gzip sidestepping (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539842 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [13:35:51] (03PS2) 10Marostegui: db-eqiad.php: Depool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540006 (https://phabricator.wikimedia.org/T233698) [13:37:34] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540006 (https://phabricator.wikimedia.org/T233698) (owner: 10Marostegui) [13:38:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540006 (https://phabricator.wikimedia.org/T233698) (owner: 10Marostegui) [13:40:06] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool es1019 for on-site maintenance T233698 (duration: 01m 01s) [13:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:10] T233698: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 [13:44:43] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01009 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:44:53] !log Stop MySQL and shutdown es1019 for on-site maintenance - T233698 [13:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:18] (03CR) 10Andrew Bogott: [C: 03+1] production-m5.sql.erb: Remove grants from designate_pool_manager [puppet] - 10https://gerrit.wikimedia.org/r/540534 (https://phabricator.wikimedia.org/T233978) (owner: 10Marostegui) [13:48:19] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 (10Marostegui) @Cmjohnson es1019 is now off and ready for you. Once you are done, power it back on. Thanks! [13:49:07] !log roll restart hadoop yarn resource managers for openssl updates on Hadoop workers [13:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:39] (03PS1) 10Jhedden: openstack: set glance image dir permissions on controllers [puppet] - 10https://gerrit.wikimedia.org/r/540607 (https://phabricator.wikimedia.org/T234518) [13:52:25] elukey: cdanis: I am rephrasing the task about alarming for that gerrit issue ( was https://phabricator.wikimedia.org/T230138 ) [13:52:27] rephrasing it now [13:53:25] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Tested in tools, it shouldn't break anything this time around. Also the entirety of the production pods have be refreshed twice to avoid a" [puppet] - 10https://gerrit.wikimedia.org/r/539912 (https://phabricator.wikimedia.org/T207200) (owner: 10Alexandros Kosiaris) [13:53:33] (03PS4) 10Alexandros Kosiaris: Revert "Revert "rsyslog: populate kubernetes configuration"" [puppet] - 10https://gerrit.wikimedia.org/r/539912 (https://phabricator.wikimedia.org/T207200) [13:53:51] (03PS3) 10BBlack: Deploy inactive digicert 2019 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/540470 (https://phabricator.wikimedia.org/T209515) [13:55:41] (03CR) 10BBlack: [C: 03+2] Deploy inactive digicert 2019 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/540470 (https://phabricator.wikimedia.org/T209515) (owner: 10BBlack) [13:55:51] hashar: ack thanks, will follow up in there [13:56:42] (03PS4) 10BBlack: Deploy inactive digicert 2019 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/540470 (https://phabricator.wikimedia.org/T209515) [13:57:23] (03CR) 10Andrew Bogott: [C: 03+1] openstack: set glance image dir permissions on controllers [puppet] - 10https://gerrit.wikimedia.org/r/540607 (https://phabricator.wikimedia.org/T234518) (owner: 10Jhedden) [13:57:30] elukey: rephrased :) [13:57:33] (03CR) 10Jhedden: [C: 03+2] openstack: set glance image dir permissions on controllers [puppet] - 10https://gerrit.wikimedia.org/r/540607 (https://phabricator.wikimedia.org/T234518) (owner: 10Jhedden) [13:57:45] (03PS2) 10Jhedden: openstack: set glance image dir permissions on controllers [puppet] - 10https://gerrit.wikimedia.org/r/540607 (https://phabricator.wikimedia.org/T234518) [13:57:48] it is not like I am going to work on it though :-\ I am busy with other details [14:01:32] heads up, some 50k events added to logstash in the last 3 minutes, courtesy of https://gerrit.wikimedia.org/r/539912 [14:02:11] it was the backlog and was processed quickly, if lag in the logging pipeline increases will be the wake of that [14:02:26] * akosiaris monitoring https://grafana.wikimedia.org/d/000000561/logstash [14:03:08] but it does look like it's fine [14:04:44] (03PS2) 10Volans: prospector: disable McCabe complexity check [cookbooks] - 10https://gerrit.wikimedia.org/r/539500 [14:04:46] (03PS1) 10Jhedden: Revert "openstack: set glance image dir permissions on controllers" [puppet] - 10https://gerrit.wikimedia.org/r/540608 [14:05:49] (03CR) 10Jhedden: [V: 03+2 C: 03+2] Revert "openstack: set glance image dir permissions on controllers" [puppet] - 10https://gerrit.wikimedia.org/r/540608 (owner: 10Jhedden) [14:06:18] (03PS1) 10Andrew Bogott: designate: remove mitaka files and templates and a mitaka-specific arg [puppet] - 10https://gerrit.wikimedia.org/r/540609 (https://phabricator.wikimedia.org/T233978) [14:07:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] logstash: parse nested json from mmkubernetes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/539978 (https://phabricator.wikimedia.org/T207200) (owner: 10Filippo Giunchedi) [14:08:48] (03CR) 10Volans: [C: 03+2] prospector: disable McCabe complexity check [cookbooks] - 10https://gerrit.wikimedia.org/r/539500 (owner: 10Volans) [14:09:25] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.003631 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:10:56] godog: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/539912/ merged without destroying everything this time around it seems [14:10:59] (03Merged) 10jenkins-bot: prospector: disable McCabe complexity check [cookbooks] - 10https://gerrit.wikimedia.org/r/539500 (owner: 10Volans) [14:11:27] (03CR) 10Andrew Bogott: "harmless-looking diff at https://puppet-compiler.wmflabs.org/compiler1002/18726/" [puppet] - 10https://gerrit.wikimedia.org/r/540609 (https://phabricator.wikimedia.org/T233978) (owner: 10Andrew Bogott) [14:11:38] also https://gerrit.wikimedia.org/r/540481 by urandom should also remove some of the less well structured logs [14:14:08] (03PS2) 10Andrew Bogott: designate: remove mitaka files and templates and a mitaka-specific arg [puppet] - 10https://gerrit.wikimedia.org/r/540609 (https://phabricator.wikimedia.org/T233978) [14:18:19] (03PS5) 10BBlack: Deploy inactive digicert 2019 unified certs [puppet] - 10https://gerrit.wikimedia.org/r/540470 (https://phabricator.wikimedia.org/T209515) [14:18:26] rebase rebase rebase [14:18:43] someday we need to loop back to that conversation about our merge strategy and tooling, etc :) [14:19:53] bblack: T224033 ;) [14:20:03] T224033: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 [14:22:24] hahahaah [14:22:41] I didn't recall the title [14:23:25] lol [14:27:22] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Papaul) [14:32:21] 10Operations, 10ops-codfw, 10netops: msw-c1 down? - https://phabricator.wikimedia.org/T234411 (10Papaul) 05Open→03Resolved A was not on site yesterday when this happen. This morning all same to be good it was maybe the cable so i just made sure that the cable was not loose . [14:37:24] (03CR) 10Awight: Add a commit message guide (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [14:39:28] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Papaul) ` papaul@asw-c-codfw# show| compare [edit interfaces interface-range vlan-private1-c-codfw] - member ge-6/0/19; [edit interfaces interface-range disabled] mem... [14:40:27] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2050.codfw.wmnet - https://phabricator.wikimedia.org/T230391 (10Papaul) [14:40:34] (03PS1) 10Elukey: Switch the Hadoop test cluster to krb1001/krb2001 [puppet] - 10https://gerrit.wikimedia.org/r/540610 (https://phabricator.wikimedia.org/T226089) [14:41:25] (03PS2) 10Elukey: Switch the Hadoop test cluster to krb1001/krb2001 [puppet] - 10https://gerrit.wikimedia.org/r/540610 (https://phabricator.wikimedia.org/T226089) [14:43:01] PROBLEM - Check systemd state on an-tool1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:21] 10Operations, 10SRE-Access-Requests: Requesting access to analytics cluster for Djellel Difallah - https://phabricator.wikimedia.org/T234473 (10Ottomata) [14:45:24] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10Ottomata) [14:45:30] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Scoring-platform-team: Grant LDAP groups and deployment shell access to Kevin Bazira - https://phabricator.wikimedia.org/T234209 (10Ottomata) [14:48:38] (03PS1) 10Jhedden: openstack: set glance image dir permissions on controllers [puppet] - 10https://gerrit.wikimedia.org/r/540612 (https://phabricator.wikimedia.org/T234518) [14:50:00] an-tool1006 is me [14:50:27] forgot to set downtime [14:53:17] (03PS3) 10Elukey: Switch the Hadoop test cluster to krb1001/krb2001 [puppet] - 10https://gerrit.wikimedia.org/r/540610 (https://phabricator.wikimedia.org/T226089) [14:53:28] (03PS2) 10Marostegui: realm.pp: Remove filejournal table from private list [puppet] - 10https://gerrit.wikimedia.org/r/534389 (https://phabricator.wikimedia.org/T51195) [14:56:45] 10Operations, 10Wikimedia-Incident: Wikimedia sites are down - https://phabricator.wikimedia.org/T234528 (10Aklapper) [14:56:47] (03PS4) 10Elukey: Switch the Hadoop test cluster to krb1001/krb2001 [puppet] - 10https://gerrit.wikimedia.org/r/540610 (https://phabricator.wikimedia.org/T226089) [14:58:45] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18729/" [puppet] - 10https://gerrit.wikimedia.org/r/540610 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [14:58:51] (03CR) 10Elukey: [C: 03+2] Switch the Hadoop test cluster to krb1001/krb2001 [puppet] - 10https://gerrit.wikimedia.org/r/540610 (https://phabricator.wikimedia.org/T226089) (owner: 10Elukey) [15:07:35] (03PS1) 10Ayounsi: Modify access rules [homer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/540614 [15:19:11] (03PS13) 10Dzahn: parsoid: introduce parameter to use parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) [15:20:06] (03CR) 10Dzahn: [C: 03+2] parsoid: introduce parameter to use parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [15:21:28] (03PS14) 10Dzahn: parsoid: introduce parameter to use parsoid/PHP [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) [15:25:23] (03CR) 10Dzahn: "looks like the "has_lvs" removed in the last patch set was needed after all. Could not find data item has_lvs in any Hiera data file .." [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [15:26:19] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:27:17] (03PS1) 10Dzahn: wtp2001: explicitly set has_lvs to true [puppet] - 10https://gerrit.wikimedia.org/r/540615 [15:27:46] (03PS2) 10Dzahn: wtp2001: explicitly set has_lvs to true [puppet] - 10https://gerrit.wikimedia.org/r/540615 (https://phabricator.wikimedia.org/T233654) [15:28:33] (03PS3) 10Dzahn: wtp2001: explicitly set has_lvs to true [puppet] - 10https://gerrit.wikimedia.org/r/540615 (https://phabricator.wikimedia.org/T233654) [15:29:09] (03CR) 10Dzahn: [C: 03+2] wtp2001: explicitly set has_lvs to true [puppet] - 10https://gerrit.wikimedia.org/r/540615 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [15:36:55] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:44:18] chaomodus: ^^^ maybe related to the known issue, but still spamming abit [15:45:17] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1002/18730/" [puppet] - 10https://gerrit.wikimedia.org/r/540612 (https://phabricator.wikimedia.org/T234518) (owner: 10Jhedden) [15:45:29] (03CR) 10Jhedden: [C: 03+2] openstack: set glance image dir permissions on controllers [puppet] - 10https://gerrit.wikimedia.org/r/540612 (https://phabricator.wikimedia.org/T234518) (owner: 10Jhedden) [15:45:59] (03PS2) 10Jhedden: openstack: set glance image dir permissions on controllers [puppet] - 10https://gerrit.wikimedia.org/r/540612 (https://phabricator.wikimedia.org/T234518) [15:51:34] (03PS1) 10Jhedden: Revert "openstack: set glance image dir permissions on controllers" [puppet] - 10https://gerrit.wikimedia.org/r/540618 [15:51:54] (03CR) 10Jhedden: [V: 03+2 C: 03+2] Revert "openstack: set glance image dir permissions on controllers" [puppet] - 10https://gerrit.wikimedia.org/r/540618 (owner: 10Jhedden) [15:54:41] hello! has https://phabricator.wikimedia.org/T233271 received any attention? This seems to be happening more and more often. From my data, it was extremely seldom prior to September 20 (except for Sep 6-7, which we all know about) [15:55:38] (03CR) 10Alexandros Kosiaris: Add a commit message guide (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [15:55:40] I still need to take a crash course on Logstash. I tried searching for requests that I know gave a 503, and I can't find them (plenty of other 503s in there, but that by itself I assume is not unusual) [15:56:36] (03PS5) 10Alexandros Kosiaris: Add a commit message guide [puppet] - 10https://gerrit.wikimedia.org/r/540366 [15:58:30] musikanimal: you mean anything newer since sbass.ett's last comment in Oct 1? Cause it does give the impression that's it's been handled [15:58:55] no the CheckUser problem seems to be unrelated, I've concluded [15:59:03] I'm going off of automated error reports from XTools [15:59:47] and there's an email thread about the 503s on wikitech-l, someone reported the issue agian today, around the same time I see the errors from XTools [16:00:04] godog and _joe_: My dear minions, it's time we take the moon! Just kidding. Time for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191003T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:19] musikanimal: We had some issues today around 08:46 UTC and at around 12:00 UTC too [16:02:33] musikanimal: there was an incident around the time the user reported this in wikitech-l, so maybe related? [16:02:49] yes, and the same time it happened to XTools [16:04:34] musikanimal: https://phabricator.wikimedia.org/T234528 [16:04:39] that's the task tracking it [16:04:57] RECOVERY - Check systemd state on an-tool1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:58] the incident report is not yet on wikitech, but should be soon [16:05:55] got it, thank you. Do you think the 503s over the past few weeks are all due to the same thing? Or rather, do you believe this to be an ongoing issue, and I don't need to report when these incidents happen? [16:07:22] oh, do report them, cause they may very well be unrelated [16:09:08] okee doke, will do. thanks! [16:09:19] I doubt though all those 503s during the past few weeks were related btw [16:12:14] 10Operations, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) I commented the lines related to puppetdb (34 to 44 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/promethe... [16:22:57] (03CR) 10Awight: Add a commit message guide (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [16:26:46] (03PS6) 10Alexandros Kosiaris: Add a commit message guide [puppet] - 10https://gerrit.wikimedia.org/r/540366 [16:27:05] _joe_: https://grafana.wikimedia.org/d/-sq5te5Wk/kubernetes-dns?panelId=42&fullscreen&orgId=1&from=now-2d&to=now&var-dc=eqiad%20prometheus%2Fk8s [16:27:12] interesting... SERVFAILs [16:27:20] I wonder what they are about... [16:27:29] really really low levels though [16:27:37] <_joe_> yeah still [16:27:58] (03CR) 10Alexandros Kosiaris: Add a commit message guide (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540366 (owner: 10Alexandros Kosiaris) [16:35:19] (03PS1) 10Ayounsi: Move jnt's config.yaml to Homer's new filestructure [homer/public] - 10https://gerrit.wikimedia.org/r/540619 [16:55:15] (03PS1) 10Ayounsi: README, common and asw2-a/b/c-eqiad mock private data [homer/mock-private] - 10https://gerrit.wikimedia.org/r/540622 [16:57:05] 10Operations, 10Core Platform Team, 10Performance-Team, 10Readers-Web-Backlog, and 7 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Jdlrobson) > I've tagged Reading-Web and CPT on the task here for their feedback from product and technic... [16:57:22] 10Operations, 10ops-codfw, 10decommission, 10media-storage, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10Papaul) [17:00:04] cscott, arlolra, subbu, halfak, and accraze: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191003T1700). [17:07:33] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@31b2703]: Update mobileapps to 1db84a7 [17:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:39] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@31b2703]: Update mobileapps to 1db84a7 (duration: 06m 06s) [17:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:51] Daimona: hi, can do a few AF pushes today, whenever you're ready/available. [17:36:54] Did one last night as well [17:37:38] Sure [17:37:45] 10Operations, 10ops-eqiad, 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10Cmjohnson) a:05Cmjohnson→03Jclark-ctr I need some clarification on this, please. We do not have 10G space for all of these servers. Especially if... [17:38:02] I'd say either group1 or even group1+2 [17:38:05] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 58.13 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:39:27] Daimona: yeah, looking at the frequency of filtered actions by wiki, we have commonswiki as 3rd biggest. [17:39:41] so we can do wikidata next, then enwiki, and then default? [17:39:41] It seems good so far [17:39:43] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 74.23 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:39:54] Sure [17:39:57] So let's do wikidata now? [17:40:01] yeah [17:40:12] trying ot figure out now why traffic plummeted earlier today [17:40:17] https://grafana.wikimedia.org/d/000000393/mediawiki-abusefilter-profiling?orgId=1&from=1570099431666&to=1570113577762 [17:40:20] big gap here in edits [17:40:50] There was an incident [17:41:15] (03CR) 10BBlack: [C: 03+1] wikimediacloud.org: add initial zone file [dns] - 10https://gerrit.wikimedia.org/r/540148 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [17:41:27] 10Operations, 10ops-eqiad, 10Patch-For-Review: (Aug 30th, 2019) rack/setup/install elastic10[53-67].eqiad.wmnet - https://phabricator.wikimedia.org/T230746 (10EBernhardson) The servers today will not be able to utilize 10G, so they could go in 1G racks for the time being. The cluster can't take advantage of... [17:43:14] (03PS1) 10Daimona Eaytoy: Use AbuseFilterCachingParser on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540627 (https://phabricator.wikimedia.org/T156095) [17:43:34] Krinkle ^ config patch [17:43:48] (03PS3) 10Andrew Bogott: designate: remove mitaka files and templates and a mitaka-specific arg [puppet] - 10https://gerrit.wikimedia.org/r/540609 (https://phabricator.wikimedia.org/T233978) [17:44:33] Daimona: thx, found the task, proper incident will follow [17:44:42] I've added a {mediawiki} annotation in Grafana and enabled those on the AF dash [17:46:11] (03CR) 10Andrew Bogott: [C: 03+2] designate: remove mitaka files and templates and a mitaka-specific arg [puppet] - 10https://gerrit.wikimedia.org/r/540609 (https://phabricator.wikimedia.org/T233978) (owner: 10Andrew Bogott) [17:48:37] Ah, cool [17:49:24] * Krinkle takes mwdebug1002 [17:49:30] (03CR) 10Krinkle: [C: 03+2] Use AbuseFilterCachingParser on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540627 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [17:50:24] (03Merged) 10jenkins-bot: Use AbuseFilterCachingParser on wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540627 (https://phabricator.wikimedia.org/T156095) (owner: 10Daimona Eaytoy) [17:51:41] As a side note, I wonder how much the stampede would be noticeable, and if it's worth trying to avoid it [17:52:14] But I guess we can think about that later on [17:52:27] Daimona: Yeah, we can warm some of the caches via mwdebug though [17:52:50] On a daily basis I mean... Esp for wikis as big as WD [17:52:58] But nvm [17:53:04] Ping me when it's on mwdebug :) [17:53:10] Right Yeah, WANObjectCache provides protections ofr that as well [17:53:17] it does preliminary refreshes from a post-send callback at random [17:53:20] to keep things warm [17:53:33] Daimona: live on mwdebug1002 [17:54:50] This is on APCU though [17:54:56] OK, test run now [17:55:00] Hm.. only on apcu? [17:55:14] Might be worth benching to see how it goes parsing vs memcached fetch [17:55:38] Might not be worth it per-filter, but maybe with a batch key [17:56:05] e.g. all global filters as 1 key and 1 per-wiki batch key returning the tokens [17:56:06] anyway [17:56:24] Can use wan->getWithSet for that. But yeah, we need data first. [17:56:46] Sure, let's do that later [17:56:55] What a mess I've just done [17:57:18] Forgot to enable the extension for the first edits, then disabled flash instead of JS *shrug* [17:58:51] Looking good anyway [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Morning SWAT (Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191003T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:57] OK, rolling out [18:01:02] checking logstash one more time [18:01:12] Sure [18:01:16] I'm checking too [18:02:26] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/mock-private] - 10https://gerrit.wikimedia.org/r/540622 (owner: 10Ayounsi) [18:03:06] (03PS1) 10Mforns: reportupdater::jobs::mysql.pp: Absent flow reports [puppet] - 10https://gerrit.wikimedia.org/r/540641 (https://phabricator.wikimedia.org/T223414) [18:03:12] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 5389d0243ee9c (duration: 01m 01s) [18:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:41] (03PS1) 10Andrew Bogott: openstack: move codfw1dev to openstack Newton [puppet] - 10https://gerrit.wikimedia.org/r/540642 [18:03:43] (03PS1) 10Andrew Bogott: Openstack: move eqiad1 glance/keystone/nova/neutron to Newton [puppet] - 10https://gerrit.wikimedia.org/r/540643 (https://phabricator.wikimedia.org/T212302) [18:04:36] (03CR) 10Mforns: "This one can be merged any time." [puppet] - 10https://gerrit.wikimedia.org/r/540641 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [18:05:39] (03CR) 10Volans: [C: 03+1] "LGTM in general, I couldn't check all the values." [homer/public] - 10https://gerrit.wikimedia.org/r/540619 (owner: 10Ayounsi) [18:06:03] (03PS4) 10Volans: Add commit action to the Homer class [software/homer] - 10https://gerrit.wikimedia.org/r/539551 (https://phabricator.wikimedia.org/T228388) [18:06:54] Daimona: next one in 30min or so? [18:07:07] (03PS2) 10Jhedden: wikimediacloud.org: add initial zone file [dns] - 10https://gerrit.wikimedia.org/r/540148 (https://phabricator.wikimedia.org/T223907) [18:07:22] I don't know if I'll be there. But if WD is fine, probably everything else will be :) [18:07:44] OK [18:07:45] BTW, AFAICS wikidata is still using the old parser [18:07:57] Yeah, I'm not seeing a big shift yet [18:08:07] Oh, and for other wikis we'd have to fix three filters beforehand [18:08:10] (03CR) 10Jhedden: [C: 03+2] wikimediacloud.org: add initial zone file [dns] - 10https://gerrit.wikimedia.org/r/540148 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [18:08:14] https://phabricator.wikimedia.org/T156096#5538576 [18:08:37] We can pin those to old parser for now [18:08:51] To check what it's using, just head to https://www.wikidata.org/wiki/Special:AbuseFilter/tools, type "()" and check syntax. Failure <=> new parser :) [18:08:52] Sure [18:08:59] Daimona: I noticed yesterday with commons as well that it took a long time before traffic shifted [18:09:01] (03CR) 10Andrew Bogott: [C: 03+2] openstack: move codfw1dev to openstack Newton [puppet] - 10https://gerrit.wikimedia.org/r/540642 (owner: 10Andrew Bogott) [18:09:20] (03CR) 10jerkins-bot: [V: 04-1] Add commit action to the Homer class [software/homer] - 10https://gerrit.wikimedia.org/r/539551 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [18:09:23] Uhm, sounds like some hosts are already using the new parser [18:09:36] At least, some requests are [18:10:45] strange [18:10:56] (03PS2) 10Ayounsi: Move jnt's config.yaml to Homer's new filestructure [homer/public] - 10https://gerrit.wikimedia.org/r/540619 [18:11:29] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] README, common and asw2-a/b/c-eqiad mock private data [homer/mock-private] - 10https://gerrit.wikimedia.org/r/540622 (owner: 10Ayounsi) [18:11:32] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] README, common and asw2-a/b/c-eqiad mock private data [homer/mock-private] - 10https://gerrit.wikimedia.org/r/540622 (owner: 10Ayounsi) [18:12:21] Yeah, that's it though... It's around 1/10 requests [18:13:14] (03PS5) 10Volans: Add commit action to the Homer class [software/homer] - 10https://gerrit.wikimedia.org/r/539551 (https://phabricator.wikimedia.org/T228388) [18:14:15] Daimona: ha [18:14:18] it only affects wmf.25 [18:14:25] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: no-op / config cache issue? (duration: 01m 00s) [18:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:28] no, never mind [18:14:36] I checked the wrong config cache file [18:15:18] Now it's working [18:18:08] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Move jnt's config.yaml to Homer's new filestructure [homer/public] - 10https://gerrit.wikimedia.org/r/540619 (owner: 10Ayounsi) [18:18:22] (03PS1) 10Krinkle: Enable AbuseFilterCachingParser for (almost) all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540647 (https://phabricator.wikimedia.org/T156095) [18:18:53] (03PS2) 10Krinkle: Enable AbuseFilterCachingParser for (almost) all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540647 (https://phabricator.wikimedia.org/T156095) [18:22:19] (03CR) 10Daimona Eaytoy: [C: 03+1] Enable AbuseFilterCachingParser for (almost) all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540647 (https://phabricator.wikimedia.org/T156095) (owner: 10Krinkle) [18:23:51] 10Operations, 10ops-codfw: refresh/replace scs-c1-codfw - https://phabricator.wikimedia.org/T231687 (10Papaul) [18:24:17] 10Operations, 10ops-codfw: refresh/replace scs-c1-codfw - https://phabricator.wikimedia.org/T231687 (10Papaul) [18:24:38] 10Operations, 10Growth-Team, 10Notifications, 10serviceops, 10CPT Initiatives (Multi-DC Echo Notification Storage): Provision Kask for Echo timestamp storage in k8s - https://phabricator.wikimedia.org/T234376 (10Eevans) [18:25:10] 10Operations, 10Growth-Team, 10Notifications, 10serviceops, and 2 others: Echostore service endpoints - https://phabricator.wikimedia.org/T234464 (10Eevans) [18:27:39] (03PS1) 10Paladox: Modify access rules [homer/public] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/540651 [18:32:13] (03PS1) 10Paladox: Modify access rules [homer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/540652 [18:33:16] (03Abandoned) 10Paladox: Modify access rules [homer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/540652 (owner: 10Paladox) [18:33:39] (03PS1) 10Paladox: Modify access rules [homer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/540653 [18:35:56] (03CR) 10Ottomata: [C: 03+2] reportupdater::jobs::mysql.pp: Absent flow reports [puppet] - 10https://gerrit.wikimedia.org/r/540641 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [18:36:02] (03PS2) 10Ottomata: reportupdater::jobs::mysql.pp: Absent flow reports [puppet] - 10https://gerrit.wikimedia.org/r/540641 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [18:36:06] (03CR) 10Ottomata: [V: 03+2 C: 03+2] reportupdater::jobs::mysql.pp: Absent flow reports [puppet] - 10https://gerrit.wikimedia.org/r/540641 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [18:38:59] (03CR) 10Ayounsi: [C: 03+1] Modify access rules [homer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/540653 (owner: 10Paladox) [18:39:06] Daimona: I've added "cache hit ratio" as graph [18:39:06] https://grafana.wikimedia.org/d/000000393/mediawiki-abusefilter-profiling?orgId=1&from=now-6h&to=now&var-wiki=wikidatawiki&var-metric=p95 [18:39:13] not sure I got it right, but looks pretty good so far [18:39:26] not surprising I suppose, but nice to see - assuming I got the query right [18:39:59] (03CR) 10Krinkle: [C: 03+2] Enable AbuseFilterCachingParser for (almost) all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540647 (https://phabricator.wikimedia.org/T156095) (owner: 10Krinkle) [18:40:47] (03Merged) 10jenkins-bot: Enable AbuseFilterCachingParser for (almost) all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540647 (https://phabricator.wikimedia.org/T156095) (owner: 10Krinkle) [18:43:08] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] Modify access rules [homer] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/540653 (owner: 10Paladox) [18:43:27] (03Abandoned) 10Paladox: Modify access rules [homer/public] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/540651 (owner: 10Paladox) [18:43:30] thanks! [18:43:42] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: c2b3d7ce57e9c422 (duration: 00m 59s) [18:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:46] Cool, thanks, let's see what the ratio is [18:51:35] looks like the config cache got stuck this time as wel [18:51:36] weird [18:51:37] re-syncing [18:51:46] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 2 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10BBlack) [18:52:49] !log krinkle@deploy1001 Synchronized wmf-config/InitialiseSettings.php: no-op / config cached? (duration: 00m 59s) [18:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:47] As a side note, I was expecting the eval phase to be way quicker, but instead, it's pretty slow... We can try to improve it later refs https://phabricator.wikimedia.org/T234427 [18:58:08] Alright, I'm done for today. I'll check the graphs tomorrow, thanks a lot! [18:59:41] (03PS1) 10Mforns: reportupdater::jobs::mysql.pp: Absent jobs affected by migration [puppet] - 10https://gerrit.wikimedia.org/r/540658 (https://phabricator.wikimedia.org/T223414) [19:00:04] marxarelli: Your horoscope predicts another unfortunate MediaWiki train - American version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191003T1900). [19:01:24] (03CR) 10Mforns: "Let's wait to deploy this on Monday 2019-10-07." [puppet] - 10https://gerrit.wikimedia.org/r/540658 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [19:05:31] (03PS1) 10Eevans: staging/sessionstore: Upgrade image to 2019-10-03-182310-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/540660 (https://phabricator.wikimedia.org/T227514) [19:06:51] (03CR) 10Eevans: [V: 03+2 C: 03+2] staging/sessionstore: Upgrade image to 2019-10-03-182310-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/540660 (https://phabricator.wikimedia.org/T227514) (owner: 10Eevans) [19:07:03] (03Merged) 10jenkins-bot: staging/sessionstore: Upgrade image to 2019-10-03-182310-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/540660 (https://phabricator.wikimedia.org/T227514) (owner: 10Eevans) [19:07:30] (03PS1) 10Mforns: ::reportupdater::jobs: Migrate MySQL jobs to Hive [puppet] - 10https://gerrit.wikimedia.org/r/540661 (https://phabricator.wikimedia.org/T223414) [19:10:07] (03CR) 10jerkins-bot: [V: 04-1] ::reportupdater::jobs: Migrate MySQL jobs to Hive [puppet] - 10https://gerrit.wikimedia.org/r/540661 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [19:11:03] (03CR) 10Mforns: "We should merge this when we deploy the related task, see dependency." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/540661 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [19:11:57] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [19:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:09] (03PS2) 10Mforns: ::reportupdater::jobs: Migrate MySQL jobs to Hive [puppet] - 10https://gerrit.wikimedia.org/r/540661 (https://phabricator.wikimedia.org/T223414) [19:16:39] (03CR) 10jerkins-bot: [V: 04-1] ::reportupdater::jobs: Migrate MySQL jobs to Hive [puppet] - 10https://gerrit.wikimedia.org/r/540661 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [19:18:57] (03PS1) 10Dduvall: all wikis to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540666 [19:19:01] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540666 (owner: 10Dduvall) [19:19:01] !log puppetmaster1001 - revoke cert for parsoid.discovery.wmnet - creating new ones for each DC and a unified one with both (T233654) [19:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:05] T233654: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 [19:19:52] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540666 (owner: 10Dduvall) [19:20:29] here comes the choo choo [19:21:48] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.25 [19:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:16] 10Operations, 10Core Platform Team, 10Performance-Team, 10Readers-Web-Backlog, and 7 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Keegan) >>! In T120085#5544560, @Jdlrobson wrote: >> I've tagged Reading-Web and CPT on the task here for... [19:29:37] (03PS3) 10Mforns: ::reportupdater::jobs: Migrate MySQL jobs to Hive [puppet] - 10https://gerrit.wikimedia.org/r/540661 (https://phabricator.wikimedia.org/T223414) [19:30:07] !log 1.34.0-wmf.25 promoted to all wikis, cc: T220750. no rise in relevant error rates. no new errors [19:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:11] T220750: 1.34.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T220750 [19:31:47] 10Operations, 10ops-eqiad: Can't SSH to mw1290.mgmt - https://phabricator.wikimedia.org/T234153 (10Cmjohnson) @Dzahn the server needs to be powered off and power removed...can you depool again and leave it depooled for 24 hours please. I will update the task once complete. [19:32:04] 10Operations, 10Core Platform Team, 10Performance-Team, 10Readers-Web-Backlog, and 7 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Jdlrobson) Yup. en.wikipedia.org for desktop (which redirects to mobile en.m.wikipedia.org). [19:35:13] (03PS2) 10Andrew Bogott: Openstack: move eqiad1 glance/keystone/nova/neutron to Newton [puppet] - 10https://gerrit.wikimedia.org/r/540643 (https://phabricator.wikimedia.org/T212302) [19:35:15] (03PS1) 10Andrew Bogott: move cloudbackup2001/2002 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/540668 (https://phabricator.wikimedia.org/T224528) [19:36:19] 10Operations, 10ops-eqiad: Move YHSM from auth1001 to auth1002 - https://phabricator.wikimedia.org/T233821 (10Cmjohnson) a:05Jclark-ctr→03Cmjohnson @MoritzMuehlenhoff the HSM has been moved..please confirm you see it [19:38:04] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1290.eqiad.wmnet [19:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:14] (03CR) 10Andrew Bogott: [C: 03+2] move cloudbackup2001/2002 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/540668 (https://phabricator.wikimedia.org/T224528) (owner: 10Andrew Bogott) [19:40:04] 10Operations, 10ops-eqiad: Can't SSH to mw1290.mgmt - https://phabricator.wikimedia.org/T234153 (10Dzahn) @Cmjohnson Depooled and scheduled an Icinga downtime for about 2 days. Go ahead. [19:40:37] !log mw1290 - depooled and scheduled downtime in Icinga for hardware maintenance T234153 [19:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:41] T234153: Can't SSH to mw1290.mgmt - https://phabricator.wikimedia.org/T234153 [19:41:32] 10Operations, 10ops-eqiad, 10DBA: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson I powered the server off, unplugged everything, removed the PSU drained flea power by holding the power button f... [19:46:31] 10Operations, 10ops-eqiad, 10DBA: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 (10Marostegui) Thanks - I can access too! [19:55:37] (03PS1) 10Eevans: sessionstore: Upgrade to v1.0.5 release [deployment-charts] - 10https://gerrit.wikimedia.org/r/540671 (https://phabricator.wikimedia.org/T227514) [19:57:03] (03CR) 10Eevans: [V: 03+2 C: 03+2] sessionstore: Upgrade to v1.0.5 release [deployment-charts] - 10https://gerrit.wikimedia.org/r/540671 (https://phabricator.wikimedia.org/T227514) (owner: 10Eevans) [20:01:19] (03PS1) 10Dzahn: ssl: add new parsoid.svc(eqiad|codfw) certs [puppet] - 10https://gerrit.wikimedia.org/r/540672 (https://phabricator.wikimedia.org/T233654) [20:01:54] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'sessionstore' for release 'production' . [20:01:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:51] (03PS1) 10Dzahn: ssl: fix file extension of mwmwaint cert file [puppet] - 10https://gerrit.wikimedia.org/r/540673 [20:05:30] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'sessionstore' for release 'production' . [20:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:39] (03PS1) 10Dzahn: add fake keys for new parsoid certs [labs/private] - 10https://gerrit.wikimedia.org/r/540675 [20:07:27] PROBLEM - DPKG on wtp2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:08:22] (03PS2) 10Dzahn: add fake keys for new parsoid certs [labs/private] - 10https://gerrit.wikimedia.org/r/540675 (https://phabricator.wikimedia.org/T233654) [20:09:03] RECOVERY - DPKG on wtp2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:10:14] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake keys for new parsoid certs [labs/private] - 10https://gerrit.wikimedia.org/r/540675 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [20:10:52] (03PS3) 10Dzahn: add fake keys for new parsoid certs [labs/private] - 10https://gerrit.wikimedia.org/r/540675 (https://phabricator.wikimedia.org/T233654) [20:11:44] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake keys for new parsoid certs [labs/private] - 10https://gerrit.wikimedia.org/r/540675 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [20:12:34] (03PS2) 10Dzahn: ssl: add new parsoid.svc(eqiad|codfw) certs [puppet] - 10https://gerrit.wikimedia.org/r/540672 (https://phabricator.wikimedia.org/T233654) [20:14:01] (03CR) 10Dzahn: [C: 03+2] ssl: add new parsoid.svc(eqiad|codfw) certs [puppet] - 10https://gerrit.wikimedia.org/r/540672 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [20:17:19] PROBLEM - Check systemd state on wtp2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:31] ACKNOWLEDGEMENT - Check systemd state on wtp2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn WIP https://phabricator.wikimedia.org/T233654 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:19:31] ACKNOWLEDGEMENT - DPKG on wtp2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages daniel_zahn WIP https://phabricator.wikimedia.org/T233654 https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [20:21:26] wtp2001 is becoming an appserver .. that's why [20:21:27] PROBLEM - php7.2-fpm service on wtp2001 is CRITICAL: NRPE: Command check_php7.2-fpm-state not defined https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:21:44] all that stuff is just being added right now and wasnt there before [20:23:46] ACKNOWLEDGEMENT - php7.2-fpm service on wtp2001 is CRITICAL: NRPE: Command check_php7.2-fpm-state not defined daniel_zahn WIP https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:24:44] 10Operations, 10Wikimedia-Logstash, 10observability: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10colewhite) [20:27:15] PROBLEM - Nginx local proxy to apache on wtp2001 is CRITICAL: connect to address 10.192.16.43 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [20:28:39] schedules downtime for most of these once they show up. puppet run ongoing [20:29:31] RECOVERY - Check systemd state on wtp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:37] RECOVERY - php7.2-fpm service on wtp2001 is OK: OK - php7.2-fpm is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:30:09] 10Operations, 10Traffic, 10observability: global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus - https://phabricator.wikimedia.org/T234567 (10CDanis) [20:31:44] (03PS1) 10CDanis: prometheus global: add rules for correct global HTTP avail [puppet] - 10https://gerrit.wikimedia.org/r/540676 (https://phabricator.wikimedia.org/T234567) [20:32:25] RECOVERY - Nginx local proxy to apache on wtp2001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.217 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:34:36] (03PS1) 10Ladsgroup: Set $wgMainPageIsDomainRoot true for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540678 (https://phabricator.wikimedia.org/T120085) [20:34:38] (03PS1) 10Ladsgroup: Get rid of main page hack for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540679 (https://phabricator.wikimedia.org/T120085) [20:35:25] 10Operations, 10Traffic, 10observability, 10Patch-For-Review: global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus - https://phabricator.wikimedia.org/T234567 (10CDanis) [20:35:35] 10Operations, 10Traffic, 10observability, 10Patch-For-Review: global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus - https://phabricator.wikimedia.org/T234567 (10CDanis) p:05Triage→03Normal [20:37:36] (03CR) 10Dzahn: "The certificate situation was a bit cumbersome but has been fixed now. We needed a separate key for each cert as well. There are now 3 cer" [puppet] - 10https://gerrit.wikimedia.org/r/539181 (https://phabricator.wikimedia.org/T233654) (owner: 10Dzahn) [20:38:07] (03PS1) 10Dzahn: parsoid: turn wtp1025 into the first eqiad parsoid appserver [puppet] - 10https://gerrit.wikimedia.org/r/540680 (https://phabricator.wikimedia.org/T233654) [20:40:22] 10Operations, 10serviceops, 10Patch-For-Review: Make the parsoid cluster support parsoid/PHP - https://phabricator.wikimedia.org/T233654 (10Dzahn) certficate issues fixed. wtp2001 after the mediawiki roles have been applied now: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=wtp2001... [20:41:30] 10Operations, 10Wikimedia-Logstash, 10observability: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10colewhite) [20:44:27] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Andrew) I am finally back looking at this! I'm not sure quite what I should expect regarding the raids here -- I reimaged (for buster) and... [20:45:27] 10Operations, 10Core Platform Team, 10Parsing-Team, 10Performance-Team, and 8 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [20:46:27] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for nqowiki - https://phabricator.wikimedia.org/T230543 (10Bstorm) [20:49:43] (03PS6) 10Volans: Add commit action to the Homer class [software/homer] - 10https://gerrit.wikimedia.org/r/539551 (https://phabricator.wikimedia.org/T228388) [20:50:43] 10Operations, 10Core Platform Team, 10Editing-team, 10Parsing-Team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) @ssastry, @Esanders Hi - could you review this RFC for potential impact on Parsoid and VisualEditor? The... [20:51:39] (03CR) 10Krinkle: [C: 03+1] Set $wgMainPageIsDomainRoot true for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540678 (https://phabricator.wikimedia.org/T120085) (owner: 10Ladsgroup) [20:51:42] (03CR) 10Krinkle: [C: 03+1] Get rid of main page hack for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540679 (https://phabricator.wikimedia.org/T120085) (owner: 10Ladsgroup) [20:54:48] (03PS1) 10Dzahn: conftool: turn wtp2001 into a test server [puppet] - 10https://gerrit.wikimedia.org/r/540684 (https://phabricator.wikimedia.org/T233654) [20:59:29] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10DStrine) Hey all, An Advancement team member is sick, and out of communication due to an unforeseen event for an extended period of time.... [20:59:50] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Erin Yener - https://phabricator.wikimedia.org/T234529 (10DStrine) Hey all, An Advancement team member is sick, and out of communication due to an unforeseen event for an extended period of time. We... [21:02:02] (03PS2) 10Dzahn: ssl: fix file extension of mwmwaint cert file [puppet] - 10https://gerrit.wikimedia.org/r/540673 [21:02:55] (03CR) 10Dzahn: [C: 03+2] ssl: fix file extension of mwmwaint cert file [puppet] - 10https://gerrit.wikimedia.org/r/540673 (owner: 10Dzahn) [21:05:30] (03PS1) 10Jhedden: openstack: use wikimediacloud for keystone API requests [puppet] - 10https://gerrit.wikimedia.org/r/540685 (https://phabricator.wikimedia.org/T223907) [21:10:09] (03CR) 10Dzahn: [C: 03+2] Gerrit: Tweak replication config [puppet] - 10https://gerrit.wikimedia.org/r/540458 (owner: 10Paladox) [21:10:17] (03PS6) 10Dzahn: Gerrit: Tweak replication config [puppet] - 10https://gerrit.wikimedia.org/r/540458 (owner: 10Paladox) [21:10:33] thanks mutante ! [21:10:41] that'll need gerrit to be restarted. [21:11:50] (03PS7) 10Dzahn: Gerrit: Rename "slaves" in replication config to replica_codfw [puppet] - 10https://gerrit.wikimedia.org/r/540462 (owner: 10Paladox) [21:12:48] (03CR) 10Dzahn: [C: 03+2] Gerrit: Rename "slaves" in replication config to replica_codfw [puppet] - 10https://gerrit.wikimedia.org/r/540462 (owner: 10Paladox) [21:13:54] (03PS2) 10Jhedden: openstack: update codfw1 keystone clients for wikimediacloud domain [puppet] - 10https://gerrit.wikimedia.org/r/540685 (https://phabricator.wikimedia.org/T223907) [21:14:12] 10Operations, 10Core Platform Team, 10Editing-team, 10Parsing-Team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10cscott) From Parsoid's perspective: 1) is $wgMainPageIsDomainRoot available in SiteInfo? Parsoid/JS and the non-in... [21:15:56] (03CR) 10Jforrester: [C: 03+1] Set $wgMainPageIsDomainRoot true for fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540678 (https://phabricator.wikimedia.org/T120085) (owner: 10Ladsgroup) [21:22:05] (03PS7) 10Volans: Add commit action to the Homer class [software/homer] - 10https://gerrit.wikimedia.org/r/539551 (https://phabricator.wikimedia.org/T228388) [21:28:05] (03PS3) 10Jhedden: openstack: update codfw1 keystone clients for wikimediacloud domain [puppet] - 10https://gerrit.wikimedia.org/r/540685 (https://phabricator.wikimedia.org/T223907) [21:32:20] (03PS4) 10Jhedden: openstack: update codfw1 keystone clients for wikimediacloud domain [puppet] - 10https://gerrit.wikimedia.org/r/540685 (https://phabricator.wikimedia.org/T223907) [21:38:12] (03PS5) 10Jhedden: openstack: update codfw1 keystone clients for wikimediacloud domain [puppet] - 10https://gerrit.wikimedia.org/r/540685 (https://phabricator.wikimedia.org/T223907) [21:38:45] (03CR) 10Jhedden: [C: 03+2] "PCC results: https://puppet-compiler.wmflabs.org/compiler1002/18740/" [puppet] - 10https://gerrit.wikimedia.org/r/540685 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [21:41:08] (03PS8) 10Volans: Add commit action to the Homer class [software/homer] - 10https://gerrit.wikimedia.org/r/539551 (https://phabricator.wikimedia.org/T228388) [21:53:16] 10Operations, 10ops-eqiad, 10DBA: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 (10Dzahn) 05Resolved→03Open ssh to mgmt works now. confirmed. But IPMI from remote fails to establish a session. Could you please check if "IPMI over LAN" is d... [21:55:59] 10Operations, 10ops-eqiad, 10DBA: es1019 IPMI and its management interface are unresponsive (again2) - https://phabricator.wikimedia.org/T233698 (10Dzahn) 05Open→03Resolved I take that comment back. (now) it works for me. [22:00:51] (03CR) 10Cwhite: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/540676 (https://phabricator.wikimedia.org/T234567) (owner: 10CDanis) [22:04:09] (03CR) 10Cwhite: initial commit (032 comments) [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 (owner: 10Cwhite) [22:04:57] (03CR) 10Cwhite: initial commit (032 comments) [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 (owner: 10Cwhite) [22:05:49] (03CR) 10Cwhite: initial commit (032 comments) [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 (owner: 10Cwhite) [22:13:25] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Papaul) [22:19:24] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2054.codfw.wmnet - https://phabricator.wikimedia.org/T232969 (10Papaul) [22:40:21] (03PS5) 10Cwhite: initial commit [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 [22:41:00] (03CR) 10Cwhite: "All feedback incorporated. Please let me know if you see anything else?" (031 comment) [debs/prometheus-swagger-exporter] - 10https://gerrit.wikimedia.org/r/536376 (owner: 10Cwhite) [22:43:10] (03PS1) 10Paladox: gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/540703 (https://phabricator.wikimedia.org/T222391) [22:43:41] (03PS2) 10Paladox: gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/540703 (https://phabricator.wikimedia.org/T222391) [22:43:47] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/540703 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [22:52:01] (03PS1) 10Dzahn: Revert "fix IPv6 address for gerrit1001, wrong row" [dns] - 10https://gerrit.wikimedia.org/r/540704 [22:52:33] (03CR) 10Paladox: [C: 03+1] "Per chat." [dns] - 10https://gerrit.wikimedia.org/r/540704 (owner: 10Dzahn) [22:56:34] (03PS1) 10Dzahn: move gerrit-new service IP to B network [dns] - 10https://gerrit.wikimedia.org/r/540707 (https://phabricator.wikimedia.org/T222391) [22:57:00] (03CR) 10jerkins-bot: [V: 04-1] move gerrit-new service IP to B network [dns] - 10https://gerrit.wikimedia.org/r/540707 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [22:58:53] (03PS2) 10Dzahn: Revert "fix IPv6 address for gerrit1001, wrong row" [dns] - 10https://gerrit.wikimedia.org/r/540704 [22:59:42] (03CR) 10Dzahn: [C: 03+2] Revert "fix IPv6 address for gerrit1001, wrong row" [dns] - 10https://gerrit.wikimedia.org/r/540704 (owner: 10Dzahn) [22:59:47] (03PS3) 10Dzahn: Revert "fix IPv6 address for gerrit1001, wrong row" [dns] - 10https://gerrit.wikimedia.org/r/540704 [23:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191003T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:57] 10Operations, 10ops-codfw: refresh/replace scs-c1-codfw - https://phabricator.wikimedia.org/T231687 (10Papaul) [23:01:11] 10Operations, 10ops-codfw: refresh/replace scs-c1-codfw - https://phabricator.wikimedia.org/T231687 (10Papaul) 05Open→03Resolved This is complete [23:01:57] (03CR) 10Dzahn: "reverted. this was wrong. the thing that needed fixing was the service IP, not the server IP" [dns] - 10https://gerrit.wikimedia.org/r/540514 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:02:57] 10Operations, 10ops-codfw, 10decommission, 10media-storage, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10Papaul) [23:03:52] (03PS2) 10Dzahn: move gerrit-new service IP to B network [dns] - 10https://gerrit.wikimedia.org/r/540707 (https://phabricator.wikimedia.org/T222391) [23:07:22] (03PS3) 10Dzahn: move gerrit-new service IP to B network [dns] - 10https://gerrit.wikimedia.org/r/540707 (https://phabricator.wikimedia.org/T222391) [23:07:47] (03CR) 10jerkins-bot: [V: 04-1] move gerrit-new service IP to B network [dns] - 10https://gerrit.wikimedia.org/r/540707 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:09:54] (03CR) 10Nuria: [C: 03+1] reportupdater::jobs::mysql.pp: Absent jobs affected by migration [puppet] - 10https://gerrit.wikimedia.org/r/540658 (https://phabricator.wikimedia.org/T223414) (owner: 10Mforns) [23:10:50] (03PS4) 10Dzahn: move gerrit-new service IP to B network [dns] - 10https://gerrit.wikimedia.org/r/540707 (https://phabricator.wikimedia.org/T222391) [23:11:16] (03CR) 10jerkins-bot: [V: 04-1] move gerrit-new service IP to B network [dns] - 10https://gerrit.wikimedia.org/r/540707 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:12:39] (03CR) 10Paladox: move gerrit-new service IP to B network (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/540707 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:13:31] (03PS5) 10Dzahn: move gerrit-new service IP to B network [dns] - 10https://gerrit.wikimedia.org/r/540707 (https://phabricator.wikimedia.org/T222391) [23:19:50] (03CR) 10Dzahn: "everything should be in the B network now, which is the 2 in the IPv6 addresses" [dns] - 10https://gerrit.wikimedia.org/r/540707 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:19:52] (03CR) 10Dzahn: [C: 03+1] move gerrit-new service IP to B network [dns] - 10https://gerrit.wikimedia.org/r/540707 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:23:05] (03PS1) 10Dzahn: gerrit1001: fix service IP to IP in B network [puppet] - 10https://gerrit.wikimedia.org/r/540710 (https://phabricator.wikimedia.org/T222391) [23:23:14] (03PS2) 10Dzahn: gerrit1001: fix service IP to IP in B network [puppet] - 10https://gerrit.wikimedia.org/r/540710 (https://phabricator.wikimedia.org/T222391) [23:23:45] (03CR) 10Paladox: [C: 03+1] gerrit1001: fix service IP to IP in B network [puppet] - 10https://gerrit.wikimedia.org/r/540710 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:23:51] (03CR) 10Paladox: [C: 03+1] move gerrit-new service IP to B network [dns] - 10https://gerrit.wikimedia.org/r/540707 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:24:28] (03CR) 10Dzahn: [C: 03+2] move gerrit-new service IP to B network [dns] - 10https://gerrit.wikimedia.org/r/540707 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:25:49] (03CR) 10Dzahn: [C: 03+2] gerrit1001: fix service IP to IP in B network [puppet] - 10https://gerrit.wikimedia.org/r/540710 (https://phabricator.wikimedia.org/T222391) (owner: 10Dzahn) [23:27:06] (03PS3) 10Paladox: gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/540703 (https://phabricator.wikimedia.org/T222391) [23:27:11] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/540703 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [23:30:11] paladox: can now search https://puppet-compiler.wmflabs.org/compiler1002/296/gerrit1001.wikimedia.org/change.gerrit1001.wikimedia.org.pson for just "208.80.154.13" [23:30:21] since they are following each other [23:30:53] (03CR) 10Dzahn: [C: 03+2] gerrit: add role on gerrit1001 and remove gerrit::migration [puppet] - 10https://gerrit.wikimedia.org/r/540703 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [23:30:59] \o/ [23:32:05] (03CR) 10Dzahn: [C: 03+2] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/540703 (https://phabricator.wikimedia.org/T222391) (owner: 10Paladox) [23:32:16] i could not submit before that button [23:32:19] start review [23:32:49] oh [23:32:53] i forgot to unwip :P [23:35:06] paladox: works so far.. IP is fixed [23:35:12] yay!!! [23:35:30] icinga is also green [23:44:18] 10Operations, 10Core Platform Team, 10Editing-team, 10Parsing-Team, and 9 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Johan) Something like this for Tech News? (Plus links and clearer handling of URLs.) The URL of the main page of th... [23:50:55] !log gerrit - restarting for replication config tweaks [23:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:52:36] (03PS1) 10Paladox: Gerrit: Add new line to gerrit1001.yaml [puppet] - 10https://gerrit.wikimedia.org/r/540727 [23:54:17] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2035 and db2054 [dns] - 10https://gerrit.wikimedia.org/r/540728 [23:54:22] (03PS2) 10Paladox: Gerrit: Add todo comment to gerrit1001.yaml [puppet] - 10https://gerrit.wikimedia.org/r/540727 [23:55:00] (03CR) 10Dzahn: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/540727 (owner: 10Paladox) [23:55:44] (03CR) 10Dzahn: [C: 03+2] "using this to test replication" [puppet] - 10https://gerrit.wikimedia.org/r/540727 (owner: 10Paladox)