[01:07:23] RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational [01:10:42] PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:02:16] (03PS2) 10GTirloni: shinken: Adjustments necessary to upgrade 1.4->2.0 and Trusty->Jessie [puppet] - 10https://gerrit.wikimedia.org/r/468792 (https://phabricator.wikimedia.org/T204562) [02:03:12] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 117 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [03:30:13] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 717.32 seconds [03:34:23] PROBLEM - puppet last run on cp4028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-ISP.mmdb.gz] [04:01:03] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 193.09 seconds [04:05:02] RECOVERY - puppet last run on cp4028 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [06:28:12] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/smart-data-dump] [06:28:52] PROBLEM - puppet last run on authdns2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/20-confd.conf] [06:31:03] PROBLEM - puppet last run on cloudservices1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:31:33] PROBLEM - puppet last run on mw1305 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ferm/functions.conf] [06:56:32] RECOVERY - puppet last run on cloudservices1004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:57:02] RECOVERY - puppet last run on mw1305 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:33] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:22] RECOVERY - puppet last run on authdns2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:08:53] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.4769 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [09:09:53] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [09:35:02] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on einsteinium is CRITICAL: 137.7 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [11:10:48] *.archive.org is whitelisted on Commons for URLto Upload https://commons.wikimedia.org/wiki/Commons:Upload_tools/wgCopyUploadsDomains [11:11:09] but https://ia800406.us.archive.org/21/items/BharatvarshiyMadhyayuginCharitrakoshCropped/lila%20charitra_cropped.pdf can't be uploaded [11:11:15] any idea? [11:14:08] yannf, yeah that won't match [11:14:25] *.archive.org is whitelisted, not *.*.archive.org [11:15:23] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/includes/upload/UploadFromUrl.php#83 [11:18:35] oh :/ [11:22:42] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:25:26] 10Operations, 10MediaWiki-Page-deletion, 10Performance: Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 (10MarcoAurelio) AIUI the method we requested in the task above was to be applied when a page had a high number of revisions, not for all page deletions. If... [11:27:25] https://phabricator.wikimedia.org/T207581 [11:27:46] Krenair, parent task is ok? [11:28:55] yeah [11:29:22] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:29:35] you will need to add site-requests though yannf [11:29:42] 10Operations, 10DNS, 10Traffic: Add punjabi.wikimedia.org to DNS - https://phabricator.wikimedia.org/T207583 (10Urbanecm) [11:36:48] (03PS5) 10MarcoAurelio: Close chairwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443585 (https://phabricator.wikimedia.org/T184961) [11:37:16] (03PS1) 10Urbanecm: Add punjabiwikimedia [dns] - 10https://gerrit.wikimedia.org/r/468812 (https://phabricator.wikimedia.org/T207583) [11:46:46] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Add punjabi.wikimedia.org to DNS and Apache - https://phabricator.wikimedia.org/T207583 (10Urbanecm) [11:46:56] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Add punjabi.wikimedia.org to DNS and Apache - https://phabricator.wikimedia.org/T207583 (10Urbanecm) [11:48:03] (03PS1) 10Urbanecm: Add punjabi.wikimedia.org to Apache [puppet] - 10https://gerrit.wikimedia.org/r/468814 [11:48:27] (03PS2) 10Urbanecm: Add punjabi.wikimedia.org to Apache [puppet] - 10https://gerrit.wikimedia.org/r/468814 (https://phabricator.wikimedia.org/T207583) [11:58:37] (03PS1) 10Urbanecm: Initial configuration for punjabiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468815 (https://phabricator.wikimedia.org/T204477) [12:03:12] (03PS2) 10Urbanecm: Initial configuration for punjabiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468815 (https://phabricator.wikimedia.org/T204477) [12:15:20] (03PS3) 10Urbanecm: Add punjabi.wikimedia.org to Apache [puppet] - 10https://gerrit.wikimedia.org/r/468814 (https://phabricator.wikimedia.org/T207583) [12:26:24] (03PS1) 10Urbanecm: Close internalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468823 (https://phabricator.wikimedia.org/T205584) [12:30:53] PROBLEM - High lag on wdqs1003 is CRITICAL: 1.247e+04 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [12:34:39] * onimisionipe is looking into WDQS [12:39:12] !log depooling wdqs1003 to catchup on lag time [12:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:31] onimisionipe: thanks for looking into it! [12:41:35] (03PS1) 10Urbanecm: Anniversary logo for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468824 (https://phabricator.wikimedia.org/T207589) [12:42:01] We should have a patch for the kafka poller tomorrow [12:43:01] gehel: You welcome! [12:47:10] (03PS2) 10Zoranzoki21: Enable suppressredirect and markbotedit rights to rollbackers on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468075 (https://phabricator.wikimedia.org/T207300) [12:47:14] (03CR) 10Zoranzoki21: Enable suppressredirect and markbotedit rights to rollbackers on it.wikiversity (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468075 (https://phabricator.wikimedia.org/T207300) (owner: 10Zoranzoki21) [12:47:20] (03PS3) 10Zoranzoki21: Enable suppressredirect and markbotedit rights to rollbackers on it.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468075 (https://phabricator.wikimedia.org/T207300) [12:55:43] (03PS2) 10Urbanecm: Enable rollbacker right on srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468080 (https://phabricator.wikimedia.org/T206935) (owner: 10Zoranzoki21) [12:56:04] (03PS2) 10Zoranzoki21: Enable autopatroller, patroller and rollbacker rights on srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468079 (https://phabricator.wikimedia.org/T206936) [12:56:52] (03PS3) 10Urbanecm: Enable autopatroller, patroller and rollbacker rights on srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468079 (https://phabricator.wikimedia.org/T206936) (owner: 10Zoranzoki21) [12:59:16] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443585 (https://phabricator.wikimedia.org/T184961) (owner: 10MarcoAurelio) [13:00:24] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468075 (https://phabricator.wikimedia.org/T207300) (owner: 10Zoranzoki21) [13:21:13] (03PS3) 10Zoranzoki21: Enable rollbacker right on srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468080 (https://phabricator.wikimedia.org/T206935) [13:22:55] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468080 (https://phabricator.wikimedia.org/T206935) (owner: 10Zoranzoki21) [13:33:16] (03PS4) 10Zoranzoki21: Enable rollbacker right on srwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468080 (https://phabricator.wikimedia.org/T206935) [13:37:40] (03PS4) 10Zoranzoki21: Enable autopatroller, patroller and rollbacker rights on srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468079 (https://phabricator.wikimedia.org/T206936) [13:38:53] (03PS1) 10Framawiki: Whitelist *.*.archive.org in wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468831 (https://phabricator.wikimedia.org/T207581) [13:39:32] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468079 (https://phabricator.wikimedia.org/T206936) (owner: 10Zoranzoki21) [13:47:07] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Andrew) Thanks, Stas. There are two ways I think we can go forward with this:... [13:47:22] (03PS6) 10Andrew Bogott: nova: update scheduling pools for main and eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/468377 [13:48:32] (03CR) 10Andrew Bogott: [C: 032] nova: update scheduling pools for main and eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/468377 (owner: 10Andrew Bogott) [13:49:48] (03CR) 10Andrew Bogott: [C: 032] "I don't know what this was supposed to do, but clearly we're getting along without it :)" [puppet] - 10https://gerrit.wikimedia.org/r/468697 (owner: 10Faidon Liambotis) [13:49:58] (03PS2) 10Andrew Bogott: designate/mitaka: remove typo'ed extension [puppet] - 10https://gerrit.wikimedia.org/r/468697 (owner: 10Faidon Liambotis) [13:52:05] (03PS3) 10Andrew Bogott: labsaliaser: use keystone public port instead of admin port [puppet] - 10https://gerrit.wikimedia.org/r/468709 (https://phabricator.wikimedia.org/T207533) (owner: 10Alex Monk) [13:53:31] (03CR) 10Andrew Bogott: [C: 032] labsaliaser: use keystone public port instead of admin port [puppet] - 10https://gerrit.wikimedia.org/r/468709 (https://phabricator.wikimedia.org/T207533) (owner: 10Alex Monk) [13:57:09] (03CR) 10Andrew Bogott: [C: 04-1] "This seems right but we probably need the include in an upstream role. As it is this fails on cloudservices1003:" [puppet] - 10https://gerrit.wikimedia.org/r/468714 (https://phabricator.wikimedia.org/T207533) (owner: 10Alex Monk) [13:57:37] (03PS7) 10Zoranzoki21: Edited syntax of the code where is the content for user rights for mlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464485 [13:57:54] (03CR) 10Andrew Bogott: [C: 032] "Tested post-merge and it looks good. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/468709 (https://phabricator.wikimedia.org/T207533) (owner: 10Alex Monk) [13:58:59] (03PS2) 10Andrew Bogott: labs recursor: require interface alias before trying to start pdns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/468708 (owner: 10Alex Monk) [13:59:33] PROBLEM - configured eth on cumin1001 is CRITICAL: Return code of 255 is out of bounds [13:59:52] PROBLEM - MD RAID on cumin1001 is CRITICAL: Return code of 255 is out of bounds [13:59:54] andrewbogott, I'm not sure about that clientlib thing. Shouldn't I be able to apply the pdns recursor profile and have it work without requiring other profiles? [14:00:02] PROBLEM - dhclient process on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:00:12] PROBLEM - Check systemd state on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:00:12] PROBLEM - DPKG on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:00:13] PROBLEM - Check whether ferm is active by checking the default input chain on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:00:22] PROBLEM - Keyholder SSH agent on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:00:23] PROBLEM - Disk space on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:00:23] PROBLEM - Check size of conntrack table on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:01:40] Krenair: yeah, ideally the profile should be able to live on its own. I haven't actually looked, do you know where the conflicting include is? [14:02:02] PROBLEM - puppet last run on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:02:19] hm [14:02:30] (03CR) 10Andrew Bogott: [C: 032] labs recursor: require interface alias before trying to start pdns-recursor [puppet] - 10https://gerrit.wikimedia.org/r/468708 (owner: 10Alex Monk) [14:02:39] I just noticed that I'm trying to include a main profile inside a base profile too [14:02:53] I imagine it conflicts with stuff like modules/profile/manifests/openstack/main/designate/service.pp: require ::profile::openstack::main::clientlib [14:04:49] btw, do you know what Faidon means about 'cloud->prod flow'? Are there things being communicated from the VMs to the recursor other than the names and IPs of instances? [14:04:53] RECOVERY - Check size of conntrack table on cumin1001 is OK: OK: nf_conntrack is 0 % full [14:05:12] RECOVERY - configured eth on cumin1001 is OK: OK - interfaces up [14:05:23] RECOVERY - MD RAID on cumin1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:05:33] RECOVERY - dhclient process on cumin1001 is OK: PROCS OK: 0 processes with command name dhclient [14:05:43] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational [14:05:43] RECOVERY - DPKG on cumin1001 is OK: All packages OK [14:05:43] RECOVERY - Check whether ferm is active by checking the default input chain on cumin1001 is OK: OK ferm input default policy is set [14:05:52] RECOVERY - Keyholder SSH agent on cumin1001 is OK: OK: Keyholder is armed with all configured keys. [14:05:53] RECOVERY - Disk space on cumin1001 is OK: DISK OK [14:06:33] andrewbogott, I assume he means cloud stuff able to talk to prod stuff in ways that random hosts on the internet would not be able to [14:07:08] ah, so just because that's a non-public IP? Hm… [14:07:12] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:07:20] andrewbogott, well [14:07:26] it's a public IP [14:07:39] sorry, I meant to type, a non-public port [14:07:47] yeah [14:07:54] I don't generally think of DNS servers as an attack vector [14:08:01] this is kind of a tangent anyway [14:08:30] it seems the basic idea is to move as much cloud infrastructure as possible into cloud VMs [14:08:50] see PM [14:16:33] PROBLEM - MD RAID on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:16:42] PROBLEM - dhclient process on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:16:53] PROBLEM - Check systemd state on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:16:53] PROBLEM - DPKG on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:16:53] PROBLEM - Check whether ferm is active by checking the default input chain on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:16:57] 10Operations, 10Cloud-VPS, 10Patch-For-Review: Move labs-recursors in WMCS - https://phabricator.wikimedia.org/T207533 (10Andrew) My only concern about this is that those recursors are used about every second on every VM, so they're a huge, vital point of failure and I'm a bit reluctant to rock the boat. In... [14:17:02] PROBLEM - Keyholder SSH agent on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:17:03] PROBLEM - Disk space on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:17:12] PROBLEM - Check size of conntrack table on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:17:32] PROBLEM - configured eth on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:19:42] PROBLEM - puppet last run on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:23:43] PROBLEM - Check the NTP synchronisation status of timesyncd on cumin1001 is CRITICAL: Return code of 255 is out of bounds [14:34:43] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational [14:34:43] RECOVERY - DPKG on cumin1001 is OK: All packages OK [14:34:52] RECOVERY - Check whether ferm is active by checking the default input chain on cumin1001 is OK: OK ferm input default policy is set [14:34:53] RECOVERY - Keyholder SSH agent on cumin1001 is OK: OK: Keyholder is armed with all configured keys. [14:34:53] RECOVERY - Disk space on cumin1001 is OK: DISK OK [14:35:02] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:35:02] RECOVERY - Check size of conntrack table on cumin1001 is OK: OK: nf_conntrack is 0 % full [14:35:22] RECOVERY - configured eth on cumin1001 is OK: OK - interfaces up [14:35:33] RECOVERY - MD RAID on cumin1001 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:35:42] RECOVERY - dhclient process on cumin1001 is OK: PROCS OK: 0 processes with command name dhclient [14:53:52] RECOVERY - Check the NTP synchronisation status of timesyncd on cumin1001 is OK: OK: synced at Sun 2018-10-21 14:53:45 UTC. [15:35:22] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on einsteinium is CRITICAL: 136.3 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [15:57:28] !log adjust patch for T194204 [15:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:08] (03PS1) 10Reedy: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468845 [16:11:10] (03CR) 10Reedy: [C: 032] Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468845 (owner: 10Reedy) [16:14:52] (03Merged) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468845 (owner: 10Reedy) [16:15:57] !log reedy@deploy1001 Synchronized wmf-config/interwiki.php: Updating interwiki cache (duration: 04m 52s) [16:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:50] (03CR) 10jenkins-bot: Updating interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/468845 (owner: 10Reedy) [16:33:08] What happening with jobs? [16:33:37] gate-and-submit stopped with working and started everything from start [16:33:56] https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/467119/ is not merged [16:47:43] I looked ^^ things look fine to me. The patch they referenced was odd (they Verified+2 instead of just CodeReview+2) I re+2'd, we'll see [16:50:45] ok, maybe not? zuul isn't picking up my +2 [16:53:11] i see it in the queue greg-g [16:53:19] 467119,6 [16:53:25] and 453660,14 [17:03:23] (03PS3) 10GTirloni: shinken: Adjustments necessary to upgrade 1.4->2.0 and Trusty->Jessie [puppet] - 10https://gerrit.wikimedia.org/r/468792 (https://phabricator.wikimedia.org/T204562) [17:04:05] (03CR) 10jerkins-bot: [V: 04-1] shinken: Adjustments necessary to upgrade 1.4->2.0 and Trusty->Jessie [puppet] - 10https://gerrit.wikimedia.org/r/468792 (https://phabricator.wikimedia.org/T204562) (owner: 10GTirloni) [17:05:54] paladox: there's no update from zuul: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/467119/ [17:07:53] paladox: but yeah, I see that 467119,6 there being worked up, we'll see [17:12:37] (03PS1) 10GTirloni: shinken: Remove unused 'Keyholder status' check [puppet] - 10https://gerrit.wikimedia.org/r/468848 (https://phabricator.wikimedia.org/T183454) [17:22:41] (03CR) 10Alex Monk: [C: 04-1] "This is not for toolsbeta." [puppet] - 10https://gerrit.wikimedia.org/r/468848 (https://phabricator.wikimedia.org/T183454) (owner: 10GTirloni) [17:22:58] heh, and now it failed [17:26:16] (03CR) 10GTirloni: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/468848 (https://phabricator.wikimedia.org/T183454) (owner: 10GTirloni) [17:26:19] (03CR) 10Alex Monk: [C: 04-1] "That said looks like it has been broken since Ieac6487d." [puppet] - 10https://gerrit.wikimedia.org/r/468848 (https://phabricator.wikimedia.org/T183454) (owner: 10GTirloni) [17:27:46] (03PS2) 10Alex Monk: shinken: Remove broken 'Keyholder status' check [puppet] - 10https://gerrit.wikimedia.org/r/468848 (https://phabricator.wikimedia.org/T183454) (owner: 10GTirloni) [17:27:57] (03CR) 10Alex Monk: [C: 031] shinken: Remove broken 'Keyholder status' check [puppet] - 10https://gerrit.wikimedia.org/r/468848 (https://phabricator.wikimedia.org/T183454) (owner: 10GTirloni) [17:37:08] LGTM now https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/467119/ [19:11:11] Hi, why phab.wmflabs.org no send me email? [19:15:05] Zoranzoki21: its better to ask in -cloud, or the relevant tool maintainers directly [19:15:11] because it's in neutron [19:17:32] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10Smalyshev) > Run a second set of tests on a similar VM that shares a host with o... [19:38:21] (03PS1) 10GTirloni: git-sync-upstream: Send cron mail in case of failures [puppet] - 10https://gerrit.wikimedia.org/r/468865 (https://phabricator.wikimedia.org/T184261) [20:42:26] !log resuming replication on s4@dbstore2002 (T204930) [20:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:29] T204930: dbstore2002 tables compression status check - https://phabricator.wikimedia.org/T204930 [20:56:18] I ack'd a predictive disk failure on db2061 with (https://phabricator.wikimedia.org/T207212#4679442) [20:56:40] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10monitoring, 10Patch-For-Review: Releases Jenkins Icinga check failing after restricting access - https://phabricator.wikimedia.org/T206579 (10hashar) Sorry for the trailing `/` and thank you for the quick fix. [21:45:44] 10Operations, 10MediaWiki-Page-deletion, 10Performance: Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 (10BPirkle) It is unexpected that the website would appear to hang, whether or not the deletion is batched. The batch deletion threshold is controlled by $... [22:11:09] (03PS4) 10GTirloni: shinken: Adjustments necessary to upgrade 1.4->2.0 and Trusty->Jessie [puppet] - 10https://gerrit.wikimedia.org/r/468792 (https://phabricator.wikimedia.org/T204562) [22:11:19] (03CR) 10Mathew.onipe: "I have some concern with the way I implemented this. I have another proposal tomorrow that I think is better. Will discuss in standup. Tha" (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [22:14:23] RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1131 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:15:39] !log repooling wdqs1003 as it has caught up on lag [22:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:20] onimisionipe: thanks again! [22:20:23] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [22:21:22] gehel: uwc! [22:22:42] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen