[00:16:27] RECOVERY - logstash process on logstash1003 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [00:23:47] 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1751220 (10hoo) >>! In T116487#1751166, @Krenair wrote: >>>! In T116487#1751145, @hoo wrote: >> While (probably) not a formal requirement, I think you should also get access to t... [00:27:26] 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1751230 (10Krenair) >>! In T116487#1751220, @hoo wrote: >>>! In T116487#1751166, @Krenair wrote: >>>>! In T116487#1751145, @hoo wrote: >>> While (probably) not a formal requireme... [00:28:53] 10Ops-Access-Requests, 6operations: Requesting access to (wmf-)deployment for jzerebecki - https://phabricator.wikimedia.org/T116487#1751233 (10JanZerebecki) For the shell group I count 2 wmf employees, 1 disabled account. None for the gerrit group. The mediawiki gerrit group includes the wmf ldap group, but t... [00:33:09] ori: Grafana uses POST for larger queries to graphite, and for the labs graphite we use the grafana proxy. [00:33:21] However POST is disabled so the dashbaord deoesn't work at https://grafana.wikimedia.org/dashboard/db/labs-project-board [00:33:35] because POST to /api/datasources/proxy/3/ is rejected [00:34:47] Actually, I'll change the datasource back to not use the proxy [00:35:02] I was hoping it would be less buggy when using a request directly to the internal graphite server but it's just as buggy [00:35:03] https://grafana-admin.wikimedia.org/dashboard/db/labs-project-board [00:35:19] Some panels are randomly missing there because graphite.wmflabs is responding with PNG instead of JSON sometimes [00:35:25] it's not deterministic though [00:35:46] so naturally grafana JS gets confusd and can't read the response [01:51:15] 7Puppet, 6Labs, 6Phabricator: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1751263 (10zhuyifei1999) >>! In T116442#1751011, @Negative24 wrote: > `role::phabricator::main` isn't the right Puppet class to use in Labs. I'm pretty sure the error had to do with the site variables.... [02:23:32] !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 07m 59s) [02:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:25] 7Puppet, 6Labs, 6Phabricator: phabricator puppet at labs broken - https://phabricator.wikimedia.org/T116442#1751267 (10Negative24) Its not documented. But its not hard. Essentially its just: 1. Apply `role::phabricator::labs` and run puppet 2. `cd /srv/phab/phabricator` and run `sudo bin/storage upgrade` 3.... [02:28:08] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-25 02:28:08+00:00 [02:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:39:06] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [02:39:15] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [02:41:55] PROBLEM - Restbase endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:37] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [03:10:55] PROBLEM - Restbase endpoints health on aqs1002 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [03:14:16] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [03:19:36] PROBLEM - Restbase endpoints health on aqs1003 is CRITICAL: /pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) [03:21:15] PROBLEM - Analytics Cassanda CQL query interface on aqs1001 is CRITICAL: Connection timed out [03:25:47] RECOVERY - Restbase endpoints health on aqs1001 is OK: All endpoints are healthy [03:26:26] RECOVERY - Analytics Cassanda CQL query interface on aqs1001 is OK: TCP OK - 0.000 second response time on port 9042 [03:26:36] RECOVERY - Restbase endpoints health on aqs1003 is OK: All endpoints are healthy [03:26:45] RECOVERY - Restbase endpoints health on aqs1002 is OK: All endpoints are healthy [03:35:26] PROBLEM - puppet last run on mw2038 is CRITICAL: CRITICAL: Puppet has 1 failures [03:35:36] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Puppet has 1 failures [04:01:07] PROBLEM - puppet last run on mw1102 is CRITICAL: CRITICAL: puppet fail [04:01:47] RECOVERY - puppet last run on mw2038 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [04:01:56] RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [04:29:07] RECOVERY - puppet last run on mw1102 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [05:17:36] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [05:21:05] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [05:28:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [05:31:45] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds [05:35:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 2 below the confidence bounds [05:36:33] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Oct 25 05:36:33 UTC 2015 (duration 36m 32s) [05:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:42:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [05:47:25] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [05:52:36] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [06:08:36] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds [06:13:46] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds [06:30:25] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:26] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:45] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: puppet fail [06:31:06] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:16] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:56] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:16] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:25] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:36] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:15] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:26] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: Puppet has 1 failures [06:41:13] 6operations, 10Traffic, 7Availability, 5Patch-For-Review: Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820#1751407 (10aaron) [06:55:47] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:56:06] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:56:56] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:56:56] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:57:05] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:05] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:57:15] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:56] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:07] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:58:26] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:57] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:03:56] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 out: 300 virgin: 25) [08:19:36] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [08:54:40] (03PS1) 10Ori.livneh: Replace dynamic gdash.wikimedia.org with static mirror [puppet] - 10https://gerrit.wikimedia.org/r/248678 (https://phabricator.wikimedia.org/T104365) [08:57:58] (03PS2) 10Ori.livneh: Replace dynamic gdash.wikimedia.org with static mirror [puppet] - 10https://gerrit.wikimedia.org/r/248678 (https://phabricator.wikimedia.org/T104365) [09:07:27] (03CR) 10Ori.livneh: [C: 032] Replace dynamic gdash.wikimedia.org with static mirror [puppet] - 10https://gerrit.wikimedia.org/r/248678 (https://phabricator.wikimedia.org/T104365) (owner: 10Ori.livneh) [09:35:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 16 data above and 7 below the confidence bounds [09:45:46] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds [09:47:15] ori: I wonder how long before you get ticked off enough by the daily logrotate based puppet failures to do something about them [09:47:22] ticking-off-based-development! [09:48:05] i fixed it before but godog (I think) reverted [09:48:33] haha reallyyy [09:49:39] https://gerrit.wikimedia.org/r/#/c/219788/ [09:50:04] revert (with reason): https://gerrit.wikimedia.org/r/#/c/221149/ [09:50:56] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 7 below the confidence bounds [09:51:06] hmm we can probably re-revert it by directing the passenger logs to elsewhere [09:52:35] yeah we could. i didn't even want to use chronolog, though. i think the best solution is logrotate's "copytruncate" option, which truncates the log file in place after creating a copy [09:53:28] ah nice [09:53:53] filippo vetoed that because that approach is susceptible to dropping log records during the brief gap between the copy and truncate operations [09:55:29] I wonder how many times people look at the apache log files [09:55:34] IMO, we have holes in our log coverage big enough to drive a truck through [09:55:37] this is not one of them [09:55:48] so i think the opposition is academic and fussy [09:55:58] you know godog, always walking around with his monocle [09:56:25] yay, monocling his way around airbnb rentals... [09:57:17] it's literally a one line fix [09:57:43] do it and let's see if godog still wants to veto [09:57:52] i think the real reason people resist fixing it is that it has become something like an IRC timekeeping device, like a sundial [09:58:04] I think we can use _joe_ waking up as an alternative [09:58:32] Could also have a bot that just announces the UTC midnight [09:58:58] "we do." [09:59:17] hahaha [10:00:30] well i taunted godog above [10:00:38] to bait him into re-considering [10:00:42] so let's see :) [10:01:32] oh shit did the clock change? [10:01:55] I don't think the taunt is complete unless you actually submit a patch and add him as cc [10:01:59] what clock? [10:02:17] the local clock [10:02:33] (it didn't; dst ends nov. 1) [10:02:34] ? [10:02:38] hahaha [10:02:47] did you just forget an hour pass by instead? [10:03:17] no, i hadn't been paying attention to the time [10:03:48] yeah for a moment I was surprised too since it was 3:03 AM and the last time I looked at the time it was 1:18AM [10:03:53] it just ended in europe [10:03:55] so i wasn't totally off [10:04:05] Sunday, October 25, 2015 at 1:00 UTC, clocks will be set back one hour when Daylight Saving Time (DST) ends in most of Europe. [10:04:39] 10 hrs ago [10:05:12] I wonder how a world without measured time will look like [10:05:22] where the unit of measurement is the day rather than the hour and minute [10:05:41] known but uncared for, like how we treat milliseconds most of the time [10:06:19] i would have gone with "known but uncared for, like how we treat icinga alerts" [10:06:41] yeah, I was thinking at some point of 'labs issues' but that's probably not strictly true [10:07:47] it's the alert that is over-sensitive [10:07:51] it's not a genuine spike [10:08:09] the graphite 5xx one? [10:08:10] yeah [10:08:22] although the reverse could be true too [10:08:30] we used to completely ignore the labstore on high load alert allll the time [10:08:39] and then we actually fixed the systems [10:08:46] and the original parameters were actually sane for a sane system [10:09:03] and was alerting all the time because we were operating close to max capacity all th etime [10:09:26] I shall now attempt to fall asleep. [10:10:03] night! [10:39:33] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 19 data above and 9 below the confidence bounds [10:45:36] ori: consider the copytruncate veto lifted :) my reasoning at the time IIRC was that we were working on upgrading the puppetmasters anyways, not the case ATM [10:50:01] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 15 data above and 9 below the confidence bounds [11:00:41] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 15 data above and 9 below the confidence bounds [11:04:11] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 19 data above and 9 below the confidence bounds [11:10:16] the high 503 is applebot doing malformed requests and us not returning 404s [11:11:12] it's been like that since 5UTC [11:33:29] (03PS2) 10Glaisher: Remove Page and Index namespaces from $wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240640 (https://phabricator.wikimedia.org/T54709) [11:44:22] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [12:09:02] 6operations, 7Database: Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - https://phabricator.wikimedia.org/T107072#1751635 (10jcrespo) db1035 continues having issues: Compare db1044 (same shard and hardware): {F2767546... [12:14:22] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [12:23:03] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds [12:26:33] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [12:35:32] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [12:40:43] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [12:46:01] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 9 below the confidence bounds [12:49:12] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:49:23] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [13:15:52] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [13:21:11] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [13:26:22] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds [13:27:23] akosiaris, jynus - maps replication failed again with FATAL: could not receive data from WAL stream: ERROR: requested WAL segment ... has already been removed [13:27:37] i will try to re-sync it, but any suggestions are welcome [13:27:47] * yurik is googling how to restart postgres replicatin [13:31:51] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds [13:35:21] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 8 below the confidence bounds [13:47:12] o.O [13:52:51] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [14:03:23] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [14:06:22] _O_O_o [15:27:52] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds [15:42:01] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [16:16:12] PROBLEM - puppet last run on mw2100 is CRITICAL: CRITICAL: puppet fail [16:44:21] RECOVERY - puppet last run on mw2100 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:45:02] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [17:30:55] (03PS1) 10Ori.livneh: gdash: add notice pointing users to grafana.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/248697 (https://phabricator.wikimedia.org/T104365) [17:31:21] (03CR) 10Ori.livneh: [C: 032 V: 032] gdash: add notice pointing users to grafana.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/248697 (https://phabricator.wikimedia.org/T104365) (owner: 10Ori.livneh) [17:37:53] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps DWDM]BR [17:38:10] 6operations, 7Graphite, 7Monitoring, 5Patch-For-Review: deprecate gdash - https://phabricator.wikimedia.org/T104365#1751905 (10ori) I replaced gdash with a static HTML mirror, migrated it to krypton, and added a site notice pointing users to grafana. Closing as resolved. [17:38:35] 6operations, 7Graphite, 7Monitoring, 5Patch-For-Review: deprecate gdash - https://phabricator.wikimedia.org/T104365#1751906 (10ori) 5Open>3Resolved a:5fgiunchedi>3ori [17:46:32] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 228, down: 0, dormant: 0, excluded: 0, unused: 0 [18:03:29] (03PS1) 10Cmjohnson: Removing dns entries for es1001-es1010 [dns] - 10https://gerrit.wikimedia.org/r/248700 [18:06:31] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for es1001-es1010 [dns] - 10https://gerrit.wikimedia.org/r/248700 (owner: 10Cmjohnson) [18:21:35] (03PS1) 10Cmjohnson: Adding dns entries for ms-be1019-1020 [dns] - 10https://gerrit.wikimedia.org/r/248702 [18:23:11] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for ms-be1019-1020 [dns] - 10https://gerrit.wikimedia.org/r/248702 (owner: 10Cmjohnson) [18:25:27] 6operations, 10ops-eqiad: Rack/Setup ms-be1019-ms-1021 - https://phabricator.wikimedia.org/T116542#1751953 (10Cmjohnson) 3NEW a:3Cmjohnson [18:31:40] (03PS1) 10Cmjohnson: Addin dhcp and netboot cfg for ms-be1019-1021 [puppet] - 10https://gerrit.wikimedia.org/r/248705 [18:34:20] (03PS2) 10Cmjohnson: Addin dhcp and netboot cfg for ms-be1019-1021 [puppet] - 10https://gerrit.wikimedia.org/r/248705 [18:35:22] (03CR) 10Cmjohnson: [C: 032] Addin dhcp and netboot cfg for ms-be1019-1021 [puppet] - 10https://gerrit.wikimedia.org/r/248705 (owner: 10Cmjohnson) [18:37:31] 6operations, 10ops-eqiad: Rack/Setup ms-be1019-ms-1021 - https://phabricator.wikimedia.org/T116542#1751972 (10Cmjohnson) All on-site work has been completed including bios setup and raid cfg. partman and dhcp https://gerrit.wikimedia.org/r/#/c/248705/ dns https://gerrit.wikimedia.org/r/#/c/248702/ mgmt dns... [18:40:41] 6operations, 10ops-eqiad: Decommission es1001-es1010 - https://phabricator.wikimedia.org/T113080#1751976 (10Cmjohnson) 5Open>3Resolved Completed [19:25:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [500.0] [19:29:13] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:00:02] yurik: that probably is the reason https://dpaste.de/mHgH . Dropping and recreating an index in a 100GB table is not exactly the best thing you can do to a database [20:00:34] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1752045 (10Dzahn) a:5Dzahn>3None [20:11:12] (03PS1) 10Papaul: Add DNS entries for ms-be20[1-2][0-6] Bug:T114712 [dns] - 10https://gerrit.wikimedia.org/r/248712 (https://phabricator.wikimedia.org/T114712) [20:18:46] (03PS2) 10Papaul: Add DNS entries for ms-be20[1-2][0-6] Bug:T114712 [dns] - 10https://gerrit.wikimedia.org/r/248712 (https://phabricator.wikimedia.org/T114712) [20:40:03] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Puppet has 1 failures [21:08:02] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:36:18] 7Puppet: Disable 80 character line length check in puppet-lint - https://phabricator.wikimedia.org/T116552#1752131 (10siebrand) 3NEW [22:37:15] 7Puppet, 6operations: Disable 80 character line length check in puppet-lint - https://phabricator.wikimedia.org/T116552#1752138 (10yuvipanda) [22:38:25] 7Puppet, 6operations: Disable 80 character line length check in puppet-lint - https://phabricator.wikimedia.org/T116552#1752131 (10yuvipanda) I already see `--no-80chars-check` in .puppet-lint.rc in operations/puppet.git. Is that not being read by the strict check? [22:48:31] 7Puppet, 6operations: Disable 80 character line length check in puppet-lint - https://phabricator.wikimedia.org/T116552#1752144 (10scfc) I think this is a failure in the Jenkins job. According to https://integration.wikimedia.org/ci/job/translatewiki-puppetlint-strict/2322/console: ``` [translatewiki-puppetl... [22:49:57] 7Puppet, 6operations, 10Continuous-Integration-Config: Disable 80 character line length check in puppet-lint - https://phabricator.wikimedia.org/T116552#1752145 (10scfc) [22:51:32] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: maps-test200{2-4} PostgreSQL replication needs rebuilding - https://phabricator.wikimedia.org/T116553#1752152 (10Yurik) [22:55:11] PROBLEM - salt-minion processes on lvs3001 is CRITICAL: Timeout while attempting connection [22:55:22] PROBLEM - salt-minion processes on lvs3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [22:57:02] PROBLEM - pybal on lvs3001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:58:41] RECOVERY - pybal on lvs3001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal [23:00:22] RECOVERY - salt-minion processes on lvs3001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:16:22] RECOVERY - salt-minion processes on lvs3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion